at interpolation, generalization is distance to Bayes

March 16, 2026

TL;DR. The previous post exhibited a situation where explaining the generalization gap using common complexity measures can be challenging: the same model and optimizer can interpolate every dataset yet generalize very differently depending on the labels. In this very same setup, however, the gap itself has a simple structure, so we know exactly what quantity a bound would need to capture. At interpolation it equals the population error, which decomposes into Bayes error plus a disagreement term with the Bayes classifier. In one dimension, that disagreement term can be computed exactly from the learned parameters.

In the previous post, we saw that the same model and optimizer can generalize very differently depending on the data.

But the data-generating process is usually unknown (which is precisely why we train a model to recover it in some sense), so we cannot directly inspect it to understand generalization.

After training, what we can actually inspect are the learned parameters. If generalization depends on the data distribution, then that dependence must leave an observable footprint in those parameters.

This post asks: what is that footprint, exactly? In a simple one-dimensional setting, it can be computed in closed form.

At interpolation (zero training error), we simply have $$\text{generalization gap} = \text{population error} - \text{train error} = \text{population error}$$ so the question becomes: what is the population error equal to?


what actually determines population error?

We consider binary classification with inputs $X\in\mathbb{R}^d$ and labels $Y\in\{-1,+1\}$.

A classifier $\hat y$ is a function $\mathbb{R}^d\to\{-1,+1\}$. Its population error is $$ \P(\hat y(X)\neq Y). $$

There exists an optimal classifier, a Bayes classifier $y^\star$ which minimizes this quantity. A Bayes classifier provides a natural reference point.

For any classifier $\hat y$, we can decompose its population error as $$ \P(\hat y(X)\neq Y) = \P(y^\star(X)\neq Y) + \mathcal{E}(\hat y). $$

The first term is the Bayes error, which is independent of the learned classifier $\hat y$. The second term depends on $\hat y$ and measures the excess error relative to Bayes: $$ \mathcal{E}(\hat y) \coloneqq \E\left[ \lvert 2\eta(X)-1\rvert \, \cdot \, \mathbf{1}_{\{\hat y(X)\neq y^\star(X)\}} \right], \qquad \eta(x)\coloneqq\P(Y=+1\mid X=x). $$

When training error is zero, the generalization gap is entirely captured by this excess error term, which measures the deviation from a Bayes classifier. Here, a convenient choice of Bayes classifier is $$\text{Bayes classifier}(x)\coloneqq\operatorname{sign}(\eta(x)-1/2).$$

details (click to expand): why $\operatorname{sign}(\eta(x)-1/2)$ is Bayes optimal and proof of the decomposition

Fix $x$. Once we condition on $X=x$, the prediction $\hat y(x)$ is just a constant in $\{-1,+1\}$. The only randomness left is in $Y$, so the conditional error is $$ \P(\hat y(X)\neq Y\mid X=x) = \begin{cases} \P(Y=-1\mid X=x), & \text{if }\hat y(x)=+1,\\\ \P(Y=+1\mid X=x), & \text{else if }\hat y(x)=-1. \end{cases} $$

Since $ \eta(x)=\P(Y=+1\mid X=x), $ we get the compact indicator form $$ \P(\hat y(X)\neq Y\mid X=x) = (1-\eta(x)) \mathbf{1}_{\{\hat y(x)=+1\}} + \eta(x) \mathbf{1}_{\{\hat y(x)=-1\}}. $$

At a fixed $x$, an optimal prediction $\hat y(x)$ is one that minimizes this conditional error.

  • Predicting $+1$ gives conditional error $1-\eta(x)$.
  • Predicting $-1$ gives conditional error $\eta(x)$.

Since $1-\eta(x)\leq \eta(x)$ if and only if $\eta(x)\geq 1/2$, this shows that we can indeed choose a Bayes classifier of the form $$ y^\star(x)= \begin{cases} +1, & \eta(x)\ge 1/2,\\ -1, & \eta(x)< 1/2, \end{cases} \qquad\text{i.e.}\qquad y^\star(x)=\operatorname{sign}\big(\eta(x)-1/2\big) \ \ (\text{ties arbitrary}). $$

We now prove the decomposition of the population error of any classifier $\hat y$ into Bayes error + excess error. Fix $x$ again.

  • If $\hat y(x)=y^\star(x)$, then $\hat y$ achieves the Bayes conditional error at $x$.
  • If $\hat y(x)\neq y^\star(x)$, then $\hat y(x)$ picks the other label, so its conditional error becomes $\max\{\eta(x),1-\eta(x)\}$ instead of $\min\{\eta(x),1-\eta(x)\}$. The gap is therefore $$ \max\{\eta(x),1-\eta(x)\}-\min\{\eta(x),1-\eta(x)\}=\lvert 2\eta(x)-1\rvert. $$

So we have the pointwise identity $$ \P(\hat y(X)\neq Y\mid X=x) = \P(y^\star(X)\neq Y\mid X=x) + \lvert 2\eta(x)-1\rvert \,\mathbf{1}_{\{\hat y(x)\neq y^\star(x)\}}. $$

Taking expectation over $X$ gives, as claimed: $$ \P(\hat y(X)\neq Y) = \P(y^\star(X)\neq Y) + \mathcal{E}(\hat y). $$ As a side note, even though we will not use it, since $\lvert 2\eta(X)-1\rvert\le 1$, we also obtain a cruder but more interpretable bound via the disagreement set: $$ \mathcal{E}(\hat y)\le \P(\hat y(X)\neq y^\star(X)). $$


why this is usually hard to compute

Ideally, there would be a ``simple’’ recipe to compute the excess error of a learned classifier, given its parameters. But in general, and especially for high-dimensional inputs $x$:

This is why people rely on indirect proxies: margins, flatness, norms of the parameters, etc.

However, when the input is one-dimensional, exact computation is sometimes possible. The rest of this post is devoted to such an example.


a setting where everything is explicit

Consider the 1d problem (the same as in the previous post):

In this setting, an input $x$ has probability $1-p/2$ of having its label equal to $\operatorname{sign}(x)$, and probability $p/2$ of having its label equal to the opposite sign. Therefore the function $\eta(x)$ is

$$ \eta(x) = \begin{cases} 1-p/2, & \text{if } x>0,\\ p/2, & \text{else if } x<0, \end{cases} $$

and the Bayes classifier $y^\star(x)\coloneqq\operatorname{sign}(\eta(x)-1/2)$ is simply the sign rule $y^\star(x)=\operatorname{sign}(x)$.

details (click to expand): why $\operatorname{sign}(x)$ is Bayes optimal

We know that a Bayes classifier is $$ y^\star(x)= \begin{cases} +1, & \text{if }\eta(x)\ge 1/2,\\ -1, & \text{else if }\eta(x)< 1/2. \end{cases} $$

From the expression of $\eta(x)$ above, we have for every $p\in[0,1]$:

  • if $x>0$, then $\eta(x)=1-p/2$ and since this is at least 1/2, we have $y^\star(x)=+1$.
  • otherwise $x<0$, in which case $\eta(x)=p/2\le 1/2$, hence $y^\star(x)=-1$.

Therefore, $$ y^\star(x)=\operatorname{sign}(x)\quad \text{for all } p\in[0,1]. $$

(at $p=1$ we have $\eta(x)=1/2$ everywhere, so every classifier is Bayes-optimal; the sign rule is still one of them.)

some intuition (click to expand) on why $\operatorname{sign}(x)$ is Bayes optimal
  • Corrupted labels are pure independent noise, so all rules perform equally on them: error is $1/2$ for every classifier, including the sign rule.
  • Clean labels are perfectly classified by the sign rule.

This is why the sign rule is optimal in all cases.

In particular, the Bayes error is $p/2$, and the Bayes confidence is $\lvert 2\eta(x)-1\rvert=1-p$ for all $x$ (except at 0, which is irrelevant since $X$ has no mass there).

Therefore, for this 1d problem, the population error of any classifier $\hat y$ simplifies to $$ \text{population error}(p) = p/2 + (1-p)\,\P\big(\hat y(X)\neq \operatorname{sign}(X)\big). $$

In the previous post, we trained a one-hidden-layer ReLU network with a 2d output on this problem. Let the network output be $f(x)=(f_1(x),f_2(x))$. Define the logit gap $$ \Delta(x)\coloneqq f_2(x)-f_1(x). $$

The learned classifier is $\hat y(x)=\operatorname{sign}(\Delta(x))$. Define $$ \mathrm{Leb\_wrong} \coloneqq \mathrm{Leb}\Big\{ x: \operatorname{sign}(\Delta(x))\neq \operatorname{sign}(x) \Big\}. $$

We obtain, in terms of the network’s output: $$ \boxed{ \text{population error}(p) = p/2 + (1-p)\,\mathrm{Leb\_wrong} } $$


computing the disagreement set in 1d

For a one-hidden-layer ReLU network, $\Delta$ is piecewise affine: there is a partition of the input space into intervals such that $$ \Delta(x)=mx+c \quad\text{on each interval}. $$

The intervals are determined by the breakpoints where the ReLU units switch from inactive to active. For a hidden neuron implementing $x\mapsto \max(0, w x + b)$, its breakpoint is at $$ x=-b/w $$

By collecting and sorting these breakpoints, we obtain intervals on which $\Delta$ is affine, and the associated slope $m$ and intercept $c$.

On each interval, we can compute the length of the region where the signs differ, and summing over intervals gives $\mathrm{Leb\_wrong}$.

In summary: parameters $\to$ breakpoints $\to$ partition of $[-1,1]$ $\to$ affine pieces of $\Delta$ $\to$ $\mathrm{Leb\_wrong}$.

Below is how this computation looks in code.

net = nn.Sequential(
    nn.Linear(1, width, bias=True),
    nonlinearity,
    nn.Linear(width, 2, bias=False),
)
w = net[0].weight[:, 0]
b = net[0].bias
v = net[2].weight[1, :] - net[2].weight[0, :] # logit gap(x) = f_2(x) - f_1(x) = v^T ReLU(w x + b)

# breakpoints are the -b_j / w_j inside (-1, 1)
mask_w = (w.abs() > 1e-12) # if w_j ≈ 0, contributes a constant ReLU(b_j) to the output so no breakpoint (and it's a dead neuron if b_j < 0: could be removed)
bk = (-b[mask_w] / w[mask_w]).cpu().numpy()
bk = np.unique(bk[(bk > -1) & (bk < 1)])
bk = np.concatenate(([-1], bk, [1]))

Leb_wrong = 0.0

for i in range(len(bk) - 1):
    l, r = float(bk[i]), float(bk[i + 1]) # interval [left, right]

    mid = 0.5 * (l + r)
    act = ((w * mid + b) > 0).to(w.dtype)  # which ReLU neurons are active on this region
    # logit gap(x) = f_2(x) - f_1(x) = m * x + c on this region
    m = (v * act * w).sum().item()
    c = (v * act * b).sum().item()

    # Leb( sign(Delta(x)) != sign(x) )
    cuts = [(l, r)]
    if l < 0 < r:
        cuts = [(l, 0.0), (0.0, r)]
    for (ll, rr) in cuts:
        true_sign = (-1 if rr <= -1e-12 else (1 if ll >= 1e-12 else 0)) 
        if true_sign == 0:
            continue

        if abs(m) <= 1e-12:
            # constant Delta(x) = c on this interval
            if c * true_sign < -1e-12: # if c has opposite sign of true_sign, count the whole interval as wrong
                Leb_wrong += (rr - ll)
        else:
            x0 = -c / m # zero of mx+c, where it changes sign on that region
            if m > 0:
                pos_subinterval = max(0.0, rr - max(ll, x0)) # positive on x > x0
            else:
                pos_subinterval = max(0.0, min(rr, x0) - ll) # positive on x < x0
            Leb_wrong += pos_subinterval if true_sign == -1 else (rr - ll - pos_subinterval)

what we observe

When we run this computation in the setup from the previous post:


takeaway

Generalization in a classification problem is about where the learned classifier deviates from the Bayes rule.

In high dimension, this deviation is hard to characterize.

In one dimension, with a simple model, it can be computed exactly from the learned parameters.


cite this post as:

@misc{gonon2026interpolation_distance_to_bayes,
  title  = {At Interpolation, Generalization is Distance to Bayes},
  author = {Antoine Gonon},
  year   = {2026},
  url    = {https://dimension-one.github.io/blog/2026-03-16-post-2/}
}