proxies for generalization: what survives in 1d when generalization gap is explicit?

April 13, 2026

TL;DR. This post closes the first group of posts built around the same 1d classification example. In that example, post 1 showed that useful explanations of generalization cannot ignore the data, and post 2 showed that at interpolation the relevant target is Bayes error plus a disagreement term with the Bayes rule. Here we compare common proxies such as margins, slopes and sharpness against that target. The main point is geometric: most of these quantities measure sensitivity or oscillation of the learned rule, not the mass of the region where its decision disagrees with Bayes. In this toy setting, Bayes-aware 1d quantities such as disagreement mass and decision-flip count stay near the top, while more generic slope, margins and sharpness proxies are less stable. This points out a common failure mode of popular proxies: they only track one aspect of the learned rule, and that aspect can be irrelevant to the geometry of generalization in many problems.

This post closes the first group of posts built around the same 1d classification example. In that example, post 1 showed that useful explanations of generalization cannot ignore the data, and post 2 identified the exact quantity that remains at interpolation: Bayes error plus a disagreement term with the Bayes classifier.

That raises a natural question:

do popular complexity measures for generalization say something about the mass of the region where the classifier has the wrong sign, or are they really measuring something else?

All the complexity measures we look at do depend on the learned decision rule, so they are eligible candidates to explain generalization in this setup. In particular, they do not fall in the “data-agnostic” category of post 1. But being an eligible candidate is not the same as tracking the right object.

We stay in the same setup as before:

$X\sim \mathrm{Unif}([-1,1])$
label $y(x)=\operatorname{sign}(x)$
with probability $p$, the label is replaced by a random sign $\text{Unif}(\{-1,+1\})$.

We train a 1-hidden-layer ReLU net for binary classification. Write its 2d output as $f(x)=(f_1(x),f_2(x))$. Define the logit gap

$$ \Delta(x)=f_2(x)-f_1(x), $$

with predicted label

$$ \hat y(x) = \operatorname{sign}(\Delta(x)). $$

The population error is (proved in the previous post):

$$ \boxed{ \text{population error}(p) = p/2 + (1-p)\,\mathrm{Leb\_wrong}/2 } $$

with the Lebesgue measure of the disagreement set: $\mathrm{Leb\_wrong} = \mathrm{Leb}\{x\in[-1,1]: \operatorname{sign}(\Delta(x))\neq \operatorname{sign}(x)\}$.

This formula is the takeaway of post 2. Once training error is zero, the classifier-dependent part of generalization is controlled by the disagreement region: how much of the interval is predicted with the wrong sign. That is the quantity the proxies should somehow capture.

In more general classification settings, there is no similar easy formula since the Bayes classifier is not tractable. Instead, one falls back on proxies computed from the learned parameters: margins, counts of decision changes, slopes, sharpness, and related quantities.

But in our 1d example, these indirect explanations can be compared directly to the exact Bayes-disagreement formula above.

what do these proxies actually measure, compared with the quantity that really determines generalization here?

usual proxies and their computation

Since we consider a one-hidden-layer ReLU net, many complexity measures are easy to compute. Indeed, each ReLU hidden unit $x\mapsto \max(0,w x+b)$ is piecewise affine with a single breakpoint at

$$ x=-b/w. $$

If we sort these breakpoints, we get a partition of the domain into intervals on which $\Delta$ is affine:

$$ \Delta(x)=m x + c. $$

Computing all these intervals and the corresponding slopes $m$ and intercepts $c$ is easy, and allows to compute many proxies exactly, such as:

the number of affine pieces,
the number of sign changes of $\Delta$,
slopes $\lvert m \rvert$ on each region, in particular, the exact Lipschitz constant $\max \lvert \Delta'(x) \rvert$, or the averaged Lipschitz constant $\int_{-1}^1 \lvert\Delta'(x)\rvert\,dx / 2$,
margin-band mass $\mathrm{Leb}\{\lvert\Delta(x)\rvert\le \gamma\}$,
sharpness around the learned parameters,
and of course $\mathrm{Leb\_wrong}$.

two families of proxies

In this setup, most of the proxies we look at fall into one of two broad families.

family A: decision-geometry proxies (about correctness of the decision)

These try to say something meaningful about the geometry of the disagreement set $\{x\in[-1,1]: \operatorname{sign}(\Delta(x))\neq \operatorname{sign}(x)\}$.

$\mathrm{Leb\_wrong}$ itself.
Not a proxy: it is the measure of the disagreement set, and yields the exact population error (up to the factor $(1-p)/2$), and thus also the generalization gap if the training error is zero.
number of sign flips (roots) of $\Delta$.
A coarse summary of how often the classifier changes its mind on $[-1,1]$. This is not a standard generalization proxy in modern settings, but here it is a natural candidate to track the level of corruption $p$ at interpolation, since as $p$ grows, the learned rule needs to flip its decision more and more often to fit the corrupted labels. However, there is no direct link with $\mathrm{Leb\_wrong}$: you can flip a lot while being wrong on a tiny set, or flip rarely while being wrong on a large set.
margin-band mass $$ \mathrm{Leb}\{x\in[-1,1]: |\Delta(x)|\le \gamma\}. $$ This measures how much input mass lives near the decision boundary $\Delta=0$. It says “how fragile are predictions under small perturbations in score”. It does not say “is the boundary placed correctly”. In particular, a classifier can be confidently wrong (large $|\Delta|$ with wrong sign), so band mass can be small while $\mathrm{Leb\_wrong}$ is large. So here, margins are mostly about robustness of the decision, not correctness.

family B: scale / landscape proxies (about sensitivity of the decision, not correctness)

These are about “how wild” $\Delta$ is, or how sensitive the parameters are.

slope / Lipschitz-type quantities in input space.
Like the maximal slope $\max \lvert \Delta'(x) \rvert$ over all $x$, or the average slope $\int_{-1}^1 \lvert\Delta'(x)\rvert\,dx / 2$. Large slopes mean large sensitivity to input perturbations. However, sensitivity has no direct link with whether the decision is correct (there are classifiers with large slope but small disagreement mass, and vice versa).

Note that while we will report exact versions of these slopes, in more general settings ones often replaces these exact slopes by upper bounds (the exact Lipschitz constant is NP-hard to compute in high dimension).
flatness / sharpness in parameter space.
These measure how sensitive the training loss is to perturbations of the parameters. Sharpness can track memorization difficulty rather than disagreement geometry. Here, interpolating increasingly corrupted labels requires more and more oscillations of the learned rule, and small parameter perturbations can easily break these oscillations and cause large loss increase, even if the disagreement mass (hence generalization gap) remains small.

These proxies are the only ones we consider that do depend on the parameters chosen to represent the learned decision rule. The ReLU network used for classification admits several parameter values that represent the same function. Among these parameters, some can be sharper than others. Below, we will report different notions of sharpness, each with different sensitivity to these symmetries.

which proxies are closest to the right geometry?

Below, we run experiments where we vary the corruption level $p$ and look at the correlation between each proxy and the generalization gap across $p$. Before looking at the numbers, we can already expect a rough hierarchy in this 1d setup:

$\mathrm{Leb\_wrong}$ should be among the most reliable, since it is the classifier-dependent geometric part of the exact formula for the generalization gap. But we should not expect perfect correlation: varying $p$ not only changes the learned decision rule (hence $\mathrm{Leb\_wrong}$), but also the Bayes error term $p/2$ in the formula, which is independent of the learned rule.
The sign-flip count should also strongly correlate: larger $p$ requires more oscillations of the learned rule to fit the corrupted labels, while the generalization gap also grows with $p$. Note that this proxy is specific to the setting considered here. In the absence of progressively corrupted labels, the decision-flip count would not necessarily be a relevant quantity to track.
Then should come sensitivity proxies such as margin-band mass, slopes, and sharpness, which are not directly about the correctness of the decision.

results

The one-hidden-layer ReLU network has 1d input, 2d output, and width 8096.

We train long enough to reach near-zero training error for all corruption levels. Experiments are repeated 5 times with different random seeds (training data, initialization).

Within each run, proxies are ranked by absolute Pearson correlation with the generalization gap across corruption levels $p$. We report the mean rank $\pm$ standard deviation and the mean correlation $\pm$ standard deviation.

The exact list of proxies computed is the following:

Leb_wrong: exact disagreement mass $\text{Leb}\{\text{sign}(\Delta(x)) \ne \text{sign}(x)\}$
nb_sign_flips: number of sign changes of $\Delta$ on $[-1,1]$
band_mass_γ: margin-band mass $\text{Leb}\{\lvert\Delta(x)\rvert \le \gamma\}$ for $\gamma \in \{0.1, 0.2, 0.5, 1.0\}$
nb_regions: number of affine pieces of $\Delta$ on $[-1,1]$
Lmax: max input-space slope $\max \lvert\Delta'(x)\rvert$
Lavg: average slope $\E\lvert\Delta'(X)\rvert$
Lavg_flip: average $\lvert\Delta'(x)\rvert$ restricted to regions containing a root of $\Delta$
hess_top: top Hessian eigenvalue (power iteration via HVP)
flat_rel: mean loss increase under relative Gaussian perturbations
sam_sharp: loss increase when moving in the direction of worst-case perturbation in a small L2 ball (one-step of so-called sharpness aware minimization)
train_err: training error

We obtain the next results. reproduce in colab

Mean rank +/- std (rank by {::nomarkdown}$

\mathrm{corr}

${:/nomarkdown} within each run):

stat	mean rank +/- std
`nb_flips`	1.20 +/- 0.40
`Leb_wrong`	1.80 +/- 0.40
`band_mass_1.0`	4.00 +/- 0.89
`band_mass_0.5`	5.60 +/- 2.42
`sam_sharp`	6.40 +/- 0.49
`band_mass_0.2`	6.60 +/- 1.85
`band_mass_0.1`	7.20 +/- 3.37
`hess_top`	7.20 +/- 3.31
`flat_rel`	9.00 +/- 1.55
`Lmax`	10.00 +/- 1.79
`nb_regions`	10.20 +/- 2.04
`train_err`	11.20 +/- 1.33
`Lavg_flip`	12.00 +/- 4.00
`Lavg`	12.60 +/- 1.02

Mean corr +/- std:

stat	mean corr +/- std
`nb_flips`	+0.92 +/- 0.04
`Leb_wrong`	+0.88 +/- 0.03
`band_mass_1.0`	+0.72 +/- 0.11
`band_mass_0.5`	+0.67 +/- 0.15
`band_mass_0.2`	+0.64 +/- 0.15
`band_mass_0.1`	+0.63 +/- 0.19
`sam_sharp`	+0.62 +/- 0.13
`hess_top`	+0.62 +/- 0.09
`flat_rel`	+0.51 +/- 0.03
`Lmax`	+0.49 +/- 0.11
`train_err`	+0.42 +/- 0.13
`nb_regions`	+0.41 +/- 0.18
`Lavg`	+0.29 +/- 0.09
`Lavg_flip`	+0.20 +/- 0.21

The ordering roughly matches the expected hierarchy. Across repeated runs, three quantities stay near the top:

$\mathrm{Leb\_wrong}$, which is the exact disagreement-mass term appearing in the formula above,
nb_sign_flips, which is a specific 1d proxy that makes sense in this setup, and tracks the need for more oscillations as $p$ grows,
band_mass_1.0, one of the margin-band mass proxies.

Most other quantities show much larger spread across setups (training set, initialization, width of the network). This is coherent with the fact that they are measuring sensitivity of the decision rule instead of its correctness.

takeaway

This closes the first group of posts built around understanding generalization in the same 1d classification example.

In that example, we now know the target exactly: at interpolation, the generalization gap equals the population error, namely Bayes error plus a term controlled by the mass of the region where the learned classifier disagrees with the Bayes rule. Most common proxies do not measure that quantity. They measure how sensitive the learned rule is to change in inputs (slopes, margins), parameters (sharpness), or how often it changes prediction.

The repeated-run picture illustrates that. The quantities that stay strongest are either the exact geometric target itself, or 1d comparators tailored to that same geometry. The more generic proxies are much less stable.

This is not specific to this 1d example: a fixed proxy only measures one aspect of the learned rule, and that aspect can be irrelevant to many Bayes-disagreement geometries that can arise in more general classification problems.

cite this post as:

@misc{gonon2026proxies1d,
  title  = {Proxies for Generalization: What Survives in 1d When Generalization Gap Is Explicit?},
  author = {Antoine Gonon},
  year   = {2026},
  url    = {https://dimension-one.github.io/blog/2026-04-13-proxies-generalization-1d/}
}