proxies for generalization: what survives in 1d when generalization gap is explicit?

April 13, 2026

TL;DR. This post closes the first group of posts built around the same 1d classification example. In that example, post 1 showed that useful explanations of generalization cannot ignore the data, and post 2 showed that at interpolation the relevant target is Bayes error plus a disagreement term with the Bayes rule. Here we compare common proxies such as margins, slopes and sharpness against that target. The main point is geometric: most of these quantities measure sensitivity or oscillation of the learned rule, not the mass of the region where its decision disagrees with Bayes. In this toy setting, Bayes-aware 1d quantities such as disagreement mass and decision-flip count stay near the top, while more generic slope, margins and sharpness proxies are less stable. This points out a common failure mode of popular proxies: they only track one aspect of the learned rule, and that aspect can be irrelevant to the geometry of generalization in many problems.

This post closes the first group of posts built around the same 1d classification example. In that example, post 1 showed that useful explanations of generalization cannot ignore the data, and post 2 identified the exact quantity that remains at interpolation: Bayes error plus a disagreement term with the Bayes classifier.

That raises a natural question:

do popular complexity measures for generalization say something about the mass of the region where the classifier has the wrong sign, or are they really measuring something else?

All the complexity measures we look at do depend on the learned decision rule, so they are eligible candidates to explain generalization in this setup. In particular, they do not fall in the “data-agnostic” category of post 1. But being an eligible candidate is not the same as tracking the right object.

We stay in the same setup as before:

We train a 1-hidden-layer ReLU net for binary classification. Write its 2d output as $f(x)=(f_1(x),f_2(x))$. Define the logit gap

$$ \Delta(x)=f_2(x)-f_1(x), $$

with predicted label

$$ \hat y(x) = \operatorname{sign}(\Delta(x)). $$

The population error is (proved in the previous post):

$$ \boxed{ \text{population error}(p) = p/2 + (1-p)\,\mathrm{Leb\_wrong}/2 } $$

with the Lebesgue measure of the disagreement set: $\mathrm{Leb\_wrong} = \mathrm{Leb}\{x\in[-1,1]: \operatorname{sign}(\Delta(x))\neq \operatorname{sign}(x)\}$.

This formula is the takeaway of post 2. Once training error is zero, the classifier-dependent part of generalization is controlled by the disagreement region: how much of the interval is predicted with the wrong sign. That is the quantity the proxies should somehow capture.

In more general classification settings, there is no similar easy formula since the Bayes classifier is not tractable. Instead, one falls back on proxies computed from the learned parameters: margins, counts of decision changes, slopes, sharpness, and related quantities.

But in our 1d example, these indirect explanations can be compared directly to the exact Bayes-disagreement formula above.

what do these proxies actually measure, compared with the quantity that really determines generalization here?


usual proxies and their computation

Since we consider a one-hidden-layer ReLU net, many complexity measures are easy to compute. Indeed, each ReLU hidden unit $x\mapsto \max(0,w x+b)$ is piecewise affine with a single breakpoint at

$$ x=-b/w. $$

If we sort these breakpoints, we get a partition of the domain into intervals on which $\Delta$ is affine:

$$ \Delta(x)=m x + c. $$

Computing all these intervals and the corresponding slopes $m$ and intercepts $c$ is easy, and allows to compute many proxies exactly, such as:


two families of proxies

In this setup, most of the proxies we look at fall into one of two broad families.

family A: decision-geometry proxies (about correctness of the decision)

These try to say something meaningful about the geometry of the disagreement set $\{x\in[-1,1]: \operatorname{sign}(\Delta(x))\neq \operatorname{sign}(x)\}$.

family B: scale / landscape proxies (about sensitivity of the decision, not correctness)

These are about “how wild” $\Delta$ is, or how sensitive the parameters are.


which proxies are closest to the right geometry?

Below, we run experiments where we vary the corruption level $p$ and look at the correlation between each proxy and the generalization gap across $p$. Before looking at the numbers, we can already expect a rough hierarchy in this 1d setup:


results

The one-hidden-layer ReLU network has 1d input, 2d output, and width 8096.

We train long enough to reach near-zero training error for all corruption levels. Experiments are repeated 5 times with different random seeds (training data, initialization).

Within each run, proxies are ranked by absolute Pearson correlation with the generalization gap across corruption levels $p$. We report the mean rank $\pm$ standard deviation and the mean correlation $\pm$ standard deviation.

The exact list of proxies computed is the following:

We obtain the next results. reproduce in colab

Mean rank +/- std (rank by {::nomarkdown}$ \mathrm{corr} ${:/nomarkdown} within each run):
stat mean rank +/- std
nb_flips 1.20 +/- 0.40
Leb_wrong 1.80 +/- 0.40
band_mass_1.0 4.00 +/- 0.89
band_mass_0.5 5.60 +/- 2.42
sam_sharp 6.40 +/- 0.49
band_mass_0.2 6.60 +/- 1.85
band_mass_0.1 7.20 +/- 3.37
hess_top 7.20 +/- 3.31
flat_rel 9.00 +/- 1.55
Lmax 10.00 +/- 1.79
nb_regions 10.20 +/- 2.04
train_err 11.20 +/- 1.33
Lavg_flip 12.00 +/- 4.00
Lavg 12.60 +/- 1.02

Mean corr +/- std:

stat mean corr +/- std
nb_flips +0.92 +/- 0.04
Leb_wrong +0.88 +/- 0.03
band_mass_1.0 +0.72 +/- 0.11
band_mass_0.5 +0.67 +/- 0.15
band_mass_0.2 +0.64 +/- 0.15
band_mass_0.1 +0.63 +/- 0.19
sam_sharp +0.62 +/- 0.13
hess_top +0.62 +/- 0.09
flat_rel +0.51 +/- 0.03
Lmax +0.49 +/- 0.11
train_err +0.42 +/- 0.13
nb_regions +0.41 +/- 0.18
Lavg +0.29 +/- 0.09
Lavg_flip +0.20 +/- 0.21

The ordering roughly matches the expected hierarchy. Across repeated runs, three quantities stay near the top:

Most other quantities show much larger spread across setups (training set, initialization, width of the network). This is coherent with the fact that they are measuring sensitivity of the decision rule instead of its correctness.


takeaway

This closes the first group of posts built around understanding generalization in the same 1d classification example.

In that example, we now know the target exactly: at interpolation, the generalization gap equals the population error, namely Bayes error plus a term controlled by the mass of the region where the learned classifier disagrees with the Bayes rule. Most common proxies do not measure that quantity. They measure how sensitive the learned rule is to change in inputs (slopes, margins), parameters (sharpness), or how often it changes prediction.

The repeated-run picture illustrates that. The quantities that stay strongest are either the exact geometric target itself, or 1d comparators tailored to that same geometry. The more generic proxies are much less stable.

This is not specific to this 1d example: a fixed proxy only measures one aspect of the learned rule, and that aspect can be irrelevant to many Bayes-disagreement geometries that can arise in more general classification problems.


cite this post as:

@misc{gonon2026proxies1d,
  title  = {Proxies for Generalization: What Survives in 1d When Generalization Gap Is Explicit?},
  author = {Antoine Gonon},
  year   = {2026},
  url    = {https://dimension-one.github.io/blog/2026-04-13-proxies-generalization-1d/}
}