
What Models Prefer to Learn: A Geometric Framing of Architecture and Regularization
Last Updated on May 9, 2025 by Editorial Team
Author(s): Sigurd Roll Solberg
Originally published on Towards AI.
Intro
What does a Neural Network really learn?
Every machine learning model, deep or shallow, learns by searching within a βhypothesis spaceβ β the set of functions it can, in principle, represent. But this space is not neutral territory. It is carved out and weighted by two forces: architecture and regularization.
- The architecture defines what can be expressed.
- Regularization defines how likely different regions of this space are to be explored or trusted.
This isnβt a new observation. But as models grow more expressive and application-specific, understanding how these two elements interact becomes not just academic β but foundational to intelligent model design.
Our goal in this post is to take this question seriously. Weβll explore how different neural architectures sculpt the geometry and topology of hypothesis spaces, and how regularization can be viewed not simply as a constraint but as a prioritization scheme β a way of emphasizing certain βregionsβ of the hypothesis space over others. By reframing the problem geometrically, we aim to build intuition for what models prefer to learn, and why.
Exploration
1. A Tale of Two Learners
Imagine two neural networks trained on the same data. One is a shallow MLP; the other is a convolutional neural network. Both converge to low training error. Yet their generalization behavior differs dramatically.
Why?
Because even though both underlying architectures are βuniversal approximators,β the shape of their hypothesis spaces is different. The MLP has no built-in notion of locality or translation invariance. It must learn such inductive biases from scratch. The CNN, by contrast, starts with a geometry: spatial locality is baked in.
This difference reflects not just a shift in what functions are representable, but in how easy it is for the optimizer to find and prefer certain solutions. The architecture defines not just a boundary around the space, but a gradient-weighted landscape over it.
2. From Functions to Manifolds
To make this precise, think of the hypothesis space as a manifold embedded in a larger function space. An architecture carves out a submanifold of functions it can express. But this isnβt a flat, uniform surface. It has:
- Curvature: Some functions are easier to reach (lower curvature), others harder (steep gradients, complex compositions).
- Volume: Some function classes occupy more βspaceβ β e.g., shallow networks more easily model linear or low-frequency functions.
- Topology: Some architectures enforce continuity or symmetries that others do not.
This brings us to a geometric deep learning lens: architectural priors shape the metric and topology of the hypothesis space [2]. CNNs favor translationally equivariant functions. GNNs favor permutation invariance. Transformers? Attention-weighted global interactions.
The optimizer doesnβt explore all of function space β it flows along this curved, structured manifold defined by the architecture.
3. Regularization as a Measure over Hypothesis Space
Now enter regularization. In its classic form (e.g., L2 norm), itβs often interpreted as penalizing complexity. But this view is limited. More deeply, regularization defines a measure over the hypothesis space β a way of saying: βThese functions are more likely. These ones are suspect.β
Dropout, for example, flattens reliance on specific units, favoring more distributed representations. Spectral norm regularization constrains Lipschitz continuity, biasing toward smoother functions. Bayesian neural networks make this idea explicit: the prior over weights induces a prior over functions.
Viewed this way, regularization isnβt a constraint on learning β itβs a shaping force. It sculpts the energy landscape. It changes which valleys the optimizer is most likely to settle into.
This becomes especially interesting when we realize that different regularizers and architectures may interact nonlinearly. A regularizer that improves generalization in one architecture may hurt it in another, simply because the underlying hypothesis space is differently curved or composed.
Resolution
A Geometric Framing of Learning Bias
Letβs sharpen the central claim:
Learning is a process of moving along a structured manifold, defined by the architecture, following a flow field shaped by regularization, in pursuit of a low-energy state defined by the loss function.
In this framing:
- Architecture defines the manifold of functions the model can express β the terrain on which learning happens.
- Regularization imposes a density or potential field over this terrain β some directions become easier, some harder.
- The loss function defines the energy landscape β it tells us where the valleys lie, where the model should settle.
The optimization algorithm β usually gradient descent β acts as a navigator. But it doesnβt traverse all of function space. It flows along this manifold, biased by regularization, toward regions of low loss.
This perspective reframes generalization not as mere convergence, but as a bias-aware descent on a curved manifold, where both geometry and preference shape the final outcome.
Conclusion
Designing With Geometry in Mind
If we accept that architecture and regularization jointly shape the hypothesis space, then several strategic insights follow:
- Architectural choices should be guided not just by empirical performance but by understanding what kind of manifold they induce. Geometry matters.
- Regularization strategies should be tuned to the architecture β not just in hyperparameter terms, but in philosophical terms: what kind of functions are we favoring?
- Future research might benefit from explicit characterizations of these manifolds: can we map the implicit bias of different models, or even interpolate between hypothesis spaces?
Perhaps most provocatively: we may want to design architectures and regularizers in tandem, as complementary instruments in sculpting the modelβs functional landscape.
This is not a call to abandon empirical methods. But it is a call to infuse them with geometric and probabilistic awareness. To think not just in terms of performance, but of preference β what our models are predisposed to learn, and why.
If geometric deep learning taught us that data lives on a manifold, then perhaps the next lesson is this: so do our models.
References
- [1] Poggio et al., βTheory of Deep Learning III: Explaining the Non-overfitting Puzzleβ
- [2] Bronstein et al., βGeometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gaugesβ
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI