Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Perceptron Is Not SGD: A Better Interpretation Through Pseudogradients
Machine Learning

Perceptron Is Not SGD: A Better Interpretation Through Pseudogradients

Last Updated on October 29, 2021 by Editorial Team

Author(s):Β Francesco Orabona, Ph.D.

Actual conversation with my wife (Image fromΒ xkcd.com)

There is a popular interpretation of the Perceptron as a stochastic (sub)gradient descent procedure. I even found slides online with this idea. The thought of so many young minds twisted by these false claims was too much to bear. So, I felt compelled to write a blog post to explain why this is wrongβ€¦Β πŸ˜‰
Moreover, I will also give a different and (I think) much better interpretation of the Perceptron algorithm.

1. Perceptron Algorithm

The Perceptron algorithm was introduced by Rosenblatt in 1958. To be more precise, he introduced a family of algorithms characterized by a certain architecture. Also, he considered what we call now supervised and unsupervised training procedures. However, nowadays when we talk about the Perceptron we intend the following algorithm:

In the algorithm, the couplesΒ {({\boldsymbol z}_t,y_t)}Β forΒ {t=1,\dots,N}, withΒ {{\boldsymbol x}_t \in {\mathbb R}^d}Β andΒ {y_t \in \{-1, 1\}}, represent a set ofΒ {N}Β input/output pairs that we want to learn to classify correctly in the two categoriesΒ {+1}Β andΒ {-1}. We assume that there exists an unknown vectorΒ {{\boldsymbol u}}Β the correctly classify all the samples, that isΒ {\rm sign(\langle {\boldsymbol z}_t, {\boldsymbol u}\rangle) = y_t}. Note that any scaling ofΒ {{\boldsymbol u}}Β by a positive constant still correctly classify all the samples, so there are infinite solutions. The aim of the Perceptron is to find any of these solutions.

From an optimization point of view, this is called aΒ feasibility problem, that is something like

\displaystyle \begin{aligned} Find & \ {\boldsymbol x} \\ s.t. & \ {\boldsymbol x} \in C, \end{aligned}

whereΒ {C}Β is some set. They are an essential step in constrained optimization for algorithms that require an feasible initial point. Feasibility problems are not optimization problems even if in some cases can be solved with an optimization formulation.

In the Perceptron case, we can restate the problem as

\displaystyle \begin{aligned} Find & \ {\boldsymbol x}\\ s.t. & \ y_i \langle {\boldsymbol z}_i,{\boldsymbol x}\rangle \geq 1 , i=1, \dots, N, \end{aligned}

where the β€œ1” on the r.h.s. is clearly arbitrary and it can be changed through rescaling ofΒ {{\boldsymbol x}}. So, in optimization language, the Perceptron algorithm is nothing else than an iterative procedure to solve the above feasibility problem.

2. Issues with the SGD Interpretation

As said above, sometimes people refer to the Perceptron as a stochastic (sub)gradient descent algorithm on the objective function

\displaystyle \label{eq:obj_relu_perceptron} F({\boldsymbol x}) = \sum_{i=1}^N \max(-y_i \langle {\boldsymbol z}_i,{\boldsymbol x}\rangle,0)~. \ \ \ \ \ (1)

I think they are many problems with this ideas, let me list some of them

  • First of all, the above interpretation assumes that we take the samples randomly fromΒ {S}. However, this is not needed in the Perceptron and it was not needed in the first proofs of the Perceptron convergenceΒ (Novikoff, 1963). There is a tendency to call anything that receive one sample at a time as β€œstochastic”, but β€œarbitrary order” and β€œstochastic” are clearly not the same.
  • The Perceptron is typically initialized withΒ {{\boldsymbol x}_1=\boldsymbol{0}}. Now, we have two problems. The first one is that with a black-box first-order oracle, we would get aΒ subgradientΒ of aΒ {\max(-y_i \langle {\boldsymbol z}_i,{\boldsymbol x}\rangle,0)}, whereΒ {i}Β is drawn uniformly at random inΒ {[1,N]}. A possible subgradient for anyΒ {i}Β isΒ {\boldsymbol{0}}. This means that SGD would not update. Instead, the Perceptron in this case does update. So, we are forced to consider a different model than the black-box one. Changing the oracle model is a minor problem, but this fact hints to another very big issue.
  • The biggest issue is thatΒ {{\boldsymbol x}_1=\boldsymbol{0}}Β is a global optimum ofΒ {F}! So, there is nothing to minimize, we are already done in the first iteration. However, from a classification point of view, this solution seems clearly wrong. So, it seems we constructed an objective function we want to minimize and a corresponding algorithm, but for some reason we do not like one of its infinite minimizers. So, maybe, the objective function is wrong? So, maybe, this interpretation misses something?

There is an easy way to avoid some of the above problems: change the objective function to a parametrized loss that has non-zero gradient in zero. For example, something like this

\displaystyle F_\alpha({\boldsymbol x}) = \sum_{i=1}^N \log(1+\exp(-\alpha y_i \langle {\boldsymbol z}_i,{\boldsymbol x}\rangle))~.

Now, whenΒ {\alpha}Β goes to infinity, you recover the functionΒ {F}. However, for any finiteΒ {\alpha},Β {\boldsymbol{0}}Β is not a global optimum anymore. As a side effect, we also solved the issue of the subgradient of the max function. In this way, you could interpret the Perceptron algorithm as theΒ limit behaviour of SGD on a family of optimization problems.

To be honest, I am not sure this is a satisfying solution. Moreover, the stochasticity is still there and it should be removed.

Now,Β I already proved a mistake bound for the Perceptron, without any particular interpretation attached to it. As a matter of fact, proofs do not need interpretations to be correct. I showed that the Perceptron competes with aΒ family of loss functionsΒ that implies that it does not just use the subgradient of a single function. However, if you need anΒ intuitive wayΒ to think about it, let me present you the idea ofΒ pseudogradients.

3. Pseudogradients

Suppose we want to minimize a functionΒ {L}-smoothΒ {F}Β and we would like to use something like gradient descent. However, we do not have access to its gradient. In this situation,Β (Polyak and Tsypkin, 1973)Β proposed to use a β€œpseudogradient”, that isΒ anyΒ vectorΒ {{\boldsymbol g}}Β that forms an angle of 90 degrees or less with the actual gradient inΒ {{\boldsymbol x}}

\displaystyle \langle {\boldsymbol g}, \nabla F({\boldsymbol x})\rangle \geq \gamma \geq 0~.

In a very intuitive way,Β {{\boldsymbol g}}Β gives me some information that should allow me to minimizeΒ {F}, at least in the limit. The algorithm then becomes a β€œpseudogradient descent” procedure that updates the current solution in the direction of the negative pseudogradient

\displaystyle {\boldsymbol x}_{t+1} = {\boldsymbol x}_t - \eta_t {\boldsymbol g}_t,

whereΒ {\eta_t\geq0}Β are the step size or learning rates.

Note thatΒ (Polyak and Tsypkin, 1973)Β define the pseudogradients as aΒ stochasticΒ vector that satisfies the above inequality in conditional expectation and for a time-varyingΒ {\gamma}. Indeed, there are a number of very interesting results in that paper. However, for simplicity of exposition I will only consider the deterministic case and only describe the application to the Perceptron.

Let’s see how this would work. Let’s assume thatΒ {\gamma>0}Β at least for an initial number of rounds, that means that the angle between the pseudogradient and the gradient is acute. From theΒ {L}-smoothness ofΒ {F}, we have that

\displaystyle \begin{aligned} F({\boldsymbol x}_{t+1}) &\leq F({\boldsymbol x}_t) +\langle \nabla F({\boldsymbol x}_t), {\boldsymbol x}_{t+1} - {\boldsymbol x}_t\rangle + \frac{L}{2}\|{\boldsymbol x}_{t+1}-{\boldsymbol x}_t\|^2 \\ &= F({\boldsymbol x}_t) - \eta_t \langle \nabla F({\boldsymbol x}_t), {\boldsymbol g}_t \rangle + \eta_t^2 \frac{L}{2}\|{\boldsymbol g}_t\|^2 \\ &\leq F({\boldsymbol x}_t) - \eta_t \gamma + \eta_t^2 \frac{L}{2}\|{\boldsymbol g}_t\|^2~. \end{aligned}

Now, ifΒ {0 < \eta_t < \frac{2\gamma}{L \|{\boldsymbol g}\|^2}}, we have thatΒ {- \eta_t \gamma + \eta_t^2 \frac{L}{2}\|{\boldsymbol g}_t\|^2<0}Β so can guarantee that the value ofΒ {F}Β decreases at each step. So, we are minimizingΒ {F}Β without using a gradient!

To get a rate of convergence, we should know something more aboutΒ {{\boldsymbol g}_t}. For example, we could assume thatΒ {\|{\boldsymbol g}_t\|\leq G}. Then, settingΒ {\eta_t = \frac{\gamma^2}{L G^2}}, we obtain

\displaystyle F({\boldsymbol x}_{t+1}) \leq F({\boldsymbol x}_t) - \frac{\gamma^2}{2 L G^2}~.

This is still not enough because it is clear thatΒ {\gamma>0}Β cannot be true on all rounds because when we are in the minimizerΒ {\nabla F({\boldsymbol x}_t)=\boldsymbol{0}}. However, with enough assumptions, following this route you can even get a rate of convergence.

4. Pseudogradients for the Perceptron

How do we use this to explain the Perceptron? Suppose your setΒ {S}Β isΒ linearly separableΒ with a margin of 1. This means that there exists a vectorΒ {{\boldsymbol u}\in {\mathbb R}^d}Β such that

\displaystyle \label{eq:margin_assumption} \min_i y_i \langle {\boldsymbol z}_i, {\boldsymbol u}\rangle = 1~. \ \ \ \ \ (2)

Note that the value of the margin is arbitrary, we can change it just rescalingΒ {{\boldsymbol u}}.

Remark 1.Β An equivalent way to restate this condition is to constrainΒ {{\boldsymbol u}}Β to have unitary norm and require

\displaystyle \min_i y_i \langle {\boldsymbol z}_i, {\boldsymbol u}\rangle = \gamma,

whereΒ {\gamma>0}Β is called theΒ maximum marginΒ ofΒ {S}. However, in the following I will not use the margin notation because it makes things a bit less clear from an optimization point of view.

We would like to construct an algorithm to findΒ {{\boldsymbol u}}Β (or any positive scaling of it) from the samplesΒ {\{({\boldsymbol z}_i,y_i)\}_{i=1}^N}. So, we need an objective function. Here the brilliant idea of Polyak and Tsypkin: in each iteration take an arbitraryΒ {({\boldsymbol z}_i,y_i)}Β and defineΒ {{\boldsymbol g}_t = -y_i {\boldsymbol z}_i \boldsymbol{1}[y_i \langle {\boldsymbol z}_i, {\boldsymbol x}_t\rangle \leq 0]}, that is exactly the negative update we use in the Perceptron. This turns out to be a pseudogradient forΒ {F({\boldsymbol x}) = \frac{1}{2}\|{\boldsymbol x}-{\boldsymbol u}\|^2}. Indeed,

\displaystyle \begin{aligned} \langle {\boldsymbol g}_t, \nabla F({\boldsymbol x}_t) \rangle &= \langle -y_i {\boldsymbol z}_i \boldsymbol{1}[y_i \langle {\boldsymbol z}_i, {\boldsymbol x}_t\rangle \leq 0], {\boldsymbol x}_t-{\boldsymbol u}\rangle \\ &= \boldsymbol{1}[y_i \langle {\boldsymbol z}_i, {\boldsymbol x}_t\rangle \leq 0](\langle -y_i {\boldsymbol z}_i, {\boldsymbol x}_t\rangle + y_i \langle {\boldsymbol z}_i, {\boldsymbol u}\rangle) \\ &\geq \boldsymbol{1}[y_i \langle {\boldsymbol z}_i, {\boldsymbol x}_t\rangle \leq 0], \end{aligned}

where in the last inequality we usedΒ (2).

Let’s pause for a moment to look at what we did: We want to minimizeΒ {F({\boldsymbol x})=\frac{1}{2}\|{\boldsymbol x}-{\boldsymbol u}\|^2}, but its gradient is just impossible to calculate because it depends onΒ {{\boldsymbol u}}Β that we clearly do not know. However,Β every time the Perceptron finds a sample on which its prediction is wrong, we can construct a pseudogradient, without any knowledge ofΒ {{\boldsymbol u}}. It is even more surprising if you consider the fact that there is an infinite number of possible solutionsΒ {{\boldsymbol u}}Β and hence functionsΒ {F}, yet the pseudogradient correlates positively with the gradient of any of them! Moreover, no stochasticity is necessary.

At this point we are basically done. In fact, observe thatΒ {F({\boldsymbol x}) =\frac{1}{2}\|{\boldsymbol x}-{\boldsymbol u}\|^2}Β is 1-smooth. So, every timeΒ {{\boldsymbol g}_t\neq \boldsymbol{0}}, the analysis above tells us that

\displaystyle \begin{aligned} \frac{1}{2}\|{\boldsymbol x}_{t+1}-{\boldsymbol u}\|^2 - \frac{1}{2}\|{\boldsymbol x}_t - {\boldsymbol u}\|^2 \leq -\eta_t + \frac{\eta_t^2}{2} \|{\boldsymbol z}_t\|^2 \leq -\eta_t + \frac{\eta_t^2}{2} G^2, \end{aligned}

where in the last inequality we have assumedΒ {\|{\boldsymbol z}_t\|\leq G}.

SettingΒ {\eta_t=\eta}, summing over time, and denotingΒ {M}Β the number of updates we have overΒ {T}Β iterations, we obtain

\displaystyle M \leq \frac{\eta M G^2}{2} + \frac{1}{2\eta}\|{\boldsymbol u}-{\boldsymbol x}_1\|^2 - \frac{1}{2\eta} \|{\boldsymbol u}-{\boldsymbol x}_{T+1}\|^2 \leq \frac{\eta M G^2}{2} + \frac{1}{2\eta}\|{\boldsymbol u}\|^2,

where used the fact thatΒ {{\boldsymbol x}_1=\boldsymbol{0}}.

Now, there is the actual magic of the (parameter-free!) Perceptron update rule: as we explainedΒ here, the updates of the Perceptron are independent ofΒ {\eta>0}. That is, given an order in which the samples are presented to the algorithm, any fixedΒ {\eta}Β makes the Perceptron update on the same samples and it only changes the scale ofΒ {{\boldsymbol x}_t}. Hence, even if the Perceptron algorithm usesΒ {\eta=1}, we can consider an arbitraryΒ {\eta}Β decided post-hoc to minimize the upper bound. Hence, we obtain

\displaystyle M \leq G \|{\boldsymbol u}\|\sqrt{M},

that is

\displaystyle M \leq G^2 \|{\boldsymbol u}\|^2~.

Now, observing that the r.h.s. is independent ofΒ {T}, we proved that the maximum number of updates, or equivalently mistakes, of the Perceptron algorithm is bounded.

Are we done? Not yet! We can now improve the Perceptron algorithm taking full advantage of the pseudogradients interpretation.

5. An Improved Perceptron

This is a little known idea to improve the Perceptron. It can be shown with the classic analysis as well, but it comes very naturally from the pseudogradient analysis.

Let’s start from

\displaystyle \frac{1}{2}\|{\boldsymbol x}_{t+1}-{\boldsymbol u}\|^2 - \frac{1}{2}\|{\boldsymbol x}_t - {\boldsymbol u}\|^2 \leq -\eta_t + \frac{\eta_t^2}{2} \|{\boldsymbol z}_t\|^2~.

Now consider only the rounds in whichΒ {{\boldsymbol z}_t\neq \boldsymbol{0}}Β and setΒ {\eta_t = \frac{1}{\|{\boldsymbol z}_t\|^2}}, that is obtained by an optimization of the expressionΒ {-\eta_t + \frac{\eta_t^2}{2} \|{\boldsymbol z}_t\|^2}. So, we obtain

\displaystyle \label{eq:improved_perc_1} \frac{1}{2}\|{\boldsymbol x}_{t+1}-{\boldsymbol u}\|^2 - \frac{1}{2}\|{\boldsymbol x}_t - {\boldsymbol u}\|^2 \leq -\frac{1}{2\|{\boldsymbol z}_t\|^2}~. \ \ \ \ \ (3)

This means that now the update rule becomes

\displaystyle {\boldsymbol x}_{t+1} = {\boldsymbol x}_t + \frac{y_i {\boldsymbol z}_i}{\|{\boldsymbol z}_i\|^2} \boldsymbol{1}[y_i \langle {\boldsymbol z}_i, {\boldsymbol x}_t \rangle \leq 0]~.

Now, summingΒ (3)Β over time, we get

\displaystyle \sum_{i=1}^M \frac{1}{\|{\boldsymbol z}_i\|^2} \leq \|{\boldsymbol u}\|^2 - \|{\boldsymbol x}_{T+1}-{\boldsymbol u}\|^2 \leq \|{\boldsymbol u}\|^2~.

It is clear that this inequality implies the previous one becauseΒ {\sum_{i=1}^M \frac{1}{\|{\boldsymbol z}_i\|^2} \geq \frac{M}{G^2}}. But we can even obtain a tighter bound. Using the inequality between harmonic, geometric, and arithmetic mean, we have

\displaystyle M \leq \|{\boldsymbol u}\|^2 \frac{M}{\sum_{i=1}^M \frac{1}{\|{\boldsymbol z}_i\|^2}} \leq \|{\boldsymbol u}\|^2 \left(\prod_{i=1}^M \|{\boldsymbol z}_i\|^2\right)^{1/M} \leq \|{\boldsymbol u}\|^2 \frac{1}{M} \sum_{i=1}^M \|{\boldsymbol z}_i\|^2 \leq \|{\boldsymbol u}\|^2 G^2~.

In words, the original Perceptron bound depends on the maximum squared norm of the samples on which we updated. Instead, this bound depends on the geometric or arithmetic mean of the squared norm of the samples on which we updated, that is less or equal to the maximum.

6. Pseudogradients and Lyapunov Potential Functions

Some people might have realized yet another way to look at this:Β {F({\boldsymbol x}) = \frac{1}{2}\|{\boldsymbol x}-{\boldsymbol u}\|^2}Β is theΒ Lyapunov function typically used to analyze subgradient descent. In fact, the classic analysis of SGD considers the guaranteed decrement at each step of this function. The two things coincide, but I find the pseudogradient idea to add a non-trivial amount of information because it allows to bypass the idea of using a subgradient of the loss function completely.

Moreover, the idea of the pseudogradients is more general because it applies to any smooth function, not only to the choice ofΒ {F({\boldsymbol x}) = \frac{1}{2}\|{\boldsymbol x}-{\boldsymbol u}\|^2}.

Overall, it is clear that all the good analyses of the Perceptron must have something in common. However, sometimes recasting a problem in a particular framework might have some advantages because it helps our intuition. In this view, I find the pseudogradient view particularly compelling because it aligns with my intuition of how an optimization algorithm is supposed to work.

7. History Bits

I already wrote about the Perceptron, so I will just add few more relevant bits.

As I said, it seems that the family of Perceptrons algorithms was intended to be something much more general than what we intend now. The particular class of Perceptron we use nowadays was called {\alpha}-system (Block, 1962). I hypothesize that the fact theΒ {\alpha}-system survived the test of time is exactly due to the simple convergence proof in (Block, 1962) andΒ (Novikoff, 1963). Both proofs are non-stochastic. For the sake of proper credits assignment, it seems that the convergence proof of the Perceptron was proved by many others before Block and Novikoff (see references inΒ Novikoff, 1963). However, the proof inΒ (Novikoff, 1963)Β seems to be the cleanest one.Β (Aizerman, Braverrnan, and Rozonoer, 1964)Β (essentially) describe for the first time the Kernel Perceptron and prove a finite mistake bound for it.

I got the idea of smoothing the Perceptron algorithm with a scaled logistic loss from a discussion on Twitter with Maxim Raginsky. He wrote that (Aizerman, Braverrnan, and Rozonoer, 1970) proposed some kind of smoothing in a Russian book for the objective function inΒ (1), but I don’t have access to it so I am not sure what are the details. I just thought of a very natural one.

The idea of pseudogradients and the application to the Perceptron algorithm is inΒ (Polyak and Tsypkin, 1973). However, there the input/output samples are still stochastic and the finite bound is not explicitly calculated. As I have shown, stochasticity is not needed. It is important to remember that online convex optimization as a field will come much later, so there was no reason for these people to consider arbitrary or even adversarial order of the samples.

The improved Perceptron mistake bound could be new (but please let me know if it isn’t!) and it is inspired from the idea inΒ (Graepel, Herbrich, and Williamson, 2001)Β of normalizing the samples to show a tighter bound.

Acknowledgements

Given the insane amount of mistakes that NicolΓ² Campolongo usually finds in my posts, this time I asked him to proofread it. So, I thank NicolΓ² for finding an insane amount of mistakes on a draft of this post πŸ™‚

Bio: Francesco Orabona is an Assistant Professor at Boston University. He received his Ph.D. in Electronic Engineering from the University of Genoa, Italy, in 2007. As a result of his activities, Dr. Orabona has published more than 70 papers in scientific journals and conferences on the topics of online learning, optimization, and statistical learning theory. His latest work is on “parameter-free” machine learning and optimization algorithms, that is algorithms that achieve the best performance without the need to tune parameters, like regularization, learning rate, momentum, etc.


Perceptron Is Not SGD: A Better Interpretation Through Pseudogradients was originally published in Parameter-Free Learning and Optimization Algorithms, where people are continuing the conversation by highlighting and responding to this story.

Published viaΒ Towards AI

Feedback ↓