Mastering Deep Learning: The Art of Approximating Non-Linearities with Piecewise Estimations Part-2
Last Updated on November 5, 2023 by Editorial Team
Author(s): Raagulbharatwaj K
Originally published on Towards AI.
Greetings, everyone! Welcome to the second installment of my Mastering Deep Learning series. This article serves as a continuation of the first part, titled The Art of Approximating Non-Linearities with Piecewise Estimations Part-1. In the first article, we have seen that neural networks combine multiple linear functions of input x to estimate an output y through learning the mapping f(x,Ο) which maps the input space to the output space. We observed that these mappings are inherently linear, and it is the activation functions that introduce non-linearity into these mappings. We see that we can approximate a non-linear function using several piece-wise linear functions by combining them linearly. The creation of these piece-wise linear functions is a unique property of functions that either thresholds or clips the input at one or more points (for example ReLU). As the number of linear regions approaches infinity, the length of these linear regions becomes infinitesimally small, thereby transforming the mapping. What was previously a piecewise linear structure now evolves into a non-linear function. The universal approximation theorem proves that for any continuous function, there exists a shallow network that can approximate this function to any specified precision.
There are functions that require an impractically large number of hidden units for us to estimate them to the required precision. This led to the discovery of Deep Neural Networks. Deep Neural Networks can estimate a lot more linear regions than shallow neural networks for a given number of parameters. Let us understand the intuition behind Deep Neural Networks by considering two shallow neural networks with 3-hidden units the output of network-1 will be fed as input to the second network.
now let us train the above network to estimate the function y = sin(2x)+cos(x)+x and try to understand how it works.
The above plot depicts the function estimated by the deep network. Let us try to understand this by dissecting the estimations layer by layer to understand the dynamics.
The plots will not make any sense at this point one of the reasons is the network capacity is not enough. But let us try to understand what we have step by step let us look at the relationship
y = ΞΈββhββ + ΞΈββhββ + ΞΈββhββ + Ξ²β
here y describes a 3-dimensional hyperplane of hββ,hββ, and hββ. Unfortunately, we cannot visualize this hyperplane, but if we look carefully, we can rewrite the relationship as follows:
y = ΞΈββReLU[Οβyβ + Ξ²β] + ΞΈββReLU[Οβyβ + Ξ²β] + ΞΈββReLU[Οβyβ + Ξ²β] + Ξ²β.
The above relationship describes a piece-wise linear 1-D hyperplane in yβ. here we took the 3-dimensional hyperplane and unfolded it into a single-dimensional hyperplane by moving from y to yβ. Deep neural networks perform the exact same thing but in the reverse. Deep Neural Networks take a lower dimensional surface and fold it in higher dimensions to produce complex representations. The higher dimensional space that arises as a result of this folding is often known as latent space and the estimated higher dimensional surface is known as latent representation of the input. In our example, the mapping between x to y is obtained through the following latent representations from x to yβ and yβ to y
yβ = ReLU[ΞΈβhβ + ΞΈβhβ + ΞΈβhβ + Ξ²]
y = ΞΈββhββ + ΞΈββhββ + ΞΈββhββ + Ξ²β
Again, these are 3-D hyperplanes as the underlying mapping from x to y is 1-D. These latent relationships can be unfolded to obtain the underlying 1-D mapping from x to y. Hence, we can think of deep networks as folding input space. If all these Math and dimensions confuse you, just imagine your input space as a paper to fold your paper, you have to move in the third dimension a deep neural network does the exact same thing but in much higher dimensions.
Till now, we have considered a toy example where we thought of the deep network as a composition of two shallow networks now let us look at a more practical deep networks used in practice:
Let us train the above network to see how the new network estimates. Let us try to find out if we are able to capture the folding. This time, let us use 7 neurons in each layer instead of 3. The piece-wise function estimated by the network is visualized below:
We can see that the network did a fair job estimating the function. Now, letβs investigate whether there are any indications of folding by visualizing the latent representations. If folding has occurred, as previously discussed, we might observe one or more of the following signs:
- Change in Value Range: We would detect alterations in the range of values that the function represents along the x-axis, y-axis, or both, contingent upon how the deep network executed the folding operation.
- Overlapping or Loops: We might witness the function winding over itself in certain instances, resulting in loop-like structures.
As anticipated, the network did fold the input space to create latent representations, which is clearly noticeable from the above plot. For instance, if we examine the pink line, we observe that its x and y ranges have been interchanged due to this transformation. Similarly, the blue line has undergone a transformation resulting in the formation of a loop, along with a reduction in the x-axis range. These transformations can be as straightforward as swapping coordinate axes or, in more complex cases, giving rise to intricate loop-like structures.
The concept of folding in deep neural networks can be likened to a simple and intuitive analogy. Imagine you have a piece of paper with a curve drawn on it. When you fold the paper along the curve and make cuts, youβll notice that these cuts effectively double when you unfold the paper. Deep neural networks employ a similar principle, although with a crucial distinction. Instead of manually creating cuts, these networks use a learned process during the training procedure. They adjust the latent representations by effectively βclippingβ or transforming them, aiming to create more linear regions within the data. This learned folding process allows the network to fit the data distribution better by adapting the transformations as closely as possible.
Letβs construct a dataset using a more intricate function, f(x) = sin(x) + sin(2x)cos(3x) + 0.2sin(5x)cos(7x) + 0.1sin(10x), combined with Gaussian noise. Our goal is to assess whether deep neural networks can effectively estimate the underlying distribution amidst the presence of this noise.
Itβs evident that the model has successfully captured the underlying distribution to a reasonable degree. If we were to enhance the modelβs capacity, itβs likely that our estimation would improve further. This demonstrates the remarkable potential of deep neural networks as powerful estimators, which has earned them a significant place in the contemporary Machine-Learning landscape.
This blog draws significant inspiration from the book βUnderstanding Deep Learningβ by Simon J.D. Prince (udlbook.github.io/udlbook/). In the final section, we will delve into how these networks make estimations for multivariate inputs and outputs. The code I utilized to generate the plots can be found below. If youβve found this blog insightful, I would greatly appreciate your support by giving it a like.
Understanding-Deep-Learning/Mastering_Deep_Learning_The_Art_of_Approximating_Non_Linearities_with_Piβ¦
Contribute to Raagulbharatwaj/Understanding-Deep-Learning development by creating an account on GitHub.
github.com
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI