Unlock the full potential of AI with Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

Mastering Deep Learning: The Art of Approximating Non-Linearities with Piecewise Estimations Part-1
Data Science   Latest   Machine Learning

Mastering Deep Learning: The Art of Approximating Non-Linearities with Piecewise Estimations Part-1

Last Updated on November 5, 2023 by Editorial Team

Author(s): Raagulbharatwaj K

Originally published on Towards AI.

In the past year, we’ve witnessed an explosive surge in the popularity of Deep Learning. Large Language Models like GPT-4 and Generative models like DALL·E are dominating conversations across the internet. The enthusiasm surrounding Deep Learning has spurred the development of supercomputers like Nvidia’s DGX GH200. It is a computational powerhouse boasting an astonishing performance of 1 exaflops, tailor-made for neural network computations. Whenever Deep Learning is the subject of discussion, it inevitably goes hand in hand with Deep Neural Networks, and the reverse holds as well. Deep Neural Networks are equations that can represent an extremely broad family of relationships between input and output. Often these relationships are extremely complicated, non-linear, and difficult to visualize but how do Deep Neural Networks represent them comfortably?

To grasp how a deep neural network approximates such intricate functions, we begin by examining a basic shallow neural network and dissect how it models these associations. Neural networks in general are functions f(x,ϕ) which map multivariate-inputs x to multivariate-output y, where ϕ is the set of parameters of the function f. For example, if f(x,ϕ) = ax +b, ϕ is the set {a, b}. Shallow neural networks in general are made of fundamental computational units referred to as neurons. These neurons act as linear estimators, each estimating different linear functions.

A simple shallow neural network with 3 hidden units

In general, a simple shallow neural network as shown in the above figure, with one input x and one output y with D neurons can be represented as :

where ϕ₀ and ϕᵢ are the bias and output weight corresponding to hidden unit i respectively. Each hᵢ corresponds to the linear function approximated by hidden unit i. Each hidden unit computes hᵢ as follows:

as we stated earlier θ₀ is the bias and θᵢ is the input weight corresponding to the hidden unit i. Here a[.] is an activation function. Till now we have only estimated a linear function using the neural network which supposedly should estimate a non-linear function. The activation function is the one that brings upon this non-linearity. To be precise it generates what is known as a piecewise-linear function, which serves as an approximation for handling non-linear functions.

Let us conduct a simple experiment to see how this activation function introduces nonlinearity. For this experiment let us take three simple linear functions given by:

Now let us combine y₁,y₂, and y₃ find y using the above equation, and visualize the result.

We can see that the result is a linear function now let us pass y1,y2, and y3 through an activation function and visualize the result.

We can see that the function is no longer linear but is piecewise linear. These piecewise linear functions are an excellent approximation of non-linear functions. When the number of linear regions starts to increase the linear regions start to shrink in size making the impression of a non-linear function even though the underlying function is piecewise linear. There are actual non-linear activation functions such as sigmoid which produces an estimation that is non-linear as shown below.

Even though we can estimate a non-linear function with the help of the sigmoid activation function ReLu or one of its variants is the choice when it comes to most modern neural networks. Various factors make ReLu the choice when it comes to activation functions the most influencing factor being that ReLu and its derivative Heaviside function are extremely simple to compute when compared to Sigmoid or Tanh. Using ReLu can reduce the compute time greatly without compromising the performance.

Now let us train a shallow neural network with ReLu activation to estimate sin(x) by increasing the number of hidden units each time and visualizing the estimated function until the estimation is satisfactory. We then finally compare this estimated function with functions estimated with sigmoid and tanh activation functions and see if there is any compromise in the performance.

Shallow Neural network with 5 hidden units and Relu activation
Shallow Neural network with 50 hidden units and ReLu activation
Shallow Neural network with 500 hidden units and ReLu activation
Shallow Neural network with 1500 hidden units and ReLu activation

From the above plots, we can see that the piecewise estimations made with the help of ReLu approximates a non-linear function such as sin(x) very well. It’s noticeable that as the number of hidden units increases, the piecewise linear segments closely resemble non-linear regions. In fact, with D hidden units, it’s possible to create as many as D+1 linear regions. During the training process, the model learns the slopes represented by θ₁ and the offsets denoted by θ₀ for these D+1 linear segments, enhancing the overall effectiveness of this approach.

Now Let us compare this with the estimations made using Sigmoid and Tanh functions.

Shallow Neural network with 1500 hidden units and Sigmoid activation
Shallow Neural network with 1500 hidden units and Tanh activation

From the above plots, we can observe that even though these estimates are non-linear they are not as good as the estimates made with the help of the ReLu function. One way to explain this is when we train a shallow neural network of D hidden units with ReLu activation we essentially get D+1 separate linear regions these linear regions possess the flexibility to be oriented and positioned as necessary, adapting to the underlying data distribution. Let us understand this with an extremely simple real-world analogy look at the figure below

Let us imagine we have to create a circle using only inflexible matchsticks. As depicted in the figure, as we increase the number of matchsticks, the resulting shape gradually resembles a circle more closely. Each matchstick can be likened to a linear region that each hidden unit within the neural network estimates. When we assemble these matchsticks, the collective result represents the final estimation achieved by the neural network it is as simple as that.

This blog draws significant inspiration from the book “Understanding Deep Learning” by Simon J.D. Prince (udlbook.github.io/udlbook/). I’m planning to extend this blog with two more parts to further delve into the subject. The code I utilized to generate the plots can be found below. If you’ve found this blog insightful, I would greatly appreciate your support by giving it a like.

Understanding-Deep-Learning/Mastering_Deep_Learning_The_Art_of_Approximating_Non_Linearities_with_Pi…

Contribute to Raagulbharatwaj/Understanding-Deep-Learning development by creating an account on GitHub.

github.com

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓