Understanding Convolution
Last Updated on September 29, 2024 by Editorial Team
Author(s): Ayo Akinkugbe
Originally published on Towards AI.
To better understand what convolution is, it is needful to know why dense neural networks (DNN) donβt work well for images. If you trained a DNN and a CNN (convolutional neural network), you are bound to get higher accuracy and lower loss on the CNN model compared to the DNN model. Here are some reasons why:
1. High Dimensionality and Computational Complexity
Images typically have a large number of pixels. For example, a 200×200 image has 40,000 pixels, and a dense neural network would need to treat each pixel as an independent input. A fully connected layer with 40,000 inputs would require an enormous number of connections to the next layer, leading to:
- High memory usage: Storing the weights for every pixel connection in large images becomes impractical.
- Increased computational cost: Processing becomes slow and inefficient because dense layers donβt take advantage of the spatial structure of images.
In contrast, convolutional layers in CNNs use small filters that share weights across the image, drastically reducing the number of parameters and making computations more efficient.
2. Loss of Spatial Hierarchy
DNNs treat all pixels as independent features, ignoring the fact that neighboring pixels in an image are closely related. This means that in a DNN:
- Spatial relationships are not considered: Dense layers donβt account for spatial patterns like edges, textures, or shapes that are present in nearby pixels. Images have local features (e.g. eyes, corners of objects) that need to be preserved.
- No translation invariance: Dense networks struggle to recognize patterns like an object in an image if it appears in different positions. Convolutional layers, on the other hand, apply filters across the entire image, making them good at recognizing objects regardless of their location.
3. Inefficient Feature Learning
In DNNs, each layer needs to learn global patterns from scratch. This makes it difficult to detect complex hierarchical features in images, such as edges in earlier layers and entire objects in later layers.
In contrast, CNNs can learn hierarchical features. Early layers in a CNN focus on low-level features (like edges and textures), while deeper layers learn more abstract concepts (like parts of objects or even whole objects). Dense layers do not efficiently capture this hierarchical structure, leading to poor performance on image data.
4. Overfitting
With a large number of parameters in fully connected layers, a dense network is more prone to overfitting, especially with smaller datasets. Images usually contain a lot of redundant information, and fully connected networks have no mechanism to reduce this redundancy. Convolutional layers reduce overfitting through the concept of weight sharing (the same filter slides over different parts of the image). This greatly reduces the number of parameters, leading to more generalizable models with less risk of overfitting.
How Then Does Convolution Works?
Imagine sweeping a magnifying glass across an image to detect specific patterns (like lines or shapes). Convolution in CNNs can be thought of as a way to capture patterns in data by sliding a small magnifying glass (filter) across an image or other data to focus on specific local features. Each filter looks for a different kind of pattern, and the CNN uses many of them to understand the image, layer by layer, from simple features to complex ones.
- Filter as a Pattern Detector: Imagine you have an image of a cat. A filter (or kernel) in a CNN is a small matrix (e.g., 3×3 or 5×5) that scans across this image. Each filter looks for a specific feature like edges, textures, or shapes. For example, one filter might detect horizontal lines, another might detect vertical lines, and yet another could find corners.
- Sliding Across the Image: The filter moves over the image (convolves) in small steps. At each step, it performs a dot product between the values in the filter and the corresponding region of the image. This helps the CNN extract local information about the image (such as edges or texture patterns) without looking at the entire image at once.
- Feature Map: The result of this sliding process is a new matrix called a feature map. The values in the feature map represent how strongly the feature (pattern) the filter is looking for is present in different parts of the image. For example, if the filter is detecting vertical edges, the feature map will have high values where vertical edges appear in the image.
- Multiple Filters, Rich Features: A CNN uses many different filters to capture various features. Early layers typically learn simple features like edges, while deeper layers learn more complex patterns (e.g., eyes, faces, or even abstract shapes).
- Receptive Field: The filterβs size limits how much of the image it βseesβ at once, which is called its receptive field. As you go deeper in the network, the filters βseeβ larger parts of the image, which allows the network to detect higher-level features, like objects or parts of objects.
Conclusion
Convolution improves image prediction as it uses filters to reduce parameter complexity in training while considering spatial hierarchy. These unique properties convolution offers make CNNs deliver better accuracy and lower loss when used on image data.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI