How Computers Visualise Images: A Beginner’s Guide to CNNs

Last Updated on October 18, 2025 by Editorial Team

Author(s): Aditya Gupta

Originally published on Towards AI.

Think about how you scroll through your phone’s photo gallery. You don’t even have to think before recognizing your friends, your pet, or the places you’ve been. Your brain instantly knows what’s in each picture.

But a computer doesn’t have that ability naturally. To it, every image is just a grid of numbers. Each pixel has values that represent brightness and color. So how does a computer move from those numbers to actually recognizing what’s inside an image? That’s where Convolutional Neural Networks, or CNNs, come into play.

What Exactly Are CNNs?

Think about how a child learns to draw. When someone first starts learning art, you don’t hand them a blank canvas and ask them to paint an entire landscape. You start small. You teach them how to draw a simple line, then how to make a few different kinds of lines — maybe curved, maybe straight. Then you show how lines form shapes like triangles, squares, or circles. Once they can draw these shapes, you move to simple objects like a hut, a tree, or clouds. And when these small objects come together, they finally form a complete scenery.

How Computers Visualise Images: A Beginner’s Guide to CNNs

A Convolutional Neural Network, or CNN, works in almost the same way. It doesn’t recognize an image all at once. It starts by identifying small, simple features such as edges or lines. Then it moves on to find patterns that combine those edges into shapes. Eventually, it learns to detect more complex features, like eyes, wheels, or windows, until it can recognize the entire object, maybe a face, a car, or a house.

In short, CNNs learn visual understanding step by step, layer by layer, just like how we build up our artistic understanding from basic lines to full images.

How Computers Interpret an Image

When we look at a photo, our brain instantly understands what it is. We see a tree, a car, or a person without thinking about tiny details. A computer, however, does not see objects naturally. To a computer, an image is just a grid of numbers, which we call a matrix.

1. Pixels: The Building Blocks of Images

An image is made of tiny dots called pixels. Each pixel stores information about the color and brightness at that spot.

In a black-and-white (grayscale) image, each pixel has a value that shows how bright it is.
0 means black
255 means white
Numbers in between represent shades of gray

For example, a small 3×3 grayscale image could look like this:

[[255, 128, 0],
 [64, 200, 150],
 [0, 50, 255]]

Here, 255 is the brightest pixel, and 0 is completely dark. Each number tells the computer how much light is at that pixel.

More pixels mean higher resolution. A 1000×1000 image has a million pixels, which makes it very detailed. A 100×100 image has only 10,000 pixels, so it will look blurrier if you try to zoom in.

2. Colored Images and RGB Channels

Most images we see are in color. Computers represent colors using three channels: Red, Green, and Blue (RGB). Each channel is its own matrix, and each pixel in that channel stores a number from 0 to 255 indicating how much of that color is present.

For example, a 2×2 colored image might have:

Red channel: Green channel: Blue channel:
[[255, 0], [[0, 255], [[0, 0],
 [128, 64]] [128, 64]] [255, 128]]

Red channel matrix shows how much red is in each pixel
Green channel matrix shows how much green is in each pixel
Blue channel matrix shows how much blue is in each pixel

When combined, these three matrices reconstruct the full color of each pixel. The computer doesn’t see a dog or a car immediately, it only sees numbers in three layers.

3. Image Dimensions

Images have three main dimensions:

Height — number of pixels vertically
Width — number of pixels horizontally
Depth — number of channels (1 for grayscale, 3 for RGB)

So, a 64×64 RGB image has a shape of 64 x 64 x 3. This means the image has 64 rows, 64 columns, and three color layers.

More pixels = clearer image.

A 32×32 image is very coarse.
A 1024×1024 image is very detailed.

This is why image resolution is so important in computer vision.

4. How Computers Interpret Pixels as Numbers

Each pixel value tells the computer how bright or intense a color is. Think of it like a heat map:

0 means no light or zero intensity
255 means full intensity
Intermediate numbers represent varying brightness or color intensity

Computers use these numbers in matrices to perform calculations. They don’t “see” objects yet. They can only process patterns of numbers.

5. From Pixels to Patterns

A single pixel doesn’t mean much. The real information comes when you combine groups of pixels:

Neighboring pixels form edges
Edges combine into shapes
Shapes combine into features
Features combine into objects

Just like a student learns to draw by starting with lines and shapes before creating a full painting, a CNN starts by looking at small patterns in these matrices and gradually builds up an understanding of objects in the image.

Why Do We Need CNNs?

Now that we know that an image is just a set of numbers in matrices, you might wonder: why not just use traditional machine learning algorithms or regular neural networks to recognize images? After all, a computer can process numbers, right?

It turns out, images are very different from simple tabular data. Each image might have thousands or even millions of pixels. Let’s explore why this is a problem for traditional approaches.

1. Why Machine Learning Algorithms Struggle

Traditional machine learning algorithms like SVMs, decision trees, or logistic regression expect input as flat feature vectors. For an image, that would mean converting the entire 2D pixel grid into a single long list of numbers.

For example:

A 64 x 64 grayscale image has 4096 pixels → 4096 input features.
A 64 x 64 RGB image has 64 x 64 x 3 = 12,288 features.

That is already a huge number of inputs for classical algorithms.

Problems with this approach:

Loss of spatial information: Flattening the image into a vector destroys the structure of the image, the algorithm cannot know which pixels are next to each other.
High dimensionality: With thousands of features, machine learning algorithms require much more data to generalize well and are prone to overfitting.
Feature engineering required: Classical ML algorithms cannot automatically detect edges, shapes, or textures. You would need to manually extract these features, which is time-consuming and limited.

So traditional ML cannot effectively “see” patterns in images the way a human or a CNN can.

2. Why Fully Connected Neural Networks (ANNs) Struggle

Artificial neural networks (ANNs) can theoretically learn patterns from data. However, if we try to feed an image directly into a fully connected network, we run into huge computational problems.

Consider an RGB image of 64 x 64 x 3 = 12,288 pixels as input. Suppose the first hidden layer has 1000 neurons.

The number of parameters (weights) connecting the input to the first layer is:

Number of weights = 12,288 x 1000 = 12,288,000

That is over 12 million parameters just for the first layer. Training such a network would require:

Huge amounts of memory
Massive computational power
Very large datasets to avoid overfitting

For larger images like 224 x 224 x 3 (standard for ImageNet), the number of parameters explodes into hundreds of millions. Clearly, fully connected ANNs are not practical for images.

3. How CNNs Solve These Problems

Convolutional Neural Networks were designed specifically for image processing. They solve the issues of ANNs and classical ML in three main ways:

Local connectivity: Instead of connecting every pixel to every neuron, CNNs focus on small local regions (like 3×3 or 5×5 patches). This dramatically reduces the number of parameters.
Weight sharing: The same filter (or kernel) is applied across the entire image, so the network learns the same feature everywhere. This reduces computation and helps detect patterns regardless of location.
Hierarchical feature learning: CNNs learn edges → shapes → objects step by step, so they can generalize better without manually extracting features.

In short, CNNs combine efficiency and effectiveness for image tasks: they reduce the number of parameters, preserve spatial information, and automatically detect relevant patterns.

How CNNs Work Step by Step

Convolutional Neural Networks process images in a layer-by-layer hierarchy. Each step extracts patterns and gradually builds up an understanding of the objects in the image. Let’s go through every step in detail.

Step 1: Input Image

The CNN takes an image as input, represented as a 3D matrix: Height x Width x Channels.
Example: a 64 x 64 RGB image → shape 64 x 64 x 3.
At this stage, the computer still only sees numbers, not objects.

Source: https://community.element14.com/members-area/personalblogs/b/frank-milburn-s-blog/posts/a-beginning-journey-in-tensorflow-5-color-images

Step 2: Convolution

Goal: Extract local features like edges, corners, and textures.

Filter (Kernel):

A small matrix (e.g., 3×3 or 5×5) that “slides” over the image.
Each filter detects a specific feature, like vertical edges, horizontal edges, or textures.

Operation:

For each small patch in the image, multiply each pixel by the corresponding value in the filter and sum them.
This produces a single number in the feature map.

Source: https://www.semiconductorforu.com/artificial-intelligence-impacts-automotive-design/a-cnn-breaks-an-image-into-feature-maps/

Mathematical formula:

i, j → position in the output feature map
m, n → position in the filter

Example:

5×5 input matrix
3×3 filter
Output feature map size = 3×3 (if stride = 1, no padding)

Stride and Padding:

Stride = how many pixels the filter moves at each step. Stride 1 → move 1 pixel, stride 2 → move 2 pixels.
Padding = adding extra pixels (usually 0) around the input to control output size.

Output size formula:

Output_size = ((Input_size — Filter_size + 2*Padding)/Stride) + 1

Step 3: Activation

After convolution, we apply an activation function to introduce non-linearity.
Most common: ReLU (Rectified Linear Unit) → replaces negative values with 0.

Why ReLU:

Helps the network learn complex patterns
Keeps computation simple

Step 4: Pooling (Downsampling)

Goal: Reduce the spatial size of feature maps while keeping the important information.
Most common: Max Pooling → takes the maximum value from a small patch (e.g., 2×2).

Example:

4×4 feature map → 2×2 after 2×2 max pooling
Reduces computation for next layers
Provides translation invariance (small shifts in the image won’t change detection)

Source: https://towardsdatascience.com/image-classification-with-convolutional-neural-networks-12a7b4fb4c91/

Step 5: Flattening

After several convolution + pooling layers, we flatten the 3D feature maps into a 1D vector.
This prepares the features for a fully connected layer, which acts like a traditional neural network.

Example:

8x8x16 feature map → 1024-length vector (8 x 8 x 16)

Step 6: Fully Connected (Dense) Layer

Each neuron in this layer is connected to all the outputs of the previous layer.
Learns high-level features and combines them to predict the final class.

Example:

Input vector (flattened) → 128 neurons → output layer (softmax for classification)

Mathematical formula for one neuron:

x_i → input feature
w_i → weight
b → bias
Activation → ReLU, sigmoid, or softmax

Step 7: Output Layer

For classification, usually softmax is applied to get probabilities for each class.
The class with the highest probability is the network’s prediction.

Step 8: Backpropagation and Learning

CNN learns by adjusting weights in filters and fully connected layers based on the error between predicted and actual labels.
Uses gradient descent and backpropagation to minimize loss.

Step 9: Putting It All Together

Input Image → Convolution + ReLU → Pooling → Convolution + ReLU → Pooling → Flatten → Fully Connected → Output
Each step reduces computation, detects features, and builds understanding from edges → shapes → features → objects.

Source: https://mediacy.com/blog/ai-essentials-cnns-microscopy/

Coding a CNN of our own !

Okay now lets create a CNN of our own, sounds fun right.

Step 1: Import Libraries

import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.datasets import cifar10
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing import image

In this step, we import the required Python libraries:

numpy for handling arrays and numerical operations.
matplotlib.pyplot for plotting images.
tensorflow.keras.datasets.cifar10 to load the CIFAR-10 dataset.
Sequential for building our CNN model layer by layer.
Conv2D, MaxPooling2D, Flatten, and Dense are layers of our CNN.
to_categorical converts labels into one-hot encoded vectors.
image helps in loading custom images for prediction.

Step 2: Load and Explore CIFAR-10 Dataset

# Load CIFAR-10 dataset
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
# Normalize pixel values (0-255 -> 0-1)
x_train = x_train.astype('float32') / 255
x_test = x_test.astype('float32') / 255
# Convert labels to one-hot encoding
y_train = to_categorical(y_train, 10)
y_test = to_categorical(y_test, 10)
# Check dataset shapes
print("Training data shape:", x_train.shape)
print("Test data shape:", x_test.shape)

CIFAR-10 has 60,000 color images of size 32×32 in 10 classes.
x_train and x_test contain the image data, while y_train and y_test contain labels.
We normalize pixel values to 0–1 for faster and better training.
Labels are converted to one-hot encoding so the neural network can use them for classification.
Printing shapes helps us confirm the dataset size.

Step 3: Build the CNN Model

model = Sequential()
# First convolutional layer
model.add(Conv2D(32, (3,3), activation='relu', input_shape=(32,32,3)))
# First max pooling layer
model.add(MaxPooling2D(pool_size=(2,2)))
# Second convolutional layer
model.add(Conv2D(64, (3,3), activation='relu'))
# Second max pooling layer
model.add(MaxPooling2D(pool_size=(2,2)))
# Flatten layer to convert 2D feature maps into 1D
model.add(Flatten())
# Fully connected layer
model.add(Dense(64, activation='relu'))
# Output layer for 10 classes
model.add(Dense(10, activation='softmax'))

Conv2D layers detect patterns like edges, shapes, and textures. The first layer has 32 filters, the second has 64.
MaxPooling2D reduces the spatial dimensions and helps the network focus on important features.
Flatten converts 2D feature maps into a 1D vector for the fully connected layer.
Dense layers act like a normal neural network. The last Dense layer uses softmax for class probabilities.

Step 4: Compile the Model

model.compile(optimizer='adam',
 loss='categorical_crossentropy',
 metrics=['accuracy'])

Optimizer: adam helps the network learn efficiently.
Loss function: categorical_crossentropy is used for multi-class classification.
Metrics: We track accuracy to see how well the model is performing.

Step 5: Train the Model

history = model.fit(x_train, y_train,
 batch_size=64,
 epochs=10,
 validation_data=(x_test, y_test))

batch_size: Number of images processed at a time.
epochs: Number of times the entire dataset passes through the network.
validation_data: We test the model on unseen data after each epoch to monitor performance.

Step 6: Evaluate the Model

test_loss, test_accuracy = model.evaluate(x_test, y_test)
print("Test Accuracy:", test_accuracy)

evaluate calculates the final accuracy and loss on the test set.
This gives an idea of how well the model can classify new images it hasn’t seen before.

Step 7: Make Predictions on a Custom Image

# Load custom image
img_path = 'your_image.png' # replace with your image path
img = image.load_img(img_path, target_size=(32,32))
img_array = image.img_to_array(img) / 255.0
img_array = np.expand_dims(img_array, axis=0) # Make it batch size 1

# Predict
prediction = model.predict(img_array)
predicted_class = np.argmax(prediction)
class_labels = ['airplane','automobile','bird','cat','deer','dog','frog','horse','ship','truck']
print("Predicted Class:", class_labels[predicted_class])
print("Class Probabilities:", prediction)

Load any image and resize it to 32×32 (CIFAR-10 input size).
Normalize pixel values (0–1) and add batch dimension.
model.predict returns probabilities for all classes.
np.argmax selects the class with the highest probability.
class_labels maps the predicted index to a human-readable label.

Applications of CNNs in the Real World

Convolutional Neural Networks are not just theoretical. They are used in many real-world scenarios where computers need to understand images or visual data. Here are some of the most common applications:

1. Image Classification

CNNs can identify what an image contains.
Example: CIFAR-10 or MNIST datasets where the network classifies objects like cats, dogs, or cars.
Real-world use: Sorting photos in your phone gallery automatically, or detecting objects in security cameras.

2. Object Detection

Going a step further than classification, CNNs can locate where objects are in an image.
Example: Detecting pedestrians, cars, or traffic signs in self-driving car systems.
Real-world use: Autonomous vehicles, surveillance systems, and robotics.

3. Face Recognition

CNNs can identify faces, match them to a database, or detect facial features.
Real-world use: Unlocking your phone with your face, social media photo tagging, airport security.

4. Medical Imaging

CNNs can analyze X-rays, MRIs, or skin lesion images to detect diseases.
Example: Detecting cancerous cells or anomalies in scans.
Real-world use: Healthcare applications assisting doctors in early diagnosis.

5. Image Segmentation

CNNs can classify each pixel of an image to separate objects from the background.
Example: Segmenting roads, cars, and pedestrians in an image.
Real-world use: Self-driving cars, satellite image analysis, and urban planning.

6. Style Transfer and Image Generation

CNNs can apply artistic styles to images or generate realistic images.
Example: Transforming a photo into a painting style.
Real-world use: Photo editing apps, content creation, and games.

7. Video Analysis

CNNs can analyze video frames to detect motion, objects, or actions.
Real-world use: Security surveillance, sports analytics, and activity recognition.

CNNs are extremely powerful whenever visual data is involved. From classifying tiny images to driving autonomous cars, medical diagnosis, and face recognition, CNNs are behind many technologies we use every day.

Common CNN Variations and Why They Are Used

Over time, researchers have designed different CNN architectures to improve performance, reduce computation, or handle very deep networks. Here are some of the most common variations:

1. LeNet-5

One of the first CNN architectures, developed in the 1990s for handwritten digit recognition (MNIST).
Structure: Convolution → Pooling → Fully Connected → Output
Key idea: Introduced convolution + pooling layers for feature extraction, instead of fully connected layers alone.
Real-world impact: Paved the way for modern CNNs, though not used directly in large-scale tasks today.

2. AlexNet

Won the ImageNet 2012 competition, making CNNs popular again.
Structure: Deeper than LeNet, uses ReLU activation, dropout for regularization, and overlapping max pooling.
Key idea: ReLU accelerates training, dropout prevents overfitting.
Real-world impact: Can classify high-resolution images (like 224×224) efficiently.

3. VGG (VGG16, VGG19)

Very deep networks with 16 or 19 layers.
Structure: Repeated blocks of 3×3 convolution layers followed by pooling.
Key idea: Simplicity — stacking small filters multiple times gives better features.
Pros: Easy to understand, performs well on large datasets.
Cons: Very large number of parameters → high memory and computation requirements.

4. ResNet (Residual Networks)

Introduced skip connections, which allow the network to bypass certain layers.
Key idea: Helps train very deep networks (50, 101, 152 layers) without the “vanishing gradient problem.”
Real-world impact: State-of-the-art performance in image classification and object detection.

5. Inception (GoogLeNet)

Uses multiple filter sizes at the same layer (1×1, 3×3, 5×5) and concatenates results.
Key idea: Lets the network capture features at multiple scales efficiently.
Pros: Reduces computation while keeping the network deep and powerful.

Why These Variations Exist

Deeper networks = better feature extraction, but harder to train → ResNet solves this with skip connections.
Efficiency vs performance tradeoff → Inception reduces computation while keeping accuracy high.
Prevent overfitting → Dropout layers (AlexNet, VGG) and batch normalization help generalize better.
Adapt to different tasks → Some architectures are better for small images, others for high-resolution or multi-scale images.

Conclusion

Convolutional Neural Networks have transformed how computers understand the visual world. From recognizing handwritten digits to classifying images, detecting objects in real-time, or assisting doctors in diagnosis, CNNs are at the heart of modern computer vision.

Like a student learning to draw a landscape, CNNs start with simple lines, then shapes, then objects, and finally the full picture. This layer-by-layer learning turns numbers in a matrix into meaningful insights.

We also saw why traditional machine learning or fully connected networks struggle with images and how CNNs solve these problems through local connectivity, weight sharing, and hierarchical feature learning. The evolution from LeNet to AlexNet, VGG, ResNet, and Inception shows how CNNs handle deeper networks, larger datasets, and complex tasks efficiently.

Looking ahead, the future of CNNs is bright. Faster, more efficient architectures, multi-modal networks, and applications in autonomous vehicles, robotics, healthcare, and art will continue to expand the possibilities of computer vision.

CNNs have truly made it possible for computers to see and interpret the world.

To visualize Convolution: [Video]

To learn how to implement CNN: [Video]

To read about Neural Networks: [Article]

“if you want the rainbow, you gotta put up with the rain.” — Dolly Parton

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

How Computers Visualise Images: A Beginner’s Guide to CNNs

Author(s): Aditya Gupta

What Exactly Are CNNs?

How Computers Interpret an Image

1. Pixels: The Building Blocks of Images

2. Colored Images and RGB Channels

3. Image Dimensions

4. How Computers Interpret Pixels as Numbers

5. From Pixels to Patterns

Why Do We Need CNNs?

1. Why Machine Learning Algorithms Struggle

2. Why Fully Connected Neural Networks (ANNs) Struggle

3. How CNNs Solve These Problems

How CNNs Work Step by Step

Step 1: Input Image

Step 2: Convolution

Step 3: Activation

Step 4: Pooling (Downsampling)

Step 5: Flattening

Step 6: Fully Connected (Dense) Layer

Step 7: Output Layer

Step 8: Backpropagation and Learning

Step 9: Putting It All Together

Coding a CNN of our own !

Applications of CNNs in the Real World

Common CNN Variations and Why They Are Used

Conclusion

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement