Understanding the Layers of Convolutional Neural Networks (CNNs)

8 min readJun 10, 2024

A post that helps you in understanding the most commonly used layers of Convolution of Neural Networks (CNNs)

Hey hello, when I was a beginner to AI/ML/DL, searched a lot to understand the layers of Neural Networks which are the building blocks of AI/ML/DL. To my surprise, I couldn’t find any reference or place with a complete explanation, and that’s when I decided to write this post explaining the important layers of CNNs, which helps build your intuition on Layers for those who are beginners to A/ML/DL. Without wasting much time, let’s dive in.

Here we will try to understand the most commonly used layers in CNNs. Broadly layers in CNNs can be categorised as below depending on the usage of them in the model architecture.

Input Layer:

The input layer is where we provide input to the CNN model, in general, the CNN model will be provided with an image or series of images. Also, we tend to provide the dimensions or the size of the input image. E.g. If we are providing an RGB image of size 224x224, then the dimensions of the input layer are 224x224x3, 3 signifying the no. of channels of the RGB image.

Convolution Layer:

These are the primary or foundation layers in the CNN model. Which are responsible for the extraction of features from the images or input data using convolutional filters (kernels). It applies a set of learnable filters known as the kernels to the input images. The filters/kernels are smaller matrices usually 2×2, 3×3, or 5×5 shape. it slides over the input image data and computes the dot product between kernel weight and the corresponding input image patch. The output of this layer is referred to as feature maps. Suppose we use a total of 12 filters for this layer we’ll get an output volume of dimension 32 x 32 x 12.

Pooling Layer:

Reduces spatial dimensions, and down-sampling feature maps.

Spatial Reduction: Pooling layers shrink the spatial dimensions (width and height) of feature maps.
Feature Consolidation: They consolidate the features learned by CNNs.
Parameter Reduction: By reducing spatial dimensions, pooling minimizes the number of parameters and computations in the network.

Max Pooling:

Selects the maximum element from each pooling region.
Highlights the most prominent features.
Commonly used in CNN architectures.

Average Pooling:

Computes the average of elements within each region.
Provides a smoother representation.

Global Pooling:

Reduces each channel to a single value (1x1xnc feature map).
Equivalent to using a filter of the entire feature map’s dimensions.

Flatten layer:

It serves a critical role in reshaping the output from preceding layers (such as convolutional or pooling layers) into a one-dimensional vector. Here are the key points:

Purpose:

The flatten layer converts the multi-dimensional output from the previous layer into a one-dimensional array.
This transformation is necessary before feeding the data into subsequent fully connected layers (also known as dense layers) for further processing.

Function:

When applied to a multi-dimensional tensor output (e.g., from a convolutional or pooling layer), the flatten layer collapses all dimensions except the batch dimension.
For example, if the output tensor has dimensions (batch_size, height, width, channels), the flatten layer reshapes it to (batch_size, height * width * channels).

Position in the Network:

The flatten layer typically appears after the convolutional and pooling layers in CNN architectures.
It acts as a bridge between the spatial feature extraction layers (convolutional/pooling) and the fully connected layers that perform classification or regression tasks.

Role in Parameter Reduction:

By flattening the output tensor, the flatten layer reduces the dimensionality of the data before passing it to fully connected layers.
This helps improve computational efficiency by reducing the number of parameters in subsequent layers.

Example:

Suppose we have a CNN architecture that processes images with convolutional and pooling layers.
After these layers, the output might be a tensor with dimensions (batch_size, height, width, channels).
The flatten layer reshapes this tensor into a one-dimensional array of length (height * width * channels) before passing it to fully connected layers for classification.

In summary, the flatten layer ensures that the spatial information extracted by earlier layers is properly prepared for subsequent fully connected layers. It’s a crucial step in the transition from feature extraction to classification or regression tasks

Fully Connected layer:

Also known as a dense layer, is a type of neural network layer where every neuron in the layer is connected to every neuron in the previous and subsequent layers.
The term “fully connected” refers to the dense interconnectivity between neurons within this layer.

Key Components of a Fully Connected Layer:

Neurons: These are the basic units within the fully connected layer. Each neuron receives inputs from all neurons in the previous layer and sends outputs to all neurons in the subsequent layer.
Weights and Biases: The fully connected layer consists of learnable weights and biases associated with each neuron. These parameters are adjusted during training to optimize the network’s performance.
Activation Function: Typically, an activation function (such as ReLU, sigmoid, or tanh) is applied to the weighted sum of inputs to introduce non-linearity.

Role of Fully Connected Layers in CNNs:

Fully connected layers are usually placed before the output layer in a CNN architecture.
They serve as a bridge between the convolutional and pooling layers (which extract features) and the final classification or regression layer.
These layers learn complex combinations of features from the extracted representations and make predictions based on them.

Example Usage:

In an image classification CNN, the fully connected layers take flattened feature maps (output from previous layers) and learn to associate them with specific classes (e.g., “cat,” “dog,” etc.).

Remember, fully connected layers play a crucial role in capturing high-level abstractions and enabling end-to-end learning in CNNs.

Activation Layer:

It introduces non-linearity to the network, allowing it to learn complex mappings.

ReLU (Rectified Linear Unit):

It replaces negative input values with zeros, introducing nonlinearity.

Formulae for ReLU

Sigmoid:

It maps inputs to the range (0, 1).

Softmax Activation:

It converts a vector of real numbers into a probability distribution. It exponentiates each input value and normalizes them to sum up to 1. This is useful for multi-class classification tasks, where the network predicts the probabilities of different classes.

Tanh (Hyperbolic Tangent):

Tanh squashes input values to the range [-1, 1].

Leaky ReLU:

Leaky ReLU allows a small gradient for negative inputs.

Formulae for LeakyReLU

where α is a small positive constant

Sample graphical representation of Leaky ReLU

Nomalization Layer:

It plays a crucial role in improving training stability and accelerating convergence.

Batch Normalization (BN):

BN normalises the activation of internal layers during training.

How It Works:

Computes the mean and variance of activations within a mini-batch.
Normalizes each activation by subtracting the mean and dividing by the standard deviation.
Scales and shifts the normalized activations using learnable parameters (gamma and beta).

Benefits:

Faster convergence during training.
Reduces internal covariate shift (fluctuations in layer activations).
Improves gradient flow and stability.
Acts as a regularizer, reducing overfitting.

Layer Normalization (LN):

LN normalises activations within a single layer from a batch.

How It Works:

Computes the mean and variance for each unit within the layer.
Normalises each activation using the layer-specific statistics.
Similar to BN but operates on a per-layer basis.

Use Cases:

LN is useful when batch sizes are small or when applying neural networks to sequences (e.g., recurrent networks).

Instance Normalization (IN):

IN standardizes each feature map with respect to each sample (instance).
It is commonly used in style transfer and other tasks where contrast normalization is essential.

How It Works:

For each feature map, IN computes the mean and variance across spatial locations (H x W) independently for each sample (instance).
Normalizes the feature map using these per-instance statistics.
Does not exploit the batch dimension.

Use Cases:

Useful when the class label should not depend on the contrast of the input image (e.g., style transfer).
Not commonly used for image classification due to its independence from batch information.

Group Normalization (GN):

GN is a trade-off between Layer Normalization (LN) and IN.
Divides the channel dimension into groups and normalizes within each group.

How It Works:

Normalizes each group independently across spatial locations (H x W).
Does not exploit the batch dimension.

Benefits:

GN achieves similar performance to BN when the batch size is medium or high.
Outperforms BN when there are fewer instances in each batch.
Potential replacement for BN in simple usage scenarios.

Regularization layer:

Regularization techniques prevent overfitting by controlling the complexity of the model.
They encourage the network to learn simpler, more generalizable representations.

Dropout Layer:

Randomly drops a fraction of neurons during training.
Prevents co-adaptation of neurons and encourages robustness.

L1 and L2 Regularization (Weight Decay):

Adds a penalty term to the loss function based on the magnitude of weights.
L1 regularization encourages sparsity (some weights become exactly zero).
L2 regularization encourages small weights (but doesn’t force them to zero).

Early Stopping:

Monitors validation loss during training.
Stops training when validation loss starts increasing (indicating overfitting).

Output Layer:

It provides the final predictions based on the learned information.
It represents the network’s final output classifications or regression values.
The number of neurons in the output layer corresponds to the number of output classes (for classification tasks).
In regression problems, the output layer typically has only one neuron.

References:

This brings the end of the post, thank you so much for reading, hope you enjoyed reading this post and gained knowledge on layers of CNNs. In continuation of this post, planning to post a few more articles.

Understanding the Layers of Convolutional Neural Networks (CNNs)

Layers of Convolution Neural Network:

Input Layer:

Convolution Layer:

Pooling Layer:

Max Pooling:

Average Pooling:

Global Pooling:

Flatten layer:

Purpose:

Function:

Position in the Network:

Role in Parameter Reduction:

Example:

Fully Connected layer:

Key Components of a Fully Connected Layer:

Role of Fully Connected Layers in CNNs:

Example Usage:

Activation Layer:

ReLU (Rectified Linear Unit):

Sigmoid:

Softmax Activation:

Tanh (Hyperbolic Tangent):

Leaky ReLU:

Nomalization Layer:

Batch Normalization (BN):

Layer Normalization (LN):

Instance Normalization (IN):

Group Normalization (GN):

Regularization layer:

Dropout Layer:

L1 and L2 Regularization (Weight Decay):

Early Stopping:

Output Layer:

References:

Written by Surendra Allam

No responses yet