
Implementing Tensor Contractions in Modern C++
Author(s): Luiz doleron | Luiz d’Oleron
Originally published on Towards AI.
Tensor contractions are of fundamental importance in modern artificial intelligence systems, playing a central role in the computation performed by the underlying linear algebra engines.
Despite its relevance, there are only a few technical references for contractions, many of them covering only advanced topics or implementation challenges in specific cases or strategies.
As a result, many in the AI/ML community can only describe contractions as a βgeneralized case of matrix multiplicationβ without exactly knowing how the operation works.
In order to shed some light on such a scenario, this story introduces tensor contractions from a coding point of view, focusing on:
- How to define the resulting tensor dimensions
- How to compute each coefficient of the resulting tensor
The discussion is supported by a fully-functional, friendly-readable implementation from scratch in modern C++ to compute tensor contractions in the multidimensional case.
Letβs start the talk by understanding how contractions are used and where they are needed.
Where tensor contractions come from
Consider the following representation of a 3-layer neural network:

I would assume that this architecture is familiar to you. At this moment, we will focus on the first layer, the input layer, and the next layer, the first hidden layer.
The value of each neuron Y(j)
is given by the formula:

where each M(i,j)
represents the weights (or synapses) connecting X(i)
to Y(j)
. The function A(...)
is called activation function. Because activation functions are not strictly relevant for this talk, we will rewrite the previous formula as:

It is easier to obtain the value of each Y(j)
by performing the following (schoolbook) matrix multiplication:

If you do not remember how to compute this product, check the image below:

This operation results in the following computation:

In general, instead of computing a single instance alone, we compute batches of instances. For example, if we have 6 input arrays, we can perform a batch multiplication by stacking the 6 arrays into a single matrix:

and repeating the last multiplication:

In this case, the input matrix X is 6×5, the weight matrix M is 5×4, and the result matrix Y is 6×4. The computation is similar to that in the last case:

Resulting in the sum:

This operation is only possible because the number of βcolumnsβ in X is equal to the number of βrowsβ in M:

This mechanism is how dense layers of many neural networks work. The implementation of such models uses the GEneral Matrix Multiply (GEMM) primitive available in many Basic Linear Algebra Subroutine (BLAS) libraries. These GEMMs are highly optimized, allowing state-of-the-art performance in both training and inference time.
Of course, matrices are not suitable to represent all types of data. As shown next, matrix multiplication is only one particular case of a more generic and powerful tool: tensor contractions.
The multidimensional case(s)
The previous matrix multiplication supports scenarios where the inputs are 1-dimensional streams, which can be combined to form 2D data grids. But what happens if, instead of a single value, the xs are multi-valued arrays?
Consider, for example, that each X encodes an RGB pixel. Thus, now, each X(i)
is an array of 3 positions: X(i, r)
, X(i, g)
, and X(i, b)
:
Considering that each Y(j)
is also an RGB pixel, we can perform the following operation:

This multiplication results in a 3-dimensional vector. We name vectors like this as tensors, or better saying, a 3-rank tensor. The computation of each coefficient of tensor Y can be visualized as follows:

Tensors are cuboid, grid-like structures with 0, 1, 2, or n dimensions. In the previous example, the first tensor X has dimensions {6, 3, 5}. The second tensor M has dimensions {5, 4}, and the result tensor Y is {6, 3, 4}.
Note that Matrices are a particular family of tensors with only 2 dimensions.
Now, we can define the tensor contraction as a binary operation over 2 tensors that uses a set of pairs as configuration and results in a tensor as well:

In schoolbook matrix multiplication, the pair is always {1, 0}, that is, intersecting the number of columns of the first matrix with the number of rows of the second matrix. Tensor contraction pairs do not have this limitation.
The intersection pairs
Consider our previous example again, where an X {6, 3, 5} tensor and an M {5, 4} tensor contract to form a Y {6, 3, 4} tensor. This operation is represented below:

As shown above, this contraction uses a single pair {2, 0} to intersect the two tensors: the 2-th dimension of X and the 0-th dimension of M.
As in the schoolbook matrix multiplication, pairs in contractions must match. In fact, this is the case here: the 2-th dimension of X and the 0-th dimension of M are both 5.
The resulting tensor dimensions are formed by the concatenation of the non-intersected dimensions of tensor X and the non-intersected dimensions of tensor M. Thus, we can find the following formula for the tensors' ranks:

Applying this formula to our previous example results in

Which is actually the rank of Y.
Letβs consider another case, where we use two pairs. In this scenario, X is {6, 3, 5} (such as in the previous example), but M is {3, 5, 4}, and we use two pairs {{1, 0}, {2, 1}} to output a tensor Y with rank 2:

In this last example, the dimensions of the resulting tensor Y are {6, 4}.
In a nutshell, the following rule is all we need to keep in mind to obtain the shape of contraction:
The resulting tensor dimensions are the concatenation of the two input tensors without the dimensions listed by the pairs.
Letβs check out other examples:
- X {2, 3}, M{3, 2}, pairs {{1, 0}} = Y {2, 2}
- X {2, 3}, M{3, 2}, pairs {{0, 1}} = Y {3, 3}
- X {11, 12, 13, 14}, M{12, 13, 14}, pairs {{1, 0}, {2, 1}, {3, 2}} = Y {11}
- X {2, 3}, M{4, 5}, pairs {} = Y {2, 3, 4, 5 }
Note that the order of pairs does not matter:
- X {4, 3, 5, 10, 3}, M{5, 9, 4, 8}, pairs {{0, 2}, {2, 0}} = Y {3, 10, 3, 9, 8}
- X {4, 3, 5, 10, 3}, M{5, 9, 4, 8}, pairs {{2, 0}, {0, 2}} = Y {3, 10, 3, 9, 8}
Once we understand how to define the output tensor shape, the last step is knowing how to compute each coefficient in the output tensor.
Computing the output coefficients
Now weβre focusing on computing the output coefficients of contracting X by M using the provided pairs configuration. The first thing to note here is that, for each coefficient C, part of its indices come from the X tensor and the rest come from the M tensor.
Letβs make this easier to understand with an illustrative example. Consider the contraction of X {6, 3, 5} by M {3, 5, 4} using pairs {{1, 0}, {2, 1}}. This contraction results in the 2-dimensional tensor Y {6, 4}:

Now, we need to compute each one of the 24 coefficients Y(i, j) where i varies from 0 to 5and j varies from 0 to 3. This space is represented by the following grid:

To compute each coefficient Y(i, j)
, we must sum up the multiplication of each coefficient X(a, i, j)
by M(i, j, b)
for every combination (a, b)
in the pairs {3, 5}.
Note that there are 3×5 = 15 combinations of (a, b).
The computation of each coefficient Y(i, j)
is summarized in the following formula:

Letβs draw the computation of Y(2, 3)
:

The colors highlight where each index comes from.
The summation to computing each coefficient
Y(i, j)
is called Inner Product.
To make sure we understand how to compute the coefficients, letβs list each coefficient computation of the X {2, 3} by M {2, 3} contraction using pairs {{0, 0}}:

As seen above, the contraction results in a Y{3, 3} tensor.
Below, an example of the contraction of X {2, 2, 2} by M {2, 2, 2, 2} using pairs {{1, 2}, {2, 3}} resulting in Y {2, 2, 2}:

The examples provide insight into the expected behavior of contractions. Now, letβs start thinking about the contraction implementation.
Implementing Tensors
In its definition, a Tensor is a grid-like object having one, two, or more dimensions. But implementing tensors as grids (for example, using arrays of arrays or bi-dimensional arrays, etc) is not usual. Instead, tensors are internally represented by contiguous, linear containers.
How does this work? This works by using ColMajor or RowMajor schemas. In our implementation, we will use ColMajor index order as represented below:

Once we have chosen the tensor layout, we need to realize how to perform the following operations:
- How to convert a tensor index into an internal position and
- How to convert an internal position into a multidimensional index in the tensor space
Of course, one operation is the inverse of the other. Let us start by converting a tensor index into an internal position:
#include <iostream>
#include <vector>
#include <numeric>
int main(int, char **)
{
std::vector<int> dimensions {6, 7, 8, 9};
std::vector<int> prefix (dimensions.size(), 1);
for (int j = dimensions.size() - 2; j >= 0; j--) {
prefix[j] = dimensions[j + 1] * prefix[j + 1];
}
// prefix is {7*8*9, 8*9, 9, 1}
std::vector<int> indices {3, 1, 4, 1};
int pos = std::inner_product(indices.begin(),
indices.end(), prefix.begin(), 0);
std::cout << pos << "\n"; // prints 1621
return 0;
}
The code above is self-explanatory. In general, we store prefix
somewhere so we can reuse it for computing multiple indices. Some references name this operation as ravel index because of the similarity with the structure of a woven.

The next operation is unravel_index
, that is, given an internal position, find the sequence of tensor indices. This name comes from the idea of separating or disentangling the threads of a woven or knitted fabric.
For practical purposes, we would enclose this operation inside a functor:
class INDEX_GENERATOR
{
public:
INDEX_GENERATOR(const DIMS &dims)
{
const int DIM_SIZE = dims.size();
this->divisors.resize(DIM_SIZE, 1);
this->length = 1;
for (int j = DIM_SIZE - 1; j >= 0; j--)
{
this->divisors[j] = this->length;
this->length *= dims[j];
}
}
DIMS operator()(const int pos) const
{
const int DIM_SIZE = this->divisors.size();
DIMS result(DIM_SIZE, 0);
int residual = pos;
int acc;
int index;
for (int j = 0; j < DIM_SIZE; j++)
{
acc = this->divisors[j];
index = residual / acc;
result[j] = index;
residual -= index * acc;
}
return result;
}
private:
DIMS divisors;
int length;
};
We can use this functor like below:
INDEX_GENERATOR generator({6,7,8,9});
auto indices = generator(1621); // {3, 1, 4, 1}
Now we know how to represent tensors using linear containers. It is time to move forward and implement the contraction.
Finding contraction dimensions
As explained at the beginning of this story, the first thing to do is to compute the dimensions of the resulting tensor:
template <typename FP_TYPE>
class Tensor
{
using DIMS = std::vector<int>;
using INDEX_PAIRS = std::vector<std::pair<int, int>>;
using STORAGE = std::vector<FP_TYPE>;
private:
STORAGE storage;
DIMS dimensions;
public:
Tensor<FP_TYPE> contraction(const Tensor<FP_TYPE> &other,
const INDEX_PAIRS &pairs) const
{
const int this_size = this->dimensions.size();
const int other_size = other.dimensions.size();
const int pairs_size = pairs.size();
const int dim_size = this_size + other_size - 2 * pairs_size;
DIMS intersection_dims(pairs_size, 0);
std::unordered_set<int> this_pairs_set;
std::unordered_set<int> other_pairs_set;
for (int idx = 0; idx < pairs_size; ++idx)
{
const auto &pair = pairs[idx];
this_pairs_set.insert(pair.first);
other_pairs_set.insert(pair.second);
intersection_dims[idx] = this->dimensions[pair.first];
}
DIMS result_dimensions;
result_dimensions.reserve(dim_size);
int dest_pointer = 0;
for (int i = 0; i < this_size; ++i)
{
if (this_pairs_set.find(i) == this_pairs_set.end())
{
int dim = this->dimensions[i];
result_dimensions.push_back(dim);
}
}
for (int i = 0; i < other_size; ++i)
{
if (other_pairs_set.find(i) == other_pairs_set.end())
{
int dim = other.dimensions[i];
result_dimensions.push_back(dim);
}
}
//..
}
};
Check the complete code on GitHub.
At this point, we have two arrays: result_dimensions
and intersection_dims
. If we invoke the contraction as follows:
Tensor<float> X = Tensor<float>::RANDOM({2, 5, 3});
Tensor<float> M = Tensor<float>::RANDOM({2, 11, 13, 3});
Tensor<float> Y = X.contraction(M, {{0, 0}, {2, 3}});
Than the contents of result_dimensions
will be {5, 11, 13}
and intersection_dims
will be {2, 3}
. This means that the result tensor Y
has dimensions {5, 11, 13} and 5*11*13 coefficients.
To compute each coefficient Y(i, j, k)
, we must find all coefficients X(a, i, b)
and M(a, j, k, b)
, multiplying them pair-wise and summing up.
This is what the following code does:
Tensor<FP_TYPE> contraction(const Tensor<FP_TYPE> &other,
const INDEX_PAIRS &pairs) const
{
//... previous part
// precaching intersection offsets
std::vector<int> position_this_cache(intersection_length, 0);
std::vector<int> position_other_cache(intersection_length, 0);
for (int j = 0; j < intersection_length; ++j)
{
DIMS intersection = intersection_generator(j);
int position_this = 0;
int position_other = 0;
for (int idx = 0; idx < pairs_size; ++idx)
{
position_this += intersection[idx] * this_pre[idx + dim_size];
position_other += intersection[idx] * other_pre[idx + dim_size];
}
position_this_cache[j] = position_this;
position_other_cache[j] = position_other;
}
// computing each coefficient
for (int i = 0; i < length; ++i)
{
DIMS indices = index_generator(i);
int pos_this_before = 0;
int pos_other_before = 0;
for (int idx = 0; idx < dim_size; ++idx)
{
pos_this_before += indices[idx] * this_pre[idx];
pos_other_before += indices[idx] * other_pre[idx];
}
FP_TYPE acc = FP_TYPE(0);
for (int j = 0; j < intersection_length; ++j)
{
int pos_this = pos_this_before + position_this_cache[j];
int pos_other = pos_other_before + position_other_cache[j];
FP_TYPE this_coeff = this->storage[pos_this];
FP_TYPE other_coeff = other.storage[pos_other];
acc += this_coeff * other_coeff;
}
result.set(indices, acc);
}
return result;
}
Assuming that each dimension is N
, this code provides a time complexity of:

where x
is the rank of tensor X, m
is the rank of the tensor M, and p
is the number of pairs.
That time complexity indicates that contractions are very computer-intensive operations, exponentially increasing the overall time if the input increases just linearly.
Check below a chart depicting the total operation time of 100 contractions using configurations where x+m-p
is 2, 3, 4, 5, 6, and 7:

In this story, we want to code uniquely to make the mechanisms behind contractions easier to understand. Of course, the implementation here does not aim to be as fast as production-grade implementations. But this does not mean that our code cannot run at least decently!
There are a few details that we can implement without making the code too hard to read.
A guideline to achieve state-of-art performance is discussed in the end of this story.
To make things run faster, we pre-cached part of the computation of vectors X(a, i, b)
and M(a, j, k, b)
:
// precaching intersection offsets
std::vector<int> position_this_cache(intersection_length, 0);
std::vector<int> position_other_cache(intersection_length, 0);
for (int j = 0; j < intersection_length; ++j)
{
DIMS intersection = intersection_generator(j);
int position_this = 0;
int position_other = 0;
for (int idx = 0; idx < pairs_size; ++idx)
{
position_this += intersection[idx] * this_pre[idx + dim_size];
position_other += intersection[idx] * other_pre[idx + dim_size];
}
position_this_cache[j] = position_this;
position_other_cache[j] = position_other;
}
Invoking the contraction is pretty simple:

Checking the source code, you find that we pre-cache some computations to avoid re-evaluating them in each loop.
You can find more implementation details in the full source on GitHub.
Optimization tips
Whenever someone asks me for performance advice, as say:
- Only optimize after the whole thing runs correctly
- Check if your algorithm complexity is as good as you expect
- Avoid recomputing things inside loops
- Avoid allocating memory inside loops
- Use STL and canonical libraries (
std::transform()
,std::for_each()
) as much as possible - Use
const
variables (including lambdas) - Use
unordered
containers such asstd::unordered_map
andstd::unordered_set
instead ofstd::map
andstd::set
- Avoid the βonelinerβ thing.
- Make the code clean so you can understand it more easily and improve it more easily too
- Use the right compiler flags
Those are βgeneral purposeβ tips. They help with any project. There are other nice tips, but letβs talk about more specific advising.
Making things run even faster
The code in this story is not intended to replace a proper BLAS implementation. We are using it only to explain how contractions work. However, we can think of several ways to improve the code to achieve a production-level performance.
Reducing cache misses
In general, tensors are huge objects occupying large amounts of memory blocks in the computer RAM. Unfortunately, RAM blocks are not directly accessible to CPUs. It turns out that the tensor data must be copied from the RAM to the CPU cache before being used by the program.
The problem is that CPU caches are usually smaller than the size of tensors, resulting in cache misses that affect the process latency. As a result, the overall time is 4 times or slower if the program causes cache misses too often.
The solution for this issue is well explored in GEMM routines of popular BLAS libraries, where the whole operation is broken into smaller pieces that fit the CPU cache.
Implementing contraction in packages improves the performance but reduces the code readability.
Multi-core implementation
The code so far uses a single-core implementation. We can use modern C++ to turn it into a multi-core program very easily, for example, by splitting the main loop into threads:
const auto compute = [&](int i)
{
DIMS indices = index_generator(i);
int pos_this_before = 0;
int pos_other_before = 0;
for (int idx = 0; idx < dim_size; ++idx)
{
pos_this_before += indices[idx] * this_pre[idx];
pos_other_before += indices[idx] * other_pre[idx];
}
FP_TYPE acc = FP_TYPE(0);
for (int j = 0; j < intersection_length; ++j)
{
int pos_this = pos_this_before + position_this_cache[j];
int pos_other = pos_other_before + position_other_cache[j];
FP_TYPE this_coeff = this->storage[pos_this];
FP_TYPE other_coeff = other.storage[pos_other];
acc += this_coeff * other_coeff;
}
result.set(indices, acc);
};
std::vector<int> coefficients(length);
std::iota(coefficients.begin(), coefficients.end(), 0);
std::for_each(std::execution::par, coefficients.begin(),
coefficients.end(), compute);
The point here is running the loop in parallel using std::execution::par
, enclosing the loop in a lambda.
How fast does the code run after this change? This is what I get by running the same previous example using multithreaded contractions:

Using the multithreaded version, the program executes roughly 6 times faster. It happens because the processing load is shared between multiple cores.
Multicore programming is only one alternative for parallel execution. Today, the most popular way to distribute processing loads is GPU programming.
CUDA
Contraction is the type of computation that can run way faster on GPUs. However, refactoring the previous code to use CUDA is not simple. It turns out that GPU processors work differently from CPUs.
If we want to use CUDA, we need to rewrite the whole thing and implement a function that:
- Streams memory blocks to the GPU
- Runs the kernel and
- Gets each slice back to build the resulting tensor.
This is not hard to implement, but we need to rethink the code to realize how to divide the task blocks.
This is a good talk for a future story.
Epilogue: Why We Love Contractions
Back to the math concepts, it is clear why tensor contractions are so useful for performing pattern recognition tasks in computer vision or NLP systems. Essentially, what tensor contractions compute is the multidimensional cosine distance between two tensors.
Remember how coefficients are evaluated:

This formula refers to the dot product between two vectors a and b:

Note that the dot product can also be expressed as:

ΞΈ is the angle between vectors a and b:

This relation provides an interesting geometric insight into what contractions represent since cosines are very informative to indicate the similarity between two elements of an Euclidean space.
Conclusions
Contractions are part of the key concepts that every AI/ML engineer must know on a solid basis. In this story, we shall help those looking for guidance on the theme.
Source Code
The code used here can be found on GitHub.
Questions, corrections, or concerns
Can you spot something wrong? Drop a message. Have a question? Drop a message. Do you want to talk about the code or contractions? Post your message. It is welcome and free!
How to contribute
Nobody paid me to write this story. If you enjoyed this tutorial and want to contribute, check a charity institution near you and provide your assistance with money and/or time. No citation is needed. You neednβt let me know either.
Have a question? Drop a message and I will reply as soon as possible.
Regards,
Luiz
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI