Implementing Tensor Contractions in Modern C++

Author(s): Luiz doleron | Luiz d’Oleron

Originally published on Towards AI.

Tensor contractions are of fundamental importance in modern artificial intelligence systems, playing a central role in the computation performed by the underlying linear algebra engines.

Despite its relevance, there are only a few technical references for contractions, many of them covering only advanced topics or implementation challenges in specific cases or strategies.

As a result, many in the AI/ML community can only describe contractions as a “generalized case of matrix multiplication” without exactly knowing how the operation works.

In order to shed some light on such a scenario, this story introduces tensor contractions from a coding point of view, focusing on:

How to define the resulting tensor dimensions
How to compute each coefficient of the resulting tensor

The discussion is supported by a fully-functional, friendly-readable implementation from scratch in modern C++ to compute tensor contractions in the multidimensional case.

Let’s start the talk by understanding how contractions are used and where they are needed.

Where tensor contractions come from

Consider the following representation of a 3-layer neural network:

Implementing Tensor Contractions in Modern C++ — A simple neural network

I would assume that this architecture is familiar to you. At this moment, we will focus on the first layer, the input layer, and the next layer, the first hidden layer.

The value of each neuron Y(j)is given by the formula:

where each M(i,j) represents the weights (or synapses) connecting X(i) to Y(j). The function A(...) is called activation function. Because activation functions are not strictly relevant for this talk, we will rewrite the previous formula as:

It is easier to obtain the value of each Y(j) by performing the following (schoolbook) matrix multiplication:

Using Matriz multiplication to obtain Y(i)

If you do not remember how to compute this product, check the image below:

This operation results in the following computation:

In general, instead of computing a single instance alone, we compute batches of instances. For example, if we have 6 input arrays, we can perform a batch multiplication by stacking the 6 arrays into a single matrix:

and repeating the last multiplication:

In this case, the input matrix X is 6×5, the weight matrix M is 5×4, and the result matrix Y is 6×4. The computation is similar to that in the last case:

Resulting in the sum:

This operation is only possible because the number of “columns” in X is equal to the number of “rows” in M:

Matching dimensions in matrix multiplication

This mechanism is how dense layers of many neural networks work. The implementation of such models uses the GEneral Matrix Multiply (GEMM) primitive available in many Basic Linear Algebra Subroutine (BLAS) libraries. These GEMMs are highly optimized, allowing state-of-the-art performance in both training and inference time.

Of course, matrices are not suitable to represent all types of data. As shown next, matrix multiplication is only one particular case of a more generic and powerful tool: tensor contractions.

The multidimensional case(s)

The previous matrix multiplication supports scenarios where the inputs are 1-dimensional streams, which can be combined to form 2D data grids. But what happens if, instead of a single value, the xs are multi-valued arrays?

Consider, for example, that each X encodes an RGB pixel. Thus, now, each X(i) is an array of 3 positions: X(i, r), X(i, g), and X(i, b):

Considering that each Y(j) is also an RGB pixel, we can perform the following operation:

This multiplication results in a 3-dimensional vector. We name vectors like this as tensors, or better saying, a 3-rank tensor. The computation of each coefficient of tensor Y can be visualized as follows:

Computing indices in the 3-dimensional case

Tensors are cuboid, grid-like structures with 0, 1, 2, or n dimensions. In the previous example, the first tensor X has dimensions {6, 3, 5}. The second tensor M has dimensions {5, 4}, and the result tensor Y is {6, 3, 4}.

Note that Matrices are a particular family of tensors with only 2 dimensions.

Now, we can define the tensor contraction as a binary operation over 2 tensors that uses a set of pairs as configuration and results in a tensor as well:

In schoolbook matrix multiplication, the pair is always {1, 0}, that is, intersecting the number of columns of the first matrix with the number of rows of the second matrix. Tensor contraction pairs do not have this limitation.

The intersection pairs

Consider our previous example again, where an X {6, 3, 5} tensor and an M {5, 4} tensor contract to form a Y {6, 3, 4} tensor. This operation is represented below:

As shown above, this contraction uses a single pair {2, 0} to intersect the two tensors: the 2-th dimension of X and the 0-th dimension of M.

As in the schoolbook matrix multiplication, pairs in contractions must match. In fact, this is the case here: the 2-th dimension of X and the 0-th dimension of M are both 5.

The resulting tensor dimensions are formed by the concatenation of the non-intersected dimensions of tensor X and the non-intersected dimensions of tensor M. Thus, we can find the following formula for the tensors' ranks:

Applying this formula to our previous example results in

Which is actually the rank of Y.

Let’s consider another case, where we use two pairs. In this scenario, X is {6, 3, 5} (such as in the previous example), but M is {3, 5, 4}, and we use two pairs {{1, 0}, {2, 1}} to output a tensor Y with rank 2:

In this last example, the dimensions of the resulting tensor Y are {6, 4}.

In a nutshell, the following rule is all we need to keep in mind to obtain the shape of contraction:

The resulting tensor dimensions are the concatenation of the two input tensors without the dimensions listed by the pairs.

Let’s check out other examples:

X {2, 3}, M{3, 2}, pairs {{1, 0}} = Y {2, 2}
X {2, 3}, M{3, 2}, pairs {{0, 1}} = Y {3, 3}
X {11, 12, 13, 14}, M{12, 13, 14}, pairs {{1, 0}, {2, 1}, {3, 2}} = Y {11}
X {2, 3}, M{4, 5}, pairs {} = Y {2, 3, 4, 5 }

Note that the order of pairs does not matter:

X {4, 3, 5, 10, 3}, M{5, 9, 4, 8}, pairs {{0, 2}, {2, 0}} = Y {3, 10, 3, 9, 8}
X {4, 3, 5, 10, 3}, M{5, 9, 4, 8}, pairs {{2, 0}, {0, 2}} = Y {3, 10, 3, 9, 8}

Once we understand how to define the output tensor shape, the last step is knowing how to compute each coefficient in the output tensor.

Computing the output coefficients

Now we’re focusing on computing the output coefficients of contracting X by M using the provided pairs configuration. The first thing to note here is that, for each coefficient C, part of its indices come from the X tensor and the rest come from the M tensor.

Let’s make this easier to understand with an illustrative example. Consider the contraction of X {6, 3, 5} by M {3, 5, 4} using pairs {{1, 0}, {2, 1}}. This contraction results in the 2-dimensional tensor Y {6, 4}:

Now, we need to compute each one of the 24 coefficients Y(i, j) where i varies from 0 to 5and j varies from 0 to 3. This space is represented by the following grid:

To compute each coefficient Y(i, j), we must sum up the multiplication of each coefficient X(a, i, j) by M(i, j, b)for every combination (a, b) in the pairs {3, 5}.

Note that there are 3×5 = 15 combinations of (a, b).

The computation of each coefficient Y(i, j) is summarized in the following formula:

Let’s draw the computation of Y(2, 3):

The colors highlight where each index comes from.

The summation to computing each coefficient Y(i, j) is called Inner Product.

To make sure we understand how to compute the coefficients, let’s list each coefficient computation of the X {2, 3} by M {2, 3} contraction using pairs {{0, 0}}:

As seen above, the contraction results in a Y{3, 3} tensor.

Below, an example of the contraction of X {2, 2, 2} by M {2, 2, 2, 2} using pairs {{1, 2}, {2, 3}} resulting in Y {2, 2, 2}:

The examples provide insight into the expected behavior of contractions. Now, let’s start thinking about the contraction implementation.

Implementing Tensors

In its definition, a Tensor is a grid-like object having one, two, or more dimensions. But implementing tensors as grids (for example, using arrays of arrays or bi-dimensional arrays, etc) is not usual. Instead, tensors are internally represented by contiguous, linear containers.

How does this work? This works by using ColMajor or RowMajor schemas. In our implementation, we will use ColMajor index order as represented below:

Once we have chosen the tensor layout, we need to realize how to perform the following operations:

How to convert a tensor index into an internal position and
How to convert an internal position into a multidimensional index in the tensor space

Of course, one operation is the inverse of the other. Let us start by converting a tensor index into an internal position:

#include <iostream>
#include <vector>
#include <numeric>

int main(int, char **)
{

 std::vector<int> dimensions {6, 7, 8, 9};
 std::vector<int> prefix (dimensions.size(), 1);

 for (int j = dimensions.size() - 2; j >= 0; j--) {
 prefix[j] = dimensions[j + 1] * prefix[j + 1];
 }
 // prefix is {7*8*9, 8*9, 9, 1}

 std::vector<int> indices {3, 1, 4, 1};

 int pos = std::inner_product(indices.begin(), 
 indices.end(), prefix.begin(), 0);

 std::cout << pos << "\n"; // prints 1621

 return 0;
}

The code above is self-explanatory. In general, we store prefix somewhere so we can reuse it for computing multiple indices. Some references name this operation as ravel index because of the similarity with the structure of a woven.

Array index order like the woven pattern

The next operation is unravel_index , that is, given an internal position, find the sequence of tensor indices. This name comes from the idea of separating or disentangling the threads of a woven or knitted fabric.

For practical purposes, we would enclose this operation inside a functor:

class INDEX_GENERATOR
{

public:
 INDEX_GENERATOR(const DIMS &dims)
 {
 const int DIM_SIZE = dims.size();
 this->divisors.resize(DIM_SIZE, 1);
 this->length = 1;
 for (int j = DIM_SIZE - 1; j >= 0; j--)
 {
 this->divisors[j] = this->length;
 this->length *= dims[j];
 }
 }

 DIMS operator()(const int pos) const
 {
 const int DIM_SIZE = this->divisors.size();
 DIMS result(DIM_SIZE, 0);
 int residual = pos;
 int acc;
 int index;
 for (int j = 0; j < DIM_SIZE; j++)
 {
 acc = this->divisors[j];
 index = residual / acc;
 result[j] = index;
 residual -= index * acc;
 }
 return result;
 }

private:
 DIMS divisors;
 int length;
};

We can use this functor like below:

INDEX_GENERATOR generator({6,7,8,9});
auto indices = generator(1621); // {3, 1, 4, 1}

Now we know how to represent tensors using linear containers. It is time to move forward and implement the contraction.

Finding contraction dimensions

As explained at the beginning of this story, the first thing to do is to compute the dimensions of the resulting tensor:

template <typename FP_TYPE>
class Tensor
{

 using DIMS = std::vector<int>;
 using INDEX_PAIRS = std::vector<std::pair<int, int>>;
 using STORAGE = std::vector<FP_TYPE>;

private:
 STORAGE storage;
 DIMS dimensions;

public:
 
 Tensor<FP_TYPE> contraction(const Tensor<FP_TYPE> &other, 
 const INDEX_PAIRS &pairs) const
 {

 const int this_size = this->dimensions.size();
 const int other_size = other.dimensions.size();
 const int pairs_size = pairs.size();
 const int dim_size = this_size + other_size - 2 * pairs_size;
 
 DIMS intersection_dims(pairs_size, 0);
 
 std::unordered_set<int> this_pairs_set;
 std::unordered_set<int> other_pairs_set;
 for (int idx = 0; idx < pairs_size; ++idx)
 {
 const auto &pair = pairs[idx];
 
 this_pairs_set.insert(pair.first);
 other_pairs_set.insert(pair.second);
 
 intersection_dims[idx] = this->dimensions[pair.first];
 }
 
 DIMS result_dimensions;
 result_dimensions.reserve(dim_size);
 
 int dest_pointer = 0;
 for (int i = 0; i < this_size; ++i)
 {
 if (this_pairs_set.find(i) == this_pairs_set.end())
 {
 int dim = this->dimensions[i];
 result_dimensions.push_back(dim);
 }
 }
 
 for (int i = 0; i < other_size; ++i)
 {
 if (other_pairs_set.find(i) == other_pairs_set.end())
 {
 int dim = other.dimensions[i];
 result_dimensions.push_back(dim);
 }
 }

 //..
 }
};

Check the complete code on GitHub.

At this point, we have two arrays: result_dimensions and intersection_dims. If we invoke the contraction as follows:

Tensor<float> X = Tensor<float>::RANDOM({2, 5, 3});
Tensor<float> M = Tensor<float>::RANDOM({2, 11, 13, 3});

Tensor<float> Y = X.contraction(M, {{0, 0}, {2, 3}});

Than the contents of result_dimensions will be {5, 11, 13} and intersection_dims will be {2, 3}. This means that the result tensor Y has dimensions {5, 11, 13} and 5*11*13 coefficients.

To compute each coefficient Y(i, j, k), we must find all coefficients X(a, i, b) and M(a, j, k, b), multiplying them pair-wise and summing up.

This is what the following code does:


Tensor<FP_TYPE> contraction(const Tensor<FP_TYPE> &other, 
 const INDEX_PAIRS &pairs) const
 {
 //... previous part

 // precaching intersection offsets
 std::vector<int> position_this_cache(intersection_length, 0);
 std::vector<int> position_other_cache(intersection_length, 0);
 for (int j = 0; j < intersection_length; ++j)
 {
 DIMS intersection = intersection_generator(j);
 int position_this = 0;
 int position_other = 0;
 for (int idx = 0; idx < pairs_size; ++idx)
 {
 position_this += intersection[idx] * this_pre[idx + dim_size];
 position_other += intersection[idx] * other_pre[idx + dim_size];
 }
 position_this_cache[j] = position_this;
 position_other_cache[j] = position_other;
 }

 // computing each coefficient
 for (int i = 0; i < length; ++i)
 {
 
 DIMS indices = index_generator(i);

 int pos_this_before = 0;
 int pos_other_before = 0;
 for (int idx = 0; idx < dim_size; ++idx)
 {
 pos_this_before += indices[idx] * this_pre[idx];
 pos_other_before += indices[idx] * other_pre[idx];
 }

 FP_TYPE acc = FP_TYPE(0);
 for (int j = 0; j < intersection_length; ++j)
 {

 int pos_this = pos_this_before + position_this_cache[j];
 int pos_other = pos_other_before + position_other_cache[j];

 FP_TYPE this_coeff = this->storage[pos_this];
 FP_TYPE other_coeff = other.storage[pos_other];
 acc += this_coeff * other_coeff;
 }
 result.set(indices, acc);
 }

 return result;
}

Assuming that each dimension is N, this code provides a time complexity of:

where x is the rank of tensor X, m is the rank of the tensor M, and p is the number of pairs.

That time complexity indicates that contractions are very computer-intensive operations, exponentially increasing the overall time if the input increases just linearly.

Check below a chart depicting the total operation time of 100 contractions using configurations where x+m-p is 2, 3, 4, 5, 6, and 7:

TIme cost for executions using different configurations

In this story, we want to code uniquely to make the mechanisms behind contractions easier to understand. Of course, the implementation here does not aim to be as fast as production-grade implementations. But this does not mean that our code cannot run at least decently!

There are a few details that we can implement without making the code too hard to read.

A guideline to achieve state-of-art performance is discussed in the end of this story.

To make things run faster, we pre-cached part of the computation of vectors X(a, i, b) and M(a, j, k, b):

// precaching intersection offsets
std::vector<int> position_this_cache(intersection_length, 0);
std::vector<int> position_other_cache(intersection_length, 0);
for (int j = 0; j < intersection_length; ++j)
{
 DIMS intersection = intersection_generator(j);
 int position_this = 0;
 int position_other = 0;
 for (int idx = 0; idx < pairs_size; ++idx)
 {
 position_this += intersection[idx] * this_pre[idx + dim_size];
 position_other += intersection[idx] * other_pre[idx + dim_size];
 }
 position_this_cache[j] = position_this;
 position_other_cache[j] = position_other;
}

Invoking the contraction is pretty simple:

Measuring the time taken by 10 contractions

Checking the source code, you find that we pre-cache some computations to avoid re-evaluating them in each loop.

You can find more implementation details in the full source on GitHub.

Optimization tips

Whenever someone asks me for performance advice, as say:

Only optimize after the whole thing runs correctly
Check if your algorithm complexity is as good as you expect
Avoid recomputing things inside loops
Avoid allocating memory inside loops
Use STL and canonical libraries ( std::transform(), std::for_each()) as much as possible
Use const variables (including lambdas)
Use unordered containers such as std::unordered_map and std::unordered_set instead of std::map and std::set
Avoid the “oneliner” thing.
Make the code clean so you can understand it more easily and improve it more easily too
Use the right compiler flags

Those are “general purpose” tips. They help with any project. There are other nice tips, but let’s talk about more specific advising.

Making things run even faster

The code in this story is not intended to replace a proper BLAS implementation. We are using it only to explain how contractions work. However, we can think of several ways to improve the code to achieve a production-level performance.

Reducing cache misses

In general, tensors are huge objects occupying large amounts of memory blocks in the computer RAM. Unfortunately, RAM blocks are not directly accessible to CPUs. It turns out that the tensor data must be copied from the RAM to the CPU cache before being used by the program.

The problem is that CPU caches are usually smaller than the size of tensors, resulting in cache misses that affect the process latency. As a result, the overall time is 4 times or slower if the program causes cache misses too often.

The solution for this issue is well explored in GEMM routines of popular BLAS libraries, where the whole operation is broken into smaller pieces that fit the CPU cache.

Implementing contraction in packages improves the performance but reduces the code readability.

Multi-core implementation

The code so far uses a single-core implementation. We can use modern C++ to turn it into a multi-core program very easily, for example, by splitting the main loop into threads:

const auto compute = [&](int i)
{
 DIMS indices = index_generator(i);

 int pos_this_before = 0;
 int pos_other_before = 0;
 for (int idx = 0; idx < dim_size; ++idx)
 {
 pos_this_before += indices[idx] * this_pre[idx];
 pos_other_before += indices[idx] * other_pre[idx];
 }

 FP_TYPE acc = FP_TYPE(0);
 for (int j = 0; j < intersection_length; ++j)
 {

 int pos_this = pos_this_before + position_this_cache[j];
 int pos_other = pos_other_before + position_other_cache[j];

 FP_TYPE this_coeff = this->storage[pos_this];
 FP_TYPE other_coeff = other.storage[pos_other];
 acc += this_coeff * other_coeff;
 }
 result.set(indices, acc);
};

std::vector<int> coefficients(length);
std::iota(coefficients.begin(), coefficients.end(), 0);
std::for_each(std::execution::par, coefficients.begin(), 
 coefficients.end(), compute);

The point here is running the loop in parallel using std::execution::par, enclosing the loop in a lambda.

How fast does the code run after this change? This is what I get by running the same previous example using multithreaded contractions:

Using the multithreaded version, the program executes roughly 6 times faster. It happens because the processing load is shared between multiple cores.

Multicore programming is only one alternative for parallel execution. Today, the most popular way to distribute processing loads is GPU programming.

CUDA

Contraction is the type of computation that can run way faster on GPUs. However, refactoring the previous code to use CUDA is not simple. It turns out that GPU processors work differently from CPUs.

If we want to use CUDA, we need to rewrite the whole thing and implement a function that:

Streams memory blocks to the GPU
Runs the kernel and
Gets each slice back to build the resulting tensor.

This is not hard to implement, but we need to rethink the code to realize how to divide the task blocks.

This is a good talk for a future story.

Epilogue: Why We Love Contractions

Back to the math concepts, it is clear why tensor contractions are so useful for performing pattern recognition tasks in computer vision or NLP systems. Essentially, what tensor contractions compute is the multidimensional cosine distance between two tensors.

Remember how coefficients are evaluated:

This formula refers to the dot product between two vectors a and b:

Note that the dot product can also be expressed as:

θ is the angle between vectors a and b:

This relation provides an interesting geometric insight into what contractions represent since cosines are very informative to indicate the similarity between two elements of an Euclidean space.

Conclusions

Contractions are part of the key concepts that every AI/ML engineer must know on a solid basis. In this story, we shall help those looking for guidance on the theme.

Source Code

The code used here can be found on GitHub.

Questions, corrections, or concerns

Can you spot something wrong? Drop a message. Have a question? Drop a message. Do you want to talk about the code or contractions? Post your message. It is welcome and free!

How to contribute

Nobody paid me to write this story. If you enjoyed this tutorial and want to contribute, check a charity institution near you and provide your assistance with money and/or time. No citation is needed. You needn’t let me know either.

Have a question? Drop a message and I will reply as soon as possible.

Regards,
Luiz

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Implementing Tensor Contractions in Modern C++

Author(s): Luiz doleron | Luiz d’Oleron

Where tensor contractions come from

The multidimensional case(s)

The intersection pairs

Computing the output coefficients

Implementing Tensors

Finding contraction dimensions

Optimization tips

Making things run even faster

Reducing cache misses

Multi-core implementation

CUDA

Epilogue: Why We Love Contractions

Conclusions

Source Code

Questions, corrections, or concerns

How to contribute

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Implementing Tensor Contractions in Modern C++

Author(s): Luiz doleron | Luiz d’Oleron

Where tensor contractions come from

The multidimensional case(s)

The intersection pairs

Computing the output coefficients

Implementing Tensors

Finding contraction dimensions

Optimization tips

Making things run even faster

Reducing cache misses

Multi-core implementation

CUDA

Epilogue: Why We Love Contractions

Conclusions

Source Code

Questions, corrections, or concerns

How to contribute

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement