Matrix multiplication: Performance

Creation date: 2021-10-11

Tags: julia, newcomers, benchmarking, profiling, performance

In the last posts in this series we talked quite a bit about the very basic tools you need to learn to write Julia code. Ranging from understanding how Multiple distpatch works to tools for debugging and profiling.

This time I want to use a specific example to dive deeper into performance than before. This includes cache misses which we haven't discussed yet and also writing parallel code. I lately stumbled across this wonderful MIT course Performance Engineering of Software Systems.

The first episode tackles the problem of matrix matrix multiplication. They work with C as the programming language but I of course try to achieve the performance with Julia instead.

The Basics: Matrix Matrix Multiplication
Inbounds
Parallel
Block structure
Tiling
@Turbo

The Basics: Matrix Matrix Multiplication

First we write our baseline code which performs a dense matrix multiplication and compare the performance to the function in the standard library LinearAlgebra.

using Random
using BenchmarkTools
function my_mul!(C,A,B)
    n,m = size(A)
    for i in 1:n
        for j in 1:m 
            for k in 1:m
                C[i,j] += A[i,k]*B[k,j]
            end
        end
    end
end

function main(n=512)
    Random.seed!(1)
    A = rand(n, n)
    B = rand(n, n)
    C = zeros(n, n)
    return A,B,C
end

A,B,C = main();
@benchmark my_mul!($C, $A, $B)

The multiplication takes about 790ms on my machine.

Maybe a bit to the code: We first iterate through the rows and columns of the resulting matrix $C$ here. Then to compute the result we always fix the row of $A$ and the column of $B$ to do the row times column multiplication.

One thing I should note is that Julia stores matrices in column order in contrast to C,C++ and most other languages. That means iterating over a single column is faster than iterating over a single row.

Currently we have good temporal locality for the matrix $C$ as we load the same C[i,j] in each most inner loop and it therefore just stays in the cache. For $B$ we iterate over the row index which is also fine but for $A$ we iterate over the column index which results in cache misses.

You might have realized that the loop order can be changed however we want, right? There are no dependencies between them.

function my_mul!(C,A,B)
    n,m = size(A)
    for j in 1:m 
        for k in 1:m
            for i in 1:n
                C[i,j] += A[i,k]*B[k,j]
            end
        end
    end
end

In this example we iterate over the a column of $A$ in the inner loop by changing the row index $i$ before changing the column index $k$. That is good spatial locality. For $B$ we iterate over the row index $k$ before we iterate over the column index $j$ which is also good but here it's more important that in the inner most loop we don't change anything for B[k,j] which means that we have good temporal locality. Finally we also have good spatial locality for $C$ as we iterate again over the row index before the column index.

You're probably waiting for the benchmark results? It takes about 73ms on my machine now. That is a speed up factor of more than 10. We might want to increase the size of the matrix a little bit for further improvements but first let's benchmark the mul! function of LinearAlgebra.

That takes about 3.8ms for the same task. That means there is still a lot of room for improvement for us 😉

Let's increase $n$ to 2048 for the next tasks with the current running times given in this table:

Implementation	Running time (s)	Relative speedup	Absolute speedup
i,j,k loop	117.96	1.0	1.0
j,k,i loop	6.41	18.4	18.4

To be honest I don't know why we had a speed up of a factor of 10 for $n=512$ but one of 18 for $n=2048$. I suspect because the cache lines might have been long enough for 512 in some cases but for 2048 only the good alignment helped to use the cache. If you have more knowledge please let me know 😉 Either via email o.kroeger <at> opensourc.es or via Twitter.

Inbounds

Before we try some more sophisticated things let's make a very simple improvement. Currently every time we access an element in the three arrays it will get checked whether the index is out of bounds. We can however overcome this by using the @inbounds macro.

function my_mul!(C,A,B)
    n,m = size(A)
    @inbounds for j in 1:m 
        for k in 1:m
            for i in 1:n
                C[i,j] += A[i,k]*B[k,j]
            end
        end
    end
end

Implementation	Running time (s)	Relative speedup	Absolute speedup
i,j,k loop	117.96	1.0	1.0
j,k,i loop	6.41	18.4	18.4
@inbounds	5.20	1.23	22.7

Parallel

For parallelization you need to start Julia with julia -t THREADS where THREADS should be replaced with the number of threads you have/need/want to use. In my case: julia -t 8.

function my_mul!(C,A,B)
    n,m = size(A)
    @inbounds Threads.@threads for j in 1:m 
        for k in 1:m
            for i in 1:n
                C[i,j] += A[i,k]*B[k,j]
            end
        end
    end
end

Implementation	Running time (s)	Relative speedup	Absolute speedup
i,j,k loop	117.96	1.0	1.0
j,k,i loop	6.41	18.4	18.4
@inbounds	5.20	1.23	22.7
@threads	3.32	1.57	35.5

To be honest I'm quite disappointed by this speed up here for parallelizing the outer loop here with 8 cores. I'm not sure whether there is too much overhead here or whether it the cache misses increase.

As you probably realized by now: I'm still learning 😄 Will definitely update this post whenever I obtain more knowledge. Nevertheless I'll continue with more optimizations for now 😉

Note

Regarding cache misses I tried LinuxPerf.jl but it seems to calculate the misses maybe only for a single processor.

Block structure

Let's calculate how much memory is accessed by calculating one column of $C$.

We access $C$ 2048 times
We access $B$ 2048 times
We access each element of $A$ though which means $2048^2 = 4,194,304$ times

If we instead use a block structure we can change this memory access. For this we calculate with a block size of $45 \times 45$ as $45^2 \approx 2048$ 😄

Note

In the lecture they use $n=4096$ and are able to split it into blocks of size $64 \times 64$ but I didn't want to wait too long to let the slow version compute the result for such a large matrix 😄

Okay so for 45x45 we need

~2048 writes to $C$
$2048 \cdot 45 \approx 92,000$ reads from $A$ as we need to access all 45 rows
$2048 \cdot 45 \approx 92,000$ reads from $B$ as we need to access all 45 columns
Total: $\approx 186,368$ which is roughly 20 times less than the previous version

function my_mul_block!(C,A,B)
    n,m = size(A)
    @assert n == m
    block_size = 32
    @assert n % block_size == 0

    @inbounds Threads.@threads for jh in 1:block_size:n
        for ih in 1:block_size:n 
            for kh in 1:block_size:n
                for j in 0:block_size-1 
                    for k in 0:block_size-1
                        for i in 0:block_size-1
                            C[ih+i,jh+j] += A[ih+i,kh+k]*B[kh+k,jh+j]
                        end
                    end
                end
            end
        end
    end
end

Here I've added some @assert statements at the beginning and worked with square matrices only. (Making my life a bit easier 😄)

Implementation	Running time (s)	Relative speedup	Absolute speedup
i,j,k loop	117.96	1.0	1.0
j,k,i loop	6.41	18.4	18.4
@inbounds	5.20	1.23	22.7
@threads	3.32	1.57	35.5
blocks	3.13	1.06	37.7

Okay for that the speedup doesn't really seem to be that nice. Maybe another idea?

Tiling

We can use the fact that we can split the calculation into four matrix multiplications and one matrix matrix addition like this:

\[ \begin{pmatrix} C_{11} & C_{12} \\ C_{21} & C_{22} \\ \end{pmatrix} = \begin{pmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \\ \end{pmatrix} \begin{pmatrix} B_{11} & B_{12} \\ B_{21} & B_{22} \\ \end{pmatrix} \] \[ = \begin{pmatrix} A_{11}B_{11} & A_{11}B_{12} \\ A_{21}B_{11} & A_{21}B_{12} \\ \end{pmatrix} + \begin{pmatrix} A_{12}B_{21} & A_{12}B_{22} \\ A_{22}B_{21} & A_{22}B_{22} \\ \end{pmatrix} \]

function my_mul_seq!(C,A,B)
    n,m = size(A)
    @inbounds for j in 1:m 
        for k in 1:m
            for i in 1:n
                C[i,j] += A[i,k]*B[k,j]
            end
        end
    end
end

UL(x, n) = @views x[1:div(n,2), 1:div(n,2)]
UR(x, n) = @views x[1:div(n,2), div(n,2)+1:end]
LL(x, n) = @views x[div(n,2)+1:end, 1:div(n,2)]
LR(x, n) = @views x[div(n,2)+1:end, div(n,2)+1:end]

function my_mul_tile!(C,A,B)
    n,m = size(A)
    @assert n == m
    thresh_size = 64
    if n <= thresh_size
        my_mul_seq!(C,A,B)
        return
    end

    @sync begin
        Threads.@spawn my_mul_tile!(UL(C,n), UL(A,n), UL(B,n))
        Threads.@spawn my_mul_tile!(UR(C,n), UL(A,n), UR(B,n))
        Threads.@spawn my_mul_tile!(LL(C,n), LL(A,n), UL(B,n))
        Threads.@spawn my_mul_tile!(LR(C,n), LL(A,n), UR(B,n))
    end
    @sync begin
        Threads.@spawn my_mul_tile!(UL(C,n), UR(A,n), LL(B,n))
        Threads.@spawn my_mul_tile!(UR(C,n), UR(A,n), LR(B,n))
        Threads.@spawn my_mul_tile!(LL(C,n), LR(A,n), LL(B,n))
        Threads.@spawn my_mul_tile!(LR(C,n), LR(A,n), LR(B,n))
    end
end

This involves quite some code but shouldn't be too hard to understand. Let me walk you through it: We'll have a base case of size $64 \times 64$ in this case as spawning processors for those small matrices will result in overhead. That is the reason for the first function:

function my_mul_seq!(C,A,B)
    n,m = size(A)
    @inbounds for j in 1:m 
        for k in 1:m
            for i in 1:n
                C[i,j] += A[i,k]*B[k,j]
            end
        end
    end
end

which is what we had in the @inbounds section. Next I've added four one line methods which are named as follows: UL(x,n) for the upper left part of the matrix $x$ where $x$ is of size $n \times n$. Then we have the same for upper right (UR), lower left (LL) and lower right (LR).

⚠ Important note

I assume that we can always split the matrix into four matrices of the same size or in other words that we have a starting matrix of size $2^k \times 2^k$.

Then we have our main method which has the base case at the beginning and then two @sync blocks. Those mean that we can do the four @spawn tasks inside in parallel but before we continue all of them needs to be finished.

This is the case as we don't save $A_{11}B_{11}$ as an example in a temporary variable but instead in $C_{11}$. For the second matrix

\[ \begin{pmatrix} A_{12}B_{21} & A_{12}B_{22} \\ A_{22}B_{21} & A_{22}B_{22} \\ \end{pmatrix} \]

we then need to have $C_{11}$ computed before we can add $A_{12}B_{21}$ to it. Due to the fact that we always manipulate the $C$ matrix directly we also don't need to use an explicit $+$ as you might have thought from equation (2).

Now let's see the result:

Implementation	Running time (s)	Relative speedup	Absolute speedup
i,j,k loop	117.96	1.0	1.0
j,k,i loop	6.41	18.4	18.4
@inbounds	5.20	1.23	22.7
@threads	3.32	1.57	35.5
blocks	3.13	1.06	37.7
tile	0.44	7.11	268.1

That is quite an impressive speed up compared to the blocks but also in general. We have nearly a 300x speedup to our first simple implementation 🎉

@Turbo

One last thing I wanted to try is LoopVectorization it provides a @turbo macro which itself tries to rearrange the loop ordering to minimize computation time and vectorizes the loop.

For this I loaded the library with using LoopVectorization after installing it of course and changed the my_mul_seq! code to:

function my_mul_seq!(C,A,B)
    n,m = size(A)
    @turbo for j in 1:m 
        for k in 1:m
            for i in 1:n
                C[i,j] += A[i,k]*B[k,j]
            end
        end
    end
end

Implementation	Running time (s)	Relative speedup	Absolute speedup
i,j,k loop	117.96	1.0	1.0
j,k,i loop	6.41	18.4	18.4
@inbounds	5.20	1.23	22.7
@threads	3.32	1.57	35.5
blocks	3.13	1.06	37.7
tile	0.44	7.11	268.1
@turbo	0.27	1.63	436.9

I think we are at a point to stop now as testing the standard library version:

@btime mul!($C,$A,$B)

results in:

238.187 ms (0 allocations: 0 bytes)

That means we are about as fast as the standard library!

Full disclaimer: The standard library doesn't just work for square matrices with size $2^k \times 2^k$ 😄

Hope you have enjoyed this deep dive into benchmarking and improving matrix multiplication from a naive implementation to one which is nearly as fast as the standard library.

Thanks for reading and if you don't want to miss a post: Please sign up for the email newsletter 😉 Additionally, please consider joining the dozen of you who support me financially. Thank you!

You might enjoy the next post in this series about a diffing algorithm

Thanks to my 12 patrons!

Special special thanks to my >4$ patrons. The ones I thought couldn't be found 😄

Anonymous
Kangpyo
Gurvesh Sanghera
Szymon Bęczkowski
Håkan Kjellerstrand
Colin Phillips
Jérémie Knuesel

For a donation of a single dollar per month you get early access to these posts. Your support will increase the time I can spend on working on this blog.

There is also a special tier if you want to get some help for your own project. You can checkout my mentoring post if you're interested in that and feel free to write me an E-mail if you have questions: o.kroeger <at> opensourc.es

I'll keep you updated on Twitter opensourcesblog.

mul!(Y, A, B) -> Y

Calculates the matrix-matrix or matrix-vector product $AB$ and stores the result in Y, overwriting the existing value of Y. Note that Y must not be aliased with either A or B.

Examples

julia> A=[1.0 2.0; 3.0 4.0]; B=[1.0 1.0; 1.0 1.0]; Y = similar(B); mul!(Y, A, B);


 
    
        
            @inbounds(blk)
Eliminates array bounds checking within expressions.
In the example below the in-range check for referencing element i of array A is skipped to improve performance.
function sum(A::AbstractArray)
    r = zero(eltype(A))
    for i in eachindex(A)
        @inbounds r += A[i]
    end
    return r
end
Warning
Using @inbounds may return incorrect results/crashes/corruption for out-of-bounds indices. The user is responsible for checking it manually. Only use @inbounds when it is certain from the information locally available that all accesses are in bounds. In particular, using 1:length(A) instead of eachindex(A) in a function like the one above is not safely inbounds because the first index of A may not be 1 for all user defined types that subtype AbstractArray.

        
    
 
    
        
            @assert cond [text]
Throw an AssertionError if cond is false. Preferred syntax for writing assertions. Message text is optionally displayed upon assertion failure.
Warning
An assert might be disabled at various optimization levels. Assert should therefore only be used as a debugging tool and not used for authentication verification (e.g., verifying passwords), nor should side effects needed for the function to work correctly be used inside of asserts.

Examples
julia> @assert iseven(3) "3 is an odd number!"
ERROR: AssertionError: 3 is an odd number!
        

    






Want to be updated? Consider subscribing on Patreon for free



                            


  
    Subscribe to RSS
  
  
    © Ole Kröger. Last modified: November 07, 2022. Website built with Franklin.jl.

OpenSourc.ES

Matrix multiplication: Performance

The Basics: Matrix Matrix Multiplication

Inbounds

Parallel

Block structure

Tiling

@Turbo

Examples

Examples