# Matrix multiplication: Performance

Creation date: 2021-10-11

In the last posts in this series we talked quite a bit about the very basic tools you need to learn to write Julia code. Ranging from understanding how Multiple distpatch works to tools for debugging and profiling.

This time I want to use a specific example to dive deeper into performance than before. This includes cache misses which we haven't discussed yet and also writing parallel code. I lately stumbled across this wonderful MIT course Performance Engineering of Software Systems.

The first episode tackles the problem of matrix matrix multiplication. They work with C as the programming language but I of course try to achieve the performance with Julia instead.

## The Basics: Matrix Matrix Multiplication

First we write our baseline code which performs a dense matrix multiplication and compare the performance to the function in the standard library LinearAlgebra.

using Random
using BenchmarkTools
function my_mul!(C,A,B)
n,m = size(A)
for i in 1:n
for j in 1:m
for k in 1:m
C[i,j] += A[i,k]*B[k,j]
end
end
end
end

function main(n=512)
Random.seed!(1)
A = rand(n, n)
B = rand(n, n)
C = zeros(n, n)
return A,B,C
end

A,B,C = main();
@benchmark my_mul!($C,$A, B) The multiplication takes about 790ms on my machine. Maybe a bit to the code: We first iterate through the rows and columns of the resulting matrix $$C$$ here. Then to compute the result we always fix the row of $$A$$ and the column of $$B$$ to do the row times column multiplication. One thing I should note is that Julia stores matrices in column order in contrast to C,C++ and most other languages. That means iterating over a single column is faster than iterating over a single row. Currently we have good temporal locality for the matrix $$C$$ as we load the same C[i,j] in each most inner loop and it therefore just stays in the cache. For $$B$$ we iterate over the row index which is also fine but for $$A$$ we iterate over the column index which results in cache misses. You might have realized that the loop order can be changed however we want, right? There are no dependencies between them. function my_mul!(C,A,B) n,m = size(A) for j in 1:m for k in 1:m for i in 1:n C[i,j] += A[i,k]*B[k,j] end end end end In this example we iterate over the a column of $$A$$ in the inner loop by changing the row index $$i$$ before changing the column index $$k$$. That is good spatial locality. For $$B$$ we iterate over the row index $$k$$ before we iterate over the column index $$j$$ which is also good but here it's more important that in the inner most loop we don't change anything for B[k,j] which means that we have good temporal locality. Finally we also have good spatial locality for $$C$$ as we iterate again over the row index before the column index. You're probably waiting for the benchmark results? It takes about 73ms on my machine now. That is a speed up factor of more than 10. We might want to increase the size of the matrix a little bit for further improvements but first let's benchmark the function of LinearAlgebra. That takes about 3.8ms for the same task. That means there is still a lot of room for improvement for us 😉 Let's increase $$n$$ to 2048 for the next tasks with the current running times given in this table: ImplementationRunning time (s)Relative speedupAbsolute speedup i,j,k loop117.961.01.0 j,k,i loop6.4118.418.4 To be honest I don't know why we had a speed up of a factor of 10 for $$n=512$$ but one of 18 for $$n=2048$$. I suspect because the cache lines might have been long enough for 512 in some cases but for 2048 only the good alignment helped to use the cache. If you have more knowledge please let me know 😉 Either via email o.kroeger <at> opensourc.es or via Twitter. ## Inbounds Before we try some more sophisticated things let's make a very simple improvement. Currently every time we access an element in the three arrays it will get checked whether the index is out of bounds. We can however overcome this by using the macro. function my_mul!(C,A,B) n,m = size(A) @inbounds for j in 1:m for k in 1:m for i in 1:n C[i,j] += A[i,k]*B[k,j] end end end end ImplementationRunning time (s)Relative speedupAbsolute speedup i,j,k loop117.961.01.0 j,k,i loop6.4118.418.4 @inbounds5.201.2322.7 ## Parallel For parallelization you need to start Julia with julia -t THREADS where THREADS should be replaced with the number of threads you have/need/want to use. In my case: julia -t 8. function my_mul!(C,A,B) n,m = size(A) @inbounds Threads.@threads for j in 1:m for k in 1:m for i in 1:n C[i,j] += A[i,k]*B[k,j] end end end end ImplementationRunning time (s)Relative speedupAbsolute speedup i,j,k loop117.961.01.0 j,k,i loop6.4118.418.4 @inbounds5.201.2322.7 @threads3.321.5735.5 To be honest I'm quite disappointed by this speed up here for parallelizing the outer loop here with 8 cores. I'm not sure whether there is too much overhead here or whether it the cache misses increase. As you probably realized by now: I'm still learning 😄 Will definitely update this post whenever I obtain more knowledge. Nevertheless I'll continue with more optimizations for now 😉 Note Regarding cache misses I tried LinuxPerf.jl but it seems to calculate the misses maybe only for a single processor. ## Block structure Let's calculate how much memory is accessed by calculating one column of $$C$$. • We access $$C$$ 2048 times • We access $$B$$ 2048 times • We access each element of $$A$$ though which means $$2048^2 = 4,194,304$$ times If we instead use a block structure we can change this memory access. For this we calculate with a block size of $$45 \times 45$$ as $$45^2 \approx 2048$$ 😄 Note In the lecture they use $$n=4096$$ and are able to split it into blocks of size $$64 \times 64$$ but I didn't want to wait too long to let the slow version compute the result for such a large matrix 😄 Okay so for 45x45 we need • ~2048 writes to $$C$$ • $$2048 \cdot 45 \approx 92,000$$ reads from $$A$$ as we need to access all 45 rows • $$2048 \cdot 45 \approx 92,000$$ reads from $$B$$ as we need to access all 45 columns • Total: $$\approx 186,368$$ which is roughly 20 times less than the previous version function my_mul_block!(C,A,B) n,m = size(A) @assert n == m block_size = 32 @assert n % block_size == 0 @inbounds Threads.@threads for jh in 1:block_size:n for ih in 1:block_size:n for kh in 1:block_size:n for j in 0:block_size-1 for k in 0:block_size-1 for i in 0:block_size-1 C[ih+i,jh+j] += A[ih+i,kh+k]*B[kh+k,jh+j] end end end end end end end Here I've added some statements at the beginning and worked with square matrices only. (Making my life a bit easier 😄) ImplementationRunning time (s)Relative speedupAbsolute speedup i,j,k loop117.961.01.0 j,k,i loop6.4118.418.4 @inbounds5.201.2322.7 @threads3.321.5735.5 blocks3.131.0637.7 Okay for that the speedup doesn't really seem to be that nice. Maybe another idea? ## Tiling We can use the fact that we can split the calculation into four matrix multiplications and one matrix matrix addition like this: $\begin{pmatrix} C_{11} & C_{12} \\ C_{21} & C_{22} \\ \end{pmatrix} = \begin{pmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \\ \end{pmatrix} \begin{pmatrix} B_{11} & B_{12} \\ B_{21} & B_{22} \\ \end{pmatrix}$ $= \begin{pmatrix} A_{11}B_{11} & A_{11}B_{12} \\ A_{21}B_{11} & A_{21}B_{12} \\ \end{pmatrix} + \begin{pmatrix} A_{12}B_{21} & A_{12}B_{22} \\ A_{22}B_{21} & A_{22}B_{22} \\ \end{pmatrix}$ function my_mul_seq!(C,A,B) n,m = size(A) @inbounds for j in 1:m for k in 1:m for i in 1:n C[i,j] += A[i,k]*B[k,j] end end end end UL(x, n) = @views x[1:div(n,2), 1:div(n,2)] UR(x, n) = @views x[1:div(n,2), div(n,2)+1:end] LL(x, n) = @views x[div(n,2)+1:end, 1:div(n,2)] LR(x, n) = @views x[div(n,2)+1:end, div(n,2)+1:end] function my_mul_tile!(C,A,B) n,m = size(A) @assert n == m thresh_size = 64 if n <= thresh_size my_mul_seq!(C,A,B) return end @sync begin Threads.@spawn my_mul_tile!(UL(C,n), UL(A,n), UL(B,n)) Threads.@spawn my_mul_tile!(UR(C,n), UL(A,n), UR(B,n)) Threads.@spawn my_mul_tile!(LL(C,n), LL(A,n), UL(B,n)) Threads.@spawn my_mul_tile!(LR(C,n), LL(A,n), UR(B,n)) end @sync begin Threads.@spawn my_mul_tile!(UL(C,n), UR(A,n), LL(B,n)) Threads.@spawn my_mul_tile!(UR(C,n), UR(A,n), LR(B,n)) Threads.@spawn my_mul_tile!(LL(C,n), LR(A,n), LL(B,n)) Threads.@spawn my_mul_tile!(LR(C,n), LR(A,n), LR(B,n)) end end This involves quite some code but shouldn't be too hard to understand. Let me walk you through it: We'll have a base case of size $$64 \times 64$$ in this case as spawning processors for those small matrices will result in overhead. That is the reason for the first function: function my_mul_seq!(C,A,B) n,m = size(A) @inbounds for j in 1:m for k in 1:m for i in 1:n C[i,j] += A[i,k]*B[k,j] end end end end which is what we had in the section. Next I've added four one line methods which are named as follows: UL(x,n) for the upper left part of the matrix $$x$$ where $$x$$ is of size $$n \times n$$. Then we have the same for upper right (UR), lower left (LL) and lower right (LR). ⚠ Important note I assume that we can always split the matrix into four matrices of the same size or in other words that we have a starting matrix of size $$2^k \times 2^k$$. Then we have our main method which has the base case at the beginning and then two @sync blocks. Those mean that we can do the four @spawn tasks inside in parallel but before we continue all of them needs to be finished. This is the case as we don't save $$A_{11}B_{11}$$ as an example in a temporary variable but instead in $$C_{11}$$. For the second matrix $\begin{pmatrix} A_{12}B_{21} & A_{12}B_{22} \\ A_{22}B_{21} & A_{22}B_{22} \\ \end{pmatrix}$ we then need to have $$C_{11}$$ computed before we can add $$A_{12}B_{21}$$ to it. Due to the fact that we always manipulate the $$C$$ matrix directly we also don't need to use an explicit $$+$$ as you might have thought from equation (2). Now let's see the result: ImplementationRunning time (s)Relative speedupAbsolute speedup i,j,k loop117.961.01.0 j,k,i loop6.4118.418.4 @inbounds5.201.2322.7 @threads3.321.5735.5 blocks3.131.0637.7 tile0.447.11268.1 That is quite an impressive speed up compared to the blocks but also in general. We have nearly a 300x speedup to our first simple implementation 🎉 ## @Turbo One last thing I wanted to try is LoopVectorization it provides a @turbo macro which itself tries to rearrange the loop ordering to minimize computation time and vectorizes the loop. For this I loaded the library with using LoopVectorization after installing it of course and changed the my_mul_seq! code to: function my_mul_seq!(C,A,B) n,m = size(A) @turbo for j in 1:m for k in 1:m for i in 1:n C[i,j] += A[i,k]*B[k,j] end end end end ImplementationRunning time (s)Relative speedupAbsolute speedup i,j,k loop117.961.01.0 j,k,i loop6.4118.418.4 @inbounds5.201.2322.7 @threads3.321.5735.5 blocks3.131.0637.7 tile0.447.11268.1 @turbo0.271.63436.9 I think we are at a point to stop now as testing the standard library version: @btime mul!(C,$A,$B)

results in:

238.187 ms (0 allocations: 0 bytes)

That means we are about as fast as the standard library!

Full disclaimer: The standard library doesn't just work for square matrices with size $$2^k \times 2^k$$ 😄

Hope you have enjoyed this deep dive into benchmarking and improving matrix multiplication from a naive implementation to one which is nearly as fast as the standard library.

You might enjoy the next post in this series about a diffing algorithm

Thanks to my 12 patrons!

Special special thanks to my >4$patrons. The ones I thought couldn't be found 😄 • Anonymous • Kangpyo • Gurvesh Sanghera • Szymon Bęczkowski • Håkan Kjellerstrand • Colin Phillips • Jérémie Knuesel For a donation of a single dollar per month you get early access to these posts. Your support will increase the time I can spend on working on this blog. There is also a special tier if you want to get some help for your own project. You can checkout my mentoring post if you're interested in that and feel free to write me an E-mail if you have questions: o.kroeger <at> opensourc.es I'll keep you updated on Twitter opensourcesblog. mul!(Y, A, B) -> Y Calculates the matrix-matrix or matrix-vector product$AB\$ and stores the result in Y, overwriting the existing value of Y. Note that Y must not be aliased with either A or B.

# Examples

julia> A=[1.0 2.0; 3.0 4.0]; B=[1.0 1.0; 1.0 1.0]; Y = similar(B); mul!(Y, A, B);

julia> Y
2×2 Matrix{Float64}:
3.0  3.0
7.0  7.0

# Implementation

For custom matrix and vector types, it is recommended to implement 5-argument mul! rather than implementing 3-argument mul! directly if possible.

@inbounds(blk)

Eliminates array bounds checking within expressions.

In the example below the in-range check for referencing element i of array A is skipped to improve performance.

function sum(A::AbstractArray)
r = zero(eltype(A))
for i in eachindex(A)
@inbounds r += A[i]
end
return r
end
Warning

Using @inbounds may return incorrect results/crashes/corruption for out-of-bounds indices. The user is responsible for checking it manually. Only use @inbounds when it is certain from the information locally available that all accesses are in bounds. In particular, using 1:length(A) instead of eachindex(A) in a function like the one above is not safely inbounds because the first index of A may not be 1 for all user defined types that subtype AbstractArray.

@assert cond [text]

Throw an AssertionError if cond is false. Preferred syntax for writing assertions. Message text is optionally displayed upon assertion failure.

Warning

An assert might be disabled at various optimization levels. Assert should therefore only be used as a debugging tool and not used for authentication verification (e.g., verifying passwords), nor should side effects needed for the function to work correctly be used inside of asserts.

# Examples

julia> @assert iseven(3) "3 is an odd number!"
ERROR: AssertionError: 3 is an odd number!

julia> @assert isodd(3) "What even are numbers?"

Want to be updated? Consider subscribing and receiving a mail whenever a new post comes out.