Creation date: 2021-10-11
In the last posts in this series we talked quite a bit about the very basic tools you need to learn to write Julia code. Ranging from understanding how Multiple distpatch works to tools for debugging and profiling.
This time I want to use a specific example to dive deeper into performance than before. This includes cache misses which we haven't discussed yet and also writing parallel code. I lately stumbled across this wonderful MIT course Performance Engineering of Software Systems.
The first episode tackles the problem of matrix matrix multiplication. They work with C
as the programming language but I of course try to achieve the performance with Julia instead.
First we write our baseline code which performs a dense matrix multiplication and compare the performance to the function in the standard library LinearAlgebra
.
using Random
using BenchmarkTools
function my_mul!(C,A,B)
n,m = size(A)
for i in 1:n
for j in 1:m
for k in 1:m
C[i,j] += A[i,k]*B[k,j]
end
end
end
end
function main(n=512)
Random.seed!(1)
A = rand(n, n)
B = rand(n, n)
C = zeros(n, n)
return A,B,C
end
A,B,C = main();
@benchmark my_mul!($C, $A, $B)
The multiplication takes about 790ms on my machine.
Maybe a bit to the code: We first iterate through the rows and columns of the resulting matrix \(C\) here. Then to compute the result we always fix the row of \(A\) and the column of \(B\) to do the row times column multiplication.
One thing I should note is that Julia stores matrices in column order in contrast to C,C++ and most other languages. That means iterating over a single column is faster than iterating over a single row.
Currently we have good temporal locality for the matrix \(C\) as we load the same C[i,j]
in each most inner loop and it therefore just stays in the cache. For \(B\) we iterate over the row index which is also fine but for \(A\) we iterate over the column index which results in cache misses.
You might have realized that the loop order can be changed however we want, right? There are no dependencies between them.
function my_mul!(C,A,B)
n,m = size(A)
for j in 1:m
for k in 1:m
for i in 1:n
C[i,j] += A[i,k]*B[k,j]
end
end
end
end
In this example we iterate over the a column of \(A\) in the inner loop by changing the row index \(i\) before changing the column index \(k\). That is good spatial locality. For \(B\) we iterate over the row index \(k\) before we iterate over the column index \(j\) which is also good but here it's more important that in the inner most loop we don't change anything for B[k,j]
which means that we have good temporal locality. Finally we also have good spatial locality for \(C\) as we iterate again over the row index before the column index.
You're probably waiting for the benchmark results? It takes about 73ms on my machine now. That is a speed up factor of more than 10. We might want to increase the size of the matrix a little bit for further improvements but first let's benchmark the
mul!
function of LinearAlgebra.
That takes about 3.8ms for the same task. That means there is still a lot of room for improvement for us 😉
Let's increase \(n\) to 2048 for the next tasks with the current running times given in this table:
Implementation | Running time (s) | Relative speedup | Absolute speedup |
---|---|---|---|
i,j,k loop | 117.96 | 1.0 | 1.0 |
j,k,i loop | 6.41 | 18.4 | 18.4 |
To be honest I don't know why we had a speed up of a factor of 10 for \(n=512\) but one of 18 for \(n=2048\). I suspect because the cache lines might have been long enough for 512 in some cases but for 2048 only the good alignment helped to use the cache. If you have more knowledge please let me know 😉 Either via email o.kroeger <at> opensourc.es or via Twitter.
Before we try some more sophisticated things let's make a very simple improvement. Currently every time we access an element in the three arrays it will get checked whether the index is out of bounds. We can however overcome this by using the
@inbounds
macro.
function my_mul!(C,A,B)
n,m = size(A)
@inbounds for j in 1:m
for k in 1:m
for i in 1:n
C[i,j] += A[i,k]*B[k,j]
end
end
end
end
Implementation | Running time (s) | Relative speedup | Absolute speedup |
---|---|---|---|
i,j,k loop | 117.96 | 1.0 | 1.0 |
j,k,i loop | 6.41 | 18.4 | 18.4 |
@inbounds | 5.20 | 1.23 | 22.7 |
For parallelization you need to start Julia with julia -t THREADS
where THREADS
should be replaced with the number of threads you have/need/want to use. In my case: julia -t 8
.
function my_mul!(C,A,B)
n,m = size(A)
@inbounds Threads.@threads for j in 1:m
for k in 1:m
for i in 1:n
C[i,j] += A[i,k]*B[k,j]
end
end
end
end
Implementation | Running time (s) | Relative speedup | Absolute speedup |
---|---|---|---|
i,j,k loop | 117.96 | 1.0 | 1.0 |
j,k,i loop | 6.41 | 18.4 | 18.4 |
@inbounds | 5.20 | 1.23 | 22.7 |
@threads | 3.32 | 1.57 | 35.5 |
To be honest I'm quite disappointed by this speed up here for parallelizing the outer loop here with 8 cores. I'm not sure whether there is too much overhead here or whether it the cache misses increase.
As you probably realized by now: I'm still learning 😄 Will definitely update this post whenever I obtain more knowledge. Nevertheless I'll continue with more optimizations for now 😉
Let's calculate how much memory is accessed by calculating one column of \(C\).
We access \(C\) 2048 times
We access \(B\) 2048 times
We access each element of \(A\) though which means \(2048^2 = 4,194,304\) times
If we instead use a block structure we can change this memory access. For this we calculate with a block size of \(45 \times 45\) as \(45^2 \approx 2048\) 😄
Okay so for 45x45 we need
~2048 writes to \(C\)
\(2048 \cdot 45 \approx 92,000\) reads from \(A\) as we need to access all 45 rows
\(2048 \cdot 45 \approx 92,000\) reads from \(B\) as we need to access all 45 columns
Total: \(\approx 186,368\) which is roughly 20 times less than the previous version
function my_mul_block!(C,A,B)
n,m = size(A)
@assert n == m
block_size = 32
@assert n % block_size == 0
@inbounds Threads.@threads for jh in 1:block_size:n
for ih in 1:block_size:n
for kh in 1:block_size:n
for j in 0:block_size-1
for k in 0:block_size-1
for i in 0:block_size-1
C[ih+i,jh+j] += A[ih+i,kh+k]*B[kh+k,jh+j]
end
end
end
end
end
end
end
Here I've added some
@assert
statements at the beginning and worked with square matrices only. (Making my life a bit easier 😄)
Implementation | Running time (s) | Relative speedup | Absolute speedup |
---|---|---|---|
i,j,k loop | 117.96 | 1.0 | 1.0 |
j,k,i loop | 6.41 | 18.4 | 18.4 |
@inbounds | 5.20 | 1.23 | 22.7 |
@threads | 3.32 | 1.57 | 35.5 |
blocks | 3.13 | 1.06 | 37.7 |
Okay for that the speedup doesn't really seem to be that nice. Maybe another idea?
We can use the fact that we can split the calculation into four matrix multiplications and one matrix matrix addition like this:
\[ \begin{pmatrix} C_{11} & C_{12} \\ C_{21} & C_{22} \\ \end{pmatrix} = \begin{pmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \\ \end{pmatrix} \begin{pmatrix} B_{11} & B_{12} \\ B_{21} & B_{22} \\ \end{pmatrix} \] \[ = \begin{pmatrix} A_{11}B_{11} & A_{11}B_{12} \\ A_{21}B_{11} & A_{21}B_{12} \\ \end{pmatrix} + \begin{pmatrix} A_{12}B_{21} & A_{12}B_{22} \\ A_{22}B_{21} & A_{22}B_{22} \\ \end{pmatrix} \]function my_mul_seq!(C,A,B)
n,m = size(A)
@inbounds for j in 1:m
for k in 1:m
for i in 1:n
C[i,j] += A[i,k]*B[k,j]
end
end
end
end
UL(x, n) = @views x[1:div(n,2), 1:div(n,2)]
UR(x, n) = @views x[1:div(n,2), div(n,2)+1:end]
LL(x, n) = @views x[div(n,2)+1:end, 1:div(n,2)]
LR(x, n) = @views x[div(n,2)+1:end, div(n,2)+1:end]
function my_mul_tile!(C,A,B)
n,m = size(A)
@assert n == m
thresh_size = 64
if n <= thresh_size
my_mul_seq!(C,A,B)
return
end
@sync begin
Threads.@spawn my_mul_tile!(UL(C,n), UL(A,n), UL(B,n))
Threads.@spawn my_mul_tile!(UR(C,n), UL(A,n), UR(B,n))
Threads.@spawn my_mul_tile!(LL(C,n), LL(A,n), UL(B,n))
Threads.@spawn my_mul_tile!(LR(C,n), LL(A,n), UR(B,n))
end
@sync begin
Threads.@spawn my_mul_tile!(UL(C,n), UR(A,n), LL(B,n))
Threads.@spawn my_mul_tile!(UR(C,n), UR(A,n), LR(B,n))
Threads.@spawn my_mul_tile!(LL(C,n), LR(A,n), LL(B,n))
Threads.@spawn my_mul_tile!(LR(C,n), LR(A,n), LR(B,n))
end
end
This involves quite some code but shouldn't be too hard to understand. Let me walk you through it: We'll have a base case of size \(64 \times 64\) in this case as spawning processors for those small matrices will result in overhead. That is the reason for the first function:
function my_mul_seq!(C,A,B)
n,m = size(A)
@inbounds for j in 1:m
for k in 1:m
for i in 1:n
C[i,j] += A[i,k]*B[k,j]
end
end
end
end
which is what we had in the
@inbounds
section. Next I've added four one line methods which are named as follows: UL(x,n)
for the upper left part of the matrix \(x\) where \(x\) is of size \(n \times n\). Then we have the same for upper right (UR), lower left (LL) and lower right (LR).
Then we have our main method which has the base case at the beginning and then two @sync
blocks. Those mean that we can do the four @spawn
tasks inside in parallel but before we continue all of them needs to be finished.
This is the case as we don't save \(A_{11}B_{11}\) as an example in a temporary variable but instead in \(C_{11}\). For the second matrix
\[ \begin{pmatrix} A_{12}B_{21} & A_{12}B_{22} \\ A_{22}B_{21} & A_{22}B_{22} \\ \end{pmatrix} \]we then need to have \(C_{11}\) computed before we can add \(A_{12}B_{21}\) to it. Due to the fact that we always manipulate the \(C\) matrix directly we also don't need to use an explicit \(+\) as you might have thought from equation (2).
Now let's see the result:
Implementation | Running time (s) | Relative speedup | Absolute speedup |
---|---|---|---|
i,j,k loop | 117.96 | 1.0 | 1.0 |
j,k,i loop | 6.41 | 18.4 | 18.4 |
@inbounds | 5.20 | 1.23 | 22.7 |
@threads | 3.32 | 1.57 | 35.5 |
blocks | 3.13 | 1.06 | 37.7 |
tile | 0.44 | 7.11 | 268.1 |
That is quite an impressive speed up compared to the blocks but also in general. We have nearly a 300x speedup to our first simple implementation 🎉
One last thing I wanted to try is LoopVectorization it provides a @turbo
macro which itself tries to rearrange the loop ordering to minimize computation time and vectorizes the loop.
For this I loaded the library with using LoopVectorization
after installing it of course and changed the my_mul_seq!
code to:
function my_mul_seq!(C,A,B)
n,m = size(A)
@turbo for j in 1:m
for k in 1:m
for i in 1:n
C[i,j] += A[i,k]*B[k,j]
end
end
end
end
Implementation | Running time (s) | Relative speedup | Absolute speedup |
---|---|---|---|
i,j,k loop | 117.96 | 1.0 | 1.0 |
j,k,i loop | 6.41 | 18.4 | 18.4 |
@inbounds | 5.20 | 1.23 | 22.7 |
@threads | 3.32 | 1.57 | 35.5 |
blocks | 3.13 | 1.06 | 37.7 |
tile | 0.44 | 7.11 | 268.1 |
@turbo | 0.27 | 1.63 | 436.9 |
I think we are at a point to stop now as testing the standard library version:
@btime mul!($C,$A,$B)
results in:
238.187 ms (0 allocations: 0 bytes)
That means we are about as fast as the standard library!
Full disclaimer: The standard library doesn't just work for square matrices with size \(2^k \times 2^k\) 😄
Hope you have enjoyed this deep dive into benchmarking and improving matrix multiplication from a naive implementation to one which is nearly as fast as the standard library.
Thanks for reading and if you don't want to miss a post: Please sign up for the email newsletter 😉 Additionally, please consider joining the dozen of you who support me financially. Thank you!
You might enjoy the next post in this series about a diffing algorithm
Thanks to my 12 patrons!
Special special thanks to my >4$ patrons. The ones I thought couldn't be found 😄
Anonymous
Kangpyo
Gurvesh Sanghera
Szymon Bęczkowski
Colin Phillips
Jérémie Knuesel
For a donation of a single dollar per month you get early access to these posts. Your support will increase the time I can spend on working on this blog.
There is also a special tier if you want to get some help for your own project. You can checkout my mentoring post if you're interested in that and feel free to write me an E-mail if you have questions: o.kroeger <at> opensourc.es
I'll keep you updated on Twitter opensourcesblog.