Creation date: 2021-03-21

Welcome back to a new part in our Julia basics series where I explain some basic functionality of the Julia programming language.

In this post I'll explain various possibilities to find slow code in your project. You might have already researched the Julia documentation and other websites regarding this. Often they will do this using small code examples which can be easily analyzed and are helpful for training. Nevertheless I feel some things can only be seen when having a look at bigger packages.

Well you're here so I assume that small code examples didn't help you out too much. You might want to infiltrate some bigger projects of your own. Then you're at the right place.

Let's get started with a small code example as well π and use it for the most part of this post. However, in the last part I'll be having a look at my own package which implements a constraint solver, see: ConstraintSolver . You can also see quite a few posts about that project on this blog when you check out the sidebar on the left.

I know that this mathematical optimization thing can be a bit overwhelming but I assure you that I will not talk about that project here. I'll simply use it as a codebase to analyze the performance and the first 90% of the post are about a very small code example. π

Even though that is the case we'll learn about the tools to keep manageable for bigger code bases whereas the Julia documentation is a bit too fine detailed for me and provides often a performance profile that is longer than the code which feels daunting.

One last thing before we dive into it. I've recently started to create an email subscription list to send you the newest posts. It's often a bit hard to keep track of new posts when they don't get posted on a regular schedule. So if you want to get informed about new posts roughly twice a month consider subscribing to the list for free. Thanks a lot!

The post is split into some sections to make it easier for you to jump around as the post is a bit long. π We'll start with having a look at BenchmarkTools first to understand basic time measurement of functions and get an overall feeling of the speed. This will include some things to consider when timing functions in Julia as one has to be aware of compiling time.

Afterwards we will have a look at memory allocations in particular as this is often a good place to start with when trying to speed up code. For both these sections we will use a small julia script as an example to make sure we understand everything for a small enough function.

The final step will be to have a look at the ConstraintSolver codebase to find a particular section of the code that is slowing down the solver.

Let's assume we want to find all prime numbers up to a number.

```
function get_primes(to)
possible_primes = collect(2:to)
# iterate over the primes and remove multiples
i = 1
while i <= length(possible_primes)
prime = possible_primes[i]
possible_primes = filter(n->n <= prime || n % prime != 0, possible_primes)
i += 1
end
return possible_primes
end
# create a list of numbers from 2:100
possible_primes = get_primes(100)
```

Let's don't have a look at the code from above in detail here and just figure out how fast it is for different input sizes.

It's possible to just use the macro `@time`

for this and run:

`@time get_primes(100)`

to see that it takes 0.00....X seconds so one might consider that fast π The problem with using the `@time`

macro is that it can include compile time if it's the first time we call that function. Additionally it calls the function once only so it can depend on a lot of other variables like how busy is your CPU because you currently watch a YouTube video?

For this you want to run the function several times and get some statistics like median and mean running time. Therefore we want to use the BenchmarkTools package.

In one of the earlier posts about packages I've explained how to install packages and environments and stuff like that. If you haven't already you can install the package by typing `] add BenchmarkTools`

into your REPL.

Perfect! Now let's do some measurements for prime numbers up to 100, 1,000, 100,000, 1,000,000.

```
julia> using BenchmarkTools
julia> @benchmark get_primes(100)
BenchmarkTools.Trial:
memory estimate: 8.80 KiB
allocs estimate: 26
--------------
minimum time: 2.165 ΞΌs (0.00% GC)
median time: 2.592 ΞΌs (0.00% GC)
mean time: 3.064 ΞΌs (3.01% GC)
maximum time: 85.641 ΞΌs (96.03% GC)
--------------
samples: 10000
evals/sample: 9
```

I think it might be hard to get a feeling for whether this is fast or not so let's test it for bigger numbers. We use the macro `@btime`

for this to get only the minimum elapsed time.

```
for n in [1000, 10000, 100000, 300000]
println("n: $n")
@btime get_primes($n)
end
```

Here we use `$`

to interpolate the variable inside the `@btime`

macro.

```
n: 1000
71.448 ΞΌs (169 allocations: 264.69 KiB)
n: 10000
3.926 ms (1241 allocations: 11.95 MiB)
n: 100000
235.344 ms (19190 allocations: 705.86 MiB)
n: 300000
1.698 s (52000 allocations: 5.05 GiB)
```

I would say the scaling of this is quite bad. Certainly we don't want to wait for more than a second to get the prime numbers up to a 300 thousand.

As we have only a single function we are more or less done with the benchmarking section for this code. You might be interested in plotting the timing.

Actually yeah why not... A little graph is always nice and we can learn to retrieve the timing information from BenchmarkTools .

Installing Plots with `] add Plots`

```
using Plots
xs = Float64[]
ys = Float64[]
for n in range(100, stop=50000, length=20)
println("n: $n")
t = @benchmark get_primes($n) seconds=1
push!(xs, n)
# convert from nano seconds to seconds
push!(ys, minimum(t).time / 10^9)
end
plot(xs, ys, legend=:topleft, label="Calculating prime numbers", xaxis="Primes up to", yaxis="Time in seconds")
```

You can of course experiment a bit with logarithmic scales or something and more data but I think we have a rough understanding of how it scales now π

In this section we will try to understand memory allocations of the above mentioned prime code.

You might have realized that `@btime`

and `@benchmark`

also show the memory consumption.

```
n: 1000
71.448 ΞΌs (169 allocations: 264.69 KiB)
n: 10000
3.926 ms (1241 allocations: 11.95 MiB)
n: 100000
235.344 ms (19190 allocations: 705.86 MiB)
n: 300000
1.698 s (52000 allocations: 5.05 GiB)
```

It certainly doesn't take 5GiB to store the prime numbers from 2 to 300,000, right? I mean we don't even need that much memory to store all the numbers from 2 to 300,000 which should probably be more or less our limit.

```
julia> sizeof(collect(2:300000))
799992
```

based on wolfram alpha this is `2399992 bytes in MiB => 2.29MiB`

π

Okay so without diving into the code just yet as it's probably not that hard to see why we need so much memory... Let's create a file for that program `primes.jl`

and then start a julia session with `julia --track-allocation=user`

.

Then use:

```
include("primes.jl")
get_primes(100000)
```

and close the julia session.

This generates a `.mem`

file in the directory where we creates `primes.jl`

Unfortunately this is the output for it:

```
- function get_primes(to)
800080 possible_primes = collect(2:to)
- # iterate over the primes and remove multiples
- i = 1
0 while i <= length(possible_primes)
0 prime = possible_primes[i]
0 possible_primes = filter(n->n <= prime || n % prime != 0, possible_primes)
0 i += 1
- end
0 return possible_primes
- end
```

Which is clearly not the correct result as we had more memory allocations.

It was actually a bit unexpected for me to see this as it's normally my goto tool for having a look at memory allocations.

Fortunately there are other ways to track this which we will cover now π

TimerOutputs is a package that times specified parts of a function and combines their running time when allocations happen in a loop.

One drawback of this is that you need to change the code a bit to tell TimerOutputs where we want to measure time and allocations.

I have changed it to:

```
using TimerOutputs
tmr = TimerOutput()
function get_primes(to)
@timeit tmr "collect" possible_primes = collect(2:to)
# iterate over the primes and remove multiples
i = 1
@timeit tmr "loop" begin
while i <= length(possible_primes)
prime = possible_primes[i]
possible_primes = filter(n->n <= prime || n % prime != 0, possible_primes)
i += 1
end
end
return possible_primes
end
```

and after running `get_primes(100000)`

we can get the results:

```
julia> show(tmr)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Time Allocations
ββββββββββββββββββββββ βββββββββββββββββββββββ
Tot / % measured: 11.1s / 2.58% 783MiB / 90.1%
Section ncalls time %tot avg alloc %tot avg
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
loop 1 280ms 97.9% 280ms 705MiB 100% 705MiB
collect 1 5.97ms 2.09% 5.97ms 790KiB 0.11% 790KiB
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```

Here one can see that `collect`

is not the main memory allocator in our code. It lies somewhere in our loop.

β Note

One can disable the timer and reset it which can be important for running those measurements several times. It's all explained in the readme of TimerOutputs

We could specify more `@timeit`

macros inside of our code like:

```
function get_primes(to)
@timeit tmr "collect" possible_primes = collect(2:to)
# iterate over the primes and remove multiples
i = 1
@timeit tmr "loop" begin
while i <= length(possible_primes)
prime = possible_primes[i]
@timeit tmr "filter" possible_primes = filter(n->n <= prime || n % prime != 0, possible_primes)
i += 1
end
end
return possible_primes
end
```

and get:

```
julia> show(tmr)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Time Allocations
ββββββββββββββββββββββ βββββββββββββββββββββββ
Tot / % measured: 6.62s / 4.26% 711MiB / 99.3%
Section ncalls time %tot avg alloc %tot avg
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
loop 1 282ms 100% 282ms 706MiB 100% 706MiB
filter 9.59k 268ms 95.1% 27.9ΞΌs 705MiB 100% 75.3KiB
collect 1 73.9ΞΌs 0.03% 73.9ΞΌs 781KiB 0.11% 781KiB
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```

This also shows how often the `filter`

is called and we can indeed see that `filter`

is responsible for the memory allocations.

β Note

One should know that

`@timeit`

itself has some running time and memory allocation overhead but it gives us a good idea of where we need to have a deeper lookEven though this post is about profiling I want to point out that one can use `filter!`

with the `!`

to update the collection inplace. Currently we create a new vector every time.

```
using TimerOutputs
tmr = TimerOutput()
function get_primes(to)
@timeit tmr "collect" possible_primes = collect(2:to)
# iterate over the primes and remove multiples
i = 1
@timeit tmr "loop" begin
while i <= length(possible_primes)
prime = possible_primes[i]
@timeit tmr "filter!" possible_primes = filter!(n->n <= prime || n % prime != 0, possible_primes)
i += 1
end
end
return possible_primes
end
```

with the TimerOutputs table:

```
julia> show(tmr)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Time Allocations
ββββββββββββββββββββββ βββββββββββββββββββββββ
Tot / % measured: 6.06s / 3.53% 5.31MiB / 22.5%
Section ncalls time %tot avg alloc %tot avg
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
loop 1 214ms 100% 214ms 442KiB 36.2% 442KiB
filter! 9.59k 201ms 94.1% 21.0ΞΌs 300KiB 24.5% 32.0B
collect 1 73.9ΞΌs 0.03% 73.9ΞΌs 781KiB 63.8% 781KiB
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```

As mentioned above for small allocations and times the overhead of `@timeit`

can come into play. We can verify easily with BenchmarkTools that `filter!`

actually doesn't have any allocations.

Before we continue with the profiling section and the bigger codebase you might want to experiment a bit with the code and see how much faster you can make the code π

```
using Plots
xs = Float64[]
ys_slow = Float64[]
ys_faster = Float64[]
for n in range(100, stop=50000, length=20)
println("n: $n")
push!(xs, n)
t_slow = @benchmark get_primes_not_inplace($n) seconds=1
t_faster = @benchmark get_primes($n) seconds=1
# convert from nano seconds to seconds
push!(ys_slow, minimum(t_slow).time / 10^9)
push!(ys_faster, minimum(t_faster).time / 10^9)
end
plot(;title="Runtime of prime number algorithms", legend=:topleft, xaxis="Primes up to", yaxis="Time in seconds")
plot!(xs, ys_slow, label="Slow version")
plot!(xs, ys_faster, label="Faster version")
```

We got a bit faster and reduced memory consumption. In the last section we will have a short look at the flamegraph of this prime function and I'll give one other faster way of computing the primes.

We had a look at basic benchmarking where we get the running time of the function we call. How about getting this information for all functions inside the main function and recursively apply that down to the very basic functions like `==`

and `+`

?

Profiling is a bit like that but does not necessarily give information about the exact running time of these functions. I would say it's a bit more vague but gives a very nice overview of where your code takes the most amount of time.

Let's start with our in-place version of the prime function.

For reference:

```
function get_primes(to)
possible_primes = collect(2:to)
# iterate over the primes and remove multiples
i = 1
while i <= length(possible_primes)
prime = possible_primes[i]
possible_primes = filter!(n->n <= prime || n % prime != 0, possible_primes)
i += 1
end
return possible_primes
end
```

For profiling we need another two packages where one (Profile) is part of the standard library so we don't need to run `] add`

for that. The other is PProf which visualizes the profiling data and needs to be installed.

We write:

```
using Profile, PProf
get_primes(100000);
@profile get_primes(100000)
pprof(;webport=58699)
```

You can then click on the link to "http://localhost:58699" which will show a call graph. I don't like the graph representation that much which is the reason why I often switch to

This will then show something like this (scroll down a bit π)

For our current implementation this is quite simple: We have the `get_primes`

function and beneath that the `filter!`

function. The flamegraph has those blocks which have a width that is representing the time spent in that function. Blocks below other blocks are inner functions.

Two important things to note:

the color has no meaning it should just look like a flame π

The x-axis can't be read from left to write as time. Only the width of the blocks is important

**not**the x position.

The `filter!`

function part takes more than 98% of the time which means it is **the** part we want to optimize. If we don't want to improve the performance of `==`

and `<=`

π

Let's have a look at the `filter!`

line again.

`possible_primes = filter!(n->n <= prime || n % prime != 0, possible_primes)`

We can see that `n <= prime`

is actually a bit useless as we still iterate over it and check `<=`

every time even though we know that `possible_primes`

is in ascending order. That is some information that we have but the compiler doesn't.

You might want to have a look at how to do the same as `filter!`

but in a loop or something like that. I think there a lot of different ways to solve this. I personally used a different kind of data structure where I have a boolean array which stores whether a number is a prime (or couldn't be ruled out yet) or whether it's a multiple.

```
function get_primes_without_filter(to)
possible_primes = 1:to
// first we think all all prime ;)
primes = ones(Bool, to)
primes[1] = false
i = 2
while i <= div(to,2)
prime = possible_primes[i]
// use step range and broadcasting to remove the multiples
primes[2*prime:prime:end] .= false
i += 1
// go to next prime number and break early.
while i <= div(to,2)+1 && !primes[i]
i += 1
end
end
return findall(primes)
end
```

Let's run our benchmarking again:

which looks quite funny π as it looks like the fastest version is always taking the same amount of time. We can look a bit deeper into the details by using a logarithmic y-axis though:

It's roughly 300 times faster for 50,000 and the factor increases over time.

I've used it to compute the prime numbers up to 100.000.000 in about 3 seconds which hopefully isn't too bad but I'm looking forward to your solutions.

Before we conclude this post I want to show a flamegraph for my ConstraintSolver project.

This can look a bit daunting at first but I think we understood how we can read it in general so let's step back a bit and just see which blocks are wide.

β Note

You might want to click on the image to see a bigger version π

One thing you might see is this big `Array`

block in the middle and `_new_array`

below it. Seems like we are creating a lot of new arrays in between. This is something that we might be able to catch with `track-allocation=user`

as well but this is a very quick way of getting a big overview.

The solver currently has about ~9K lines of code so just checking it line by line which might be a good approach for the prime function from before is not really feasible. In this post I don't want to have a look on how I resolved this issue but you might want to subscribe to the mail list to get notified about my next ConstraintSolver post π

Hope you've learned different ways to benchmark and profile your code using a variety of tools. There are some specific things that are not easy to see using these tools like type instability or bad memory layout but I felt these can often be avoided by having a basic understanding of those beforehand. I might do a blog post about these as well.

Additionally it's possible to have a look at a more lowered code level with some macros to dive deeper into some functions for micro optimizations which might be interesting for a future blog post as well.

**List of other interesting tools and links you might want to have a look at**

Performance Tips in Julia

ProfileView.jl similar to PProf but doesn't show it in a browser window but instead uses Gtk

Cthulhu finding type inference issues (I might use this in another post)

**Thanks to my 11 patrons!**

Special special thanks to my >4$ patrons. The ones I thought couldn't be found π

Site Wang

Gurvesh Sanghera

Szymon BΔczkowski

Colin Phillips

For a donation of a **single** dollar per month you get early access to these posts. Your support will increase the time I can spend on working on this blog.

There is also a special tier if you want to get some help for your own project. You can checkout my mentoring post if you're interested in that and feel free to write me an E-mail if you have questions: o.kroeger <at> opensourc.es

I'll keep you updated on Twitter OpenSourcES.

Want to be updated? Consider subscribing and receiving a mail whenever a new post comes out.

© Ole KrΓΆger. Last modified: June 15, 2021. Website built with Franklin.jl.