Creation date: 2022-11-03

I've just read about the Myers difference algorithm via a thread on hackernews and thought about doing a short implementation using JuliaLang. In other words I got nerd sniped.

There were many applications were I was interested in doing some kind of diff sometimes in different ways than the standard git diff which are more relevant for humans in specific settings. Like sometimes we only want to see real structural changes and not all the parts where float values in the 10th decimal increased by 1. Other times one wants to see changes like: Variable `x`

was changed to `y`

ones at the top and not on all 100s of places where this was the case. Maybe though the parts were it maybe should have changed 😂 It's a very complicated problem...

This however is about the Myers difference algorithm which is very well explained in the link above and I only want to provide a starter Julia implementation for this. We are here in the Julia basics section so let's start with something not too complicated and maybe explore some ideas at a different time. If you know my blog you know that I love to dive deep deep down into something but not this time (for now 😄 ) If you're new to the blog please check out some other posts in this series or others by checking out the navbar.

Some suggestions:

In simple words we want to compare two strings and get the shortest edit distance between the two. For this we create a graph with `N = (A+1)*(B+1)`

nodes where `A`

is the length of the first word and `B`

the length of the second word. Then we add edges for possible moves that we can make which are, delete a char from our old word, insert a char from the new word or when the char is the same to just keep it. Then we want to output the shortest possible way by reducing the number of edits we need to perform.

We start with creating the grid graph.

```
function word_diff(old_word, new_word)
A = length(old_word)
B = length(new_word)
N = (A+1)*(B+1)
g = SimpleGraph(N)
end
```

This creates a simple graph `g`

with the correct amount of nodes. Now we need to add the edges. For this we need to remember that Julia starts indexing at 1 and therefore the upper left vertex in the grid is the 1. Which represents our starting node.

We can add the following to the above function:

```
for i in 0:A
for j in 0:B
# one to the right
if i+1 <= A
add_edge!(g, j*(A+1)+i+1, j*(A+1)+i+2)
end
# one below
if j+1 <= B
add_edge!(g, j*(A+1)+i+1, (j+1)*(A+1)+i+1)
end
# diagonal
if i > 0 && j > 0 && old_word[i] == new_word[j]
add_edge!(g, (j-1)*(A+1)+(i-1)+1, j*(A+1)+i+1)
end
end
end
```

One should probably refactor the calculation of the 1d index from the 2d index. That is a task for you (the reader 😉 ) We basically always add an edge to the right if there is a vertex to the right and below if there is one. Then if the character is the same as the current position we add a diagonal edge.

Now we can use Breadth-first search to find the shortest path from the top left vertex to the bottom right one. We can use the `shortest_paths`

function for this.

`bfs_result = shortest_paths(g, 1)`

The output for this for our test example with `old_word = "driftwood"`

and `new_word = "artwork"`

this results in:

```
Graphs.Experimental.ShortestPaths.BFSResult{Int64}(
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 12, 23, 24, 25, 26, 27, 28, 29, 21, 22, 23, 24, 25, 25, 36, 37, 38, 39, 31, 32, 33, 34, 35, 36, 36, 47, 48, 49, 41, 42, 43, 44, 45, 46, 47, 47, 48, 59, 51, 52, 52, 63, 64, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 2, 3, 3, 4, 5, 6, 7, 8, 9, 10, 3, 4, 4, 5, 6, 6, 7, 8, 9, 10, 4, 5, 5, 6, 7, 7, 7, 8, 9, 10, 5, 6, 6, 7, 8, 8, 8, 8, 9, 10, 6, 7, 7, 8, 9, 9, 9, 9, 10, 11, 7, 8, 8, 9, 10, 10, 10, 10, 11, 12]
)
```

Which might not be the easiest one to read. It shows the parent for each vertex first and then the distance to our start node `1`

. So the last value of the second array shows us the length of our path. Then we can start from the last value of the first array and go along the parents to get our path.

I wrote this function for it:

```
function get_path_list(bfs_parents)
c = length(bfs_parents)
l = Int[]
while c != 1
push!(l, c)
c = bfs_parents[c]
end
return reverse!(l)
end
```

In our case it is:

`[2, 12, 23, 24, 25, 36, 47, 48, 59, 60, 70, 80]`

It never starts with a `1`

due to my while loop break but you can read it as we start at the top left corner. Then go one right to the `d`

of `driftwood`

which means we delete that `d`

. Then we go down (a plus of 10 in our case as `driftwood`

has 9 chars). Then diagonally (a plus of 11) and so on.

The last step that we need to do is now to make this path readable as I tried in my last sentence 😄 but hopefully more visually pleasing. For this I installed Crayons.jl. First I wanted to have tuples which contain whether we delete, insert or keep a char and then the char itself.

```
function get_insert_delete_tuples(path, old_word, new_word)
A = length(old_word)
tpls = Vector{Tuple{Symbol, Char}}()
last = 1
for p in path
lj, li = divrem(last-1, A+1)
j, i = divrem(p-1, A+1)
# deletion
if li+1 == i && j == lj
push!(tpls, (:delete, old_word[i]))
elseif i == li && lj+1 == j
# insertion
push!(tpls, (:insert, new_word[j]))
else
push!(tpls, (:same, old_word[i]))
end
last = p
end
return tpls
end
```

Here we go along the path keep track of the last vertex we came from and convert that into two dimensional coordinates. Then we check whether we came from the left, top or diagonally.

My last function is to convert that to crayons:

```
function print_in_crayons(ins_del_tpls)
for tpl in ins_del_tpls
cray = Crayon(reset=true)
if tpl[1] == :delete
cray = Crayon(reset=true, strikethrough=true)
elseif tpl[1] == :insert
cray = Crayon(reset=true, foreground=:green)
end
print(cray, tpl[2])
end
println()
end
```

And in our `word_diff`

method I've added:

```
path_list = get_path_list(bfs_result.parents)
ins_del_tpls = get_insert_delete_tuples(path_list, old_word, new_word)
print_in_crayons(ins_del_tpls)
```

For our example the output reads like:

I totally agree that it isn't the easiest to read so I give you the task to improve the output here as well 😉

I think there are several options this could be improved:

We would like to keep the whole "two" and then remove the second "o" from "wood" afterwards instead of deleting the first "o"

Check this out for several lines.

Do we have a line by line diff and one for each individual line?

It might be interesting to test this out for larger files then and do some benchmarks.

Thanks for reading!

**Special thanks to my 11 patrons!**

Special special thanks to my >4$ patrons. The ones I thought couldn't be found 😄

Anonymous

Gurvesh Sanghera

Szymon Bęczkowski

Colin Phillips

Want to be updated? Consider subscribing and receiving a mail whenever a new post comes out.

© Ole Kröger. Last modified: January 29, 2023. Website built with Franklin.jl.