Graphs.jl: The Myers difference algorithm


Creation date: 2022-11-03

Tags: julia, graphs, optimization

I've just read about the Myers difference algorithm via a thread on hackernews and thought about doing a short implementation using JuliaLang. In other words I got nerd sniped.

There were many applications were I was interested in doing some kind of diff sometimes in different ways than the standard git diff which are more relevant for humans in specific settings. Like sometimes we only want to see real structural changes and not all the parts where float values in the 10th decimal increased by 1. Other times one wants to see changes like: Variable x was changed to y ones at the top and not on all 100s of places where this was the case. Maybe though the parts were it maybe should have changed 😂 It's a very complicated problem...

This however is about the Myers difference algorithm which is very well explained in the link above and I only want to provide a starter Julia implementation for this. We are here in the Julia basics section so let's start with something not too complicated and maybe explore some ideas at a different time. If you know my blog you know that I love to dive deep deep down into something but not this time (for now 😄 ) If you're new to the blog please check out some other posts in this series or others by checking out the navbar.

Some suggestions:

The problem description

In simple words we want to compare two strings and get the shortest edit distance between the two. For this we create a graph with N = (A+1)*(B+1) nodes where A is the length of the first word and B the length of the second word. Then we add edges for possible moves that we can make which are, delete a char from our old word, insert a char from the new word or when the char is the same to just keep it. Then we want to output the shortest possible way by reducing the number of edits we need to perform.

Creating the graph

We start with creating the grid graph.

function word_diff(old_word, new_word)
    A = length(old_word)
    B = length(new_word)
    N = (A+1)*(B+1)
    g = SimpleGraph(N)
end

This creates a simple graph g with the correct amount of nodes. Now we need to add the edges. For this we need to remember that Julia starts indexing at 1 and therefore the upper left vertex in the grid is the 1. Which represents our starting node.

We can add the following to the above function:

for i in 0:A
    for j in 0:B
        # one to the right
        if i+1 <= A
            add_edge!(g, j*(A+1)+i+1, j*(A+1)+i+2) 
        end
        # one below
        if j+1 <= B
            add_edge!(g, j*(A+1)+i+1, (j+1)*(A+1)+i+1) 
        end
        # diagonal
        if i > 0 && j > 0 && old_word[i] == new_word[j]
            add_edge!(g, (j-1)*(A+1)+(i-1)+1, j*(A+1)+i+1) 
        end
    end
end

One should probably refactor the calculation of the 1d index from the 2d index. That is a task for you (the reader 😉 ) We basically always add an edge to the right if there is a vertex to the right and below if there is one. Then if the character is the same as the current position we add a diagonal edge.

Calculating the solution

Now we can use Breadth-first search to find the shortest path from the top left vertex to the bottom right one. We can use the shortest_paths function for this.

bfs_result = shortest_paths(g, 1)

The output for this for our test example with old_word = "driftwood" and new_word = "artwork" this results in:

Graphs.Experimental.ShortestPaths.BFSResult{Int64}(
    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 12, 23, 24, 25, 26, 27, 28, 29, 21, 22, 23, 24, 25, 25, 36, 37, 38, 39, 31, 32, 33, 34, 35, 36, 36, 47, 48, 49, 41, 42, 43, 44, 45, 46, 47, 47, 48, 59, 51, 52, 52, 63, 64, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70], 
    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 2, 3, 3, 4, 5, 6, 7, 8, 9, 10, 3, 4, 4, 5, 6, 6, 7, 8, 9, 10, 4, 5, 5, 6, 7, 7, 7, 8, 9, 10, 5, 6, 6, 7, 8, 8, 8, 8, 9, 10, 6, 7, 7, 8, 9, 9, 9, 9, 10, 11, 7, 8, 8, 9, 10, 10, 10, 10, 11, 12]
)

Which might not be the easiest one to read. It shows the parent for each vertex first and then the distance to our start node 1. So the last value of the second array shows us the length of our path. Then we can start from the last value of the first array and go along the parents to get our path.

I wrote this function for it:

function get_path_list(bfs_parents)
    c = length(bfs_parents)
    l = Int[]
    while c != 1
        push!(l, c)
        c = bfs_parents[c]
    end
    return reverse!(l)
end

In our case it is:

[2, 12, 23, 24, 25, 36, 47, 48, 59, 60, 70, 80]

It never starts with a 1 due to my while loop break but you can read it as we start at the top left corner. Then go one right to the d of driftwood which means we delete that d. Then we go down (a plus of 10 in our case as driftwood has 9 chars). Then diagonally (a plus of 11) and so on.

Show a visual diff

The last step that we need to do is now to make this path readable as I tried in my last sentence 😄 but hopefully more visually pleasing. For this I installed Crayons.jl. First I wanted to have tuples which contain whether we delete, insert or keep a char and then the char itself.

function get_insert_delete_tuples(path, old_word, new_word)
    A = length(old_word)
    tpls = Vector{Tuple{Symbol, Char}}()
    last = 1
    for p in path 
        lj, li = divrem(last-1, A+1)
        j, i = divrem(p-1, A+1)
        # deletion
        if li+1 == i && j == lj 
            push!(tpls, (:delete, old_word[i]))
        elseif  i == li && lj+1 == j
            # insertion 
            push!(tpls, (:insert, new_word[j]))
        else 
            push!(tpls, (:same, old_word[i]))
        end
        
        last = p
    end
    return tpls
end

Here we go along the path keep track of the last vertex we came from and convert that into two dimensional coordinates. Then we check whether we came from the left, top or diagonally.

My last function is to convert that to crayons:

function print_in_crayons(ins_del_tpls)
    for tpl in ins_del_tpls
        cray = Crayon(reset=true)
        if tpl[1] == :delete
            cray = Crayon(reset=true, strikethrough=true)
        elseif tpl[1] == :insert
            cray = Crayon(reset=true, foreground=:green)
        end
        print(cray, tpl[2])
    end
    println()
end

And in our word_diff method I've added:

path_list = get_path_list(bfs_result.parents)
ins_del_tpls = get_insert_delete_tuples(path_list, old_word, new_word)
print_in_crayons(ins_del_tpls)

For our example the output reads like:

dariftwoodrk

I totally agree that it isn't the easiest to read so I give you the task to improve the output here as well 😉

Future possibilities

I think there are several options this could be improved:

It might be interesting to test this out for larger files then and do some benchmarks.

Thanks for reading!

Special thanks to my 11 patrons!

Special special thanks to my >4$ patrons. The ones I thought couldn't be found 😄



Want to be updated? Consider subscribing and receiving a mail whenever a new post comes out.

Powered by Buttondown.


Subscribe to RSS