Dear DiffLib
Dear difflib, What The F@&#! ?¶
TL,DR:¶
Acording to difflib, the following strings: 'baaaaa', 'aaabaa' and 'aaaaba' share 5 caracters with 'aaaaaa', but 'aabaaa' share only 3 .... but only if the change is in the first half of the string.
import difflib
print sum(x.size for x in difflib.SequenceMatcher(None, '-aaaaa', 'aaaaaa').get_matching_blocks())
print sum(x.size for x in difflib.SequenceMatcher(None, 'a-aaaa', 'aaaaaa').get_matching_blocks())
print sum(x.size for x in difflib.SequenceMatcher(None, 'aa-aaa', 'aaaaaa').get_matching_blocks())
print sum(x.size for x in difflib.SequenceMatcher(None, 'aaa-aa', 'aaaaaa').get_matching_blocks())
print '- Control -'
print sum(x.size for x in difflib.SequenceMatcher(None, 'aaaaaa', 'aaaaaa').get_matching_blocks())
It only get weirder and more fractal if you read the rest
Context¶
A few weeks back I was in EuroSciPy, where I had to dive a little deeper than usual into difflib. Indead, one often requested feature in IPython is to be able to diff notebooks, so I started looking at how this can be done. Thanks to @MinRK for helping me figuring out the rest of this post and give me some nices ideas of graphs.
Naturaly I turned myself toward difflib :
This module provides classes and functions for comparing sequences. It can be used for example, for comparing files, and can produce difference information in various formats, including HTML and context and unified diffs. For comparing directories and files, see also, the filecmp module.
More especially, I am interested in SequenceMatcher
which aims to be :
[...] a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable.
It is also explicitely state that generated diff might not be minimal, but might looks more "human readable" than classical diff (emphasis mine).
The basic algorithm [..] is a little fancier than,[...] “gestalt pattern matching.” The idea is to find the longest contiguous matching subsequence that contains no “junk” elements [...]. The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence. This does not yield minimal edit sequences, but does tend to yield matches that “look right” to people.
To do that, we need to define "junk" which are things you don't want the algorithme to match on. Let see a common example in python where when you add a function with a derorator, the decorator is often added to the "next" function.
s1 = """
@decorate
def fun1():
return 1
@decorate
def fun3():
return 3
"""
s2 = """
@decorate
def fun1():
return 1
@decorate
def fun2():
return 2
@decorate
def fun3():
return 3
"""
Classical diff
import difflib
print ''.join(difflib.ndiff(s1.splitlines(1), s2.splitlines(1)))
Now we will tell it that blank line are junk:
blankline = lambda x:x.strip() ==''
print ''.join(difflib.ndiff(s1.splitlines(1), s2.splitlines(1), linejunk=blankline))
This is clearly better, as hunk do not have blank lines in the midlle, but more on sides.
Where it gets weird¶
Things is, SequenceMatcher
is also the method that help you get the proximity between two strings.
If the sequence you pass to SequenceMatcher
is a string, it will try to match each caracters.
from difflib import SequenceMatcher
print SequenceMatcher('hello world','hello world').ratio()
print SequenceMatcher('xyz','abc').ratio()
Oh, sorry, you need to explicitelty, pass the isjunk
function. Hopefully, it accepts None
.
print SequenceMatcher(None, 'hello world','hello world').ratio()
print SequenceMatcher(None, 'xyz','abc').ratio()
Ok, that's better, it goes from 0.0 for completely different strings (nothing in common) to 1.0 for perfectly matching strings.
API is weird, but for compatibility we keep the old one... fine with me
let's try longer...
print SequenceMatcher(None, 'y'+'abc'*150,'x'+'abc'*150).ratio()
Still, don't pass it strings longer than 200 char, it automatically detect junk... Yes this is documented, but not obvisous.
print SequenceMatcher(None, 'y'+'abc'*150,'x'+'abc'*150, autojunk=False).ratio()
So let's define a custom SequenceMatcher that make sens for the rest of the post, so that isjunk
is None
by default, and no autojunk
for n > 200
, and a simple ratio(a, b)
method as a shortcut.
def MySequenceMatcher( seq1, seq2, isjunk=None, autojunk=False):
return SequenceMatcher(isjunk, seq1, seq2, autojunk=autojunk)
def ratio(a,b):
return MySequenceMatcher(a,b).ratio()
print ratio('abc','abc')
print ratio('abcd','abcf')
Where it gets weirder¶
I'll probably won't go into how we found out the following, but for some reason ratio(a,b)
was different from ratio(reversed(a),reversed(b))
, that is to say ratio('ab...yz','ab...yz') != ratio('zy...ba','zy...ba')
in some case.
So let's look at what happend if we compare a string against itself, with only one modification, as a function as the position of the modification, that is to say the ratio of aaaa...
vs baaa...
vs abaa...
vs aaba...
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
n = 100
step = 1
s1 = u'a'*n
r = lambda x : ratio(x, s1)
def modified(i):
""" return the string s1, where the i-th is replaced by -"""
return s1[:i-1]+u'-'+s1[i+1:]
xx = range(1, n, step)
print(s1[:10]),'...'
for i in range(10):
print(modified(i)[:10]),'...'
print('..............')
distance = map( r, [modified(i) for i in xx])
plt.plot(xx, distance)
plt.ylabel('similarity')
plt.xlabel('position of modified caracter')
WWWHHHHAAAAAT ???
Hum..let's look with some other string, basically same as before, but with repeating sequence ababab..
, abcabcabcabc...
import string
def test_rep(k=1, n=128):
s1 = string.letters[:k]*int(n/k*2)
s1 = s1[:n]
r = lambda x : ratio(x, s1)
def modified(i):
""" return the string s1, where the i-th is replaced by -"""
return s1[:i-1]+u'-'+s1[i+1:]
xx = range(1, n, step)
distance = map( r, [modified(i) for i in xx])
return xx,distance
fig,ax = plt.subplots(1,1)
fig.set_figwidth(16)
fig.set_figheight(8)
for k in [1,2,4,8,16,32]:
xx, distance = test_rep(k, n=64)
plt.step(xx, distance, where='post', label='k={k}'.format(k=k))
plt.ylabel('similarity')
plt.xlabel('position of modified caracter')
plt.legend(loc=4)
(bottom of the graph is at 0.5, not 0.0)
Huummm... this definitively does not look at what I was expecting (mainly something constant), but looks more like a Sierpinski triangle to me.
Bottom line, even if difflib make some pretty looking diff, I cannot trust it for giving me proximity of two sequences.
Still Some good sides...¶
Anyway, I knew what levenstein distance was, but not really what the efficient algorithme where. I played around a little bit with them, and recoded them in pure python. I'll post the link soon, just time to clean things up im make them as pure python module. Below is how it looks like.
Thing is, as far as I can tell, libdiff stays an order of magnitude faster in some cases, as well as computes the matches at the same time.
def lcs_len3(Seq1 , Seq2):
""" Compute the LCS len 2 sequences
Do not calculate the matrix and try to be as efficient as possible
in storing only the minimal ammount of elelment in memory, mainly the previous
matrix row + 1 element.
"""
LL1 = len(Seq1)+1
LL2 = len(Seq2)+1
## we will do the big loop over the longest sequence (L1)
## and store the previous row of the matrix (L2+1)
if LL2 > LL1 :
Seq2, Seq1 = Seq1, Seq2
LL2, LL1 = LL1, LL2
previousrow = [0]*(LL2)
cindex = 0
for Seq1ii in Seq1:
for jj in range(1,LL2):
cindex = (cindex+1) % LL2
if Seq1ii == Seq2[jj-1]:
if jj == 1:
previousrow[cindex] = 1
else :
previousrow[cindex]+=1
if Seq1ii != Seq2[jj-1] :
up = previousrow[(cindex+1) % LL2]
if jj != 1 :
left = previousrow[(cindex-1) % LL2]
if left > up :
previousrow[cindex] = left
continue
previousrow[cindex] = up
return previousrow[cindex]
import numpy as np
import difflib
def compare(s1,s2):
m0 = difflib.SequenceMatcher(None, s1, s2, autojunk=False).get_matching_blocks()
m1 = lcs_len3(s1,s2)
a,b = sum([x.size for x in m0]),m1
return a,b
for k in range(10):
s1 = np.random.randint(0,250, 100)
s2 = np.random.randint(0,250, 100)
a,b = compare(s1,s2)
print 'random',u'√' if a == b else 'x',a,b
a,b = compare(sorted(s1),sorted(s2))
print 'sorted',u'√' if a == b else 'x',a,b
Except on the sorted case, SequenceMatcher from stdlib give the wrong matches length almost all the time (sometime it is right)
%timeit difflib.SequenceMatcher(None, s1, s2, autojunk=False).get_matching_blocks()
%timeit lcs_len3(s1,s2)
print '--------- sorted ----------'
s1.sort()
s2.sort()
%timeit difflib.SequenceMatcher(None, s1, s2, autojunk=False).get_matching_blocks()
%timeit lcs_len3(s1,s2)
On both sorted and unsorted arrays, SequenceMatcher is way faster (10 to 20 times) than I am, but wrong.
s1 = 'a'*251
i = 124
s2 = 'a'*(i)+'b'+'a'*(250-i)
%timeit MySequenceMatcher(s1, s2).get_matching_blocks()
%timeit lcs_len3(s1,s2)
sum(x.size for x in MySequenceMatcher(s1, s2).get_matching_blocks())
But in the cases where SequenceMatcher
is clearly wrong for the optimal sequence, we are around the same time to execute.
Not that I don't like current python libdiff, but I hope to be able to build a lower lever library (pure python of course) without any surprise behavior, on which maby one can rebuild the current libdiff and avoid to rely on regular expression to parse the output of another function.
I'll be happy to get any tips to make it faster, of course it is not possible in all the cases but I'm sure we can figure it out.
As usual comments, typo and PRs welcommed on the repo that host thoses notebooks. I should probably learn how to use Pelican now that it support notebooks.
Edit: At the request of some person, I opened a place to comment on this post until I turn it into a real blogpost.