I became curious to understand the internals of how string comparison works in python when I was solving the following example algorithm problem:
Given two strings, return the length of the longest common prefix
Solution 1: charByChar
My intuition told me that the optimal solution would be to start with one cursor at the beginning of both words and iterate forward until the prefixes no longer match. Something like
def charByChar(smaller, bigger): assert len(smaller) <= len(bigger) for p in range(len(smaller)): if smaller[p] != bigger[p]: return p return len(smaller)
To simplify the code, the function assumes that the length of the first string,
smaller, is always smaller than or equal to the length of the second string,
Solution 2: binarySearch
Another method is to bisect the two strings to create two prefix substrings. If the prefixes are equal, we know that the common prefix point is at least as long as the midpoint. Otherwise the common prefix point is at least no bigger than the midpoint. We can then recurse to find the prefix length.
Aka binary search.
def binarySearch(smaller, bigger): assert len(smaller) <= len(bigger) lo = 0 hi = len(smaller) # binary search for prefix while lo < hi: # +1 for even lengths mid = ((hi - lo + 1) // 2) + lo if smaller[:mid] == bigger[:mid]: # prefixes equal lo = mid else: # prefixes not equal hi = mid - 1 return lo
At first I assumed that that
binarySearch would be slower because string comparison would compare all characters several times rather than just the prefix characters as in
binarySearch turned out to be much faster after some preliminary benchmarking.
Above shows how performance is affected as prefix length is increased. Suffix length remains constant at 50 characters.
This graph shows to things:
- As expected, both algorithms perform linearly worse as prefix length increases.
- Performance of
charByChardegrades at a much faster rate.
binarySearch so much better? I think it is because
- The string comparison in
binarySearchis presumably optimized by the interpreter / CPU behind the scenes.
charByCharactually creates new strings for each character accessed and this produces significant overhead.
To validate this I benchmarked the performance of comparing and slicing a string, labelled
slice respectively below.
This graph show two important things:
- As expected, comparing and slicing increase linearly with length.
- The cost of comparing and slicing increase very slowly with length relative to algorithm performance, Figure A. Note both figures go up to strings of length 1 Billion characters. Therefore, the cost of comparing 1 character 1 Billion times is much much greater than comparing 1 Billion characters once. But this still doesn’t answer why …
In an effort to find out how the cpython interpreter optimizes string comparison I generated the byte code for the following function.
In : def slice_cmp(a, b): return a == b In : dis.dis(slice_cmp) 0 LOAD_FAST 0 (a) 2 LOAD_CONST 1 (0) 4 BINARY_SUBSCR 6 LOAD_FAST 1 (b) 8 LOAD_CONST 1 (0) 10 BINARY_SUBSCR 12 COMPARE_OP 2 (==) 14 RETURN_VALUE
- Where in the cpython does string comparison occur?
- Is there a CPU optimization? Is there a special a special x86 instruction?
- Why is comparing a long string so much faster than comparing each of it’s characters?
Bonus question: When is charByChar more performant?
If the prefix is sufficiently small in comparison to the length rest of the string, at some point the cost of creating substrings in
charByChar becomes less than the cost of comparing the substrings in
To describe this relationship I delved into runtime analysis.
To simplify the below equations, let’s assume that
bigger are the same size and I will refer to them as
charByChar(s1, s2) = costOfOneChar * prefixLen
costOfOneChar = cmp(1) + slice(s1Len, 1) + slice(s2Len, 1)
cmp(1) is the cost of comparing two strings of length 1 char.
slice is the cost of accessing a char, the equivalent of
charAt(i). Python has immutable strings and accessing a char actually creates a new string of length 1.
slice(string_len, slice_len) is the cost of slicing a string of length
string_len to a slice of size
charByChar(s1, s2) = O((cmp(1) + slice(s1Len, 1)) * prefixLen)
binarySearch(s1, s2) = costOfHalfOfEachString * log_2(s1Len)
log_2 is the number of times to divide the strings in half until reaching a string of length 1. Where
costOfHalfOfEachString = slice(s1Len, s1Len / 2) + slice(s2Len, s1Len / 2) + cmp(s1Len / 2)
So the big-O of
binarySearch will grow according to
binarySearch(s1, s2) = O((slice(s2Len, s1Len) + cmp(s1Len)) * log_2(s1Len))
Based on our previous analysis of the cost of
If we assume that
costOfHalfOfEachString is approximately the
costOfComparingOneChar then we can refer to them both as
charByChar(s1, s2) = O(x * prefixLen) binarySearch(s1, s2) = O(x * log_2(s1Len))
If we equate them
O(charByChar(s1, s2)) = O(binarySearch(s1, s2)) x * prefixLen = x * log_2(s1Len) prefixLen = log_2(s1Len) 2 ** prefixLen = s1Len
O(charByChar(s1, s2)) > O(binarySearch(s1, s2) when
2 ** prefixLen = s1Len
So plugging in the above formula I regenerated tests for Figure A but with strings of total length
2 ** prefixLen expecting the performance of the two algorithms to be roughly equal.
charByChar performs much better. With a bit of trial and error, the performance of the two algorithms are roughly equal when
s1Len = 200 * prefixLen
Why is the relationship 200x?