Tripped up by Python string comparison


Thursday, I was irked by a bug.

I had modified a background task so it could import a range of documents from another subsystem into our datastore, instead of only one. Its parameters had included one “document id”, which identified the patent document to import. Now, it could be given that, or two document ids representing a document range.

In one instance, it reported a successful completion yet the desired patents weren’t loaded. What had gone wrong?

Multiple official and de facto formats exist for US patent application and grant document ids. To keep this simple, let’s consider US Design Patents. Their document id is a “D” followed by a number. This looks like “D4432”, or “D902”.

So if you wanted to import a range of Design Patents, you might say, “Import the patents D900 through D4000, inclusive.” “D900” is the lower bound and “D4000” is the upper bound. Right?

Not so fast!

>>> "D900" < "D4000"
False
>>>

Gah!

Like most languages and command shells, Python’s string comparison differs from a human’s. It uses lexicographical ordering. More specifically, case-sensitive lexicographical ordering. So “D9000” > “D4000”, which is what you’d expect. But “D900” > “D4000”, which you don’t expect.

A simple explanation is it compares two strings one character at a time, left-to-right, and the first inequality provides the result. If you want the set theory, go read the Wikipedia article or search for it on Google.

The code didn’t process the range because it got confused about the boundaries. I had forgotten a cardinal rule: Never compare floating-point numbers for equality, and be careful when comparing strings for less-than or greater-than.

You almost always want to compare strings using natural ordering, which compares them as a human would. Natsort is an excellent Python package for natural order sorting, but it doesn’t provide methods comparing string arguments, which is what this code needed.

I considered a couple of fixes, including normalizing the document ids to integers elsewhere in the code. As the strings’ format was well-defined because it’s internal data, I decided to craft my own solution, viz.:

def in_natural_range(low, item, high):
    """Return True if low <= item <= high, using natural ordering.

    :param low: The lower bound
    :type low: str
    :param item: The item to be compared to low and high
    :type item: str
    :param high: The upper bound
    :type high: str

    """

    # Keep these values locally because they're used multiple times
    lenlow = len(low)
    lenitem = len(item)
    lenhigh = len(high)

    if lenlow == lenitem == lenhigh:
        return low <= item <= high
    elif lenlow < lenitem < lenhigh:
        return True
    elif lenlow > lenitem or lenitem > lenhigh:
        return False
    elif lenlow == lenitem:
        # Item is shorter than high. How do low and item compare?
        return low <= item
    else:
        # Low is shorter than item.  How do item and high compare?
        return item <= high

How would you do it?

3 comments
  1. In PHP, there’s a function for comparing string using ‘natural’ ordering: http://www.php.net/manual/en/function.strnatcmp.php

    PHP has a natural sort function too: http://php.net/manual/en/function.natsort.php

    What I would normally do is write a Comparator that would split the string into a string portion and a numeric portion. If the string portions are equal, then cast the numeric portion into integers and perform a numeric comparison. Of course, if there is no particular form for the document ID (i.e. the ID could be “1345-D” or “D-969-A”) then we can’t really do much.

  2. John said:

    My friend Kirk pointed out that if the document id formats were static, I could do something like this:

    def my_order(value):
        return (value[0], int(value[1:]))
    

    Then sort a list:

    listQuery.sort(key=my_order)
    

    A couple of reasons why I didn’t do this. One, the doc id format is 0 to 2 letters followed by 1 to 11 digits. Not every combination occurs, but we see “D1234”, “1234567”, “PP5556”, etc. There’s enough variation that you’d have to make my_order more complex. Two, I wanted to compare a string to two other strings, not sort a list, and coding a class with a custom __lt__ would be much too heavy a solution.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: