A performance lesson on Django QuerySets


At work, we’ve contracted with PostgreSQL Experts to help us improve our Postgres performance. After analyzing our system, one of their consultants, Christophe Pettus, found glaring problems in how some of my code accessed our database.

I consider myself well-informed about good database access practices in Django, and in general. I might not exactly hit the bull’s-eye, but I’m sufficiently savvy to avoid making a “WTF” mistake, right?

Nope!

Goof #1: Bad iterator() use

We use Celery for background asynchronous tasks. One task periodically updates our full-text search engine. It wakes up, sees if there’s work to be done, and if so launches 20 sub-tasks (across 10 servers) to do the work.

This loop is in each sub-task:

# Iterate through some items in need of indexing, making a
# document for each one. When we've accumulated a handful,
# ship them off to the server.
vitals = Vital.objects.filter(id__range=(id_start, id_end),
                              fts_handled=False).iterator()
for item in vitals:
    do_stuff()
    do_more_stuff()

    # etc...

Simple and trivial — right?

I used iterator() for the standard reason: A large number of rows could be returned. But, consider: This loop iterates very quickly — it does some text parsing and extraction, with no other I/O activity. Twenty of these sub-tasks means there are 20 of these loops simultaneously executing.

The performance data showed iterator() was a problem. While it uses less memory on the slave, it generates a request on every iteration. Multiplied by 20. By comparison, a non-iterator() loop would generate only one QuerySet for the entire sub-task. I.e.: Twenty requests returning data blocks, as opposed to a bazillion requests returning just one row.

After reconsidering the QuerySet’s size, I removed the iterator(). Postgres’ workload significantly decreased, and the tasks’ memory use wasn’t that bad after all.

My sins: Not considering the combined effect of a fast loop replicated in multiple tasks, and premature optimization.

Goof #2: QuerySets aren’t in-memory filtered

Two other asynchronous tasks periodically wake up, check if a particular table needs updating, and do the update. One launches 20 sub-tasks, while the other updates in situ.

Both have the following kind of loop. (I’m showing only the relevant code. Don’t be distracted by the goofy names — I want to anonymize this.):

def subtask(start, end):
    """The sub-task main loop."""

    things = Foobar.objects.filter(id__range=(start, end),
                                   updated=False).iterator()
    for thing in things:
        # do some work....
        _private_update(thing)
        # do some other work...

def _private_update(thing):
    """Update a row with new information."""
    from .models import TIMERANGE_CHOICES

    # For every timerange in this class...
    for timerange in [x[0] for x in TIMERANGE_CHOICES]:

        # Extract the ending year
        endyear = int(timerange[4:])

        # Calculate the four variants.
        p_a = Vital.objects.filter(year__lte=endyear, owners=thing)
        value = foobar(p_a)
        if value:
            thing.most_pa[timerange] = value

        p = p_a.filter(fieldb='')
        value = foobar(p)
        if value:
            thing.most_p[timerange] = value
.
        p = p_a.exclude(fieldb='')
        value = foobar(p)
        if value:
            thing.most_g[timerange] = value

        p = p.filter(expire_date__gt=date(endyear, 12, 31))
        value = foobar(p)
        if value:
            thing.most_up[timerange] = value

    thing.save()

The iterator() is again a problem, although minor this time. The “things” loop does more work than the previous example, so it didn’t beat on Postgres as much. But “more is more” and the filter() result’s memory use turned out to not be so bad. So, again, I had done a premature optimization, resulting in a little more work for Postgres because it got hit once per iteration instead of once per task invocation.

The major problem is in _private_update(). It iterates over four QuerySet references. QuerySets are lazy and cached, and I thought “cached” included caching for use in other QuerySets. Since the latter three QuerySets derive from the “p_a” QuerySet, I thought Postgres got hit only once per iteration, because the “p_a” QuerySet was reused.

Way Wrong. Each QuerySet is a new QuerySet and hits the database! The documentation does imply this: “Filter…returns a new QuerySet.” A new QuerySet means a new QuerySet. The “root” QuerySet (the Vital.objects.filter()) is not reused in the subsequent database statements, and so this loop generates four database requests, not one. And that’s across simultaneous tasks.

I replaced the latter three QuerySets with list comprehensions. Presto, database operations drop by 75%.

My sins: Not being curious about when the QuerySet cache is used, careless reading, and not considering the effects of multiple simultaneous tasks.

Big take-aways

Performance monitoring is good. Smart experts are good. Never think you’re as smart as you think you are. (That might be a contradiction…)

12 comments
  1. Hey man, good post. I don’t think those are “WTF” mistakes. Maybe you’re being to hard on yourself, ha. If you’re scaling so much that you’re finding problems like these then you must be doing something right!

  2. Ben said:

    Thanks for the advice and examples. What I would love is a guide for how to best police your code for bad queries. I use django debug toolbar on my public-facing pages, but don’t have a similar method for my backend code.

    • John said:

      I don’t have any grand insight to share. Except: Think. Stay informed by following blogs, forums, and conferences. Analyze your running system for chokepoints. Hire smart specialists to tell you, “Dude, you’re goofing up in this function…”

    • debugsqlshell, which is part of debug-toolbar (not everyone is aware of it) can be a nice companion while you’re prototyping code.

      Another easy thing to do is just wrap appropriate sections of your code (in a decorator for example) to capture queries from django.db.connection.queries. That’s where Django logs queries when DEBUG is on and you can replace connection.queries with an empty list wherever you want to begin capture for a particular piece of code.

    • New Relic (http://newrelic.com) is invaluable for us in diagnosing performance issues.

  3. Kevin said:

    The Django DevServer project (a drop-in replacement for django’s runserver) allows you to configure SQL Profiling so you can see the queries being executed in the shell, time of execution, integrates with werkzeug interactive debugger, etc. https://github.com/dcramer/django-devserver.

  4. James said:

    I’m not sure why you still work after committing such stupid mistakes. I guess the crisis urged your company to save on Django experts and hire staff like you.

  5. Reblogged this on Anand Jeyahar's blah..blah..radio static…..

  6. Adam Hammouda said:

    If you haven’t seen it yet, this talks about more query optimization:
    http://www.mellowmorning.com/2010/03/03/django-query-set-iterator-for-really-large-querysets/
    (check out the last comment: also an implementation)

    and this:
    http://djangosnippets.org/snippets/1949/

  7. Awesome, thanks for this. Was just trying to figure out why I couldn’t chain cached QuerySets without hitting the database again–as you pointed out, the docs make it sound like this would work.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 9,373 other followers

%d bloggers like this: