Django vs. PostgreSQL IN operations


Here’s another cautionary performance tale, wherein I thought I was clever but was not.

A table (“Vital”) holds widget information. Another table (“Furball”) holds other information, with an M:M relationship to Vital.

We want to do inferential computations on filtered Furball rows. So we generate a pk list from a Vital QuerySet, and call this function:

def _get_top(vitals):
    from django.db.models import Count

    TOP_NUMBER = 5

    vitalids = [x.id for x in vitals]
    top_balls = Furball.objects.filter(vital__id__in=vitalids)\
                            .annotate(count=Count('id'))\
                            .order_by('-count')[:TOP_NUMBER]
    top_list = [(x.name, x.count)for x in top_balls]

    return top_list

I coded this using an __in operation for a couple of reasons, which won’t be apparent from this simplified code fragment. My point is, I didn’t think this approach would be a performance problem.

Look at the top_balls assignment, which filters the Furball rows referenced by the Vital rows.

Goof #1: The vitalids list could be really big. Really big. How big? It could have up to 15 K entries.

I didn’t expect this when I coded this function. That’s a huge amount of data to send with a db request.

Goof #2: Although there’s a relationship between these tables, I had thought an __in field lookup, which asks Postgres to search for pks in the intermediary JOIN table, would be better than doing a full-blown JOIN.

Reality: PG Experts said, “Jane, you ignorant slut!”

  1. The parse time for such a very large SQL command was substantial
  2. The logging time was non-trivial, and gobbled up disk space like there was no tomorrow
  3. But the biggest problem was this: Within PostgreSQL, the JOIN execution path is far more optimized than the IN path. My assumption that I was saving Postgres work was False. I was, in fact, making more work for it.

The IN command will often be a longer — sometimes way longer — command than the equivalent JOIN. And the logs showed that this happened much more often than I had expected. Just the cumulative command transfer time alone was substantial.

I restructured the code to use a JOIN, and the performance went way, way up.

The lesson: Measure. Test. Be ready to learn that your assumptions are wrong. Be ready to learn that there’s a better way. And, as always, hire smart consultants to tell you what you’re doing wrong.

5 thoughts on “Django vs. PostgreSQL IN operations

    1. Fair question, I’ll post what I did. It wasn’t particularly cosmic, IIRC.

      Update: I didn’t RC! Removing the __in required replacing these calculations with a lot more server code, which included using a Postgres materialized view. We needed the count information in addition to the relationships, and so using just a JOIN wouldn’t have been sufficient. Removing the __in wasn’t easy in this case, but it was a big win. (No poetry intended. 🙂 )

      My point remains that __in operations are expensive, and large __in operations are very expensive.

  1. Did you compare with .filter(vital__id__in=vitals.values_list(“pk”))? When first getting started with Django, I missed that this will create a sub-select and created lists before realizing that it was both more work and slower. Postgres does quite well with subselects, although MySQL can be unfortunate.

Leave a Reply to John Cancel reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.