Django vs. PostgreSQL IN operations

Here’s another cautionary performance tale, wherein I thought I was clever but was not.

A table (“Vital”) holds widget information. Another table (“Furball”) holds other information, with an M:M relationship to Vital.

We want to do inferential computations on filtered Furball rows. So we generate a pk list from a Vital QuerySet, and call this function:

def _get_top(vitals):
    from django.db.models import Count

    TOP_NUMBER = 5

    vitalids = [x.id for x in vitals]
    top_balls = Furball.objects.filter(vital__id__in=vitalids)\
                            .annotate(count=Count('id'))\
                            .order_by('-count')[:TOP_NUMBER]
    top_list = [(x.name, x.count)for x in top_balls]

    return top_list

I coded this using an __in operation for a couple of reasons, which won’t be apparent from this simplified code fragment. My point is, I didn’t think this approach would be a performance problem.

Look at the top_balls assignment, which filters the Furball rows referenced by the Vital rows.

Goof #1: The vitalids list could be really big. Really big. How big? It could have up to 15 K entries.

I didn’t expect this when I coded this function. That’s a huge amount of data to send with a db request.

Goof #2: Although there’s a relationship between these tables, I had thought an __in field lookup, which asks Postgres to search for pks in the intermediary JOIN table, would be better than doing a full-blown JOIN.

Reality: PG Experts said, “Jane, you ignorant slut!”

The parse time for such a very large SQL command was substantial
The logging time was non-trivial, and gobbled up disk space like there was no tomorrow
But the biggest problem was this: Within PostgreSQL, the JOIN execution path is far more optimized than the IN path. My assumption that I was saving Postgres work was False. I was, in fact, making more work for it.

The IN command will often be a longer — sometimes way longer — command than the equivalent JOIN. And the logs showed that this happened much more often than I had expected. Just the cumulative command transfer time alone was substantial.

I restructured the code to use a JOIN, and the performance went way, way up.

The lesson: Measure. Test. Be ready to learn that your assumptions are wrong. Be ready to learn that there’s a better way. And, as always, hire smart consultants to tell you what you’re doing wrong.

5 thoughts on “Django vs. PostgreSQL IN operations”

So how did you actually restructure this particular piece of code?

John says:

August 4, 2011 at 8:08 AM

Fair question, I’ll post what I did. It wasn’t particularly cosmic, IIRC.

Update: I didn’t RC! Removing the __in required replacing these calculations with a lot more server code, which included using a Postgres materialized view. We needed the count information in addition to the relationships, and so using just a JOIN wouldn’t have been sufficient. Removing the __in wasn’t easy in this case, but it was a big win. (No poetry intended. 🙂 )

My point remains that __in operations are expensive, and large __in operations are very expensive.

Reply

Yup, that would be interesting 🙂

Btw, .values_list() saved me many times in cases like “vitalids = [x.id for x in vitals]”

Did you compare with .filter(vital__id__in=vitals.values_list(“pk”))? When first getting started with Django, I missed that this will create a sub-select and created lists before realizing that it was both more work and slower. Postgres does quite well with subselects, although MySQL can be unfortunate.

@Thomas and @Chris:

Check out Christophe Pettus’ OSCON 2011 slide deck, “Unbreaking Your Django Application.” He discusses __in starting at slide 66.

The issue isn’t how I specify the pk list on the Django side. It’s what Postgres does with it.

Andrey Fedoseev says:

August 3, 2011 at 9:18 PM

So how did you actually restructure this particular piece of code?

1. John says:
  
  August 4, 2011 at 8:08 AM
  
  Fair question, I’ll post what I did. It wasn’t particularly cosmic, IIRC.
  
  Update: I didn’t RC! Removing the __in required replacing these calculations with a lot more server code, which included using a Postgres materialized view. We needed the count information in addition to the relationships, and so using just a JOIN wouldn’t have been sufficient. Removing the __in wasn’t easy in this case, but it was a big win. (No poetry intended. 🙂 )
  
  My point remains that __in operations are expensive, and large __in operations are very expensive.
  
Thomas Simonnet says:

August 4, 2011 at 12:59 AM

Yup, that would be interesting 🙂

Btw, .values_list() saved me many times in cases like “vitalids = [x.id for x in vitals]”

Chris Adams says:

August 4, 2011 at 4:33 AM

Did you compare with .filter(vital__id__in=vitals.values_list(“pk”))? When first getting started with Django, I missed that this will create a sub-select and created lists before realizing that it was both more work and slower. Postgres does quite well with subselects, although MySQL can be unfortunate.

John says:

August 4, 2011 at 11:01 AM

@Thomas and @Chris:

Check out Christophe Pettus’ OSCON 2011 slide deck, “Unbreaking Your Django Application.” He discusses __in starting at slide 66.

The issue isn’t how I specify the pk list on the Django side. It’s what Postgres does with it.

	Review of Green Feat… on Adding an HD/IR camera to a ne…
	brianwisti on LPT: Mental health
	Aaron VanDerlip on Brex layoff + 6 weeks
	Mark Lipsky on Brex layoff + 6 weeks
	seeknuanceb5b290de84 on Brex layoff + 5 weeks

Django vs. PostgreSQL IN operations

Published by John

5 thoughts on “Django vs. PostgreSQL IN operations”

Leave a reply to John Cancel reply

Share this:

Published by John

5 thoughts on “Django vs. PostgreSQL IN operations”

Leave a reply to John Cancel reply