Here’s another cautionary performance tale, wherein I thought I was clever but was not.
A table (“Vital”) holds widget information. Another table (“Furball”) holds other information, with an M:M relationship to Vital.
We want to do inferential computations on filtered Furball rows. So we generate a pk list from a Vital QuerySet, and call this function:
def _get_top(vitals): from django.db.models import Count TOP_NUMBER = 5 vitalids = [x.id for x in vitals] top_balls = Furball.objects.filter(vital__id__in=vitalids)\ .annotate(count=Count('id'))\ .order_by('-count')[:TOP_NUMBER] top_list = [(x.name, x.count)for x in top_balls] return top_list
I coded this using an __in operation for a couple of reasons, which won’t be apparent from this simplified code fragment. My point is, I didn’t think this approach would be a performance problem.
Look at the top_balls assignment, which filters the Furball rows referenced by the Vital rows.
Goof #1: The vitalids list could be really big. Really big. How big? It could have up to 15 K entries.
I didn’t expect this when I coded this function. That’s a huge amount of data to send with a db request.
Goof #2: Although there’s a relationship between these tables, I had thought an __in field lookup, which asks Postgres to search for pks in the intermediary JOIN table, would be better than doing a full-blown JOIN.
Reality: PG Experts said, “Jane, you ignorant slut!”
- The parse time for such a very large SQL command was substantial
- The logging time was non-trivial, and gobbled up disk space like there was no tomorrow
- But the biggest problem was this: Within PostgreSQL, the JOIN execution path is far more optimized than the IN path. My assumption that I was saving Postgres work was False. I was, in fact, making more work for it.
The IN command will often be a longer — sometimes way longer — command than the equivalent JOIN. And the logs showed that this happened much more often than I had expected. Just the cumulative command transfer time alone was substantial.
I restructured the code to use a JOIN, and the performance went way, way up.
The lesson: Measure. Test. Be ready to learn that your assumptions are wrong. Be ready to learn that there’s a better way. And, as always, hire smart consultants to tell you what you’re doing wrong.
5 thoughts on “Django vs. PostgreSQL IN operations”
So how did you actually restructure this particular piece of code?
Fair question, I’ll post what I did. It wasn’t particularly cosmic, IIRC.
Update: I didn’t RC! Removing the __in required replacing these calculations with a lot more server code, which included using a Postgres materialized view. We needed the count information in addition to the relationships, and so using just a JOIN wouldn’t have been sufficient. Removing the __in wasn’t easy in this case, but it was a big win. (No poetry intended. 🙂 )
My point remains that __in operations are expensive, and large __in operations are very expensive.
Yup, that would be interesting 🙂
Btw, .values_list() saved me many times in cases like “vitalids = [x.id for x in vitals]”
Did you compare with .filter(vital__id__in=vitals.values_list(“pk”))? When first getting started with Django, I missed that this will create a sub-select and created lists before realizing that it was both more work and slower. Postgres does quite well with subselects, although MySQL can be unfortunate.
@Thomas and @Chris:
Check out Christophe Pettus’ OSCON 2011 slide deck, “Unbreaking Your Django Application.” He discusses __in starting at slide 66.
The issue isn’t how I specify the pk list on the Django side. It’s what Postgres does with it.