Here’s another cautionary performance tale, wherein I thought I was clever but was not.
A table (“Vital”) holds widget information. Another table (“Furball”) holds other information, with an M:M relationship to Vital.
We want to do inferential computations on filtered Furball rows. So we generate a pk list from a Vital QuerySet, and call this function:
def _get_top(vitals): from django.db.models import Count TOP_NUMBER = 5 vitalids = [x.id for x in vitals] top_balls = Furball.objects.filter(vital__id__in=vitalids)\ .annotate(count=Count('id'))\ .order_by('-count')[:TOP_NUMBER] top_list = [(x.name, x.count)for x in top_balls] return top_list
I coded this using an __in operation for a couple of reasons, which won’t be apparent from this simplified code fragment. My point is, I didn’t think this approach would be a performance problem.
Look at the top_balls assignment, which filters the Furball rows referenced by the Vital rows.
Goof #1: The vitalids list could be really big. Really big. How big? It could have up to 15 K entries.
I didn’t expect this when I coded this function. That’s a huge amount of data to send with a db request.
Goof #2: Although there’s a relationship between these tables, I had thought an __in field lookup, which asks Postgres to search for pks in the intermediary JOIN table, would be better than doing a full-blown JOIN.
Reality: PG Experts said, “Jane, you ignorant slut!”
- The parse time for such a very large SQL command was substantial
- The logging time was non-trivial, and gobbled up disk space like there was no tomorrow
- But the biggest problem was this: Within PostgreSQL, the JOIN execution path is far more optimized than the IN path. My assumption that I was saving Postgres work was False. I was, in fact, making more work for it.
The IN command will often be a longer — sometimes way longer — command than the equivalent JOIN. And the logs showed that this happened much more often than I had expected. Just the cumulative command transfer time alone was substantial.
I restructured the code to use a JOIN, and the performance went way, way up.
The lesson: Measure. Test. Be ready to learn that your assumptions are wrong. Be ready to learn that there’s a better way. And, as always, hire smart consultants to tell you what you’re doing wrong.
So how did you actually restructure this particular piece of code?
Fair question, I’ll post what I did. It wasn’t particularly cosmic, IIRC.
Update: I didn’t RC! Removing the __in required replacing these calculations with a lot more server code, which included using a Postgres materialized view. We needed the count information in addition to the relationships, and so using just a JOIN wouldn’t have been sufficient. Removing the __in wasn’t easy in this case, but it was a big win. (No poetry intended. 🙂 )
My point remains that __in operations are expensive, and large __in operations are very expensive.
Yup, that would be interesting 🙂
Btw, .values_list() saved me many times in cases like “vitalids = [x.id for x in vitals]”
Did you compare with .filter(vital__id__in=vitals.values_list(“pk”))? When first getting started with Django, I missed that this will create a sub-select and created lists before realizing that it was both more work and slower. Postgres does quite well with subselects, although MySQL can be unfortunate.
@Thomas and @Chris:
Check out Christophe Pettus’ OSCON 2011 slide deck, “Unbreaking Your Django Application.” He discusses __in starting at slide 66.
The issue isn’t how I specify the pk list on the Django side. It’s what Postgres does with it.