Django vs. PostgreSQL IN operations
Here’s another cautionary performance tale, wherein I thought I was clever but was not.
A table (“Vital”) holds widget information. Another table (“Furball”) holds other information, with an M:M relationship to Vital.
We want to do inferential computations on filtered Furball rows. So we generate a pk list from a Vital QuerySet, and call this function:
def _get_top(vitals): from django.db.models import Count TOP_NUMBER = 5 vitalids = [x.id for x in vitals] top_balls = Furball.objects.filter(vital__id__in=vitalids)\ .annotate(count=Count('id'))\ .order_by('-count')[:TOP_NUMBER] top_list = [(x.name, x.count)for x in top_balls] return top_list
I coded this using an __in operation for a couple of reasons, which won’t be apparent from this simplified code fragment. My point is, I didn’t think this approach would be a performance problem.
Look at the top_balls assignment, which filters the Furball rows referenced by the Vital rows.
Goof #1: The vitalids list could be really big. Really big. How big? It could have up to 15 K entries.
I didn’t expect this when I coded this function. That’s a huge amount of data to send with a db request.
Goof #2: Although there’s a relationship between these tables, I had thought an __in field lookup, which asks Postgres to search for pks in the intermediary JOIN table, would be better than doing a full-blown JOIN.
- The parse time for such a very large SQL command was substantial
- The logging time was non-trivial, and gobbled up disk space like there was no tomorrow
- But the biggest problem was this: Within PostgreSQL, the JOIN execution path is far more optimized than the IN path. My assumption that I was saving Postgres work was False. I was, in fact, making more work for it.
The IN command will often be a longer — sometimes way longer — command than the equivalent JOIN. And the logs showed that this happened much more often than I had expected. Just the cumulative command transfer time alone was substantial.
I restructured the code to use a JOIN, and the performance went way, way up.
The lesson: Measure. Test. Be ready to learn that your assumptions are wrong. Be ready to learn that there’s a better way. And, as always, hire smart consultants to tell you what you’re doing wrong.