At work, we’ve contracted with PostgreSQL Experts to help us improve our Postgres performance. After analyzing our system, one of their consultants, Christophe Pettus, found glaring problems in how some of my code accessed our database.
I consider myself well-informed about good database access practices in Django, and in general. I might not exactly hit the bull’s-eye, but I’m sufficiently savvy to avoid making a “WTF” mistake, right?
Goof #1: Bad iterator() use
We use Celery for background asynchronous tasks. One task periodically updates our full-text search engine. It wakes up, sees if there’s work to be done, and if so launches 20 sub-tasks (across 10 servers) to do the work.
This loop is in each sub-task:
# Iterate through some items in need of indexing, making a # document for each one. When we've accumulated a handful, # ship them off to the server. vitals = Vital.objects.filter(id__range=(id_start, id_end), fts_handled=False).iterator() for item in vitals: do_stuff() do_more_stuff() # etc...
Simple and trivial — right?
I used iterator() for the standard reason: A large number of rows could be returned. But, consider: This loop iterates very quickly — it does some text parsing and extraction, with no other I/O activity. Twenty of these sub-tasks means there are 20 of these loops simultaneously executing.
The performance data showed iterator() was a problem. While it uses less memory on the slave, it generates a request on every iteration. Multiplied by 20. By comparison, a non-iterator() loop would generate only one QuerySet for the entire sub-task. I.e.: Twenty requests returning data blocks, as opposed to a bazillion requests returning just one row.
After reconsidering the QuerySet’s size, I removed the iterator(). Postgres’ workload significantly decreased, and the tasks’ memory use wasn’t that bad after all.
My sins: Not considering the combined effect of a fast loop replicated in multiple tasks, and premature optimization.
Goof #2: QuerySets aren’t in-memory filtered
Two other asynchronous tasks periodically wake up, check if a particular table needs updating, and do the update. One launches 20 sub-tasks, while the other updates in situ.
Both have the following kind of loop. (I’m showing only the relevant code. Don’t be distracted by the goofy names — I want to anonymize this.):
def subtask(start, end): """The sub-task main loop.""" things = Foobar.objects.filter(id__range=(start, end), updated=False).iterator() for thing in things: # do some work.... _private_update(thing) # do some other work... def _private_update(thing): """Update a row with new information.""" from .models import TIMERANGE_CHOICES # For every timerange in this class... for timerange in [x for x in TIMERANGE_CHOICES]: # Extract the ending year endyear = int(timerange[4:]) # Calculate the four variants. p_a = Vital.objects.filter(year__lte=endyear, owners=thing) value = foobar(p_a) if value: thing.most_pa[timerange] = value p = p_a.filter(fieldb='') value = foobar(p) if value: thing.most_p[timerange] = value . p = p_a.exclude(fieldb='') value = foobar(p) if value: thing.most_g[timerange] = value p = p.filter(expire_date__gt=date(endyear, 12, 31)) value = foobar(p) if value: thing.most_up[timerange] = value thing.save()
The iterator() is again a problem, although minor this time. The “things” loop does more work than the previous example, so it didn’t beat on Postgres as much. But “more is more” and the filter() result’s memory use turned out to not be so bad. So, again, I had done a premature optimization, resulting in a little more work for Postgres because it got hit once per iteration instead of once per task invocation.
The major problem is in _private_update(). It iterates over four QuerySet references. QuerySets are lazy and cached, and I thought “cached” included caching for use in other QuerySets. Since the latter three QuerySets derive from the “p_a” QuerySet, I thought Postgres got hit only once per iteration, because the “p_a” QuerySet was reused.
Way Wrong. Each QuerySet is a new QuerySet and hits the database! The documentation does imply this: “Filter…returns a new QuerySet.” A new QuerySet means a new QuerySet. The “root” QuerySet (the Vital.objects.filter()) is not reused in the subsequent database statements, and so this loop generates four database requests, not one. And that’s across simultaneous tasks.
I replaced the latter three QuerySets with list comprehensions. Presto, database operations drop by 75%.
My sins: Not being curious about when the QuerySet cache is used, careless reading, and not considering the effects of multiple simultaneous tasks.
Performance monitoring is good. Smart experts are good. Never think you’re as smart as you think you are. (That might be a contradiction…)