What to do about the queue?

By Serdar Yegulalp | 2016/04/02 10:00

MeTal controls all of its publishing actions through a job queue. Sometimes a queue job doesn't run, because one of the jobs causes an error of some kind, and the whole queue has to be stopped.

What's the best way to deal with a job in the publishing queue that fails? Here's my thoughts so far.

  1. Delete all the pending items in the queue, at least for that blog, and log an error.
  2. Delete everything in the queue, period, and log an error.
  3. Halt the queue and optionally delete the offending job.
  4. Skip the offending job and plow on ahead.

Here's where each of these things stand for me:

  1. From the point of view of handling things elegantly, this seems ugly, but also necessary, and I suspect this is going to end up being the least-worst option.
  2. Same as #1, but in fact this may be even more preferable. For instance, if we have something in one blog that potentially modifies another, we want all of those changes to happen as atomically as possible.
  3. Not a great idea because it means the stuff in the queue is just waiting to start up again and potentially cause the same problem.
  4. Ditto, and see #2 - not very atomic.

One other thing that's becoming clear is that operations to create and write files out to disk should also be done atomically, by way of a context that's similar to the ones we use for database operations. What I'm thinking is that writes should be built up in a list, and then upon successfully reaching the end of a given job and completing a transaction, we should write out all the files in question (testing first to make sure that they can be written). This way an incomplete operation doesn't write out things partway.

If a queue job fails, it's usually because of a few things:

  1. A broken template
  2. A permissions error (a page can't be written to disk)
  3. A larger problem like a database crash

The first of those three has proven to be the most common type of issue. A broken template wrecks everything, but on the plus side, I tend to find out about it pretty quickly.

In the meantime, the best thing I can do is make sure that a broken queue job at least doesn't hold things up too much. I will probably default to emptying the queue if things crash, at least for now, with the admin notified in email that queues aren't completing, and with some notes about the offending job. As time goes on, we'll work more robust ways in to recover from such disasters.

Tags: publishing queue

comments powered by Disqus