MeTal controls all of its publishing actions through a job queue. Sometimes a queue job doesn't run, because one of the jobs causes an error of some kind, and the whole queue has to be stopped.
What's the best way to deal with a job in the publishing queue that fails? Here's my thoughts so far.
Here's where each of these things stand for me:
One other thing that's becoming clear is that operations to create and write files out to disk should also be done atomically, by way of a context that's similar to the ones we use for database operations. What I'm thinking is that writes should be built up in a list, and then upon successfully reaching the end of a given job and completing a transaction, we should write out all the files in question (testing first to make sure that they can be written). This way an incomplete operation doesn't write out things partway.
If a queue job fails, it's usually because of a few things:
The first of those three has proven to be the most common type of issue. A broken template wrecks everything, but on the plus side, I tend to find out about it pretty quickly.
In the meantime, the best thing I can do is make sure that a broken queue job at least doesn't hold things up too much. I will probably default to emptying the queue if things crash, at least for now, with the admin notified in email that queues aren't completing, and with some notes about the offending job. As time goes on, we'll work more robust ways in to recover from such disasters.