Gearman is a great tool for farmed or distributed work. It's some kind of basic cloud handling: Many Worker processes (usually running on many different servers) are waiting for jobs to process and there's (hopefully) always another worker taking over if one worker (server) should ever die. But what happens, if a single job dies and should be repeated?
Let's have a short look on what Gearman does within a job's life:
|Send a job|
|Store the job in the queue|
|Wait for a free worker slot|
|Ask for something to do|
|Send the job to the worker|
|Confirm the job|
|Run the job|
|Report the result|
|Delete the job from the queue|
|Report the result to the client|
|Receive the job result|
Gearman has some nice features. One of them is the ability to retry jobs on crashing workers. Let's say, the worker server running our job is the bottom-most server in a datacenter rack. Some emergency occurs somewhere else and the technical stuff has to go to the datacenter, but there's nobody for taking care of this baby. The baby is crawling around and suddenly discovers that great blinking box! It starts crawling to that blinking lights and pulls itself up to get a better look as the lights suddenly disappear. The baby didn't know that his lovely soft baby-hand just pressed the power switch.
The worker server is gone now and so is the running job! We don't care for the baby at the moment but look at the dispatcher: The worker process also disappeared and as soon the Gearman server realizes this fact, he startes getting panic: A job disappeared while a client is waiting for it! This must not happen!!! But there is a solution: Other workers (installed much higher and not reachable for the baby) are still there and so it pushes the job to one of them. As soon as the job is done, it's result is returned to the client which had to wait a bit longer but finally got the result.
RisksAny data manipulation done by a job might screw up data in the end. The job might INSERT a record INTO one table and UPDATE another increasing the row count. The baby error might occur between both commands. The second worker also tries to do both but either INSERTs a duplicate record or fails because some unique key is violated. No matter what happens, the data isn't consistent any longer and there might be a stale record left over from the first (interrupted) job run. (Transactions might solve this, but there are still many things which could go wrong if a job is being started twice.)
Forcing a retrySome jobs should be retried by intention: Something happened and the job didn't finish but it know that a restart would be safe and much better than loosing the job. Any worker process may return three different status codes to the Gearman dispatcher server (beside dying badly, which could be considered as an additional type of "return status"):
But all of them are "results" from a job and any result will make the Gearman server consider the job as "done", report the result to the client and delete it from the queue, none of them will trigger a retry of the job on another server!
I did some tests and asked the all-knowing internet, but the only way of forcing a retry seems to be an exit call within the worker process killing not only the running job, but also the worker process. Herman reported, that PHP requires an exit code (above zero), but I can't confirm this for Perl.
Killing worker processes is always a bad idea, because they're supposed to be long (forever) running processes. Restarting them is usually very expensive compared to running a job and a restart should not happen after (or within) every job.
There are some situations where every worker process is running one single job before exiting, but they're very, very rare. Exiting them before completing the job would be perfectly ok.
There seems to be only one solution: Add a retry function within the client or (own) worker source. I'm usually wrapping the Gearman::Worker module into a project module to add some common environment for the worker subs. That would be a good place to start.