Seitenanfang

Retry failed Gearman jobs

Dieser Post wurde aus meiner alten WordPress-Installation importiert. Sollte es Darstellungsprobleme, falsche Links oder fehlende Bilder geben, bitte einfach hier einen Kommentar hinterlassen. Danke.


Gearman is a great tool for farmed or distributed work. It's some kind of basic cloud handling: Many Worker processes (usually running on many different servers) are waiting for jobs to process and there's (hopefully) always another worker taking over if one worker (server) should ever die. But what happens, if a single job dies and should be repeated? broken_gearman

Let's have a short look on what Gearman does within a job's life:

ClientGearmanWorker
Send a job
Store the job in the queue
Wait for a free worker slot
Ask for something to do
Send the job to the worker
Confirm the job
Run the job
Report the result
Delete the job from the queue
Report the result to the client
Receive the job result
The client might also start a fire-and-forget job and ignore the job result (actually, the Gearman server won't try to send any result to the client for those jobs, but we currently don't care for these details) which will basically remove the last two lines from the table above.

Gearman has some nice features. One of them is the ability to retry jobs on crashing workers. Let's say, the worker server running our job is the bottom-most server in a datacenter rack. Some emergency occurs somewhere else and the technical stuff has to go to the datacenter, but there's nobody for taking care of this baby. The baby is crawling around and suddenly discovers that great blinking box! It starts crawling to that blinking lights and pulls itself up to get a better look as the lights suddenly disappear. The baby didn't know that his lovely soft baby-hand just pressed the power switch.

The worker server is gone now and so is the running job! We don't care for the baby at the moment but look at the dispatcher: The worker process also disappeared and as soon the Gearman server realizes this fact, he startes getting panic: A job disappeared while a client is waiting for it! This must not happen!!! But there is a solution: Other workers (installed much higher and not reachable for the baby) are still there and so it pushes the job to one of them. As soon as the job is done, it's result is returned to the client which had to wait a bit longer but finally got the result.

Risks

Any data manipulation done by a job might screw up data in the end. The job might INSERT a record INTO one table and UPDATE another increasing the row count. The baby error might occur between both commands. The second worker also tries to do both but either INSERTs a duplicate record or fails because some unique key is violated. No matter what happens, the data isn't consistent any longer and there might be a stale record left over from the first (interrupted) job run. (Transactions might solve this, but there are still many things which could go wrong if a job is being started twice.)

Forcing a retry

Some jobs should be retried by intention: Something happened and the job didn't finish but it know that a restart would be safe and much better than loosing the job. Any worker process may return three different status codes to the Gearman dispatcher server (beside dying badly, which could be considered as an additional type of "return status"):
  • WORK_COMPLETE
  • WORK_FAIL
  • WORK_EXCEPTION
The first one is used for successful jobs and the other two are used for failed jobs. WORK_FAIL is a status by itself and WORK_EXCEPTION includes and error message. The Gearman::Worker Perl module will treat any defined return value from the job sub as WORK_COMPLETE, an undefined value as WORK_FAIL and issue a WORK_EXCEPTION if the sub dies.

But all of them are "results" from a job and any result will make the Gearman server consider the job as "done", report the result to the client and delete it from the queue, none of them will trigger a retry of the job on another server!

I did some tests and asked the all-knowing internet, but the only way of forcing a retry seems to be an exit call within the worker process killing not only the running job, but also the worker process. Herman reported, that PHP requires an exit code (above zero), but I can't confirm this for Perl.

Killing worker processes is always a bad idea, because they're supposed to be long (forever) running processes. Restarting them is usually very expensive compared to running a job and a restart should not happen after (or within) every job.

There are some situations where every worker process is running one single job before exiting, but they're very, very rare. Exiting them before completing the job would be perfectly ok.

There seems to be only one solution: Add a retry function within the client or (own) worker source. I'm usually wrapping the Gearman::Worker module into a project module to add some common environment for the worker subs. That would be a good place to start.

 

2 Kommentare. Schreib was dazu

  1. Hi!

    Gearmand will restart any job that fails because socket connection was closed. In libgearman this is implemented, I don't know about other libraries.

    The problem you describe, i.e. where one big of failed logic could follow completed logic can be handled by building a pipeline of jobs that understand failure. You layer the workers to create a transactional space where you then handle roll back based on the error of any sub jobs you create. This is a common solution though for some reason folks don't always seem to leap to it.

    Cheers,
    -Brian

    • Sebastian

      Thank you, Brian, for pointing this out.

Schreib was dazu

Die folgenden HTML-Tags sind erlaubt:<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>