Web scripts usually do one thing at a time which is quite good as long as there is not too much to do. But serving thousands of request per minute with thousands of source code lines per request starts getting challenging. Bon Jovi asked for help for web scripts some time ago by writing his song "(Whos gonna) work for the working man".
New, faster, better, cheaper hardware is published nearly every week (day?) and only the sky is the limit if as long as one could just buy more cpu power and memory to cover increasing source code complexity and request counts. Most web applications will never ever reach the sky (which is the limit), but could also benefit from multitasking. Three examples:
Not for your eyesComplex web scripts do some work to calculate the next page to be shown to the user but also do many things which don't change the resulting web page at all, like logging statistics data, sending emails, updating database records where the result isn't relevant for this page. There is no need to these tasks to be performed while a impatient user is waiting for the result of his click. Remember - web scripts life in fragments of seconds (well, they should).
I'm not strong enoughA web server is always a mixture of high cpu power, enough memory, disks for quick log file access and high bandwidth for fast database and internet connections. Some applications don't require all of this, like resizing a user-picture, extracting preview pictures for a video file or compressing uploaded data for storage. They need hugh cpu power but few (if any) disk size and speed. Running those jobs on a webserver will not only slow down the request itself but also all other requests due to high cpu load. The job may run at reduced priority giving other jobs more air to breeze but often all other parts of the request (other source code besides the resizing snippet) will take the same time (or less) like the resizing process.
It's a long way homeSome work requires time. More time than a request should take, but still many websites just let the user wait for 10, 30, 60 or more seconds. Some of them show a "processing will take time, don't click this button twice" message or a waiting animation. Nice, but the user is still waiting, the request may get canceled and everything is lost. Often the user doesn't know if anything was done or if the request was successful if anything goes wrong. Why not start to do the work, show a waiting page to the user and ask the webserver for result in regular intervals. Any other website script could do a quick check if some unreported result is waiting and show a notification for the user as soon as the long running job is done.
ThreadsPerl started to support threads a long, long time ago and I did some threading tests as soon as our distribution provided as threading Perl. Threading is cool but dangerous: Developers must have a deep understanding of what's going on (they should in every case
Threading has two other major drawbacks: You can't use it in persistent environments like mod_perl or fcgi because the Perl part is only guest within a parent process - which may be using thread itself, maybe even for running multiple Perl web-workers in parallel. The other drawback are the power of Perl: Many modules aren't thread-safe or noone knows if they're safe or not.
Thread will usually benefit from multiple cpu cores (while a single process is running on a single core), but can't use more than one server. A single box with four, eight or sixteen cpu cores is the limit - and 16 cores in one server aren't cheap at all.
Fork, Spoon and KnifeSpooning and knifing aren't quire common for developers, but forking is an option: Split a running process into two like single cell microorganisms do, but easily controllable by the developer, the new child won't disturb the parent and vice-versa, but forking also has some drawbacks: It doubles the required memory (which is clearly something to consider if your typical project modules take 100 MB just after use'ing), all variables are doubled which makes life easier most of the time but could also create unexpected side-effects.
Network and database connections are really critical: The child will close them (at least on exit) and the parent's copy will also be closed without any notice to the parent, but the child also must not try to use them as it would mix up the connection state known by the parent.
mod_perl and fcgi environments also don't work well with fork because everything is copied including the Apache handler process or the fcgi main loop and their parent would end up with two processes sitting in one slot.
Finally, there is no good way for the child to inform the parent of any result, they're different processes and must use some kind of inter-process communication.
Open-system nohupMost languages could easily execute other processes and a simple "&" at the end of a command line will send the new process to the background. Prepend "nohup" to finally disconnect STDIN, STDOUT and STDERR (or do it yourself within the child) and forget the child, it won't share anything with the parent and many of the drawbacks are gone. One-way-signaling is easy if the child is being started using "open" and either input or output are piped back to to the parent.
But starting a new (Perl) process is an expensive thing as the interpreter has to be loaded again and every module used by the new process has to be compiled (again). Passing arguments isn't easy and the risk for adding injection chances is high (just imagine system "foo.pl $original_filename_of_uploaded_file &"; and a client sending ;rm -rf * as the filename). Finally all childs will run on the webserver even if they might use other cpu cores.
Standing in lineBuilding a job queue is easy: Add a database table (queue) and push (INSERT) every new job which has to be done into it. A cronjob or even a daemon could process the queue on another server. Adding database triggers (who inform the client once a new row has been added instead of letting the client search for new jobs) isn't easy but may speed up things and additional work could even allow many workers on many servers to process the same queue in parallel.
But databases aren't build for this, most don't perform well on a table which gets thousands of INSERTs, UPDATEs and DELETEs per minute and thing start getting worse if the database is being replicated: Keeping mySQL in sync is nearly impossible (max_seconds_behind and binary log position aren't accurate) and every new slave has to have enough power to handle all writing operations plus additional resources for read requests.
Pushing a massive amount of developer time and hardware cost might be able to reduce the problems and drop time between job insertion and processing start (as well as the way back once the work is done), but I doubt it would be safe to relay on it while the user is waiting for the reply.
The Diary of Jane...probably wouldn't fill up a single webserver, but all others do. LiveJournal isn't as big as Facebook but created a lot of impressive technologies - including Gearman which is a job dispatcher system.
It consists of three parts: A client, a (Gearman) dispatcher server and as many worker tasks as you like. Every web script, program or anything else able to open a network socket may act as a client. Basically a client is someone who wants to give away a part of his work (the working man in Bon Jovi's song).
The Gearman dispatcher server simply accepts all the work from the clients, queues up everything, handles priorities and results (if there is a result at all) and keeps track of the progress of the running jobs. It's a simple and small piece of software not doing anything more than necessary but doing a good job on his part.
The workers may be written in any language (not necessarily the same as the clients), they sit around until the Gearman dispatcher has something to do for them. Once they get a job, they do the work for the working man, pass their results back to the dispatcher - and calm down waiting for the next job.
The Gearman solution also has drawbacks - it involves network communication and passing huge amounts of data requires all involved servers to push it around, but the workers are running completely independent from the clients/web-scripts, they don't share any network or database connection, filehandle or variables.
Workers may run on any number of servers and cpu-cores without any additional development time and also the dispatchers may run redundant, the worker tasks start once (loading the interpreter), load their modules once and connect the database(s) once - no need to do this again for every new job and no risk of mixing up parent and child.
A small environment may use some worker tasks and a dispatcher task running on the webserver box but would scale to many workers on different servers and two or more dispatchers without any problem. The dispatcher(s) don't know about workers until they register themselfs - no need to configure IP addresses and ports for the workers on the dispatcher side and the dispatcher silently forgets a worker once it died and optionally assigns it's last task to another one. Workers may run on cheap hardware with no need to buy 4-cpu-boards or RAID disk arrays.
A really big environment may also provide a diskless boot service for worker servers. Connect a new box to your network, power it on and it will boot from the network, start one or more workers who register at the dispatchers. Once the rush hour is done, worker servers could finish their running jobs and shutdown waiting for a wake-on-lan signal to come back for the next rush. Sorry, just dreaming...
There are still drawbacks: What happens if the (running) worker source changes? You have to care yourself about detecting and loading the new version, can't pass files from client to worker without a network filesystem or shared drive and the client shouldn't wait forever if the result should be shown to the user.
To be continued...A co-worker suggested using some kind of dispatcher solution for solving our scaling problems at work not long ago. I liked the idea and started to read about and work on implementing things and providing a simple framework for taking care of our project-dependent things. Gearman and it's Perl modules turned out to be much easier to set up and use than expected (even if it's documentation is not as detailed as I'ld like it to be) and today we're using it not only within cronjobs or our internal admin websites but also on public sites which earn our money and must be up and running 24/7. We experienced some smaller problems but much less than expected.
I'll write about setting up Gearman, the clients and workers in another post following shortly.
PS: This posts features 4 songs - did you find all of them? :-)