In between the busyness of moving office, bug fixing Nightvibes and doing some dev on our new company website. I was fortunate enough to spend some time in the past few weeks on a cool upcoming project we have going on at Inlight that had an interesting challenge. A key part of the project required a queuing solution to be developed to perform intensive image processing tasks at a more appropriate time than at the point of a HTTP request being answered. Finding the best solution was the challenge.

Over a surprisingly delicious lunch at Taste of Malaysia, Tony, our friend Anthony and myself were discussing how to approach the architecture of the queuing system. We wanted to be able to scale easily to have many worker machines processing items from the queue. At a fine detail level we discussed database level locking and Anthony mentioned the delayed_job project as an implementation example.

Tony and I continued the conversation and played around with the idea of rolling our own custom solution such as a PHP layer between the database and the worker machines. I even had the audacity to suggest we allow direct access to our database by the workers - although we smartly dismissed this idea quickly. As with most instances in... well pretty much everything, there is rarely new ideas or problems that others haven't thought of or faced. Research resulted in the exciting discovery of Beanstalk - a simple queue service for web apps.

What is Beanstalk?

Beanstalk makes it very easy to implement a scalable queuing system for any web app - there are client libraries for most major languages. It was initially developed to scale the Causes on Facebook app. Beanstalk works by having a server and one or more clients. The server maintains a queue of "jobs" that have states ("ready", "reserved", "delayed", "buried"), priorities (0 being highest and ascending numbers are lower priority) and payloads (the content/data to process). Jobs are added to "tubes" which are like channels that clients can watch for new jobs of a particular category. Clients can add new jobs to the server, process existing and delete them when they are done. An example for instance, would be a web app that generates custom PDF newsletters. The web app would implement a lightweight client to push a new job to the server with a "ready" state in the "newsletter" tube and with a payload of the data to populate the PDF with. A separate client shell/daemon running continuously on another server would be watching the "newsletter" tube and will be told a new job has arrived. It will then "reserve" that job, process it and then either delete it on success or "bury" it if it fails (it can then be restored at a later stage). That's the Beanstalk protocol in a nutshell, if you want more detail, the protocol documentation is extensive and insightful.

As with most of my personal and work projects, the framework of choice is CakePHP so this post is tailored towards it accordingly. However the concepts are much the same for other languages, particularly with framework-less PHP projects.

Installing Beanstalk

These are slightly updated instructions based on those provided by SeventyTwo. Beanstalk requires Libevent. Download and extract a release of Libevent, then run the following.


cd /extracted/libevent/directory
./configure
make
sudo make install

Download the latest release of Beanstalk.


cd /extracted/beanstalk/directory
./configure
make
sudo make install

Check that it's all working by running beanstalkd -h in Terminal and you should be presented with the help options available. You'll want to keep Beanstalk running for the rest of the setup as it'll allow you to add new queue items from the CakePHP web app and handle them using the PHP worker clients. Just execute beanstalkd to start an detached process in Terminal.

Setting up the CakePHP Plugin

David Persson has kindly offered up a solid CakePHP plugin on github for Beanstalk. Go ahead and grab it and I'll run you through getting it working in your CakePHP app.

These instructions are for CakePHP 1.3x. The inclusion of the datasources from plugins causes issues for me in CakePHP 1.2x. It looks to be the updates made to connection_manager.php. 1.3x allows datasources from plugin folders. If you've yet to use 1.3x, checkout how to have both 1.2x and 1.3x setup on your machine.

Firstly, copy and paste the extracted directory into your app/plugins folder and rename it to "queue". Add the following to app/config/database.php.


var $queue = array(
	'datasource' => 'Queue.BeanstalkdSource',
	'host' => '0.0.0.0',
	'port' => 11300,
);

Then add the following to locations where you want to add new jobs to Beanstalk, customising the payload to match the job you want to create.


$payload = array('key1' => 'value1', 'key2' => 'value2');
ClassRegistry::init('Queue.Job')->put($payload);

You can also specify options such as priority and delay values, as well as the tube to insert the job into. Also if you have jobs you think will take longer than 120 seconds (default value) to run then you should increase the ttr value.


$payload = array(
	'subject' => 'Monthly Newsletter',
	'email' => 'john.smith@example.com',
	'message' => 'Lorem ipsum dolor sit amet, consectetur adipisicing elit.'
);
$options = array(
	'priority' => 5000, // fairly low given "0" is highest priority
	'tube' => 'newsletter'
);
ClassRegistry::init('Queue.Job')->put($payload, $options);

You can check that jobs are being added using the shell cake queue statistics in Terminal. It'll give you a dump of information about the current beanstalk state. Look over it, specifically focusing on these lines to see if jobs are being added:


current-jobs-urgent: 1
current-jobs-ready: 1
current-jobs-reserved: 0
current-jobs-delayed: 0
current-jobs-buried: 0

CakePHP Example

You can download a CakePHP (version 1.3) sample application that demonstrates this code and the folder structure of the plugin, etc. Visit /example/newsletter and /example/image to add the different types of jobs to the beanstalk queue. The only files to really focus on are config/database.php, plugins/queue, and controllers/example_controller.php. The rest are the same stock CakePHP files.

Processing the Jobs

So now that there are jobs in your queue, it's probably a good time to start setting up how you'll handle and process them. Firstly, you should already have beanstalkd running in a Terminal (you won't have been able to add queue items if it hasn't been running). The CakePHP queue plugin has a tidy implementation of a worker client that can easily be run as a CakePHP shell, just type cake queue and select from the options under "worker". You can also set the worker and the tube to watch in one line with something similar to cake queue debug_worker default which will start the debug_worker monitoring the "default" tube.

This approach however tightly couples your CakePHP code and your worker code, meaning both will have to be on every machine you want to run the worker on... not good, you don't want to be passing around your source code, potentially making your site open for exploiting. The better option is a separate PHP only worker script that isn't reliant on the CakePHP framework - this also increases the portability of the worker script. The script would be run as a constantly running daemon that can wait for remote queue jobs to enter the beanstalk "tubes".

Pure PHP Beanstalk Workers

Having only just started extensively using different tubes in some code I'm hacking away at, I realised that this PHP worker code was not correctly switching tubes and was defaulting to the "default" tube. The problematic code was on line 22 of worker.php with the method use_tube which needs to be instead watch for it to work correctly. This confusing method naming caused many hours of confusion until being enlightened by comment number 7 by avip on a Google Groups forum.

I've updated the code downloads accordingly and the code snippets are shown below

	
// was this:
// $this->Beanstalk->use_tube($this->config['tube']);
// now is this:
$this->Beanstalk->watch($this->config['tube']);
	
	

With separation a priority, I've put together a small framework for pure PHP workers - lightweight and portable. Feel free to grab a copy to look through while we discuss it.

The basic idea behind the folder structure is you have a simple runner script the initialises the worker you want to use. The worker is a subclass of a superclass worker (worker.php) located in the /workers directory. The subclass worker should only override the function task($job) method with the custom processing required. So when you fire off the runner script with the Terminal command: php runner.php the specified subclassed worker will run and perform the code you wrote to override the task method every time a new job appears.

There is also the /lib directory which stores the Beanstalk PHP Class from Sourceforge and other classes that might be useful to use when writing your task method - simply a directory for separating and reusing logic. I've included a Console class to make it easier to write comments out to Terminal and a REST class which is more an implementation of basic POST and GET wrappers that you might need. Adding classes that can for instance translate JSON and XML, or upload to Amazon S3, or send emails, will really start bring power to your Beanstalk job processing.

In Summary

Beanstalk is all about deferring the heavy lifting of a request to a more appropriate time so that our users don't get held up unnecessarily. As the code above shows, it's simple to integrate into both new and existing CakePHP projects. With the pure PHP workers you also have the power of an unlimited sea of workers ready to handle your jobs as they come flying in. The potential of the technology is exciting and it's quite easy to imagine the many uses it has. If you are using it or start using it, let us know what cool things you are up to. As always, feel free to throw down questions or comments to discuss. Enjoy.