Last updated at Mon, 06 Nov 2017 20:43:24 GMT

Lock and execute!

Exploiting-Zookeeper-for-managing-processes-with-Lockex

As an engineer here at Logentries I need to maintain a complex system that has requirements for being available to our customers. We always build systems with the ability to be resistant to failure.

In our environment, we have processes and daemons which would benefit from having a mechanism for running only one instance at a time. Examples of this might be a producer daemon which can only have one running instance and cron jobs that need to be run at least once from one host in your environment. A more concrete example of the daemon case might be a celery beat process which is responsible for scheduling customer billing reports.

Solutions to the above problems might be to modify the application to support distributed locking, using modern platforms such as Mesos or exposing the locking mechanisms in existing systems such as zookeeper. In our environment the later was chosen as the solution to building better systems.

We constructed a small application which leverages our existing zookeeper infrastructure by providing locking and semaphores via the kazoo library. This application can be found on GitHub and can be installed via pip or a debian package generated using dh-virtualenv.

lockex‘s command line interface was somewhat modeled on the sudo command line interface such that it would be usable in existing workflows and pipelines with as few modifications as possible. In short to use lockex you would need to know where your zookeeper cluster is and what command you wish to run.

The use case of running jobs from cron

When a command can only be run once in a cluster of machines,  it needs to be run regularly from cron, and from a single host, the host must be a ‘reliable’ machine or the process is not started reliably due to uptime issues.

host1$ my_garbage_collection_process --myoptions=true

To make it more reliable, we can run the process on multiple hosts through lockex, on host1

host1$ lockex -t 1 --zkhosts zk0.local.net:2181 -- sleep 10 && my_garbage_collection_process --myoptions=true

Then on host2

host2$ lockex -t 1 --zkhosts zk0.local.net:2181 -- sleep 10 && my_garbage_collection_process --myoptions=true

What is happening above is on host1 and host2 both systems are racing to acquire a lock from zookeeper, once the lock has been acquired a sleep command is executed before the garbage collection process is run. If the lock cannot be acquired within 1 second lockex exits and does not execute the user supplied command. The purpose of the sleep command is to act as a barrier to hold onto the lock to prevent another lockex instance from acquiring the lock, this is especially useful for commands with a short runtime. With the above pattern, it is quite easy to add the same command on a number of hosts in your environment to ensure that there will always be at least one host that runs the command.

The use case of adding a cold standby process to celery beat

Celery beat is the component for scheduling tasks for celery workers. In general only one celery beat process should be running in a given environment, if more than one celery beat process is running it may lead to undesirable effects.

In order to provide continuous service with celery beat we can use lockex to acquire a lock before execution to ensure only one celery beat process is running.

On host1 we launch celery beat

host1$ lockex --zkhosts zk0.local.net:2181 -- $PROJECT/bin/celerybeat.py

On host2 we do the same thing

host2$ lockex --zkhosts zk0.local.net:2181 -- $PROJECT/bin/celerybeat.py

In the above, host1 will acquire the lock and start the celery beat process, while on host2 lockex will try to acquire the lock, but since the lock has been acquired it will block until the lock that host1 has is released. When lockex is sent a SIGTERM on host1 it will propagate the signal to its child processes and then release the lock, host2 will immediately acquire the lock and startup celery beat. In the time between the SIGTERM being sent and the cold standby starting up, there may be potentially missed events. However, the time without celery beat is brought down to a minimum and the service continues as normal.

In summary

The patterns that lockex allows us to implement are useful as a stop gap measure for building available and reliable systems with minimal changes to existing code bases and applications. It would be worth noting that lockex should be used appropriately to give engineers more time to develop better solutions. Exposing zookeeper functionality from kazoo was also a rather painless exercise.