Monthly Archives: December 2010

The Importance of Being Tested

In addition to the regular posts documenting features of 0.6 and giving hints and tips about it’s usage, release announcements and so-forth; I’ll also be posting insights and anecdotes about Upstart’s ongoing development.  A particular story cropped up again this month, and I thought I’d share it with you.

When I began work on Upstart, one of the earliest decisions I made was to make sure the code was very-well covered by a comprehensive test suite.  I’d been working with Robert Collins a lot in the previous couple of years and he is very much an advocate of practices such as Extreme Programming (XP) and Agile Development; especially the discipline of Test Driven Development.

I’d also recently seen a keynote by Andrew Tridgell in which he talked about some of the development of Samba 4, in particular the high use of both test cases and code generation in that code-base.  Something he said in the keynote stuck with me: “untested code is broken code”.

Statistics obviously depend on exactly how you count lines of code, but using a simple semi-colon count the combined source code of libnih and Upstart is slightly over 20,000 lines of code.  The combined source code of the test suite for both is slightly over 120,000 lines of code.

The init daemon is an extremely important part of a Linux system, if it crashes then you’re left with a kernel panic; if it simply misbehaves, you’re left with just severe problems.  Not only was I changing it, but I was replacing a very simple dumb system (Sys V init) with something comparatively complex with rules and behaviours that needed rigorous testing.

It would have been very scary to have developed it without the careful testing, and I would have been very worried if anyone had agreed to replace such a core component of the system without this test suite to back up its behaviour.

That being said, maintaining the test suite can be a huge burden.  Don’t believe what anybody tells you, if you’re writing test cases as well as code, then your pace of development slows as well.  They’re right that you spend a lot less time debugging of course, but unlike in the commercial software business free software developers tend to release first and debug later.   If you use a similarly high test to code ratio in your own project, then you’ll find that the time until your first release will be pretty long and the time between releases longer as well.

Another decision is whether to do Test Driven Development or not; that discipline requires that you always write the tests first, to fail, and only write code in order to make the tests pass.  I’m not a fan of TDD, and I’ve no problem admitting that I mostly did not use it for Upstart.  My gut feel is that TDD produces code that hangs, swings and loops just to deal with testing.  It also just doesn’t suit my coding style: I like to write code from the middle outwards, the function API is the last thing I tend to fix, where TDD forces it to be the first.

I’m also not convinced TDD is really suitable for a language like C; it’s pretty hard to get a test case to compile, run and fail without writing any supporting code such as a header file, etc.

I have found TDD useful when I have code that really does break down into a single unit with a well-defined and obvious API, and that while the inputs and outputs have been obvious, the algorithm for getting between them wasn’t at the time.

What I’ve tended to do instead is write code naturally how I would, and write test cases alongside to run the code and make sure it’s working.  As the code grows more complex, more test cases appear for it.  One big advantage to this is then I don’t need to reboot or fire up a VM as much, I can test a large proportion of Upstart’s operation through testing.

Now, onto the stories.  There are two similar ones.

One of the side-effects of testing Upstart so strongly is that the tests are not only driving the code I’ve written but also code in libraries and even in the Kernel.  One particular set of tests was covering the code in libnih and Upstart that handles watching the configuration directory for changes, it’s this code that means Upstart automatically reloads jobs when you edit them without needing an explicitly signal.

One day these test cases started failing without warning.  Investigation showed that they passed fine under older kernels, but with the newest kernel update to Ubuntu, they failed.

The inotify subsystem in the kernel had undergone a radical overhaul and rewrite.  Rather than being its own code, it was completely rebased onto the new fsnotify system.  Fortunately I was aware of this, and after careful checking that it was indeed the kernel behaviour that was now incorrect (and that it wasn’t incorrect before), I got in touch with the Eric Paris, the author of the new code, and was able to give him minimal example code to replicate the problem.

inotify: check filename before dropping repeat events

This was a while ago, but pretty much the same story happened again recently, just this time not with the kernel.

Again, the story started with Upstart’s test suite failing.  The engineer who first noticed it assumed it was an issue with the new build daemon and disabled the test for the time being.  The test was in the part of the code testing Upstart’s interaction with D-Bus.

Now, sometimes I tend to write tests to deal with corner-cases and “what if” scenarios that I dream up.  This isn’t always about testing my code, often it’s a case of finding out whether something is really possible or whether that thing misbehaves.  These tests still stay in the suite of course.

A particular set of tests were intended to find out what happened if the D-Bus daemon crashed during initial connection, I considered this fairly important because at times the libdbus library has called exit() or abort() when things happened that it didn’t like.  If you call that from the init daemon, the kernel panics.

These tests had worked fine for a couple of years (actually at the time I had to fix bugs in libdbus to make them pass) but now one of these tests was breaking.  The disconnection was causing SIGPIPE to be delivered to the test.

Again, this turned out to be due to a change to D-Bus.  Lennart Poettering had been working on some changes to avoid libdbus’s awkward SIGPIPE handling and replace it with the use of the MSG_NOSIGNAL flag.  Unfortunately he’d missed a case in the authentication code.  The side-effect was that if the D-Bus daemon had crashed, been killed, OOM’d, etc. during initial connection – the connecting application would have gone too.  Especially bad for an init daemon.

Fortunately Upstart’s test suite caught it, and the fix was simple.

sysdeps-unix: use MSG_NOSIGNAL when sending creds

(reposted from http://upstart.at/2010/12/20/the-importance-of-being-tested/ – post comments there)

Events are like Methods

In last week’s post I talked about how Events can be treated like Signals, this week we’ll be looking at how Events can be treated like Methods.  That might seem a little surprising, since normally one considers signals and methods as very different things, but to Upstart they are both just events.

What do I mean by Methods?  You’ve almost certainly done some kind of programming, even if just a little scripting, so you should know about methods or functions.

In contrast to signals, which are just a notification that something happened on the system, a method is a request for the system to do something on your behalf.  Usually to make some kind of change to the system state.

Likewise in contrast to the signals where you don’t care about the result, for a method you want to wait for the changes to be completed and perhaps even be notified if the method failed.

It’s just as easy to implement a method in Upstart as it is to implement something that considers an event a signal.  Here’s an example of how you might implement a suspend method:

start on suspend

task
exec pm suspend

Doesn’t look that much difference from a signal, the only new stanza in this is task (and that’s not necessary for a method either).  So what happens if we want to trigger a suspend?  We use the command:

root@worldofwarcraft:~# initctl emit suspend

The difference here from emitting a signal we demonstrated in the previous post is that we aren’t using the –no-wait flag.

So we emit the suspend event, and Upstart will start our job as a result; but initctl emit will not return immediately, it waits for the results of the event to complete before it returns.

Because we used the task stanza in the configuration, we’ve told Upstart that the process we execute is expected to take a limited amount of time and then finish by itself.  This means that Upstart will not believe the job is complete until the process has exited, and will continue to block the event while it is still running.

Finally if the command exited with an error, that error is propagated back to the event that started it, and the initctl emit command will exit with an error code.

So now we can use Upstart events and jobs for two different purposes; we can announce changes to the system, and we can use them as methods to make changes to the system.

The most typical event that is used as a methods on your system is the runlevel event used to change the runlevel for System-V compatibility and generally emitted by the telinit and shutdown tools.  The /etc/init/rc.conf script that handles it can be pretty simple and looks not unlike the suspend example above:

start on runlevel [0123456]

task
exec /etc/init.d/rc $RUNLEVEL

What happens if you don’t include task?  Well, that means Upstart will consider the job as ready when the process executed is running, and the event will be unblocked and initctl emit will return.  If the service fails to start, then initctl will return with an error.  This is great for methods that start (or stop) services.

Side-note: the start and stop commands act very much like method events, they block until the service is running or the task has finished and they return errors as well.  However they’re not actually implemented as events right now, an oversight I intend to correct in Upstart 2.

(reposted from http://upstart.at/2010/12/16/events-are-like-methods/ – post comments there)

Event matching in Upstart

A little while ago I was asked to solve a problem that somebody was having with Upstart, and I realised that people weren’t understanding how things were actually working and were just muddling along when doing event matching in jobs.  This is unfortunate, because it hides some of Upstart’s true power, so I thought it high time I actually explained this.

Let’s start with a simple example.  Fire up any Linux distribution with Upstart 0.6, Ubuntu or Fedora current releases will do, and create a file named /etc/init/example1.conf with the following content:

start on surprise

This is pretty simple, it’s a job that does nothing except declare that it’s started when the surprise event happens.  We can demonstrate that works by emitting the event ourselves and checking the status of the job before and afterwards:

root@angrybirds:/etc/init# status example1
example1 stop/waiting
root@angrybirds:/etc/init# initctl emit surprise
root@angrybirds:/etc/init# status example1
example1 start/running

Nothing too surprising after all, I hope.  The job did indeed start on the surprise event, and would now be running if we’d actually told Upstart to run something.

Incidentally I’m often asked why there isn’t a single list of events anywhere, that’s because you can match any event you like as long as you know something emits it.  Events are supposed to come from all manner of sources.  I do try and document them though, try running man 7 startup on your system to see an example of an event’s man page.

If events were just names, they’d be pretty boring.  Events can also have attached environment variables, and these get put into the environment of any job’s process started by the event.  Here’s /etc/init/example2.conf:

start on weather

script
    echo $KIND > /tmp/weather
end script

This will now run a small shell script that outputs the $KIND environment variable to a file.  This isn’t set anywhere, but we can pass it in the event.

root@angrybirds:/etc/init# cat /tmp/weather
cat: /tmp/weather: No such file or directory
root@angrybirds:/etc/init# initctl emit weather KIND=RAIN
root@angrybirds:/etc/init# cat /tmp/weather
RAIN

Ok, these are just examples but there are plenty of useful events on your system right now which carry environment variables such as which network interface just came up, and so on.

If you wanted to only run on a certain type of weather, you might think to check the value of $KIND within the script; you could do that, but it’s inefficient, ideally you don’t want your script run at all.  Fortunately we can match the environment of an event in the job easily enough, here’s /etc/init/example3.conf:

start on weather KIND=snow

Hopefully you’ll figure that this one will only start if it’s snowing, and you’d be right:

root@angrybirds:/etc/init# status example3
example3 stop/waiting
root@angrybirds:/etc/init# initctl emit weather KIND=hail
root@angrybirds:/etc/init# status example3
example3 stop/waiting
root@angrybirds:/etc/init# initctl emit weather KIND=snow
root@angrybirds:/etc/init# status example3
example3 start/running

Events can have more than one environment variable, and you can have more than one match:

start on weather KIND=rain INTENSITY=heavy

The matches are actually globs, so you can use * and ? in there and as well as =, there’s obviously !=.

One useful use for the latter is in the stop on stanza, as well as being available for the job’s processes you can also use these in other stanzas within the job.  Here’s a cute example for /etc/init/example4.conf:

start on weather KIND=rain or weather KIND=snow
stop on weather KIND!=$KIND

This one takes a bit of explaining.  First of all to start the job we match the weather event with $KIND set to either rain or snow.  Now we supply a condition to stop the job, and we also match the weather event with a given value of $KIND except this time we match what looks like itself.

In fact this expansion of $KIND is the value that variable had when the job was started, not the value in the new event.  It says to stop the job if it stops raining, or stops snowing depending on which of the two started it.  Most importantly, if an event simply repeats the same kind of weather, but maybe with a different intensity, the job carries on running (but it doesn’t have its environment updated – UNIX can’t do that).

root@angrybirds:/etc/init# status example4
example4 stop/waiting
root@angrybirds:/etc/init# initctl emit weather KIND=rain INTENSITY=heavy
root@angrybirds:/etc/init# status example4
example4 start/running
root@angrybirds:/etc/init# initctl emit weather KIND=rain INTENSITY=light
root@angrybirds:/etc/init# status example4
example4 start/running
root@angrybirds:/etc/init# initctl emit weather KIND=sun
root@angrybirds:/etc/init# status example4
example4 stop/waiting

Ok, last fake example before we get onto the fun bits.  Remember the example from above:

start on weather KIND=rain INTENSITY=heavy

Upstart lets us shortcut this a little, the environment variables are specified in an order on the initctl command-line and if we know what that order is, we can just assume what variable is in that position.  So as long as we know a weather event always has a KIND followed by an INTENSITY, we could shortcut that to:

start on weather rain heavy

If you’ve used Upstart at all, you’ve seen that shortcut before.  A lot.  You may not have even realised it was a shortcut at all, and that’s what I hope to fix here.

Here’s an example of where you’ve used that:

start on started dbus

You should hopefully now recognise that started is the name of the event there, an dbus is simply the value of its first argument, whatever that might be.  Remember I mentioned that events have man pages?  Take a look at man 7 started, which is the man page for this event.

It documents which environment variables are attached to the started event, and most importantly what order they come in.

started JOB=JOB INSTANCE=INSTANCE [ENV]...

So really when we wrote the previous, we were just using a shortcut to specify:

start on started JOB=dbus

You might wonder what difference this makes.  A good example of how to exploit this is the stopped event.  If you look at it’s man page (man 7 stopped) you’ll see it has a large number of environment variables specifying not only which job stopped but the reason for it stopping.  One of those is the exit signal, for example.

Now you know that you’re just matching the $JOB environment variable, it’s obvious that you don’t have to!  You can match any other environment variable or variables in the event, or none at all.

Here’s how to run a script if any other job on the system exits with a segmentation fault:

start on stopped EXIT_SIGNAL=SEGV

I said you didn’t have to match any variables, just like in the first examples we didn’t, there’s a neat use for that with the job events.  The starting event blocks the named job from actually starting until anything run by it is started; or, in the case of jobs marked task, finished.

Here’s a little job that runs every time another job is started, and blocks that job from actually starting until the script finishes.

start on starting
task

script
    ....
end script

Useful both for debugging and performance analysis.

Now for the really neat bit.  So far we’ve concentrated on the environment variables that come from events, and those that Upstart puts into the job events.  But we can influence these in rather useful ways.

Firstly we can declare a default value for an environment variable in a job, if no alternate value is given in the start event or command, then this default value wins:

start on mounted

env MOUNTPOINT=/tmp
script
    ....
end script

This script will run for each occurrence of the mounted event, and will hopefully get the value for $MOUNTPOINT from that event.  But should the value be missing from the event, or the script be started manually by a system administrator, a default value is provided.

This isn’t a false example, that’s from the job on your system that cleans up the /tmp directory on boot.  The default value wasn’t there in earlier versions of Ubuntu, and this had a rather disastrous side-effect when run by hand.

Ok, we can set the values of environment variables from a job, and we don’t have to match the job name in the usual job events.  We can combine these two facts in a very interesting way when we can export the value of a job’s environment variable into its job events.

Here’s the first job:

env AM_A_DISPLAY_MANAGER=1
export AM_A_DISPLAY_MANAGER

This sets the default value of $AM_A_DISPLAY_MANAGER, but this isn’t a variable we ever expect to be supplied by an event so it just gets passed into the environment of its processes.  It’s not that useful either on its own.

The export line is the useful one, it adds the value of the named environment variable to the job’s events.  That is the starting, started, stopping and stopped events.

Now, in another job, we can do:

start on started AM_A_DISPLAY_MANAGER=1

This is run when any job is started that has that environment variable in its events.  In other words, we can tag classes of services so we don’t have to list every single one.

And because everything in Upstart is the same fundamental type of thing, this can work in the opposite direction.  For example we can put in our job:

env NEED_PORTMAP=1
export NEED_PORTMAP

This means our events will have NEED_PORTMAP=1 in them, now remembering that the job waits for the side-effects of the starting event to complete, we can now write in /etc/init/portmap.conf:

start on starting NEED_PORTMAP=1

So we can implement a dependency-based init system with Upstart, an event-based init system.

I look forwards to finding out what else you can do with it.

(reposted from http://upstart.at/2010/12/03/event-matching-in-upstart/ – post comments there)