Category Archives: Technology
iOS App Design Notes
Since iOS apps are now worth a billion dollars, here’s a few design notes worth bearing in mind when writing your new app:
- Users love receiving push notifications, and love it when you take them to the appropriate page in your app, but they hate it if they see the notification there. Make sure that the page you show contains stale information and does not automatically update to show the notification they just tapped on.
- Increase the number of times users open your app by keeping them in a continual state of confusion as to whether they have any unread notifications. At no time should the notifications shown in the iOS Notification tray, the number of notifications shown in the app icon badge, the number of notifications in your app’s own button badge for notifications, or indeed the list of notifications in your app match.
- For additional bonus confusion, none of the above should match the set or number of notifications the user would see if they opened the notification list on a computer.
- Users love pulling down to refresh. Even if your app is only ever used in a “read new things” kind of mode, always force the user to pull down to refresh, never update for them.
- Your app is only ever going to be used online, on a super-fast internet connection. Never cache data between invocations. Always clear the screen when loading the new data, users love seeing a spinner and a “Loading” message when on a slow connection.
- If you build your app around WebView you may be forced to accept that there will be some caching, fight against it and avoid a consistent experience at all costs by never saving this cache between invocations. Since it’s almost random whether your app will remain in memory between uses, this means it will be almost random whether your user sees stale cached data (which you don’t auto-refresh, obviously) or a white loading screen while you pull all new data.
- Tap targets should be as small as possible, testing in the Simulator with your mouse proves that Apple’s 44px guideline is way too large!
- The app should regularly reflow its content so that tap targets move under the user’s finger between the time they start their finger moving and the time it reaches the screen.
- Using a WebView for your app is great, the lack of JIT and other performance improvements means you don’t need too many servers to serve content to your app. Users love it when shared photos show up as black boxes because the servers are too busy.
- Using a WebView also means you can continually adjust the layout of your page. For the ultimate rapid deployment, never cache resources such as CSS or images client-side. Users love that retro look and feel when your page renders without them on a slow connection.
- Your app probably has to let users post content too. Users are always on fast, reliable data networks. Don’t worry about handling error cases, timeouts, etc. it’s ok to just throw an error, and ideally throw away their post too. Never double-check whether the post succeeded, for that to happen and the response being lost is impossible.
- Whenever Apple release a new API feature, make sure you update your code to take advantage of it. Apple’s documentation can be kinda waffly though, don’t bother reading all of it, you never need to handle the edge cases like someone taking a phone call on their phone (who does that?) or receiving a notification while you’re dragging your neat side bar out. Users love it when that side bar sticks half way, it’s hilarious.
For a perfect score on all of the above tips, I highly recommend taking a look at Facebook’s iPhone app.
git smash
A common thing I end up doing while working on code is to make a series of commits, and then end up work changes in my working directory which I need to apply to an earlier revision in the history than the top-most one.
One common way to do this is to make a temporary commit with those changes, then use git rebase -i and move that commit below the one I want to amend and choose fixup to have it applied.
But that’s annoying manual work. There’s a more fun way. I have this script in my path as git-smash, it takes a revision as a single argument, e.g. git smash deadbeef:
git reset --keep "$1"
EDITOR=true git commit -a --amend
git checkout HEAD@{2}
git rebase --onto HEAD@{1} HEAD@{2}
This resets the revision history, keeping local changes, back to the given revision. Unfortunately git reset doesn’t have a mode which preserves the index so we then have to use commit -a to capture all of the local changes.
Now we use the reflog (history of revisions in the working tree) to manipulate the tree back to the previous state, first checking out the revision that was two back (before the amended commit and the reset, i.e. where we began). Then we rebase that onto the revision one back (before the checkout, i.e. the amended revision) using the revision that’s now two back (before the checkout and commit, i.e. the original revision we changed).
Mental gymnastics over, this is the same as what we were doing before, just in one handy command.
Git still sucks though.
A new release process for Ubuntu?
With the nomination period beginning for the Ubuntu Technical Board, big changes like Unity having arrived in Ubuntu recently, and the upcoming UDS for being what will likely be a new LTS release of Ubuntu, it’s as good as time as any to ask big questions about the development process, challenge assumptions, and make suggestions for big changes.
Cadence
The Ubuntu release process is well known, and its developers talk regularly about the cadence of it. A new release of Ubuntu comes out every six months, and each release follows a predictable pattern. I’ve stolen the following image from OMG! Ubuntu’s recent series about Ubuntu Development.

Each developer working on Ubuntu follows this cycle. When Ubuntu 11.10 is released on October 13th, they’ll begin again. After they recover, of course.
First there’ll be a bit of a wait for the archive to be open, this gets quicker and quicker each release but since it depends on a toolchain being built and other similarly fundamental things, it tends to be a period where most people figure out what they’re going to discuss at UDS.
UDS is a bit late for the 12.04 cycle, so the merge period will probably occupy developer time both before and after UDS. This isn’t represented on Daniel’s chart above, but this is the time when massive amounts of updates arrive from Debian; it’s a time of great instability for Ubuntu. At some point there will be an Alpha 1, but you won’t want to try and install that.
Planning for UDS is going to take up some time, and writing up the results of the plans afterwards and turning that into work items. There’s also a UDS Hangover which nobody (except Robbie Williamson, when drafting the 10.10 Release Cycle) seems to like to talk about. Nothing gets done in the week or two following UDS, everybody is too wiped out.
So realistically speaking, development of features for 12.04 is going to start around mid-November at the earliest. And by features I mean the big headline things in Ubuntu; like Unity, like the Software Center, like the Installer. These things are important to get right.
Pretending for a moment that features are developed over the winter holidays like Thanksgiving, Christmas and New Year, you’ve got clear development time until Feature Freeze. The 12.04 Release Schedule isn’t published yet, but I figure that’s going to be somewhere around February 16th after which everyone switches to bug fixing and release testing.
That’s just 13 weeks of development time!
Chaos
So you’re an Ubuntu developer working on features for the upcoming release, you don’t have anywhere near as much time as you’d expect to actually do the development work. What happens if you’re replacing something that works with something completely new? Can’t you just target a later release, and work continually until the feature freeze of that release?
It turns out that you can’t. There is an incredible emphasis on the Ubuntu planning process of targeting features for particular releases. This is the exact thing you’re not supposed to do with a time-based release schedule.
Unfortunately Canonical’s own performance-review and management is also based around this schedule. The Ubuntu developers so employed (the vast majority) have such fundamentals as their pay, bonuses, etc. dictated by how many of their assigned features and work items are into the release by feature freeze. It’s not the only requirement, but it’s the biggest one.
Your new feature is going to take twelve months of development time to fully develop before it’s truly a replacement for the existing feature in Ubuntu. What you don’t do is spend twelve months developing and land it when it’s a perfect replacement.
What you do do is develop it in 12-13 week bursts, which means it’s going to take you roughly four release cycles before it’s ready rather than two. And you land the quarter-complete feature in the first release, replacing the older stable feature.
Consequence
If this were true, you would expect to see new features repeatedly arriving in Ubuntu before they were ready. Removing the old, deprecated feature and breaking things temporarily with the promise that everything will be better in the next release, certainly the one after that, definitely by the LTS.
Maybe you don’t believe that characterizes Ubuntu, in which case you should probably just stop reading now because we’re not going to agree with my fundamental complaint.
But I will say this: I know I’m responsible for doing this on more than one occasion because I had to; and I saw the exact same pattern in others’ work, when I was a manager my reports complained that they had to follow this pattern and I still see the same pattern today with features such as Unity and the Software Center.
Follow this pattern and developers are going to complain that they need a release where they don’t have any features to work on, and can just spend the time stabilizing and bug fixing.
Worse, follow this pattern and you’re going to create a user expectation that releases are going to be largely unstable and contain sweeping changes that are going to be surprising to administrators of Enterprise desktop deployments, and discourage them from using your distribution at all.
A kludge to this would be to overlay a second release schedule onto your first one, with more of an emphasis on stability and support. It’s a target for your developers to complete their features, or at least stabilize them in those 12 weeks; and it’s a target for your users to consider deployment. So three out of four of your releases are really just unstable previews of that final fourth release.
Complacency
This second LTS release cycle solves the unstable release issue, so why is this a problem?
Because developer time is wasted; because user time is wasted; because user confidence is lost.
Because features can take longer than two years to develop; or if even if a feature takes just two years, if it’s not begun immediately after the previous LTS release, it’s not going to be ready for the next one so you might postpone and lose the lead.
Because you might expect a knock-on degeneracy effect in the LTS releases as well; with 12.04 LTS being less stable than 10.04 LTS, which was less stable than 8.04 LTS which was less stable than 6.06 LTS. And it’s far too late now to have considered the 10.10/11.04/11.10/12.04 cycle to have been a Super-Long-Term-Support release and kept back the complete replacement of the desktop environment.
Because the original reason for the six-month cycle has already been forgotten: features are targeted towards releases, rather than released when ready; because the original base for the release schedule (GNOME) is no longer a key component of the distribution; because no other key component has adopted this schedule.
Because these might be a better way.
Cataclysm
What I’m going to suggest here is a completely new development process for Ubuntu, complete with details about how it would be implemented.
I’m going to suggest a monthly release process, beginning with the 11.10 release. It so happens that this fits perfectly with Ubuntu’s version numbering system, the next release would be 11.11, followed by 11.12, followed by 12.01 and so on.
This monthly release would be simply known as release in your sources.list, updates would be published to it on the first week of the month. There would be no codenames, and due to the rapid releases, changes would be largely unsurprising and iterative on the previous releases.
In order to provide user testing, a second release known as beta would exist. It’s from this release that release would be copied from on that first week of the month. beta would be updated every two weeks, on the first week of the month after it became the new release, and then on the half-way point of the month. Users who like a little bleeding on their edge can change their sources.list to use this more exciting release, or download appropriate disk images.
Developers wouldn’t run either of these, they would run the third release branch alpha. It’s from here that beta is updated; and from here that daily disk images would be generated.
Publishing from alpha to beta, and then from beta to release is handled semi-automatically. The release manager will track Release Critical bugs, and will hold up packages from copying from one to the other if they have outstanding problems. If this sounds familiar, it’s because this is exactly how the Debian testing distribution works and I recommend using the same software (which Ubuntu already uses to check for archive issues).
So where do developers upload? It’s tempting to just say to alpha, but if we say that, alpha will end up looking very different from release because it will be filled with unstable software that’s not ready for users yet. This will make it harder for problems in the release branch to be fixed, because none of the components are left in alpha because they’ve been replaced by something that’s not ready yet.
Developers will upload to an unpublished trunk branch. Packages will be copied to alpha provided:
- there is a signed-off code review for the upload
- the upload meets policy (lintian clean)
- the upload builds on all released architectures
- unit tests pass on all released architectures
- functional and verification tests pass on all architectures for the archive as a whole
I just introduced a bunch of new checks to the developer process there; I just introduced code review, mandatory unit tests and then piled functional tests and verification tests on top.
The first four are relatively self-explanatory; fail any of these tests and your upload has marked the tree red. In which case not only will your package fail to copy to alpha, but you’re about to have a conversation with the Release Manager.
For functional and verification tests, this means doing more automated QA. A failing test could be an automated installer run, or an automated boot-and-test run, etc. They’ll run sometime after the fact and will make the entire tree red. The Release Manager or their team will have to examine the logs to figure out the culprit.
So things aren’t copying to alpha, now one of two things is going to happen.
- the Release Manager reverts your upload. Because trunk is unpublished, this is simply overwriting with the older package from alpha; nobody except the original developer is going to have known about it
- after talking with the developer, it’s decided that further uploads of other packages are required (e.g. due to dep-wait, or the bug being elsewhere) in which case the tree remains red while the developer (or another in rare cases) prepares that fix upload.
While the tree is red, nobody else is allowed to upload unless it’s a fix for the problem. All effort should go to fixing the tree.
If the archive has to always remain stable, how do you develop large features such as Upstart, Unity, Ubiquity, Software Center, etc.? You use a PPA to do development, on your own timeline.
If your feature takes twelve months to develop, you take twelve months to develop it in that PPA. You’re going to be posting regularly to mailing lists or blogging about your feature to encourage users to add your PPA to their sources.list to gain testing. Obviously you’ll be doing various uploads to the main series over time to get all your dependencies in early where they don’t conflict with what’s already there.
Conclusion
My proposal is a radical change to the Ubuntu Release Process, but surprisingly it would take very little technical effort to implement because all the pieces are already there including the work on performing automated functional and verification tests.
I believe it solves the problem of landing unstable features before they’re ready, because it almost entirely removes releases as a thing. As a developer you simply work in a PPA until you’ll pass review, and land a stable feature that can replace what was there before.
It solves the need for occasional stabilization and bug-fixing releases because the main series is always stable and can receive bug-fixes easily separate to any development work going on. A developer can chose to focus on looking after the main series for some of their time in addition to their feature development work, or devote all of their time to it.
Another problem I’ve not talked about is that of building software on an unstable foundation, also solved by this change. Since developers will run alpha, and vendor developers can just run a relatively up-to-date, yet stable, release branch, software can be built on a solid foundation. Only the new feature or software itself is unstable until ready.
Canonical can keep its review schedule, and use developer uploads and work items; except rather than landing in a release, they can now land in a PPA.
Merges from Debian unstable can be handled pretty much continually as long as they keep the tree green, alternatively one can decide that users ultimately don’t care about an updated version of cat and until a case can be made (e.g. an open bug) for a package’s update, it need not be merged.
Users can now be confident of always receiving a stable operating system, because of the multiple testing and QA passes each change continually receives. Updates come in monthly, two-weekly or dailyish batches depending where in the main series they chose to run.
Enterprise administrators can run this stable release, because it only changes gradually with well-tested updates. The big changes and features have a long gestation period in PPAs, with many advance notices and blog posts about them. They’re not a surprise and can be planned for well in advance of their landing.
Downsides will, doubtless, be found in the comments below.
For your consideration.
Tracing on Linux
The Linux tracing APIs are a relatively new addition to the kernel and one of the most powerful new features its gained in a long time. Unfortunately the plethora of terms and names for the system can be confusing, so in this follow-up to my previous post on the proc connector and socket filter, I’ll take a look at achieving the same result using tracing and hopefully unravel a little of the mystery along the way.
Rather than write a program along the way, I’ll be referring to sample code found in the kernel tree itself so you’ll want a checkout. If you’re doing any work that touches the kernel further than standard POSIX APIs, I highly recommend this anyway; it’s quite readable and once you find your way around, is the quickest way to answer questions.
Grab your checkout with git:
# git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git # cd linux-2.6
Tracepoints
One of the reasons there are so many terms and names is that, like most kernel systems, there are many layers and each of those layers is exposed as different developers have different requirements. An important lower layer is that of tracepoints, also known as static tracepoints. For these we’ll be looking at the code in the samples/tracepoints directory of the kernel source; kernelese documentation of the API can be found in Documentation/trace/tracepoints.txt
A tracepoint is a placeholder function call in kernel code that the developer of that subsystem has deemed a useful point for debugging code to be able to hook into. Static refers to the fact they are fixed in point by the original developer. You can think of them as the kind of code you’d tend to guard with #if DEBUG in traditional C development, and like those statements they’re nearly free when they’re not in use except that you can turn these on and off at runtime.
The samples/tracepoints/tracepoint-sample.c file is a kernel module that creates a /proc/tracepoint-sample file, and has a couple of tracepoints coded into it by the developer. First it includes the samples/tracepoints/tp-samples-trace.h which actually declares the tracepoints.
DECLARE_TRACE(subsys_event, TP_PROTO(struct inode *inode, struct file *file), TP_ARGS(inode, file)); DECLARE_TRACE_NOARGS(subsys_eventb);
You can think of these as declaring the function prototypes, one trace function has two arguments: an inode and a file; the other has no arguments. And if they’re function prototypes, we need to define a function; this is done back in the main tracepoint-sample.c file.
DEFINE_TRACE(subsys_event); DEFINE_TRACE(subsys_eventb);
These tracepoints can now be called from the kernel code, passing the arguments that may need to be traced; remember that these have no side-effects unless enabled. The code that calls out to the tracepoints is in the my_open() function.
trace_subsys_event(inode, file); for (i = 0; i < 10; i++) trace_subsys_eventb();
Simple, huh? Don’t worry about the rest, this primer is simply so you can recognise tracepoints in the kernel source when you see them. I don’t expect you to go leaping around the kernel adding tracepoints and rebuilding it, unless you want to, of course.
So how do you hook into tracepoints? The answer is from other kernel code, usually in the form of a loadable module such as that defined by samples/tracepoints/tracepoint-probe-sample.c; this includes the same header file as before to get the prototypes.
#include "tp-samples-trace.h"
In the module __init function it registers two functions of its own as hooks into the tracepoint, this activates the tracepoint and turns the code in the previous module from a near no-op to a function call that will call these functions.
ret = register_trace_subsys_event(probe_subsys_event, NULL); WARN_ON(ret); ret = register_trace_subsys_eventb(probe_subsys_eventb, NULL); WARN_ON(ret);
And obviously in the module __exit function we have to unregister these, otherwise we leave dangling things.
unregister_trace_subsys_eventb(probe_subsys_eventb, NULL); unregister_trace_subsys_event(probe_subsys_event, NULL); tracepoint_synchronize_unregister();
As to those functions, they take an argument which is a pointer to the same data as the second argument to the register call, and then otherwise take the arguments defined in DECLARE_TRACE. You can do pretty much what you want here, the example simply extracts the filename and outputs it with a a printk()
static void probe_subsys_event(void *ignore,
struct inode *inode, struct file *file){
path_get(&file->f_path);
dget(file->f_path.dentry);
printk(KERN_INFO "Event is encountered with filename %s\n",
file->f_path.dentry->d_name.name);
dput(file->f_path.dentry);
path_put(&file->f_path);
}
So that’s tracepoints; they’re a low-level method for a kernel developer to pick places in their code that may be useful for debugging and a method for loadable kernel code such as modules to hook into those places.
Trace Events (Kernel API)
So you know about tracepoints, and you’ve almost certainly heard about Trace Events, but what’s the difference? Well firstly trace events are actually built on tracepoints, you can think of them as a higher level API – and that’s why I covered tracepoints first. Secondly trace events are usable from userspace! we don’t need to write kernel modules to be able to hook into them, but obviously we can only read data this way.
In fact, since they’re tracepoints with extra benefits, you wouldn’t think anyone would use the basic tracepoints at all, and you’d be right! A git grep DECLARE_TRACE in a current kernel tree will show you that the only user of the raw tracepoint macros is actually the trace events system.
Since everyone just defines trace events, a primer on the kernel-side will be useful, so we’ll be looking at the code in samples/trace_events and if you want to read the userspace API documentation, it’s in Documentation/trace/events.txt
Just one source file and header file this time, first we’ll look at the header samples/trace_events/trace-events-sample.h; this seems pretty complicated at first, but almost all of this is boiler-plate code that gets copied into every trace events header. The important bit is the TRACE_EVENT macro:
TRACE_EVENT(foo_bar,
TP_PROTO(char *foo, int bar),
TP_ARGS(foo, bar),
TP_STRUCT__entry(
__array( char, foo, 10 )
__field( int, bar )
),
TP_fast_assign(
strncpy(__entry->foo, foo, 10);
__entry->bar = bar;
),
TP_printk("foo %s %d", __entry->foo, __entry->bar)
);
The first part of this looks just like DECLARE_TRACE, and that’s no accident, we’re still declaring a tracepoint too so this will give us a function with the prototype declared in TP_PROTO and argument names in TP_ARGS.
The TP_STRUCT__entry and TP_fast_assign bits are new though. As well as declaring a tracepoint, trace events come with the equivalent “loadable module” code that copies data from the arguments of the function into a struct that can be examined from userspace. TP_STRUCT__entry defines that structure, and TP_fast_assign is C code that should quickly copy data into that structure.
So we’ve declared a tracepoint, we’ve defined a structure containing an array of 10 char and an int, and we’ve written C code to copy from the tracepoint arguments into that structure. The last bit of the trace event is TP_printk, which does exactly what you’d expect. Since the most common (at least, first) use of a trace event is going to be to output something, this macro defines a format string for that printk() call.
Back in the samples/trace_events/trace-events-sample.c file, we include this header but first set a special define. This is only set once in the entire kernel source, and this results in all of the functions being defined; i.e. TRACE_EVENT becomes DEFINE_TRACE rather than DECLARE_TRACE.
#define CREATE_TRACE_POINTS #include "trace-events-sample.h"
All other users of this header simply include the header.
From here on in the source, the trace event is just a tracepoint and is called in the same way: as a function call.
trace_foo_bar("hello", cnt);
That’s a kernel-side primer, you should be able to git grep through the source and find trace events. But now it’s time to get into the fun bit and look at the userspace API for dealing with them; remember if you want anything more complicated, they’re just tracepoints so you can write kernel modules and hook into them as before.
Trace Events (Userspace API)
We’re in userspace now, so you can leave the kernel source directory, but you do need to be root and you may need to mount a filesystem. This is because some distributions (like Ubuntu) have an allergy to debugging (seriously, they even disable things like gdb -p).
Try and change into the /sys/kernel/debug/tracing directory.
# cd /sys/kernel/debug/tracing
If this fails, you’ll need to mount the debugfs filesystem and try again.
# mount -t debugfs none /sys/kernel/debug # cd /sys/kernel/debug/tracing
With that done, we should make sure tracing is enabled.
# cat tracing_enabled 1
If that’s 0, enable it:
# echo 1 > tracing_enabled
So we’ve enabled tracing, but what can we trace? Trace events are exposed in the events sub-directory in two levels, the first is the subsystem and the second are the trace events themselves. Since in my last blog post we were looking at tracing forks, it would be great if there were trace events for doing just that. This is where it helps to be able to git grep around the kernel source and recognise trace events, so you at least know the right subsystem name; and it turns out that the sched subsystem has exactly the events we wanted.
deathspank tracing# ls events/sched enable sched_process_exit/ sched_stat_sleep/ filter sched_process_fork/ sched_stat_wait/ sched_kthread_stop/ sched_process_free/ sched_switch/ sched_kthread_stop_ret/ sched_process_wait/ sched_wait_task/ sched_migrate_task/ sched_stat_iowait/ sched_wakeup/ sched_pi_setprio/ sched_stat_runtime/ sched_wakeup_new/
sched_process_fork sounds exactly right, if you look at it, it’s a directory that contains four files: enable, filter, format and id. I bet you can guess how to enable fork tracing, but if not:
# cat events/sched/sched_process_fork/enable 0 # echo 1 > events/sched/sched_process_fork/enable
Pretty painless, so go ahead and run a few things, and turn the tracing off again when you’re done.
# echo 0 > events/sched/sched_process_fork/enable
Now let’s look at the result of our trace; recall that every trace event comes with a free printk() of formatted output? We can find the output from those in the top-level trace file.
# tracer: nop # # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | zsh-2667 [001] 6658.716936: sched_process_fork: comm=zsh pid=2667 child_comm=zsh child_pid=2748
So for each process fork, we get the parent and child process ids along with the process name. Pretty much exactly what we want!
There’s plenty to play around with using this API, as you’ve probably noticed you can enable entire subsystems or all events using the enable files at the subsystem and events-levels; there’s also a set_event file at the tracing level which can be used to make batch changes to tracing, see the kernel documentation for more details.
You’re probably wondering though what happened to the rest of the struct, especially if there fields that aren’t included in the default printk(). You can examine the struct format by reading the format file of a trace event, and you can use this with the filter file to exclude events you’re not interested in. Again anything I write here would be just duplicating the kernel documentation, so go read Documentation/trace/events.txt
Perf
After a little bit of playing you’ll realise that not only is tracing not limited to your current process or shell, you’ll get events for processes you’re not intersted in, but also events for subsystems you’re not interested in if other processes are doing traces of their own. There’s also only one global filter for the entire trace events system, so other users or processes doing tracing, could override yours.
There’s an even higher-level that we can use to work around those problems, the perf tool. Originally designed as a userspace component to the performance counters system, it’s grown a wide variety of extra features one of which is the ability to work with kernel tracepoints as an input source.
Since trace events are tracepoints, these count!
So let’s say we want to record the forks made by a process we run, without fear of contamination from other processes on the system or other users performing tracing. Using perf we can simply run
# perf record -e sched:sched_process_fork record bash
And run as many commands as we like in that shell. When the shell exits, perf will write the results of the tracing to a perf.data file for analysis.
# exit [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.017 MB perf.data (~735 samples) ]
We can analyse this later using various perf sub-commands, the simplest of which is an argument-less perf script which outputs the equivalent of reading the trace file.
# perf script bash-3141 [003] 10201.049939: sched_process_fork: comm=bash pid=3141 child_comm=bash child_pid=3142 :3142-3142 [001] 10201.050391: sched_process_fork: comm=bash pid=3142 child_comm=bash child_pid=3143
Conclusion
As an administrator debugging their system, or a developer trying to understand the performance or events timeline of their work, perf is perfect. It’s a very well documented tool with all of the bells and whistles you need for tracing a wide variety of events.
Unfortuantely the API between perf and the kernel is a private one; the perf tool source is shipped as part of the kernel source, and they are version-mated with each other.
Recall that the topic of the previous blog post was to write a program to follow forks, rather than doing it as a system administrator.
If we want to write software to do it, the lower (but still high) level trace events API seems a better bet. There are a wide range of applications of this API, for example the ureadahead program in an Ubuntu system uses it to trace the open() and exec() syscalls the system performs during boot so it knows which files to cache for faster boot times. But it’s easy for another process, or a user, to interfere with the results of this tracing so it’s not ideal for our purpose either.
Finally the tracepoints API is too low-level, writing a kernel module and building and maintaining it for each kernel version is just not on the cards.
So it would appear we’re at a dead-end for using tracing to do what we want. That’s not the end of the story though; there are other tracing tools such as kprobes and ftrace that I haven’t covered yet. Unfortunately this blog post has gotten a little too long, and the coverage of tracepoints, trace events and perf was worthwhile in of itself, so we’ll have to pick those up next time!
The Proc Connector and Socket Filters
The proc connector is one of those interesting kernel features that most people rarely come across, and even more rarely find documentation on. Likewise the socket filter. This is a shame, because they’re both really quite useful interfaces that might serve a variety of purposes if they were better documented.
The proc connector allows you to receive notification of process events such fork and exec calls, as well as changes to a process’s uid, gid or sid (session id). These are provided through a socket-based interface by reading instances of struct proc_event defined in the kernel header.
#include <linux/cn_proc.h>
The interface is built on the more generic connector API, which itself is built on the generic netlink API. These interfaces add some complexity as they are intended to provide bi-directional communication between the kernel and userspace; the connector API appears to have been largely forgotten as newer such socket interfaces simply declare their own first-class socket classes. So we need the headers for those too.
#include <linux/netlink.h> #include <linux/connector.h>
(For brevity, I’ll omit any standard boilerplate such as the headers you need for syscalls and library functions that you should be used to as well as function definitions, error checking, and so-forth.)
Ok, now we’re ready to create the connector socket. This is straight-forward enough, since we’re dealing with atomic messages rather than a stream, datagram is appropriate.
int sock;
sock = socket (PF_NETLINK, SOCK_DGRAM | SOCK_NONBLOCK | SOCK_CLOEXEC,
NETLINK_CONNECTOR);
To select the proc connector we bind the socket using a struct sockaddr_nl object.
struct sockaddr_nl addr; addr.nl_family = AF_NETLINK; addr.nl_pid = getpid (); addr.nl_groups = CN_IDX_PROC; bind (sock, (struct sockaddr *)&addr, sizeof addr);
Unfortunately that’s not quite enough yet; the proc connector socket is a bit of a firehose, so it doesn’t in fact send any messages until a process has subscribed to it. So we have to send a subscription message.
As I mentioned before, the proc connector is built on top of the generic connector and that itself is on top of netlink so sending that subscription message also involves embedded a message, inside a message inside a message. If you understood Christopher Nolan’s Inception, you should do just fine.
Since we’re nesting a proc connector operation message inside a connector message inside a netlink message, it’s easiest to use an iovec for this kind of thing.
struct iovec iov[3]; char nlmsghdrbuf[NLMSG_LENGTH (0)]; struct nlmsghdr *nlmsghdr = nlmsghdrbuf; struct cn_msg cn_msg; enum proc_cn_mcast_op op; nlmsghdr->nlmsg_len = NLMSG_LENGTH (sizeof cn_msg + sizeof op); nlmsghdr->nlmsg_type = NLMSG_DONE; nlmsghdr->nlmsg_flags = 0; nlmsghdr->nlmsg_seq = 0; nlmsghdr->nlmsg_pid = getpid (); iov[0].iov_base = nlmsghdrbuf; iov[0].iov_len = NLMSG_LENGTH (0); cn_msg.id.idx = CN_IDX_PROC; cn_msg.id.val = CN_VAL_PROC; cn_msg.seq = 0; cn_msg.ack = 0; cn_msg.len = sizeof op; iov[1].iov_base = &cn_msg; iov[1].iov_len = sizeof cn_msg; op = PROC_CN_MCAST_LISTEN; iov[2].iov_base = &op; iov[2].iov_len = sizeof op; writev (sock, iov, 3);
The netlink message length is the combined length of the following connector and proc connector operation messages, and is otherwise simply a message from our process id with no following messages. However all of the interfaces to netlink take a lot of care to make sure the following structure in the message is aligned as wide as possible using the NLMSG_LENGTH macro, to avoid issues with platforms that have fixed alignment for data types, so we have to be careful of that too.
So we actually have a bit of padding between the struct nlmsghdr and the struct cn_msg, this is accomplished by actually using a character buffer of the right size for the first iovec element and accessing it through a struct nlmsghdr pointer.
The connector message indicates that it is relevant to the proc connector through the idx and val fields, and the length is the legnth of the proc connector operation message.
Finally the proc connector operation message (just an enum) says we want to subscribe. Why isn’t there padding between the connector and proc connector operation messages? Because the last element in struct cn_msg is a zero-width type which results in the right padding, this interface is rather newer than netlink.
iovec stitches it all together so it’s sent as a single message, visualized this message looks like this:
There’s a matching PROC_CN_MCAST_IGNORE message if you want to turn off the firehose without closing the socket.
Ok, the firehose is on now we need to read the stream of messages. Just like the message we sent, the stream of messages we receive are actually netlink messages, and inside those netlink messages are connector messages, and inside those are proc connector messages.
Netlink allows for all sorts of things like multi-part messages, but in reality we can ignore most of that since connector doesn’t use the, but it’s worth future-protecting ourselves and being liberal in what we accept.
struct msghdr msghdr; struct sockaddr_nl addr; struct iovec iov[1]; char buf[PAGE_SIZE]; ssize_t len; msghdr.msg_name = &addr; msghdr.msg_namelen = sizeof addr; msghdr.msg_iov = iov; msghdr.msg_iovlen = 1; msghdr.msg_control = NULL; msghdr.msg_controllen = 0; msghdr.msg_flags = 0; iov[0].iov_base = buf; iov[0].iov_len = sizeof buf; len = recvmsg (sock, &msghdr, 0);
Why do we use recvmsg rather than just read? Because netlink allows arbitrary processes to send messages to each other, so we need to make sure the message actually comes from the kernel; otherwise you have a potential security vulnerability. recvfrom lets us receive the sender address as well as the data.
if (addr.nl_pid != 0)
continue;
(I’m assuming you’re reading in a loop there.)
So now we have a netlink message package from the kernel, this may contain multiple individual netlink messages (it doesn’t, but it may). So we iterate over those.
for (struct nlmsghdr *nlmsghdr = (struct nlmsghdr *)buf;
NLMSG_OK (nlmsghdr, len);
nlmsghdr = NLMSG_NEXT (nlmsghdr, len))
And we should ignore error or no-op messages from netlink.
if ((nlmsghdr->nlmsg_type == NLMSG_ERROR)
|| (nlmsghdr->nlmsg_type == NLMSG_NOOP))
continue;
Inside each individual netlink message is a connector message, we extract that and make sure it comes from the proc connector system.
struct cn_msg *cn_msg = NLMSG_DATA (nlmsghdr);
if ((cn_msg->id.idx != CN_IDX_PROC)
|| (cn_msg->id.val != CN_VAL_PROC))
continue;
Now we can safely extract the proc connector message; this is a struct proc_event that we haven’t seen before. It’s quite a large structure definition so I won’t paste it here, since it contains a union for each of the different possible message types. Instead here’s code to actually print the relevant contents for an example message.
struct proc_event *ev = (struct proc_event *)cn_msg->data;
switch (ev->what) {
case PROC_EVENT_FORK:
printf ("FORK %d/%d -> %d/%d\n",
ev->event_data.fork.parent_pid,
ev->event_data.fork.parent_tgid,
ev->event_data.fork.child_pid,
ev->event_data.fork.child_tgid);
break;
/* more message types here */
}
As you can see, each message type has an associated member of the event_data union containing the information fields for it. And as you can see, this gives you information about each individual kernel task, not just the top-level processes you’re normally used to seeing. In other words, you see threads as well as processes.
Like I keep saying, it’s a firehose. It would be great if there was some way to filter the socket in the kernel so that our process doesn’t even get woken up for messages. Wake-ups are bad, especially in the embedded space.
Fortunately there is a way to filter sockets on the kernel-side, the kernel socket filter interface. Unfortunately this isn’t too well documented either; but let’s use this opportunity to document an example.
We’ll filter the socket so that we only receive fork notifications, discarding the other types of proc connector event type and most importantly discarding the messages that indicate new threads being created (those where the pid and tgid fields differ). One important part of filtering is that you should be careful so that only expected messages are filtered, and that unexpected messages are still passed through.
The filter machine consists of a set of machine language instructions added to the socket through a special socket option. Fortunately this machine language is copied from the Berkeley Packet Filter from BSD, so we can find documentation for it in the bpf(4) manual page there. Just ignore the structure definitions, because they are different on Linux.
So let’s get started with our example; first we need to add the right header.
#include <linux/filter.h>
And now we need to insert the filter into the socket creation, before the subscription message is sent is usually a good place. On Linux the instructions are given as an array of struct sock_filter members which we can construct using the BPF_STMT and BPF_JUMP macros.
Just to make sure everything is working, we’ll create a simple “no-op” filter.
struct sock_filter filter[] = {
BPF_STMT (BPF_RET|BPF_K, 0xffffffff),
};
struct sock_fprog fprog;
fprog.filter = filter;
fprog.len = sizeof filter / sizeof filter[0];
setsockopt (sock, SOL_SOCKET, SO_ATTACH_FILTER, &fprog, sizeof fprog);
Not very useful, but it means we can now concentrate on writing the filter code itself. This filter consists of a single statement, BPF_RET that tells the kernel to deliver an amount of bytes of the packet to the receiving process and to return from the filter. The BPF_K option means that we give the amount of bytes as the argument to the statement, and in this case we give the largest possible value. In other words, this statement declares to deliver the whole packet and return from the filter.
To not wake up the process at all, and filter everything we deliver no bytes and return from the filter.
BPF_STMT (BPF_RET|BPF_K, 0);
You may want to test that too.
Ok, now let’s actually do some examination of the packets to filter out the noise. Recall that we’re dealing with nested messages here, messages inside messages, inside messages. Visualizing this is really important to understanding what you’re dealing with.
The most basic filter code consists of three operations: load a value from the packet into the machine’s accumulator, compare that against a value and jump to a different instruction if equal (or not equal), and then possibly return or perform another operation.
All of the following filter code replaces whatever you had in the filter[] array before.
So first we should examine the nlmsghdr on the start of the packet, we want to make sure that there is just one netlink message in this packet. If there are multiple, we just pass the whole packet to userspace for dealing with. We check the nlmsg_type field to make sure it contains the value NLMSG_DONE.
BPF_STMT (BPF_LD|BPF_H|BPF_ABS,
offsetof (struct nlmsghdr, nlmsg_type));
BPF_JUMP (BPF_JMP|BPF_JEQ|BPF_K,
htons (NLMSG_DONE),
1, 0);
BPF_STMT (BPF_RET|BPF_K, 0xffffffff);
The first statement says to load (BPF_LD) a “halfword” (16-bit) value (BPF_H) from the absolute offset (BPF_ABS) equivalent to the position of the nlmsg_type member in struct nlmsghdr. Since we expect that structure to be the start of the message, this means the accumulator should now have that value.
The next statement is a jump (BPF_JMP), it says to compare the accumulator for equality (BPF_JEQ) against the constant argument (BPF_K). We only want to continue if this is the sole message, so the value we compare against is NLMSG_DONE – first remembering to deal with host and network ordering.
If true, the jump will jump one statement; if false the jump will not jump any statements. These are the third and fourth arguments to the BPF_JUMP macro.
Note that the error case is always to return the whole packet to the process, waking it up. And the success case is future processing of the packet. This makes sure that we don’t filter unexpected packets that userspace may really need to deal with. Don’t use the socket filter for security filtering, it’s for reducing wake-ups.
So let’s filter the next set of values, we want to make sure that this netlink message is from the connector interface. Again we load the right “word” (32-bit) values (BPF_W) from the appropriate offsets and check them against constants.
BPF_STMT (BPF_LD|BPF_W|BPF_ABS,
NLMSG_LENGTH (0) + offsetof (struct cn_msg, id)
+ offsetof (struct cb_id, idx));
BPF_JUMP (BPF_JMP|BPF_JEQ|BPF_K,
htonl (CN_IDX_PROC),
1, 0);
BPF_STMT (BPF_RET|BPF_K, 0xffffffff);
BPF_STMT (BPF_LD|BPF_W|BPF_ABS,
NLMSG_LENGTH (0) + offsetof (struct cn_msg, id)
+ offsetof (struct cb_id, idx));
BPF_JUMP (BPF_JMP|BPF_JEQ|BPF_K,
htonl (CN_VAL_PROC),
1, 0);
BPF_STMT (BPF_RET|BPF_K, 0xffffffff);
So after this filter code has executed, we know the packet contains a single netlink message from the proc connector. Now we want to make sure it’s a fork message; this is a bit different from before, because now we explicitly do filter out the other message types so the return case for non-equality is to return zero bytes.
BPF_STMT (BPF_LD|BPF_W|BPF_ABS,
NLMSG_LENGTH (0) + offsetof (struct cn_msg, data)
+ offsetof (struct proc_event, what);
BPF_JUMP (BPF_JMP|BPF_JEQ|BF_K,
htonl (PROC_EVENT_FORK),
1, 0);
BPF_STMT (BPF_RET|BPF_K, 0);
And now we can compare the pid and tgid values for the parent process and the child process fields. This is again slightly interesting because we can’t compare against an absolute offset with the jump instruction so we use the second index register instead (BPF_X in the jump instruction). Of course it would be too easy if we could load directly into that, so we have to do it via the scratch memory store instead; this requires loading into the accumulator (BPF_LD), storing into scratch memory (BPF_ST) and loading the index register (BPF_LDX) from scratch memory (BPF_MEM).
BPF_STMT (BPF_LD|BPF_W|BPF_ABS,
NLMSG_LENGTH (0) + offsetof (struct cn_msg, data)
+ offsetof (struct proc_event, event_data)
+ offsetof (struct fork_proc_event, parent_pid));
BPF_STMT (BPF_ST, 0);
BPF_STMT (BPF_LDX|BPF_W|BPF_MEM, 0);
Then we load the tgid value into the accumulator and we can compare and jump as before; if they are equal we want to continue, if they are inequal we want to filter the packet.
BPF_STMT (BPF_LD|BPF_W|BPF_ABS,
NLMSG_LENGTH (0) + offsetof (struct cn_msg, data)
+ offsetof (struct proc_event, event_data)
+ offsetof (struct fork_proc_event, parent_tgid));
BPF_JUMP (BPF_JMP|BPF_JEQ|BPF_X,
0,
1, 0);
BPF_STMT (BPF_RET|BPF_K, 0);
Then we do the same for the child field.
BPF_STMT (BPF_LD|BPF_W|BPF_ABS,
NLMSG_LENGTH (0) + offsetof (struct cn_msg, data)
+ offsetof (struct proc_event, event_data)
+ offsetof (struct fork_proc_event, parent_pid));
BPF_STMT (BPF_ST, 0);
BPF_STMT (BPF_LDX|BPF_W|BPF_MEM, 0);
BPF_STMT (BPF_LD|BPF_W|BPF_ABS,
NLMSG_LENGTH (0) + offsetof (struct cn_msg, data)
+ offsetof (struct proc_event, event_data)
+ offsetof (struct fork_proc_event, parent_tgid));
BPF_JUMP (BPF_JMP|BPF_JEQ|BPF_X,
0,
1, 0);
BPF_STMT (BPF_RET|BPF_K, 0);
After all that filter hurdling, we have a packet that we want to pass through to the process, so the final instruction is a return of the largest packet size.
BPF_STMT (BPF_RET|BPF_K, 0xffffffff);
That’s it. Of course, what you do with this is up to you. One example could be a daemon that watches for excessive forks and kills fork bombs before they kill the machine. Since you get notification of changes of uid or gid, another example could be a security audit daemon, etc.
Upstart uses this interface for its own nefarious process tracking purposes.
The Importance of Being Tested
In addition to the regular posts documenting features of 0.6 and giving hints and tips about it’s usage, release announcements and so-forth; I’ll also be posting insights and anecdotes about Upstart’s ongoing development. A particular story cropped up again this month, and I thought I’d share it with you.
When I began work on Upstart, one of the earliest decisions I made was to make sure the code was very-well covered by a comprehensive test suite. I’d been working with Robert Collins a lot in the previous couple of years and he is very much an advocate of practices such as Extreme Programming (XP) and Agile Development; especially the discipline of Test Driven Development.
I’d also recently seen a keynote by Andrew Tridgell in which he talked about some of the development of Samba 4, in particular the high use of both test cases and code generation in that code-base. Something he said in the keynote stuck with me: “untested code is broken code”.
Statistics obviously depend on exactly how you count lines of code, but using a simple semi-colon count the combined source code of libnih and Upstart is slightly over 20,000 lines of code. The combined source code of the test suite for both is slightly over 120,000 lines of code.
The init daemon is an extremely important part of a Linux system, if it crashes then you’re left with a kernel panic; if it simply misbehaves, you’re left with just severe problems. Not only was I changing it, but I was replacing a very simple dumb system (Sys V init) with something comparatively complex with rules and behaviours that needed rigorous testing.
It would have been very scary to have developed it without the careful testing, and I would have been very worried if anyone had agreed to replace such a core component of the system without this test suite to back up its behaviour.
That being said, maintaining the test suite can be a huge burden. Don’t believe what anybody tells you, if you’re writing test cases as well as code, then your pace of development slows as well. They’re right that you spend a lot less time debugging of course, but unlike in the commercial software business free software developers tend to release first and debug later. If you use a similarly high test to code ratio in your own project, then you’ll find that the time until your first release will be pretty long and the time between releases longer as well.
Another decision is whether to do Test Driven Development or not; that discipline requires that you always write the tests first, to fail, and only write code in order to make the tests pass. I’m not a fan of TDD, and I’ve no problem admitting that I mostly did not use it for Upstart. My gut feel is that TDD produces code that hangs, swings and loops just to deal with testing. It also just doesn’t suit my coding style: I like to write code from the middle outwards, the function API is the last thing I tend to fix, where TDD forces it to be the first.
I’m also not convinced TDD is really suitable for a language like C; it’s pretty hard to get a test case to compile, run and fail without writing any supporting code such as a header file, etc.
I have found TDD useful when I have code that really does break down into a single unit with a well-defined and obvious API, and that while the inputs and outputs have been obvious, the algorithm for getting between them wasn’t at the time.
What I’ve tended to do instead is write code naturally how I would, and write test cases alongside to run the code and make sure it’s working. As the code grows more complex, more test cases appear for it. One big advantage to this is then I don’t need to reboot or fire up a VM as much, I can test a large proportion of Upstart’s operation through testing.
Now, onto the stories. There are two similar ones.
One of the side-effects of testing Upstart so strongly is that the tests are not only driving the code I’ve written but also code in libraries and even in the Kernel. One particular set of tests was covering the code in libnih and Upstart that handles watching the configuration directory for changes, it’s this code that means Upstart automatically reloads jobs when you edit them without needing an explicitly signal.
One day these test cases started failing without warning. Investigation showed that they passed fine under older kernels, but with the newest kernel update to Ubuntu, they failed.
The inotify subsystem in the kernel had undergone a radical overhaul and rewrite. Rather than being its own code, it was completely rebased onto the new fsnotify system. Fortunately I was aware of this, and after careful checking that it was indeed the kernel behaviour that was now incorrect (and that it wasn’t incorrect before), I got in touch with the Eric Paris, the author of the new code, and was able to give him minimal example code to replicate the problem.
inotify: check filename before dropping repeat events
This was a while ago, but pretty much the same story happened again recently, just this time not with the kernel.
Again, the story started with Upstart’s test suite failing. The engineer who first noticed it assumed it was an issue with the new build daemon and disabled the test for the time being. The test was in the part of the code testing Upstart’s interaction with D-Bus.
Now, sometimes I tend to write tests to deal with corner-cases and “what if” scenarios that I dream up. This isn’t always about testing my code, often it’s a case of finding out whether something is really possible or whether that thing misbehaves. These tests still stay in the suite of course.
A particular set of tests were intended to find out what happened if the D-Bus daemon crashed during initial connection, I considered this fairly important because at times the libdbus library has called exit() or abort() when things happened that it didn’t like. If you call that from the init daemon, the kernel panics.
These tests had worked fine for a couple of years (actually at the time I had to fix bugs in libdbus to make them pass) but now one of these tests was breaking. The disconnection was causing SIGPIPE to be delivered to the test.
Again, this turned out to be due to a change to D-Bus. Lennart Poettering had been working on some changes to avoid libdbus’s awkward SIGPIPE handling and replace it with the use of the MSG_NOSIGNAL flag. Unfortunately he’d missed a case in the authentication code. The side-effect was that if the D-Bus daemon had crashed, been killed, OOM’d, etc. during initial connection – the connecting application would have gone too. Especially bad for an init daemon.
Fortunately Upstart’s test suite caught it, and the fix was simple.
sysdeps-unix: use MSG_NOSIGNAL when sending creds
(reposted from http://upstart.at/2010/12/20/the-importance-of-being-tested/ – post comments there)
Events are like Methods
In last week’s post I talked about how Events can be treated like Signals, this week we’ll be looking at how Events can be treated like Methods. That might seem a little surprising, since normally one considers signals and methods as very different things, but to Upstart they are both just events.
What do I mean by Methods? You’ve almost certainly done some kind of programming, even if just a little scripting, so you should know about methods or functions.
In contrast to signals, which are just a notification that something happened on the system, a method is a request for the system to do something on your behalf. Usually to make some kind of change to the system state.
Likewise in contrast to the signals where you don’t care about the result, for a method you want to wait for the changes to be completed and perhaps even be notified if the method failed.
It’s just as easy to implement a method in Upstart as it is to implement something that considers an event a signal. Here’s an example of how you might implement a suspend method:
start on suspend task exec pm suspend
Doesn’t look that much difference from a signal, the only new stanza in this is task (and that’s not necessary for a method either). So what happens if we want to trigger a suspend? We use the command:
root@worldofwarcraft:~# initctl emit suspend
The difference here from emitting a signal we demonstrated in the previous post is that we aren’t using the –no-wait flag.
So we emit the suspend event, and Upstart will start our job as a result; but initctl emit will not return immediately, it waits for the results of the event to complete before it returns.
Because we used the task stanza in the configuration, we’ve told Upstart that the process we execute is expected to take a limited amount of time and then finish by itself. This means that Upstart will not believe the job is complete until the process has exited, and will continue to block the event while it is still running.
Finally if the command exited with an error, that error is propagated back to the event that started it, and the initctl emit command will exit with an error code.
So now we can use Upstart events and jobs for two different purposes; we can announce changes to the system, and we can use them as methods to make changes to the system.
The most typical event that is used as a methods on your system is the runlevel event used to change the runlevel for System-V compatibility and generally emitted by the telinit and shutdown tools. The /etc/init/rc.conf script that handles it can be pretty simple and looks not unlike the suspend example above:
start on runlevel [0123456] task exec /etc/init.d/rc $RUNLEVEL
What happens if you don’t include task? Well, that means Upstart will consider the job as ready when the process executed is running, and the event will be unblocked and initctl emit will return. If the service fails to start, then initctl will return with an error. This is great for methods that start (or stop) services.
Side-note: the start and stop commands act very much like method events, they block until the service is running or the task has finished and they return errors as well. However they’re not actually implemented as events right now, an oversight I intend to correct in Upstart 2.
(reposted from http://upstart.at/2010/12/16/events-are-like-methods/ – post comments there)
Not a WordPress Certified Engineer
Sorry for the spam folks, I’ve no idea what happened there, for some reason WordPress decided it was going to keep making duplicates of the most recent blog post. I have no idea why, but it does seem to have stopped.
Event matching in Upstart
A little while ago I was asked to solve a problem that somebody was having with Upstart, and I realised that people weren’t understanding how things were actually working and were just muddling along when doing event matching in jobs. This is unfortunate, because it hides some of Upstart’s true power, so I thought it high time I actually explained this.
Let’s start with a simple example. Fire up any Linux distribution with Upstart 0.6, Ubuntu or Fedora current releases will do, and create a file named /etc/init/example1.conf with the following content:
start on surprise
This is pretty simple, it’s a job that does nothing except declare that it’s started when the surprise event happens. We can demonstrate that works by emitting the event ourselves and checking the status of the job before and afterwards:
root@angrybirds:/etc/init# status example1 example1 stop/waiting root@angrybirds:/etc/init# initctl emit surprise root@angrybirds:/etc/init# status example1 example1 start/running
Nothing too surprising after all, I hope. The job did indeed start on the surprise event, and would now be running if we’d actually told Upstart to run something.
Incidentally I’m often asked why there isn’t a single list of events anywhere, that’s because you can match any event you like as long as you know something emits it. Events are supposed to come from all manner of sources. I do try and document them though, try running man 7 startup on your system to see an example of an event’s man page.
If events were just names, they’d be pretty boring. Events can also have attached environment variables, and these get put into the environment of any job’s process started by the event. Here’s /etc/init/example2.conf:
start on weather
script
echo $KIND > /tmp/weather
end script
This will now run a small shell script that outputs the $KIND environment variable to a file. This isn’t set anywhere, but we can pass it in the event.
root@angrybirds:/etc/init# cat /tmp/weather cat: /tmp/weather: No such file or directory root@angrybirds:/etc/init# initctl emit weather KIND=RAIN root@angrybirds:/etc/init# cat /tmp/weather RAIN
Ok, these are just examples but there are plenty of useful events on your system right now which carry environment variables such as which network interface just came up, and so on.
If you wanted to only run on a certain type of weather, you might think to check the value of $KIND within the script; you could do that, but it’s inefficient, ideally you don’t want your script run at all. Fortunately we can match the environment of an event in the job easily enough, here’s /etc/init/example3.conf:
start on weather KIND=snow
Hopefully you’ll figure that this one will only start if it’s snowing, and you’d be right:
root@angrybirds:/etc/init# status example3 example3 stop/waiting root@angrybirds:/etc/init# initctl emit weather KIND=hail root@angrybirds:/etc/init# status example3 example3 stop/waiting root@angrybirds:/etc/init# initctl emit weather KIND=snow root@angrybirds:/etc/init# status example3 example3 start/running
Events can have more than one environment variable, and you can have more than one match:
start on weather KIND=rain INTENSITY=heavy
The matches are actually globs, so you can use * and ? in there and as well as =, there’s obviously !=.
One useful use for the latter is in the stop on stanza, as well as being available for the job’s processes you can also use these in other stanzas within the job. Here’s a cute example for /etc/init/example4.conf:
start on weather KIND=rain or weather KIND=snow stop on weather KIND!=$KIND
This one takes a bit of explaining. First of all to start the job we match the weather event with $KIND set to either rain or snow. Now we supply a condition to stop the job, and we also match the weather event with a given value of $KIND except this time we match what looks like itself.
In fact this expansion of $KIND is the value that variable had when the job was started, not the value in the new event. It says to stop the job if it stops raining, or stops snowing depending on which of the two started it. Most importantly, if an event simply repeats the same kind of weather, but maybe with a different intensity, the job carries on running (but it doesn’t have its environment updated – UNIX can’t do that).
root@angrybirds:/etc/init# status example4 example4 stop/waiting root@angrybirds:/etc/init# initctl emit weather KIND=rain INTENSITY=heavy root@angrybirds:/etc/init# status example4 example4 start/running root@angrybirds:/etc/init# initctl emit weather KIND=rain INTENSITY=light root@angrybirds:/etc/init# status example4 example4 start/running root@angrybirds:/etc/init# initctl emit weather KIND=sun root@angrybirds:/etc/init# status example4 example4 stop/waiting
Ok, last fake example before we get onto the fun bits. Remember the example from above:
start on weather KIND=rain INTENSITY=heavy
Upstart lets us shortcut this a little, the environment variables are specified in an order on the initctl command-line and if we know what that order is, we can just assume what variable is in that position. So as long as we know a weather event always has a KIND followed by an INTENSITY, we could shortcut that to:
start on weather rain heavy
If you’ve used Upstart at all, you’ve seen that shortcut before. A lot. You may not have even realised it was a shortcut at all, and that’s what I hope to fix here.
Here’s an example of where you’ve used that:
start on started dbus
You should hopefully now recognise that started is the name of the event there, an dbus is simply the value of its first argument, whatever that might be. Remember I mentioned that events have man pages? Take a look at man 7 started, which is the man page for this event.
It documents which environment variables are attached to the started event, and most importantly what order they come in.
started JOB=JOB INSTANCE=INSTANCE [ENV]...
So really when we wrote the previous, we were just using a shortcut to specify:
start on started JOB=dbus
You might wonder what difference this makes. A good example of how to exploit this is the stopped event. If you look at it’s man page (man 7 stopped) you’ll see it has a large number of environment variables specifying not only which job stopped but the reason for it stopping. One of those is the exit signal, for example.
Now you know that you’re just matching the $JOB environment variable, it’s obvious that you don’t have to! You can match any other environment variable or variables in the event, or none at all.
Here’s how to run a script if any other job on the system exits with a segmentation fault:
start on stopped EXIT_SIGNAL=SEGV
I said you didn’t have to match any variables, just like in the first examples we didn’t, there’s a neat use for that with the job events. The starting event blocks the named job from actually starting until anything run by it is started; or, in the case of jobs marked task, finished.
Here’s a little job that runs every time another job is started, and blocks that job from actually starting until the script finishes.
start on starting
task
script
....
end script
Useful both for debugging and performance analysis.
Now for the really neat bit. So far we’ve concentrated on the environment variables that come from events, and those that Upstart puts into the job events. But we can influence these in rather useful ways.
Firstly we can declare a default value for an environment variable in a job, if no alternate value is given in the start event or command, then this default value wins:
start on mounted
env MOUNTPOINT=/tmp
script
....
end script
This script will run for each occurrence of the mounted event, and will hopefully get the value for $MOUNTPOINT from that event. But should the value be missing from the event, or the script be started manually by a system administrator, a default value is provided.
This isn’t a false example, that’s from the job on your system that cleans up the /tmp directory on boot. The default value wasn’t there in earlier versions of Ubuntu, and this had a rather disastrous side-effect when run by hand.
Ok, we can set the values of environment variables from a job, and we don’t have to match the job name in the usual job events. We can combine these two facts in a very interesting way when we can export the value of a job’s environment variable into its job events.
Here’s the first job:
env AM_A_DISPLAY_MANAGER=1 export AM_A_DISPLAY_MANAGER
This sets the default value of $AM_A_DISPLAY_MANAGER, but this isn’t a variable we ever expect to be supplied by an event so it just gets passed into the environment of its processes. It’s not that useful either on its own.
The export line is the useful one, it adds the value of the named environment variable to the job’s events. That is the starting, started, stopping and stopped events.
Now, in another job, we can do:
start on started AM_A_DISPLAY_MANAGER=1
This is run when any job is started that has that environment variable in its events. In other words, we can tag classes of services so we don’t have to list every single one.
And because everything in Upstart is the same fundamental type of thing, this can work in the opposite direction. For example we can put in our job:
env NEED_PORTMAP=1 export NEED_PORTMAP
This means our events will have NEED_PORTMAP=1 in them, now remembering that the job waits for the side-effects of the starting event to complete, we can now write in /etc/init/portmap.conf:
start on starting NEED_PORTMAP=1
So we can implement a dependency-based init system with Upstart, an event-based init system.
I look forwards to finding out what else you can do with it.
(reposted from http://upstart.at/2010/12/03/event-matching-in-upstart/ – post comments there)



