<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Scott James Remnant &#187; Technology</title>
	<atom:link href="http://netsplit.com/category/tech/feed/" rel="self" type="application/rss+xml" />
	<link>http://netsplit.com</link>
	<description></description>
	<lastBuildDate>Fri, 18 May 2012 22:02:21 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>git smash</title>
		<link>http://netsplit.com/2012/02/06/git-smash/</link>
		<comments>http://netsplit.com/2012/02/06/git-smash/#comments</comments>
		<pubDate>Mon, 06 Feb 2012 18:21:33 +0000</pubDate>
		<dc:creator>scott</dc:creator>
				<category><![CDATA[Technology]]></category>

		<guid isPermaLink="false">http://netsplit.com/?p=464</guid>
		<description><![CDATA[A common thing I end up doing while working on code is to make a series of commits, and  then end up work changes in my working directory which I need to apply to an earlier revision in the history &#8230; <a href="http://netsplit.com/2012/02/06/git-smash/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>A common thing I end up doing while working on code is to make a series of commits, and  then end up work changes in my working directory which I need to apply to an earlier revision in the history than the top-most one.</p>
<p>One common way to do this is to make a temporary commit with those changes, then use <em>git rebase -i</em> and move that commit below the one I want to amend and choose <em>fixup</em> to have it applied.</p>
<p>But that&#8217;s annoying manual work. There&#8217;s a more fun way. I have this script in my path as <em>git-smash</em>, it takes a revision as a single argument, e.g. <em>git smash deadbeef</em>:</p>
<pre>git reset --keep "$1"
EDITOR=true git commit -a --amend
git checkout HEAD@{2}
git rebase --onto HEAD@{1} HEAD@{2}</pre>
<p>This resets the revision history, keeping local changes, back to the given revision. Unfortunately <em>git reset</em> doesn&#8217;t have a mode which preserves the index so we then have to use <em>commit -a</em> to capture all of the local changes.</p>
<p>Now we use the reflog (history of revisions in the working tree) to manipulate the tree back to the previous state, first checking out the revision that was two back (before the amended commit and the reset, i.e. where we began). Then we rebase that onto the revision one back (before the checkout, i.e. the amended revision) using the revision that&#8217;s now two back (before the checkout and commit, i.e. the original revision we changed).</p>
<p>Mental gymnastics over, this is the same as what we were doing before, just in one handy command.</p>
<p>Git still <a href="http://netsplit.com/2009/02/17/git-sucks/">sucks</a> though.</p>
]]></content:encoded>
			<wfw:commentRss>http://netsplit.com/2012/02/06/git-smash/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>A new release process for Ubuntu?</title>
		<link>http://netsplit.com/2011/09/08/new-ubuntu-release-process/</link>
		<comments>http://netsplit.com/2011/09/08/new-ubuntu-release-process/#comments</comments>
		<pubDate>Thu, 08 Sep 2011 23:06:02 +0000</pubDate>
		<dc:creator>scott</dc:creator>
				<category><![CDATA[Ubuntu]]></category>

		<guid isPermaLink="false">http://netsplit.com/?p=445</guid>
		<description><![CDATA[With the nomination period beginning for the Ubuntu Technical Board, big changes like Unity having arrived in Ubuntu recently, and the upcoming UDS for being what will likely be a new LTS release of Ubuntu, it&#8217;s as good as time &#8230; <a href="http://netsplit.com/2011/09/08/new-ubuntu-release-process/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>With the nomination period beginning for the Ubuntu Technical Board, big changes like Unity having arrived in Ubuntu recently, and the upcoming UDS for being what will likely be a new LTS release of Ubuntu, it&#8217;s as good as time as any to ask big questions about the development process, challenge assumptions, and make suggestions for big changes.</p>
<h2>Cadence</h2>
<p>The Ubuntu release process is well known, and its developers talk regularly about the <em>cadence</em> of it. A new release of Ubuntu comes out every six months, and each release follows a predictable pattern. I&#8217;ve stolen the following image from OMG! Ubuntu&#8217;s recent series about Ubuntu Development.</p>
<p><img class="aligncenter" title="Ubuntu Release Cycle" src="http://cdn.omgubuntu.co.uk/wp-content/uploads/2011/08/cycle-items.png" alt="" width="636" height="236" /></p>
<p>Each developer working on Ubuntu follows this cycle. When Ubuntu 11.10 is released on October 13th, they&#8217;ll begin again. After they recover, of course.</p>
<p>First there&#8217;ll be a bit of a wait for the archive to be open, this gets quicker and quicker each release but since it depends on a toolchain being built and other similarly fundamental things, it tends to be a period where most people figure out what they&#8217;re going to discuss at UDS.</p>
<p>UDS is a bit late for the 12.04 cycle, so the <em>merge period</em> will probably occupy developer time both before and after UDS. This isn&#8217;t represented on Daniel&#8217;s chart above, but this is the time when massive amounts of updates arrive from Debian; it&#8217;s a time of great instability for Ubuntu. At some point there will be an <em>Alpha 1</em>, but you won&#8217;t want to try and install that.</p>
<p>Planning for UDS is going to take up some time, and writing up the results of the plans afterwards and turning that into <em>work items</em>. There&#8217;s also a UDS Hangover which nobody (except Robbie Williamson, when drafting the 10.10 Release Cycle) seems to like to talk about. Nothing gets done in the week or two following UDS, everybody is too wiped out.</p>
<p>So realistically speaking, development of features for 12.04 is going to start around mid-November at the earliest. And by features I mean the big headline things in Ubuntu; like Unity, like the Software Center, like the Installer. These things are important to get right.</p>
<p>Pretending for a moment that features are developed over the winter holidays like Thanksgiving, Christmas and New Year, you&#8217;ve got clear development time until Feature Freeze. The 12.04 Release Schedule isn&#8217;t published yet, but I figure that&#8217;s going to be somewhere around February 16th after which everyone switches to bug fixing and release testing.</p>
<p><strong>That&#8217;s just 13 weeks of development time!</strong></p>
<h2>Chaos</h2>
<p>So you&#8217;re an Ubuntu developer working on features for the upcoming release, you don&#8217;t have anywhere near as much time as you&#8217;d expect to actually do the development work. What happens if you&#8217;re replacing something that works with something completely new? Can&#8217;t you just target a later release, and work continually until the feature freeze of that release?</p>
<p>It turns out that you can&#8217;t. There is an incredible emphasis on the Ubuntu planning process of targeting features for particular releases. This is the exact thing <em>you&#8217;re not supposed to do</em> with a time-based release schedule.</p>
<p>Unfortunately Canonical&#8217;s own performance-review and management is also based around this schedule. The Ubuntu developers so employed (the vast majority) have such fundamentals as their pay, bonuses, etc. dictated by how many of their assigned features and work items are into the release by feature freeze. It&#8217;s not the only requirement, but it&#8217;s the biggest one.</p>
<p>Your new feature is going to take twelve months of development time to fully develop before it&#8217;s truly a replacement for the existing feature in Ubuntu. What you <em>don&#8217;t</em> do is spend twelve months developing and land it when it&#8217;s a perfect replacement.</p>
<p>What you <em>do</em> do is develop it in 12-13 week bursts, which means it&#8217;s going to take you roughly four release cycles before it&#8217;s ready rather than two. <strong>And you land the quarter-complete feature in the first release, replacing the older stable feature.</strong></p>
<h2>Consequence</h2>
<p>If this were true, you would expect to see new features repeatedly arriving in Ubuntu before they were ready. Removing the old, deprecated feature and breaking things temporarily with the promise that everything will be better in the next release, certainly the one after that, definitely by the LTS.</p>
<p>Maybe you don&#8217;t believe that characterizes Ubuntu, in which case you should probably just stop reading now because we&#8217;re not going to agree with my fundamental complaint.</p>
<p>But I will say this: I know I&#8217;m responsible for doing this on more than one occasion <em>because I had to</em>; and I saw the exact same pattern in others&#8217; work, when I was a manager my reports complained that they had to follow this pattern and I still see the same pattern today with features such as Unity and the Software Center.</p>
<p>Follow this pattern and developers are going to complain that they need a release where they don&#8217;t have any features to work on, and can just spend the time stabilizing and bug fixing.</p>
<p>Worse, follow this pattern and you&#8217;re going to create a user expectation that releases are going to be largely unstable and contain sweeping changes that are going to be surprising to administrators of Enterprise desktop deployments, and discourage them from using your distribution at all.</p>
<p>A kludge to this would be to overlay a second release schedule onto your first one, with more of an emphasis on stability and support. It&#8217;s a target for your developers to complete their features, or at least stabilize them in those 12 weeks; and it&#8217;s a target for your users to consider deployment. <strong>So three out of four of your releases are really just unstable previews of that final fourth release.</strong></p>
<h2>Complacency</h2>
<p>This second LTS release cycle solves the unstable release issue, so why is this a problem?</p>
<p>Because developer time is wasted; because user time is wasted; because user confidence is lost.</p>
<p>Because features can take longer than two years to develop; or if even if a feature takes just two years, if it&#8217;s not begun immediately after the previous LTS release, it&#8217;s not going to be ready for the next one so you might postpone and lose the lead.</p>
<p>Because you might expect a knock-on degeneracy effect in the LTS releases as well; with 12.04 LTS being less stable than 10.04 LTS, which was less stable than 8.04 LTS which was less stable than 6.06 LTS. And it&#8217;s far too late now to have considered the 10.10/11.04/11.10/12.04 cycle to have been a Super-Long-Term-Support release and kept back the complete replacement of the desktop environment.</p>
<p>Because the original reason for the six-month cycle has already been forgotten: features are targeted towards releases, rather than released when ready; because the original base for the release schedule (GNOME) is no longer a key component of the distribution; because no other key component has adopted this schedule.</p>
<p><strong>Because these might be a better way.</strong></p>
<h2>Cataclysm</h2>
<p>What I&#8217;m going to suggest here is a completely new development process for Ubuntu, complete with details about how it would be implemented.</p>
<p><strong>I&#8217;m going to suggest a monthly release process, beginning with the 11.10 release.</strong> It so happens that this fits perfectly with Ubuntu&#8217;s version numbering system, the next release would be 11.11, followed by 11.12, followed by 12.01 and so on.</p>
<p>This monthly release would be simply known as <strong>release</strong> in your <em>sources.list</em>, updates would be published to it on the first week of the month. There would be no codenames, and due to the rapid releases, changes would be largely unsurprising and iterative on the previous releases.</p>
<p>In order to provide user testing, a second release known as <strong>beta</strong> would exist. It&#8217;s from this release that <strong>release</strong> would be copied from on that first week of the month. <strong>beta</strong> would be updated every two weeks, on the first week of the month after it became the new <strong>release</strong>, and then on the half-way point of the month. Users who like a little bleeding on their edge can change their <em>sources.list</em> to use this more exciting release, or download appropriate disk images.</p>
<p>Developers wouldn&#8217;t run either of these, they would run the third release branch <strong>alpha</strong>. It&#8217;s from here that <strong>beta</strong> is updated; and from here that daily disk images would be generated.</p>
<p>Publishing from <strong>alpha</strong> to <strong>beta</strong>, and then from <strong>beta</strong> to <strong>release</strong> is handled semi-automatically. The release manager will track Release Critical bugs, and will hold up packages from copying from one to the other if they have outstanding problems. If this sounds familiar, it&#8217;s because this is exactly how the <a href="http://www.debian.org/devel/testing">Debian testing</a> distribution works and I recommend using the same software (which Ubuntu already uses to check for archive issues).</p>
<p>So where do developers upload? It&#8217;s tempting to just say to <strong>alpha</strong>, but if we say that, <strong>alpha</strong> will end up looking very different from <strong>release</strong> because it will be filled with unstable software that&#8217;s not ready for users yet. This will make it harder for problems in the <strong>release</strong> branch to be fixed, because none of the components are left in <strong>alpha</strong> because they&#8217;ve been replaced by something that&#8217;s not ready yet.</p>
<p><strong>Developers will upload to an <em>unpublished</em> trunk branch.</strong> Packages will be copied to <strong>alpha </strong>provided:</p>
<ul>
<li>there is a signed-off code review for the upload</li>
<li>the upload meets policy (lintian clean)</li>
<li>the upload builds on all released architectures</li>
<li>unit tests pass on all released architectures</li>
<li>functional and verification tests pass on all architectures for the archive as a whole</li>
</ul>
<p>I just introduced a bunch of new checks to the developer process there; I just introduced <em>code review</em>, mandatory <em>unit tests</em> and then piled<em> functional tests</em> and <em>verification tests</em> on top.</p>
<p>The first four are relatively self-explanatory; fail any of these tests and your upload has marked the <em><strong>tree</strong><strong> red</strong></em>. In which case not only will your package fail to copy to <strong>alpha</strong>, but you&#8217;re about to have a conversation with the Release Manager.</p>
<p>For functional and verification tests, this means doing more automated QA. A failing test could be an automated installer run, or an automated boot-and-test run, etc. They&#8217;ll run sometime after the fact and will make the entire<strong> </strong><em><strong>tree red</strong></em>. The Release Manager or their team will have to examine the logs to figure out the culprit.</p>
<p>So things aren&#8217;t copying to <strong>alpha</strong>, now one of two things is going to happen.</p>
<ul>
<li>the Release Manager <em>reverts</em> your upload. Because <strong>trunk</strong> is unpublished, this is simply overwriting with the older package from <strong>alpha</strong>; nobody except the original developer is going to have known about it</li>
<li>after talking with the developer, it&#8217;s decided that further uploads of other packages are required (e.g. due to dep-wait, or the bug being elsewhere) in which case the <em><strong>tree remains red</strong></em> while the developer (or another in rare cases) prepares that fix upload.</li>
</ul>
<p>While the <em><strong>tree is red</strong></em>, nobody else is allowed to upload unless it&#8217;s a fix for the problem. All effort should go to fixing the tree.</p>
<p>If the archive has to always remain stable, how do you develop large features such as Upstart, Unity, Ubiquity, Software Center, etc.? <strong>You use a PPA to do development, on your own timeline.</strong></p>
<p>If your feature takes twelve months to develop, you take twelve months to develop it in that PPA. You&#8217;re going to be posting regularly to mailing lists or blogging about your feature to encourage users to add your PPA to their <em>sources.list</em> to gain testing. Obviously you&#8217;ll be doing various uploads to the main series over time to get all your dependencies in early where they don&#8217;t conflict with what&#8217;s already there.</p>
<h2>Conclusion</h2>
<p>My proposal is a radical change to the Ubuntu Release Process, but surprisingly it would take very little technical effort to implement because all the pieces are already there including the work on performing automated functional and verification tests.</p>
<p>I believe it solves the problem of landing unstable features before they&#8217;re ready, because it almost entirely removes releases as a <em>thing</em>. As a developer you simply work in a PPA until you&#8217;ll pass review, and land a stable feature that can replace what was there before.</p>
<p>It solves the need for occasional stabilization and bug-fixing releases because the main series is always stable and can receive bug-fixes easily separate to any development work going on. A developer can chose to focus on looking after the main series for some of their time in addition to their feature development work, or devote all of their time to it.</p>
<p>Another problem I&#8217;ve not talked about is that of building software on an unstable foundation, also solved by this change. Since developers will run <strong>alpha</strong>, and vendor developers can just run a relatively up-to-date, yet stable, <strong>release</strong> branch, software can be built on a solid foundation. Only the new feature or software itself is unstable until ready.</p>
<p>Canonical can keep its review schedule, and use developer uploads and work items; except rather than landing in a release, they can now land in a PPA.</p>
<p>Merges from Debian unstable can be handled pretty much continually as long as they keep the tree green, alternatively one can decide that users ultimately don&#8217;t care about an updated version of <em>cat</em> and until a case can be made (e.g. an open bug) for a package&#8217;s update, it need not be merged.</p>
<p>Users can now be confident of always receiving a stable operating system, because of the multiple testing and QA passes each change continually receives. Updates come in monthly, two-weekly or dailyish batches depending where in the main series they chose to run.</p>
<p>Enterprise administrators can run this stable release, because it only changes <em>gradually</em> with well-tested updates. The big changes and features have a long gestation period in PPAs, with many advance notices and blog posts about them. They&#8217;re not a surprise and can be planned for well in advance of their landing.</p>
<p>Downsides will, doubtless, be found in the comments below.</p>
<p><em>For your consideration.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://netsplit.com/2011/09/08/new-ubuntu-release-process/feed/</wfw:commentRss>
		<slash:comments>189</slash:comments>
		</item>
		<item>
		<title>Tracing on Linux</title>
		<link>http://netsplit.com/2011/03/07/tracing-on-linux/</link>
		<comments>http://netsplit.com/2011/03/07/tracing-on-linux/#comments</comments>
		<pubDate>Mon, 07 Mar 2011 20:18:54 +0000</pubDate>
		<dc:creator>scott</dc:creator>
				<category><![CDATA[Technology]]></category>

		<guid isPermaLink="false">http://netsplit.com/?p=433</guid>
		<description><![CDATA[The Linux tracing APIs are a relatively new addition to the kernel and one of the most powerful new features its gained in a long time. Unfortunately the plethora of terms and names for the system can be confusing, so &#8230; <a href="http://netsplit.com/2011/03/07/tracing-on-linux/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>The Linux tracing APIs are a relatively new addition to the kernel and one of the most powerful new features its gained in a long time. Unfortunately the plethora of terms and names for the system can be confusing, so in this follow-up to my previous post on the proc connector and socket filter, I&#8217;ll take a look at achieving the same result using tracing and hopefully unravel a little of the mystery along the way.</p>
<p>Rather than write a program along the way, I&#8217;ll be referring to sample code found in the kernel tree itself so you&#8217;ll want a checkout. If you&#8217;re doing any work that touches the kernel further than standard POSIX APIs, I highly recommend this anyway; it&#8217;s quite readable and once you find your way around, is the quickest way to answer questions.</p>
<p>Grab your checkout with <em>git</em>:</p>
<pre># git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
# cd linux-2.6</pre>
<h2>Tracepoints</h2>
<p>One of the reasons there are so many terms and names is that, like most kernel systems, there are many layers and each of those layers is exposed as different developers have different requirements. An important lower layer is that of <em>tracepoints</em>, also known as <em>static tracepoints</em>. For these we&#8217;ll be looking at the code in the <em>samples/tracepoints</em> directory of the kernel source; kernelese documentation of the API can be found in <em>Documentation/trace/tracepoints.txt</em></p>
<p>A tracepoint is a placeholder function call in kernel code that the developer of that subsystem has deemed a useful point for debugging code to be able to hook into. Static refers to the fact they are fixed in point by the original developer. You can think of them as the kind of code you&#8217;d tend to guard with <em>#if DEBUG</em> in traditional C development, and like those statements they&#8217;re nearly free when they&#8217;re not in use except that you can turn these on and off at runtime.</p>
<p>The <em>samples/tracepoints/tracepoint-sample.</em>c file is a kernel module that creates a <em>/proc/tracepoint-sample</em> file, and has a couple of tracepoints coded into it by the developer. First it includes the <em>samples/tracepoints/tp-samples-trace.h</em> which actually declares the tracepoints.</p>
<pre>DECLARE_TRACE(subsys_event,
        TP_PROTO(struct inode *inode, struct file *file),
        TP_ARGS(inode, file));
DECLARE_TRACE_NOARGS(subsys_eventb);</pre>
<p>You can think of these as declaring the function prototypes, one trace function has two arguments: an <em>inode</em> and a <em>file</em>; the other has no arguments. And if they&#8217;re function prototypes, we need to define a function; this is done back in the main <em>tracepoint-sample.c</em> file.</p>
<pre>DEFINE_TRACE(subsys_event);
DEFINE_TRACE(subsys_eventb);</pre>
<p>These tracepoints can now be called from the kernel code, passing the arguments that may need to be traced; remember that these have no side-effects unless enabled. The code that calls out to the tracepoints is in the <em>my_open()</em> function.</p>
<pre>trace_subsys_event(inode, file);
for (i = 0; i &lt; 10; i++)
        trace_subsys_eventb();</pre>
<p>Simple, huh? Don&#8217;t worry about the rest, this primer is simply so you can recognise tracepoints in the kernel source when you see them. I don&#8217;t expect you to go leaping around the kernel adding tracepoints and rebuilding it, unless you want to, of course.</p>
<p>So how do you hook into tracepoints? The answer is from other kernel code, usually in the form of a loadable module such as that defined by <em>samples/tracepoints/tracepoint-probe-sample.c</em>; this includes the same header file as before to get the prototypes.</p>
<pre>#include "tp-samples-trace.h"</pre>
<p>In the module <em>__init</em> function it registers two functions of its own as hooks into the tracepoint, this activates the tracepoint and turns the code in the previous module from a near no-op to a function call that will call these functions.</p>
<pre>ret = register_trace_subsys_event(probe_subsys_event, NULL);
WARN_ON(ret);
ret = register_trace_subsys_eventb(probe_subsys_eventb, NULL);
WARN_ON(ret);</pre>
<p>And obviously in the module <em>__exit</em> function we have to unregister these, otherwise we leave dangling things.</p>
<pre>unregister_trace_subsys_eventb(probe_subsys_eventb, NULL);
unregister_trace_subsys_event(probe_subsys_event, NULL);
tracepoint_synchronize_unregister();</pre>
<p>As to those functions, they take an argument which is a pointer to the same data as the second argument to the register call, and then otherwise take the arguments defined in <em>DECLARE_TRACE</em>. You can do pretty much what you want here, the example simply extracts the filename and outputs it with a a <em>printk(</em>)</p>
<pre>static void probe_subsys_event(void *ignore,
                               struct inode *inode, struct file *file){
        path_get(&amp;file-&gt;f_path);
        dget(file-&gt;f_path.dentry);
        printk(KERN_INFO "Event is encountered with filename %s\n",
                file-&gt;f_path.dentry-&gt;d_name.name);
        dput(file-&gt;f_path.dentry);
        path_put(&amp;file-&gt;f_path);
}</pre>
<p>So that&#8217;s tracepoints; they&#8217;re a low-level method for a kernel developer to pick places in their code that may be useful for debugging and a method for loadable kernel code such as modules to hook into those places.</p>
<h2>Trace Events (Kernel API)</h2>
<p>So you know about tracepoints, and you&#8217;ve almost certainly heard about <em>Trace Events</em>, but what&#8217;s the difference? Well firstly trace events are actually built on tracepoints, you can think of them as a higher level API &#8211; and that&#8217;s why I covered tracepoints first. Secondly trace events are usable from userspace! we don&#8217;t need to write kernel modules to be able to hook into them, but obviously we can only read data this way.</p>
<p>In fact, since they&#8217;re tracepoints with extra benefits, you wouldn&#8217;t think anyone would use the basic tracepoints at all, and you&#8217;d be right! A <em>git grep DECLARE_TRACE</em> in a current kernel tree will show you that the only user of the raw tracepoint macros is actually the trace events system.</p>
<p>Since everyone just defines trace events, a primer on the kernel-side will be useful, so we&#8217;ll be looking at the code in <em>samples/trace_events</em> and if you want to read the userspace API documentation, it&#8217;s in <em>Documentation/trace/events.txt</em></p>
<p>Just one source file and header file this time, first we&#8217;ll look at the header <em>samples/trace_events/trace-events-sample.h</em>; this seems pretty complicated at first, but almost all of this is boiler-plate code that gets copied into every trace events header. The important bit is the <em>TRACE_EVENT</em> macro:</p>
<pre>TRACE_EVENT(foo_bar,
        TP_PROTO(char *foo, int bar),
        TP_ARGS(foo, bar),
        TP_STRUCT__entry(
                __array(        char,   foo,    10              )
                __field(        int,    bar                     )
        ),
        TP_fast_assign(
                strncpy(__entry-&gt;foo, foo, 10);
                __entry-&gt;bar    = bar;
        ),
        TP_printk("foo %s %d", __entry-&gt;foo, __entry-&gt;bar)
);</pre>
<p>The first part of this looks just like <em>DECLARE_TRACE</em>, and that&#8217;s no accident, we&#8217;re still declaring a tracepoint too so this will give us a function with the prototype declared in <em>TP_PROTO</em> and argument names in <em>TP_ARGS</em>.</p>
<p>The <em>TP_STRUCT__entry</em> and <em>TP_fast_assign</em> bits are new though. As well as declaring a tracepoint, trace events come with the equivalent &#8220;loadable module&#8221; code that copies data from the arguments of the function into a struct that can be examined from userspace. <em>TP_STRUCT__entry</em> defines that structure, and <em>TP_fast_assign</em> is C code that should quickly copy data into that structure.</p>
<p>So we&#8217;ve declared a tracepoint, we&#8217;ve defined a structure containing an array of 10 char and an int, and we&#8217;ve written C code to copy from the tracepoint arguments into that structure. The last bit of the trace event is <em>TP_printk</em>, which does exactly what you&#8217;d expect. Since the most common (at least, first) use of a trace event is going to be to output something, this macro defines a format string for that <em>printk()</em> call.</p>
<p>Back in the <em>samples/trace_events/trace-events-sample.c</em> file, we include this header but first set a special define. This is only set once in the entire kernel source, and this results in all of the functions being defined; i.e. <em>TRACE_EVENT</em> becomes <em>DEFINE_TRACE</em> rather than <em>DECLARE_TRACE</em>.</p>
<pre>#define CREATE_TRACE_POINTS
#include "trace-events-sample.h"</pre>
<p>All other users of this header simply include the header.</p>
<p>From here on in the source, the trace event is just a tracepoint and is called in the same way: as a function call.</p>
<pre>trace_foo_bar("hello", cnt);</pre>
<p>That&#8217;s a kernel-side primer, you should be able to <em>git grep</em> through the source and find trace events. But now it&#8217;s time to get into the fun bit and look at the userspace API for dealing with them; remember if you want anything more complicated, they&#8217;re just tracepoints so you can write kernel modules and hook into them as before.</p>
<h2>Trace Events (Userspace API)</h2>
<p>We&#8217;re in userspace now, so you can leave the kernel source directory, but you do need to be root and you may need to mount a filesystem. This is because some distributions (like Ubuntu) have an allergy to debugging (seriously, they even disable things like <em>gdb -p</em>).</p>
<p>Try and change into the <em>/sys/kernel/debug/tracing</em> directory.</p>
<pre># cd /sys/kernel/debug/tracing</pre>
<p>If this fails, you&#8217;ll need to mount the <em>debugfs</em> filesystem and try again.</p>
<pre># mount -t debugfs none /sys/kernel/debug
# cd /sys/kernel/debug/tracing</pre>
<p>With that done, we should make sure tracing is enabled.</p>
<pre># cat tracing_enabled
1</pre>
<p>If that&#8217;s 0, enable it:</p>
<pre># echo 1 &gt; tracing_enabled</pre>
<p>So we&#8217;ve enabled tracing, but what can we trace? Trace events are exposed in the events sub-directory in two levels, the first is the subsystem and the second are the trace events themselves. Since in my last blog post we were looking at tracing <em>forks</em>, it would be great if there were trace events for doing just that. This is where it helps to be able to <em>git grep</em> around the kernel source and recognise trace events, so you at least know the right subsystem name; and it turns out that the <em>sched</em> subsystem has exactly the events we wanted.</p>
<pre>deathspank tracing# ls events/sched
enable                   sched_process_exit/  sched_stat_sleep/
filter                   sched_process_fork/  sched_stat_wait/
sched_kthread_stop/      sched_process_free/  sched_switch/
sched_kthread_stop_ret/  sched_process_wait/  sched_wait_task/
sched_migrate_task/      sched_stat_iowait/   sched_wakeup/
sched_pi_setprio/        sched_stat_runtime/  sched_wakeup_new/</pre>
<p><em>sched_process_fork</em> sounds exactly right, if you look at it, it&#8217;s a directory that contains four files: <em>enable</em>, <em>filter</em>, <em>format</em> and <em>id</em>. I bet you can guess how to enable fork tracing, but if not:</p>
<pre># cat events/sched/sched_process_fork/enable
0
# echo 1 &gt; events/sched/sched_process_fork/enable</pre>
<p>Pretty painless, so go ahead and run a few things, and turn the tracing off again when you&#8217;re done.</p>
<pre># echo 0 &gt; events/sched/sched_process_fork/enable</pre>
<p>Now let&#8217;s look at the result of our trace; recall that every trace event comes with a free <em>printk()</em> of formatted output? We can find the output from those in the top-level <em>trace</em> file.</p>
<pre># tracer: nop
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
             zsh-2667  [001]  6658.716936: sched_process_fork: comm=zsh pid=2667 child_comm=zsh child_pid=2748</pre>
<p>So for each process fork, we get the parent and child process ids along with the process name. Pretty much exactly what we want!</p>
<p>There&#8217;s plenty to play around with using this API, as you&#8217;ve probably noticed you can enable entire subsystems or all events using the <em>enable</em> files at the subsystem and events-levels; there&#8217;s also a <em>set_event</em> file at the <em>tracing</em> level which can be used to make batch changes to tracing, see the kernel documentation for more details.</p>
<p>You&#8217;re probably wondering though what happened to the rest of the struct, especially if there fields that aren&#8217;t included in the default <em>printk()</em>. You can examine the struct format by reading the <em>format</em> file of a trace event, and you can use this with the <em>filter</em> file to exclude events you&#8217;re not interested in. Again anything I write here would be just duplicating the kernel documentation, so go read <em>Documentation/trace/events.txt</em></p>
<h2>Perf</h2>
<p>After a little bit of playing you&#8217;ll realise that not only is tracing not limited to your current process or shell, you&#8217;ll get events for processes you&#8217;re not intersted in, but also events for subsystems you&#8217;re not interested in if other processes are doing traces of their own. There&#8217;s also only one global filter for the entire trace events system, so other users or processes doing tracing, could override yours.</p>
<p>There&#8217;s an even higher-level that we can use to work around those problems, the <em>perf</em> tool. Originally designed as a userspace component to the performance counters system, it&#8217;s grown a wide variety of extra features one of which is the ability to work with kernel tracepoints as an input source.</p>
<p>Since trace events are tracepoints, these count!</p>
<p>So let&#8217;s say we want to record the forks made by a process we run, without fear of contamination from other processes on the system or other users performing tracing. Using perf we can simply run</p>
<pre> # perf record -e sched:sched_process_fork record bash</pre>
<p>And run as many commands as we like in that shell. When the shell exits, perf will write the results of the tracing to a perf.data file for analysis.</p>
<pre># exit
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.017 MB perf.data (~735 samples) ]</pre>
<p>We can analyse this later using various <em>perf</em> sub-commands, the simplest of which is an argument-less <em>perf script</em> which outputs the equivalent of reading the trace file.</p>
<pre># perf script
            bash-3141  [003] 10201.049939: sched_process_fork: comm=bash pid=3141 child_comm=bash child_pid=3142
           :3142-3142  [001] 10201.050391: sched_process_fork: comm=bash pid=3142 child_comm=bash child_pid=3143</pre>
<h2>Conclusion</h2>
<p>As an administrator debugging their system, or a developer trying to understand the performance or events timeline of their work, <em>perf</em> is perfect. It&#8217;s a very well documented tool with all of the bells and whistles you need for tracing a wide variety of events.</p>
<p>Unfortuantely the API between <em>perf</em> and the kernel is a private one; the perf tool source is shipped as part of the kernel source, and they are version-mated with each other.</p>
<p>Recall that the topic of the previous blog post was to write a program to follow forks, rather than doing it as a system administrator.</p>
<p>If we want to write software to do it, the lower (but still high) level <em>trace events</em> API seems a better bet. There are a wide range of applications of this API, for example the <em>ureadahead</em> program in an Ubuntu system uses it to trace the <em>open()</em> and <em>exec()</em> syscalls the system performs during boot so it knows which files to cache for faster boot times. But it&#8217;s easy for another process, or a user, to interfere with the results of this tracing so it&#8217;s not ideal for our purpose either.</p>
<p>Finally the <em>tracepoints</em> API is too low-level, writing a kernel module and building and maintaining it for each kernel version is just not on the cards.</p>
<p>So it would appear we&#8217;re at a dead-end for using tracing to do what we want. That&#8217;s not the end of the story though; there are other tracing tools such as <em>kprobes</em> and <em>ftrace</em> that I haven&#8217;t covered yet. Unfortunately this blog post has gotten a little too long, and the coverage of tracepoints, trace events and perf was worthwhile in of itself, so we&#8217;ll have to pick those up next time!</p>
]]></content:encoded>
			<wfw:commentRss>http://netsplit.com/2011/03/07/tracing-on-linux/feed/</wfw:commentRss>
		<slash:comments>18</slash:comments>
		</item>
		<item>
		<title>The Proc Connector and Socket Filters</title>
		<link>http://netsplit.com/2011/02/09/the-proc-connector-and-socket-filters/</link>
		<comments>http://netsplit.com/2011/02/09/the-proc-connector-and-socket-filters/#comments</comments>
		<pubDate>Wed, 09 Feb 2011 19:55:20 +0000</pubDate>
		<dc:creator>scott</dc:creator>
				<category><![CDATA[Technology]]></category>
		<category><![CDATA[Upstart]]></category>

		<guid isPermaLink="false">http://netsplit.com/?p=420</guid>
		<description><![CDATA[The proc connector is one of those interesting kernel features that most people rarely come across, and even more rarely find documentation on. Likewise the socket filter. This is a shame, because they&#8217;re both really quite useful interfaces that might &#8230; <a href="http://netsplit.com/2011/02/09/the-proc-connector-and-socket-filters/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>The proc connector is one of those interesting kernel features that most people rarely come across, and even more rarely find documentation on. Likewise the socket filter. This is a shame, because they&#8217;re both really quite useful interfaces that might serve a variety of purposes if they were better documented.</p>
<p>The proc connector allows you to receive notification of process events such <em>fork</em> and <em>exec</em> calls, as well as changes to a process&#8217;s <em>uid</em>, <em>gid</em> or <em>sid</em> (session id). These are provided through a socket-based interface by reading instances of <em>struct proc_event</em> defined in the kernel header.</p>
<pre>#include &lt;linux/cn_proc.h&gt;</pre>
<p>The interface is built on the more generic <em>connector</em> API, which itself is built on the generic <em>netlink</em> API. These interfaces add some complexity as they are intended to provide bi-directional communication between the kernel and userspace; the connector API appears to have been largely forgotten as newer such socket interfaces simply declare their own first-class socket classes. So we need the headers for those too.</p>
<pre>#include &lt;linux/netlink.h&gt;
#include &lt;linux/connector.h&gt;</pre>
<p>(For brevity, I&#8217;ll omit any standard boilerplate such as the headers you need for syscalls and library functions that you should be used to as well as function definitions, error checking, and so-forth.)</p>
<p>Ok, now we&#8217;re ready to create the <em>connector</em> socket. This is straight-forward enough, since we&#8217;re dealing with atomic messages rather than a stream, datagram is appropriate.</p>
<pre>int sock;
sock = socket (PF_NETLINK, SOCK_DGRAM | SOCK_NONBLOCK | SOCK_CLOEXEC,
               NETLINK_CONNECTOR);</pre>
<p>To select the <em>proc connector</em> we bind the socket using a <em>struct sockaddr_nl</em> object.</p>
<pre>struct sockaddr_nl addr;
addr.nl_family = AF_NETLINK;
addr.nl_pid = getpid ();
addr.nl_groups = CN_IDX_PROC;

bind (sock, (struct sockaddr *)&amp;addr, sizeof addr);</pre>
<p>Unfortunately that&#8217;s not quite enough yet; the proc connector socket is a bit of a firehose, so it doesn&#8217;t in fact send any messages until a process has subscribed to it. So we have to send a subscription message.</p>
<p>As I mentioned before, the <em>proc connector</em> is built on top of the generic <em>connector</em> and that itself is on top of <em>netlink</em> so sending that subscription message also involves embedded a message, inside a message inside a message.  If you understood Christopher Nolan&#8217;s Inception, you should do just fine.</p>
<p>Since we&#8217;re nesting a <em>proc connector operation</em> message inside a <em>connector</em> message inside a <em>netlink</em> message, it&#8217;s easiest to use an <em>iovec</em> for this kind of thing.</p>
<pre>struct iovec iov[3];
char nlmsghdrbuf[NLMSG_LENGTH (0)];
struct nlmsghdr *nlmsghdr = nlmsghdrbuf;
struct cn_msg cn_msg;
enum proc_cn_mcast_op op;

nlmsghdr-&gt;nlmsg_len = NLMSG_LENGTH (sizeof cn_msg + sizeof op);
nlmsghdr-&gt;nlmsg_type = NLMSG_DONE;
nlmsghdr-&gt;nlmsg_flags = 0;
nlmsghdr-&gt;nlmsg_seq = 0;
nlmsghdr-&gt;nlmsg_pid = getpid ();

iov[0].iov_base = nlmsghdrbuf;
iov[0].iov_len = NLMSG_LENGTH (0);

cn_msg.id.idx = CN_IDX_PROC;
cn_msg.id.val = CN_VAL_PROC;
cn_msg.seq = 0;
cn_msg.ack = 0;
cn_msg.len = sizeof op;

iov[1].iov_base = &amp;cn_msg;
iov[1].iov_len = sizeof cn_msg;

op = PROC_CN_MCAST_LISTEN;

iov[2].iov_base = &amp;op;
iov[2].iov_len = sizeof op;

writev (sock, iov, 3);</pre>
<p>The <em>netlink </em>message length is the combined length of the following <em>connector</em> and <em>proc connector operation</em> messages, and is otherwise simply a message from our process id with no following messages.  However all of the interfaces to netlink take a lot of care to make sure the following structure in the message is aligned as wide as possible using the <em>NLMSG_LENGTH</em> macro, to avoid issues with platforms that have fixed alignment for data types, so we have to be careful of that too.</p>
<p>So we actually have a bit of padding between the <em>struct nlmsghdr</em> and the <em>struct cn_msg</em>, this is accomplished by actually using a character buffer of the right size for the first iovec element and accessing it through a <em>struct nlmsghdr</em> pointer.</p>
<p>The <em>connector</em> message indicates that it is relevant to the <em>proc connector</em> through the <em>idx</em> and <em>val</em> fields, and the length is the legnth of the <em>proc connector operation</em> message.</p>
<p>Finally the <em>proc connector operation</em> message (just an enum) says we want to subscribe. Why isn&#8217;t there padding between the<em> connector</em> and <em>proc connector operation</em> messages? Because the last element in <em>struct cn_msg</em> is a zero-width type which results in the right padding, this interface is rather newer than netlink.</p>
<p><em>iovec</em> stitches it all together so it&#8217;s sent as a single message, visualized this message looks like this:</p>
<p><a href="http://netsplit.com/wp-content/uploads/2011/02/enum-proc_cn_mcast_op.png"><img class="aligncenter size-full wp-image-426" title="enum proc_cn_mcast_op" src="http://netsplit.com/wp-content/uploads/2011/02/enum-proc_cn_mcast_op.png" alt="" width="640" height="240" /></a></p>
<p>There&#8217;s a matching <em>PROC_CN_MCAST_IGNORE</em> message if you want to turn off the firehose without closing the socket.</p>
<p>Ok, the firehose is on now we need to read the stream of messages.  Just like the message we sent, the stream of messages we receive are actually <em>netlink</em> messages, and inside those <em>netlink</em> messages are <em>connector</em> messages, and inside those are <em>proc connector</em> messages.</p>
<p>Netlink allows for all sorts of things like multi-part messages, but in reality we can ignore most of that since connector doesn&#8217;t use the, but it&#8217;s worth future-protecting ourselves and being liberal in what we accept.</p>
<pre>struct msghdr msghdr;
struct sockaddr_nl addr;
struct iovec iov[1];
char buf[PAGE_SIZE];
ssize_t len;

msghdr.msg_name = &amp;addr;
msghdr.msg_namelen = sizeof addr;
msghdr.msg_iov = iov;
msghdr.msg_iovlen = 1;
msghdr.msg_control = NULL;
msghdr.msg_controllen = 0;
msghdr.msg_flags = 0;

iov[0].iov_base = buf;
iov[0].iov_len = sizeof buf;

len = recvmsg (sock, &amp;msghdr, 0);</pre>
<p>Why do we use <em>recvmsg</em> rather than just <em>read</em>? Because <em>netlink</em> allows arbitrary processes to send messages to each other, so we need to make sure the message actually comes from the kernel; otherwise you have a potential security vulnerability. <em>recvfrom</em> lets us receive the sender address as well as the data.</p>
<pre>if (addr.nl_pid != 0)
        continue;</pre>
<p>(I&#8217;m assuming you&#8217;re reading in a loop there.)</p>
<p>So now we have a <em>netlink</em> message package from the kernel, this may contain multiple individual netlink messages (it doesn&#8217;t, but it may). So we iterate over those.</p>
<pre>for (struct nlmsghdr *nlmsghdr = (struct nlmsghdr *)buf;
     NLMSG_OK (nlmsghdr, len);
     nlmsghdr = NLMSG_NEXT (nlmsghdr, len))</pre>
<p><span style="font-family: 'Courier 10 Pitch', Courier, monospace; color: #222222;"><span style="line-height: 21px;">And we should ignore error or no-op messages from netlink.</span></span></p>
<pre>if ((nlmsghdr-&gt;nlmsg_type == NLMSG_ERROR)
    || (nlmsghdr-&gt;nlmsg_type == NLMSG_NOOP))
        continue;</pre>
<p>Inside each individual <em>netlink</em> message is a <em>connector</em> message, we extract that and make sure it comes from the <em>proc connector</em> system.</p>
<pre>struct cn_msg *cn_msg = NLMSG_DATA (nlmsghdr);

if ((cn_msg-&gt;id.idx != CN_IDX_PROC)
    || (cn_msg-&gt;id.val != CN_VAL_PROC))
        continue;</pre>
<p>Now we can safely extract the <em>proc connector</em> message; this is a <em>struct proc_event</em> that we haven&#8217;t seen before. It&#8217;s quite a large structure definition so I won&#8217;t paste it here, since it contains a <em>union</em> for each of the different possible message types. Instead here&#8217;s code to actually print the relevant contents for an example message.</p>
<pre>struct proc_event *ev = (struct proc_event *)cn_msg-&gt;data;

switch (ev-&gt;what) {
case PROC_EVENT_FORK:
        printf ("FORK %d/%d -&gt; %d/%d\n",
                ev-&gt;event_data.fork.parent_pid,
                ev-&gt;event_data.fork.parent_tgid,
                ev-&gt;event_data.fork.child_pid,
                ev-&gt;event_data.fork.child_tgid);
        break;
/* more message types here */
}</pre>
<p>As you can see, each message type has an associated member of the <em>event_data</em> union containing the information fields for it. And as you can see, this gives you information about each individual kernel task, not just the top-level processes you&#8217;re normally used to seeing. In other words, you see threads as well as processes.</p>
<p>Like I keep saying, it&#8217;s a firehose. It would be great if there was some way to filter the socket in the kernel so that our process doesn&#8217;t even get woken up for messages. Wake-ups are bad, especially in the embedded space.</p>
<p>Fortunately there is a way to filter sockets on the kernel-side, the kernel <em>socket filter</em> interface. Unfortunately this isn&#8217;t too well documented either; but let&#8217;s use this opportunity to document an example.</p>
<p>We&#8217;ll filter the socket so that we only receive <em>fork</em> notifications, discarding the other types of <em>proc connector</em> event type and most importantly discarding the messages that indicate new threads being created (those where the <em>pid</em> and <em>tgid</em> fields differ). One important part of filtering is that you should be careful so that only expected messages are filtered, and that unexpected messages are still passed through.</p>
<p>The filter machine consists of a set of machine language instructions added to the socket through a special socket option. Fortunately this machine language is copied from the Berkeley Packet Filter from BSD, so we can find documentation for it in the <a href="http://www.gsp.com/cgi-bin/man.cgi?section=4&amp;topic=bpf#5">bpf(4)</a> manual page there. Just ignore the structure definitions, because they are different on Linux.</p>
<p>So let&#8217;s get started with our example; first we need to add the right header.</p>
<pre>#include &lt;linux/filter.h&gt;</pre>
<p>And now we need to insert the filter into the socket creation, before the subscription message is sent is usually a good place. On Linux the instructions are given as an array of <em>struct sock_filter</em> members which we can construct using the <em>BPF_STMT</em> and <em>BPF_JUMP</em> macros.</p>
<p>Just to make sure everything is working, we&#8217;ll create a simple &#8220;no-op&#8221; filter.</p>
<pre>struct sock_filter filter[] = {
        BPF_STMT (BPF_RET|BPF_K, 0xffffffff),
};

struct sock_fprog fprog;
fprog.filter = filter;
fprog.len = sizeof filter / sizeof filter[0];

setsockopt (sock, SOL_SOCKET, SO_ATTACH_FILTER, &amp;fprog, sizeof fprog);</pre>
<p>Not very useful, but it means we can now concentrate on writing the filter code itself. This filter consists of a single statement, <em>BPF_RET</em> that tells the kernel to deliver an amount of bytes of the packet to the receiving process and to return from the filter. The <em>BPF_K</em> option means that we give the amount of bytes as the argument to the statement, and in this case we give the largest possible value. In other words, this statement declares to deliver the whole packet and return from the filter.</p>
<p>To not wake up the process at all, and filter everything we deliver no bytes and return from the filter.</p>
<pre>BPF_STMT (BPF_RET|BPF_K, 0);</pre>
<p>You may want to test that too.</p>
<p>Ok, now let&#8217;s actually do some examination of the packets to filter out the noise. Recall that we&#8217;re dealing with nested messages here, messages inside messages, inside messages. Visualizing this is really important to understanding what you&#8217;re dealing with.</p>
<p><a href="http://netsplit.com/wp-content/uploads/2011/02/struct-proc_event.png"><img class="aligncenter size-full wp-image-427" title="struct proc_event" src="http://netsplit.com/wp-content/uploads/2011/02/struct-proc_event.png" alt="" width="640" height="240" /></a></p>
<p>The most basic filter code consists of three operations: load a value from the packet into the machine&#8217;s accumulator, compare that against a value and jump to a different instruction if equal (or not equal), and then possibly return or perform another operation.</p>
<p>All of the following filter code replaces whatever you had in the <em>filter[]</em> array before.</p>
<p>So first we should examine the <em>nlmsghdr</em> on the start of the packet, we want to make sure that there is just one <em>netlink</em> message in this packet. If there are multiple, we just pass the whole packet to userspace for dealing with. We check the <em>nlmsg_type</em> field to make sure it contains the value <em>NLMSG_DONE</em>.</p>
<pre>BPF_STMT (BPF_LD|BPF_H|BPF_ABS,
          offsetof (struct nlmsghdr, nlmsg_type));
BPF_JUMP (BPF_JMP|BPF_JEQ|BPF_K,
          htons (NLMSG_DONE),
          1, 0);
BPF_STMT (BPF_RET|BPF_K, 0xffffffff);</pre>
<p>The first statement says to load (<em>BPF_LD</em>) a <em>&#8220;halfword&#8221; </em>(16-bit) value (<em>BPF_H</em>) from the absolute offset (<em>BPF_ABS</em>) equivalent to the position of the <em>nlmsg_type</em> member in <em>struct nlmsghdr</em>. Since we expect that structure to be the start of the message, this means the accumulator should now have that value.</p>
<p>The next statement is a jump (<em>BPF_JMP</em>), it says to compare the accumulator for equality (<em>BPF_JEQ</em>) against the constant argument (<em>BPF_K</em>). We only want to continue if this is the sole message, so the value we compare against is <em>NLMSG_DONE</em> &#8211; first remembering to deal with host and network ordering.</p>
<p>If <em>true</em>, the jump will jump one statement; if <em>false</em> the jump will not jump any statements. These are the third and fourth arguments to the <em>BPF_JUMP</em> macro.</p>
<p>Note that the error case is always to return the whole packet to the process, waking it up. And the success case is future processing of the packet. This makes sure that we don&#8217;t filter unexpected packets that userspace may really need to deal with. Don&#8217;t use the socket filter for security filtering, it&#8217;s for reducing wake-ups.</p>
<p>So let&#8217;s filter the next set of values, we want to make sure that this netlink message is from the connector interface. Again we load the right <em>&#8220;word&#8221;</em> (32-bit) values (<em>BPF_W</em>) from the appropriate offsets and check them against constants.</p>
<pre>BPF_STMT (BPF_LD|BPF_W|BPF_ABS,
          NLMSG_LENGTH (0) + offsetof (struct cn_msg, id)
          + offsetof (struct cb_id, idx));
BPF_JUMP (BPF_JMP|BPF_JEQ|BPF_K,
          htonl (CN_IDX_PROC),
          1, 0);
BPF_STMT (BPF_RET|BPF_K, 0xffffffff);

BPF_STMT (BPF_LD|BPF_W|BPF_ABS,
          NLMSG_LENGTH (0) + offsetof (struct cn_msg, id)
          + offsetof (struct cb_id, idx));
BPF_JUMP (BPF_JMP|BPF_JEQ|BPF_K,
          htonl (CN_VAL_PROC),
          1, 0);
BPF_STMT (BPF_RET|BPF_K, 0xffffffff);</pre>
<p>So after this filter code has executed, we know the packet contains a single netlink message from the proc connector. Now we want to make sure it&#8217;s a fork message; this is a bit different from before, because now we explicitly <em>do</em> filter out the other message types so the return case for non-equality is to return zero bytes.</p>
<pre>BPF_STMT (BPF_LD|BPF_W|BPF_ABS,
          NLMSG_LENGTH (0) + offsetof (struct cn_msg, data)
          + offsetof (struct proc_event, what);
BPF_JUMP (BPF_JMP|BPF_JEQ|BF_K,
          htonl (PROC_EVENT_FORK),
          1, 0);
BPF_STMT (BPF_RET|BPF_K, 0);</pre>
<p>And now we can compare the <em>pid</em> and <em>tgid</em> values for the parent process and the child process fields. This is again slightly interesting because we can&#8217;t compare against an absolute offset with the jump instruction so we use the second <em>index register</em> instead (<em>BPF_X</em> in the jump instruction). Of course it would be too easy if we could load directly into that, so we have to do it via the <em>scratch memory store</em> instead; this requires loading into the accumulator (<em>BPF_LD</em>), storing into scratch memory (<span style="color: #000000;"><em>BPF_ST</em>) and loading the index register (<em>BPF_LDX</em>) from scratch memory (<em>BPF_MEM</em>).</span></p>
<pre>BPF_STMT (BPF_LD|BPF_W|BPF_ABS,
          NLMSG_LENGTH (0) + offsetof (struct cn_msg, data)
          + offsetof (struct proc_event, event_data)
          + offsetof (struct fork_proc_event, parent_pid));
BPF_STMT (BPF_ST, 0);
BPF_STMT (BPF_LDX|BPF_W|BPF_MEM, 0);</pre>
<p>Then we load the <em>tgid</em> value into the accumulator and we can compare and jump as before; if they are equal we want to continue, if they are inequal we want to filter the packet.</p>
<pre>BPF_STMT (BPF_LD|BPF_W|BPF_ABS,
          NLMSG_LENGTH (0) + offsetof (struct cn_msg, data)
          + offsetof (struct proc_event, event_data)
          + offsetof (struct fork_proc_event, parent_tgid));
BPF_JUMP (BPF_JMP|BPF_JEQ|BPF_X,
          0,
          1, 0);
BPF_STMT (BPF_RET|BPF_K, 0);</pre>
<p>Then we do the same for the child field.</p>
<pre>BPF_STMT (BPF_LD|BPF_W|BPF_ABS,
          NLMSG_LENGTH (0) + offsetof (struct cn_msg, data)
          + offsetof (struct proc_event, event_data)
          + offsetof (struct fork_proc_event, parent_pid));
BPF_STMT (BPF_ST, 0);
BPF_STMT (BPF_LDX|BPF_W|BPF_MEM, 0);

BPF_STMT (BPF_LD|BPF_W|BPF_ABS,
          NLMSG_LENGTH (0) + offsetof (struct cn_msg, data)
          + offsetof (struct proc_event, event_data)
          + offsetof (struct fork_proc_event, parent_tgid));

BPF_JUMP (BPF_JMP|BPF_JEQ|BPF_X,
          0,
          1, 0);

BPF_STMT (BPF_RET|BPF_K, 0);</pre>
<p>After all that filter hurdling, we have a packet that we want to pass through to the process, so the final instruction is a return of the largest packet size.</p>
<pre>BPF_STMT (BPF_RET|BPF_K, 0xffffffff);</pre>
<p>That&#8217;s it. Of course, what you do with this is up to you. One example could be a daemon that watches for excessive forks and kills fork bombs before they kill the machine. Since you get notification of changes of <em>uid</em> or <em>gid</em>, another example could be a security audit daemon, etc.</p>
<p>Upstart uses this interface for its own nefarious process tracking purposes.</p>
]]></content:encoded>
			<wfw:commentRss>http://netsplit.com/2011/02/09/the-proc-connector-and-socket-filters/feed/</wfw:commentRss>
		<slash:comments>26</slash:comments>
		</item>
		<item>
		<title>The Importance of Being Tested</title>
		<link>http://netsplit.com/2010/12/30/the-importance-of-being-tested/</link>
		<comments>http://netsplit.com/2010/12/30/the-importance-of-being-tested/#comments</comments>
		<pubDate>Thu, 30 Dec 2010 17:46:09 +0000</pubDate>
		<dc:creator>scott</dc:creator>
				<category><![CDATA[Upstart]]></category>

		<guid isPermaLink="false">http://www.netsplit.com/?p=310</guid>
		<description><![CDATA[In addition to the regular posts documenting features of 0.6 and giving hints and tips about it’s usage, release announcements and so-forth; I’ll also be posting insights and anecdotes about Upstart’s ongoing development.  A particular story cropped up again this month, &#8230; <a href="http://netsplit.com/2010/12/30/the-importance-of-being-tested/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>In addition to the regular posts documenting <a href="http://upstart.at/category/0-6/">features of 0.6</a> and giving hints and tips about it’s usage, <a href="http://upstart.at/category/release/">release announcements</a> and so-forth; I’ll also be posting insights and anecdotes about Upstart’s <a href="http://upstart.at/category/development/">ongoing development</a>.  A particular story cropped up again this month, and I thought I’d share it with you.</p>
<p>When I began work on Upstart, one of the earliest decisions I made was to make sure the code was very-well covered by a comprehensive test suite.  I’d been working with <a href="https://launchpad.net/~lifeless">Robert Collins</a> a lot in the previous couple of years and he is very much an advocate of practices such as <a href="http://en.wikipedia.org/wiki/Extreme_Programming">Extreme Programming</a> (XP) and <a href="http://en.wikipedia.org/wiki/Agile_software_development">Agile Development</a>; especially the discipline of <a href="http://en.wikipedia.org/wiki/Test-driven_development">Test Driven Development</a>.</p>
<p>I’d also recently seen a keynote by <a href="http://en.wikipedia.org/wiki/Andrew_Tridgell">Andrew Tridgell</a> in which he talked about some of the development of Samba 4, in particular the high use of both test cases and code generation in that code-base.  Something he said in the keynote stuck with me: “untested code is broken code”.</p>
<p>Statistics obviously depend on exactly how you count lines of code, but using a simple semi-colon count the combined source code of libnih and Upstart is slightly over 20,000 lines of code.  The combined source code of the test suite for both is slightly over 120,000 lines of code.</p>
<p>The init daemon is an extremely important part of a Linux system, if it crashes then you’re left with a kernel panic; if it simply misbehaves, you’re left with just severe problems.  Not only was I changing it, but I was replacing a very simple dumb system (Sys V init) with something comparatively complex with rules and behaviours that needed rigorous testing.</p>
<p>It would have been very scary to have developed it without the careful testing, and I would have been very worried if anyone had agreed to replace such a core component of the system without this test suite to back up its behaviour.</p>
<p>That being said, maintaining the test suite can be a huge burden.  Don’t believe what anybody tells you, if you’re writing test cases as well as code, then your pace of development slows as well.  They’re right that you spend a lot less time debugging of course, but unlike in the commercial software business free software developers tend to release first and debug later.   If you use a similarly high test to code ratio in your own project, then you’ll find that the time until your first release will be pretty long and the time between releases longer as well.</p>
<p>Another decision is whether to do Test Driven Development or not; that discipline requires that you always write the tests first, to fail, and only write code in order to make the tests pass.  I’m not a fan of TDD, and I’ve no problem admitting that I mostly did not use it for Upstart.  My gut feel is that TDD produces code that hangs, swings and loops just to deal with testing.  It also just doesn’t suit my coding style: I like to write code from the middle outwards, the function API is the last thing I tend to fix, where TDD forces it to be the first.</p>
<p>I’m also not convinced TDD is really suitable for a language like C; it’s pretty hard to get a test case to compile, run and fail without writing any supporting code such as a header file, etc.</p>
<p>I have found TDD useful when I have code that really does break down into a single unit with a well-defined and obvious API, and that while the inputs and outputs have been obvious, the algorithm for getting between them wasn’t at the time.</p>
<p>What I’ve tended to do instead is write code naturally how I would, and write test cases alongside to run the code and make sure it’s working.  As the code grows more complex, more test cases appear for it.  One big advantage to this is then I don’t need to reboot or fire up a VM as much, I can test a large proportion of Upstart’s operation through testing.</p>
<p>Now, onto the stories.  There are two similar ones.</p>
<p>One of the side-effects of testing Upstart so strongly is that the tests are not only driving the code I’ve written but also code in libraries and even in the Kernel.  One particular set of tests was covering the code in libnih and Upstart that handles watching the configuration directory for changes, it’s this code that means Upstart automatically reloads jobs when you edit them without needing an explicitly signal.</p>
<p>One day these test cases started failing without warning.  Investigation showed that they passed fine under older kernels, but with the newest kernel update to Ubuntu, they failed.</p>
<p>The inotify subsystem in the kernel had undergone a radical overhaul and rewrite.  Rather than being its own code, it was completely rebased onto the new fsnotify system.  Fortunately I was aware of this, and after careful checking that it was indeed the kernel behaviour that was now incorrect (and that it wasn’t incorrect before), I got in touch with the Eric Paris, the author of the new code, and was able to give him minimal example code to replicate the problem.</p>
<p><strong><a href="http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=4a148ba988988b9c400ad0f2cbccc155289b954b">inotify: check filename before dropping repeat events</a></strong></p>
<p>This was a while ago, but pretty much the same story happened again recently, just this time not with the kernel.</p>
<p>Again, the story started with Upstart’s test suite failing.  The engineer who first noticed it assumed it was an issue with the new build daemon and disabled the test for the time being.  The test was in the part of the code testing Upstart’s interaction with D-Bus.</p>
<p>Now, sometimes I tend to write tests to deal with corner-cases and “what if” scenarios that I dream up.  This isn’t always about testing my code, often it’s a case of finding out whether something is really possible or whether that thing misbehaves.  These tests still stay in the suite of course.</p>
<p>A particular set of tests were intended to find out what happened if the D-Bus daemon crashed during initial connection, I considered this fairly important because at times the libdbus library has called <em>exit()</em> or <em>abort()</em> when things happened that it didn’t like.  If you call that from the init daemon, the kernel panics.</p>
<p>These tests had worked fine for a couple of years (actually at the time I had to fix bugs in libdbus to make them pass) but now one of these tests was breaking.  The disconnection was causing SIGPIPE to be delivered to the test.</p>
<p>Again, this turned out to be due to a change to D-Bus.  Lennart Poettering had been working on some changes to avoid libdbus’s awkward SIGPIPE handling and replace it with the use of the MSG_NOSIGNAL flag.  Unfortunately he’d missed a case in the authentication code.  The side-effect was that if the D-Bus daemon had crashed, been killed, OOM’d, etc. during initial connection – the connecting application would have gone too.  Especially bad for an init daemon.</p>
<p>Fortunately Upstart’s test suite caught it, and the fix was simple.</p>
<p><strong><a href="http://cgit.freedesktop.org/dbus/dbus/commit/?id=c5d0998295a15fe649da854b68334c767aad1049">sysdeps-unix: use MSG_NOSIGNAL when sending creds</a></strong></p>
<p><em>(reposted from <a href="http://upstart.at/2010/12/20/the-importance-of-being-tested/">http://upstart.at/2010/12/20/the-importance-of-being-tested/</a> &#8211; post comments there)</em></p>
]]></content:encoded>
			<wfw:commentRss>http://netsplit.com/2010/12/30/the-importance-of-being-tested/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Events are like Methods</title>
		<link>http://netsplit.com/2010/12/20/events-are-like-methods/</link>
		<comments>http://netsplit.com/2010/12/20/events-are-like-methods/#comments</comments>
		<pubDate>Mon, 20 Dec 2010 07:44:59 +0000</pubDate>
		<dc:creator>scott</dc:creator>
				<category><![CDATA[Upstart]]></category>

		<guid isPermaLink="false">http://www.netsplit.com/?p=302</guid>
		<description><![CDATA[In last week’s post I talked about how Events can be treated like Signals, this week we’ll be looking at how Events can be treated like Methods.  That might seem a little surprising, since normally one considers signals and methods &#8230; <a href="http://netsplit.com/2010/12/20/events-are-like-methods/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>In <a href="http://upstart.at/2010/12/08/events-are-like-signals/">last week’s post</a> I talked about how Events can be treated like Signals, this week we’ll be looking at how Events can be treated like Methods.  That might seem a little surprising, since normally one considers signals and methods as very different things, but to Upstart they are both just events.</p>
<p>What do I mean by Methods?  You’ve almost certainly done some kind of programming, even if just a little scripting, so you should know about methods or functions.</p>
<p>In contrast to signals, which are just a notification that something happened on the system, a method is a request for the system to do something on your behalf.  Usually to make some kind of change to the system state.</p>
<p>Likewise in contrast to the signals where you don’t care about the result, for a method you want to wait for the changes to be completed and perhaps even be notified if the method failed.</p>
<p>It’s just as easy to implement a method in Upstart as it is to implement something that considers an event a signal.  Here’s an example of how you might implement a<em> suspend</em> method:</p>
<pre>start on suspend

task
exec pm suspend</pre>
<p>Doesn’t look that much difference from a signal, the only new stanza in this is <em>task</em> (and that’s not necessary for a method either).  So what happens if we want to trigger a suspend?  We use the command:</p>
<pre>root@worldofwarcraft:~# initctl emit suspend</pre>
<p>The difference here from emitting a signal we demonstrated in the previous post is that we <strong>aren’t</strong> using the <em>–no-wait</em> flag.</p>
<p>So we emit the <em>suspend</em> event, and Upstart will start our job as a result; but <em>initctl emit</em> will not return immediately, it waits for the results of the event to complete before it returns.</p>
<p>Because we used the <em>task</em> stanza in the configuration, we’ve told Upstart that the process we execute is expected to take a limited amount of time and then finish by itself.  This means that Upstart will not believe the job is complete until the process has exited, and will continue to block the event while it is still running.</p>
<p>Finally if the command exited with an error, that error is propagated back to the event that started it, and the <em>initctl emit</em> command will exit with an error code.</p>
<p>So now we can use Upstart events and jobs for two different purposes; we can announce changes to the system, and we can use them as methods to make changes to the system.</p>
<p>The most typical event that is used as a methods on your system is the <em>runlevel</em> event used to change the runlevel for System-V compatibility and generally emitted by the <em>telinit</em> and <em>shutdown</em> tools.  The <em>/etc/init/rc.conf</em> script that handles it can be pretty simple and looks not unlike the <em>suspend</em> example above:</p>
<pre>start on runlevel [0123456]

task
exec /etc/init.d/rc $RUNLEVEL</pre>
<p>What happens if you don’t include <em>task</em>?  Well, that means Upstart will consider the job as ready when the process executed is running, and the event will be unblocked and <em>initctl emit</em> will return.  If the service fails to start, then <em>initctl</em> will return with an error.  This is great for methods that start (or stop) services.</p>
<blockquote><p>Side-note: the start and stop commands act very much like method events, they block until the service is running or the task has finished and they return errors as well.  However they’re not actually implemented as events right now, an oversight I intend to correct in Upstart 2.</p></blockquote>
<p><em>(reposted from <a href="http://upstart.at/2010/12/16/events-are-like-methods/">http://upstart.at/2010/12/16/events-are-like-methods/</a> &#8211; post comments there)</em></p>
]]></content:encoded>
			<wfw:commentRss>http://netsplit.com/2010/12/20/events-are-like-methods/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Event matching in Upstart</title>
		<link>http://netsplit.com/2010/12/03/event-matching-in-upstart/</link>
		<comments>http://netsplit.com/2010/12/03/event-matching-in-upstart/#comments</comments>
		<pubDate>Fri, 03 Dec 2010 05:38:54 +0000</pubDate>
		<dc:creator>scott</dc:creator>
				<category><![CDATA[Upstart]]></category>

		<guid isPermaLink="false">http://www.netsplit.com/2010/12/03/event-matching-in-upstart/</guid>
		<description><![CDATA[A little while ago I was asked to solve a problem that somebody was having with Upstart, and I realised that people weren’t understanding how things were actually working and were just muddling along when doing event matching in jobs. &#8230; <a href="http://netsplit.com/2010/12/03/event-matching-in-upstart/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>A little while ago I was asked to solve a problem that somebody was having with Upstart, and I realised that people weren’t understanding how things were actually working and were just muddling along when doing event matching in jobs.  This is unfortunate, because it hides some of Upstart’s true power, so I thought it high time I actually explained this.</p>
<p>Let’s start with a simple example.  Fire up any Linux distribution with Upstart 0.6, Ubuntu or Fedora current releases will do, and create a file named <em>/etc/init/example1.conf </em>with the following content:</p>
<pre>start on surprise</pre>
<p>This is pretty simple, it’s a job that does nothing except declare that it’s started when the <em>surprise</em> event happens.  We can demonstrate that works by emitting the event ourselves and checking the status of the job before and afterwards:</p>
<pre>root@angrybirds:/etc/init# status example1
example1 stop/waiting
root@angrybirds:/etc/init# initctl emit surprise
root@angrybirds:/etc/init# status example1
example1 start/running</pre>
<p>Nothing too surprising after all, I hope.  The job did indeed start on the <em>surprise</em> event, and would now be running if we’d actually told Upstart to run something.</p>
<p>Incidentally I’m often asked why there isn’t a single list of events anywhere, that’s because you can match any event you like as long as you know something emits it.  Events are supposed to come from all manner of sources.  I do try and document them though, try running <em>man 7 startup</em> on your system to see an example of an event’s man page.</p>
<p>If events were just names, they’d be pretty boring.  Events can also have attached environment variables, and these get put into the environment of any job’s process started by the event.  Here’s <em>/etc/init/example2.conf</em>:</p>
<pre>start on weather

script
    echo $KIND &gt; /tmp/weather
end script</pre>
<p>This will now run a small shell script that outputs the $KIND environment variable to a file.  This isn’t set anywhere, but we can pass it in the event.</p>
<pre>root@angrybirds:/etc/init# cat /tmp/weather
cat: /tmp/weather: No such file or directory
root@angrybirds:/etc/init# initctl emit weather KIND=RAIN
root@angrybirds:/etc/init# cat /tmp/weather
RAIN</pre>
<p>Ok, these are just examples but there are plenty of useful events on your system right now which carry environment variables such as which network interface just came up, and so on.</p>
<p>If you wanted to only run on a certain type of weather, you might think to check the value of <em>$KIND</em> within the script; you could do that, but it’s inefficient, ideally you don’t want your script run at all.  Fortunately we can match the environment of an event in the job easily enough, here’s <em>/etc/init/example3.conf</em>:</p>
<pre>start on weather KIND=snow</pre>
<p>Hopefully you’ll figure that this one will only start if it’s snowing, and you’d be right:</p>
<pre>root@angrybirds:/etc/init# status example3
example3 stop/waiting
root@angrybirds:/etc/init# initctl emit weather KIND=hail
root@angrybirds:/etc/init# status example3
example3 stop/waiting
root@angrybirds:/etc/init# initctl emit weather KIND=snow
root@angrybirds:/etc/init# status example3
example3 start/running</pre>
<p>Events can have more than one environment variable, and you can have more than one match:</p>
<pre>start on weather KIND=rain INTENSITY=heavy</pre>
<p>The matches are actually globs, so you can use <em>*</em> and <em>?</em> in there and as well as <em>=</em>, there’s obviously <em>!=</em>.</p>
<p>One useful use for the latter is in the <em>stop on</em> stanza, as well as being available for the job’s processes you can also use these in other stanzas within the job.  Here’s a cute example for <em>/etc/init/example4.conf</em>:</p>
<pre>start on weather KIND=rain or weather KIND=snow
stop on weather KIND!=$KIND</pre>
<p>This one takes a bit of explaining.  First of all to start the job we match the <em>weather</em> event with <em>$KIND</em> set to either <em>rain</em> or <em>snow</em>.  Now we supply a condition to stop the job, and we also match the <em>weather</em> event with a given value of <em>$KIND</em> except this time we match what looks like itself.</p>
<p>In fact this expansion of <em>$KIND</em> is the value that variable had when the job was started, not the value in the new event.  It says to stop the job if it stops raining, or stops snowing depending on which of the two started it.  Most importantly, if an event simply repeats the same kind of weather, but maybe with a different intensity, the job carries on running (but it doesn’t have its environment updated – UNIX can’t do that).</p>
<pre>root@angrybirds:/etc/init# status example4
example4 stop/waiting
root@angrybirds:/etc/init# initctl emit weather KIND=rain INTENSITY=heavy
root@angrybirds:/etc/init# status example4
example4 start/running
root@angrybirds:/etc/init# initctl emit weather KIND=rain INTENSITY=light
root@angrybirds:/etc/init# status example4
example4 start/running
root@angrybirds:/etc/init# initctl emit weather KIND=sun
root@angrybirds:/etc/init# status example4
example4 stop/waiting</pre>
<p>Ok, last fake example before we get onto the fun bits.  Remember the example from above:</p>
<pre>start on weather KIND=rain INTENSITY=heavy</pre>
<p>Upstart lets us shortcut this a little, the environment variables are specified in an order on the <em>initctl</em> command-line and if we know what that order is, we can just assume what variable is in that position.  So as long as we know a <em>weather</em> event always has a <em>KIND</em> followed by an <em>INTENSITY</em>, we could shortcut that to:</p>
<pre>start on weather rain heavy</pre>
<p>If you’ve used Upstart at all, you’ve seen that shortcut before.  A lot.  You may not have even realised it was a shortcut at all, and that’s what I hope to fix here.</p>
<p>Here’s an example of where you’ve used that:</p>
<pre>start on started dbus</pre>
<p>You should hopefully now recognise that <em>started</em> is the name of the event there, an <em>dbus </em>is simply the value of its first argument, whatever that might be.  Remember I mentioned that events have man pages?  Take a look at <em>man 7 started</em>, which is the man page for this event.</p>
<p>It documents which environment variables are attached to the <em>started</em> event, and most importantly what order they come in.</p>
<pre><strong>started</strong> <strong>JOB</strong>=<span>JOB</span> <strong>INSTANCE</strong>=<span>INSTANCE</span> [<span>ENV</span>]...</pre>
<p>So really when we wrote the previous, we were just using a shortcut to specify:</p>
<pre>start on started JOB=dbus</pre>
<p>You might wonder what difference this makes.  A good example of how to exploit this is the <em>stopped</em> event.  If you look at it’s man page (<em>man 7 stopped</em>) you’ll see it has a large number of environment variables specifying not only which job stopped but the reason for it stopping.  One of those is the exit signal, for example.</p>
<p>Now you know that you’re just matching the <em>$JOB</em> environment variable, it’s obvious that you don’t have to!  You can match any other environment variable or variables in the event, or none at all.</p>
<p>Here’s how to run a script if any other job on the system exits with a segmentation fault:</p>
<pre>start on stopped EXIT_SIGNAL=SEGV</pre>
<p>I said you didn’t have to match any variables, just like in the first examples we didn’t, there’s a neat use for that with the job events.  The <em>starting</em> event blocks the named job from actually starting until anything run by it is started; or, in the case of jobs marked <em>task</em>, finished.</p>
<p>Here’s a little job that runs every time another job is started, and blocks that job from actually starting until the script finishes.</p>
<pre>start on starting
task

script
    ....
end script</pre>
<p>Useful both for debugging and performance analysis.</p>
<p>Now for the really neat bit.  So far we’ve concentrated on the environment variables that come from events, and those that Upstart puts into the job events.  But we can influence these in rather useful ways.</p>
<p>Firstly we can declare a default value for an environment variable in a job, if no alternate value is given in the start event or command, then this default value wins:</p>
<pre>start on mounted

env MOUNTPOINT=/tmp
script
    ....
end script</pre>
<p>This script will run for each occurrence of the <em>mounted</em> event, and will hopefully get the value for <em>$MOUNTPOINT</em> from that event.  But should the value be missing from the event, or the script be started manually by a system administrator, a default value is provided.</p>
<p>This isn’t a false example, that’s from the job on your system that cleans up the <em>/tmp</em> directory on boot.  The default value wasn’t there in earlier versions of Ubuntu, and this had a rather disastrous side-effect when run by hand.</p>
<p>Ok, we can set the values of environment variables from a job, and we don’t have to match the job name in the usual job events.  We can combine these two facts in a very interesting way when we can <em>export</em> the value of a job’s environment variable into its job events.</p>
<p>Here’s the first job:</p>
<pre>env AM_A_DISPLAY_MANAGER=1
export AM_A_DISPLAY_MANAGER</pre>
<p>This sets the default value of <em>$AM_A_DISPLAY_MANAGER</em>, but this isn’t a variable we ever expect to be supplied by an event so it just gets passed into the environment of its processes.  It’s not that useful either on its own.</p>
<p>The <em>export</em> line is the useful one, it adds the value of the named environment variable to the job’s events.  That is the <em>starting</em>, <em>started</em>, <em>stopping</em> and <em>stopped</em> events.</p>
<p>Now, in another job, we can do:</p>
<pre>start on started AM_A_DISPLAY_MANAGER=1</pre>
<p>This is run when any job is started that has that environment variable in its events.  In other words, we can tag classes of services so we don’t have to list every single one.</p>
<p>And because everything in Upstart is the same fundamental type of thing, this can work in the opposite direction.  For example we can put in our job:</p>
<pre>env NEED_PORTMAP=1
export NEED_PORTMAP</pre>
<p>This means our events will have <em>NEED_PORTMAP=1</em> in them, now remembering that the job waits for the side-effects of the <em>starting</em> event to complete, we can now write in <em>/etc/init/portmap.conf</em>:</p>
<pre>start on starting NEED_PORTMAP=1</pre>
<p>So we can implement a dependency-based init system with Upstart, an event-based init system.</p>
<p>I look forwards to finding out what else you can do with it.</p>
<p><em>(reposted from <a href="http://upstart.at/2010/12/03/event-matching-in-upstart/">http://upstart.at/2010/12/03/event-matching-in-upstart/</a> &#8211; post comments there)</em></p>
]]></content:encoded>
			<wfw:commentRss>http://netsplit.com/2010/12/03/event-matching-in-upstart/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Dependency-based &amp; Event-based init daemons and launchd</title>
		<link>http://netsplit.com/2010/05/27/dependency-based-event-based-init-daemons-and-launchd/</link>
		<comments>http://netsplit.com/2010/05/27/dependency-based-event-based-init-daemons-and-launchd/#comments</comments>
		<pubDate>Thu, 27 May 2010 13:30:08 +0000</pubDate>
		<dc:creator>scott</dc:creator>
				<category><![CDATA[Upstart]]></category>

		<guid isPermaLink="false">http://www.netsplit.com/?p=255</guid>
		<description><![CDATA[With the recent announcement of systemd, I&#8217;ve noticed some increased confusion around Upstart and what it means to be an event-based init daemon.  Now seems as good a time as any to try and clear that up by describing what &#8230; <a href="http://netsplit.com/2010/05/27/dependency-based-event-based-init-daemons-and-launchd/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>With the recent announcement of systemd, I&#8217;ve noticed some increased confusion around Upstart and what it means to be an <em>event-based init daemon</em>.  Now seems as good a time as any to try and clear that up by describing what I mean by that.</p>
<h3>Dependency-based init</h3>
<p>Before <a href="http://upstart.ubuntu.com/">Upstart</a> came along, the state of the art of init daemon replacements were the <em>dependency-based init daemons</em>.  The two most well-known at the time was the <a href="http://www.sun.com/bigadmin/content/selfheal/smf-quickstart.jsp">Service Management Facility</a> (SMF) of Solaris, and <a href="http://initng.sourceforge.net/trac">initng</a> on Linux.</p>
<p>The easiest way to understand how a dependency-based init daemon works is to look at another dependency-based system you&#8217;re probably more familiar with: the package manager of your Linux distribution.</p>
<p>When you want to install a package, for example the Apache Web Server, you tell the package manager to do that.  The Apache package will list additional dependencies that it requires to be installed, and those in turn will list additional dependencies, and so on.  The package manager will walk this dependency tree, eliminating those that you already have installed, and it will then flatten the remaining tree to get an order in which those remaining can be safely installed.</p>
<p>To put it simply: you say that you want Apache installed, but you may get more than that installed to ensure that Apache works.</p>
<p>A dependency-based init daemon works in fundamentally the same way.  When you say that you want Apache started, it looks at the configuration for that service for the list of dependency services, and builds up a similar tree.  Eliminating those already running, and flattening the tree, gives you a list of services that must be started in an order that they should be safe to start in.</p>
<p>You say you want Apache running, but you may get more than Apache running as a result.</p>
<p>Booting a system with a dependency-based init daemon, however, is a little strange.  They need to know the target set of services that must be running, otherwise they would start nothing.  SMF simply started all services that were not in manual start mode, initng had the concept of goal services whose dependencies were those that should be running &#8212; and used these to define the runlevels.</p>
<p>Once you have that list of goal services, you work out the dependency trees, and flatten them as normal &#8211; and thus you get an order that all services on the system should be started in.</p>
<p>Dependency-based init daemons work, but I believed there was a better way to do things.  I invented the <em>event-based init daemon</em> instead.</p>
<h3>Event-based init</h3>
<p>An event-based init daemon isn&#8217;t really a great leap from a dependency-based init daemon, it simply does everything backwards.  A simplistic view says that instead of starting Apache&#8217;s dependencies because Apache is started, it starts Apache because its dependencies are now running.</p>
<p>But it&#8217;s much more interesting than that, and much more flexible.  Most people don&#8217;t get the epiphany.</p>
<p>A better description might be that services are started and stopped due to external influences on them.  Those external influences can be anything, for example: hardware coming and going; changes in the time; and not least, other services.</p>
<p>The events represent changes in the system state, and services define the states in which they can be running, and the system reacts accordingly.</p>
<p>I&#8217;m still convinced this is the best way to work, not in the least because you can implement a dependency-based system with an event-based init daemon.  Starting a service causes an event for each of its dependencies declaring a need for them, and the service waits for those events to complete; those events cause the dependencies to be started.</p>
<h3>launchd</h3>
<p>The other well-known init daemon out there is Apple&#8217;s <a href="http://developer.apple.com/macosx/launchd.html">launchd</a>, of which Lennart&#8217;s recent <a href="http://www.freedesktop.org/wiki/Software/systemd">systemd</a> project is similar implementation in some ways but not in others.</p>
<p>launchd&#8217;s modus operandi is that it starts services on demand, and it does this on the assumption that all services communicate through sockets or through the Mach IPC model.  For the socket-based services, launchd itself creates the listening sockets, and when it receives a connection it starts the service and hands off the listening socket to it.</p>
<p>This has a beautiful engineering elegance, and it&#8217;s easy to see why it appeals to us.</p>
<p>You don&#8217;t need to configure a service&#8217;s dependencies or requirements in the init daemon, instead the service causes its dependencies to be started through this on-demand activation.  If the dependency isn&#8217;t ready to be started, the service simply blocks in the <code>connect</code> or <code>open</code> syscall until it is ready.</p>
<p>As launchd as matured, Apple have added support to watch for files on the disk and for cron-like schedule events.  In many ways, this makes launchd kinda like an event-based init daemon, except with listening sockets.</p>
<p>systemd takes a similar approach with regard to the listening sockets, though my understanding so far is that it combines it with a dependency-based resolution procedure for other parts of the system, rather than an event-based one.  I&#8217;m willing to be corrected on this though.</p>
<h3>Upstart</h3>
<p>Upstart is an event-based init daemon; it&#8217;s taken a little while to develop because it&#8217;s the first pure example of its kind, and I only replaced the working sysvinit cautiously.  I basically had to prove to myself, and others, that an event-based init daemon can really work.  That&#8217;s why Ubuntu 9.10 and 10.04 were the first versions to really start taking advantage of it.</p>
<p>I also wanted to keep it relatively stable to encourage adoption by other distributions, and I believe this has also paid off given that Fedora, RedHat and OpenSuSE have all adopted it now.</p>
<p>I&#8217;ve proven it works, and it&#8217;s been adopted, now the fun development can begin!</p>
<p>Two of the main complains about Upstart are that the <code>start on</code> and <code>stop on</code> mechanism to define services is complicated and exposes far too much of the event model, and that it&#8217;s not very well documented.  Ironically, these two complaints are entirely related.</p>
<p>The <code>start on</code>/<code>stop</code> on mechanism is basically just a debug interface, it allowed me during early development to access the raw event queue and find out what types of service model we really needed.  Since it&#8217;s a debug interface, it wasn&#8217;t documented; I knew that future versions of Upstart would have a much better model.</p>
<p>So to correct a common misconception, the hideous <code>start on</code> lines are not a side-effect of event-based init daemons; they&#8217;re a side-effect of developing an event-based init daemon in a release early open-source way.</p>
<p>I&#8217;ve also mentioned that events can be just about anything, not just directly from other services.  This includes on-demand activation; I don&#8217;t see any reason why Upstart should not be able to create sockets as launchd does, a connection on those sockets would simply be an event that would cause a service to be started.</p>
<p>Likewise, I fully intend Upstart to take over activation of system and session bus services from D-Bus, using an event from the D-Bus daemon to start and manage the service on its behalf.</p>
<p>This latter example neatly illustrates how start on will be replaced.  Take a system bus service, you might declare such a service like this:</p>
<pre><code>dbus system-bus org.freedesktop.UDisks
exec /usr/lib/udisks-daemon</code></pre>
<p>That initial line replaces a whole slew of previous verbs.  It tells Upstart that this service should be activated from the D-Bus system bus when a message for the given name has no destination in the bus.  It also tells Upstart that this service should not be considered &#8220;ready&#8221; until it actually registers that name on the bus.</p>
<p>Finally it tells Upstart that the service can only be run while the D-Bus system bus service is running.  You might think this superfluous, but remember from above that an event-based init daemon can work both ways; starting this service manually as a system administrator would start the message bus for you, if it wasn&#8217;t already running.  This can be done with either an event or through the service connecting to the message bus via a known socket.</p>
<p>It&#8217;s this flexibility that still leaves me convinced that Upstart is a better all-round approach than the purity of launchd (or systemd).</p>
<p>Take another service, for example, the printing service: CUPS.  At first glance, you might believe that it can be on-demand activated when something connects to its socket.</p>
<p>And that would certainly appear to work, you&#8217;d click Print in an application and the printer service would be started.</p>
<p>But that&#8217;s not the full picture; what if there was a job in the queue from before you shut down?  You also need the service started if there are any files in the named queue directory.</p>
<p>And that&#8217;s still not the full picture; CUPS performs remote printer discovery, you most certainly don&#8217;t want to click Print and see no printers because CUPS hasn&#8217;t had time to discover them, having only just been started.  Users have short attention spans to wait, I know I certainly do.</p>
<p>You need a combination of different conditions to start CUPS; it should be started on demand, it should be started if there are files in the print queue, and it should be still started on boot (just low-priority once the system is idle) to discover remote printers.</p>
<p>A pure on-demand daemon just doesn&#8217;t cut it, you need something more flexible.</p>
<p>The last point about user impatience is also my other major disagreement here.  launchd supposes that you should always optimise for the minimum system footprint, at a cost to interaction performance.</p>
<p>It assumes that it&#8217;s ok to wait for a service to start when you click a button the first time, or bogusly that all services start immediately!</p>
<p>While this might be true in many situations, it&#8217;s also not true in many others.  I&#8217;ve met very few system administrators who think that their web server should only ever be started on demand, and shut down again once there are no users browsing it.</p>
<p>And if you&#8217;re going to do always-running services like this, you do need to be able to encode their dependencies and requirements in the init-daemon configuration, which negates the engineering precision of avoiding doing so through on-demand activation.</p>
]]></content:encoded>
			<wfw:commentRss>http://netsplit.com/2010/05/27/dependency-based-event-based-init-daemons-and-launchd/feed/</wfw:commentRss>
		<slash:comments>26</slash:comments>
		</item>
		<item>
		<title>btrfs by default in Maverick?</title>
		<link>http://netsplit.com/2010/05/14/btrfs-by-default-in-maverick/</link>
		<comments>http://netsplit.com/2010/05/14/btrfs-by-default-in-maverick/#comments</comments>
		<pubDate>Fri, 14 May 2010 16:21:18 +0000</pubDate>
		<dc:creator>scott</dc:creator>
				<category><![CDATA[Canonical]]></category>
		<category><![CDATA[Ubuntu]]></category>

		<guid isPermaLink="false">http://www.netsplit.com/?p=250</guid>
		<description><![CDATA[UDS is over! And in the customary wrap-up I stood up and told the audience what the Foundations team have been discussing all week. One of the items is almost certainly going to get a little bit of publicity. We &#8230; <a href="http://netsplit.com/2010/05/14/btrfs-by-default-in-maverick/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>UDS is over!  And in the customary wrap-up I stood up and told the audience what the Foundations team have been discussing all week.  One of the items is almost certainly going to get a little bit of publicity.</p>
<p>We are going to be doing the work to have btrfs as an installation option, and we have not ruled out making it the default.</p>
<p>I do stress the emphasis of that statement, a number of things would have to be true for us to take that decision:</p>
<ol>
<li>btrfs would need to not be marked &#8220;experimental&#8221; in the kernel config; we understand that this is planned for 2.6.35, which is the kernel version we are expecting to ship in Maverick.</li>
<li>btrfs is not currently supported by GRUB2 (our boot loader) or the installer; these pieces would need to be finished <em>before</em> Feature Freeze.</li>
<li>If that happens, we <em>may</em> make it the default for Alpha releases to gain testing; that testing must go smoothly.</li>
<li>The btrfs upstream must be happy with the idea.</li>
<li>We must be happy with the idea.</li>
</ol>
<p>It&#8217;s a tough gauntlet, and it would only made with the knowledge that production servers and desktops can be run on Lucid as a fully supported version of Ubuntu at the same time.  I&#8217;d give it a 1-in-5 chance.</p>
<ol></ol>
]]></content:encoded>
			<wfw:commentRss>http://netsplit.com/2010/05/14/btrfs-by-default-in-maverick/feed/</wfw:commentRss>
		<slash:comments>56</slash:comments>
		</item>
		<item>
		<title>On systemd</title>
		<link>http://netsplit.com/2010/04/30/on-systemd/</link>
		<comments>http://netsplit.com/2010/04/30/on-systemd/#comments</comments>
		<pubDate>Fri, 30 Apr 2010 11:47:31 +0000</pubDate>
		<dc:creator>scott</dc:creator>
				<category><![CDATA[Canonical]]></category>
		<category><![CDATA[Ubuntu]]></category>
		<category><![CDATA[Upstart]]></category>

		<guid isPermaLink="false">http://www.netsplit.com/?p=246</guid>
		<description><![CDATA[I&#8217;m sure you&#8217;ve all by now read the announcement of systemd, and have probably come running to my blog to see what the reaction of Ubuntu and the Upstart author is! As you know, improvements to the boot process has &#8230; <a href="http://netsplit.com/2010/04/30/on-systemd/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m sure you&#8217;ve all by now read the announcement of <a href="http://0pointer.de/blog/projects/systemd.html">systemd</a>, and have probably come running to my blog to see what the reaction of Ubuntu and the <a href="http://upstart.ubuntu.com/">Upstart</a> author is!</p>
<p>As you know, improvements to the boot process has been something that Ubuntu have been working on for a few years now and this led to the development of Upstart.  We&#8217;re not the only ones working in this area, Intel have also been hard at work with different improvements of their own with the Moblin and MeeGo projects.</p>
<p>So it&#8217;s great to see some Fedora and OpenSuSE guys working on this too, and bringing some different ideas to the table!</p>
<p>I can&#8217;t say I disagree with some of Lennart&#8217;s observations about problems with Upstart, it&#8217;s certainly nowhere near perfect.  Now that the stable period leading up to the release of Ubuntu 10.04 LTS is over, I&#8217;m looking forwards to getting back into the code and trying to address them.</p>
<p>It&#8217;s far too early to tell which approach is going to work out better in the end; but that&#8217;s one of the great things about Linux.  The different distributions are able to develop in different directions, and we&#8217;re able to try out many different things.</p>
<p>On a personal note, I&#8217;m particularly pleased that Lennart has continued the punny naming scheme I began with <a href="http://www.thefreedictionary.com/upstart">Upstart</a>. <a href="http://www.urbandictionary.com/define.php?term=System%20D"> System D</a> is a French concept that embraces responding to challenges when they happen, thinking fast and on your feet and adapting and improvising to get the job done.</p>
]]></content:encoded>
			<wfw:commentRss>http://netsplit.com/2010/04/30/on-systemd/feed/</wfw:commentRss>
		<slash:comments>17</slash:comments>
		</item>
	</channel>
</rss>

