The proc connector is one of those interesting kernel features that most people rarely come across, and even more rarely find documentation on. Likewise the socket filter. This is a shame, because they’re both really quite useful interfaces that might serve a variety of purposes if they were better documented.
The proc connector allows you to receive notification of process events such fork and exec calls, as well as changes to a process’s uid, gid or sid (session id). These are provided through a socket-based interface by reading instances of struct proc_event defined in the kernel header.
#include <linux/cn_proc.h>
The interface is built on the more generic connector API, which itself is built on the generic netlink API. These interfaces add some complexity as they are intended to provide bi-directional communication between the kernel and userspace; the connector API appears to have been largely forgotten as newer such socket interfaces simply declare their own first-class socket classes. So we need the headers for those too.
#include <linux/netlink.h> #include <linux/connector.h>
(For brevity, I’ll omit any standard boilerplate such as the headers you need for syscalls and library functions that you should be used to as well as function definitions, error checking, and so-forth.)
Ok, now we’re ready to create the connector socket. This is straight-forward enough, since we’re dealing with atomic messages rather than a stream, datagram is appropriate.
int sock;
sock = socket (PF_NETLINK, SOCK_DGRAM | SOCK_NONBLOCK | SOCK_CLOEXEC,
NETLINK_CONNECTOR);
To select the proc connector we bind the socket using a struct sockaddr_nl object.
struct sockaddr_nl addr; addr.nl_family = AF_NETLINK; addr.nl_pid = getpid (); addr.nl_groups = CN_IDX_PROC; bind (sock, (struct sockaddr *)&addr, sizeof addr);
Unfortunately that’s not quite enough yet; the proc connector socket is a bit of a firehose, so it doesn’t in fact send any messages until a process has subscribed to it. So we have to send a subscription message.
As I mentioned before, the proc connector is built on top of the generic connector and that itself is on top of netlink so sending that subscription message also involves embedded a message, inside a message inside a message. If you understood Christopher Nolan’s Inception, you should do just fine.
Since we’re nesting a proc connector operation message inside a connector message inside a netlink message, it’s easiest to use an iovec for this kind of thing.
struct iovec iov[3]; char nlmsghdrbuf[NLMSG_LENGTH (0)]; struct nlmsghdr *nlmsghdr = nlmsghdrbuf; struct cn_msg cn_msg; enum proc_cn_mcast_op op; nlmsghdr->nlmsg_len = NLMSG_LENGTH (sizeof cn_msg + sizeof op); nlmsghdr->nlmsg_type = NLMSG_DONE; nlmsghdr->nlmsg_flags = 0; nlmsghdr->nlmsg_seq = 0; nlmsghdr->nlmsg_pid = getpid (); iov[0].iov_base = nlmsghdrbuf; iov[0].iov_len = NLMSG_LENGTH (0); cn_msg.id.idx = CN_IDX_PROC; cn_msg.id.val = CN_VAL_PROC; cn_msg.seq = 0; cn_msg.ack = 0; cn_msg.len = sizeof op; iov[1].iov_base = &cn_msg; iov[1].iov_len = sizeof cn_msg; op = PROC_CN_MCAST_LISTEN; iov[2].iov_base = &op; iov[2].iov_len = sizeof op; writev (sock, iov, 3);
The netlink message length is the combined length of the following connector and proc connector operation messages, and is otherwise simply a message from our process id with no following messages. However all of the interfaces to netlink take a lot of care to make sure the following structure in the message is aligned as wide as possible using the NLMSG_LENGTH macro, to avoid issues with platforms that have fixed alignment for data types, so we have to be careful of that too.
So we actually have a bit of padding between the struct nlmsghdr and the struct cn_msg, this is accomplished by actually using a character buffer of the right size for the first iovec element and accessing it through a struct nlmsghdr pointer.
The connector message indicates that it is relevant to the proc connector through the idx and val fields, and the length is the legnth of the proc connector operation message.
Finally the proc connector operation message (just an enum) says we want to subscribe. Why isn’t there padding between the connector and proc connector operation messages? Because the last element in struct cn_msg is a zero-width type which results in the right padding, this interface is rather newer than netlink.
iovec stitches it all together so it’s sent as a single message, visualized this message looks like this:
There’s a matching PROC_CN_MCAST_IGNORE message if you want to turn off the firehose without closing the socket.
Ok, the firehose is on now we need to read the stream of messages. Just like the message we sent, the stream of messages we receive are actually netlink messages, and inside those netlink messages are connector messages, and inside those are proc connector messages.
Netlink allows for all sorts of things like multi-part messages, but in reality we can ignore most of that since connector doesn’t use the, but it’s worth future-protecting ourselves and being liberal in what we accept.
struct msghdr msghdr; struct sockaddr_nl addr; struct iovec iov[1]; char buf[PAGE_SIZE]; ssize_t len; msghdr.msg_name = &addr; msghdr.msg_namelen = sizeof addr; msghdr.msg_iov = iov; msghdr.msg_iovlen = 1; msghdr.msg_control = NULL; msghdr.msg_controllen = 0; msghdr.msg_flags = 0; iov[0].iov_base = buf; iov[0].iov_len = sizeof buf; len = recvmsg (sock, &msghdr, 0);
Why do we use recvmsg rather than just read? Because netlink allows arbitrary processes to send messages to each other, so we need to make sure the message actually comes from the kernel; otherwise you have a potential security vulnerability. recvfrom lets us receive the sender address as well as the data.
if (addr.nl_pid != 0)
continue;
(I’m assuming you’re reading in a loop there.)
So now we have a netlink message package from the kernel, this may contain multiple individual netlink messages (it doesn’t, but it may). So we iterate over those.
for (struct nlmsghdr *nlmsghdr = (struct nlmsghdr *)buf;
NLMSG_OK (nlmsghdr, len);
nlmsghdr = NLMSG_NEXT (nlmsghdr, len))
And we should ignore error or no-op messages from netlink.
if ((nlmsghdr->nlmsg_type == NLMSG_ERROR)
|| (nlmsghdr->nlmsg_type == NLMSG_NOOP))
continue;
Inside each individual netlink message is a connector message, we extract that and make sure it comes from the proc connector system.
struct cn_msg *cn_msg = NLMSG_DATA (nlmsghdr);
if ((cn_msg->id.idx != CN_IDX_PROC)
|| (cn_msg->id.val != CN_VAL_PROC))
continue;
Now we can safely extract the proc connector message; this is a struct proc_event that we haven’t seen before. It’s quite a large structure definition so I won’t paste it here, since it contains a union for each of the different possible message types. Instead here’s code to actually print the relevant contents for an example message.
struct proc_event *ev = (struct proc_event *)cn_msg->data;
switch (ev->what) {
case PROC_EVENT_FORK:
printf ("FORK %d/%d -> %d/%d\n",
ev->event_data.fork.parent_pid,
ev->event_data.fork.parent_tgid,
ev->event_data.fork.child_pid,
ev->event_data.fork.child_tgid);
break;
/* more message types here */
}
As you can see, each message type has an associated member of the event_data union containing the information fields for it. And as you can see, this gives you information about each individual kernel task, not just the top-level processes you’re normally used to seeing. In other words, you see threads as well as processes.
Like I keep saying, it’s a firehose. It would be great if there was some way to filter the socket in the kernel so that our process doesn’t even get woken up for messages. Wake-ups are bad, especially in the embedded space.
Fortunately there is a way to filter sockets on the kernel-side, the kernel socket filter interface. Unfortunately this isn’t too well documented either; but let’s use this opportunity to document an example.
We’ll filter the socket so that we only receive fork notifications, discarding the other types of proc connector event type and most importantly discarding the messages that indicate new threads being created (those where the pid and tgid fields differ). One important part of filtering is that you should be careful so that only expected messages are filtered, and that unexpected messages are still passed through.
The filter machine consists of a set of machine language instructions added to the socket through a special socket option. Fortunately this machine language is copied from the Berkeley Packet Filter from BSD, so we can find documentation for it in the bpf(4) manual page there. Just ignore the structure definitions, because they are different on Linux.
So let’s get started with our example; first we need to add the right header.
#include <linux/filter.h>
And now we need to insert the filter into the socket creation, before the subscription message is sent is usually a good place. On Linux the instructions are given as an array of struct sock_filter members which we can construct using the BPF_STMT and BPF_JUMP macros.
Just to make sure everything is working, we’ll create a simple “no-op” filter.
struct sock_filter filter[] = {
BPF_STMT (BPF_RET|BPF_K, 0xffffffff),
};
struct sock_fprog fprog;
fprog.filter = filter;
fprog.len = sizeof filter / sizeof filter[0];
setsockopt (sock, SOL_SOCKET, SO_ATTACH_FILTER, &fprog, sizeof fprog);
Not very useful, but it means we can now concentrate on writing the filter code itself. This filter consists of a single statement, BPF_RET that tells the kernel to deliver an amount of bytes of the packet to the receiving process and to return from the filter. The BPF_K option means that we give the amount of bytes as the argument to the statement, and in this case we give the largest possible value. In other words, this statement declares to deliver the whole packet and return from the filter.
To not wake up the process at all, and filter everything we deliver no bytes and return from the filter.
BPF_STMT (BPF_RET|BPF_K, 0);
You may want to test that too.
Ok, now let’s actually do some examination of the packets to filter out the noise. Recall that we’re dealing with nested messages here, messages inside messages, inside messages. Visualizing this is really important to understanding what you’re dealing with.
The most basic filter code consists of three operations: load a value from the packet into the machine’s accumulator, compare that against a value and jump to a different instruction if equal (or not equal), and then possibly return or perform another operation.
All of the following filter code replaces whatever you had in the filter[] array before.
So first we should examine the nlmsghdr on the start of the packet, we want to make sure that there is just one netlink message in this packet. If there are multiple, we just pass the whole packet to userspace for dealing with. We check the nlmsg_type field to make sure it contains the value NLMSG_DONE.
BPF_STMT (BPF_LD|BPF_H|BPF_ABS,
offsetof (struct nlmsghdr, nlmsg_type));
BPF_JUMP (BPF_JMP|BPF_JEQ|BPF_K,
htons (NLMSG_DONE),
1, 0);
BPF_STMT (BPF_RET|BPF_K, 0xffffffff);
The first statement says to load (BPF_LD) a “halfword” (16-bit) value (BPF_H) from the absolute offset (BPF_ABS) equivalent to the position of the nlmsg_type member in struct nlmsghdr. Since we expect that structure to be the start of the message, this means the accumulator should now have that value.
The next statement is a jump (BPF_JMP), it says to compare the accumulator for equality (BPF_JEQ) against the constant argument (BPF_K). We only want to continue if this is the sole message, so the value we compare against is NLMSG_DONE – first remembering to deal with host and network ordering.
If true, the jump will jump one statement; if false the jump will not jump any statements. These are the third and fourth arguments to the BPF_JUMP macro.
Note that the error case is always to return the whole packet to the process, waking it up. And the success case is future processing of the packet. This makes sure that we don’t filter unexpected packets that userspace may really need to deal with. Don’t use the socket filter for security filtering, it’s for reducing wake-ups.
So let’s filter the next set of values, we want to make sure that this netlink message is from the connector interface. Again we load the right “word” (32-bit) values (BPF_W) from the appropriate offsets and check them against constants.
BPF_STMT (BPF_LD|BPF_W|BPF_ABS,
NLMSG_LENGTH (0) + offsetof (struct cn_msg, id)
+ offsetof (struct cb_id, idx));
BPF_JUMP (BPF_JMP|BPF_JEQ|BPF_K,
htonl (CN_IDX_PROC),
1, 0);
BPF_STMT (BPF_RET|BPF_K, 0xffffffff);
BPF_STMT (BPF_LD|BPF_W|BPF_ABS,
NLMSG_LENGTH (0) + offsetof (struct cn_msg, id)
+ offsetof (struct cb_id, idx));
BPF_JUMP (BPF_JMP|BPF_JEQ|BPF_K,
htonl (CN_VAL_PROC),
1, 0);
BPF_STMT (BPF_RET|BPF_K, 0xffffffff);
So after this filter code has executed, we know the packet contains a single netlink message from the proc connector. Now we want to make sure it’s a fork message; this is a bit different from before, because now we explicitly do filter out the other message types so the return case for non-equality is to return zero bytes.
BPF_STMT (BPF_LD|BPF_W|BPF_ABS,
NLMSG_LENGTH (0) + offsetof (struct cn_msg, data)
+ offsetof (struct proc_event, what);
BPF_JUMP (BPF_JMP|BPF_JEQ|BF_K,
htonl (PROC_EVENT_FORK),
1, 0);
BPF_STMT (BPF_RET|BPF_K, 0);
And now we can compare the pid and tgid values for the parent process and the child process fields. This is again slightly interesting because we can’t compare against an absolute offset with the jump instruction so we use the second index register instead (BPF_X in the jump instruction). Of course it would be too easy if we could load directly into that, so we have to do it via the scratch memory store instead; this requires loading into the accumulator (BPF_LD), storing into scratch memory (BPF_ST) and loading the index register (BPF_LDX) from scratch memory (BPF_MEM).
BPF_STMT (BPF_LD|BPF_W|BPF_ABS,
NLMSG_LENGTH (0) + offsetof (struct cn_msg, data)
+ offsetof (struct proc_event, event_data)
+ offsetof (struct fork_proc_event, parent_pid));
BPF_STMT (BPF_ST, 0);
BPF_STMT (BPF_LDX|BPF_W|BPF_MEM, 0);
Then we load the tgid value into the accumulator and we can compare and jump as before; if they are equal we want to continue, if they are inequal we want to filter the packet.
BPF_STMT (BPF_LD|BPF_W|BPF_ABS,
NLMSG_LENGTH (0) + offsetof (struct cn_msg, data)
+ offsetof (struct proc_event, event_data)
+ offsetof (struct fork_proc_event, parent_tgid));
BPF_JUMP (BPF_JMP|BPF_JEQ|BPF_X,
0,
1, 0);
BPF_STMT (BPF_RET|BPF_K, 0);
Then we do the same for the child field.
BPF_STMT (BPF_LD|BPF_W|BPF_ABS,
NLMSG_LENGTH (0) + offsetof (struct cn_msg, data)
+ offsetof (struct proc_event, event_data)
+ offsetof (struct fork_proc_event, parent_pid));
BPF_STMT (BPF_ST, 0);
BPF_STMT (BPF_LDX|BPF_W|BPF_MEM, 0);
BPF_STMT (BPF_LD|BPF_W|BPF_ABS,
NLMSG_LENGTH (0) + offsetof (struct cn_msg, data)
+ offsetof (struct proc_event, event_data)
+ offsetof (struct fork_proc_event, parent_tgid));
BPF_JUMP (BPF_JMP|BPF_JEQ|BPF_X,
0,
1, 0);
BPF_STMT (BPF_RET|BPF_K, 0);
After all that filter hurdling, we have a packet that we want to pass through to the process, so the final instruction is a return of the largest packet size.
BPF_STMT (BPF_RET|BPF_K, 0xffffffff);
That’s it. Of course, what you do with this is up to you. One example could be a daemon that watches for excessive forks and kills fork bombs before they kill the machine. Since you get notification of changes of uid or gid, another example could be a security audit daemon, etc.
Upstart uses this interface for its own nefarious process tracking purposes.



Don’t forget about BPF_MISC+BPF_TAX and BPF_TXA to transfer between X and A.
Oh, wow.
That wasn’t in the manpage I’d read, but was in the one I linked to
Good stuff – just posted a link.
One question, though: in this day and age, do you really want to use these facilities to track forks, or would it be better just to use ftrace and turn on a tracepoint or two? The interface is rather simpler, and the filtering gets taken care of from the start…
Thanks Jonathan; I think it’d be certainly interesting to do a follow-up article looking at similar tracing using ftrace! I actually don’t use the proc connector in Upstart for tracking forks anymore, I only use it for watching for setsid() calls. I wanted to dump the older knowledge out of my brain somewhere so blogging seemed a good choice.
I’d actually thought about doing an ftrace blog post as well covering what ureadahead does to trace open()s and exec()s, you’ve tempted me to make that a follow-up to this one
Well, I don’t think it’s advisable trying to follow forks like this at all. Let the kernel do the tracking with stuff like named cgroups. I heard there’s a well-known init implementation doing just that.
Yup, it’s totally true that you can overflow this queue just like any kernel->userspace queue really.
But at least if you were, for example, checking for fork bombs – you could use the very fact that it overflowed as a good hint something bad was afoot.
Do you any downloadable source code?
John: for this? The post is intended to document the interfaces for people writing their own code, rather than to be a generic tool
Nicely done!
*BSD had the kqueue/kevent infra for a while now, and finally Linux has this.
anonymous: actually this is quite an old interface in Linux
Yup — it’s been in the kernel for 5 years (since 2.6.15).
Great article!
Time spent writing and testing the code can skew what a developer considers obvious and the documentation can suffer as a result. Balancing utility, completeness, brevity, and keeping the attention of your audience is difficult to achieve — especially given the time constraints a developer is likely to encounter.
Consequently the involvement of others who are initially less familiar with the code yet take the time to suggest improvements to or write new documentation is extremely valuable. This is true even if it is an old interface with better modern alternatives — as Jonathan Corbet pointed out.
The problem with this and similar interfaces is that they’re not reliable. The queue can overflow, so you can never be sure that your model of system state is accurate. From a strict security perspective, logging is a stronger from of auditing because log messages aren’t discarded, and a process logging too fast for the log to be committed will simply block. In contrast, a fork() or setuid() isn’t going to block just because your netlink PID or UID message queue is overflowing; instead, you’ll just miss it. If you think that your netlink queue will never overflow because you’re code is so lean and tight, then you’re not thinking like an attacker.
To the guest who says this is not reliable :
Logging isn’t fool-proof either. Crashes (either of the log daemon at the wrong time or the system). There’s other possibilities too. I mean you can edit logs (say your system gets compromised – your system files can be tampered with then). Or what if someone finds a buffer overflow in the program or logging daemon and exploits it ? You think it’s going to log things completely accurate ? Maybe, but then again maybe not: memory bugs do modify memory and are among the ugliest bugs to deal with (unless you know where in the code it is and that’s not always the case). What was that about you aren’t thinking like an attacker?
So your idea is flawed too. Besides that, there’s more to the OP than logging or security. I can think of several examples. strace anyone ? Or what about things like installwatch ? Or soapbox (from Dag) ? Etc. This is monitoring system calls, too. And yes, one is over a netlink and others not. The thing that is the same however, is what it is doing: monitoring syscalls. And other very useful programs use netlinks, I might add. I’d rather they keep it that way.
Best of all, you do realize you CAN LOG over netlinks in addition to the syslog (yes, at the same time even!), right ? Thus, your entire point is actually rather (amusingly so) invalid – no code is 100% secure; no computer is 100% secure. Even if it’s off – there’s possibly physical access. And physical access = root access. You say you locked the computer with passwords and even locks? What if they steal a key, or .. etc. Possibilities are endless.
An approach is fatally flawed when it logically is insufficient to accomplish the task. All solutions should be sound presuming a computer–hardware and software–is capable of consistently answering 1 + 1 correctly. When I speak of an attacker breaking the system, I don’t mean inducing hardware failure; I mean increasing the chance of the patent flaw in the logic. The logging solution is sound with the sole constraint of logging space; but if there is no more space than the logical thing to do to maintain the security guarantee is to to stop processing.
Whether a solution is robust in the face of non-deterministic behavior–such as hardware failure–is an entirely different problem. As is the issue of cost-benefit analysis of a security measure.
BTW, existing syscall tracing is fatally flawed. This is why systrace was deprecated in OpenBSD and elsewhere; because it’s susceptible to timing attacks. (The solution, copying all syscall arguments into kernel buffers before checking, seriously reduces performance.) So I’ll restate what I said: the only fool-proof (absent hardware failure) way to audit a system is through logging.
Just to be more clear, to break the netlink scheme an attacker needn’t necessarily find a logic flaw in any bit of code. It would be sufficient to exercise the system in a way that overflowed the message queue. That was my basic point. And a scheme which can be subverted without breaking the logic of the scheme per se is, obviously, fatally flawed. A logging scheme wherein syscalls were blocked in the kernel while attempting to log the information is on its face perfectly sound; something which cannot be said for the netlink scheme.
Of course, you could change netlink so that a syscall blocks until the consumers of the event receive the messages in user-space, and then are able to signal the kernel to fail the syscall if the consumers are unable to process. The problem there is that this would be horribly inefficient, especially if all you wanted to do was provide for auditing ex post facto. The best scheme, then, is simple logging to permanent storage, something which is currently possible (or at least possible with a trivial kernel changes).
Well anything that traces something else obviously has issues – be it performance or something else (even memory debuggers and watchers have issues). Security as you point out is another one (that’s kind of obvious). That does not mean they don’t have uses, though. General debuggers have uses, too. Big uses.
Anyway – system call tracing was only an example. One example.
(When I’m referring to netlinks I’m referring to them generally speaking).
Whether netlinks are flawed in some ways or not, the fact is (as the example you use) logging has its own issues too at some times – rare or not. Most importantly, as I’m sure you’re aware – redundancy is a good thing in computing. Therefore, how is logging over a netlink AND to disk at the same time, bad ? That’s what I was getting at (well one of the things I was getting at).
Other than that, view it like you please. I accept it has flaws. But that’s because _everything_ has flaws (or issues or problems or any other word you prefer to use). Again, no such thing as 100% security. Perfect example: encryption algorithms that were deemed impossible to break. Then a few weeks later, cracked. Even a computer that’s off is not 100% secure if you consider physical possibilities. And while you dismiss hardware failure, it’s still an issue. Whether it’s part of the picture or not – it’s still a way logging could fail if not sent to another location too (like, by, say, netlinks).
Anyway – Yes, you make valid points but so do I.
And that’s about all I have to say, really.
Kind regards.
Nice article. After reading this article, checked out netlink even more. It is quite underused in linux userland. This article about connectors is quite nice. Connectors as I see now look like equivalent to inotify,fanotify events in filesystem domain, albeit much more generic and customizable. But I was not able to understand the BPF statements, I have to checkout them more to understand that part. Regarding language bindings, I see that Python has one but not sure whether all the things done above can be done in that.
I wonder whether this can be used in other places like BSD process accounting since this looks lighter.
It’s a really bad idea to use cn_proc to keep track of processes belonging to services started by an init system such as Upstart. If the netlink buffer runs over the init system will just get an ENOBUFS and the message is gone. That means that if a process wants to escape Upstart’s supervision it just needs to do bit of fork bombing to make the queue overrun and — whoops! — it can sneak out of the supervision.
Using cn_proc might be useful for debugging purposes and stuff. For example its use in bootchart is a good thing. But it’s not really useful for anything used in production, because it’s inheritely unreliable.
Upstart doesn’t actually use PROC_EVENT_FORK, but it was the easiest one to demonstrate for purposes of the documentation/blog.
See if you can guess which one it does use (hint: it’s one I added to the kernel myself a while back)
I’m curious what it is. I try to only do POSIX-portable application programming so am not very familiar with Linux’isms. I’m inclined to think that you would need a way to globally gate the allocation or deallocation of PIDs because you wouldn’t want a PID reallocated before the tracker was able to update its process map.
Good stuff. One small quibble: I think “parent_pid” and “parent_tgid” in the second-last code block should be “child_pid” and “child_tgid”.
Really nice and insightful article.
Some old reference regarding proc connector:
[RFC] Process Events Connector (test program) by Matthew Helsley [2005]
It was slightly simplified and commented in:
linux process monitoring (exec, fork, exit, set*uid, set*gid) by bewareofgeek [2009]
Thanks for this, I’ve always wanted a tool that could do this. Except to actually get the names of the processes involved is more work. I’d like to think ftrace would be a better solution, but it never seems to be enabled on the kernels I’m working with. This is supported a lot further back.
I’ve watched for dying essential processes using the similar TASKSTATS netlink.
Great article!
I created a utility based on this to solve a problem with an otherwise great voip application.
If anyone finds it useful: http://github.com/pturmel/startmon
(I’ll add the socket filters eventually…)