The Proc Connector and Socket Filters

The proc connector is one of those interesting kernel features that most people rarely come across, and even more rarely find documentation on. Likewise the socket filter. This is a shame, because they’re both really quite useful interfaces that might serve a variety of purposes if they were better documented.

The proc connector allows you to receive notification of process events such fork and exec calls, as well as changes to a process’s uid, gid or sid (session id). These are provided through a socket-based interface by reading instances of struct proc_event defined in the kernel header.

#include <linux/cn_proc.h>

The interface is built on the more generic connector API, which itself is built on the generic netlink API. These interfaces add some complexity as they are intended to provide bi-directional communication between the kernel and userspace; the connector API appears to have been largely forgotten as newer such socket interfaces simply declare their own first-class socket classes. So we need the headers for those too.

#include <linux/netlink.h>
#include <linux/connector.h>

(For brevity, I’ll omit any standard boilerplate such as the headers you need for syscalls and library functions that you should be used to as well as function definitions, error checking, and so-forth.)

Ok, now we’re ready to create the connector socket. This is straight-forward enough, since we’re dealing with atomic messages rather than a stream, datagram is appropriate.

int sock;
sock = socket (PF_NETLINK, SOCK_DGRAM | SOCK_NONBLOCK | 
               SOCK_CLOEXEC, NETLINK_CONNECTOR);

To select the proc connector we bind the socket using a struct sockaddr_nl object.

struct sockaddr_nl addr;
addr.nl_family = AF_NETLINK;
addr.nl_pid = getpid ();
addr.nl_groups = CN_IDX_PROC;

bind (sock, (struct sockaddr *)&addr, sizeof addr);

Unfortunately that’s not quite enough yet; the proc connector socket is a bit of a firehose, so it doesn’t in fact send any messages until a process has subscribed to it. So we have to send a subscription message.

As I mentioned before, the proc connector is built on top of the generic connector and that itself is on top of netlink so sending that subscription message also involves embedded a message, inside a message inside a message. If you understood Christopher Nolan’s Inception, you should do just fine.

Since we’re nesting a proc connector operation message inside a connector message inside a netlink message, it’s easiest to use a struct iovec for this kind of thing.

struct iovec iov[3];
char nlmsghdrbuf[NLMSG_LENGTH (0)];
struct nlmsghdr *nlmsghdr = nlmsghdrbuf;
struct cn_msg cn_msg;
enum proc_cn_mcast_op op;

nlmsghdr->nlmsg_len = NLMSG_LENGTH (sizeof cn_msg + sizeof op);
nlmsghdr->nlmsg_type = NLMSG_DONE;
nlmsghdr->nlmsg_flags = 0;
nlmsghdr->nlmsg_seq = 0;
nlmsghdr->nlmsg_pid = getpid ();

iov[0].iov_base = nlmsghdrbuf;
iov[0].iov_len = NLMSG_LENGTH (0);

cn_msg.id.idx = CN_IDX_PROC;
cn_msg.id.val = CN_VAL_PROC;
cn_msg.seq = 0;
cn_msg.ack = 0;
cn_msg.len = sizeof op;

iov[1].iov_base = &cn_msg;
iov[1].iov_len = sizeof cn_msg;

op = PROC_CN_MCAST_LISTEN;

iov[2].iov_base = &op;
iov[2].iov_len = sizeof op;

writev (sock, iov, 3);

The netlink message length is the combined length of the following connector and proc connector operation messages, and is otherwise simply a message from our process id with no following messages. However all of the interfaces to netlink take a lot of care to make sure the following structure in the message is aligned as wide as possible using the NLMSG_LENGTH macro, to avoid issues with platforms that have fixed alignment for data types, so we have to be careful of that too.

So we actually have a bit of padding between the struct nlmsghdr and the struct cn_msg, this is accomplished by actually using a character buffer of the right size for the first struct iovec element and accessing it through a struct nlmsghdr pointer.

The connector message indicates that it is relevant to the proc connector through the idx and val fields, and the length is the length of the proc connector operation message.

Finally the proc connector operation message (just an enum) says we want to subscribe. Why isn’t there padding between the connector and proc connector operation messages? Because the last element in struct cn_msg is a zero-width type which results in the right padding, this interface is rather newer than netlink.

struct iovec stitches it all together so it’s sent as a single message, visualized this message looks like this:

enum-proc_cn_mcast_op.png

There’s a matching PROC_CN_MCAST_IGNORE message if you want to turn off the firehose without closing the socket.

Ok, the firehose is on now we need to read the stream of messages. Just like the message we sent, the stream of messages we receive are actually netlink messages, and inside those netlink messages are connector messages, and inside those are proc connector messages.

Netlink allows for all sorts of things like multi-part messages, but in reality we can ignore most of that since connector doesn’t use the, but it’s worth future-protecting ourselves and being liberal in what we accept.

struct msghdr msghdr;
struct sockaddr_nl addr;
struct iovec iov[1];
char buf[PAGE_SIZE];
ssize_t len;

msghdr.msg_name = &addr;
msghdr.msg_namelen = sizeof addr;
msghdr.msg_iov = iov;
msghdr.msg_iovlen = 1;
msghdr.msg_control = NULL;
msghdr.msg_controllen = 0;
msghdr.msg_flags = 0;

iov[0].iov_base = buf;
iov[0].iov_len = sizeof buf;

len = recvmsg (sock, &msghdr, 0);

Why do we use recvmsg rather than just read? Because netlink allows arbitrary processes to send messages to each other, so we need to make sure the message actually comes from the kernel; otherwise you have a potential security vulnerability. recvmsg lets us receive the sender address as well as the data.

if (addr.nl_pid != 0)
        continue;

(I’m assuming you’re reading in a loop there.)

So now we have a netlink message package from the kernel, this may contain multiple individual netlink messages (it doesn’t, but it may). So we iterate over those.

for (struct nlmsghdr *nlmsghdr = (struct nlmsghdr *)buf;
     NLMSG_OK (nlmsghdr, len);
     nlmsghdr = NLMSG_NEXT (nlmsghdr, len))

And we should ignore error or no-op messages from netlink.

if ((nlmsghdr->nlmsg_type == NLMSG_ERROR)
    || (nlmsghdr->nlmsg_type == NLMSG_NOOP))
        continue;

Inside each individual netlink message is a connector message, we extract that and make sure it comes from the proc connector system.

struct cn_msg *cn_msg = NLMSG_DATA (nlmsghdr);

if ((cn_msg->id.idx != CN_IDX_PROC)
    || (cn_msg->id.val != CN_VAL_PROC))
        continue;

Now we can safely extract the proc connector message; this is a struct proc_event that we haven’t seen before. It’s quite a large structure definition so I won’t paste it here, since it contains a union for each of the different possible message types. Instead here’s code to actually print the relevant contents for an example message.

struct proc_event *ev = (struct proc_event *)cn_msg->data;

switch (ev->what) {
case PROC_EVENT_FORK:
        printf ("FORK %d/%d -> %d/%d\n",
                ev->event_data.fork.parent_pid,
                ev->event_data.fork.parent_tgid,
                ev->event_data.fork.child_pid,
                ev->event_data.fork.child_tgid);
        break;
/* more message types here */
}

As you can see, each message type has an associated member of union event_data containing the information fields for it. And as you can see, this gives you information about each individual kernel task, not just the top-level processes you’re normally used to seeing. In other words, you see threads as well as processes.

Like I keep saying, it’s a firehose. It would be great if there was some way to filter the socket in the kernel so that our process doesn’t even get woken up for messages. Wake-ups are bad, especially in the embedded space.

Fortunately there is a way to filter sockets on the kernel-side, the kernel socket filter interface. Unfortunately this isn’t too well documented either; but let’s use this opportunity to document an example.

We’ll filter the socket so that we only receive fork notifications, discarding the other types of proc connector event type and most importantly discarding the messages that indicate new threads being created (those where the pid and tgid fields differ). One important part of filtering is that you should be careful so that only expected messages are filtered, and that unexpected messages are still passed through.

The filter machine consists of a set of machine language instructions added to the socket through a special socket option. Fortunately this machine language is copied from the Berkeley Packet Filter from BSD, so we can find documentation for it in the bpf(4) manual page there. Just ignore the structure definitions, because they are different on Linux.

So let’s get started with our example; first we need to add the right header.

#include <linux/filter.h>

And now we need to insert the filter into the socket creation, before the subscription message is sent is usually a good place. On Linux the instructions are given as an array of struct sock_filter members which we can construct using the BPF_STMT and BPF_JUMP macros.

Just to make sure everything is working, we’ll create a simple “no-op” filter.

struct sock_filter filter[] = {
        BPF_STMT (BPF_RET|BPF_K, 0xffffffff),
};

struct sock_fprog fprog;
fprog.filter = filter;
fprog.len = sizeof filter / sizeof filter[0];

setsockopt (sock, SOL_SOCKET, SO_ATTACH_FILTER, &fprog,
            sizeof fprog);

Not very useful, but it means we can now concentrate on writing the filter code itself. This filter consists of a single statement, BPF_RET that tells the kernel to deliver an amount of bytes of the packet to the receiving process and to return from the filter. The BPF_K option means that we give the amount of bytes as the argument to the statement, and in this case we give the largest possible value. In other words, this statement declares to deliver the whole packet and return from the filter.

To not wake up the process at all, and filter everything we deliver no bytes and return from the filter.

BPF_STMT (BPF_RET|BPF_K, 0);

You may want to test that too.

Ok, now let’s actually do some examination of the packets to filter out the noise. Recall that we’re dealing with nested messages here, messages inside messages, inside messages. Visualizing this is really important to understanding what you’re dealing with.

struct-proc_event.png

The most basic filter code consists of three operations: load a value from the packet into the machine’s accumulator, compare that against a value and jump to a different instruction if equal (or not equal), and then possibly return or perform another operation.

All of the following filter code replaces whatever you had in the filter[] array before.

So first we should examine the struct nlmsghdr on the start of the packet, we want to make sure that there is just one netlink message in this packet. If there are multiple, we just pass the whole packet to userspace for dealing with. We check the nlmsg_type field to make sure it contains the value NLMSG_DONE.

BPF_STMT (BPF_LD|BPF_H|BPF_ABS,
          offsetof (struct nlmsghdr, nlmsg_type));
BPF_JUMP (BPF_JMP|BPF_JEQ|BPF_K,
          htons (NLMSG_DONE),
          1, 0);
BPF_STMT (BPF_RET|BPF_K, 0xffffffff);

The first statement says to load (BPF_LD) a “halfword” (16-bit) value (BPF_H) from the absolute offset (BPF_ABS) equivalent to the position of the nlmsg_type member in struct nlmsghdr. Since we expect that structure to be the start of the message, this means the accumulator should now have that value.

The next statement is a jump (BPF_JMP), it says to compare the accumulator for equality (BPF_JEQ) against the constant argument (BPF_K). We only want to continue if this is the sole message, so the value we compare against is NLMSG_DONE—first remembering to deal with host and network ordering.

If true, the jump will jump one statement; if false the jump will not jump any statements. These are the third and fourth arguments to the BPF_JUMP macro.

Note that the error case is always to return the whole packet to the process, waking it up. And the success case is future processing of the packet. This makes sure that we don’t filter unexpected packets that userspace may really need to deal with. Don’t use the socket filter for security filtering, it’s for reducing wake-ups.

So let’s filter the next set of values, we want to make sure that this netlink message is from the connector interface. Again we load the right “word” (32-bit) values (BPF_W) from the appropriate offsets and check them against constants.

BPF_STMT (BPF_LD|BPF_W|BPF_ABS,
          NLMSG_LENGTH (0) + offsetof (struct cn_msg, id)
          + offsetof (struct cb_id, idx));
BPF_JUMP (BPF_JMP|BPF_JEQ|BPF_K,
          htonl (CN_IDX_PROC),
          1, 0);
BPF_STMT (BPF_RET|BPF_K, 0xffffffff);

BPF_STMT (BPF_LD|BPF_W|BPF_ABS,
          NLMSG_LENGTH (0) + offsetof (struct cn_msg, id)
          + offsetof (struct cb_id, idx));
BPF_JUMP (BPF_JMP|BPF_JEQ|BPF_K,
          htonl (CN_VAL_PROC),
          1, 0);
BPF_STMT (BPF_RET|BPF_K, 0xffffffff);

So after this filter code has executed, we know the packet contains a single netlink message from the proc connector. Now we want to make sure it’s a fork message; this is a bit different from before, because now we explicitly do filter out the other message types so the return case for non-equality is to return zero bytes.

BPF_STMT (BPF_LD|BPF_W|BPF_ABS,
          NLMSG_LENGTH (0) + offsetof (struct cn_msg, data)
          + offsetof (struct proc_event, what);
BPF_JUMP (BPF_JMP|BPF_JEQ|BF_K,
          htonl (PROC_EVENT_FORK),
          1, 0);
BPF_STMT (BPF_RET|BPF_K, 0);

And now we can compare the pid and tgid values for the parent process and the child process fields. This is again slightly interesting because we can’t compare against an absolute offset with the jump instruction so we use the second index register instead (BPF_X in the jump instruction). Of course it would be too easy if we could load directly into that, so we have to do it via the scratch memory store instead; this requires loading into the accumulator (BPF_LD), storing into scratch memory (BPF_ST) and loading the index register (BPF_LDX) from scratch memory (BPF_MEM).

BPF_STMT (BPF_LD|BPF_W|BPF_ABS,
          NLMSG_LENGTH (0) + offsetof (struct cn_msg, data)
          + offsetof (struct proc_event, event_data)
          + offsetof (struct fork_proc_event, parent_pid));
BPF_STMT (BPF_ST, 0);
BPF_STMT (BPF_LDX|BPF_W|BPF_MEM, 0);

Then we load the tgid value into the accumulator and we can compare and jump as before; if they are equal we want to continue, if they are inequal we want to filter the packet.

BPF_STMT (BPF_LD|BPF_W|BPF_ABS,
          NLMSG_LENGTH (0) + offsetof (struct cn_msg, data)
          + offsetof (struct proc_event, event_data)
          + offsetof (struct fork_proc_event, parent_tgid));
BPF_JUMP (BPF_JMP|BPF_JEQ|BPF_X,
          0,
          1, 0);
BPF_STMT (BPF_RET|BPF_K, 0);

Then we do the same for the child field.

BPF_STMT (BPF_LD|BPF_W|BPF_ABS,
          NLMSG_LENGTH (0) + offsetof (struct cn_msg, data)
          + offsetof (struct proc_event, event_data)
          + offsetof (struct fork_proc_event, parent_pid));
BPF_STMT (BPF_ST, 0);
BPF_STMT (BPF_LDX|BPF_W|BPF_MEM, 0);

BPF_STMT (BPF_LD|BPF_W|BPF_ABS,
          NLMSG_LENGTH (0) + offsetof (struct cn_msg, data)
          + offsetof (struct proc_event, event_data)
          + offsetof (struct fork_proc_event, parent_tgid));

BPF_JUMP (BPF_JMP|BPF_JEQ|BPF_X,
          0,
          1, 0);

BPF_STMT (BPF_RET|BPF_K, 0);

After all that filter hurdling, we have a packet that we want to pass through to the process, so the final instruction is a return of the largest packet size.

BPF_STMT (BPF_RET|BPF_K, 0xffffffff);

That’s it. Of course, what you do with this is up to you. One example could be a daemon that watches for excessive forks and kills fork bombs before they kill the machine. Since you get notification of changes of uid or gid, another example could be a security audit daemon, etc.

Upstart uses this interface for its own nefarious process tracking purposes.

 
129
Kudos
 
129
Kudos

Now read this

Songs of Innocence

The most surprising thing about Apple’s move to preload the new U2 album onto iPhones isn’t that they did it, but that people are surprised that users are angry about it. For example John Gruber declared “Nailed It” in response to... Continue →