Monday, June 15, 2015

Software Isolation in Linux

Starting from the assumption that software will always have bugs, we need a way to isolate and neutralize the effects of the bugs. One approach is by isolating components of the software such that a breach in one component doesn't compromise another. That, in the software engineering field became popular with the principle of privilege separation used by the openssh server, and is the application of an old idea to a new field. Isolation for protection of assets is  widespread in several other aspects of human activity; the most prominent example throughout history is the process of quarantine which isolates suspected carriers of epidemic diseases to protect the rest of the population. In this post, we briefly overview the tools available in the Linux kernel to offer isolation between software components. It is based on my FOSDEM 2015 presentation on software isolation in security devroom, slightly enhanced.

Note that this text is intended to developers and software engineers looking for methods to isolate components in their software.

An important aspect of such an analysis is to clarify the threats that it targets. In this text I describe methods which protect against code injection attacks. Code injection provides the attacker a very strong tool to execute code, read arbitrary memory, etc.

In a typical Linux kernel based system, the available tools we can use for such protection, are the following.
  1. fork() + setuid() + exec()
  2. chroot()
  3. seccomp()
  4. prctl()
  5. SELinux
  6. Namespaces

The first allows for memory isolation by using different processes, ensuring that a forked process has no access to parent's memory address space. The second allows a process to restrict itself in a part of the file system available in the operating system. These two are available in almost all POSIX systems, and are the oldest tools available dating to the first UNIX releases. The focus of this post are the methods 3-6 which are fairly recent and Linux kernel specific.

Before proceeding, let's make a brief overview of what is a process in an UNIX-like operating system. It is some code in memory which is scheduled to have a time slot in the CPU. It can access the virtual memory assigned to it, make calculations using the CPU user-mode instructions, and that's pretty much all. To access anything else, e.g., get additional memory, access files, read/write the memory of other processes, system calls from the operating system have to be used.

Let's now proceed and describe the available isolation methods.


After that introduction to processes, seccomp comes as the natural method to list, since it is essentially a filter for the system calls available in a process. For example, a process can install a filter to allow read() and write() but nothing else. After such a filter applied, any code which is potentially injected to that process will not be able to execute any other system calls, reducing the attack impact to the allowed calls. In our particular example with read() and write() only the data written and read by that process will be affected.

The simplest way to access seccomp is via the libseccomp library which has a quite intuitive API. An example using that library, which creates a whitelist of three system calls is shown below.

    #include <seccomp.h>

    scmp_filter_ctx ctx;
    ctx = seccomp_init(SCMP_ACT_ERRNO(EPERM))
    assert(ctx == 0);
    assert(seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0) == 0);
    assert(seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0) == 0);
    assert (seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(ioctl), 1, SCMP_A1(SCMP_CMP_EQ, (int)SIOCGIFMTU)) == 0);
    assert (seccomp_load(ctx) == 0);

The example above installs a filter which allows the read(), write() and ioctl() system calls. The latter is only allowed if the second argument of ioctl() is SIOCGIFMTU. The first line of filter setup, instructs seccomp to return -1 and set errno to EPERM when an instruction outside this filter is called.

The drawback of seccomp is the tedious process which a developer has to go in order to figure out the used system calls in his application. Manual inspection of the code is required to discover the system calls, as well as inspection of run traces using the 'strace' tool. Manual inspection will allow a making a rough list which will provide a starting point, but will not be entirely accurate. The issue is that calls to libc functions may not correspond to the expected system call. For example, a call exit() often results to an exit_group() system call.

The trace obtained using 'strace' will help clarify, restrict or extend the initial list. Note however, that using traces alone may prevent getting the system calls used in error condition handling, and different versions of libc may use different system calls for the same function. For example, the libc call select() uses the system call select() in x86-64 architecture, but the _newselect() in the x86.

The performance cost of seccomp is the cost of executing the filter, and for most cases it is a fixed cost per system call. In the openconnect VPN server I estimated the impact of seccomp on a worker process to 2% slowdown in transfer speed (from 624 to 607Mbps). The measured server worker process is executing read/send and recv/write in a tight loop.


The PR_SET_DUMPABLE flag of the prctl() system call protects a process from other processes accessing its memory. That is, it will prevent processes with the same privilege as the protected to read its memory it via ptrace(). A usage example is shown below.

    #include <prctl.h>

    prctl(PR_SET_DUMPABLE, 0);

While this approach doesn't protect against code injection in general, it may prove a useful tool with low-performance cost in several setups.


SELinux is an operating system mechanism to prevent processes from accessing various "objects" of the OS. The objects can be files, other processes, pipes, network interfaces etc. It may be used to enhance seccomp using more fine-grained access control. For example one may setup a rule with seccomp to only allow read(), and enhance the rule with SELinux for read() to only accept certain file descriptors.

On the other hand, a software engineer may not be able to rely too much on  SELinux to provide isolation, because it is often not within the developer's control. It is typically used as an administrative tool, and the administrator may decide to turn it off, set it to non-enforcing mode, etc.

The way for a process to transition to a different SELinux ruleset is via exec() or via the setcon() call, and its cost is perceived to be high. However, I have no performance tests with software relying on it.

The drawbacks of this approach are the centralized nature of the system policy, meaning that individual applications can only apply the existing system policy, not update it, the obscure language (m4) a system policy needs to be written at, and the fact that any policy written will not be portable across different Linux based systems.

Linux Namespaces

One of the most recent additions to the Linux kernel are the Namespaces feature. These allow "virtualizing" certain Linux kernel subsystems in processes, and the result is often referred to as containers. It is available via the clone() and unshare() system calls. It is documented in the unshare(2) and clone(2) man pages, but let's see some examples of the subsystems they can be restricted.

  • NEWPID: Prevents processes to "see" and access process IDs (PIDs) outside their namespace. That is the first isolated process will believe it has PID 1 and see only the processes it has forked.
  • NEWIPC: Prevents processes to access the main IPC subsystem (shared memory segments, messages queues etc.). The processes will have access to their own IPC subsystem.
  • NEWNS: Provides filesystem isolation, in a way as a feature rich chroot(). It allows for example to create isolated mount points which exist only within a process.
  • NEWNET: Isolates processes from the main network subsystem. That is it provides them with a separate networking stack, device interfaces, routing tables etc.
Let's see an example of an isolated process being created on its own PID namespace. The following code operates as fork() would do.

    #if defined(__i386__) || defined(__arm__) || defined(__x86_64__) || defined(__mips__)
        long ret;
        int flags = SIGCHLD|CLONE_NEWPID;
        ret = syscall(SYS_clone, flags, 0, 0, 0);
        if (ret == 0 && syscall(SYS_getpid) != 1)
                return -1;
        return ret;

This approach, of course has a performance penalty in the time needed to create a new process. In my experiments with openconnect VPN server the time to create a process with NEWPID, NEWNET and NEWIPC flags increased the process creation time to 10 times more than a call to fork().

Note however, that the isolation subsystems available in clone() are by default reversible using the setns() system call. To ensure that these subsystems remain isolated even after code injection seccomp must be used to eliminate calls setns() (many thanks to the FOSDEM participant who brought that to my attention).

Furthermore, the approach of Namespaces follows the principle of blacklisting, allowing a developer to isolate from certain subsystems but not from every one available in the system. That is, one may enable all the isolation subsystems available in the clone() call, but then he may realize that the kernel keyring is still available to the isolated process. That is because there is no implementation of such an isolation mechanism for the kernel keyring so far.


In the following table I attempt to summarize the protection offerings of each of the described methods.

Prevent killing
other processes
Prevent access to memory
of other processes
Prevent access to
shared memory
Prevent exploitation
of an unused system call bug
Seccomp True True True True
prctl(SET_DUMPABLE) False True False False
SELinux True True True False
Namespaces True True True False

In my opinion, seccomp seems to be the best option to consider as an isolation mechanism when designing new software. Together with Linux Namespaces it can prevent access to other processes, shared memory and filesystem, but the main distinguisher I see, is the fact that it can restrict access to unused system calls.

That is an important point, given the number of available system calls in a modern kernel. It is not only that it reduces the overall attack surface by limiting them, but it will also deny access to functionality that was not intended to --see setns() and the kernel keyring issue above.