LinuxLists.cc - For review: seccomp_user

2020-09-30 11:10:03

Subject: For review: seccomp_user_notif(2) manual page

Hi Tycho, Sargun (and all),

I knew it would be a big ask, but below is kind of the manual page
I was hoping you might write [1] for the seccomp user-space notification
mechanism. Since you didn't (and because 5.9 adds various new pieces
such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD
that also will need documenting [2]), I did :-). But of course I may
have made mistakes...

I've shown the rendered version of the page below, and would love
to receive review comments from you and others, and acks, etc.

There are a few FIXMEs sprinkled into the page, including one
that relates to what appears to me to be a misdesign (possibly
fixable) in the operation of the SECCOMP_IOCTL_NOTIF_RECV
operation. I would be especially interested in feedback on that
FIXME, and also of course the other FIXMEs.

The page includes an extensive (albeit slightly contrived)
example program, and I would be happy also to receive comments
on that program.

The page source currently sits in a branch (along with the text
that you sent me for the seccomp(2) page) at
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=seccomp_user_notif

Thanks,

Michael

[1] https://lore.kernel.org/linux-man/[email protected]/#t
[2] Sargun, can you prepare something on SECCOMP_ADDFD_FLAG_SETFD
and SECCOMP_IOCTL_NOTIF_ADDFD to be added to this page?

=====

NAME
seccomp_user_notif - Seccomp user-space notification mechanism

SYNOPSIS
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <linux/audit.h>

int seccomp(unsigned int operation, unsigned int flags, void *args);

DESCRIPTION
This page describes the user-space notification mechanism pro‐
vided by the Secure Computing (seccomp) facility. As well as the
use of the SECCOMP_FILTER_FLAG_NEW_LISTENER flag, the SEC‐
COMP_RET_USER_NOTIF action value, and the SECCOMP_GET_NOTIF_SIZES
operation described in seccomp(2), this mechanism involves the
use of a number of related ioctl(2) operations (described below).

Overview
In conventional usage of a seccomp filter, the decision about how
to treat a particular system call is made by the filter itself.
The user-space notification mechanism allows the handling of the
system call to instead be handed off to a user-space process.
The advantages of doing this are that, by contrast with the sec‐
comp filter, which is running on a virtual machine inside the
kernel, the user-space process has access to information that is
unavailable to the seccomp filter and it can perform actions that
can't be performed from the seccomp filter.

In the discussion that follows, the process that has installed
the seccomp filter is referred to as the target, and the process
that is notified by the user-space notification mechanism is
referred to as the supervisor. An overview of the steps per‐
formed by these two processes is as follows:

1. The target process establishes a seccomp filter in the usual
manner, but with two differences:

· The seccomp(2) flags argument includes the flag SECCOMP_FIL‐
TER_FLAG_NEW_LISTENER. Consequently, the return value of
the (successful) seccomp(2) call is a new "listening" file
descriptor that can be used to receive notifications.

· In cases where it is appropriate, the seccomp filter returns
the action value SECCOMP_RET_USER_NOTIF. This return value
will trigger a notification event.

2. In order that the supervisor process can obtain notifications
using the listening file descriptor, (a duplicate of) that
file descriptor must be passed from the target process to the
supervisor process. One way in which this could be done is by
passing the file descriptor over a UNIX domain socket connec‐
tion between the two processes (using the SCM_RIGHTS ancillary
message type described in unix(7)). Another possibility is
that the supervisor might inherit the file descriptor via
fork(2).

3. The supervisor process will receive notification events on the
listening file descriptor. These events are returned as
structures of type seccomp_notif. Because this structure and
its size may evolve over kernel versions, the supervisor must
first determine the size of this structure using the sec‐
comp(2) SECCOMP_GET_NOTIF_SIZES operation, which returns a
structure of type seccomp_notif_sizes. The supervisor allo‐
cates a buffer of size seccomp_notif_sizes.seccomp_notif bytes
to receive notification events. In addition,the supervisor
allocates another buffer of size seccomp_notif_sizes.sec‐
comp_notif_resp bytes for the response (a struct sec‐
comp_notif_resp structure) that it will provide to the kernel
(and thus the target process).

4. The target process then performs its workload, which includes
system calls that will be controlled by the seccomp filter.
Whenever one of these system calls causes the filter to return
the SECCOMP_RET_USER_NOTIF action value, the kernel does not
execute the system call; instead, execution of the target
process is temporarily blocked inside the kernel and a notifi‐
cation event is generated on the listening file descriptor.

5. The supervisor process can now repeatedly monitor the listen‐
ing file descriptor for SECCOMP_RET_USER_NOTIF-triggered
events. To do this, the supervisor uses the SEC‐
COMP_IOCTL_NOTIF_RECV ioctl(2) operation to read information
about a notification event; this operation blocks until an
event is available. The operation returns a seccomp_notif
structure containing information about the system call that is
being attempted by the target process.

6. The seccomp_notif structure returned by the SEC‐
COMP_IOCTL_NOTIF_RECV operation includes the same information
(a seccomp_data structure) that was passed to the seccomp fil‐
ter. This information allows the supervisor to discover the
system call number and the arguments for the target process's
system call. In addition, the notification event contains the
PID of the target process.

The information in the notification can be used to discover
the values of pointer arguments for the target process's sys‐
tem call. (This is something that can't be done from within a
seccomp filter.) To do this (and assuming it has suitable
permissions), the supervisor opens the corresponding
/proc/[pid]/mem file, seeks to the memory location that corre‐
sponds to one of the pointer arguments whose value is supplied
in the notification event, and reads bytes from that location.
(The supervisor must be careful to avoid a race condition that
can occur when doing this; see the description of the SEC‐
COMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation below.) In addi‐
tion, the supervisor can access other system information that
is visible in user space but which is not accessible from a
seccomp filter.

┌─────────────────────────────────────────────────────┐
│FIXME │
├─────────────────────────────────────────────────────┤
│Suppose we are reading a pathname from /proc/PID/mem │
│for a system call such as mkdir(). The pathname can │
│be an arbitrary length. How do we know how much (how │
│many pages) to read from /proc/PID/mem? │
└─────────────────────────────────────────────────────┘

7. Having obtained information as per the previous step, the
supervisor may then choose to perform an action in response to
the target process's system call (which, as noted above, is
not executed when the seccomp filter returns the SEC‐
COMP_RET_USER_NOTIF action value).

One example use case here relates to containers. The target
process may be located inside a container where it does not
have sufficient capabilities to mount a filesystem in the con‐
tainer's mount namespace. However, the supervisor may be a
more privileged process that that does have sufficient capa‐
bilities to perform the mount operation.

8. The supervisor then sends a response to the notification. The
information in this response is used by the kernel to con‐
struct a return value for the target process's system call and
provide a value that will be assigned to the errno variable of
the target process.

The response is sent using the SECCOMP_IOCTL_NOTIF_RECV
ioctl(2) operation, which is used to transmit a sec‐
comp_notif_resp structure to the kernel. This structure
includes a cookie value that the supervisor obtained in the
seccomp_notif structure returned by the SEC‐
COMP_IOCTL_NOTIF_RECV operation. This cookie value allows the
kernel to associate the response with the target process.

9. Once the notification has been sent, the system call in the
target process unblocks, returning the information that was
provided by the supervisor in the notification response.

As a variation on the last two steps, the supervisor can send a
response that tells the kernel that it should execute the target
process's system call; see the discussion of SEC‐
COMP_USER_NOTIF_FLAG_CONTINUE, below.

ioctl(2) operations
The following ioctl(2) operations are provided to support seccomp
user-space notification. For each of these operations, the first
(file descriptor) argument of ioctl(2) is the listening file
descriptor returned by a call to seccomp(2) with the SECCOMP_FIL‐
TER_FLAG_NEW_LISTENER flag.

SECCOMP_IOCTL_NOTIF_RECV
This operation is used to obtain a user-space notification
event. If no such event is currently pending, the opera‐
tion blocks until an event occurs. The third ioctl(2)
argument is a pointer to a structure of the following form
which contains information about the event. This struc‐
ture must be zeroed out before the call.

struct seccomp_notif {
__u64 id; /* Cookie */
__u32 pid; /* PID of target process */
__u32 flags; /* Currently unused (0) */
struct seccomp_data data; /* See seccomp(2) */
};

The fields in this structure are as follows:

id This is a cookie for the notification. Each such
cookie is guaranteed to be unique for the corre‐
sponding seccomp filter. In other words, this
cookie is unique for each notification event from
the target process. The cookie value has the fol‐
lowing uses:

· It can be used with the SEC‐
COMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation to
verify that the target process is still alive.

· When returning a notification response to the
kernel, the supervisor must include the cookie
value in the seccomp_notif_resp structure that is
specified as the argument of the SEC‐
COMP_IOCTL_NOTIF_SEND operation.

pid This is the PID of the target process that trig‐
gered the notification event.

┌─────────────────────────────────────────────────────┐
│FIXME │
├─────────────────────────────────────────────────────┤
│This is a thread ID, rather than a PID, right? │
└─────────────────────────────────────────────────────┘

flags This is a bit mask of flags providing further
information on the event. In the current implemen‐
tation, this field is always zero.

data This is a seccomp_data structure containing infor‐
mation about the system call that triggered the
notification. This is the same structure that is
passed to the seccomp filter. See seccomp(2) for
details of this structure.

On success, this operation returns 0; on failure, -1 is
returned, and errno is set to indicate the cause of the
error. This operation can fail with the following errors:

EINVAL (since Linux 5.5)
The seccomp_notif structure that was passed to the
call contained nonzero fields.

ENOENT The target process was killed by a signal as the
notification information was being generated.

┌─────────────────────────────────────────────────────┐
│FIXME │
├─────────────────────────────────────────────────────┤
│From my experiments, it appears that if a SEC‐ │
│COMP_IOCTL_NOTIF_RECV is done after the target │
│process terminates, then the ioctl() simply blocks │
│(rather than returning an error to indicate that the │
│target process no longer exists). │
│ │
│I found that surprising, and it required some con‐ │
│tortions in the example program. It was not possi‐ │
│ble to code my SIGCHLD handler (which reaps the zom‐ │
│bie when the worker/target process terminates) to │
│simply set a flag checked in the main handleNotifi‐ │
│cations() loop, since this created an unavoidable │
│race where the child might terminate just after I │
│had checked the flag, but before I blocked (for‐ │
│ever!) in the SECCOMP_IOCTL_NOTIF_RECV operation. │
│Instead, I had to code the signal handler to simply │
│call _exit(2) in order to terminate the parent │
│process (the supervisor). │
│ │
│Is this expected behavior? It seems to me rather │
│desirable that SECCOMP_IOCTL_NOTIF_RECV should give │
│an error if the target process has terminated. │
└─────────────────────────────────────────────────────┘

SECCOMP_IOCTL_NOTIF_ID_VALID
This operation can be used to check that a notification ID
returned by an earlier SECCOMP_IOCTL_NOTIF_RECV operation
is still valid (i.e., that the target process still
exists).

The third ioctl(2) argument is a pointer to the cookie
(id) returned by the SECCOMP_IOCTL_NOTIF_RECV operation.

This operation is necessary to avoid race conditions that
can occur when the pid returned by the SEC‐
COMP_IOCTL_NOTIF_RECV operation terminates, and that
process ID is reused by another process. An example of
this kind of race is the following

1. A notification is generated on the listening file
descriptor. The returned seccomp_notif contains the
PID of the target process.

2. The target process terminates.

3. Another process is created on the system that by chance
reuses the PID that was freed when the target process
terminates.

4. The supervisor open(2)s the /proc/[pid]/mem file for
the PID obtained in step 1, with the intention of (say)
inspecting the memory locations that contains the argu‐
ments of the system call that triggered the notifica‐
tion in step 1.

In the above scenario, the risk is that the supervisor may
try to access the memory of a process other than the tar‐
get. This race can be avoided by following the call to
open with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to ver‐
ify that the process that generated the notification is
still alive. (Note that if the target process subse‐
quently terminates, its PID won't be reused because there
remains an open reference to the /proc[pid]/mem file; in
this case, a subsequent read(2) from the file will return
0, indicating end of file.)

On success (i.e., the notification ID is still valid),
this operation returns 0 On failure (i.e., the notifica‐
tion ID is no longer valid), -1 is returned, and errno is
set to ENOENT.

SECCOMP_IOCTL_NOTIF_SEND
This operation is used to send a notification response
back to the kernel. The third ioctl(2) argument of this
structure is a pointer to a structure of the following
form:

struct seccomp_notif_resp {
__u64 id; /* Cookie value */
__s64 val; /* Success return value */
__s32 error; /* 0 (success) or negative
error number */
__u32 flags; /* See below */
};

The fields of this structure are as follows:

id This is the cookie value that was obtained using
the SECCOMP_IOCTL_NOTIF_RECV operation. This
cookie value allows the kernel to correctly asso‐
ciate this response with the system call that trig‐
gered the user-space notification.

val This is the value that will be used for a spoofed
success return for the target process's system
call; see below.

error This is the value that will be used as the error
number (errno) for a spoofed error return for the
target process's system call; see below.

flags This is a bit mask that includes zero or more of
the following flags

SECCOMP_USER_NOTIF_FLAG_CONTINUE (since Linux 5.5)
Tell the kernel to execute the target
process's system call.

Two kinds of response are possible:

· A response to the kernel telling it to execute the tar‐
get process's system call. In this case, the flags
field includes SECCOMP_USER_NOTIF_FLAG_CONTINUE and the
error and val fields must be zero.

This kind of response can be useful in cases where the
supervisor needs to do deeper analysis of the target's
system call than is possible from a seccomp filter
(e.g., examining the values of pointer arguments), and,
having verified that the system call is acceptable, the
supervisor wants to allow it to proceed.

· A spoofed return value for the target process's system
call. In this case, the kernel does not execute the
target process's system call, instead causing the system
call to return a spoofed value as specified by fields of
the seccomp_notif_resp structure. The supervisor should
set the fields of this structure as follows:

+ flags does not contain SECCOMP_USER_NOTIF_FLAG_CON‐
TINUE.

+ error is set either to 0 for a spoofed "success"
return or to a negative error number for a spoofed
"failure" return. In the former case, the kernel
causes the target process's system call to return the
value specified in the val field. In the later case,
the kernel causes the target process's system call to
return -1, and errno is assigned the negated error
value.

+ val is set to a value that will be used as the return
value for a spoofed "success" return for the target
process's system call. The value in this field is
ignored if the error field contains a nonzero value.

On success, this operation returns 0; on failure, -1 is
returned, and errno is set to indicate the cause of the
error. This operation can fail with the following errors:

EINPROGRESS
A response to this notification has already been
sent.

EINVAL An invalid value was specified in the flags field.

EINVAL The flags field contained SEC‐
COMP_USER_NOTIF_FLAG_CONTINUE, and the error or val
field was not zero.

ENOENT The blocked system call in the target process has
been interrupted by a signal handler.

NOTES
The file descriptor returned when seccomp(2) is employed with the
SECCOMP_FILTER_FLAG_NEW_LISTENER flag can be monitored using
poll(2), epoll(7), and select(2). When a notification is pend‐
ing, these interfaces indicate that the file descriptor is read‐
able.

┌─────────────────────────────────────────────────────┐
│FIXME │
├─────────────────────────────────────────────────────┤
│Interestingly, after the event had been received, │
│the file descriptor indicates as writable (verified │
│from the source code and by experiment). How is this │
│useful? │
└─────────────────────────────────────────────────────┘

EXAMPLES
The (somewhat contrived) program shown below demonstrates the use
of the interfaces described in this page. The program creates a
child process that serves as the "target" process. The child
process installs a seccomp filter that returns the SEC‐
COMP_RET_USER_NOTIF action value if a call is made to mkdir(2).
The child process then calls mkdir(2) once for each of the sup‐
plied command-line arguments, and reports the result returned by
the call. After processing all arguments, the child process ter‐
minates.

The parent process acts as the supervisor, listening for the
notifications that are generated when the target process calls
mkdir(2). When such a notification occurs, the supervisor exam‐
ines the memory of the target process (using /proc/[pid]/mem) to
discover the pathname argument that was supplied to the mkdir(2)
call, and performs one of the following actions:

· If the pathname begins with the prefix "/tmp/", then the super‐
visor attempts to create the specified directory, and then
spoofs a return for the target process based on the return
value of the supervisor's mkdir(2) call. In the event that
that call succeeds, the spoofed success return value is the
length of the pathname.

· If the pathname begins with "./" (i.e., it is a relative path‐
name), the supervisor sends a SECCOMP_USER_NOTIF_FLAG_CONTINUE
response to the kernel to say that kernel should execute the
target process's mkdir(2) call.

· If the pathname begins with some other prefix, the supervisor
spoofs an error return for the target process, so that the tar‐
get process's mkdir(2) call appears to fail with the error EOP‐
NOTSUPP ("Operation not supported"). Additionally, if the
specified pathname is exactly "/bye", then the supervisor ter‐
minates.

This program can used to demonstrate various aspects of the
behavior of the seccomp user-space notification mechanism. To
help aid such demonstrations, the program logs various messages
to show the operation of the target process (lines prefixed "T:")
and the supervisor (indented lines prefixed "S:").

In the following example, the target attempts to create the
directory /tmp/x. Upon receiving the notification, the supervi‐
sor creates the directory on the target's behalf, and spoofs a
success return to be received by the target process's mkdir(2)
call.

$ ./seccomp_unotify /tmp/x
T: PID = 23168

T: about to mkdir("/tmp/x")
S: got notification (ID 0x17445c4a0f4e0e3c) for PID 23168
S: executing: mkdir("/tmp/x", 0700)
S: success! spoofed return = 6
S: sending response (flags = 0; val = 6; error = 0)
T: SUCCESS: mkdir(2) returned 6

T: terminating
S: target has terminated; bye

In the above output, note that the spoofed return value seen by
the target process is 6 (the length of the pathname /tmp/x),
whereas a normal mkdir(2) call returns 0 on success.

In the next example, the target attempts to create a directory
using the relative pathname ./sub. Since this pathname starts
with "./", the supervisor sends a SECCOMP_USER_NOTIF_FLAG_CON‐
TINUE response to the kernel, and the kernel then (successfully)
executes the target process's mkdir(2) call.

$ ./seccomp_unotify ./sub
T: PID = 23204

T: about to mkdir("./sub")
S: got notification (ID 0xddb16abe25b4c12) for PID 23204
S: target can execute system call
S: sending response (flags = 0x1; val = 0; error = 0)
T: SUCCESS: mkdir(2) returned 0

T: terminating
S: target has terminated; bye

If the target process attempts to create a directory with a path‐
name that doesn't start with "." and doesn't begin with the pre‐
fix "/tmp/", then the supervisor spoofs an error return (EOPNOT‐
SUPP, "Operation not supported") for the target's mkdir(2) call
(which is not executed):

$ ./seccomp_unotify /xxx
T: PID = 23178

T: about to mkdir("/xxx")
S: got notification (ID 0xe7dc095d1c524e80) for PID 23178
S: spoofing error response (Operation not supported)
S: sending response (flags = 0; val = 0; error = -95)
T: ERROR: mkdir(2): Operation not supported

T: terminating
S: target has terminated; bye

In the next example, the target process attempts to create a
directory with the pathname /tmp/nosuchdir/b. Upon receiving the
notification, the supervisor attempts to create that directory,
but the mkdir(2) call fails because the directory /tmp/nosuchdir
does not exist. Consequently, the supervisor spoofs an error
return that passes the error that it received back to the target
process's mkdir(2) call.

$ ./seccomp_unotify /tmp/nosuchdir/b
T: PID = 23199

T: about to mkdir("/tmp/nosuchdir/b")
S: got notification (ID 0x8744454293506046) for PID 23199
S: executing: mkdir("/tmp/nosuchdir/b", 0700)
S: failure! (errno = 2; No such file or directory)
S: sending response (flags = 0; val = 0; error = -2)
T: ERROR: mkdir(2): No such file or directory

T: terminating
S: target has terminated; bye

If the supervisor receives a notification and sees that the argu‐
ment of the target's mkdir(2) is the string "/bye", then (as well
as spoofing an EOPNOTSUPP error), the supervisor terminates. If
the target process subsequently executes another mkdir(2) that
triggers its seccomp filter to return the SECCOMP_RET_USER_NOTIF
action value, then the kernel causes the target process's system
call to fail with the error ENOSYS ("Function not implemented").
This is demonstrated by the following example:

$ ./seccomp_unotify /bye /tmp/y
T: PID = 23185

T: about to mkdir("/bye")
S: got notification (ID 0xa81236b1d2f7b0f4) for PID 23185
S: spoofing error response (Operation not supported)
S: sending response (flags = 0; val = 0; error = -95)
S: terminating **********
T: ERROR: mkdir(2): Operation not supported

T: about to mkdir("/tmp/y")
T: ERROR: mkdir(2): Function not implemented

T: terminating

Program source
#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/prctl.h>
#include <fcntl.h>
#include <limits.h>
#include <signal.h>
#include <stddef.h>
#include <stdint.h>
#include <stdbool.h>
#include <linux/audit.h>
#include <sys/syscall.h>
#include <sys/stat.h>
#include <linux/filter.h>
#include <linux/seccomp.h>
#include <sys/ioctl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <sys/socket.h>
#include <sys/un.h>

#define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
} while (0)

/* Send the file descriptor 'fd' over the connected UNIX domain socket
'sockfd'. Returns 0 on success, or -1 on error. */

static int
sendfd(int sockfd, int fd)
{
struct msghdr msgh;
struct iovec iov;
int data;
struct cmsghdr *cmsgp;

/* Allocate a char array of suitable size to hold the ancillary data.
However, since this buffer is in reality a 'struct cmsghdr', use a
union to ensure that it is suitable aligned. */
union {
char buf[CMSG_SPACE(sizeof(int))];
/* Space large enough to hold an 'int' */
struct cmsghdr align;
} controlMsg;

/* The 'msg_name' field can be used to specify the address of the
destination socket when sending a datagram. However, we do not
need to use this field because 'sockfd' is a connected socket. */

msgh.msg_name = NULL;
msgh.msg_namelen = 0;

/* On Linux, we must transmit at least one byte of real data in
order to send ancillary data. We transmit an arbitrary integer
whose value is ignored by recvfd(). */

msgh.msg_iov = &iov;
msgh.msg_iovlen = 1;
iov.iov_base = &data;
iov.iov_len = sizeof(int);
data = 12345;

/* Set 'msghdr' fields that describe ancillary data */

msgh.msg_control = controlMsg.buf;
msgh.msg_controllen = sizeof(controlMsg.buf);

/* Set up ancillary data describing file descriptor to send */

cmsgp = CMSG_FIRSTHDR(&msgh);
cmsgp->cmsg_level = SOL_SOCKET;
cmsgp->cmsg_type = SCM_RIGHTS;
cmsgp->cmsg_len = CMSG_LEN(sizeof(int));
memcpy(CMSG_DATA(cmsgp), &fd, sizeof(int));

/* Send real plus ancillary data */

if (sendmsg(sockfd, &msgh, 0) == -1)
return -1;

return 0;
}

/* Receive a file descriptor on a connected UNIX domain socket. Returns
the received file descriptor on success, or -1 on error. */

static int
recvfd(int sockfd)
{
struct msghdr msgh;
struct iovec iov;
int data, fd;
ssize_t nr;

/* Allocate a char buffer for the ancillary data. See the comments
in sendfd() */
union {
char buf[CMSG_SPACE(sizeof(int))];
struct cmsghdr align;
} controlMsg;
struct cmsghdr *cmsgp;

/* The 'msg_name' field can be used to obtain the address of the
sending socket. However, we do not need this information. */

msgh.msg_name = NULL;
msgh.msg_namelen = 0;

/* Specify buffer for receiving real data */

msgh.msg_iov = &iov;
msgh.msg_iovlen = 1;
iov.iov_base = &data; /* Real data is an 'int' */
iov.iov_len = sizeof(int);

/* Set 'msghdr' fields that describe ancillary data */

msgh.msg_control = controlMsg.buf;
msgh.msg_controllen = sizeof(controlMsg.buf);

/* Receive real plus ancillary data; real data is ignored */

nr = recvmsg(sockfd, &msgh, 0);
if (nr == -1)
return -1;

cmsgp = CMSG_FIRSTHDR(&msgh);

/* Check the validity of the 'cmsghdr' */

if (cmsgp == NULL ||
cmsgp->cmsg_len != CMSG_LEN(sizeof(int)) ||
cmsgp->cmsg_level != SOL_SOCKET ||
cmsgp->cmsg_type != SCM_RIGHTS) {
errno = EINVAL;
return -1;
}

/* Return the received file descriptor to our caller */

memcpy(&fd, CMSG_DATA(cmsgp), sizeof(int));
return fd;
}

static void
sigchldHandler(int sig)
{
char *msg = "\tS: target has terminated; bye\n";

write(STDOUT_FILENO, msg, strlen(msg));
_exit(EXIT_SUCCESS);
}

static int
seccomp(unsigned int operation, unsigned int flags, void *args)
{
return syscall(__NR_seccomp, operation, flags, args);
}

/* The following is the x86-64-specific BPF boilerplate code for checking
that the BPF program is running on the right architecture + ABI. At
completion of these instructions, the accumulator contains the system
call number. */

/* For the x32 ABI, all system call numbers have bit 30 set */

#define X32_SYSCALL_BIT 0x40000000

#define X86_64_CHECK_ARCH_AND_LOAD_SYSCALL_NR \
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, \
(offsetof(struct seccomp_data, arch))), \
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 0, 2), \
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, \
(offsetof(struct seccomp_data, nr))), \
BPF_JUMP(BPF_JMP | BPF_JGE | BPF_K, X32_SYSCALL_BIT, 0, 1), \
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS)

/* installNotifyFilter() installs a seccomp filter that generates
user-space notifications (SECCOMP_RET_USER_NOTIF) when the process
calls mkdir(2); the filter allows all other system calls.

The function return value is a file descriptor from which the
user-space notifications can be fetched. */

static int
installNotifyFilter(void)
{
struct sock_filter filter[] = {
X86_64_CHECK_ARCH_AND_LOAD_SYSCALL_NR,

/* mkdir() triggers notification to user-space supervisor */

BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_mkdir, 0, 1),
BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_USER_NOTIF),

/* Every other system call is allowed */

BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
};

struct sock_fprog prog = {
.len = sizeof(filter) / sizeof(filter[0]),
.filter = filter,
};

/* Install the filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER flag;
as a result, seccomp() returns a notification file descriptor. */

int notifyFd = seccomp(SECCOMP_SET_MODE_FILTER,
SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
if (notifyFd == -1)
errExit("seccomp-install-notify-filter");

return notifyFd;
}

/* Close a pair of sockets created by socketpair() */

static void
closeSocketPair(int sockPair[2])
{
if (close(sockPair[0]) == -1)
errExit("closeSocketPair-close-0");
if (close(sockPair[1]) == -1)
errExit("closeSocketPair-close-1");
}

/* Implementation of the target process; create a child process that:

(1) installs a seccomp filter with the
SECCOMP_FILTER_FLAG_NEW_LISTENER flag;
(2) writes the seccomp notification file descriptor returned from
the previous step onto the UNIX domain socket, 'sockPair[0]';
(3) calls mkdir(2) for each element of 'argv'.

The function return value in the parent is the PID of the child
process; the child does not return from this function. */

static pid_t
targetProcess(int sockPair[2], char *argv[])
{
pid_t targetPid = fork();
if (targetPid == -1)
errExit("fork");

if (targetPid > 0) /* In parent, return PID of child */
return targetPid;

/* Child falls through to here */

printf("T: PID = %ld\n", (long) getpid());

/* Install seccomp filter(s) */

if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0))
errExit("prctl");

int notifyFd = installNotifyFilter();

/* Pass the notification file descriptor to the tracing process over
a UNIX domain socket */

if (sendfd(sockPair[0], notifyFd) == -1)
errExit("sendfd");

/* Notification and socket FDs are no longer needed in target */

if (close(notifyFd) == -1)
errExit("close-target-notify-fd");

closeSocketPair(sockPair);

/* Perform a mkdir() call for each of the command-line arguments */

for (char **ap = argv; *ap != NULL; ap++) {
printf("\nT: about to mkdir(\"%s\")\n", *ap);

int s = mkdir(*ap, 0700);
if (s == -1)
perror("T: ERROR: mkdir(2)");
else
printf("T: SUCCESS: mkdir(2) returned %d\n", s);
}

printf("\nT: terminating\n");
exit(EXIT_SUCCESS);
}

/* Check that the notification ID provided by a SECCOMP_IOCTL_NOTIF_RECV
operation is still valid. It will no longer be valid if the process
has terminated. This operation can be used when accessing /proc/PID
files in the target process in order to avoid TOCTOU race conditions
where the PID that is returned by SECCOMP_IOCTL_NOTIF_RECV terminates
and is reused by another process. */

static void
checkNotificationIdIsValid(int notifyFd, uint64_t id)
{
if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_ID_VALID, &id) == -1) {
fprintf(stderr, "\tS: notification ID check: "
"target has terminated!!!\n");

exit(EXIT_FAILURE);
}
}

/* Access the memory of the target process in order to discover the
pathname that was given to mkdir() */

static void
getTargetPathname(struct seccomp_notif *req, int notifyFd,
char *path, size_t len)
{
char procMemPath[PATH_MAX];
snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", req->pid);

int procMemFd = open(procMemPath, O_RDONLY);
if (procMemFd == -1)
errExit("Supervisor: open");

/* Check that the process whose info we are accessing is still alive.
If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed
in checkNotificationIdIsValid()) succeeds, we know that the
/proc/PID/mem file descriptor that we opened corresponds to the
process for which we received a notification. If that process
subsequently terminates, then read() on that file descriptor
will return 0 (EOF). */

checkNotificationIdIsValid(notifyFd, req->id);

/* Seek to the location containing the pathname argument (i.e., the
first argument) of the mkdir(2) call and read that pathname */

if (lseek(procMemFd, req->data.args[0], SEEK_SET) == -1)
errExit("Supervisor: lseek");

ssize_t s = read(procMemFd, path, PATH_MAX);
if (s == -1)
errExit("read");

if (s == 0) {
fprintf(stderr, "\tS: read() of /proc/PID/mem "
"returned 0 (EOF)\n");
exit(EXIT_FAILURE);
}

if (close(procMemFd) == -1)
errExit("close-/proc/PID/mem");
}

/* Handle notifications that arrive via the SECCOMP_RET_USER_NOTIF file
descriptor, 'notifyFd'. */

static void
handleNotifications(int notifyFd)
{
struct seccomp_notif_sizes sizes;
char path[PATH_MAX];
/* For simplicity, we assume that the pathname given to mkdir()
is no more than PATH_MAX bytes; but this might not be true. */

/* Discover the sizes of the structures that are used to receive
notifications and send notification responses, and allocate
buffers of those sizes. */

if (seccomp(SECCOMP_GET_NOTIF_SIZES, 0, &sizes) == -1)
errExit("\tS: seccomp-SECCOMP_GET_NOTIF_SIZES");

struct seccomp_notif *req = malloc(sizes.seccomp_notif);
if (req == NULL)
errExit("\tS: malloc");

struct seccomp_notif_resp *resp = malloc(sizes.seccomp_notif_resp);
if (resp == NULL)
errExit("\tS: malloc");

/* Loop handling notifications */

for (;;) {
/* Wait for next notification, returning info in '*req' */

memset(req, 0, sizes.seccomp_notif);
if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_RECV, req) == -1) {
if (errno == EINTR)
continue;
errExit("Supervisor: ioctl-SECCOMP_IOCTL_NOTIF_RECV");
}

printf("\tS: got notification (ID %#llx) for PID %d\n",
req->id, req->pid);

/* The only system call that can generate a notification event
is mkdir(2). Nevertheless, we check that the notified system
call is indeed mkdir() as kind of future-proofing of this
code in case the seccomp filter is later modified to
generate notifications for other system calls. */

if (req->data.nr != __NR_mkdir) {
printf("\tS: notification contained unexpected "
"system call number; bye!!!\n");
exit(EXIT_FAILURE);
}

getTargetPathname(req, notifyFd, path, sizeof(path));

/* Prepopulate some fields of the response */

resp->id = req->id; /* Response includes notification ID */
resp->flags = 0;
resp->val = 0;

/* If the directory is in /tmp, then create it on behalf of
the supervisor; if the pathname starts with '.', tell the
kernel to let the target process execute the mkdir();
otherwise, give an error for a directory pathname in
any other location. */

if (strncmp(path, "/tmp/", strlen("/tmp/")) == 0) {
printf("\tS: executing: mkdir(\"%s\", %#llo)\n",
path, req->data.args[1]);

if (mkdir(path, req->data.args[1]) == 0) {
resp->error = 0; /* "Success" */
resp->val = strlen(path); /* Used as return value of
mkdir() in target */
printf("\tS: success! spoofed return = %lld\n",
resp->val);
} else {

/* If mkdir() failed in the supervisor, pass the error
back to the target */

resp->error = -errno;
printf("\tS: failure! (errno = %d; %s)\n", errno,
strerror(errno));
}
} else if (strncmp(path, "./", strlen("./")) == 0) {
resp->error = resp->val = 0;
resp->flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;
printf("\tS: target can execute system call\n");
} else {
resp->error = -EOPNOTSUPP;
printf("\tS: spoofing error response (%s)\n",
strerror(-resp->error));
}

/* Send a response to the notification */

printf("\tS: sending response "
"(flags = %#x; val = %lld; error = %d)\n",
resp->flags, resp->val, resp->error);

if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_SEND, resp) == -1) {
if (errno == ENOENT)
printf("\tS: response failed with ENOENT; "
"perhaps target process's syscall was "
"interrupted by signal?\n");
else
perror("ioctl-SECCOMP_IOCTL_NOTIF_SEND");
}

/* If the pathname is just "/bye", then the supervisor
terminates. This allows us to see what happens if the
target process makes further calls to mkdir(2). */

if (strcmp(path, "/bye") == 0) {
printf("\tS: terminating **********\n");
exit(EXIT_FAILURE);
}
}
}

/* Implementation of the supervisor process:

(1) obtains the notification file descriptor from 'sockPair[1]'
(2) handles notifications that arrive on that file descriptor. */

static void
supervisor(int sockPair[2])
{
int notifyFd = recvfd(sockPair[1]);
if (notifyFd == -1)
errExit("recvfd");

closeSocketPair(sockPair); /* We no longer need the socket pair */

handleNotifications(notifyFd);
}

int
main(int argc, char *argv[])
{
int sockPair[2];

setbuf(stdout, NULL);

if (argc < 2) {
fprintf(stderr, "At least one pathname argument is required\n");
exit(EXIT_FAILURE);
}

/* Create a UNIX domain socket that is used to pass the seccomp
notification file descriptor from the target process to the
supervisor process. */

if (socketpair(AF_UNIX, SOCK_STREAM, 0, sockPair) == -1)
errExit("socketpair");

/* Create a child process--the "target"--that installs seccomp
filtering. The target process writes the seccomp notification
file descriptor onto 'sockPair[0]' and then calls mkdir(2) for
each directory in the command-line arguments. */

(void) targetProcess(sockPair, &argv[optind]);

/* Catch SIGCHLD when the target terminates, so that the
supervisor can also terminate. */

struct sigaction sa;
sa.sa_handler = sigchldHandler;
sa.sa_flags = 0;
sigemptyset(&sa.sa_mask);
if (sigaction(SIGCHLD, &sa, NULL) == -1)
errExit("sigaction");

supervisor(sockPair);

exit(EXIT_SUCCESS);
}

SEE ALSO
ioctl(2), seccomp(2)

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2020-09-30 15:15:02

by Tycho Andersen

[permalink] [raw]

Subject: Re: For review: seccomp_user_notif(2) manual page

On Wed, Sep 30, 2020 at 09:03:36AM -0600, Tycho Andersen wrote:
> On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
> > ┌─────────────────────────────────────────────────────┐
> > │FIXME │
> > ├─────────────────────────────────────────────────────┤
> > │Interestingly, after the event had been received, │
> > │the file descriptor indicates as writable (verified │
> > │from the source code and by experiment). How is this │
> > │useful? │
>
> You're saying it should just do EPOLLOUT and not EPOLLWRNORM? Seems
> reasonable.

If we make this change, I suppose we should also drop EPOLLRDNORM from
things which have not been received yet, since they're not really
readable.

Tycho

2020-09-30 15:17:53

by Tycho Andersen

[permalink] [raw]

Subject: Re: For review: seccomp_user_notif(2) manual page

On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
> 2. In order that the supervisor process can obtain notifications
> using the listening file descriptor, (a duplicate of) that
> file descriptor must be passed from the target process to the
> supervisor process. One way in which this could be done is by
> passing the file descriptor over a UNIX domain socket connec‐
> tion between the two processes (using the SCM_RIGHTS ancillary
> message type described in unix(7)). Another possibility is
> that the supervisor might inherit the file descriptor via
> fork(2).

It is technically possible to inherit the fd via fork, but is it
really that useful? The child process wouldn't be able to actually do
the syscall in question, since it would have the same filter.

> The information in the notification can be used to discover
> the values of pointer arguments for the target process's sys‐
> tem call. (This is something that can't be done from within a
> seccomp filter.) To do this (and assuming it has suitable

s/To do this/One way to accomplish this/ perhaps, since there are
others.

> permissions), the supervisor opens the corresponding
> /proc/[pid]/mem file, seeks to the memory location that corre‐
> sponds to one of the pointer arguments whose value is supplied
> in the notification event, and reads bytes from that location.
> (The supervisor must be careful to avoid a race condition that
> can occur when doing this; see the description of the SEC‐
> COMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation below.) In addi‐
> tion, the supervisor can access other system information that
> is visible in user space but which is not accessible from a
> seccomp filter.
>
> ┌─────────────────────────────────────────────────────┐
> │FIXME │
> ├─────────────────────────────────────────────────────┤
> │Suppose we are reading a pathname from /proc/PID/mem │
> │for a system call such as mkdir(). The pathname can │
> │be an arbitrary length. How do we know how much (how │
> │many pages) to read from /proc/PID/mem? │
> └─────────────────────────────────────────────────────┘

PATH_MAX, I suppose.

> ┌─────────────────────────────────────────────────────┐
> │FIXME │
> ├─────────────────────────────────────────────────────┤
> │From my experiments, it appears that if a SEC‐ │
> │COMP_IOCTL_NOTIF_RECV is done after the target │
> │process terminates, then the ioctl() simply blocks │
> │(rather than returning an error to indicate that the │
> │target process no longer exists). │

Yeah, I think Christian wanted to fix this at some point, but it's a
bit sticky to do. Note that if you e.g. rely on fork() above, the
filter is shared with your current process, and this notification
would never be possible. Perhaps another reason to omit that from the
man page.

> SECCOMP_IOCTL_NOTIF_ID_VALID
> This operation can be used to check that a notification ID
> returned by an earlier SECCOMP_IOCTL_NOTIF_RECV operation
> is still valid (i.e., that the target process still
> exists).
>
> The third ioctl(2) argument is a pointer to the cookie
> (id) returned by the SECCOMP_IOCTL_NOTIF_RECV operation.
>
> This operation is necessary to avoid race conditions that
> can occur when the pid returned by the SEC‐
> COMP_IOCTL_NOTIF_RECV operation terminates, and that
> process ID is reused by another process. An example of
> this kind of race is the following
>
> 1. A notification is generated on the listening file
> descriptor. The returned seccomp_notif contains the
> PID of the target process.
>
> 2. The target process terminates.
>
> 3. Another process is created on the system that by chance
> reuses the PID that was freed when the target process
> terminates.
>
> 4. The supervisor open(2)s the /proc/[pid]/mem file for
> the PID obtained in step 1, with the intention of (say)
> inspecting the memory locations that contains the argu‐
> ments of the system call that triggered the notifica‐
> tion in step 1.
>
> In the above scenario, the risk is that the supervisor may
> try to access the memory of a process other than the tar‐
> get. This race can be avoided by following the call to
> open with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to ver‐
> ify that the process that generated the notification is
> still alive. (Note that if the target process subse‐
> quently terminates, its PID won't be reused because there
> remains an open reference to the /proc[pid]/mem file; in
> this case, a subsequent read(2) from the file will return
> 0, indicating end of file.)
>
> On success (i.e., the notification ID is still valid),
> this operation returns 0 On failure (i.e., the notifica‐
^ need a period?

> ┌─────────────────────────────────────────────────────┐
> │FIXME │
> ├─────────────────────────────────────────────────────┤
> │Interestingly, after the event had been received, │
> │the file descriptor indicates as writable (verified │
> │from the source code and by experiment). How is this │
> │useful? │

You're saying it should just do EPOLLOUT and not EPOLLWRNORM? Seems
reasonable.

>
> EXAMPLES
> The (somewhat contrived) program shown below demonstrates the use

May also be worth mentioning the example in
samples/seccomp/user-trap.c as well.

Tycho

2020-09-30 15:56:40