Subject: For review: seccomp_user_notif(2) manual page

Hi Tycho, Sargun (and all),

I knew it would be a big ask, but below is kind of the manual page
I was hoping you might write [1] for the seccomp user-space notification
mechanism. Since you didn't (and because 5.9 adds various new pieces
such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD
that also will need documenting [2]), I did :-). But of course I may
have made mistakes...

I've shown the rendered version of the page below, and would love
to receive review comments from you and others, and acks, etc.

There are a few FIXMEs sprinkled into the page, including one
that relates to what appears to me to be a misdesign (possibly
fixable) in the operation of the SECCOMP_IOCTL_NOTIF_RECV
operation. I would be especially interested in feedback on that
FIXME, and also of course the other FIXMEs.

The page includes an extensive (albeit slightly contrived)
example program, and I would be happy also to receive comments
on that program.

The page source currently sits in a branch (along with the text
that you sent me for the seccomp(2) page) at
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=seccomp_user_notif

Thanks,

Michael

[1] https://lore.kernel.org/linux-man/[email protected]/#t
[2] Sargun, can you prepare something on SECCOMP_ADDFD_FLAG_SETFD
and SECCOMP_IOCTL_NOTIF_ADDFD to be added to this page?

=====

NAME
seccomp_user_notif - Seccomp user-space notification mechanism

SYNOPSIS
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <linux/audit.h>

int seccomp(unsigned int operation, unsigned int flags, void *args);

DESCRIPTION
This page describes the user-space notification mechanism pro‐
vided by the Secure Computing (seccomp) facility. As well as the
use of the SECCOMP_FILTER_FLAG_NEW_LISTENER flag, the SEC‐
COMP_RET_USER_NOTIF action value, and the SECCOMP_GET_NOTIF_SIZES
operation described in seccomp(2), this mechanism involves the
use of a number of related ioctl(2) operations (described below).

Overview
In conventional usage of a seccomp filter, the decision about how
to treat a particular system call is made by the filter itself.
The user-space notification mechanism allows the handling of the
system call to instead be handed off to a user-space process.
The advantages of doing this are that, by contrast with the sec‐
comp filter, which is running on a virtual machine inside the
kernel, the user-space process has access to information that is
unavailable to the seccomp filter and it can perform actions that
can't be performed from the seccomp filter.

In the discussion that follows, the process that has installed
the seccomp filter is referred to as the target, and the process
that is notified by the user-space notification mechanism is
referred to as the supervisor. An overview of the steps per‐
formed by these two processes is as follows:

1. The target process establishes a seccomp filter in the usual
manner, but with two differences:

· The seccomp(2) flags argument includes the flag SECCOMP_FIL‐
TER_FLAG_NEW_LISTENER. Consequently, the return value of
the (successful) seccomp(2) call is a new "listening" file
descriptor that can be used to receive notifications.

· In cases where it is appropriate, the seccomp filter returns
the action value SECCOMP_RET_USER_NOTIF. This return value
will trigger a notification event.

2. In order that the supervisor process can obtain notifications
using the listening file descriptor, (a duplicate of) that
file descriptor must be passed from the target process to the
supervisor process. One way in which this could be done is by
passing the file descriptor over a UNIX domain socket connec‐
tion between the two processes (using the SCM_RIGHTS ancillary
message type described in unix(7)). Another possibility is
that the supervisor might inherit the file descriptor via
fork(2).

3. The supervisor process will receive notification events on the
listening file descriptor. These events are returned as
structures of type seccomp_notif. Because this structure and
its size may evolve over kernel versions, the supervisor must
first determine the size of this structure using the sec‐
comp(2) SECCOMP_GET_NOTIF_SIZES operation, which returns a
structure of type seccomp_notif_sizes. The supervisor allo‐
cates a buffer of size seccomp_notif_sizes.seccomp_notif bytes
to receive notification events. In addition,the supervisor
allocates another buffer of size seccomp_notif_sizes.sec‐
comp_notif_resp bytes for the response (a struct sec‐
comp_notif_resp structure) that it will provide to the kernel
(and thus the target process).

4. The target process then performs its workload, which includes
system calls that will be controlled by the seccomp filter.
Whenever one of these system calls causes the filter to return
the SECCOMP_RET_USER_NOTIF action value, the kernel does not
execute the system call; instead, execution of the target
process is temporarily blocked inside the kernel and a notifi‐
cation event is generated on the listening file descriptor.

5. The supervisor process can now repeatedly monitor the listen‐
ing file descriptor for SECCOMP_RET_USER_NOTIF-triggered
events. To do this, the supervisor uses the SEC‐
COMP_IOCTL_NOTIF_RECV ioctl(2) operation to read information
about a notification event; this operation blocks until an
event is available. The operation returns a seccomp_notif
structure containing information about the system call that is
being attempted by the target process.

6. The seccomp_notif structure returned by the SEC‐
COMP_IOCTL_NOTIF_RECV operation includes the same information
(a seccomp_data structure) that was passed to the seccomp fil‐
ter. This information allows the supervisor to discover the
system call number and the arguments for the target process's
system call. In addition, the notification event contains the
PID of the target process.

The information in the notification can be used to discover
the values of pointer arguments for the target process's sys‐
tem call. (This is something that can't be done from within a
seccomp filter.) To do this (and assuming it has suitable
permissions), the supervisor opens the corresponding
/proc/[pid]/mem file, seeks to the memory location that corre‐
sponds to one of the pointer arguments whose value is supplied
in the notification event, and reads bytes from that location.
(The supervisor must be careful to avoid a race condition that
can occur when doing this; see the description of the SEC‐
COMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation below.) In addi‐
tion, the supervisor can access other system information that
is visible in user space but which is not accessible from a
seccomp filter.

┌─────────────────────────────────────────────────────┐
│FIXME │
├─────────────────────────────────────────────────────┤
│Suppose we are reading a pathname from /proc/PID/mem │
│for a system call such as mkdir(). The pathname can │
│be an arbitrary length. How do we know how much (how │
│many pages) to read from /proc/PID/mem? │
└─────────────────────────────────────────────────────┘

7. Having obtained information as per the previous step, the
supervisor may then choose to perform an action in response to
the target process's system call (which, as noted above, is
not executed when the seccomp filter returns the SEC‐
COMP_RET_USER_NOTIF action value).

One example use case here relates to containers. The target
process may be located inside a container where it does not
have sufficient capabilities to mount a filesystem in the con‐
tainer's mount namespace. However, the supervisor may be a
more privileged process that that does have sufficient capa‐
bilities to perform the mount operation.

8. The supervisor then sends a response to the notification. The
information in this response is used by the kernel to con‐
struct a return value for the target process's system call and
provide a value that will be assigned to the errno variable of
the target process.

The response is sent using the SECCOMP_IOCTL_NOTIF_RECV
ioctl(2) operation, which is used to transmit a sec‐
comp_notif_resp structure to the kernel. This structure
includes a cookie value that the supervisor obtained in the
seccomp_notif structure returned by the SEC‐
COMP_IOCTL_NOTIF_RECV operation. This cookie value allows the
kernel to associate the response with the target process.

9. Once the notification has been sent, the system call in the
target process unblocks, returning the information that was
provided by the supervisor in the notification response.

As a variation on the last two steps, the supervisor can send a
response that tells the kernel that it should execute the target
process's system call; see the discussion of SEC‐
COMP_USER_NOTIF_FLAG_CONTINUE, below.

ioctl(2) operations
The following ioctl(2) operations are provided to support seccomp
user-space notification. For each of these operations, the first
(file descriptor) argument of ioctl(2) is the listening file
descriptor returned by a call to seccomp(2) with the SECCOMP_FIL‐
TER_FLAG_NEW_LISTENER flag.

SECCOMP_IOCTL_NOTIF_RECV
This operation is used to obtain a user-space notification
event. If no such event is currently pending, the opera‐
tion blocks until an event occurs. The third ioctl(2)
argument is a pointer to a structure of the following form
which contains information about the event. This struc‐
ture must be zeroed out before the call.

struct seccomp_notif {
__u64 id; /* Cookie */
__u32 pid; /* PID of target process */
__u32 flags; /* Currently unused (0) */
struct seccomp_data data; /* See seccomp(2) */
};

The fields in this structure are as follows:

id This is a cookie for the notification. Each such
cookie is guaranteed to be unique for the corre‐
sponding seccomp filter. In other words, this
cookie is unique for each notification event from
the target process. The cookie value has the fol‐
lowing uses:

· It can be used with the SEC‐
COMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation to
verify that the target process is still alive.

· When returning a notification response to the
kernel, the supervisor must include the cookie
value in the seccomp_notif_resp structure that is
specified as the argument of the SEC‐
COMP_IOCTL_NOTIF_SEND operation.

pid This is the PID of the target process that trig‐
gered the notification event.

┌─────────────────────────────────────────────────────┐
│FIXME │
├─────────────────────────────────────────────────────┤
│This is a thread ID, rather than a PID, right? │
└─────────────────────────────────────────────────────┘

flags This is a bit mask of flags providing further
information on the event. In the current implemen‐
tation, this field is always zero.

data This is a seccomp_data structure containing infor‐
mation about the system call that triggered the
notification. This is the same structure that is
passed to the seccomp filter. See seccomp(2) for
details of this structure.

On success, this operation returns 0; on failure, -1 is
returned, and errno is set to indicate the cause of the
error. This operation can fail with the following errors:

EINVAL (since Linux 5.5)
The seccomp_notif structure that was passed to the
call contained nonzero fields.

ENOENT The target process was killed by a signal as the
notification information was being generated.

┌─────────────────────────────────────────────────────┐
│FIXME │
├─────────────────────────────────────────────────────┤
│From my experiments, it appears that if a SEC‐ │
│COMP_IOCTL_NOTIF_RECV is done after the target │
│process terminates, then the ioctl() simply blocks │
│(rather than returning an error to indicate that the │
│target process no longer exists). │
│ │
│I found that surprising, and it required some con‐ │
│tortions in the example program. It was not possi‐ │
│ble to code my SIGCHLD handler (which reaps the zom‐ │
│bie when the worker/target process terminates) to │
│simply set a flag checked in the main handleNotifi‐ │
│cations() loop, since this created an unavoidable │
│race where the child might terminate just after I │
│had checked the flag, but before I blocked (for‐ │
│ever!) in the SECCOMP_IOCTL_NOTIF_RECV operation. │
│Instead, I had to code the signal handler to simply │
│call _exit(2) in order to terminate the parent │
│process (the supervisor). │
│ │
│Is this expected behavior? It seems to me rather │
│desirable that SECCOMP_IOCTL_NOTIF_RECV should give │
│an error if the target process has terminated. │
└─────────────────────────────────────────────────────┘

SECCOMP_IOCTL_NOTIF_ID_VALID
This operation can be used to check that a notification ID
returned by an earlier SECCOMP_IOCTL_NOTIF_RECV operation
is still valid (i.e., that the target process still
exists).

The third ioctl(2) argument is a pointer to the cookie
(id) returned by the SECCOMP_IOCTL_NOTIF_RECV operation.

This operation is necessary to avoid race conditions that
can occur when the pid returned by the SEC‐
COMP_IOCTL_NOTIF_RECV operation terminates, and that
process ID is reused by another process. An example of
this kind of race is the following

1. A notification is generated on the listening file
descriptor. The returned seccomp_notif contains the
PID of the target process.

2. The target process terminates.

3. Another process is created on the system that by chance
reuses the PID that was freed when the target process
terminates.

4. The supervisor open(2)s the /proc/[pid]/mem file for
the PID obtained in step 1, with the intention of (say)
inspecting the memory locations that contains the argu‐
ments of the system call that triggered the notifica‐
tion in step 1.

In the above scenario, the risk is that the supervisor may
try to access the memory of a process other than the tar‐
get. This race can be avoided by following the call to
open with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to ver‐
ify that the process that generated the notification is
still alive. (Note that if the target process subse‐
quently terminates, its PID won't be reused because there
remains an open reference to the /proc[pid]/mem file; in
this case, a subsequent read(2) from the file will return
0, indicating end of file.)

On success (i.e., the notification ID is still valid),
this operation returns 0 On failure (i.e., the notifica‐
tion ID is no longer valid), -1 is returned, and errno is
set to ENOENT.

SECCOMP_IOCTL_NOTIF_SEND
This operation is used to send a notification response
back to the kernel. The third ioctl(2) argument of this
structure is a pointer to a structure of the following
form:

struct seccomp_notif_resp {
__u64 id; /* Cookie value */
__s64 val; /* Success return value */
__s32 error; /* 0 (success) or negative
error number */
__u32 flags; /* See below */
};

The fields of this structure are as follows:

id This is the cookie value that was obtained using
the SECCOMP_IOCTL_NOTIF_RECV operation. This
cookie value allows the kernel to correctly asso‐
ciate this response with the system call that trig‐
gered the user-space notification.

val This is the value that will be used for a spoofed
success return for the target process's system
call; see below.

error This is the value that will be used as the error
number (errno) for a spoofed error return for the
target process's system call; see below.

flags This is a bit mask that includes zero or more of
the following flags

SECCOMP_USER_NOTIF_FLAG_CONTINUE (since Linux 5.5)
Tell the kernel to execute the target
process's system call.

Two kinds of response are possible:

· A response to the kernel telling it to execute the tar‐
get process's system call. In this case, the flags
field includes SECCOMP_USER_NOTIF_FLAG_CONTINUE and the
error and val fields must be zero.

This kind of response can be useful in cases where the
supervisor needs to do deeper analysis of the target's
system call than is possible from a seccomp filter
(e.g., examining the values of pointer arguments), and,
having verified that the system call is acceptable, the
supervisor wants to allow it to proceed.

· A spoofed return value for the target process's system
call. In this case, the kernel does not execute the
target process's system call, instead causing the system
call to return a spoofed value as specified by fields of
the seccomp_notif_resp structure. The supervisor should
set the fields of this structure as follows:

+ flags does not contain SECCOMP_USER_NOTIF_FLAG_CON‐
TINUE.

+ error is set either to 0 for a spoofed "success"
return or to a negative error number for a spoofed
"failure" return. In the former case, the kernel
causes the target process's system call to return the
value specified in the val field. In the later case,
the kernel causes the target process's system call to
return -1, and errno is assigned the negated error
value.

+ val is set to a value that will be used as the return
value for a spoofed "success" return for the target
process's system call. The value in this field is
ignored if the error field contains a nonzero value.

On success, this operation returns 0; on failure, -1 is
returned, and errno is set to indicate the cause of the
error. This operation can fail with the following errors:

EINPROGRESS
A response to this notification has already been
sent.

EINVAL An invalid value was specified in the flags field.

EINVAL The flags field contained SEC‐
COMP_USER_NOTIF_FLAG_CONTINUE, and the error or val
field was not zero.

ENOENT The blocked system call in the target process has
been interrupted by a signal handler.

NOTES
The file descriptor returned when seccomp(2) is employed with the
SECCOMP_FILTER_FLAG_NEW_LISTENER flag can be monitored using
poll(2), epoll(7), and select(2). When a notification is pend‐
ing, these interfaces indicate that the file descriptor is read‐
able.

┌─────────────────────────────────────────────────────┐
│FIXME │
├─────────────────────────────────────────────────────┤
│Interestingly, after the event had been received, │
│the file descriptor indicates as writable (verified │
│from the source code and by experiment). How is this │
│useful? │
└─────────────────────────────────────────────────────┘

EXAMPLES
The (somewhat contrived) program shown below demonstrates the use
of the interfaces described in this page. The program creates a
child process that serves as the "target" process. The child
process installs a seccomp filter that returns the SEC‐
COMP_RET_USER_NOTIF action value if a call is made to mkdir(2).
The child process then calls mkdir(2) once for each of the sup‐
plied command-line arguments, and reports the result returned by
the call. After processing all arguments, the child process ter‐
minates.

The parent process acts as the supervisor, listening for the
notifications that are generated when the target process calls
mkdir(2). When such a notification occurs, the supervisor exam‐
ines the memory of the target process (using /proc/[pid]/mem) to
discover the pathname argument that was supplied to the mkdir(2)
call, and performs one of the following actions:

· If the pathname begins with the prefix "/tmp/", then the super‐
visor attempts to create the specified directory, and then
spoofs a return for the target process based on the return
value of the supervisor's mkdir(2) call. In the event that
that call succeeds, the spoofed success return value is the
length of the pathname.

· If the pathname begins with "./" (i.e., it is a relative path‐
name), the supervisor sends a SECCOMP_USER_NOTIF_FLAG_CONTINUE
response to the kernel to say that kernel should execute the
target process's mkdir(2) call.

· If the pathname begins with some other prefix, the supervisor
spoofs an error return for the target process, so that the tar‐
get process's mkdir(2) call appears to fail with the error EOP‐
NOTSUPP ("Operation not supported"). Additionally, if the
specified pathname is exactly "/bye", then the supervisor ter‐
minates.

This program can used to demonstrate various aspects of the
behavior of the seccomp user-space notification mechanism. To
help aid such demonstrations, the program logs various messages
to show the operation of the target process (lines prefixed "T:")
and the supervisor (indented lines prefixed "S:").

In the following example, the target attempts to create the
directory /tmp/x. Upon receiving the notification, the supervi‐
sor creates the directory on the target's behalf, and spoofs a
success return to be received by the target process's mkdir(2)
call.

$ ./seccomp_unotify /tmp/x
T: PID = 23168

T: about to mkdir("/tmp/x")
S: got notification (ID 0x17445c4a0f4e0e3c) for PID 23168
S: executing: mkdir("/tmp/x", 0700)
S: success! spoofed return = 6
S: sending response (flags = 0; val = 6; error = 0)
T: SUCCESS: mkdir(2) returned 6

T: terminating
S: target has terminated; bye

In the above output, note that the spoofed return value seen by
the target process is 6 (the length of the pathname /tmp/x),
whereas a normal mkdir(2) call returns 0 on success.

In the next example, the target attempts to create a directory
using the relative pathname ./sub. Since this pathname starts
with "./", the supervisor sends a SECCOMP_USER_NOTIF_FLAG_CON‐
TINUE response to the kernel, and the kernel then (successfully)
executes the target process's mkdir(2) call.

$ ./seccomp_unotify ./sub
T: PID = 23204

T: about to mkdir("./sub")
S: got notification (ID 0xddb16abe25b4c12) for PID 23204
S: target can execute system call
S: sending response (flags = 0x1; val = 0; error = 0)
T: SUCCESS: mkdir(2) returned 0

T: terminating
S: target has terminated; bye

If the target process attempts to create a directory with a path‐
name that doesn't start with "." and doesn't begin with the pre‐
fix "/tmp/", then the supervisor spoofs an error return (EOPNOT‐
SUPP, "Operation not supported") for the target's mkdir(2) call
(which is not executed):

$ ./seccomp_unotify /xxx
T: PID = 23178

T: about to mkdir("/xxx")
S: got notification (ID 0xe7dc095d1c524e80) for PID 23178
S: spoofing error response (Operation not supported)
S: sending response (flags = 0; val = 0; error = -95)
T: ERROR: mkdir(2): Operation not supported

T: terminating
S: target has terminated; bye

In the next example, the target process attempts to create a
directory with the pathname /tmp/nosuchdir/b. Upon receiving the
notification, the supervisor attempts to create that directory,
but the mkdir(2) call fails because the directory /tmp/nosuchdir
does not exist. Consequently, the supervisor spoofs an error
return that passes the error that it received back to the target
process's mkdir(2) call.

$ ./seccomp_unotify /tmp/nosuchdir/b
T: PID = 23199

T: about to mkdir("/tmp/nosuchdir/b")
S: got notification (ID 0x8744454293506046) for PID 23199
S: executing: mkdir("/tmp/nosuchdir/b", 0700)
S: failure! (errno = 2; No such file or directory)
S: sending response (flags = 0; val = 0; error = -2)
T: ERROR: mkdir(2): No such file or directory

T: terminating
S: target has terminated; bye

If the supervisor receives a notification and sees that the argu‐
ment of the target's mkdir(2) is the string "/bye", then (as well
as spoofing an EOPNOTSUPP error), the supervisor terminates. If
the target process subsequently executes another mkdir(2) that
triggers its seccomp filter to return the SECCOMP_RET_USER_NOTIF
action value, then the kernel causes the target process's system
call to fail with the error ENOSYS ("Function not implemented").
This is demonstrated by the following example:

$ ./seccomp_unotify /bye /tmp/y
T: PID = 23185

T: about to mkdir("/bye")
S: got notification (ID 0xa81236b1d2f7b0f4) for PID 23185
S: spoofing error response (Operation not supported)
S: sending response (flags = 0; val = 0; error = -95)
S: terminating **********
T: ERROR: mkdir(2): Operation not supported

T: about to mkdir("/tmp/y")
T: ERROR: mkdir(2): Function not implemented

T: terminating

Program source
#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/prctl.h>
#include <fcntl.h>
#include <limits.h>
#include <signal.h>
#include <stddef.h>
#include <stdint.h>
#include <stdbool.h>
#include <linux/audit.h>
#include <sys/syscall.h>
#include <sys/stat.h>
#include <linux/filter.h>
#include <linux/seccomp.h>
#include <sys/ioctl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <sys/socket.h>
#include <sys/un.h>

#define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
} while (0)

/* Send the file descriptor 'fd' over the connected UNIX domain socket
'sockfd'. Returns 0 on success, or -1 on error. */

static int
sendfd(int sockfd, int fd)
{
struct msghdr msgh;
struct iovec iov;
int data;
struct cmsghdr *cmsgp;

/* Allocate a char array of suitable size to hold the ancillary data.
However, since this buffer is in reality a 'struct cmsghdr', use a
union to ensure that it is suitable aligned. */
union {
char buf[CMSG_SPACE(sizeof(int))];
/* Space large enough to hold an 'int' */
struct cmsghdr align;
} controlMsg;

/* The 'msg_name' field can be used to specify the address of the
destination socket when sending a datagram. However, we do not
need to use this field because 'sockfd' is a connected socket. */

msgh.msg_name = NULL;
msgh.msg_namelen = 0;

/* On Linux, we must transmit at least one byte of real data in
order to send ancillary data. We transmit an arbitrary integer
whose value is ignored by recvfd(). */

msgh.msg_iov = &iov;
msgh.msg_iovlen = 1;
iov.iov_base = &data;
iov.iov_len = sizeof(int);
data = 12345;

/* Set 'msghdr' fields that describe ancillary data */

msgh.msg_control = controlMsg.buf;
msgh.msg_controllen = sizeof(controlMsg.buf);

/* Set up ancillary data describing file descriptor to send */

cmsgp = CMSG_FIRSTHDR(&msgh);
cmsgp->cmsg_level = SOL_SOCKET;
cmsgp->cmsg_type = SCM_RIGHTS;
cmsgp->cmsg_len = CMSG_LEN(sizeof(int));
memcpy(CMSG_DATA(cmsgp), &fd, sizeof(int));

/* Send real plus ancillary data */

if (sendmsg(sockfd, &msgh, 0) == -1)
return -1;

return 0;
}

/* Receive a file descriptor on a connected UNIX domain socket. Returns
the received file descriptor on success, or -1 on error. */

static int
recvfd(int sockfd)
{
struct msghdr msgh;
struct iovec iov;
int data, fd;
ssize_t nr;

/* Allocate a char buffer for the ancillary data. See the comments
in sendfd() */
union {
char buf[CMSG_SPACE(sizeof(int))];
struct cmsghdr align;
} controlMsg;
struct cmsghdr *cmsgp;

/* The 'msg_name' field can be used to obtain the address of the
sending socket. However, we do not need this information. */

msgh.msg_name = NULL;
msgh.msg_namelen = 0;

/* Specify buffer for receiving real data */

msgh.msg_iov = &iov;
msgh.msg_iovlen = 1;
iov.iov_base = &data; /* Real data is an 'int' */
iov.iov_len = sizeof(int);

/* Set 'msghdr' fields that describe ancillary data */

msgh.msg_control = controlMsg.buf;
msgh.msg_controllen = sizeof(controlMsg.buf);

/* Receive real plus ancillary data; real data is ignored */

nr = recvmsg(sockfd, &msgh, 0);
if (nr == -1)
return -1;

cmsgp = CMSG_FIRSTHDR(&msgh);

/* Check the validity of the 'cmsghdr' */

if (cmsgp == NULL ||
cmsgp->cmsg_len != CMSG_LEN(sizeof(int)) ||
cmsgp->cmsg_level != SOL_SOCKET ||
cmsgp->cmsg_type != SCM_RIGHTS) {
errno = EINVAL;
return -1;
}

/* Return the received file descriptor to our caller */

memcpy(&fd, CMSG_DATA(cmsgp), sizeof(int));
return fd;
}

static void
sigchldHandler(int sig)
{
char *msg = "\tS: target has terminated; bye\n";

write(STDOUT_FILENO, msg, strlen(msg));
_exit(EXIT_SUCCESS);
}

static int
seccomp(unsigned int operation, unsigned int flags, void *args)
{
return syscall(__NR_seccomp, operation, flags, args);
}

/* The following is the x86-64-specific BPF boilerplate code for checking
that the BPF program is running on the right architecture + ABI. At
completion of these instructions, the accumulator contains the system
call number. */

/* For the x32 ABI, all system call numbers have bit 30 set */

#define X32_SYSCALL_BIT 0x40000000

#define X86_64_CHECK_ARCH_AND_LOAD_SYSCALL_NR \
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, \
(offsetof(struct seccomp_data, arch))), \
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 0, 2), \
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, \
(offsetof(struct seccomp_data, nr))), \
BPF_JUMP(BPF_JMP | BPF_JGE | BPF_K, X32_SYSCALL_BIT, 0, 1), \
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS)

/* installNotifyFilter() installs a seccomp filter that generates
user-space notifications (SECCOMP_RET_USER_NOTIF) when the process
calls mkdir(2); the filter allows all other system calls.

The function return value is a file descriptor from which the
user-space notifications can be fetched. */

static int
installNotifyFilter(void)
{
struct sock_filter filter[] = {
X86_64_CHECK_ARCH_AND_LOAD_SYSCALL_NR,

/* mkdir() triggers notification to user-space supervisor */

BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_mkdir, 0, 1),
BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_USER_NOTIF),

/* Every other system call is allowed */

BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
};

struct sock_fprog prog = {
.len = sizeof(filter) / sizeof(filter[0]),
.filter = filter,
};

/* Install the filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER flag;
as a result, seccomp() returns a notification file descriptor. */

int notifyFd = seccomp(SECCOMP_SET_MODE_FILTER,
SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
if (notifyFd == -1)
errExit("seccomp-install-notify-filter");

return notifyFd;
}

/* Close a pair of sockets created by socketpair() */

static void
closeSocketPair(int sockPair[2])
{
if (close(sockPair[0]) == -1)
errExit("closeSocketPair-close-0");
if (close(sockPair[1]) == -1)
errExit("closeSocketPair-close-1");
}

/* Implementation of the target process; create a child process that:

(1) installs a seccomp filter with the
SECCOMP_FILTER_FLAG_NEW_LISTENER flag;
(2) writes the seccomp notification file descriptor returned from
the previous step onto the UNIX domain socket, 'sockPair[0]';
(3) calls mkdir(2) for each element of 'argv'.

The function return value in the parent is the PID of the child
process; the child does not return from this function. */

static pid_t
targetProcess(int sockPair[2], char *argv[])
{
pid_t targetPid = fork();
if (targetPid == -1)
errExit("fork");

if (targetPid > 0) /* In parent, return PID of child */
return targetPid;

/* Child falls through to here */

printf("T: PID = %ld\n", (long) getpid());

/* Install seccomp filter(s) */

if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0))
errExit("prctl");

int notifyFd = installNotifyFilter();

/* Pass the notification file descriptor to the tracing process over
a UNIX domain socket */

if (sendfd(sockPair[0], notifyFd) == -1)
errExit("sendfd");

/* Notification and socket FDs are no longer needed in target */

if (close(notifyFd) == -1)
errExit("close-target-notify-fd");

closeSocketPair(sockPair);

/* Perform a mkdir() call for each of the command-line arguments */

for (char **ap = argv; *ap != NULL; ap++) {
printf("\nT: about to mkdir(\"%s\")\n", *ap);

int s = mkdir(*ap, 0700);
if (s == -1)
perror("T: ERROR: mkdir(2)");
else
printf("T: SUCCESS: mkdir(2) returned %d\n", s);
}

printf("\nT: terminating\n");
exit(EXIT_SUCCESS);
}

/* Check that the notification ID provided by a SECCOMP_IOCTL_NOTIF_RECV
operation is still valid. It will no longer be valid if the process
has terminated. This operation can be used when accessing /proc/PID
files in the target process in order to avoid TOCTOU race conditions
where the PID that is returned by SECCOMP_IOCTL_NOTIF_RECV terminates
and is reused by another process. */

static void
checkNotificationIdIsValid(int notifyFd, uint64_t id)
{
if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_ID_VALID, &id) == -1) {
fprintf(stderr, "\tS: notification ID check: "
"target has terminated!!!\n");

exit(EXIT_FAILURE);
}
}

/* Access the memory of the target process in order to discover the
pathname that was given to mkdir() */

static void
getTargetPathname(struct seccomp_notif *req, int notifyFd,
char *path, size_t len)
{
char procMemPath[PATH_MAX];
snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", req->pid);

int procMemFd = open(procMemPath, O_RDONLY);
if (procMemFd == -1)
errExit("Supervisor: open");

/* Check that the process whose info we are accessing is still alive.
If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed
in checkNotificationIdIsValid()) succeeds, we know that the
/proc/PID/mem file descriptor that we opened corresponds to the
process for which we received a notification. If that process
subsequently terminates, then read() on that file descriptor
will return 0 (EOF). */

checkNotificationIdIsValid(notifyFd, req->id);

/* Seek to the location containing the pathname argument (i.e., the
first argument) of the mkdir(2) call and read that pathname */

if (lseek(procMemFd, req->data.args[0], SEEK_SET) == -1)
errExit("Supervisor: lseek");

ssize_t s = read(procMemFd, path, PATH_MAX);
if (s == -1)
errExit("read");

if (s == 0) {
fprintf(stderr, "\tS: read() of /proc/PID/mem "
"returned 0 (EOF)\n");
exit(EXIT_FAILURE);
}

if (close(procMemFd) == -1)
errExit("close-/proc/PID/mem");
}

/* Handle notifications that arrive via the SECCOMP_RET_USER_NOTIF file
descriptor, 'notifyFd'. */

static void
handleNotifications(int notifyFd)
{
struct seccomp_notif_sizes sizes;
char path[PATH_MAX];
/* For simplicity, we assume that the pathname given to mkdir()
is no more than PATH_MAX bytes; but this might not be true. */

/* Discover the sizes of the structures that are used to receive
notifications and send notification responses, and allocate
buffers of those sizes. */

if (seccomp(SECCOMP_GET_NOTIF_SIZES, 0, &sizes) == -1)
errExit("\tS: seccomp-SECCOMP_GET_NOTIF_SIZES");

struct seccomp_notif *req = malloc(sizes.seccomp_notif);
if (req == NULL)
errExit("\tS: malloc");

struct seccomp_notif_resp *resp = malloc(sizes.seccomp_notif_resp);
if (resp == NULL)
errExit("\tS: malloc");

/* Loop handling notifications */

for (;;) {
/* Wait for next notification, returning info in '*req' */

memset(req, 0, sizes.seccomp_notif);
if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_RECV, req) == -1) {
if (errno == EINTR)
continue;
errExit("Supervisor: ioctl-SECCOMP_IOCTL_NOTIF_RECV");
}

printf("\tS: got notification (ID %#llx) for PID %d\n",
req->id, req->pid);

/* The only system call that can generate a notification event
is mkdir(2). Nevertheless, we check that the notified system
call is indeed mkdir() as kind of future-proofing of this
code in case the seccomp filter is later modified to
generate notifications for other system calls. */

if (req->data.nr != __NR_mkdir) {
printf("\tS: notification contained unexpected "
"system call number; bye!!!\n");
exit(EXIT_FAILURE);
}

getTargetPathname(req, notifyFd, path, sizeof(path));

/* Prepopulate some fields of the response */

resp->id = req->id; /* Response includes notification ID */
resp->flags = 0;
resp->val = 0;

/* If the directory is in /tmp, then create it on behalf of
the supervisor; if the pathname starts with '.', tell the
kernel to let the target process execute the mkdir();
otherwise, give an error for a directory pathname in
any other location. */

if (strncmp(path, "/tmp/", strlen("/tmp/")) == 0) {
printf("\tS: executing: mkdir(\"%s\", %#llo)\n",
path, req->data.args[1]);

if (mkdir(path, req->data.args[1]) == 0) {
resp->error = 0; /* "Success" */
resp->val = strlen(path); /* Used as return value of
mkdir() in target */
printf("\tS: success! spoofed return = %lld\n",
resp->val);
} else {

/* If mkdir() failed in the supervisor, pass the error
back to the target */

resp->error = -errno;
printf("\tS: failure! (errno = %d; %s)\n", errno,
strerror(errno));
}
} else if (strncmp(path, "./", strlen("./")) == 0) {
resp->error = resp->val = 0;
resp->flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;
printf("\tS: target can execute system call\n");
} else {
resp->error = -EOPNOTSUPP;
printf("\tS: spoofing error response (%s)\n",
strerror(-resp->error));
}

/* Send a response to the notification */

printf("\tS: sending response "
"(flags = %#x; val = %lld; error = %d)\n",
resp->flags, resp->val, resp->error);

if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_SEND, resp) == -1) {
if (errno == ENOENT)
printf("\tS: response failed with ENOENT; "
"perhaps target process's syscall was "
"interrupted by signal?\n");
else
perror("ioctl-SECCOMP_IOCTL_NOTIF_SEND");
}

/* If the pathname is just "/bye", then the supervisor
terminates. This allows us to see what happens if the
target process makes further calls to mkdir(2). */

if (strcmp(path, "/bye") == 0) {
printf("\tS: terminating **********\n");
exit(EXIT_FAILURE);
}
}
}

/* Implementation of the supervisor process:

(1) obtains the notification file descriptor from 'sockPair[1]'
(2) handles notifications that arrive on that file descriptor. */

static void
supervisor(int sockPair[2])
{
int notifyFd = recvfd(sockPair[1]);
if (notifyFd == -1)
errExit("recvfd");

closeSocketPair(sockPair); /* We no longer need the socket pair */

handleNotifications(notifyFd);
}

int
main(int argc, char *argv[])
{
int sockPair[2];

setbuf(stdout, NULL);

if (argc < 2) {
fprintf(stderr, "At least one pathname argument is required\n");
exit(EXIT_FAILURE);
}

/* Create a UNIX domain socket that is used to pass the seccomp
notification file descriptor from the target process to the
supervisor process. */

if (socketpair(AF_UNIX, SOCK_STREAM, 0, sockPair) == -1)
errExit("socketpair");

/* Create a child process--the "target"--that installs seccomp
filtering. The target process writes the seccomp notification
file descriptor onto 'sockPair[0]' and then calls mkdir(2) for
each directory in the command-line arguments. */

(void) targetProcess(sockPair, &argv[optind]);

/* Catch SIGCHLD when the target terminates, so that the
supervisor can also terminate. */

struct sigaction sa;
sa.sa_handler = sigchldHandler;
sa.sa_flags = 0;
sigemptyset(&sa.sa_mask);
if (sigaction(SIGCHLD, &sa, NULL) == -1)
errExit("sigaction");

supervisor(sockPair);

exit(EXIT_SUCCESS);
}

SEE ALSO
ioctl(2), seccomp(2)


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


2020-09-30 15:15:02

by Tycho Andersen

[permalink] [raw]
Subject: Re: For review: seccomp_user_notif(2) manual page

On Wed, Sep 30, 2020 at 09:03:36AM -0600, Tycho Andersen wrote:
> On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
> > ┌─────────────────────────────────────────────────────┐
> > │FIXME │
> > ├─────────────────────────────────────────────────────┤
> > │Interestingly, after the event had been received, │
> > │the file descriptor indicates as writable (verified │
> > │from the source code and by experiment). How is this │
> > │useful? │
>
> You're saying it should just do EPOLLOUT and not EPOLLWRNORM? Seems
> reasonable.

If we make this change, I suppose we should also drop EPOLLRDNORM from
things which have not been received yet, since they're not really
readable.

Tycho

2020-09-30 15:17:53

by Tycho Andersen

[permalink] [raw]
Subject: Re: For review: seccomp_user_notif(2) manual page

On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
> 2. In order that the supervisor process can obtain notifications
> using the listening file descriptor, (a duplicate of) that
> file descriptor must be passed from the target process to the
> supervisor process. One way in which this could be done is by
> passing the file descriptor over a UNIX domain socket connec‐
> tion between the two processes (using the SCM_RIGHTS ancillary
> message type described in unix(7)). Another possibility is
> that the supervisor might inherit the file descriptor via
> fork(2).

It is technically possible to inherit the fd via fork, but is it
really that useful? The child process wouldn't be able to actually do
the syscall in question, since it would have the same filter.

> The information in the notification can be used to discover
> the values of pointer arguments for the target process's sys‐
> tem call. (This is something that can't be done from within a
> seccomp filter.) To do this (and assuming it has suitable

s/To do this/One way to accomplish this/ perhaps, since there are
others.

> permissions), the supervisor opens the corresponding
> /proc/[pid]/mem file, seeks to the memory location that corre‐
> sponds to one of the pointer arguments whose value is supplied
> in the notification event, and reads bytes from that location.
> (The supervisor must be careful to avoid a race condition that
> can occur when doing this; see the description of the SEC‐
> COMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation below.) In addi‐
> tion, the supervisor can access other system information that
> is visible in user space but which is not accessible from a
> seccomp filter.
>
> ┌─────────────────────────────────────────────────────┐
> │FIXME │
> ├─────────────────────────────────────────────────────┤
> │Suppose we are reading a pathname from /proc/PID/mem │
> │for a system call such as mkdir(). The pathname can │
> │be an arbitrary length. How do we know how much (how │
> │many pages) to read from /proc/PID/mem? │
> └─────────────────────────────────────────────────────┘

PATH_MAX, I suppose.

> ┌─────────────────────────────────────────────────────┐
> │FIXME │
> ├─────────────────────────────────────────────────────┤
> │From my experiments, it appears that if a SEC‐ │
> │COMP_IOCTL_NOTIF_RECV is done after the target │
> │process terminates, then the ioctl() simply blocks │
> │(rather than returning an error to indicate that the │
> │target process no longer exists). │

Yeah, I think Christian wanted to fix this at some point, but it's a
bit sticky to do. Note that if you e.g. rely on fork() above, the
filter is shared with your current process, and this notification
would never be possible. Perhaps another reason to omit that from the
man page.

> SECCOMP_IOCTL_NOTIF_ID_VALID
> This operation can be used to check that a notification ID
> returned by an earlier SECCOMP_IOCTL_NOTIF_RECV operation
> is still valid (i.e., that the target process still
> exists).
>
> The third ioctl(2) argument is a pointer to the cookie
> (id) returned by the SECCOMP_IOCTL_NOTIF_RECV operation.
>
> This operation is necessary to avoid race conditions that
> can occur when the pid returned by the SEC‐
> COMP_IOCTL_NOTIF_RECV operation terminates, and that
> process ID is reused by another process. An example of
> this kind of race is the following
>
> 1. A notification is generated on the listening file
> descriptor. The returned seccomp_notif contains the
> PID of the target process.
>
> 2. The target process terminates.
>
> 3. Another process is created on the system that by chance
> reuses the PID that was freed when the target process
> terminates.
>
> 4. The supervisor open(2)s the /proc/[pid]/mem file for
> the PID obtained in step 1, with the intention of (say)
> inspecting the memory locations that contains the argu‐
> ments of the system call that triggered the notifica‐
> tion in step 1.
>
> In the above scenario, the risk is that the supervisor may
> try to access the memory of a process other than the tar‐
> get. This race can be avoided by following the call to
> open with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to ver‐
> ify that the process that generated the notification is
> still alive. (Note that if the target process subse‐
> quently terminates, its PID won't be reused because there
> remains an open reference to the /proc[pid]/mem file; in
> this case, a subsequent read(2) from the file will return
> 0, indicating end of file.)
>
> On success (i.e., the notification ID is still valid),
> this operation returns 0 On failure (i.e., the notifica‐
^ need a period?

> ┌─────────────────────────────────────────────────────┐
> │FIXME │
> ├─────────────────────────────────────────────────────┤
> │Interestingly, after the event had been received, │
> │the file descriptor indicates as writable (verified │
> │from the source code and by experiment). How is this │
> │useful? │

You're saying it should just do EPOLLOUT and not EPOLLWRNORM? Seems
reasonable.

>
> EXAMPLES
> The (somewhat contrived) program shown below demonstrates the use

May also be worth mentioning the example in
samples/seccomp/user-trap.c as well.

Tycho

2020-09-30 15:56:40

by Jann Horn

[permalink] [raw]
Subject: Re: For review: seccomp_user_notif(2) manual page

On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
<[email protected]> wrote:
> I knew it would be a big ask, but below is kind of the manual page
> I was hoping you might write [1] for the seccomp user-space notification
> mechanism. Since you didn't (and because 5.9 adds various new pieces
> such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD
> that also will need documenting [2]), I did :-). But of course I may
> have made mistakes...
[...]
> NAME
> seccomp_user_notif - Seccomp user-space notification mechanism
>
> SYNOPSIS
> #include <linux/seccomp.h>
> #include <linux/filter.h>
> #include <linux/audit.h>
>
> int seccomp(unsigned int operation, unsigned int flags, void *args);

Should the ioctl() calls be listed here, similar to e.g. the SYNOPSIS
of the ioctl_* manpages?

> DESCRIPTION
> This page describes the user-space notification mechanism pro‐
> vided by the Secure Computing (seccomp) facility. As well as the
> use of the SECCOMP_FILTER_FLAG_NEW_LISTENER flag, the SEC‐
> COMP_RET_USER_NOTIF action value, and the SECCOMP_GET_NOTIF_SIZES
> operation described in seccomp(2), this mechanism involves the
> use of a number of related ioctl(2) operations (described below).
>
> Overview
> In conventional usage of a seccomp filter, the decision about how
> to treat a particular system call is made by the filter itself.
> The user-space notification mechanism allows the handling of the
> system call to instead be handed off to a user-space process.
> The advantages of doing this are that, by contrast with the sec‐
> comp filter, which is running on a virtual machine inside the
> kernel, the user-space process has access to information that is
> unavailable to the seccomp filter and it can perform actions that
> can't be performed from the seccomp filter.
>
> In the discussion that follows, the process that has installed
> the seccomp filter is referred to as the target, and the process

Technically, this definition of "target" is a bit inaccurate because:

- seccomp filters are inherited
- seccomp filters apply to threads, not processes
- seccomp filters can be semi-remotely installed via TSYNC

(I assume that in manpages, we should try to go for the "a task is a
thread and a thread group is a process" definition, right?)

Perhaps "the threads on which the seccomp filter is installed are
referred to as the target", or something like that would be better?

> that is notified by the user-space notification mechanism is
> referred to as the supervisor. An overview of the steps per‐
> formed by these two processes is as follows:
>
> 1. The target process establishes a seccomp filter in the usual
> manner, but with two differences:
>
> · The seccomp(2) flags argument includes the flag SECCOMP_FIL‐
> TER_FLAG_NEW_LISTENER. Consequently, the return value of
> the (successful) seccomp(2) call is a new "listening" file
> descriptor that can be used to receive notifications.
>
> · In cases where it is appropriate, the seccomp filter returns
> the action value SECCOMP_RET_USER_NOTIF. This return value
> will trigger a notification event.
>
> 2. In order that the supervisor process can obtain notifications
> using the listening file descriptor, (a duplicate of) that
> file descriptor must be passed from the target process to the
> supervisor process. One way in which this could be done is by
> passing the file descriptor over a UNIX domain socket connec‐
> tion between the two processes (using the SCM_RIGHTS ancillary
> message type described in unix(7)). Another possibility is
> that the supervisor might inherit the file descriptor via
> fork(2).

With the caveat that if the supervisor inherits the file descriptor
via fork(), that (more or less) implies that the supervisor is subject
to the same filter (although it could bypass the filter using a helper
thread that responds SECCOMP_USER_NOTIF_FLAG_CONTINUE, but I don't
expect any clean software to do that).

> 3. The supervisor process will receive notification events on the
> listening file descriptor. These events are returned as
> structures of type seccomp_notif. Because this structure and
> its size may evolve over kernel versions, the supervisor must
> first determine the size of this structure using the sec‐
> comp(2) SECCOMP_GET_NOTIF_SIZES operation, which returns a
> structure of type seccomp_notif_sizes. The supervisor allo‐
> cates a buffer of size seccomp_notif_sizes.seccomp_notif bytes
> to receive notification events. In addition,the supervisor
> allocates another buffer of size seccomp_notif_sizes.sec‐
> comp_notif_resp bytes for the response (a struct sec‐
> comp_notif_resp structure) that it will provide to the kernel
> (and thus the target process).
>
> 4. The target process then performs its workload, which includes
> system calls that will be controlled by the seccomp filter.
> Whenever one of these system calls causes the filter to return
> the SECCOMP_RET_USER_NOTIF action value, the kernel does not
> execute the system call; instead, execution of the target
> process is temporarily blocked inside the kernel and a notifi‐

where "blocked" refers to the interruptible, restartable kind - if the
child receives a signal with an SA_RESTART signal handler in the
meantime, it'll leave the syscall, go through the signal handler, then
restart the syscall again and send the same request to the supervisor
again. so the supervisor may see duplicate syscalls.

What's really gross here is that signal(7) promises that some syscalls
like epoll_wait(2) never restart, but seccomp doesn't know about that;
if userspace installs a filter that uses SECCOMP_RET_USER_NOTIF for a
non-restartable syscall, the result is that UAPI gets broken a little
bit. Luckily normal users of seccomp probably won't use
SECCOMP_RET_USER_NOTIF for restartable syscalls, but if someone does
want to do that, we might have to add some "suppress syscall
restarting" flag into the seccomp action value, or something like
that... yuck.

> cation event is generated on the listening file descriptor.
>
> 5. The supervisor process can now repeatedly monitor the listen‐
> ing file descriptor for SECCOMP_RET_USER_NOTIF-triggered
> events. To do this, the supervisor uses the SEC‐
> COMP_IOCTL_NOTIF_RECV ioctl(2) operation to read information
> about a notification event; this operation blocks until an

(interruptably - but I guess that maybe doesn't have to be said
explicitly here?)

> event is available.

Maybe we should note here that you can use the multi-fd-polling APIs
(select/poll/epoll) instead, and that if the notification goes away
before you call SECCOMP_IOCTL_NOTIF_RECV, the ioctl will return
-ENOENT instead of blocking, and therefore as long as nobody else
reads from the same fd, you can assume that after the fd reports as
readable, you can call SECCOMP_IOCTL_NOTIF_RECV once without blocking.

Exceeeeept that this part looks broken:

if (mutex_lock_interruptible(&filter->notify_lock) < 0)
return EPOLLERR;

which I think means that we can have a race where a signal arrives
while poll() is trying to add itself to the waitqueue of the seccomp
fd, and then we'll get a spurious error condition reported on the fd.
That's a kernel bug, I'd say.

> The operation returns a seccomp_notif
> structure containing information about the system call that is
> being attempted by the target process.
>
> 6. The seccomp_notif structure returned by the SEC‐
> COMP_IOCTL_NOTIF_RECV operation includes the same information
> (a seccomp_data structure) that was passed to the seccomp fil‐
> ter. This information allows the supervisor to discover the
> system call number and the arguments for the target process's
> system call. In addition, the notification event contains the
> PID of the target process.

That's a PIDTYPE_PID, which the manpages call a "thread ID".

> The information in the notification can be used to discover
> the values of pointer arguments for the target process's sys‐
> tem call. (This is something that can't be done from within a
> seccomp filter.) To do this (and assuming it has suitable
> permissions), the supervisor opens the corresponding
> /proc/[pid]/mem file,

... which means that here we might have to get into the weeds of how
actually /proc has invisible directories for every TID, even though
only the ones for PIDs are visible, and therefore you can just open
/proc/[tid]/mem and it'll work fine?

> seeks to the memory location that corre‐
> sponds to one of the pointer arguments whose value is supplied
> in the notification event, and reads bytes from that location.
> (The supervisor must be careful to avoid a race condition that
> can occur when doing this; see the description of the SEC‐
> COMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation below.) In addi‐
> tion, the supervisor can access other system information that
> is visible in user space but which is not accessible from a
> seccomp filter.
>
> ┌─────────────────────────────────────────────────────┐
> │FIXME │
> ├─────────────────────────────────────────────────────┤
> │Suppose we are reading a pathname from /proc/PID/mem │
> │for a system call such as mkdir(). The pathname can │
> │be an arbitrary length. How do we know how much (how │
> │many pages) to read from /proc/PID/mem? │
> └─────────────────────────────────────────────────────┘

It can't be an arbitrary length. While pathnames *returned* from the
kernel in some places can have different limits, strings supplied as
path arguments *to* the kernel AFAIK always have an upper limit of
PATH_MAX, else you get -ENAMETOOLONG. See getname_flags().

> 7. Having obtained information as per the previous step, the
> supervisor may then choose to perform an action in response to
> the target process's system call (which, as noted above, is
> not executed when the seccomp filter returns the SEC‐
> COMP_RET_USER_NOTIF action value).

(unless SECCOMP_USER_NOTIF_FLAG_CONTINUE is used)

> One example use case here relates to containers. The target
> process may be located inside a container where it does not
> have sufficient capabilities to mount a filesystem in the con‐
> tainer's mount namespace. However, the supervisor may be a
> more privileged process that that does have sufficient capa‐

nit: s/that that/that/

> bilities to perform the mount operation.
>
> 8. The supervisor then sends a response to the notification. The
> information in this response is used by the kernel to con‐
> struct a return value for the target process's system call and
> provide a value that will be assigned to the errno variable of
> the target process.
>
> The response is sent using the SECCOMP_IOCTL_NOTIF_RECV
> ioctl(2) operation, which is used to transmit a sec‐
> comp_notif_resp structure to the kernel. This structure
> includes a cookie value that the supervisor obtained in the
> seccomp_notif structure returned by the SEC‐
> COMP_IOCTL_NOTIF_RECV operation. This cookie value allows the
> kernel to associate the response with the target process.

(unless if the target thread entered a signal handler or was killed in
the meantime)

> 9. Once the notification has been sent, the system call in the
> target process unblocks, returning the information that was
> provided by the supervisor in the notification response.
>
> As a variation on the last two steps, the supervisor can send a
> response that tells the kernel that it should execute the target
> process's system call; see the discussion of SEC‐
> COMP_USER_NOTIF_FLAG_CONTINUE, below.
>
> ioctl(2) operations
> The following ioctl(2) operations are provided to support seccomp
> user-space notification. For each of these operations, the first
> (file descriptor) argument of ioctl(2) is the listening file
> descriptor returned by a call to seccomp(2) with the SECCOMP_FIL‐
> TER_FLAG_NEW_LISTENER flag.
>
> SECCOMP_IOCTL_NOTIF_RECV
> This operation is used to obtain a user-space notification
> event. If no such event is currently pending, the opera‐
> tion blocks until an event occurs.

Not necessarily; for every time a process entered a signal handler or
was killed while a notification was pending, a call to
SECCOMP_IOCTL_NOTIF_RECV will return -ENOENT.

> The third ioctl(2)
> argument is a pointer to a structure of the following form
> which contains information about the event. This struc‐
> ture must be zeroed out before the call.
>
> struct seccomp_notif {
> __u64 id; /* Cookie */
> __u32 pid; /* PID of target process */

(TID, not PID)

> __u32 flags; /* Currently unused (0) */
> struct seccomp_data data; /* See seccomp(2) */
> };
>
> The fields in this structure are as follows:
>
> id This is a cookie for the notification. Each such
> cookie is guaranteed to be unique for the corre‐
> sponding seccomp filter. In other words, this
> cookie is unique for each notification event from
> the target process.

That sentence about "target process" looks wrong to me. The cookies
are unique across notifications from the filter, but there can be
multiple filters per thread, and multiple threads per filter.

> The cookie value has the fol‐
> lowing uses:
>
> · It can be used with the SEC‐
> COMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation to
> verify that the target process is still alive.
>
> · When returning a notification response to the
> kernel, the supervisor must include the cookie
> value in the seccomp_notif_resp structure that is
> specified as the argument of the SEC‐
> COMP_IOCTL_NOTIF_SEND operation.
>
> pid This is the PID of the target process that trig‐
> gered the notification event.
>
> ┌─────────────────────────────────────────────────────┐
> │FIXME │
> ├─────────────────────────────────────────────────────┤
> │This is a thread ID, rather than a PID, right? │
> └─────────────────────────────────────────────────────┘

Yeah.

>
> flags This is a bit mask of flags providing further
> information on the event. In the current implemen‐
> tation, this field is always zero.
>
> data This is a seccomp_data structure containing infor‐
> mation about the system call that triggered the
> notification. This is the same structure that is
> passed to the seccomp filter. See seccomp(2) for
> details of this structure.
>
> On success, this operation returns 0; on failure, -1 is
> returned, and errno is set to indicate the cause of the
> error. This operation can fail with the following errors:
>
> EINVAL (since Linux 5.5)
> The seccomp_notif structure that was passed to the
> call contained nonzero fields.
>
> ENOENT The target process was killed by a signal as the
> notification information was being generated.

Not just killed, interruption with a signal handler has the same effect.

> ┌─────────────────────────────────────────────────────┐
> │FIXME │
> ├─────────────────────────────────────────────────────┤
> │From my experiments, it appears that if a SEC‐ │
> │COMP_IOCTL_NOTIF_RECV is done after the target │
> │process terminates, then the ioctl() simply blocks │
> │(rather than returning an error to indicate that the │
> │target process no longer exists). │
> │ │
> │I found that surprising, and it required some con‐ │
> │tortions in the example program. It was not possi‐ │
> │ble to code my SIGCHLD handler (which reaps the zom‐ │
> │bie when the worker/target process terminates) to │
> │simply set a flag checked in the main handleNotifi‐ │
> │cations() loop, since this created an unavoidable │
> │race where the child might terminate just after I │
> │had checked the flag, but before I blocked (for‐ │
> │ever!) in the SECCOMP_IOCTL_NOTIF_RECV operation. │
> │Instead, I had to code the signal handler to simply │
> │call _exit(2) in order to terminate the parent │
> │process (the supervisor). │
> │ │
> │Is this expected behavior? It seems to me rather │
> │desirable that SECCOMP_IOCTL_NOTIF_RECV should give │
> │an error if the target process has terminated. │
> └─────────────────────────────────────────────────────┘

You could poll() the fd first. But yeah, it'd probably be a good idea
to change that.

> SECCOMP_IOCTL_NOTIF_ID_VALID
[...]
> In the above scenario, the risk is that the supervisor may
> try to access the memory of a process other than the tar‐
> get. This race can be avoided by following the call to
> open with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to ver‐
> ify that the process that generated the notification is
> still alive. (Note that if the target process subse‐
> quently terminates, its PID won't be reused because there

That's wrong, the PID can be reused, but the /proc/$pid directory is
internally not associated with the numeric PID, but, conceptually
speaking, with a specific incarnation of the PID, or something like
that. (Actually, it is associated with the "struct pid", which is not
reused, instead of the numeric PID.)

> remains an open reference to the /proc[pid]/mem file; in
> this case, a subsequent read(2) from the file will return
> 0, indicating end of file.)
>
> On success (i.e., the notification ID is still valid),
> this operation returns 0 On failure (i.e., the notifica‐

nit: s/returns 0/returns 0./

> tion ID is no longer valid), -1 is returned, and errno is
> set to ENOENT.
>
> SECCOMP_IOCTL_NOTIF_SEND
[...]
> Two kinds of response are possible:
>
> · A response to the kernel telling it to execute the tar‐
> get process's system call. In this case, the flags
> field includes SECCOMP_USER_NOTIF_FLAG_CONTINUE and the
> error and val fields must be zero.
>
> This kind of response can be useful in cases where the
> supervisor needs to do deeper analysis of the target's
> system call than is possible from a seccomp filter
> (e.g., examining the values of pointer arguments), and,
> having verified that the system call is acceptable, the
> supervisor wants to allow it to proceed.

"allow" sounds as if this is an access control thing, but this
mechanism should usually not be used for access control (unless the
"seccomp" syscall is blocked). Maybe reword as "having decided that
the system call does not require emulation by the supervisor, the
supervisor wants it to execute normally", or something like that?

[...]
> On success, this operation returns 0; on failure, -1 is
> returned, and errno is set to indicate the cause of the
> error. This operation can fail with the following errors:
>
> EINPROGRESS
> A response to this notification has already been
> sent.
>
> EINVAL An invalid value was specified in the flags field.
>
> EINVAL The flags field contained SEC‐
> COMP_USER_NOTIF_FLAG_CONTINUE, and the error or val
> field was not zero.
>
> ENOENT The blocked system call in the target process has
> been interrupted by a signal handler.

(you could also get this if a response has already been sent, instead
of EINPROGRESS - the only difference is whether the target thread has
picked up the response yet)

> NOTES
> The file descriptor returned when seccomp(2) is employed with the
> SECCOMP_FILTER_FLAG_NEW_LISTENER flag can be monitored using
> poll(2), epoll(7), and select(2). When a notification is pend‐
> ing, these interfaces indicate that the file descriptor is read‐
> able.

We should probably also point out somewhere that, as
include/uapi/linux/seccomp.h says:

* Similar precautions should be applied when stacking SECCOMP_RET_USER_NOTIF
* or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the
* same syscall, the most recently added filter takes precedence. This means
* that the new SECCOMP_RET_USER_NOTIF filter can override any
* SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all
* such filtered syscalls to be executed by sending the response
* SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can equally
* be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.

In other words, from a security perspective, you must assume that the
target process can bypass any SECCOMP_RET_USER_NOTIF (or
SECCOMP_RET_TRACE) filters unless it is completely prohibited from
calling seccomp(). This should also be noted over in the main
seccomp(2) manpage, especially the SECCOMP_RET_TRACE part.


> EXAMPLES
[...]
> This program can used to demonstrate various aspects of the

nit: "can be used to demonstrate", or alternatively just "demonstrates"

> behavior of the seccomp user-space notification mechanism. To
> help aid such demonstrations, the program logs various messages
> to show the operation of the target process (lines prefixed "T:")
> and the supervisor (indented lines prefixed "S:").
[...]
> Program source
[...]
> #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
> } while (0)

Don't we have err() for this?

> /* Send the file descriptor 'fd' over the connected UNIX domain socket
> 'sockfd'. Returns 0 on success, or -1 on error. */
>
> static int
> sendfd(int sockfd, int fd)
> {
> struct msghdr msgh;
> struct iovec iov;
> int data;
> struct cmsghdr *cmsgp;
>
> /* Allocate a char array of suitable size to hold the ancillary data.
> However, since this buffer is in reality a 'struct cmsghdr', use a
> union to ensure that it is suitable aligned. */

nit: suitably

> union {
> char buf[CMSG_SPACE(sizeof(int))];
> /* Space large enough to hold an 'int' */
> struct cmsghdr align;
> } controlMsg;
>
> /* The 'msg_name' field can be used to specify the address of the
> destination socket when sending a datagram. However, we do not
> need to use this field because 'sockfd' is a connected socket. */
>
> msgh.msg_name = NULL;
> msgh.msg_namelen = 0;
>
> /* On Linux, we must transmit at least one byte of real data in
> order to send ancillary data. We transmit an arbitrary integer
> whose value is ignored by recvfd(). */
>
> msgh.msg_iov = &iov;
> msgh.msg_iovlen = 1;
> iov.iov_base = &data;
> iov.iov_len = sizeof(int);
> data = 12345;
>
> /* Set 'msghdr' fields that describe ancillary data */
>
> msgh.msg_control = controlMsg.buf;
> msgh.msg_controllen = sizeof(controlMsg.buf);
>
> /* Set up ancillary data describing file descriptor to send */
>
> cmsgp = CMSG_FIRSTHDR(&msgh);
> cmsgp->cmsg_level = SOL_SOCKET;
> cmsgp->cmsg_type = SCM_RIGHTS;
> cmsgp->cmsg_len = CMSG_LEN(sizeof(int));
> memcpy(CMSG_DATA(cmsgp), &fd, sizeof(int));
>
> /* Send real plus ancillary data */
>
> if (sendmsg(sockfd, &msgh, 0) == -1)
> return -1;
>
> return 0;
> }

Instead of using unix domain sockets to send the fd to the parent, I
think you could also use clone3() with flags==CLONE_FILES|SIGCHLD,
dup2() the seccomp fd to an fd that was reserved in the parent, call
unshare(CLONE_FILES) in the child after setting up the seccomp fd, and
wake up the parent with something like pthread_cond_signal()? I'm not
sure whether that'd look better or worse in the end though, so maybe
just ignore this comment.

[...]
> /* Access the memory of the target process in order to discover the
> pathname that was given to mkdir() */
>
> static void
> getTargetPathname(struct seccomp_notif *req, int notifyFd,
> char *path, size_t len)
> {
> char procMemPath[PATH_MAX];
> snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", req->pid);
>
> int procMemFd = open(procMemPath, O_RDONLY);

Should example code like this maybe use O_CLOEXEC unless the fd in
question actually has to be inheritable? I know it doesn't actually
matter here, but if this code was used in a multi-threaded context, it
might.

> if (procMemFd == -1)
> errExit("Supervisor: open");
>
> /* Check that the process whose info we are accessing is still alive.
> If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed
> in checkNotificationIdIsValid()) succeeds, we know that the
> /proc/PID/mem file descriptor that we opened corresponds to the
> process for which we received a notification. If that process
> subsequently terminates, then read() on that file descriptor
> will return 0 (EOF). */
>
> checkNotificationIdIsValid(notifyFd, req->id);
>
> /* Seek to the location containing the pathname argument (i.e., the
> first argument) of the mkdir(2) call and read that pathname */
>
> if (lseek(procMemFd, req->data.args[0], SEEK_SET) == -1)
> errExit("Supervisor: lseek");
>
> ssize_t s = read(procMemFd, path, PATH_MAX);
> if (s == -1)
> errExit("read");

Why not pread() instead of lseek()+read()?

> if (s == 0) {
> fprintf(stderr, "\tS: read() of /proc/PID/mem "
> "returned 0 (EOF)\n");
> exit(EXIT_FAILURE);
> }
>
> if (close(procMemFd) == -1)
> errExit("close-/proc/PID/mem");

We should probably make sure here that the value we read is actually
NUL-terminated?

> }
>
> /* Handle notifications that arrive via the SECCOMP_RET_USER_NOTIF file
> descriptor, 'notifyFd'. */
>
> static void
> handleNotifications(int notifyFd)
> {
> struct seccomp_notif_sizes sizes;
> char path[PATH_MAX];
> /* For simplicity, we assume that the pathname given to mkdir()
> is no more than PATH_MAX bytes; but this might not be true. */

No, it has to be true, otherwise the kernel would fail the syscall if
it was executing normally.

> /* Discover the sizes of the structures that are used to receive
> notifications and send notification responses, and allocate
> buffers of those sizes. */
>
> if (seccomp(SECCOMP_GET_NOTIF_SIZES, 0, &sizes) == -1)
> errExit("\tS: seccomp-SECCOMP_GET_NOTIF_SIZES");
>
> struct seccomp_notif *req = malloc(sizes.seccomp_notif);
> if (req == NULL)
> errExit("\tS: malloc");
>
> struct seccomp_notif_resp *resp = malloc(sizes.seccomp_notif_resp);

This should probably do something like max(sizes.seccomp_notif_resp,
sizeof(struct seccomp_notif_resp)) in case the program was built
against new UAPI headers that make struct seccomp_notif_resp big, but
is running under an old kernel where that struct is still smaller?

> if (resp == NULL)
> errExit("\tS: malloc");
[...]
> } else {
>
> /* If mkdir() failed in the supervisor, pass the error
> back to the target */
>
> resp->error = -errno;
> printf("\tS: failure! (errno = %d; %s)\n", errno,
> strerror(errno));
> }
> } else if (strncmp(path, "./", strlen("./")) == 0) {

nit: indent messed up

> resp->error = resp->val = 0;
> resp->flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;
> printf("\tS: target can execute system call\n");
[...]

Subject: Re: For review: seccomp_user_notif(2) manual page

Hi Tycho,

Thanks for taking time to look at the page!

On 9/30/20 5:03 PM, Tycho Andersen wrote:
> On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
>> 2. In order that the supervisor process can obtain notifications
>> using the listening file descriptor, (a duplicate of) that
>> file descriptor must be passed from the target process to the
>> supervisor process. One way in which this could be done is by
>> passing the file descriptor over a UNIX domain socket connec‐
>> tion between the two processes (using the SCM_RIGHTS ancillary
>> message type described in unix(7)). Another possibility is
>> that the supervisor might inherit the file descriptor via
>> fork(2).
>
> It is technically possible to inherit the fd via fork, but is it
> really that useful? The child process wouldn't be able to actually do
> the syscall in question, since it would have the same filter.

D'oh! Yes, of course.

I think I was reaching because in an earlier conversation
you replied:

[[
> 3. The "target process" passes the "listening file descriptor"
> to the "monitoring process" via the UNIX domain socket.

or some other means, it doesn't have to be with SCM_RIGHTS.
]]

So, what other means?

Anyway, I removed the sentence mentioning fork().

>> The information in the notification can be used to discover
>> the values of pointer arguments for the target process's sys‐
>> tem call. (This is something that can't be done from within a
>> seccomp filter.) To do this (and assuming it has suitable
>
> s/To do this/One way to accomplish this/ perhaps, since there are
> others.

Yes, thanks, done.

>> permissions), the supervisor opens the corresponding
>> /proc/[pid]/mem file, seeks to the memory location that corre‐
>> sponds to one of the pointer arguments whose value is supplied
>> in the notification event, and reads bytes from that location.
>> (The supervisor must be careful to avoid a race condition that
>> can occur when doing this; see the description of the SEC‐
>> COMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation below.) In addi‐
>> tion, the supervisor can access other system information that
>> is visible in user space but which is not accessible from a
>> seccomp filter.
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │Suppose we are reading a pathname from /proc/PID/mem │
>> │for a system call such as mkdir(). The pathname can │
>> │be an arbitrary length. How do we know how much (how │
>> │many pages) to read from /proc/PID/mem? │
>> └─────────────────────────────────────────────────────┘
>
> PATH_MAX, I suppose.

Yes, I misunderstood a fundamental detail here, as Jann
also confirmed.

>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │From my experiments, it appears that if a SEC‐ │
>> │COMP_IOCTL_NOTIF_RECV is done after the target │
>> │process terminates, then the ioctl() simply blocks │
>> │(rather than returning an error to indicate that the │
>> │target process no longer exists). │
>
> Yeah, I think Christian wanted to fix this at some point,

Do you have a pointer that discussion? I could not find it with a
quick search.

> but it's a
> bit sticky to do.

Can you say a few words about the nature of the problem?

In the meantime. I think this merits a note under BUGS, and
I've added one.

> Note that if you e.g. rely on fork() above, the
> filter is shared with your current process, and this notification
> would never be possible. Perhaps another reason to omit that from the
> man page.

(Yes, as noted above, I removed that sentence.)

>> SECCOMP_IOCTL_NOTIF_ID_VALID
>> This operation can be used to check that a notification ID
>> returned by an earlier SECCOMP_IOCTL_NOTIF_RECV operation
>> is still valid (i.e., that the target process still
>> exists).
>>
>> The third ioctl(2) argument is a pointer to the cookie
>> (id) returned by the SECCOMP_IOCTL_NOTIF_RECV operation.
>>
>> This operation is necessary to avoid race conditions that
>> can occur when the pid returned by the SEC‐
>> COMP_IOCTL_NOTIF_RECV operation terminates, and that
>> process ID is reused by another process. An example of
>> this kind of race is the following
>>
>> 1. A notification is generated on the listening file
>> descriptor. The returned seccomp_notif contains the
>> PID of the target process.
>>
>> 2. The target process terminates.
>>
>> 3. Another process is created on the system that by chance
>> reuses the PID that was freed when the target process
>> terminates.
>>
>> 4. The supervisor open(2)s the /proc/[pid]/mem file for
>> the PID obtained in step 1, with the intention of (say)
>> inspecting the memory locations that contains the argu‐
>> ments of the system call that triggered the notifica‐
>> tion in step 1.
>>
>> In the above scenario, the risk is that the supervisor may
>> try to access the memory of a process other than the tar‐
>> get. This race can be avoided by following the call to
>> open with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to ver‐
>> ify that the process that generated the notification is
>> still alive. (Note that if the target process subse‐
>> quently terminates, its PID won't be reused because there
>> remains an open reference to the /proc[pid]/mem file; in
>> this case, a subsequent read(2) from the file will return
>> 0, indicating end of file.)
>>
>> On success (i.e., the notification ID is still valid),
>> this operation returns 0 On failure (i.e., the notifica‐
> ^ need a period?
>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │Interestingly, after the event had been received, │
>> │the file descriptor indicates as writable (verified │
>> │from the source code and by experiment). How is this │
>> │useful? │
>
> You're saying it should just do EPOLLOUT and not EPOLLWRNORM? Seems
> reasonable.

No, I'm saying something more fundamental: why is the FD indicating as
writable? Can you write something to it? If yes, what? If not, then
why do these APIs want to say that the FD is writable?

>> EXAMPLES
>> The (somewhat contrived) program shown below demonstrates the use
>
> May also be worth mentioning the example in
> samples/seccomp/user-trap.c as well.

Oh -- I meant to do that! Thanks for the reminding me.

Thanks,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2020-10-01 00:29:17

by Tycho Andersen

[permalink] [raw]
Subject: Re: For review: seccomp_user_notif(2) manual page

On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) wrote:
> Hi Tycho,
>
> Thanks for taking time to look at the page!
>
> On 9/30/20 5:03 PM, Tycho Andersen wrote:
> > On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
> >> 2. In order that the supervisor process can obtain notifications
> >> using the listening file descriptor, (a duplicate of) that
> >> file descriptor must be passed from the target process to the
> >> supervisor process. One way in which this could be done is by
> >> passing the file descriptor over a UNIX domain socket connec‐
> >> tion between the two processes (using the SCM_RIGHTS ancillary
> >> message type described in unix(7)). Another possibility is
> >> that the supervisor might inherit the file descriptor via
> >> fork(2).
> >
> > It is technically possible to inherit the fd via fork, but is it
> > really that useful? The child process wouldn't be able to actually do
> > the syscall in question, since it would have the same filter.
>
> D'oh! Yes, of course.
>
> I think I was reaching because in an earlier conversation
> you replied:
>
> [[
> > 3. The "target process" passes the "listening file descriptor"
> > to the "monitoring process" via the UNIX domain socket.
>
> or some other means, it doesn't have to be with SCM_RIGHTS.
> ]]
>
> So, what other means?
>
> Anyway, I removed the sentence mentioning fork().

Whatever means people want :). fork() could work (it's how some of the
tests for this feature work, but it's not particularly useful I don't
think), clone(CLONE_FILES) is similar, seccomp_putfd, or maybe even
cloning it via some pidfd interface that might be invented for
re-opening files.

> >> ┌─────────────────────────────────────────────────────┐
> >> │FIXME │
> >> ├─────────────────────────────────────────────────────┤
> >> │From my experiments, it appears that if a SEC‐ │
> >> │COMP_IOCTL_NOTIF_RECV is done after the target │
> >> │process terminates, then the ioctl() simply blocks │
> >> │(rather than returning an error to indicate that the │
> >> │target process no longer exists). │
> >
> > Yeah, I think Christian wanted to fix this at some point,
>
> Do you have a pointer that discussion? I could not find it with a
> quick search.
>
> > but it's a
> > bit sticky to do.
>
> Can you say a few words about the nature of the problem?

I remembered wrong, it's actually in the tree: 99cdb8b9a573 ("seccomp:
notify about unused filter"). So maybe there's a bug here?

> >> ┌─────────────────────────────────────────────────────┐
> >> │FIXME │
> >> ├─────────────────────────────────────────────────────┤
> >> │Interestingly, after the event had been received, │
> >> │the file descriptor indicates as writable (verified │
> >> │from the source code and by experiment). How is this │
> >> │useful? │
> >
> > You're saying it should just do EPOLLOUT and not EPOLLWRNORM? Seems
> > reasonable.
>
> No, I'm saying something more fundamental: why is the FD indicating as
> writable? Can you write something to it? If yes, what? If not, then
> why do these APIs want to say that the FD is writable?

You can't via read(2) or write(2), but conceptually NOTIFY_RECV and
NOTIFY_SEND are reading and writing events from the fd. I don't know
that much about the poll interface though -- is it possible to
indicate "here's a pseudo-read event"? It didn't look like it, so I
just (ab-)used POLLIN and POLLOUT, but probably that's wrong.

Tycho

2020-10-01 00:29:26

by Jann Horn

[permalink] [raw]
Subject: Re: For review: seccomp_user_notif(2) manual page

On Thu, Oct 1, 2020 at 1:03 AM Tycho Andersen <[email protected]> wrote:
> On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) wrote:
> > On 9/30/20 5:03 PM, Tycho Andersen wrote:
> > > On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
> > >> ┌─────────────────────────────────────────────────────┐
> > >> │FIXME │
> > >> ├─────────────────────────────────────────────────────┤
> > >> │From my experiments, it appears that if a SEC‐ │
> > >> │COMP_IOCTL_NOTIF_RECV is done after the target │
> > >> │process terminates, then the ioctl() simply blocks │
> > >> │(rather than returning an error to indicate that the │
> > >> │target process no longer exists). │
> > >
> > > Yeah, I think Christian wanted to fix this at some point,
> >
> > Do you have a pointer that discussion? I could not find it with a
> > quick search.
> >
> > > but it's a
> > > bit sticky to do.
> >
> > Can you say a few words about the nature of the problem?
>
> I remembered wrong, it's actually in the tree: 99cdb8b9a573 ("seccomp:
> notify about unused filter"). So maybe there's a bug here?

That thing only notifies on ->poll, it doesn't unblock ioctls; and
Michael's sample code uses SECCOMP_IOCTL_NOTIF_RECV to wait. So that
commit doesn't have any effect on this kind of usage.

2020-10-01 01:09:23

by Kees Cook

[permalink] [raw]
Subject: Re: For review: seccomp_user_notif(2) manual page

On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
> [...] I did :-)

Yay! Thank you!

> [...]
> Overview
> In conventional usage of a seccomp filter, the decision about how
> to treat a particular system call is made by the filter itself.
> The user-space notification mechanism allows the handling of the
> system call to instead be handed off to a user-space process.
> The advantages of doing this are that, by contrast with the sec‐
> comp filter, which is running on a virtual machine inside the
> kernel, the user-space process has access to information that is
> unavailable to the seccomp filter and it can perform actions that
> can't be performed from the seccomp filter.

I might clarify a bit with something like (though maybe the
target/supervisor paragraph needs to be moved to the start):

This is used for performing syscalls on behalf of the target,
rather than having the supervisor make security policy decisions
about the syscall, which would be inherently race-prone. The
target's syscall should either be handled by the supervisor or
allowed to continue normally in the kernel (where standard security
policies will be applied).

I'll comment more later, but I've run out of time today and I didn't see
anyone mention this detail yet in the existing threads... :)

--
Kees Cook

2020-10-01 01:11:04

by Tycho Andersen

[permalink] [raw]
Subject: Re: For review: seccomp_user_notif(2) manual page

On Thu, Oct 01, 2020 at 01:11:33AM +0200, Jann Horn wrote:
> On Thu, Oct 1, 2020 at 1:03 AM Tycho Andersen <[email protected]> wrote:
> > On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) wrote:
> > > On 9/30/20 5:03 PM, Tycho Andersen wrote:
> > > > On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
> > > >> ┌─────────────────────────────────────────────────────┐
> > > >> │FIXME │
> > > >> ├─────────────────────────────────────────────────────┤
> > > >> │From my experiments, it appears that if a SEC‐ │
> > > >> │COMP_IOCTL_NOTIF_RECV is done after the target │
> > > >> │process terminates, then the ioctl() simply blocks │
> > > >> │(rather than returning an error to indicate that the │
> > > >> │target process no longer exists). │
> > > >
> > > > Yeah, I think Christian wanted to fix this at some point,
> > >
> > > Do you have a pointer that discussion? I could not find it with a
> > > quick search.
> > >
> > > > but it's a
> > > > bit sticky to do.
> > >
> > > Can you say a few words about the nature of the problem?
> >
> > I remembered wrong, it's actually in the tree: 99cdb8b9a573 ("seccomp:
> > notify about unused filter"). So maybe there's a bug here?
>
> That thing only notifies on ->poll, it doesn't unblock ioctls; and
> Michael's sample code uses SECCOMP_IOCTL_NOTIF_RECV to wait. So that
> commit doesn't have any effect on this kind of usage.

Yes, thanks. And the ones stuck in RECV are waiting on a semaphore so
we don't have a count of all of them, unfortunately.

We could maybe look inside the wait_list, but that will probably make
people angry :)

Tycho

2020-10-01 01:59:27

by Jann Horn

[permalink] [raw]
Subject: Re: For review: seccomp_user_notif(2) manual page

On Thu, Oct 1, 2020 at 1:25 AM Tycho Andersen <[email protected]> wrote:
> On Thu, Oct 01, 2020 at 01:11:33AM +0200, Jann Horn wrote:
> > On Thu, Oct 1, 2020 at 1:03 AM Tycho Andersen <[email protected]> wrote:
> > > On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) wrote:
> > > > On 9/30/20 5:03 PM, Tycho Andersen wrote:
> > > > > On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
> > > > >> ┌─────────────────────────────────────────────────────┐
> > > > >> │FIXME │
> > > > >> ├─────────────────────────────────────────────────────┤
> > > > >> │From my experiments, it appears that if a SEC‐ │
> > > > >> │COMP_IOCTL_NOTIF_RECV is done after the target │
> > > > >> │process terminates, then the ioctl() simply blocks │
> > > > >> │(rather than returning an error to indicate that the │
> > > > >> │target process no longer exists). │
> > > > >
> > > > > Yeah, I think Christian wanted to fix this at some point,
> > > >
> > > > Do you have a pointer that discussion? I could not find it with a
> > > > quick search.
> > > >
> > > > > but it's a
> > > > > bit sticky to do.
> > > >
> > > > Can you say a few words about the nature of the problem?
> > >
> > > I remembered wrong, it's actually in the tree: 99cdb8b9a573 ("seccomp:
> > > notify about unused filter"). So maybe there's a bug here?
> >
> > That thing only notifies on ->poll, it doesn't unblock ioctls; and
> > Michael's sample code uses SECCOMP_IOCTL_NOTIF_RECV to wait. So that
> > commit doesn't have any effect on this kind of usage.
>
> Yes, thanks. And the ones stuck in RECV are waiting on a semaphore so
> we don't have a count of all of them, unfortunately.
>
> We could maybe look inside the wait_list, but that will probably make
> people angry :)

The easiest way would probably be to open-code the semaphore-ish part,
and let the semaphore and poll share the waitqueue. The current code
kind of mirrors the semaphore's waitqueue in the wqh - open-coding the
entire semaphore would IMO be cleaner than that. And it's not like
semaphore semantics are even a good fit for this code anyway.

Let's see... if we didn't have the existing UAPI to worry about, I'd
do it as follows (*completely* untested). That way, the ioctl would
block exactly until either there actually is a request to deliver or
there are no more users of the filter. The problem is that if we just
apply this patch, existing users of SECCOMP_IOCTL_NOTIF_RECV that use
an event loop and don't set O_NONBLOCK will be screwed. So we'd
probably also have to add some stupid counter in place of the
semaphore's counter that we can use to preserve the old behavior of
returning -ENOENT once for each cancelled request. :(

I guess this is a nice point in favor of Michael's usual complaint
that if there are no man pages for a feature by the time the feature
lands upstream, there's a higher chance that the UAPI will suck
forever...



diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 676d4af62103..f0f4c68e0bc6 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -138,7 +138,6 @@ struct seccomp_kaddfd {
* @notifications: A list of struct seccomp_knotif elements.
*/
struct notification {
- struct semaphore request;
u64 next_id;
struct list_head notifications;
};
@@ -859,7 +858,6 @@ static int seccomp_do_user_notification(int this_syscall,
list_add(&n.list, &match->notif->notifications);
INIT_LIST_HEAD(&n.addfd);

- up(&match->notif->request);
wake_up_poll(&match->wqh, EPOLLIN | EPOLLRDNORM);
mutex_unlock(&match->notify_lock);

@@ -1175,9 +1173,10 @@ find_notification(struct seccomp_filter *filter, u64 id)


static long seccomp_notify_recv(struct seccomp_filter *filter,
- void __user *buf)
+ void __user *buf, bool blocking)
{
struct seccomp_knotif *knotif = NULL, *cur;
+ DECLARE_WAITQUEUE(wait, current);
struct seccomp_notif unotif;
ssize_t ret;

@@ -1190,11 +1189,9 @@ static long seccomp_notify_recv(struct
seccomp_filter *filter,

memset(&unotif, 0, sizeof(unotif));

- ret = down_interruptible(&filter->notif->request);
- if (ret < 0)
- return ret;
-
mutex_lock(&filter->notify_lock);
+
+retry:
list_for_each_entry(cur, &filter->notif->notifications, list) {
if (cur->state == SECCOMP_NOTIFY_INIT) {
knotif = cur;
@@ -1202,14 +1199,32 @@ static long seccomp_notify_recv(struct
seccomp_filter *filter,
}
}

- /*
- * If we didn't find a notification, it could be that the task was
- * interrupted by a fatal signal between the time we were woken and
- * when we were able to acquire the rw lock.
- */
if (!knotif) {
- ret = -ENOENT;
- goto out;
+ /* This has to happen before checking &filter->users. */
+ prepare_to_wait(&filter->wqh, &wait, TASK_INTERRUPTIBLE);
+
+ /*
+ * If all users of the filter are gone, throw an error instead
+ * of pointlessly continuing to block.
+ */
+ if (refcount_read(&filter->users) == 0) {
+ ret = -ENOTCON;
+ goto out;
+ }
+ if (blocking) {
+ /* No notifications pending - wait for one,
then retry. */
+ mutex_unlock(&filter->notify_lock);
+ schedule();
+ mutex_lock(&filter->notify_lock);
+ if (signal_pending(current)) {
+ ret = -EINTR;
+ goto out;
+ }
+ goto retry;
+ } else {
+ ret = -ENOENT;
+ goto out;
+ }
}

unotif.id = knotif->id;
@@ -1220,6 +1235,7 @@ static long seccomp_notify_recv(struct
seccomp_filter *filter,
wake_up_poll(&filter->wqh, EPOLLOUT | EPOLLWRNORM);
ret = 0;
out:
+ finish_wait(&filter->wqh, &wait);
mutex_unlock(&filter->notify_lock);

if (ret == 0 && copy_to_user(buf, &unotif, sizeof(unotif))) {
@@ -1233,10 +1249,8 @@ static long seccomp_notify_recv(struct
seccomp_filter *filter,
*/
mutex_lock(&filter->notify_lock);
knotif = find_notification(filter, unotif.id);
- if (knotif) {
+ if (knotif)
knotif->state = SECCOMP_NOTIFY_INIT;
- up(&filter->notif->request);
- }
mutex_unlock(&filter->notify_lock);
}

@@ -1412,11 +1426,12 @@ static long seccomp_notify_ioctl(struct file
*file, unsigned int cmd,
{
struct seccomp_filter *filter = file->private_data;
void __user *buf = (void __user *)arg;
+ bool blocking = !(file->f_flags & O_NONBLOCK);

/* Fixed-size ioctls */
switch (cmd) {
case SECCOMP_IOCTL_NOTIF_RECV:
- return seccomp_notify_recv(filter, buf);
+ return seccomp_notify_recv(filter, buf, blocking);
case SECCOMP_IOCTL_NOTIF_SEND:
return seccomp_notify_send(filter, buf);
case SECCOMP_IOCTL_NOTIF_ID_VALID_WRONG_DIR:
@@ -1485,7 +1500,6 @@ static struct file *init_listener(struct
seccomp_filter *filter)
if (!filter->notif)
goto out;

- sema_init(&filter->notif->request, 0);
filter->notif->next_id = get_random_u64();
INIT_LIST_HEAD(&filter->notif->notifications);

2020-10-01 02:17:44

by Jann Horn

[permalink] [raw]
Subject: Re: For review: seccomp_user_notif(2) manual page

On Thu, Oct 1, 2020 at 3:52 AM Jann Horn <[email protected]> wrote:
> On Thu, Oct 1, 2020 at 1:25 AM Tycho Andersen <[email protected]> wrote:
> > On Thu, Oct 01, 2020 at 01:11:33AM +0200, Jann Horn wrote:
> > > On Thu, Oct 1, 2020 at 1:03 AM Tycho Andersen <[email protected]> wrote:
> > > > On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) wrote:
> > > > > On 9/30/20 5:03 PM, Tycho Andersen wrote:
> > > > > > On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
> > > > > >> ┌─────────────────────────────────────────────────────┐
> > > > > >> │FIXME │
> > > > > >> ├─────────────────────────────────────────────────────┤
> > > > > >> │From my experiments, it appears that if a SEC‐ │
> > > > > >> │COMP_IOCTL_NOTIF_RECV is done after the target │
> > > > > >> │process terminates, then the ioctl() simply blocks │
> > > > > >> │(rather than returning an error to indicate that the │
> > > > > >> │target process no longer exists). │
> > > > > >
> > > > > > Yeah, I think Christian wanted to fix this at some point,
> > > > >
> > > > > Do you have a pointer that discussion? I could not find it with a
> > > > > quick search.
> > > > >
> > > > > > but it's a
> > > > > > bit sticky to do.
> > > > >
> > > > > Can you say a few words about the nature of the problem?
> > > >
> > > > I remembered wrong, it's actually in the tree: 99cdb8b9a573 ("seccomp:
> > > > notify about unused filter"). So maybe there's a bug here?
> > >
> > > That thing only notifies on ->poll, it doesn't unblock ioctls; and
> > > Michael's sample code uses SECCOMP_IOCTL_NOTIF_RECV to wait. So that
> > > commit doesn't have any effect on this kind of usage.
> >
> > Yes, thanks. And the ones stuck in RECV are waiting on a semaphore so
> > we don't have a count of all of them, unfortunately.
> >
> > We could maybe look inside the wait_list, but that will probably make
> > people angry :)
>
> The easiest way would probably be to open-code the semaphore-ish part,
> and let the semaphore and poll share the waitqueue. The current code
> kind of mirrors the semaphore's waitqueue in the wqh - open-coding the
> entire semaphore would IMO be cleaner than that. And it's not like
> semaphore semantics are even a good fit for this code anyway.
>
> Let's see... if we didn't have the existing UAPI to worry about, I'd
> do it as follows (*completely* untested). That way, the ioctl would
> block exactly until either there actually is a request to deliver or
> there are no more users of the filter. The problem is that if we just
> apply this patch, existing users of SECCOMP_IOCTL_NOTIF_RECV that use
> an event loop and don't set O_NONBLOCK will be screwed. So we'd
> probably also have to add some stupid counter in place of the
> semaphore's counter that we can use to preserve the old behavior of
> returning -ENOENT once for each cancelled request. :(
>
> I guess this is a nice point in favor of Michael's usual complaint
> that if there are no man pages for a feature by the time the feature
> lands upstream, there's a higher chance that the UAPI will suck
> forever...

And I guess this would be the UAPI-compatible version - not actually
as terrible as I thought it might be. Do y'all want this? If so, feel
free to either turn this into a proper patch with Co-developed-by, or
tell me that I should do it and I'll try to get around to turning it
into something proper.

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 676d4af62103..d08c453fcc2c 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -138,7 +138,7 @@ struct seccomp_kaddfd {
* @notifications: A list of struct seccomp_knotif elements.
*/
struct notification {
- struct semaphore request;
+ bool canceled_reqs;
u64 next_id;
struct list_head notifications;
};
@@ -859,7 +859,6 @@ static int seccomp_do_user_notification(int this_syscall,
list_add(&n.list, &match->notif->notifications);
INIT_LIST_HEAD(&n.addfd);

- up(&match->notif->request);
wake_up_poll(&match->wqh, EPOLLIN | EPOLLRDNORM);
mutex_unlock(&match->notify_lock);

@@ -901,8 +900,20 @@ static int seccomp_do_user_notification(int this_syscall,
* *reattach* to a notifier right now. If one is added, we'll need to
* keep track of the notif itself and make sure they match here.
*/
- if (match->notif)
+ if (match->notif) {
list_del(&n.list);
+
+ /*
+ * We are stuck with a UAPI that requires that after a spurious
+ * wakeup, SECCOMP_IOCTL_NOTIF_RECV must return immediately.
+ * This is the tracking for that, keeping track of whether we
+ * canceled a request after waking waiters, but before userspace
+ * picked up the notification.
+ */
+ if (n.state == SECCOMP_NOTIFY_INIT)
+ match->notif->canceled_reqs = true;
+ }
+
out:
mutex_unlock(&match->notify_lock);

@@ -1178,6 +1189,7 @@ static long seccomp_notify_recv(struct
seccomp_filter *filter,
void __user *buf)
{
struct seccomp_knotif *knotif = NULL, *cur;
+ DECLARE_WAITQUEUE(wait, current);
struct seccomp_notif unotif;
ssize_t ret;

@@ -1190,11 +1202,9 @@ static long seccomp_notify_recv(struct
seccomp_filter *filter,

memset(&unotif, 0, sizeof(unotif));

- ret = down_interruptible(&filter->notif->request);
- if (ret < 0)
- return ret;
-
mutex_lock(&filter->notify_lock);
+
+retry:
list_for_each_entry(cur, &filter->notif->notifications, list) {
if (cur->state == SECCOMP_NOTIFY_INIT) {
knotif = cur;
@@ -1202,14 +1212,32 @@ static long seccomp_notify_recv(struct
seccomp_filter *filter,
}
}

- /*
- * If we didn't find a notification, it could be that the task was
- * interrupted by a fatal signal between the time we were woken and
- * when we were able to acquire the rw lock.
- */
if (!knotif) {
- ret = -ENOENT;
- goto out;
+ /* This has to happen before checking &filter->users. */
+ prepare_to_wait(&filter->wqh, &wait, TASK_INTERRUPTIBLE);
+
+ /*
+ * If all users of the filter are gone, throw an error instead
+ * of pointlessly continuing to block.
+ */
+ if (refcount_read(&filter->users) == 0) {
+ ret = -ENOTCON;
+ goto out;
+ }
+ if (filter->notif->canceled_reqs) {
+ ret = -ENOENT;
+ goto out;
+ } else {
+ /* No notifications pending - wait for one,
then retry. */
+ mutex_unlock(&filter->notify_lock);
+ schedule();
+ mutex_lock(&filter->notify_lock);
+ if (signal_pending(current)) {
+ ret = -EINTR;
+ goto out;
+ }
+ goto retry;
+ }
}

unotif.id = knotif->id;
@@ -1220,6 +1248,8 @@ static long seccomp_notify_recv(struct
seccomp_filter *filter,
wake_up_poll(&filter->wqh, EPOLLOUT | EPOLLWRNORM);
ret = 0;
out:
+ filter->notif->canceled_reqs = false;
+ finish_wait(&filter->wqh, &wait);
mutex_unlock(&filter->notify_lock);

if (ret == 0 && copy_to_user(buf, &unotif, sizeof(unotif))) {
@@ -1233,10 +1263,8 @@ static long seccomp_notify_recv(struct
seccomp_filter *filter,
*/
mutex_lock(&filter->notify_lock);
knotif = find_notification(filter, unotif.id);
- if (knotif) {
+ if (knotif)
knotif->state = SECCOMP_NOTIFY_INIT;
- up(&filter->notif->request);
- }
mutex_unlock(&filter->notify_lock);
}

@@ -1485,7 +1513,6 @@ static struct file *init_listener(struct
seccomp_filter *filter)
if (!filter->notif)
goto out;

- sema_init(&filter->notif->request, 0);
filter->notif->next_id = get_random_u64();
INIT_LIST_HEAD(&filter->notif->notifications);

Subject: Re: For review: seccomp_user_notif(2) manual page

On 10/1/20 1:03 AM, Tycho Andersen wrote:
> On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) wrote:
>> Hi Tycho,
>>
>> Thanks for taking time to look at the page!
>>
>> On 9/30/20 5:03 PM, Tycho Andersen wrote:
>>> On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:

[...]

>>>> ┌─────────────────────────────────────────────────────┐
>>>> │FIXME │
>>>> ├─────────────────────────────────────────────────────┤
>>>> │Interestingly, after the event had been received, │
>>>> │the file descriptor indicates as writable (verified │
>>>> │from the source code and by experiment). How is this │
>>>> │useful? │
>>>
>>> You're saying it should just do EPOLLOUT and not EPOLLWRNORM? Seems
>>> reasonable.
>>
>> No, I'm saying something more fundamental: why is the FD indicating as
>> writable? Can you write something to it? If yes, what? If not, then
>> why do these APIs want to say that the FD is writable?
>
> You can't via read(2) or write(2), but conceptually NOTIFY_RECV and
> NOTIFY_SEND are reading and writing events from the fd. I don't know
> that much about the poll interface though -- is it possible to
> indicate "here's a pseudo-read event"? It didn't look like it, so I
> just (ab-)used POLLIN and POLLOUT, but probably that's wrong.

I think the POLLIN thing is fine.

So, I think maybe I now understand what you intended with setting
POLLOUT: the notification has been received ("read") and now the
FD can be used to NOTIFY_SEND ("write") a response. Right?

If that's correct, I don't have a problem with it. I just wonder:
is it useful? IOW: are there situations where the process doing the
NOTIFY_SEND might want to test for POLLOUT because the it doesn't
know whether a NOTIFY_RECV has occurred?

Thanks,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Subject: Re: For review: seccomp_user_notif(2) manual page

On 10/1/20 3:52 AM, Jann Horn wrote:

[...]

> I guess this is a nice point in favor of Michael's usual complaint
> that if there are no man pages for a feature by the time the feature
> lands upstream, there's a higher chance that the UAPI will suck
> forever...

Thanks for saving me the trouble of saying that (again).

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2020-10-01 12:40:19

by Christian Brauner

[permalink] [raw]
Subject: Re: For review: seccomp_user_notif(2) manual page

[I'm on vacation so I'll just give this a quick glance for now.]

On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
> Hi Tycho, Sargun (and all),
>
> I knew it would be a big ask, but below is kind of the manual page
> I was hoping you might write [1] for the seccomp user-space notification
> mechanism. Since you didn't (and because 5.9 adds various new pieces
> such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD
> that also will need documenting [2]), I did :-). But of course I may
> have made mistakes...
>
> I've shown the rendered version of the page below, and would love
> to receive review comments from you and others, and acks, etc.
>
> There are a few FIXMEs sprinkled into the page, including one
> that relates to what appears to me to be a misdesign (possibly
> fixable) in the operation of the SECCOMP_IOCTL_NOTIF_RECV
> operation. I would be especially interested in feedback on that
> FIXME, and also of course the other FIXMEs.
>
> The page includes an extensive (albeit slightly contrived)
> example program, and I would be happy also to receive comments
> on that program.
>
> The page source currently sits in a branch (along with the text
> that you sent me for the seccomp(2) page) at
> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=seccomp_user_notif
>
> Thanks,
>
> Michael
>
> [1] https://lore.kernel.org/linux-man/[email protected]/#t
> [2] Sargun, can you prepare something on SECCOMP_ADDFD_FLAG_SETFD
> and SECCOMP_IOCTL_NOTIF_ADDFD to be added to this page?
>
> =====
>
> NAME
> seccomp_user_notif - Seccomp user-space notification mechanism
>
> SYNOPSIS
> #include <linux/seccomp.h>
> #include <linux/filter.h>
> #include <linux/audit.h>
>
> int seccomp(unsigned int operation, unsigned int flags, void *args);
>
> DESCRIPTION
> This page describes the user-space notification mechanism pro‐
> vided by the Secure Computing (seccomp) facility. As well as the
> use of the SECCOMP_FILTER_FLAG_NEW_LISTENER flag, the SEC‐
> COMP_RET_USER_NOTIF action value, and the SECCOMP_GET_NOTIF_SIZES
> operation described in seccomp(2), this mechanism involves the
> use of a number of related ioctl(2) operations (described below).
>
> Overview
> In conventional usage of a seccomp filter, the decision about how
> to treat a particular system call is made by the filter itself.
> The user-space notification mechanism allows the handling of the
> system call to instead be handed off to a user-space process.

"In contrast, the user notification mechanism allows to delegate the
handling of the system call of one process (target) to another
user-space process (supervisor)."?

> The advantages of doing this are that, by contrast with the sec‐
> comp filter, which is running on a virtual machine inside the
> kernel, the user-space process has access to information that is
> unavailable to the seccomp filter and it can perform actions that
> can't be performed from the seccomp filter.

This section reads a bit difficult imho:
"A suitably privileged supervisor can use the user notification
mechanism to perform actions in lieu of the target. The supervisor will
usually be able to retrieve information about the target and the
performed system call that the seccomp filter itself cannot."

>
> In the discussion that follows, the process that has installed
> the seccomp filter is referred to as the target, and the process
> that is notified by the user-space notification mechanism is
> referred to as the supervisor. An overview of the steps per‐
> formed by these two processes is as follows:
>
> 1. The target process establishes a seccomp filter in the usual
> manner, but with two differences:
>
> · The seccomp(2) flags argument includes the flag SECCOMP_FIL‐
> TER_FLAG_NEW_LISTENER. Consequently, the return value of
> the (successful) seccomp(2) call is a new "listening" file
> descriptor that can be used to receive notifications.

I think it would be good to mention that seccomp notify fds are
O_CLOEXEC by default somewhere.

>
> · In cases where it is appropriate, the seccomp filter returns
> the action value SECCOMP_RET_USER_NOTIF. This return value
> will trigger a notification event.
>
> 2. In order that the supervisor process can obtain notifications
> using the listening file descriptor, (a duplicate of) that
> file descriptor must be passed from the target process to the
> supervisor process. One way in which this could be done is by
> passing the file descriptor over a UNIX domain socket connec‐
> tion between the two processes (using the SCM_RIGHTS ancillary
> message type described in unix(7)). Another possibility is
> that the supervisor might inherit the file descriptor via
> fork(2).

I think a few people have already pointed out other ways of retrieving
an fd. :)

>
> 3. The supervisor process will receive notification events on the
> listening file descriptor. These events are returned as
> structures of type seccomp_notif. Because this structure and
> its size may evolve over kernel versions, the supervisor must
> first determine the size of this structure using the sec‐
> comp(2) SECCOMP_GET_NOTIF_SIZES operation, which returns a
> structure of type seccomp_notif_sizes. The supervisor allo‐
> cates a buffer of size seccomp_notif_sizes.seccomp_notif bytes
> to receive notification events. In addition,the supervisor
> allocates another buffer of size seccomp_notif_sizes.sec‐
> comp_notif_resp bytes for the response (a struct sec‐
> comp_notif_resp structure) that it will provide to the kernel
> (and thus the target process).
>
> 4. The target process then performs its workload, which includes
> system calls that will be controlled by the seccomp filter.
> Whenever one of these system calls causes the filter to return
> the SECCOMP_RET_USER_NOTIF action value, the kernel does not
> execute the system call; instead, execution of the target
> process is temporarily blocked inside the kernel and a notifi‐

Maybe mention that the task is killable when so blocked?

> cation event is generated on the listening file descriptor.
>
> 5. The supervisor process can now repeatedly monitor the listen‐
> ing file descriptor for SECCOMP_RET_USER_NOTIF-triggered
> events. To do this, the supervisor uses the SEC‐
> COMP_IOCTL_NOTIF_RECV ioctl(2) operation to read information
> about a notification event; this operation blocks until an
> event is available. The operation returns a seccomp_notif
> structure containing information about the system call that is
> being attempted by the target process.
>
> 6. The seccomp_notif structure returned by the SEC‐
> COMP_IOCTL_NOTIF_RECV operation includes the same information
> (a seccomp_data structure) that was passed to the seccomp fil‐
> ter. This information allows the supervisor to discover the
> system call number and the arguments for the target process's
> system call. In addition, the notification event contains the
> PID of the target process.

(Technically TID.)

>
> The information in the notification can be used to discover
> the values of pointer arguments for the target process's sys‐
> tem call. (This is something that can't be done from within a
> seccomp filter.) To do this (and assuming it has suitable
> permissions), the supervisor opens the corresponding
> /proc/[pid]/mem file, seeks to the memory location that corre‐
> sponds to one of the pointer arguments whose value is supplied
> in the notification event, and reads bytes from that location.
> (The supervisor must be careful to avoid a race condition that
> can occur when doing this; see the description of the SEC‐
> COMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation below.) In addi‐
> tion, the supervisor can access other system information that
> is visible in user space but which is not accessible from a
> seccomp filter.
>
> ┌─────────────────────────────────────────────────────┐
> │FIXME │
> ├─────────────────────────────────────────────────────┤
> │Suppose we are reading a pathname from /proc/PID/mem │
> │for a system call such as mkdir(). The pathname can │
> │be an arbitrary length. How do we know how much (how │
> │many pages) to read from /proc/PID/mem? │
> └─────────────────────────────────────────────────────┘

This has already been answered, I believe.

>
> 7. Having obtained information as per the previous step, the
> supervisor may then choose to perform an action in response to
> the target process's system call (which, as noted above, is
> not executed when the seccomp filter returns the SEC‐
> COMP_RET_USER_NOTIF action value).

Nit: It is not _yet_ executed it may very well be if the response is
"continue". This should either mention that when the fd becomes
_RECVable the system call is guaranteed to not have executed yet or
specify that it is not yet executed, I think.

>
> One example use case here relates to containers. The target
> process may be located inside a container where it does not
> have sufficient capabilities to mount a filesystem in the con‐
> tainer's mount namespace. However, the supervisor may be a
> more privileged process that that does have sufficient capa‐
> bilities to perform the mount operation.
>
> 8. The supervisor then sends a response to the notification. The
> information in this response is used by the kernel to con‐
> struct a return value for the target process's system call and
> provide a value that will be assigned to the errno variable of
> the target process.
>
> The response is sent using the SECCOMP_IOCTL_NOTIF_RECV
> ioctl(2) operation, which is used to transmit a sec‐
> comp_notif_resp structure to the kernel. This structure
> includes a cookie value that the supervisor obtained in the
> seccomp_notif structure returned by the SEC‐
> COMP_IOCTL_NOTIF_RECV operation. This cookie value allows the
> kernel to associate the response with the target process.

I think here or above you should mention that the id or "cookie" _must_
be used when a file descriptor to /proc/<pid>/mem or any /proc/<pid>/*
is opened:
fd = open(/proc/pid/*);
verify_via_cookie_that_pid_still_alive(cookie);
operate_on(fd)

Otherwise this is a potential security issue.

>
> 9. Once the notification has been sent, the system call in the
> target process unblocks, returning the information that was
> provided by the supervisor in the notification response.
>
> As a variation on the last two steps, the supervisor can send a
> response that tells the kernel that it should execute the target
> process's system call; see the discussion of SEC‐
> COMP_USER_NOTIF_FLAG_CONTINUE, below.
>
> ioctl(2) operations
> The following ioctl(2) operations are provided to support seccomp
> user-space notification. For each of these operations, the first
> (file descriptor) argument of ioctl(2) is the listening file
> descriptor returned by a call to seccomp(2) with the SECCOMP_FIL‐
> TER_FLAG_NEW_LISTENER flag.
>
> SECCOMP_IOCTL_NOTIF_RECV
> This operation is used to obtain a user-space notification
> event. If no such event is currently pending, the opera‐
> tion blocks until an event occurs. The third ioctl(2)
> argument is a pointer to a structure of the following form
> which contains information about the event. This struc‐
> ture must be zeroed out before the call.
>
> struct seccomp_notif {
> __u64 id; /* Cookie */
> __u32 pid; /* PID of target process */
> __u32 flags; /* Currently unused (0) */
> struct seccomp_data data; /* See seccomp(2) */
> };
>
> The fields in this structure are as follows:
>
> id This is a cookie for the notification. Each such
> cookie is guaranteed to be unique for the corre‐
> sponding seccomp filter. In other words, this
> cookie is unique for each notification event from
> the target process. The cookie value has the fol‐
> lowing uses:
>
> · It can be used with the SEC‐
> COMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation to
> verify that the target process is still alive.
>
> · When returning a notification response to the
> kernel, the supervisor must include the cookie
> value in the seccomp_notif_resp structure that is
> specified as the argument of the SEC‐
> COMP_IOCTL_NOTIF_SEND operation.
>
> pid This is the PID of the target process that trig‐
> gered the notification event.
>
> ┌─────────────────────────────────────────────────────┐
> │FIXME │
> ├─────────────────────────────────────────────────────┤
> │This is a thread ID, rather than a PID, right? │
> └─────────────────────────────────────────────────────┘

Yes.

>
> flags This is a bit mask of flags providing further
> information on the event. In the current implemen‐
> tation, this field is always zero.
>
> data This is a seccomp_data structure containing infor‐
> mation about the system call that triggered the
> notification. This is the same structure that is
> passed to the seccomp filter. See seccomp(2) for
> details of this structure.
>
> On success, this operation returns 0; on failure, -1 is
> returned, and errno is set to indicate the cause of the
> error. This operation can fail with the following errors:
>
> EINVAL (since Linux 5.5)
> The seccomp_notif structure that was passed to the
> call contained nonzero fields.
>
> ENOENT The target process was killed by a signal as the
> notification information was being generated.
>
> ┌─────────────────────────────────────────────────────┐
> │FIXME │
> ├─────────────────────────────────────────────────────┤
> │From my experiments, it appears that if a SEC‐ │
> │COMP_IOCTL_NOTIF_RECV is done after the target │
> │process terminates, then the ioctl() simply blocks │
> │(rather than returning an error to indicate that the │
> │target process no longer exists). │
> │ │
> │I found that surprising, and it required some con‐ │
> │tortions in the example program. It was not possi‐ │
> │ble to code my SIGCHLD handler (which reaps the zom‐ │
> │bie when the worker/target process terminates) to │
> │simply set a flag checked in the main handleNotifi‐ │
> │cations() loop, since this created an unavoidable │
> │race where the child might terminate just after I │
> │had checked the flag, but before I blocked (for‐ │
> │ever!) in the SECCOMP_IOCTL_NOTIF_RECV operation. │
> │Instead, I had to code the signal handler to simply │
> │call _exit(2) in order to terminate the parent │
> │process (the supervisor). │
> │ │
> │Is this expected behavior? It seems to me rather │
> │desirable that SECCOMP_IOCTL_NOTIF_RECV should give │
> │an error if the target process has terminated. │
> └─────────────────────────────────────────────────────┘

This has been discussed later in the thread too, I believe. My patchset
fixed a different but related bug in ->poll() when a filter becomes
unused. I hadn't noticed this behavior since I'm always polling. (Pure
ioctls() feel a bit fishy to me. :) But obviously a valid use.)

>
> SECCOMP_IOCTL_NOTIF_ID_VALID
> This operation can be used to check that a notification ID
> returned by an earlier SECCOMP_IOCTL_NOTIF_RECV operation
> is still valid (i.e., that the target process still
> exists).
>
> The third ioctl(2) argument is a pointer to the cookie
> (id) returned by the SECCOMP_IOCTL_NOTIF_RECV operation.
>
> This operation is necessary to avoid race conditions that
> can occur when the pid returned by the SEC‐
> COMP_IOCTL_NOTIF_RECV operation terminates, and that
> process ID is reused by another process. An example of
> this kind of race is the following
>
> 1. A notification is generated on the listening file
> descriptor. The returned seccomp_notif contains the
> PID of the target process.
>
> 2. The target process terminates.
>
> 3. Another process is created on the system that by chance
> reuses the PID that was freed when the target process
> terminates.
>
> 4. The supervisor open(2)s the /proc/[pid]/mem file for
> the PID obtained in step 1, with the intention of (say)
> inspecting the memory locations that contains the argu‐
> ments of the system call that triggered the notifica‐
> tion in step 1.
>
> In the above scenario, the risk is that the supervisor may
> try to access the memory of a process other than the tar‐
> get. This race can be avoided by following the call to
> open with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to ver‐
> ify that the process that generated the notification is
> still alive. (Note that if the target process subse‐
> quently terminates, its PID won't be reused because there
> remains an open reference to the /proc[pid]/mem file; in
> this case, a subsequent read(2) from the file will return
> 0, indicating end of file.)
>
> On success (i.e., the notification ID is still valid),
> this operation returns 0 On failure (i.e., the notifica‐

Missing a ".", I think.

> tion ID is no longer valid), -1 is returned, and errno is
> set to ENOENT.
>
> SECCOMP_IOCTL_NOTIF_SEND
> This operation is used to send a notification response
> back to the kernel. The third ioctl(2) argument of this
> structure is a pointer to a structure of the following
> form:
>
> struct seccomp_notif_resp {
> __u64 id; /* Cookie value */
> __s64 val; /* Success return value */
> __s32 error; /* 0 (success) or negative
> error number */
> __u32 flags; /* See below */
> };
>
> The fields of this structure are as follows:
>
> id This is the cookie value that was obtained using
> the SECCOMP_IOCTL_NOTIF_RECV operation. This
> cookie value allows the kernel to correctly asso‐
> ciate this response with the system call that trig‐
> gered the user-space notification.
>
> val This is the value that will be used for a spoofed
> success return for the target process's system
> call; see below.
>
> error This is the value that will be used as the error
> number (errno) for a spoofed error return for the
> target process's system call; see below.

Nit: "val" is only used when "error" is not set.

>
> flags This is a bit mask that includes zero or more of
> the following flags
>
> SECCOMP_USER_NOTIF_FLAG_CONTINUE (since Linux 5.5)
> Tell the kernel to execute the target
> process's system call.
>
> Two kinds of response are possible:
>
> · A response to the kernel telling it to execute the tar‐
> get process's system call. In this case, the flags
> field includes SECCOMP_USER_NOTIF_FLAG_CONTINUE and the
> error and val fields must be zero.
>
> This kind of response can be useful in cases where the
> supervisor needs to do deeper analysis of the target's
> system call than is possible from a seccomp filter
> (e.g., examining the values of pointer arguments), and,
> having verified that the system call is acceptable, the
> supervisor wants to allow it to proceed.

I think Jann has pointed this out. This needs to come with a big warning
and I would explicitly put a:
"The user notification mechanism cannot be used to implement a syscall
security policy in user space!"
You might want to take a look at the seccomp.h header file where I
placed a giant warning about how to use this too.

>
> · A spoofed return value for the target process's system
> call. In this case, the kernel does not execute the
> target process's system call, instead causing the system
> call to return a spoofed value as specified by fields of
> the seccomp_notif_resp structure. The supervisor should
> set the fields of this structure as follows:
>
> + flags does not contain SECCOMP_USER_NOTIF_FLAG_CON‐
> TINUE.
>
> + error is set either to 0 for a spoofed "success"
> return or to a negative error number for a spoofed
> "failure" return. In the former case, the kernel
> causes the target process's system call to return the
> value specified in the val field. In the later case,
> the kernel causes the target process's system call to
> return -1, and errno is assigned the negated error
> value.
>
> + val is set to a value that will be used as the return
> value for a spoofed "success" return for the target
> process's system call. The value in this field is
> ignored if the error field contains a nonzero value.
>
> On success, this operation returns 0; on failure, -1 is
> returned, and errno is set to indicate the cause of the
> error. This operation can fail with the following errors:
>
> EINPROGRESS
> A response to this notification has already been
> sent.
>
> EINVAL An invalid value was specified in the flags field.
>
> EINVAL The flags field contained SEC‐
> COMP_USER_NOTIF_FLAG_CONTINUE, and the error or val
> field was not zero.
>
> ENOENT The blocked system call in the target process has
> been interrupted by a signal handler.
>
> NOTES
> The file descriptor returned when seccomp(2) is employed with the
> SECCOMP_FILTER_FLAG_NEW_LISTENER flag can be monitored using
> poll(2), epoll(7), and select(2). When a notification is pend‐
> ing, these interfaces indicate that the file descriptor is read‐
> able.

This should also note that when a filter becomes unused, i.e. the last
task using that filter in its filter hierarchy is dead (been
reaped/autoreaped) ->poll() will notify with (E)POLLHUP.

>
> ┌─────────────────────────────────────────────────────┐
> │FIXME │
> ├─────────────────────────────────────────────────────┤
> │Interestingly, after the event had been received, │
> │the file descriptor indicates as writable (verified │
> │from the source code and by experiment). How is this │
> │useful? │
> └─────────────────────────────────────────────────────┘
>
> EXAMPLES
> The (somewhat contrived) program shown below demonstrates the use
> of the interfaces described in this page. The program creates a
> child process that serves as the "target" process. The child
> process installs a seccomp filter that returns the SEC‐
> COMP_RET_USER_NOTIF action value if a call is made to mkdir(2).
> The child process then calls mkdir(2) once for each of the sup‐
> plied command-line arguments, and reports the result returned by
> the call. After processing all arguments, the child process ter‐
> minates.
>
> The parent process acts as the supervisor, listening for the
> notifications that are generated when the target process calls
> mkdir(2). When such a notification occurs, the supervisor exam‐
> ines the memory of the target process (using /proc/[pid]/mem) to
> discover the pathname argument that was supplied to the mkdir(2)
> call, and performs one of the following actions:
>
> · If the pathname begins with the prefix "/tmp/", then the super‐
> visor attempts to create the specified directory, and then
> spoofs a return for the target process based on the return
> value of the supervisor's mkdir(2) call. In the event that
> that call succeeds, the spoofed success return value is the
> length of the pathname.
>
> · If the pathname begins with "./" (i.e., it is a relative path‐
> name), the supervisor sends a SECCOMP_USER_NOTIF_FLAG_CONTINUE
> response to the kernel to say that kernel should execute the
> target process's mkdir(2) call.

Potentially problematic if the two processes have the same privilege
level and the supervisor intends _CONTINUE to mean "is safe to execute".
An attacker could try to re-write arguments afaict.
A good an easy example is usually mknod() in a user namespace. A
_CONTINUE is always safe since you can't create device nodes anyway.

Sorry, I can't review the rest in sufficient detail since I'm on
vacation still so I'm just going to shut up now. :)

Christian

2020-10-01 12:56:05

by Christian Brauner

[permalink] [raw]
Subject: Re: For review: seccomp_user_notif(2) manual page

On Wed, Sep 30, 2020 at 05:53:46PM +0200, Jann Horn via Containers wrote:
> On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
> <[email protected]> wrote:
> > I knew it would be a big ask, but below is kind of the manual page
> > I was hoping you might write [1] for the seccomp user-space notification
> > mechanism. Since you didn't (and because 5.9 adds various new pieces
> > such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD
> > that also will need documenting [2]), I did :-). But of course I may
> > have made mistakes...
> [...]
> > NAME
> > seccomp_user_notif - Seccomp user-space notification mechanism
> >
> > SYNOPSIS
> > #include <linux/seccomp.h>
> > #include <linux/filter.h>
> > #include <linux/audit.h>
> >
> > int seccomp(unsigned int operation, unsigned int flags, void *args);
>
> Should the ioctl() calls be listed here, similar to e.g. the SYNOPSIS
> of the ioctl_* manpages?
>
> > DESCRIPTION
> > This page describes the user-space notification mechanism pro‐
> > vided by the Secure Computing (seccomp) facility. As well as the
> > use of the SECCOMP_FILTER_FLAG_NEW_LISTENER flag, the SEC‐
> > COMP_RET_USER_NOTIF action value, and the SECCOMP_GET_NOTIF_SIZES
> > operation described in seccomp(2), this mechanism involves the
> > use of a number of related ioctl(2) operations (described below).
> >
> > Overview
> > In conventional usage of a seccomp filter, the decision about how
> > to treat a particular system call is made by the filter itself.
> > The user-space notification mechanism allows the handling of the
> > system call to instead be handed off to a user-space process.
> > The advantages of doing this are that, by contrast with the sec‐
> > comp filter, which is running on a virtual machine inside the
> > kernel, the user-space process has access to information that is
> > unavailable to the seccomp filter and it can perform actions that
> > can't be performed from the seccomp filter.
> >
> > In the discussion that follows, the process that has installed
> > the seccomp filter is referred to as the target, and the process
>
> Technically, this definition of "target" is a bit inaccurate because:
>
> - seccomp filters are inherited
> - seccomp filters apply to threads, not processes
> - seccomp filters can be semi-remotely installed via TSYNC
>
> (I assume that in manpages, we should try to go for the "a task is a
> thread and a thread group is a process" definition, right?)
>
> Perhaps "the threads on which the seccomp filter is installed are
> referred to as the target", or something like that would be better?
>
> > that is notified by the user-space notification mechanism is
> > referred to as the supervisor. An overview of the steps per‐
> > formed by these two processes is as follows:
> >
> > 1. The target process establishes a seccomp filter in the usual
> > manner, but with two differences:
> >
> > · The seccomp(2) flags argument includes the flag SECCOMP_FIL‐
> > TER_FLAG_NEW_LISTENER. Consequently, the return value of
> > the (successful) seccomp(2) call is a new "listening" file
> > descriptor that can be used to receive notifications.
> >
> > · In cases where it is appropriate, the seccomp filter returns
> > the action value SECCOMP_RET_USER_NOTIF. This return value
> > will trigger a notification event.
> >
> > 2. In order that the supervisor process can obtain notifications
> > using the listening file descriptor, (a duplicate of) that
> > file descriptor must be passed from the target process to the
> > supervisor process. One way in which this could be done is by
> > passing the file descriptor over a UNIX domain socket connec‐
> > tion between the two processes (using the SCM_RIGHTS ancillary
> > message type described in unix(7)). Another possibility is
> > that the supervisor might inherit the file descriptor via
> > fork(2).
>
> With the caveat that if the supervisor inherits the file descriptor
> via fork(), that (more or less) implies that the supervisor is subject
> to the same filter (although it could bypass the filter using a helper
> thread that responds SECCOMP_USER_NOTIF_FLAG_CONTINUE, but I don't
> expect any clean software to do that).
>
> > 3. The supervisor process will receive notification events on the
> > listening file descriptor. These events are returned as
> > structures of type seccomp_notif. Because this structure and
> > its size may evolve over kernel versions, the supervisor must
> > first determine the size of this structure using the sec‐
> > comp(2) SECCOMP_GET_NOTIF_SIZES operation, which returns a
> > structure of type seccomp_notif_sizes. The supervisor allo‐
> > cates a buffer of size seccomp_notif_sizes.seccomp_notif bytes
> > to receive notification events. In addition,the supervisor
> > allocates another buffer of size seccomp_notif_sizes.sec‐
> > comp_notif_resp bytes for the response (a struct sec‐
> > comp_notif_resp structure) that it will provide to the kernel
> > (and thus the target process).
> >
> > 4. The target process then performs its workload, which includes
> > system calls that will be controlled by the seccomp filter.
> > Whenever one of these system calls causes the filter to return
> > the SECCOMP_RET_USER_NOTIF action value, the kernel does not
> > execute the system call; instead, execution of the target
> > process is temporarily blocked inside the kernel and a notifi‐
>
> where "blocked" refers to the interruptible, restartable kind - if the
> child receives a signal with an SA_RESTART signal handler in the
> meantime, it'll leave the syscall, go through the signal handler, then
> restart the syscall again and send the same request to the supervisor
> again. so the supervisor may see duplicate syscalls.
>
> What's really gross here is that signal(7) promises that some syscalls
> like epoll_wait(2) never restart, but seccomp doesn't know about that;
> if userspace installs a filter that uses SECCOMP_RET_USER_NOTIF for a
> non-restartable syscall, the result is that UAPI gets broken a little
> bit. Luckily normal users of seccomp probably won't use
> SECCOMP_RET_USER_NOTIF for restartable syscalls, but if someone does
> want to do that, we might have to add some "suppress syscall
> restarting" flag into the seccomp action value, or something like
> that... yuck.
>
> > cation event is generated on the listening file descriptor.
> >
> > 5. The supervisor process can now repeatedly monitor the listen‐
> > ing file descriptor for SECCOMP_RET_USER_NOTIF-triggered
> > events. To do this, the supervisor uses the SEC‐
> > COMP_IOCTL_NOTIF_RECV ioctl(2) operation to read information
> > about a notification event; this operation blocks until an
>
> (interruptably - but I guess that maybe doesn't have to be said
> explicitly here?)
>
> > event is available.
>
> Maybe we should note here that you can use the multi-fd-polling APIs
> (select/poll/epoll) instead, and that if the notification goes away
> before you call SECCOMP_IOCTL_NOTIF_RECV, the ioctl will return
> -ENOENT instead of blocking, and therefore as long as nobody else
> reads from the same fd, you can assume that after the fd reports as
> readable, you can call SECCOMP_IOCTL_NOTIF_RECV once without blocking.
>
> Exceeeeept that this part looks broken:
>
> if (mutex_lock_interruptible(&filter->notify_lock) < 0)
> return EPOLLERR;
>
> which I think means that we can have a race where a signal arrives
> while poll() is trying to add itself to the waitqueue of the seccomp
> fd, and then we'll get a spurious error condition reported on the fd.
> That's a kernel bug, I'd say.
>
> > The operation returns a seccomp_notif
> > structure containing information about the system call that is
> > being attempted by the target process.
> >
> > 6. The seccomp_notif structure returned by the SEC‐
> > COMP_IOCTL_NOTIF_RECV operation includes the same information
> > (a seccomp_data structure) that was passed to the seccomp fil‐
> > ter. This information allows the supervisor to discover the
> > system call number and the arguments for the target process's
> > system call. In addition, the notification event contains the
> > PID of the target process.
>
> That's a PIDTYPE_PID, which the manpages call a "thread ID".
>
> > The information in the notification can be used to discover
> > the values of pointer arguments for the target process's sys‐
> > tem call. (This is something that can't be done from within a
> > seccomp filter.) To do this (and assuming it has suitable
> > permissions), the supervisor opens the corresponding
> > /proc/[pid]/mem file,
>
> ... which means that here we might have to get into the weeds of how
> actually /proc has invisible directories for every TID, even though
> only the ones for PIDs are visible, and therefore you can just open
> /proc/[tid]/mem and it'll work fine?
>
> > seeks to the memory location that corre‐
> > sponds to one of the pointer arguments whose value is supplied
> > in the notification event, and reads bytes from that location.
> > (The supervisor must be careful to avoid a race condition that
> > can occur when doing this; see the description of the SEC‐
> > COMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation below.) In addi‐
> > tion, the supervisor can access other system information that
> > is visible in user space but which is not accessible from a
> > seccomp filter.
> >
> > ┌─────────────────────────────────────────────────────┐
> > │FIXME │
> > ├─────────────────────────────────────────────────────┤
> > │Suppose we are reading a pathname from /proc/PID/mem │
> > │for a system call such as mkdir(). The pathname can │
> > │be an arbitrary length. How do we know how much (how │
> > │many pages) to read from /proc/PID/mem? │
> > └─────────────────────────────────────────────────────┘
>
> It can't be an arbitrary length. While pathnames *returned* from the
> kernel in some places can have different limits, strings supplied as
> path arguments *to* the kernel AFAIK always have an upper limit of
> PATH_MAX, else you get -ENAMETOOLONG. See getname_flags().
>
> > 7. Having obtained information as per the previous step, the
> > supervisor may then choose to perform an action in response to
> > the target process's system call (which, as noted above, is
> > not executed when the seccomp filter returns the SEC‐
> > COMP_RET_USER_NOTIF action value).
>
> (unless SECCOMP_USER_NOTIF_FLAG_CONTINUE is used)
>
> > One example use case here relates to containers. The target
> > process may be located inside a container where it does not
> > have sufficient capabilities to mount a filesystem in the con‐
> > tainer's mount namespace. However, the supervisor may be a
> > more privileged process that that does have sufficient capa‐
>
> nit: s/that that/that/
>
> > bilities to perform the mount operation.
> >
> > 8. The supervisor then sends a response to the notification. The
> > information in this response is used by the kernel to con‐
> > struct a return value for the target process's system call and
> > provide a value that will be assigned to the errno variable of
> > the target process.
> >
> > The response is sent using the SECCOMP_IOCTL_NOTIF_RECV
> > ioctl(2) operation, which is used to transmit a sec‐
> > comp_notif_resp structure to the kernel. This structure
> > includes a cookie value that the supervisor obtained in the
> > seccomp_notif structure returned by the SEC‐
> > COMP_IOCTL_NOTIF_RECV operation. This cookie value allows the
> > kernel to associate the response with the target process.
>
> (unless if the target thread entered a signal handler or was killed in
> the meantime)
>
> > 9. Once the notification has been sent, the system call in the
> > target process unblocks, returning the information that was
> > provided by the supervisor in the notification response.
> >
> > As a variation on the last two steps, the supervisor can send a
> > response that tells the kernel that it should execute the target
> > process's system call; see the discussion of SEC‐
> > COMP_USER_NOTIF_FLAG_CONTINUE, below.
> >
> > ioctl(2) operations
> > The following ioctl(2) operations are provided to support seccomp
> > user-space notification. For each of these operations, the first
> > (file descriptor) argument of ioctl(2) is the listening file
> > descriptor returned by a call to seccomp(2) with the SECCOMP_FIL‐
> > TER_FLAG_NEW_LISTENER flag.
> >
> > SECCOMP_IOCTL_NOTIF_RECV
> > This operation is used to obtain a user-space notification
> > event. If no such event is currently pending, the opera‐
> > tion blocks until an event occurs.
>
> Not necessarily; for every time a process entered a signal handler or
> was killed while a notification was pending, a call to
> SECCOMP_IOCTL_NOTIF_RECV will return -ENOENT.
>
> > The third ioctl(2)
> > argument is a pointer to a structure of the following form
> > which contains information about the event. This struc‐
> > ture must be zeroed out before the call.
> >
> > struct seccomp_notif {
> > __u64 id; /* Cookie */
> > __u32 pid; /* PID of target process */
>
> (TID, not PID)
>
> > __u32 flags; /* Currently unused (0) */
> > struct seccomp_data data; /* See seccomp(2) */
> > };
> >
> > The fields in this structure are as follows:
> >
> > id This is a cookie for the notification. Each such
> > cookie is guaranteed to be unique for the corre‐
> > sponding seccomp filter. In other words, this
> > cookie is unique for each notification event from
> > the target process.
>
> That sentence about "target process" looks wrong to me. The cookies
> are unique across notifications from the filter, but there can be
> multiple filters per thread, and multiple threads per filter.
>
> > The cookie value has the fol‐
> > lowing uses:
> >
> > · It can be used with the SEC‐
> > COMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation to
> > verify that the target process is still alive.
> >
> > · When returning a notification response to the
> > kernel, the supervisor must include the cookie
> > value in the seccomp_notif_resp structure that is
> > specified as the argument of the SEC‐
> > COMP_IOCTL_NOTIF_SEND operation.
> >
> > pid This is the PID of the target process that trig‐
> > gered the notification event.
> >
> > ┌─────────────────────────────────────────────────────┐
> > │FIXME │
> > ├─────────────────────────────────────────────────────┤
> > │This is a thread ID, rather than a PID, right? │
> > └─────────────────────────────────────────────────────┘
>
> Yeah.
>
> >
> > flags This is a bit mask of flags providing further
> > information on the event. In the current implemen‐
> > tation, this field is always zero.
> >
> > data This is a seccomp_data structure containing infor‐
> > mation about the system call that triggered the
> > notification. This is the same structure that is
> > passed to the seccomp filter. See seccomp(2) for
> > details of this structure.
> >
> > On success, this operation returns 0; on failure, -1 is
> > returned, and errno is set to indicate the cause of the
> > error. This operation can fail with the following errors:
> >
> > EINVAL (since Linux 5.5)
> > The seccomp_notif structure that was passed to the
> > call contained nonzero fields.
> >
> > ENOENT The target process was killed by a signal as the
> > notification information was being generated.
>
> Not just killed, interruption with a signal handler has the same effect.
>
> > ┌─────────────────────────────────────────────────────┐
> > │FIXME │
> > ├─────────────────────────────────────────────────────┤
> > │From my experiments, it appears that if a SEC‐ │
> > │COMP_IOCTL_NOTIF_RECV is done after the target │
> > │process terminates, then the ioctl() simply blocks │
> > │(rather than returning an error to indicate that the │
> > │target process no longer exists). │
> > │ │
> > │I found that surprising, and it required some con‐ │
> > │tortions in the example program. It was not possi‐ │
> > │ble to code my SIGCHLD handler (which reaps the zom‐ │
> > │bie when the worker/target process terminates) to │
> > │simply set a flag checked in the main handleNotifi‐ │
> > │cations() loop, since this created an unavoidable │
> > │race where the child might terminate just after I │
> > │had checked the flag, but before I blocked (for‐ │
> > │ever!) in the SECCOMP_IOCTL_NOTIF_RECV operation. │
> > │Instead, I had to code the signal handler to simply │
> > │call _exit(2) in order to terminate the parent │
> > │process (the supervisor). │
> > │ │
> > │Is this expected behavior? It seems to me rather │
> > │desirable that SECCOMP_IOCTL_NOTIF_RECV should give │
> > │an error if the target process has terminated. │
> > └─────────────────────────────────────────────────────┘
>
> You could poll() the fd first. But yeah, it'd probably be a good idea
> to change that.
>
> > SECCOMP_IOCTL_NOTIF_ID_VALID
> [...]
> > In the above scenario, the risk is that the supervisor may
> > try to access the memory of a process other than the tar‐
> > get. This race can be avoided by following the call to
> > open with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to ver‐
> > ify that the process that generated the notification is
> > still alive. (Note that if the target process subse‐
> > quently terminates, its PID won't be reused because there
>
> That's wrong, the PID can be reused, but the /proc/$pid directory is
> internally not associated with the numeric PID, but, conceptually
> speaking, with a specific incarnation of the PID, or something like
> that. (Actually, it is associated with the "struct pid", which is not
> reused, instead of the numeric PID.)
>
> > remains an open reference to the /proc[pid]/mem file; in
> > this case, a subsequent read(2) from the file will return
> > 0, indicating end of file.)
> >
> > On success (i.e., the notification ID is still valid),
> > this operation returns 0 On failure (i.e., the notifica‐
>
> nit: s/returns 0/returns 0./
>
> > tion ID is no longer valid), -1 is returned, and errno is
> > set to ENOENT.
> >
> > SECCOMP_IOCTL_NOTIF_SEND
> [...]
> > Two kinds of response are possible:
> >
> > · A response to the kernel telling it to execute the tar‐
> > get process's system call. In this case, the flags
> > field includes SECCOMP_USER_NOTIF_FLAG_CONTINUE and the
> > error and val fields must be zero.
> >
> > This kind of response can be useful in cases where the
> > supervisor needs to do deeper analysis of the target's
> > system call than is possible from a seccomp filter
> > (e.g., examining the values of pointer arguments), and,
> > having verified that the system call is acceptable, the
> > supervisor wants to allow it to proceed.
>
> "allow" sounds as if this is an access control thing, but this
> mechanism should usually not be used for access control (unless the
> "seccomp" syscall is blocked). Maybe reword as "having decided that
> the system call does not require emulation by the supervisor, the
> supervisor wants it to execute normally", or something like that?
>
> [...]
> > On success, this operation returns 0; on failure, -1 is
> > returned, and errno is set to indicate the cause of the
> > error. This operation can fail with the following errors:
> >
> > EINPROGRESS
> > A response to this notification has already been
> > sent.
> >
> > EINVAL An invalid value was specified in the flags field.
> >
> > EINVAL The flags field contained SEC‐
> > COMP_USER_NOTIF_FLAG_CONTINUE, and the error or val
> > field was not zero.
> >
> > ENOENT The blocked system call in the target process has
> > been interrupted by a signal handler.
>
> (you could also get this if a response has already been sent, instead
> of EINPROGRESS - the only difference is whether the target thread has
> picked up the response yet)
>
> > NOTES
> > The file descriptor returned when seccomp(2) is employed with the
> > SECCOMP_FILTER_FLAG_NEW_LISTENER flag can be monitored using
> > poll(2), epoll(7), and select(2). When a notification is pend‐
> > ing, these interfaces indicate that the file descriptor is read‐
> > able.
>
> We should probably also point out somewhere that, as
> include/uapi/linux/seccomp.h says:
>
> * Similar precautions should be applied when stacking SECCOMP_RET_USER_NOTIF
> * or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the
> * same syscall, the most recently added filter takes precedence. This means
> * that the new SECCOMP_RET_USER_NOTIF filter can override any
> * SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all
> * such filtered syscalls to be executed by sending the response
> * SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can equally
> * be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.
>
> In other words, from a security perspective, you must assume that the
> target process can bypass any SECCOMP_RET_USER_NOTIF (or
> SECCOMP_RET_TRACE) filters unless it is completely prohibited from
> calling seccomp(). This should also be noted over in the main
> seccomp(2) manpage, especially the SECCOMP_RET_TRACE part.

So I was actually wondering about this when I skimmed this and a while
ago but forgot about this again... Afaict, you can only ever load a
single filter with SECCOMP_FILTER_FLAG_NEW_LISTENER set. If there
already is a filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER property
in the tasks filter hierarchy then the kernel will refuse to load a new
one?

static struct file *init_listener(struct seccomp_filter *filter)
{
struct file *ret = ERR_PTR(-EBUSY);
struct seccomp_filter *cur;

for (cur = current->seccomp.filter; cur; cur = cur->prev) {
if (cur->notif)
goto out;
}

shouldn't that be sufficient to guarantee that USER_NOTIF filters can't
override each other for the same task simply because there can only ever
be a single one?

>
>
> > EXAMPLES
> [...]
> > This program can used to demonstrate various aspects of the
>
> nit: "can be used to demonstrate", or alternatively just "demonstrates"
>
> > behavior of the seccomp user-space notification mechanism. To
> > help aid such demonstrations, the program logs various messages
> > to show the operation of the target process (lines prefixed "T:")
> > and the supervisor (indented lines prefixed "S:").
> [...]
> > Program source
> [...]
> > #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
> > } while (0)
>
> Don't we have err() for this?
>
> > /* Send the file descriptor 'fd' over the connected UNIX domain socket
> > 'sockfd'. Returns 0 on success, or -1 on error. */
> >
> > static int
> > sendfd(int sockfd, int fd)
> > {
> > struct msghdr msgh;
> > struct iovec iov;
> > int data;
> > struct cmsghdr *cmsgp;
> >
> > /* Allocate a char array of suitable size to hold the ancillary data.
> > However, since this buffer is in reality a 'struct cmsghdr', use a
> > union to ensure that it is suitable aligned. */
>
> nit: suitably
>
> > union {
> > char buf[CMSG_SPACE(sizeof(int))];
> > /* Space large enough to hold an 'int' */
> > struct cmsghdr align;
> > } controlMsg;
> >
> > /* The 'msg_name' field can be used to specify the address of the
> > destination socket when sending a datagram. However, we do not
> > need to use this field because 'sockfd' is a connected socket. */
> >
> > msgh.msg_name = NULL;
> > msgh.msg_namelen = 0;
> >
> > /* On Linux, we must transmit at least one byte of real data in
> > order to send ancillary data. We transmit an arbitrary integer
> > whose value is ignored by recvfd(). */
> >
> > msgh.msg_iov = &iov;
> > msgh.msg_iovlen = 1;
> > iov.iov_base = &data;
> > iov.iov_len = sizeof(int);
> > data = 12345;
> >
> > /* Set 'msghdr' fields that describe ancillary data */
> >
> > msgh.msg_control = controlMsg.buf;
> > msgh.msg_controllen = sizeof(controlMsg.buf);
> >
> > /* Set up ancillary data describing file descriptor to send */
> >
> > cmsgp = CMSG_FIRSTHDR(&msgh);
> > cmsgp->cmsg_level = SOL_SOCKET;
> > cmsgp->cmsg_type = SCM_RIGHTS;
> > cmsgp->cmsg_len = CMSG_LEN(sizeof(int));
> > memcpy(CMSG_DATA(cmsgp), &fd, sizeof(int));
> >
> > /* Send real plus ancillary data */
> >
> > if (sendmsg(sockfd, &msgh, 0) == -1)
> > return -1;
> >
> > return 0;
> > }
>
> Instead of using unix domain sockets to send the fd to the parent, I
> think you could also use clone3() with flags==CLONE_FILES|SIGCHLD,
> dup2() the seccomp fd to an fd that was reserved in the parent, call
> unshare(CLONE_FILES) in the child after setting up the seccomp fd, and
> wake up the parent with something like pthread_cond_signal()? I'm not
> sure whether that'd look better or worse in the end though, so maybe
> just ignore this comment.

(If the target process exec's (rather fast) then VFORK can be useful.)

>
> [...]
> > /* Access the memory of the target process in order to discover the
> > pathname that was given to mkdir() */
> >
> > static void
> > getTargetPathname(struct seccomp_notif *req, int notifyFd,
> > char *path, size_t len)
> > {
> > char procMemPath[PATH_MAX];
> > snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", req->pid);
> >
> > int procMemFd = open(procMemPath, O_RDONLY);
>
> Should example code like this maybe use O_CLOEXEC unless the fd in
> question actually has to be inheritable? I know it doesn't actually
> matter here, but if this code was used in a multi-threaded context, it
> might.

Agreed, about the O_CLOEXEC part.

>
> > if (procMemFd == -1)
> > errExit("Supervisor: open");
> >
> > /* Check that the process whose info we are accessing is still alive.
> > If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed
> > in checkNotificationIdIsValid()) succeeds, we know that the
> > /proc/PID/mem file descriptor that we opened corresponds to the
> > process for which we received a notification. If that process
> > subsequently terminates, then read() on that file descriptor
> > will return 0 (EOF). */
> >
> > checkNotificationIdIsValid(notifyFd, req->id);
> >
> > /* Seek to the location containing the pathname argument (i.e., the
> > first argument) of the mkdir(2) call and read that pathname */
> >
> > if (lseek(procMemFd, req->data.args[0], SEEK_SET) == -1)
> > errExit("Supervisor: lseek");
> >
> > ssize_t s = read(procMemFd, path, PATH_MAX);
> > if (s == -1)
> > errExit("read");
>
> Why not pread() instead of lseek()+read()?

With multiple arguments to be read process_vm_readv() should also be
considered.

2020-10-01 15:49:57

by Jann Horn

[permalink] [raw]
Subject: Re: For review: seccomp_user_notif(2) manual page

On Thu, Oct 1, 2020 at 2:54 PM Christian Brauner
<[email protected]> wrote:
> On Wed, Sep 30, 2020 at 05:53:46PM +0200, Jann Horn via Containers wrote:
> > On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
> > <[email protected]> wrote:
> > > NOTES
> > > The file descriptor returned when seccomp(2) is employed with the
> > > SECCOMP_FILTER_FLAG_NEW_LISTENER flag can be monitored using
> > > poll(2), epoll(7), and select(2). When a notification is pend‐
> > > ing, these interfaces indicate that the file descriptor is read‐
> > > able.
> >
> > We should probably also point out somewhere that, as
> > include/uapi/linux/seccomp.h says:
> >
> > * Similar precautions should be applied when stacking SECCOMP_RET_USER_NOTIF
> > * or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the
> > * same syscall, the most recently added filter takes precedence. This means
> > * that the new SECCOMP_RET_USER_NOTIF filter can override any
> > * SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all
> > * such filtered syscalls to be executed by sending the response
> > * SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can equally
> > * be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.
> >
> > In other words, from a security perspective, you must assume that the
> > target process can bypass any SECCOMP_RET_USER_NOTIF (or
> > SECCOMP_RET_TRACE) filters unless it is completely prohibited from
> > calling seccomp(). This should also be noted over in the main
> > seccomp(2) manpage, especially the SECCOMP_RET_TRACE part.
>
> So I was actually wondering about this when I skimmed this and a while
> ago but forgot about this again... Afaict, you can only ever load a
> single filter with SECCOMP_FILTER_FLAG_NEW_LISTENER set. If there
> already is a filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER property
> in the tasks filter hierarchy then the kernel will refuse to load a new
> one?
>
> static struct file *init_listener(struct seccomp_filter *filter)
> {
> struct file *ret = ERR_PTR(-EBUSY);
> struct seccomp_filter *cur;
>
> for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> if (cur->notif)
> goto out;
> }
>
> shouldn't that be sufficient to guarantee that USER_NOTIF filters can't
> override each other for the same task simply because there can only ever
> be a single one?

Good point. Exceeeept that that check seems ineffective because this
happens before we take the locks that guard against TSYNC, and also
before we decide to which existing filter we want to chain the new
filter. So if two threads race with TSYNC, I think they'll be able to
chain two filters with listeners together.

I don't know whether we want to eternalize this "only one listener
across all the filters" restriction in the manpage though, or whether
the man page should just say that the kernel currently doesn't support
it but that security-wise you should assume that it might at some
point.

[...]
> > > if (procMemFd == -1)
> > > errExit("Supervisor: open");
> > >
> > > /* Check that the process whose info we are accessing is still alive.
> > > If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed
> > > in checkNotificationIdIsValid()) succeeds, we know that the
> > > /proc/PID/mem file descriptor that we opened corresponds to the
> > > process for which we received a notification. If that process
> > > subsequently terminates, then read() on that file descriptor
> > > will return 0 (EOF). */
> > >
> > > checkNotificationIdIsValid(notifyFd, req->id);
> > >
> > > /* Seek to the location containing the pathname argument (i.e., the
> > > first argument) of the mkdir(2) call and read that pathname */
> > >
> > > if (lseek(procMemFd, req->data.args[0], SEEK_SET) == -1)
> > > errExit("Supervisor: lseek");
> > >
> > > ssize_t s = read(procMemFd, path, PATH_MAX);
> > > if (s == -1)
> > > errExit("read");
> >
> > Why not pread() instead of lseek()+read()?
>
> With multiple arguments to be read process_vm_readv() should also be
> considered.

process_vm_readv() can end up doing each read against a different
process, which is sort of weird semantically. You would end up taking
page faults at random addresses in unrelated processes, blocking on
their mmap locks, potentially triggering their userfaultfd notifiers,
and so on.

Whereas if you first open /proc/$tid/mem, then re-check
SECCOMP_IOCTL_NOTIF_ID_VALID, and then do the read, you know that
you're only taking page faults on the process where you intended to do
it.

So until there is a variant of process_vm_readv() that operates on
pidfds, I would not recommend using that here.

2020-10-01 17:06:04

by Tycho Andersen

[permalink] [raw]
Subject: Re: For review: seccomp_user_notif(2) manual page

On Thu, Oct 01, 2020 at 05:47:54PM +0200, Jann Horn via Containers wrote:
> On Thu, Oct 1, 2020 at 2:54 PM Christian Brauner
> <[email protected]> wrote:
> > On Wed, Sep 30, 2020 at 05:53:46PM +0200, Jann Horn via Containers wrote:
> > > On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
> > > <[email protected]> wrote:
> > > > NOTES
> > > > The file descriptor returned when seccomp(2) is employed with the
> > > > SECCOMP_FILTER_FLAG_NEW_LISTENER flag can be monitored using
> > > > poll(2), epoll(7), and select(2). When a notification is pend‐
> > > > ing, these interfaces indicate that the file descriptor is read‐
> > > > able.
> > >
> > > We should probably also point out somewhere that, as
> > > include/uapi/linux/seccomp.h says:
> > >
> > > * Similar precautions should be applied when stacking SECCOMP_RET_USER_NOTIF
> > > * or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the
> > > * same syscall, the most recently added filter takes precedence. This means
> > > * that the new SECCOMP_RET_USER_NOTIF filter can override any
> > > * SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all
> > > * such filtered syscalls to be executed by sending the response
> > > * SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can equally
> > > * be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.
> > >
> > > In other words, from a security perspective, you must assume that the
> > > target process can bypass any SECCOMP_RET_USER_NOTIF (or
> > > SECCOMP_RET_TRACE) filters unless it is completely prohibited from
> > > calling seccomp(). This should also be noted over in the main
> > > seccomp(2) manpage, especially the SECCOMP_RET_TRACE part.
> >
> > So I was actually wondering about this when I skimmed this and a while
> > ago but forgot about this again... Afaict, you can only ever load a
> > single filter with SECCOMP_FILTER_FLAG_NEW_LISTENER set. If there
> > already is a filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER property
> > in the tasks filter hierarchy then the kernel will refuse to load a new
> > one?
> >
> > static struct file *init_listener(struct seccomp_filter *filter)
> > {
> > struct file *ret = ERR_PTR(-EBUSY);
> > struct seccomp_filter *cur;
> >
> > for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> > if (cur->notif)
> > goto out;
> > }
> >
> > shouldn't that be sufficient to guarantee that USER_NOTIF filters can't
> > override each other for the same task simply because there can only ever
> > be a single one?
>
> Good point. Exceeeept that that check seems ineffective because this
> happens before we take the locks that guard against TSYNC, and also
> before we decide to which existing filter we want to chain the new
> filter. So if two threads race with TSYNC, I think they'll be able to
> chain two filters with listeners together.

Yep, seems the check needs to also be in seccomp_can_sync_threads() to
be totally effective,

> I don't know whether we want to eternalize this "only one listener
> across all the filters" restriction in the manpage though, or whether
> the man page should just say that the kernel currently doesn't support
> it but that security-wise you should assume that it might at some
> point.

This requirement originally came from Andy, arguing that the semantics
of this were/are confusing, which still makes sense to me. Perhaps we
should do something like the below?

Tycho


diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 3ee59ce0a323..7b107207c2b0 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -376,6 +376,18 @@ static int is_ancestor(struct seccomp_filter *parent,
return 0;
}

+static bool has_listener_parent(struct seccomp_filter *child)
+{
+ struct seccomp_filter *cur;
+
+ for (cur = current->seccomp.filter; cur; cur = cur->prev) {
+ if (cur->notif)
+ return true;
+ }
+
+ return false;
+}
+
/**
* seccomp_can_sync_threads: checks if all threads can be synchronized
*
@@ -385,7 +397,7 @@ static int is_ancestor(struct seccomp_filter *parent,
* either not in the correct seccomp mode or did not have an ancestral
* seccomp filter.
*/
-static inline pid_t seccomp_can_sync_threads(void)
+static inline pid_t seccomp_can_sync_threads(unsigned int flags)
{
struct task_struct *thread, *caller;

@@ -407,6 +419,11 @@ static inline pid_t seccomp_can_sync_threads(void)
caller->seccomp.filter)))
continue;

+ /* don't allow TSYNC to install multiple listeners */
+ if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER &&
+ !has_listener_parent(thread->seccomp.filter))
+ continue;
+
/* Return the first thread that cannot be synchronized. */
failed = task_pid_vnr(thread);
/* If the pid cannot be resolved, then return -ESRCH */
@@ -637,7 +654,7 @@ static long seccomp_attach_filter(unsigned int flags,
if (flags & SECCOMP_FILTER_FLAG_TSYNC) {
int ret;

- ret = seccomp_can_sync_threads();
+ ret = seccomp_can_sync_threads(flags);
if (ret) {
if (flags & SECCOMP_FILTER_FLAG_TSYNC_ESRCH)
return -ESRCH;
@@ -1462,12 +1479,9 @@ static const struct file_operations seccomp_notify_ops = {
static struct file *init_listener(struct seccomp_filter *filter)
{
struct file *ret = ERR_PTR(-EBUSY);
- struct seccomp_filter *cur;

- for (cur = current->seccomp.filter; cur; cur = cur->prev) {
- if (cur->notif)
- goto out;
- }
+ if (has_listener_parent(current->seccomp.filter))
+ goto out;

ret = ERR_PTR(-ENOMEM);
filter->notif = kzalloc(sizeof(*(filter->notif)), GFP_KERNEL);

2020-10-01 17:09:50

by Christian Brauner

[permalink] [raw]
Subject: Re: For review: seccomp_user_notif(2) manual page

On Thu, Oct 01, 2020 at 05:47:54PM +0200, Jann Horn wrote:
> On Thu, Oct 1, 2020 at 2:54 PM Christian Brauner
> <[email protected]> wrote:
> > On Wed, Sep 30, 2020 at 05:53:46PM +0200, Jann Horn via Containers wrote:
> > > On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
> > > <[email protected]> wrote:
> > > > NOTES
> > > > The file descriptor returned when seccomp(2) is employed with the
> > > > SECCOMP_FILTER_FLAG_NEW_LISTENER flag can be monitored using
> > > > poll(2), epoll(7), and select(2). When a notification is pend‐
> > > > ing, these interfaces indicate that the file descriptor is read‐
> > > > able.
> > >
> > > We should probably also point out somewhere that, as
> > > include/uapi/linux/seccomp.h says:
> > >
> > > * Similar precautions should be applied when stacking SECCOMP_RET_USER_NOTIF
> > > * or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the
> > > * same syscall, the most recently added filter takes precedence. This means
> > > * that the new SECCOMP_RET_USER_NOTIF filter can override any
> > > * SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all
> > > * such filtered syscalls to be executed by sending the response
> > > * SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can equally
> > > * be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.
> > >
> > > In other words, from a security perspective, you must assume that the
> > > target process can bypass any SECCOMP_RET_USER_NOTIF (or
> > > SECCOMP_RET_TRACE) filters unless it is completely prohibited from
> > > calling seccomp(). This should also be noted over in the main
> > > seccomp(2) manpage, especially the SECCOMP_RET_TRACE part.
> >
> > So I was actually wondering about this when I skimmed this and a while
> > ago but forgot about this again... Afaict, you can only ever load a
> > single filter with SECCOMP_FILTER_FLAG_NEW_LISTENER set. If there
> > already is a filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER property
> > in the tasks filter hierarchy then the kernel will refuse to load a new
> > one?
> >
> > static struct file *init_listener(struct seccomp_filter *filter)
> > {
> > struct file *ret = ERR_PTR(-EBUSY);
> > struct seccomp_filter *cur;
> >
> > for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> > if (cur->notif)
> > goto out;
> > }
> >
> > shouldn't that be sufficient to guarantee that USER_NOTIF filters can't
> > override each other for the same task simply because there can only ever
> > be a single one?
>
> Good point. Exceeeept that that check seems ineffective because this
> happens before we take the locks that guard against TSYNC, and also
> before we decide to which existing filter we want to chain the new
> filter. So if two threads race with TSYNC, I think they'll be able to
> chain two filters with listeners together.

That's a bug, imho. I don't have source code in front of me right now
though.

>
> I don't know whether we want to eternalize this "only one listener
> across all the filters" restriction in the manpage though, or whether
> the man page should just say that the kernel currently doesn't support
> it but that security-wise you should assume that it might at some
> point.

Maybe. I would argue that it might be worth having at least a new
flag/option to indicate either "This is a non-overridable filter." or at
least for the seccomp notifier have an option to indicate that no other
notifer can be installed.

Christian

2020-10-01 17:17:57

by Christian Brauner

[permalink] [raw]
Subject: Re: For review: seccomp_user_notif(2) manual page

On Thu, Oct 01, 2020 at 10:58:50AM -0600, Tycho Andersen wrote:
> On Thu, Oct 01, 2020 at 05:47:54PM +0200, Jann Horn via Containers wrote:
> > On Thu, Oct 1, 2020 at 2:54 PM Christian Brauner
> > <[email protected]> wrote:
> > > On Wed, Sep 30, 2020 at 05:53:46PM +0200, Jann Horn via Containers wrote:
> > > > On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
> > > > <[email protected]> wrote:
> > > > > NOTES
> > > > > The file descriptor returned when seccomp(2) is employed with the
> > > > > SECCOMP_FILTER_FLAG_NEW_LISTENER flag can be monitored using
> > > > > poll(2), epoll(7), and select(2). When a notification is pend‐
> > > > > ing, these interfaces indicate that the file descriptor is read‐
> > > > > able.
> > > >
> > > > We should probably also point out somewhere that, as
> > > > include/uapi/linux/seccomp.h says:
> > > >
> > > > * Similar precautions should be applied when stacking SECCOMP_RET_USER_NOTIF
> > > > * or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the
> > > > * same syscall, the most recently added filter takes precedence. This means
> > > > * that the new SECCOMP_RET_USER_NOTIF filter can override any
> > > > * SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all
> > > > * such filtered syscalls to be executed by sending the response
> > > > * SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can equally
> > > > * be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.
> > > >
> > > > In other words, from a security perspective, you must assume that the
> > > > target process can bypass any SECCOMP_RET_USER_NOTIF (or
> > > > SECCOMP_RET_TRACE) filters unless it is completely prohibited from
> > > > calling seccomp(). This should also be noted over in the main
> > > > seccomp(2) manpage, especially the SECCOMP_RET_TRACE part.
> > >
> > > So I was actually wondering about this when I skimmed this and a while
> > > ago but forgot about this again... Afaict, you can only ever load a
> > > single filter with SECCOMP_FILTER_FLAG_NEW_LISTENER set. If there
> > > already is a filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER property
> > > in the tasks filter hierarchy then the kernel will refuse to load a new
> > > one?
> > >
> > > static struct file *init_listener(struct seccomp_filter *filter)
> > > {
> > > struct file *ret = ERR_PTR(-EBUSY);
> > > struct seccomp_filter *cur;
> > >
> > > for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> > > if (cur->notif)
> > > goto out;
> > > }
> > >
> > > shouldn't that be sufficient to guarantee that USER_NOTIF filters can't
> > > override each other for the same task simply because there can only ever
> > > be a single one?
> >
> > Good point. Exceeeept that that check seems ineffective because this
> > happens before we take the locks that guard against TSYNC, and also
> > before we decide to which existing filter we want to chain the new
> > filter. So if two threads race with TSYNC, I think they'll be able to
> > chain two filters with listeners together.
>
> Yep, seems the check needs to also be in seccomp_can_sync_threads() to
> be totally effective,
>
> > I don't know whether we want to eternalize this "only one listener
> > across all the filters" restriction in the manpage though, or whether
> > the man page should just say that the kernel currently doesn't support
> > it but that security-wise you should assume that it might at some
> > point.
>
> This requirement originally came from Andy, arguing that the semantics
> of this were/are confusing, which still makes sense to me. Perhaps we
> should do something like the below?

I think we should either keep up this restriction and then cement it in
the manpage or add a flag to indicate that the notifier is
non-overridable.
I don't care about the default too much, i.e. whether it's overridable
by default and exclusive if opting in or the other way around doesn't
matter too much. But from a supervisor's perspective it'd be quite nice
to be able to be sure that a notifier can't be overriden by another
notifier.

I think having a flag would provide the greatest flexibility but I agree
that the semantics of multiple listeners are kinda odd.

Below looks sane to me though again, I'm not sitting in fron of source
code.

Christian

> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 3ee59ce0a323..7b107207c2b0 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -376,6 +376,18 @@ static int is_ancestor(struct seccomp_filter *parent,
> return 0;
> }
>
> +static bool has_listener_parent(struct seccomp_filter *child)
> +{
> + struct seccomp_filter *cur;
> +
> + for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> + if (cur->notif)
> + return true;
> + }
> +
> + return false;
> +}
> +
> /**
> * seccomp_can_sync_threads: checks if all threads can be synchronized
> *
> @@ -385,7 +397,7 @@ static int is_ancestor(struct seccomp_filter *parent,
> * either not in the correct seccomp mode or did not have an ancestral
> * seccomp filter.
> */
> -static inline pid_t seccomp_can_sync_threads(void)
> +static inline pid_t seccomp_can_sync_threads(unsigned int flags)
> {
> struct task_struct *thread, *caller;
>
> @@ -407,6 +419,11 @@ static inline pid_t seccomp_can_sync_threads(void)
> caller->seccomp.filter)))
> continue;
>
> + /* don't allow TSYNC to install multiple listeners */
> + if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER &&
> + !has_listener_parent(thread->seccomp.filter))
> + continue;
> +
> /* Return the first thread that cannot be synchronized. */
> failed = task_pid_vnr(thread);
> /* If the pid cannot be resolved, then return -ESRCH */
> @@ -637,7 +654,7 @@ static long seccomp_attach_filter(unsigned int flags,
> if (flags & SECCOMP_FILTER_FLAG_TSYNC) {
> int ret;
>
> - ret = seccomp_can_sync_threads();
> + ret = seccomp_can_sync_threads(flags);
> if (ret) {
> if (flags & SECCOMP_FILTER_FLAG_TSYNC_ESRCH)
> return -ESRCH;
> @@ -1462,12 +1479,9 @@ static const struct file_operations seccomp_notify_ops = {
> static struct file *init_listener(struct seccomp_filter *filter)
> {
> struct file *ret = ERR_PTR(-EBUSY);
> - struct seccomp_filter *cur;
>
> - for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> - if (cur->notif)
> - goto out;
> - }
> + if (has_listener_parent(current->seccomp.filter))
> + goto out;
>
> ret = ERR_PTR(-ENOMEM);
> filter->notif = kzalloc(sizeof(*(filter->notif)), GFP_KERNEL);

2020-10-01 18:23:06

by Jann Horn

[permalink] [raw]
Subject: Re: For review: seccomp_user_notif(2) manual page

On Thu, Oct 1, 2020 at 6:58 PM Tycho Andersen <[email protected]> wrote:
> On Thu, Oct 01, 2020 at 05:47:54PM +0200, Jann Horn via Containers wrote:
> > On Thu, Oct 1, 2020 at 2:54 PM Christian Brauner
> > <[email protected]> wrote:
> > > On Wed, Sep 30, 2020 at 05:53:46PM +0200, Jann Horn via Containers wrote:
> > > > On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
> > > > <[email protected]> wrote:
> > > > > NOTES
> > > > > The file descriptor returned when seccomp(2) is employed with the
> > > > > SECCOMP_FILTER_FLAG_NEW_LISTENER flag can be monitored using
> > > > > poll(2), epoll(7), and select(2). When a notification is pend‐
> > > > > ing, these interfaces indicate that the file descriptor is read‐
> > > > > able.
> > > >
> > > > We should probably also point out somewhere that, as
> > > > include/uapi/linux/seccomp.h says:
> > > >
> > > > * Similar precautions should be applied when stacking SECCOMP_RET_USER_NOTIF
> > > > * or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the
> > > > * same syscall, the most recently added filter takes precedence. This means
> > > > * that the new SECCOMP_RET_USER_NOTIF filter can override any
> > > > * SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all
> > > > * such filtered syscalls to be executed by sending the response
> > > > * SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can equally
> > > > * be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.
> > > >
> > > > In other words, from a security perspective, you must assume that the
> > > > target process can bypass any SECCOMP_RET_USER_NOTIF (or
> > > > SECCOMP_RET_TRACE) filters unless it is completely prohibited from
> > > > calling seccomp(). This should also be noted over in the main
> > > > seccomp(2) manpage, especially the SECCOMP_RET_TRACE part.
> > >
> > > So I was actually wondering about this when I skimmed this and a while
> > > ago but forgot about this again... Afaict, you can only ever load a
> > > single filter with SECCOMP_FILTER_FLAG_NEW_LISTENER set. If there
> > > already is a filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER property
> > > in the tasks filter hierarchy then the kernel will refuse to load a new
> > > one?
> > >
> > > static struct file *init_listener(struct seccomp_filter *filter)
> > > {
> > > struct file *ret = ERR_PTR(-EBUSY);
> > > struct seccomp_filter *cur;
> > >
> > > for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> > > if (cur->notif)
> > > goto out;
> > > }
> > >
> > > shouldn't that be sufficient to guarantee that USER_NOTIF filters can't
> > > override each other for the same task simply because there can only ever
> > > be a single one?
> >
> > Good point. Exceeeept that that check seems ineffective because this
> > happens before we take the locks that guard against TSYNC, and also
> > before we decide to which existing filter we want to chain the new
> > filter. So if two threads race with TSYNC, I think they'll be able to
> > chain two filters with listeners together.
>
> Yep, seems the check needs to also be in seccomp_can_sync_threads() to
> be totally effective,
>
> > I don't know whether we want to eternalize this "only one listener
> > across all the filters" restriction in the manpage though, or whether
> > the man page should just say that the kernel currently doesn't support
> > it but that security-wise you should assume that it might at some
> > point.
>
> This requirement originally came from Andy, arguing that the semantics
> of this were/are confusing, which still makes sense to me. Perhaps we
> should do something like the below?
[...]
> +static bool has_listener_parent(struct seccomp_filter *child)
> +{
> + struct seccomp_filter *cur;
> +
> + for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> + if (cur->notif)
> + return true;
> + }
> +
> + return false;
> +}
[...]
> @@ -407,6 +419,11 @@ static inline pid_t seccomp_can_sync_threads(void)
[...]
> + /* don't allow TSYNC to install multiple listeners */
> + if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER &&
> + !has_listener_parent(thread->seccomp.filter))
> + continue;
[...]
> @@ -1462,12 +1479,9 @@ static const struct file_operations seccomp_notify_ops = {
> static struct file *init_listener(struct seccomp_filter *filter)
[...]
> - for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> - if (cur->notif)
> - goto out;
> - }
> + if (has_listener_parent(current->seccomp.filter))
> + goto out;

I dislike this because it combines a non-locked check and a locked
check. And I don't think this will work in the case where TSYNC and
non-TSYNC race - if the non-TSYNC call nests around the TSYNC filter
installation, the thread that called seccomp in non-TSYNC mode will
still end up with two notifying filters. How about the following?


diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 676d4af62103..c49ad8ba0bc1 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1475,11 +1475,6 @@ static struct file *init_listener(struct
seccomp_filter *filter)
struct file *ret = ERR_PTR(-EBUSY);
struct seccomp_filter *cur;

- for (cur = current->seccomp.filter; cur; cur = cur->prev) {
- if (cur->notif)
- goto out;
- }
-
ret = ERR_PTR(-ENOMEM);
filter->notif = kzalloc(sizeof(*(filter->notif)), GFP_KERNEL);
if (!filter->notif)
@@ -1504,6 +1499,31 @@ static struct file *init_listener(struct
seccomp_filter *filter)
return ret;
}

+/*
+ * Does @new_child have a listener while an ancestor also has a listener?
+ * If so, we'll want to reject this filter.
+ * This only has to be tested for the current process, even in the TSYNC case,
+ * because TSYNC installs @child with the same parent on all threads.
+ * Note that @new_child is not hooked up to its parent at this point yet, so
+ * we use current->seccomp.filter.
+ */
+static bool has_duplicate_listener(struct seccomp_filter *new_child)
+{
+ struct seccomp_filter *cur;
+
+ /* must be protected against concurrent TSYNC */
+ lockdep_assert_held(&current->sighand->siglock);
+
+ if (!new_child->notif)
+ return false;
+ for (cur = current->seccomp.filter; cur; cur = cur->prev) {
+ if (cur->notif)
+ return true;
+ }
+
+ return false;
+}
+
/**
* seccomp_set_mode_filter: internal function for setting seccomp filter
* @flags: flags to change filter behavior
@@ -1575,6 +1595,9 @@ static long seccomp_set_mode_filter(unsigned int flags,
if (!seccomp_may_assign_mode(seccomp_mode))
goto out;

+ if (has_duplicate_listener(prepared))
+ goto out;
+
ret = seccomp_attach_filter(flags, prepared);
if (ret)
goto out;

2020-10-01 19:02:33

by Tycho Andersen

[permalink] [raw]
Subject: Re: For review: seccomp_user_notif(2) manual page

On Thu, Oct 01, 2020 at 08:18:49PM +0200, Jann Horn wrote:
> On Thu, Oct 1, 2020 at 6:58 PM Tycho Andersen <[email protected]> wrote:
> > On Thu, Oct 01, 2020 at 05:47:54PM +0200, Jann Horn via Containers wrote:
> > > On Thu, Oct 1, 2020 at 2:54 PM Christian Brauner
> > > <[email protected]> wrote:
> > > > On Wed, Sep 30, 2020 at 05:53:46PM +0200, Jann Horn via Containers wrote:
> > > > > On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
> > > > > <[email protected]> wrote:
> > > > > > NOTES
> > > > > > The file descriptor returned when seccomp(2) is employed with the
> > > > > > SECCOMP_FILTER_FLAG_NEW_LISTENER flag can be monitored using
> > > > > > poll(2), epoll(7), and select(2). When a notification is pend‐
> > > > > > ing, these interfaces indicate that the file descriptor is read‐
> > > > > > able.
> > > > >
> > > > > We should probably also point out somewhere that, as
> > > > > include/uapi/linux/seccomp.h says:
> > > > >
> > > > > * Similar precautions should be applied when stacking SECCOMP_RET_USER_NOTIF
> > > > > * or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the
> > > > > * same syscall, the most recently added filter takes precedence. This means
> > > > > * that the new SECCOMP_RET_USER_NOTIF filter can override any
> > > > > * SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all
> > > > > * such filtered syscalls to be executed by sending the response
> > > > > * SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can equally
> > > > > * be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.
> > > > >
> > > > > In other words, from a security perspective, you must assume that the
> > > > > target process can bypass any SECCOMP_RET_USER_NOTIF (or
> > > > > SECCOMP_RET_TRACE) filters unless it is completely prohibited from
> > > > > calling seccomp(). This should also be noted over in the main
> > > > > seccomp(2) manpage, especially the SECCOMP_RET_TRACE part.
> > > >
> > > > So I was actually wondering about this when I skimmed this and a while
> > > > ago but forgot about this again... Afaict, you can only ever load a
> > > > single filter with SECCOMP_FILTER_FLAG_NEW_LISTENER set. If there
> > > > already is a filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER property
> > > > in the tasks filter hierarchy then the kernel will refuse to load a new
> > > > one?
> > > >
> > > > static struct file *init_listener(struct seccomp_filter *filter)
> > > > {
> > > > struct file *ret = ERR_PTR(-EBUSY);
> > > > struct seccomp_filter *cur;
> > > >
> > > > for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> > > > if (cur->notif)
> > > > goto out;
> > > > }
> > > >
> > > > shouldn't that be sufficient to guarantee that USER_NOTIF filters can't
> > > > override each other for the same task simply because there can only ever
> > > > be a single one?
> > >
> > > Good point. Exceeeept that that check seems ineffective because this
> > > happens before we take the locks that guard against TSYNC, and also
> > > before we decide to which existing filter we want to chain the new
> > > filter. So if two threads race with TSYNC, I think they'll be able to
> > > chain two filters with listeners together.
> >
> > Yep, seems the check needs to also be in seccomp_can_sync_threads() to
> > be totally effective,
> >
> > > I don't know whether we want to eternalize this "only one listener
> > > across all the filters" restriction in the manpage though, or whether
> > > the man page should just say that the kernel currently doesn't support
> > > it but that security-wise you should assume that it might at some
> > > point.
> >
> > This requirement originally came from Andy, arguing that the semantics
> > of this were/are confusing, which still makes sense to me. Perhaps we
> > should do something like the below?
> [...]
> > +static bool has_listener_parent(struct seccomp_filter *child)
> > +{
> > + struct seccomp_filter *cur;
> > +
> > + for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> > + if (cur->notif)
> > + return true;
> > + }
> > +
> > + return false;
> > +}
> [...]
> > @@ -407,6 +419,11 @@ static inline pid_t seccomp_can_sync_threads(void)
> [...]
> > + /* don't allow TSYNC to install multiple listeners */
> > + if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER &&
> > + !has_listener_parent(thread->seccomp.filter))
> > + continue;
> [...]
> > @@ -1462,12 +1479,9 @@ static const struct file_operations seccomp_notify_ops = {
> > static struct file *init_listener(struct seccomp_filter *filter)
> [...]
> > - for (cur = current->seccomp.filter; cur; cur = cur->prev) {
> > - if (cur->notif)
> > - goto out;
> > - }
> > + if (has_listener_parent(current->seccomp.filter))
> > + goto out;
>
> I dislike this because it combines a non-locked check and a locked
> check. And I don't think this will work in the case where TSYNC and
> non-TSYNC race - if the non-TSYNC call nests around the TSYNC filter
> installation, the thread that called seccomp in non-TSYNC mode will
> still end up with two notifying filters. How about the following?

Sure, you can add,

Reviewed-by: Tycho Andersen <[email protected]>

when you send it.

Tycho

2020-10-01 21:10:22

by Sargun Dhillon

[permalink] [raw]
Subject: Re: For review: seccomp_user_notif(2) manual page

On Wed, Sep 30, 2020 at 4:07 AM Michael Kerrisk (man-pages)
<[email protected]> wrote:
>
> Hi Tycho, Sargun (and all),
>
> I knew it would be a big ask, but below is kind of the manual page
> I was hoping you might write [1] for the seccomp user-space notification
> mechanism. Since you didn't (and because 5.9 adds various new pieces
> such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD
> that also will need documenting [2]), I did :-). But of course I may
> have made mistakes...
>
> I've shown the rendered version of the page below, and would love
> to receive review comments from you and others, and acks, etc.
>
> There are a few FIXMEs sprinkled into the page, including one
> that relates to what appears to me to be a misdesign (possibly
> fixable) in the operation of the SECCOMP_IOCTL_NOTIF_RECV
> operation. I would be especially interested in feedback on that
> FIXME, and also of course the other FIXMEs.
>
> The page includes an extensive (albeit slightly contrived)
> example program, and I would be happy also to receive comments
> on that program.
>
> The page source currently sits in a branch (along with the text
> that you sent me for the seccomp(2) page) at
> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=seccomp_user_notif
>
> Thanks,
>
> Michael
>
> [1] https://lore.kernel.org/linux-man/[email protected]/#t
> [2] Sargun, can you prepare something on SECCOMP_ADDFD_FLAG_SETFD
> and SECCOMP_IOCTL_NOTIF_ADDFD to be added to this page?
>
> ====
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/

Should we consider the SECCOMP_GET_NOTIF_SIZES dance to be "deprecated" at
this point, given that the extensible ioctl mechanism works? If we add
new fields to the
seccomp datastructures, we would move them from fixed-size ioctls, to
variable sized
ioctls that encode the datastructure size / length?

-- This is mostly a question for Kees and Tycho.

2020-10-01 23:25:07

by Tycho Andersen

[permalink] [raw]
Subject: Re: For review: seccomp_user_notif(2) manual page

On Thu, Oct 01, 2020 at 02:06:10PM -0700, Sargun Dhillon wrote:
> On Wed, Sep 30, 2020 at 4:07 AM Michael Kerrisk (man-pages)
> <[email protected]> wrote:
> >
> > Hi Tycho, Sargun (and all),
> >
> > I knew it would be a big ask, but below is kind of the manual page
> > I was hoping you might write [1] for the seccomp user-space notification
> > mechanism. Since you didn't (and because 5.9 adds various new pieces
> > such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD
> > that also will need documenting [2]), I did :-). But of course I may
> > have made mistakes...
> >
> > I've shown the rendered version of the page below, and would love
> > to receive review comments from you and others, and acks, etc.
> >
> > There are a few FIXMEs sprinkled into the page, including one
> > that relates to what appears to me to be a misdesign (possibly
> > fixable) in the operation of the SECCOMP_IOCTL_NOTIF_RECV
> > operation. I would be especially interested in feedback on that
> > FIXME, and also of course the other FIXMEs.
> >
> > The page includes an extensive (albeit slightly contrived)
> > example program, and I would be happy also to receive comments
> > on that program.
> >
> > The page source currently sits in a branch (along with the text
> > that you sent me for the seccomp(2) page) at
> > https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=seccomp_user_notif
> >
> > Thanks,
> >
> > Michael
> >
> > [1] https://lore.kernel.org/linux-man/[email protected]/#t
> > [2] Sargun, can you prepare something on SECCOMP_ADDFD_FLAG_SETFD
> > and SECCOMP_IOCTL_NOTIF_ADDFD to be added to this page?
> >
> > ====
> >
> > --
> > Michael Kerrisk
> > Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> > Linux/UNIX System Programming Training: http://man7.org/training/
>
> Should we consider the SECCOMP_GET_NOTIF_SIZES dance to be "deprecated" at
> this point, given that the extensible ioctl mechanism works? If we add
> new fields to the
> seccomp datastructures, we would move them from fixed-size ioctls, to
> variable sized
> ioctls that encode the datastructure size / length?
>
> -- This is mostly a question for Kees and Tycho.

It will tell you how big struct seccomp_data in the currently running
kernel is, so it still seems useful/necessary to me, unless there's
another way to figure that out.

But I agree, I don't think the intent is to add anything else to
struct seccomp_notif. (I don't know that it ever was.)

Tycho

Subject: Re: For review: seccomp_user_notif(2) manual page

Hi Tycho,

Ping on the question below!

Thanks,

Michael

On 10/1/20 9:45 AM, Michael Kerrisk (man-pages) wrote:
> On 10/1/20 1:03 AM, Tycho Andersen wrote:
>> On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) wrote:
>>> Hi Tycho,
>>>
>>> Thanks for taking time to look at the page!
>>>
>>> On 9/30/20 5:03 PM, Tycho Andersen wrote:
>>>> On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
>
> [...]
>
>>>>> ┌─────────────────────────────────────────────────────┐
>>>>> │FIXME │
>>>>> ├─────────────────────────────────────────────────────┤
>>>>> │Interestingly, after the event had been received, │
>>>>> │the file descriptor indicates as writable (verified │
>>>>> │from the source code and by experiment). How is this │
>>>>> │useful? │
>>>>
>>>> You're saying it should just do EPOLLOUT and not EPOLLWRNORM? Seems
>>>> reasonable.
>>>
>>> No, I'm saying something more fundamental: why is the FD indicating as
>>> writable? Can you write something to it? If yes, what? If not, then
>>> why do these APIs want to say that the FD is writable?
>>
>> You can't via read(2) or write(2), but conceptually NOTIFY_RECV and
>> NOTIFY_SEND are reading and writing events from the fd. I don't know
>> that much about the poll interface though -- is it possible to
>> indicate "here's a pseudo-read event"? It didn't look like it, so I
>> just (ab-)used POLLIN and POLLOUT, but probably that's wrong.
>
> I think the POLLIN thing is fine.
>
> So, I think maybe I now understand what you intended with setting
> POLLOUT: the notification has been received ("read") and now the
> FD can be used to NOTIFY_SEND ("write") a response. Right?
>
> If that's correct, I don't have a problem with it. I just wonder:
> is it useful? IOW: are there situations where the process doing the
> NOTIFY_SEND might want to test for POLLOUT because the it doesn't
> know whether a NOTIFY_RECV has occurred?
>
> Thanks,
>
> Michael
>


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Subject: Re: For review: seccomp_user_notif(2) manual page

On 10/1/20 7:12 PM, Christian Brauner wrote:
> On Thu, Oct 01, 2020 at 10:58:50AM -0600, Tycho Andersen wrote:
>> On Thu, Oct 01, 2020 at 05:47:54PM +0200, Jann Horn via Containers wrote:
>>> On Thu, Oct 1, 2020 at 2:54 PM Christian Brauner
>>> <[email protected]> wrote:
>>>> On Wed, Sep 30, 2020 at 05:53:46PM +0200, Jann Horn via Containers wrote:
>>>>> On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
>>>>> <[email protected]> wrote:
>>>>>> NOTES
>>>>>> The file descriptor returned when seccomp(2) is employed with the
>>>>>> SECCOMP_FILTER_FLAG_NEW_LISTENER flag can be monitored using
>>>>>> poll(2), epoll(7), and select(2). When a notification is pend‐
>>>>>> ing, these interfaces indicate that the file descriptor is read‐
>>>>>> able.
>>>>>
>>>>> We should probably also point out somewhere that, as
>>>>> include/uapi/linux/seccomp.h says:
>>>>>
>>>>> * Similar precautions should be applied when stacking SECCOMP_RET_USER_NOTIF
>>>>> * or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the
>>>>> * same syscall, the most recently added filter takes precedence. This means
>>>>> * that the new SECCOMP_RET_USER_NOTIF filter can override any
>>>>> * SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all
>>>>> * such filtered syscalls to be executed by sending the response
>>>>> * SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can equally
>>>>> * be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.
>>>>>
>>>>> In other words, from a security perspective, you must assume that the
>>>>> target process can bypass any SECCOMP_RET_USER_NOTIF (or
>>>>> SECCOMP_RET_TRACE) filters unless it is completely prohibited from
>>>>> calling seccomp(). This should also be noted over in the main
>>>>> seccomp(2) manpage, especially the SECCOMP_RET_TRACE part.
>>>>
>>>> So I was actually wondering about this when I skimmed this and a while
>>>> ago but forgot about this again... Afaict, you can only ever load a
>>>> single filter with SECCOMP_FILTER_FLAG_NEW_LISTENER set. If there
>>>> already is a filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER property
>>>> in the tasks filter hierarchy then the kernel will refuse to load a new
>>>> one?
>>>>
>>>> static struct file *init_listener(struct seccomp_filter *filter)
>>>> {
>>>> struct file *ret = ERR_PTR(-EBUSY);
>>>> struct seccomp_filter *cur;
>>>>
>>>> for (cur = current->seccomp.filter; cur; cur = cur->prev) {
>>>> if (cur->notif)
>>>> goto out;
>>>> }
>>>>
>>>> shouldn't that be sufficient to guarantee that USER_NOTIF filters can't
>>>> override each other for the same task simply because there can only ever
>>>> be a single one?
>>>
>>> Good point. Exceeeept that that check seems ineffective because this
>>> happens before we take the locks that guard against TSYNC, and also
>>> before we decide to which existing filter we want to chain the new
>>> filter. So if two threads race with TSYNC, I think they'll be able to
>>> chain two filters with listeners together.
>>
>> Yep, seems the check needs to also be in seccomp_can_sync_threads() to
>> be totally effective,
>>
>>> I don't know whether we want to eternalize this "only one listener
>>> across all the filters" restriction in the manpage though, or whether
>>> the man page should just say that the kernel currently doesn't support
>>> it but that security-wise you should assume that it might at some
>>> point.
>>
>> This requirement originally came from Andy, arguing that the semantics
>> of this were/are confusing, which still makes sense to me. Perhaps we
>> should do something like the below?
>
> I think we should either keep up this restriction and then cement it in
> the manpage or add a flag to indicate that the notifier is
> non-overridable.
> I don't care about the default too much, i.e. whether it's overridable
> by default and exclusive if opting in or the other way around doesn't
> matter too much. But from a supervisor's perspective it'd be quite nice
> to be able to be sure that a notifier can't be overriden by another
> notifier.
>
> I think having a flag would provide the greatest flexibility but I agree
> that the semantics of multiple listeners are kinda odd.

So, for now, I have applied the patch at the foot of this mail
to the pages. Does this seem correct?

> Below looks sane to me though again, I'm not sitting in fron of source
> code.
[...]

Thanks,

Michael

PS Jann, if you see this, I'm still working through your (extensive
and very helpful) review comments. I will be sending a response.

======

diff --git a/man2/seccomp.2 b/man2/seccomp.2
index 9ab07f4ab..45a6984df 100644
--- a/man2/seccomp.2
+++ b/man2/seccomp.2
@@ -221,6 +221,11 @@ return a new user-space notification file descriptor.
When the filter returns
.BR SECCOMP_RET_USER_NOTIF
a notification will be sent to this file descriptor.
+.IP
+At most one seccomp filter using the
+.BR SECCOMP_FILTER_FLAG_NEW_LISTENER
+flag can be installed for a thread.
+.IP
See
.BR seccomp_user_notif (2)
for further details.
@@ -789,6 +794,12 @@ capability in its user namespace, or had not set
before using
.BR SECCOMP_SET_MODE_FILTER .
.TP
+.BR EBUSY
+While installing a new filter, the
+.BR SECCOMP_FILTER_FLAG_NEW_LISTENER
+flag was specified,
+but a previous filter had already been installed with that flag.
+.TP
.BR EFAULT
.IR args
was not a valid address.
diff --git a/man2/seccomp_user_notif.2 b/man2/seccomp_user_notif.2
index a6025e4d4..d1a406f46 100644
--- a/man2/seccomp_user_notif.2
+++ b/man2/seccomp_user_notif.2
@@ -92,6 +92,7 @@ Consequently, the return value of the (successful)
.BR seccomp (2)
call is a new "listening"
file descriptor that can be used to receive notifications.
+Only one such "listener" can be established.
.IP \(bu
In cases where it is appropriate, the seccomp filter returns the action value
.BR SECCOMP_RET_USER_NOTIF .

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Subject: Re: For review: seccomp_user_notif(2) manual page

Hi Jann,

So, first off, thank you for the detailed review. I really
appreciate it! I've changed various pieces, and still have
a few questions below.

On 9/30/20 5:53 PM, Jann Horn wrote:
> On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
> <[email protected]> wrote:
>> I knew it would be a big ask, but below is kind of the manual page
>> I was hoping you might write [1] for the seccomp user-space notification
>> mechanism. Since you didn't (and because 5.9 adds various new pieces
>> such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD
>> that also will need documenting [2]), I did :-). But of course I may
>> have made mistakes...
> [...]
>> NAME
>> seccomp_user_notif - Seccomp user-space notification mechanism
>>
>> SYNOPSIS
>> #include <linux/seccomp.h>
>> #include <linux/filter.h>
>> #include <linux/audit.h>
>>
>> int seccomp(unsigned int operation, unsigned int flags, void *args);
>
> Should the ioctl() calls be listed here, similar to e.g. the SYNOPSIS
> of the ioctl_* manpages?

Yes, good idea. I added:

int ioctl(int fd, SECCOMP_IOCTL_NOTIF_RECV,
struct seccomp_notif *req);
int ioctl(int fd, SECCOMP_IOCTL_NOTIF_SEND,
struct seccomp_notif_resp *req);
int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ID_VALID, __u64 *id);
>
>> DESCRIPTION
>> This page describes the user-space notification mechanism pro‐
>> vided by the Secure Computing (seccomp) facility. As well as the
>> use of the SECCOMP_FILTER_FLAG_NEW_LISTENER flag, the SEC‐
>> COMP_RET_USER_NOTIF action value, and the SECCOMP_GET_NOTIF_SIZES
>> operation described in seccomp(2), this mechanism involves the
>> use of a number of related ioctl(2) operations (described below).
>>
>> Overview
>> In conventional usage of a seccomp filter, the decision about how
>> to treat a particular system call is made by the filter itself.
>> The user-space notification mechanism allows the handling of the
>> system call to instead be handed off to a user-space process.
>> The advantages of doing this are that, by contrast with the sec‐
>> comp filter, which is running on a virtual machine inside the
>> kernel, the user-space process has access to information that is
>> unavailable to the seccomp filter and it can perform actions that
>> can't be performed from the seccomp filter.
>>
>> In the discussion that follows, the process that has installed
>> the seccomp filter is referred to as the target, and the process
>
> Technically, this definition of "target" is a bit inaccurate because:
>
> - seccomp filters are inherited
> - seccomp filters apply to threads, not processes
> - seccomp filters can be semi-remotely installed via TSYNC

(Nice summary.)

> (I assume that in manpages, we should try to go for the "a task is a
> thread and a thread group is a process" definition, right?)

Exactly.

> Perhaps "the threads on which the seccomp filter is installed are
> referred to as the target", or something like that would be better?

Thanks. It's always hugely helpful to get a suggested wording, even
if I still feel the need to rework it (which I don't in this case).
The sentence now reads:

In the discussion that follows, the thread(s) on which the seccomp
filter is installed are referred to as the target, and the process
that is notified by the user-space notification mechanism is
referred to as the supervisor.

>> that is notified by the user-space notification mechanism is
>> referred to as the supervisor. An overview of the steps per‐
>> formed by these two processes is as follows:
>>
>> 1. The target process establishes a seccomp filter in the usual
>> manner, but with two differences:
>>
>> · The seccomp(2) flags argument includes the flag SECCOMP_FIL‐
>> TER_FLAG_NEW_LISTENER. Consequently, the return value of
>> the (successful) seccomp(2) call is a new "listening" file
>> descriptor that can be used to receive notifications.
>>
>> · In cases where it is appropriate, the seccomp filter returns
>> the action value SECCOMP_RET_USER_NOTIF. This return value
>> will trigger a notification event.
>>
>> 2. In order that the supervisor process can obtain notifications
>> using the listening file descriptor, (a duplicate of) that
>> file descriptor must be passed from the target process to the
>> supervisor process. One way in which this could be done is by
>> passing the file descriptor over a UNIX domain socket connec‐
>> tion between the two processes (using the SCM_RIGHTS ancillary
>> message type described in unix(7)). Another possibility is
>> that the supervisor might inherit the file descriptor via
>> fork(2).
>
> With the caveat that if the supervisor inherits the file descriptor
> via fork(), that (more or less) implies that the supervisor is subject
> to the same filter (although it could bypass the filter using a helper
> thread that responds SECCOMP_USER_NOTIF_FLAG_CONTINUE, but I don't
> expect any clean software to do that).

It's a good thing no one ever writes unclean software...

Thanks for catching this; Tycho did also. It was a thinko on my part
to forget that if one used fork(), the supervisor would inherit the
filter. I've simply removed the sentence mentioning fork().


>> 3. The supervisor process will receive notification events on the
>> listening file descriptor. These events are returned as
>> structures of type seccomp_notif. Because this structure and
>> its size may evolve over kernel versions, the supervisor must
>> first determine the size of this structure using the sec‐
>> comp(2) SECCOMP_GET_NOTIF_SIZES operation, which returns a
>> structure of type seccomp_notif_sizes. The supervisor allo‐
>> cates a buffer of size seccomp_notif_sizes.seccomp_notif bytes
>> to receive notification events. In addition,the supervisor
>> allocates another buffer of size seccomp_notif_sizes.sec‐
>> comp_notif_resp bytes for the response (a struct sec‐
>> comp_notif_resp structure) that it will provide to the kernel
>> (and thus the target process).
>>
>> 4. The target process then performs its workload, which includes
>> system calls that will be controlled by the seccomp filter.
>> Whenever one of these system calls causes the filter to return
>> the SECCOMP_RET_USER_NOTIF action value, the kernel does not
>> execute the system call; instead, execution of the target
>> process is temporarily blocked inside the kernel and a notifi‐
>
> where "blocked" refers to the interruptible, restartable kind - if the
> child receives a signal with an SA_RESTART signal handler in the
> meantime, it'll leave the syscall, go through the signal handler, then
> restart the syscall again and send the same request to the supervisor
> again. so the supervisor may see duplicate syscalls.

So, I partially demonstrated what you describe here, for two example
system calls (epoll_wait() and pause()). But I could not exactly
demonstrate things as I understand you to be describing them. (So,
I'm not sure whether I have not understood you correctly, or
if things are not exactly as you describe them.)

Here's a scenario (A) that I tested:

1. Target installs seccomp filters for a blocking syscall
(epoll_wait() or pause(), both of which should never restart,
regardless of SA_RESTART)
2. Target installs SIGINT handler with SA_RESTART
3. Supervisor is sleeping (i.e., is not blocked in
SECCOMP_IOCTL_NOTIF_RECV operation).
4. Target makes a blocking system call (epoll_wait() or pause()).
5. SIGINT gets delivered to target; handler gets called;
***and syscall gets restarted by the kernel***

That last should never happen, of course, and is a result of the
combination of both the user-notify filter and the SA_RESTART flag.
If one or other is not present, then the system call is not
restarted.

So, as you note below, the UAPI gets broken a little.

However, from your description above I had understood that
something like the following scenario (B) could occur:

1. Target installs seccomp filters for a blocking syscall
(epoll_wait() or pause(), both of which should never restart,
regardless of SA_RESTART)
2. Target installs SIGINT handler with SA_RESTART
3. Supervisor performs SECCOMP_IOCTL_NOTIF_RECV operation (which
blocks).
4. Target makes a blocking system call (epoll_wait() or pause()).
5. Supervisor gets seccomp user-space notification (i.e.,
SECCOMP_IOCTL_NOTIF_RECV ioctl() returns
6. SIGINT gets delivered to target; handler gets called;
and syscall gets restarted by the kernel
7. Supervisor performs another SECCOMP_IOCTL_NOTIF_RECV operation
which gets another notification for the restarted system call.

However, I don't observe such behavior. In step 6, the syscall
does not get restarted by the kernel, but instead returns -1/EINTR.
Perhaps I have misconstructed my experiment in the second case, or
perhaps I've misunderstood what you meant, or is it possibly the
case that things are not quite as you said?

> What's really gross here is that signal(7) promises that some syscalls
> like epoll_wait(2) never restart, but seccomp doesn't know about that;
> if userspace installs a filter that uses SECCOMP_RET_USER_NOTIF for a
> non-restartable syscall, the result is that UAPI gets broken a little
> bit. Luckily normal users of seccomp probably won't use
> SECCOMP_RET_USER_NOTIF for restartable syscalls, but if someone does
> want to do that, we might have to add some "suppress syscall
> restarting" flag into the seccomp action value, or something like
> that... yuck.

Yes, the UAPI breakage is a bit sad (although, likely to be rarely
encountered, as you note). I'm inclined to add a note about this in
in BUGS, but beforehand I'm interested in hearing your thoughts on
scenario B above.

>> cation event is generated on the listening file descriptor.
>>
>> 5. The supervisor process can now repeatedly monitor the listen‐
>> ing file descriptor for SECCOMP_RET_USER_NOTIF-triggered
>> events. To do this, the supervisor uses the SEC‐
>> COMP_IOCTL_NOTIF_RECV ioctl(2) operation to read information
>> about a notification event; this operation blocks until an
>
> (interruptably - but I guess that maybe doesn't have to be said
> explicitly here?)

Yes, I think so. The general assumption is that syscalls block
interruptibly, unless text in a manual page that says
"uninterruptible". (Postscript: Christian made a similar comment,
so I decided to explicitly note that it's an interruptible sleep.)

>> event is available.
>
> Maybe we should note here that you can use the multi-fd-polling APIs
> (select/poll/epoll) instead, and that if the notification goes away
> before you call SECCOMP_IOCTL_NOTIF_RECV, the ioctl will return
> -ENOENT instead of blocking, and therefore as long as nobody else
> reads from the same fd, you can assume that after the fd reports as
> readable, you can call SECCOMP_IOCTL_NOTIF_RECV once without blocking.

I'd rather not add this info in the overview section, which is
already longer than I would like. But I did add some details
in NOTES:

[[
The file descriptor returned when seccomp(2) is employed with the
SECCOMP_FILTER_FLAG_NEW_LISTENER flag can be monitored using
poll(2), epoll(7), and select(2). When a notification is pending,
these interfaces indicate that the file descriptor is readable.
Following such an indication, a subsequent SEC‐
COMP_IOCTL_NOTIF_RECV ioctl(2) will not block, returning either
information about a notification or else failing with the error
EINTR if the target process has been killed by a signal or its
system call has been interrupted by a signal handler.
]]

Okay?

> Exceeeeept that this part looks broken:
>
> if (mutex_lock_interruptible(&filter->notify_lock) < 0)
> return EPOLLERR;
>
> which I think means that we can have a race where a signal arrives
> while poll() is trying to add itself to the waitqueue of the seccomp
> fd, and then we'll get a spurious error condition reported on the fd.
> That's a kernel bug, I'd say.

Sigh... Writing documentation helps find bugs. Who knew?

>> The operation returns a seccomp_notif
>> structure containing information about the system call that is
>> being attempted by the target process.
>>
>> 6. The seccomp_notif structure returned by the SEC‐
>> COMP_IOCTL_NOTIF_RECV operation includes the same information
>> (a seccomp_data structure) that was passed to the seccomp fil‐
>> ter. This information allows the supervisor to discover the
>> system call number and the arguments for the target process's
>> system call. In addition, the notification event contains the
>> PID of the target process.
>
> That's a PIDTYPE_PID, which the manpages call a "thread ID".

Yes. Fixed now. More generally, I've swept through the page replacing
various instances of "target process" with either "target thread", or
often just "target".

>> The information in the notification can be used to discover
>> the values of pointer arguments for the target process's sys‐
>> tem call. (This is something that can't be done from within a
>> seccomp filter.) To do this (and assuming it has suitable
>> permissions), the supervisor opens the corresponding
>> /proc/[pid]/mem file,
>
> ... which means that here we might have to get into the weeds of how
> actually /proc has invisible directories for every TID, even though
> only the ones for PIDs are visible, and therefore you can just open
> /proc/[tid]/mem and it'll work fine?

I myself was unaware of this for years until I *accidentally* made use
of the feature in one of my test programs and then a while later got to
asking myself "how come that worked?".

About two years ago, I added some text (@) to explain this in proc(5)
near the start of the page:

Overview
Underneath /proc, there are the following general groups of files
and subdirectories:

/proc/[pid] subdirectories
[...]
Underneath each of the /proc/[pid] directories, a task sub‐
directory contains subdirectories of the form task/[tid],
[...]

The /proc/[pid] subdirectories are visible when iterating
through /proc with getdents(2) (and thus are visible when
one uses ls(1) to view the contents of /proc).

/proc/[tid] subdirectories
@ Each one of these subdirectories contains files and subdi‐
@ rectories exposing information about the thread with the
@ corresponding thread ID. The contents of these directories
@ are the same as the corresponding /proc/[pid]/task/[tid]
@ directories.

@ The /proc/[tid] subdirectories are not visible when iterat‐
@ ing through /proc with getdents(2) (and thus are not visi‐
@ ble when one uses ls(1) to view the contents of /proc).

I think I'll just drop a cross reference to proc(5) into the text in
seccomp_user_notif.

>> seeks to the memory location that corre‐
>> sponds to one of the pointer arguments whose value is supplied
>> in the notification event, and reads bytes from that location.
>> (The supervisor must be careful to avoid a race condition that
>> can occur when doing this; see the description of the SEC‐
>> COMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation below.) In addi‐
>> tion, the supervisor can access other system information that
>> is visible in user space but which is not accessible from a
>> seccomp filter.
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │Suppose we are reading a pathname from /proc/PID/mem │
>> │for a system call such as mkdir(). The pathname can │
>> │be an arbitrary length. How do we know how much (how │
>> │many pages) to read from /proc/PID/mem? │
>> └─────────────────────────────────────────────────────┘
>
> It can't be an arbitrary length. While pathnames *returned* from the
> kernel in some places can have different limits, strings supplied as
> path arguments *to* the kernel AFAIK always have an upper limit of
> PATH_MAX, else you get -ENAMETOOLONG. See getname_flags().

Yes, another thinko on my part. I removed this FIXME.

>> 7. Having obtained information as per the previous step, the
>> supervisor may then choose to perform an action in response to
>> the target process's system call (which, as noted above, is
>> not executed when the seccomp filter returns the SEC‐
>> COMP_RET_USER_NOTIF action value).
>
> (unless SECCOMP_USER_NOTIF_FLAG_CONTINUE is used)

As you probably saw, I give SECCOMP_USER_NOTIF_FLAG_CONTINUE a brief
mention a couple of paragraphs later, and then go into rather more
detail later in the page. (Or do you still think something needs
fixing?)

>> One example use case here relates to containers. The target
>> process may be located inside a container where it does not
>> have sufficient capabilities to mount a filesystem in the con‐
>> tainer's mount namespace. However, the supervisor may be a
>> more privileged process that that does have sufficient capa‐
>
> nit: s/that that/that/

Thanks. Fixed.

>> bilities to perform the mount operation.
>>
>> 8. The supervisor then sends a response to the notification. The
>> information in this response is used by the kernel to con‐
>> struct a return value for the target process's system call and
>> provide a value that will be assigned to the errno variable of
>> the target process.
>>
>> The response is sent using the SECCOMP_IOCTL_NOTIF_RECV
>> ioctl(2) operation, which is used to transmit a sec‐
>> comp_notif_resp structure to the kernel. This structure
>> includes a cookie value that the supervisor obtained in the
>> seccomp_notif structure returned by the SEC‐
>> COMP_IOCTL_NOTIF_RECV operation. This cookie value allows the
>> kernel to associate the response with the target process.
>
> (unless if the target thread entered a signal handler or was killed in
> the meantime)

Yes, but I think I have this adequately covered in the errors described
later in the page for SECCOMP_IOCTL_NOTIF_RECV. (I have now added the
target-process-terminated case to the orror text.)

ENOENT The blocked system call in the target has been
interrupted by a signal handler or the target
process has terminated.

Is that sufficient?

>> 9. Once the notification has been sent, the system call in the
>> target process unblocks, returning the information that was
>> provided by the supervisor in the notification response.
>>
>> As a variation on the last two steps, the supervisor can send a
>> response that tells the kernel that it should execute the target
>> process's system call; see the discussion of SEC‐
>> COMP_USER_NOTIF_FLAG_CONTINUE, below.
>>
>> ioctl(2) operations
>> The following ioctl(2) operations are provided to support seccomp
>> user-space notification. For each of these operations, the first
>> (file descriptor) argument of ioctl(2) is the listening file
>> descriptor returned by a call to seccomp(2) with the SECCOMP_FIL‐
>> TER_FLAG_NEW_LISTENER flag.
>>
>> SECCOMP_IOCTL_NOTIF_RECV
>> This operation is used to obtain a user-space notification
>> event. If no such event is currently pending, the opera‐
>> tion blocks until an event occurs.
>
> Not necessarily; for every time a process entered a signal handler or
> was killed while a notification was pending, a call to
> SECCOMP_IOCTL_NOTIF_RECV will return -ENOENT.

Yes, but do you not consider this sufficiently covered by the
(updated) error text that appears later? (See below.)

>> The third ioctl(2)
>> argument is a pointer to a structure of the following form
>> which contains information about the event. This struc‐
>> ture must be zeroed out before the call.
>>
>> struct seccomp_notif {
>> __u64 id; /* Cookie */
>> __u32 pid; /* PID of target process */
>
> (TID, not PID)

Thanks. Fixed.

>> __u32 flags; /* Currently unused (0) */
>> struct seccomp_data data; /* See seccomp(2) */
>> };
>>
>> The fields in this structure are as follows:
>>
>> id This is a cookie for the notification. Each such
>> cookie is guaranteed to be unique for the corre‐
>> sponding seccomp filter. In other words, this
>> cookie is unique for each notification event from
>> the target process.
>
> That sentence about "target process" looks wrong to me. The cookies
> are unique across notifications from the filter, but there can be
> multiple filters per thread, and multiple threads per filter.

Thanks. I simply removed that last sentence.

>> The cookie value has the fol‐
>> lowing uses:
>>
>> · It can be used with the SEC‐
>> COMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation to
>> verify that the target process is still alive.
>>
>> · When returning a notification response to the
>> kernel, the supervisor must include the cookie
>> value in the seccomp_notif_resp structure that is
>> specified as the argument of the SEC‐
>> COMP_IOCTL_NOTIF_SEND operation.
>>
>> pid This is the PID of the target process that trig‐
>> gered the notification event.
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │This is a thread ID, rather than a PID, right? │
>> └─────────────────────────────────────────────────────┘
>
> Yeah.

Thanks. I've made various fixes.

>> flags This is a bit mask of flags providing further
>> information on the event. In the current implemen‐
>> tation, this field is always zero.
>>
>> data This is a seccomp_data structure containing infor‐
>> mation about the system call that triggered the
>> notification. This is the same structure that is
>> passed to the seccomp filter. See seccomp(2) for
>> details of this structure.
>>
>> On success, this operation returns 0; on failure, -1 is
>> returned, and errno is set to indicate the cause of the
>> error. This operation can fail with the following errors:
>>
>> EINVAL (since Linux 5.5)
>> The seccomp_notif structure that was passed to the
>> call contained nonzero fields.
>>
>> ENOENT The target process was killed by a signal as the
>> notification information was being generated.
>
> Not just killed, interruption with a signal handler has the same effect.

Ah yes! Thanks. I added that as well.

[[
ENOENT The target thread was killed by a signal as the
notification information was being generated, or the
target's (blocked) system call was interrupted by a
signal handler.
]]

Okay?

>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │From my experiments, it appears that if a SEC‐ │
>> │COMP_IOCTL_NOTIF_RECV is done after the target │
>> │process terminates, then the ioctl() simply blocks │
>> │(rather than returning an error to indicate that the │
>> │target process no longer exists). │
>> │ │
>> │I found that surprising, and it required some con‐ │
>> │tortions in the example program. It was not possi‐ │
>> │ble to code my SIGCHLD handler (which reaps the zom‐ │
>> │bie when the worker/target process terminates) to │
>> │simply set a flag checked in the main handleNotifi‐ │
>> │cations() loop, since this created an unavoidable │
>> │race where the child might terminate just after I │
>> │had checked the flag, but before I blocked (for‐ │
>> │ever!) in the SECCOMP_IOCTL_NOTIF_RECV operation. │
>> │Instead, I had to code the signal handler to simply │
>> │call _exit(2) in order to terminate the parent │
>> │process (the supervisor). │
>> │ │
>> │Is this expected behavior? It seems to me rather │
>> │desirable that SECCOMP_IOCTL_NOTIF_RECV should give │
>> │an error if the target process has terminated. │
>> └─────────────────────────────────────────────────────┘
>
> You could poll() the fd first. But yeah, it'd probably be a good idea
> to change that.

Ah! It was only after reading some comments from Christian that I
realized how poll() works here. I'll make some additions to the
page about the poll() details. (See my reply to Christian that should
land at about the same time as this mail.)

>> SECCOMP_IOCTL_NOTIF_ID_VALID
> [...]
>> In the above scenario, the risk is that the supervisor may
>> try to access the memory of a process other than the tar‐
>> get. This race can be avoided by following the call to
>> open with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to ver‐
>> ify that the process that generated the notification is
>> still alive. (Note that if the target process subse‐
>> quently terminates, its PID won't be reused because there
>
> That's wrong, the PID can be reused, but the /proc/$pid directory is
> internally not associated with the numeric PID, but, conceptually
> speaking, with a specific incarnation of the PID, or something like
> that. (Actually, it is associated with the "struct pid", which is not
> reused, instead of the numeric PID.)

Thanks. I simplified the last sentence of the paragraph:

In the above scenario, the risk is that the supervisor may
try to access the memory of a process other than the tar‐
get. This race can be avoided by following the call to
open(2) with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to
verify that the process that generated the notification is
still alive. (Note that if the target terminates after the
latter step, a subsequent read(2) from the file descriptor
will return 0, indicating end of file.)

I think that's probably enough detail.

>> remains an open reference to the /proc[pid]/mem file; in
>> this case, a subsequent read(2) from the file will return
>> 0, indicating end of file.)
>>
>> On success (i.e., the notification ID is still valid),
>> this operation returns 0 On failure (i.e., the notifica‐
>
> nit: s/returns 0/returns 0./

Thanks. Fixed.

>> tion ID is no longer valid), -1 is returned, and errno is
>> set to ENOENT.
>>
>> SECCOMP_IOCTL_NOTIF_SEND
> [...]
>> Two kinds of response are possible:
>>
>> · A response to the kernel telling it to execute the tar‐
>> get process's system call. In this case, the flags
>> field includes SECCOMP_USER_NOTIF_FLAG_CONTINUE and the
>> error and val fields must be zero.
>>
>> This kind of response can be useful in cases where the
>> supervisor needs to do deeper analysis of the target's
>> system call than is possible from a seccomp filter
>> (e.g., examining the values of pointer arguments), and,
>> having verified that the system call is acceptable, the
>> supervisor wants to allow it to proceed.
>
> "allow" sounds as if this is an access control thing, but this
> mechanism should usually not be used for access control (unless the
> "seccomp" syscall is blocked).

Yes, Kees has also raised this point.

> Maybe reword as "having decided that
> the system call does not require emulation by the supervisor, the
> supervisor wants it to execute normally", or something like that?

Great! More suggested wordings! Thank you :-).

I tweaked slightly:

... having decided that the system call does not require emulation
by the supervisor, the supervisor wants the system call to
be executed normally in the target.

> [...]
>> On success, this operation returns 0; on failure, -1 is
>> returned, and errno is set to indicate the cause of the
>> error. This operation can fail with the following errors:
>>
>> EINPROGRESS
>> A response to this notification has already been
>> sent.
>>
>> EINVAL An invalid value was specified in the flags field.
>>
>> EINVAL The flags field contained SEC‐
>> COMP_USER_NOTIF_FLAG_CONTINUE, and the error or val
>> field was not zero.
>>
>> ENOENT The blocked system call in the target process has
>> been interrupted by a signal handler.
>
> (you could also get this if a response has already been sent, instead
> of EINPROGRESS - the only difference is whether the target thread has
> picked up the response yet)

Got it. I don't think I'll try to work that detail into the page
(unless you really think I should, but since you made this a
parenthetical comment, perhaps you don't think it's necessary).

>> NOTES
>> The file descriptor returned when seccomp(2) is employed with the
>> SECCOMP_FILTER_FLAG_NEW_LISTENER flag can be monitored using
>> poll(2), epoll(7), and select(2). When a notification is pend‐
>> ing, these interfaces indicate that the file descriptor is read‐
>> able.
>
> We should probably also point out somewhere that, as
> include/uapi/linux/seccomp.h says:
>
> * Similar precautions should be applied when stacking SECCOMP_RET_USER_NOTIF
> * or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the
> * same syscall, the most recently added filter takes precedence. This means
> * that the new SECCOMP_RET_USER_NOTIF filter can override any
> * SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all

My takeaway from Chritian's comments is that this comment in the kernel
source is partially wrong, since it is not possible to install multiple
filters with SECCOMP_RET_USER_NOTIF, right?

> * such filtered syscalls to be executed by sending the response
> * SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can equally
> * be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.
>
> In other words, from a security perspective, you must assume that the
> target process can bypass any SECCOMP_RET_USER_NOTIF (or
> SECCOMP_RET_TRACE) filters unless it is completely prohibited from
> calling seccomp().

Drawing on text from Chrstian's comment in seccomp.h and Kees's mail,
I added the following in NOTES:

Design goals; use of SECCOMP_USER_NOTIF_FLAG_CONTINUE
The intent of the user-space notification feature is to allow sys‐
tem calls to be performed on behalf of the target. The target's
system call should either be handled by the supervisor or allowed
to continue normally in the kernel (where standard security poli‐
cies will be applied).

Note well: this mechanism must not be used to make security policy
decisions about the system call, which would be inherently race-
prone for reasons described next.

The SECCOMP_USER_NOTIF_FLAG_CONTINUE flag must be used with cau‐
tion. If set by the supervisor, the target's system call will
continue. However, there is a time-of-check, time-of-use race
here, since an attacker could exploit the interval of time where
the target is blocked waiting on the "continue" response to do
things such as rewriting the system call arguments.

Note furthermore that a user-space notifier can be bypassed if the
existing filters allow the use of seccomp(2) or prctl(2) to
install a filter that returns an action value with a higher prece‐
dence than SECCOMP_RET_USER_NOTIF (see seccomp(2)).

It should thus be absolutely clear that the seccomp user-space
notification mechanism can not be used to implement a security
policy! It should only ever be used in scenarios where a more
privileged process supervises the system calls of a lesser privi‐
leged target to get around kernel-enforced security restrictions
when the supervisor deems this safe. In other words, in order to
continue a system call, the supervisor should be sure that another
security mechanism or the kernel itself will sufficiently block
the system call if its arguments are rewritten to something
unsafe.

Seem okay?

> This should also be noted over in the main
> seccomp(2) manpage, especially the SECCOMP_RET_TRACE part.

I added some words in seccomp(2) to emphasize this.

>> EXAMPLES
> [...]
>> This program can used to demonstrate various aspects of the
>
> nit: "can be used to demonstrate", or alternatively just "demonstrates"

Thanks. Fixed (added "to")

>> behavior of the seccomp user-space notification mechanism. To
>> help aid such demonstrations, the program logs various messages
>> to show the operation of the target process (lines prefixed "T:")
>> and the supervisor (indented lines prefixed "S:").
> [...]
>> Program source
> [...]
>> #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
>> } while (0)
>
> Don't we have err() for this?

I tend to avoid the use of err() because it's a nonstandard BSDism.
Perhaps by this point this is as much a habit as anything rational.

>> /* Send the file descriptor 'fd' over the connected UNIX domain socket
>> 'sockfd'. Returns 0 on success, or -1 on error. */
>>
>> static int
>> sendfd(int sockfd, int fd)
>> {
>> struct msghdr msgh;
>> struct iovec iov;
>> int data;
>> struct cmsghdr *cmsgp;
>>
>> /* Allocate a char array of suitable size to hold the ancillary data.
>> However, since this buffer is in reality a 'struct cmsghdr', use a
>> union to ensure that it is suitable aligned. */
>
> nit: suitably

Thanks. Fixed.

>> union {
>> char buf[CMSG_SPACE(sizeof(int))];
>> /* Space large enough to hold an 'int' */
>> struct cmsghdr align;
>> } controlMsg;
>>
>> /* The 'msg_name' field can be used to specify the address of the
>> destination socket when sending a datagram. However, we do not
>> need to use this field because 'sockfd' is a connected socket. */
>>
>> msgh.msg_name = NULL;
>> msgh.msg_namelen = 0;
>>
>> /* On Linux, we must transmit at least one byte of real data in
>> order to send ancillary data. We transmit an arbitrary integer
>> whose value is ignored by recvfd(). */
>>
>> msgh.msg_iov = &iov;
>> msgh.msg_iovlen = 1;
>> iov.iov_base = &data;
>> iov.iov_len = sizeof(int);
>> data = 12345;
>>
>> /* Set 'msghdr' fields that describe ancillary data */
>>
>> msgh.msg_control = controlMsg.buf;
>> msgh.msg_controllen = sizeof(controlMsg.buf);
>>
>> /* Set up ancillary data describing file descriptor to send */
>>
>> cmsgp = CMSG_FIRSTHDR(&msgh);
>> cmsgp->cmsg_level = SOL_SOCKET;
>> cmsgp->cmsg_type = SCM_RIGHTS;
>> cmsgp->cmsg_len = CMSG_LEN(sizeof(int));
>> memcpy(CMSG_DATA(cmsgp), &fd, sizeof(int));
>>
>> /* Send real plus ancillary data */
>>
>> if (sendmsg(sockfd, &msgh, 0) == -1)
>> return -1;
>>
>> return 0;
>> }
>
> Instead of using unix domain sockets to send the fd to the parent, I
> think you could also use clone3() with flags==CLONE_FILES|SIGCHLD,
> dup2() the seccomp fd to an fd that was reserved in the parent, call
> unshare(CLONE_FILES) in the child after setting up the seccomp fd, and
> wake up the parent with something like pthread_cond_signal()? I'm not
> sure whether that'd look better or worse in the end though, so maybe
> just ignore this comment.

Ahh -- nice. That answers in detail a question I also had for Tycho.
I won't make any changes to the page (since I'm not sure it would
look better), but I will add that detail in a comment in the page
source. Perhaps I'll do something with that in the future.

> [...]
>> /* Access the memory of the target process in order to discover the
>> pathname that was given to mkdir() */
>>
>> static void
>> getTargetPathname(struct seccomp_notif *req, int notifyFd,
>> char *path, size_t len)
>> {
>> char procMemPath[PATH_MAX];
>> snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", req->pid);
>>
>> int procMemFd = open(procMemPath, O_RDONLY);
>
> Should example code like this maybe use O_CLOEXEC unless the fd in
> question actually has to be inheritable? I know it doesn't actually
> matter here, but if this code was used in a multi-threaded context, it
> might.

Yes, good point. I changed this.

>> if (procMemFd == -1)
>> errExit("Supervisor: open");
>>
>> /* Check that the process whose info we are accessing is still alive.
>> If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed
>> in checkNotificationIdIsValid()) succeeds, we know that the
>> /proc/PID/mem file descriptor that we opened corresponds to the
>> process for which we received a notification. If that process
>> subsequently terminates, then read() on that file descriptor
>> will return 0 (EOF). */
>>
>> checkNotificationIdIsValid(notifyFd, req->id);
>>
>> /* Seek to the location containing the pathname argument (i.e., the
>> first argument) of the mkdir(2) call and read that pathname */
>>
>> if (lseek(procMemFd, req->data.args[0], SEEK_SET) == -1)
>> errExit("Supervisor: lseek");
>>
>> ssize_t s = read(procMemFd, path, PATH_MAX);
>> if (s == -1)
>> errExit("read");
>
> Why not pread() instead of lseek()+read()?

No good reason! I changed it to:

/* Read bytes at the location containing the pathname argument
(i.e., the first argument) of the mkdir(2) call */

ssize_t s = pread(procMemFd, path, PATH_MAX, req->data.args[0]);
if (s == -1)
errExit("pread");

if (s == 0) {
fprintf(stderr, "\tS: pread() of /proc/PID/mem "
"returned 0 (EOF)\n");
exit(EXIT_FAILURE);
}

Thanks!

>> if (s == 0) {
>> fprintf(stderr, "\tS: read() of /proc/PID/mem "
>> "returned 0 (EOF)\n");
>> exit(EXIT_FAILURE);
>> }
>>
>> if (close(procMemFd) == -1)
>> errExit("close-/proc/PID/mem");
>
> We should probably make sure here that the value we read is actually
> NUL-terminated?

So, I was curious about that point also. But, (why) are we not
guaranteed that it will be NUL-terminated?

>> }
>>
>> /* Handle notifications that arrive via the SECCOMP_RET_USER_NOTIF file
>> descriptor, 'notifyFd'. */
>>
>> static void
>> handleNotifications(int notifyFd)
>> {
>> struct seccomp_notif_sizes sizes;
>> char path[PATH_MAX];
>> /* For simplicity, we assume that the pathname given to mkdir()
>> is no more than PATH_MAX bytes; but this might not be true. */
>
> No, it has to be true, otherwise the kernel would fail the syscall if
> it was executing normally.

Yes. I removed that comment.

>> /* Discover the sizes of the structures that are used to receive
>> notifications and send notification responses, and allocate
>> buffers of those sizes. */
>>
>> if (seccomp(SECCOMP_GET_NOTIF_SIZES, 0, &sizes) == -1)
>> errExit("\tS: seccomp-SECCOMP_GET_NOTIF_SIZES");
>>
>> struct seccomp_notif *req = malloc(sizes.seccomp_notif);
>> if (req == NULL)
>> errExit("\tS: malloc");
>>
>> struct seccomp_notif_resp *resp = malloc(sizes.seccomp_notif_resp);
>
> This should probably do something like max(sizes.seccomp_notif_resp,
> sizeof(struct seccomp_notif_resp)) in case the program was built
> against new UAPI headers that make struct seccomp_notif_resp big, but
> is running under an old kernel where that struct is still smaller?

I'm confused. Why? I mean, if the running kernel says that it expects
a buffer of a certain size, and we allocate a buffer of that size,
what's the problem?

>> if (resp == NULL)
>> errExit("\tS: malloc");
> [...]
>> } else {
>>
>> /* If mkdir() failed in the supervisor, pass the error
>> back to the target */
>>
>> resp->error = -errno;
>> printf("\tS: failure! (errno = %d; %s)\n", errno,
>> strerror(errno));
>> }
>> } else if (strncmp(path, "./", strlen("./")) == 0) {
>
> nit: indent messed up

Thanks. Fixed.

And thanks again for the detailed review, Jann.

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Subject: Re: For review: seccomp_user_notif(2) manual page

Hello Kees,

On 10/1/20 1:39 AM, Kees Cook wrote:
> On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
>> [...] I did :-)
>
> Yay! Thank you!

You're welcome :-)

>> [...]
>> Overview
>> In conventional usage of a seccomp filter, the decision about how
>> to treat a particular system call is made by the filter itself.
>> The user-space notification mechanism allows the handling of the
>> system call to instead be handed off to a user-space process.
>> The advantages of doing this are that, by contrast with the sec‐
>> comp filter, which is running on a virtual machine inside the
>> kernel, the user-space process has access to information that is
>> unavailable to the seccomp filter and it can perform actions that
>> can't be performed from the seccomp filter.
>
> I might clarify a bit with something like (though maybe the
> target/supervisor paragraph needs to be moved to the start):
>
> This is used for performing syscalls on behalf of the target,
> rather than having the supervisor make security policy decisions
> about the syscall, which would be inherently race-prone. The
> target's syscall should either be handled by the supervisor or
> allowed to continue normally in the kernel (where standard security
> policies will be applied).

You, Christian, and Jann all pulled me up on this point. And thanks;
I'm going to use some of your words above. See my reply to Jann, sent
at about the same time as this reply. Please take a look at the text
in my reply to Jann, and let me know what you think.

> I'll comment more later, but I've run out of time today and I didn't see
> anyone mention this detail yet in the existing threads... :)

Later never came :-). But, I hope you may have comments for the
next draft, which I will send out soon.

Thanks,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Subject: Re: For review: seccomp_user_notif(2) manual page

Hello Christian,

On 10/1/20 2:36 PM, Christian Brauner wrote:
> [I'm on vacation so I'll just give this a quick glance for now.]
>
> On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
>> Hi Tycho, Sargun (and all),
>>
>> I knew it would be a big ask, but below is kind of the manual page
>> I was hoping you might write [1] for the seccomp user-space notification
>> mechanism. Since you didn't (and because 5.9 adds various new pieces
>> such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD
>> that also will need documenting [2]), I did :-). But of course I may
>> have made mistakes...
>>
>> I've shown the rendered version of the page below, and would love
>> to receive review comments from you and others, and acks, etc.
>>
>> There are a few FIXMEs sprinkled into the page, including one
>> that relates to what appears to me to be a misdesign (possibly
>> fixable) in the operation of the SECCOMP_IOCTL_NOTIF_RECV
>> operation. I would be especially interested in feedback on that
>> FIXME, and also of course the other FIXMEs.
>>
>> The page includes an extensive (albeit slightly contrived)
>> example program, and I would be happy also to receive comments
>> on that program.
>>
>> The page source currently sits in a branch (along with the text
>> that you sent me for the seccomp(2) page) at
>> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=seccomp_user_notif
>>
>> Thanks,
>>
>> Michael
>>
>> [1] https://lore.kernel.org/linux-man/[email protected]/#t
>> [2] Sargun, can you prepare something on SECCOMP_ADDFD_FLAG_SETFD
>> and SECCOMP_IOCTL_NOTIF_ADDFD to be added to this page?
>>
>> =====
>>
>> NAME
>> seccomp_user_notif - Seccomp user-space notification mechanism
>>
>> SYNOPSIS
>> #include <linux/seccomp.h>
>> #include <linux/filter.h>
>> #include <linux/audit.h>
>>
>> int seccomp(unsigned int operation, unsigned int flags, void *args);
>>
>> DESCRIPTION
>> This page describes the user-space notification mechanism pro‐
>> vided by the Secure Computing (seccomp) facility. As well as the
>> use of the SECCOMP_FILTER_FLAG_NEW_LISTENER flag, the SEC‐
>> COMP_RET_USER_NOTIF action value, and the SECCOMP_GET_NOTIF_SIZES
>> operation described in seccomp(2), this mechanism involves the
>> use of a number of related ioctl(2) operations (described below).
>>
>> Overview
>> In conventional usage of a seccomp filter, the decision about how
>> to treat a particular system call is made by the filter itself.
>> The user-space notification mechanism allows the handling of the
>> system call to instead be handed off to a user-space process.
>
> "In contrast, the user notification mechanism allows to delegate the
> handling of the system call of one process (target) to another
> user-space process (supervisor)."?

Thanks. I've reworded similarly to what you suggest.

>> The advantages of doing this are that, by contrast with the sec‐
>> comp filter, which is running on a virtual machine inside the
>> kernel, the user-space process has access to information that is
>> unavailable to the seccomp filter and it can perform actions that
>> can't be performed from the seccomp filter.
>
> This section reads a bit difficult imho:
> "A suitably privileged supervisor can use the user notification
> mechanism to perform actions in lieu of the target. The supervisor will
> usually be able to retrieve information about the target and the
> performed system call that the seccomp filter itself cannot."

Thanks. Again I've done some rewording.

>> In the discussion that follows, the process that has installed
>> the seccomp filter is referred to as the target, and the process
>> that is notified by the user-space notification mechanism is
>> referred to as the supervisor. An overview of the steps per‐
>> formed by these two processes is as follows:

After the various rewordings, the opening paragraphs now read:

In conventional usage of a seccomp filter, the decision about how
to treat a system call is made by the filter itself. By contrast,
the user-space notification mechanism allows the seccomp filter to
delegate the handling of the system call to another user-space
process.

In the discussion that follows, the thread(s) on which the seccomp
filter is installed is (are) referred to as the target, and the
process that is notified by the user-space notification mechanism
is referred to as the supervisor.

A suitably privileged supervisor can use the user-space notifica‐
tion mechanism to perform actions on behalf of the target. The
advantage of the user-space notification mechanism is that the
supervisor will usually be able to retrieve information about the
target and the performed system call that the seccomp filter
itself cannot. (A seccomp filter is limited in the information it
can obtain and the actions that it can perform because it is run‐
ning on a virtual machine inside the kernel.)

An overview of the steps performed by the target and the supervi‐
sor is as follows:

>> 1. The target process establishes a seccomp filter in the usual
>> manner, but with two differences:
>>
>> · The seccomp(2) flags argument includes the flag SECCOMP_FIL‐
>> TER_FLAG_NEW_LISTENER. Consequently, the return value of
>> the (successful) seccomp(2) call is a new "listening" file
>> descriptor that can be used to receive notifications.
>
> I think it would be good to mention that seccomp notify fds are
> O_CLOEXEC by default somewhere.

Yep. This is already noted in seccomp(2).

>> · In cases where it is appropriate, the seccomp filter returns
>> the action value SECCOMP_RET_USER_NOTIF. This return value
>> will trigger a notification event.
>>
>> 2. In order that the supervisor process can obtain notifications
>> using the listening file descriptor, (a duplicate of) that
>> file descriptor must be passed from the target process to the
>> supervisor process. One way in which this could be done is by
>> passing the file descriptor over a UNIX domain socket connec‐
>> tion between the two processes (using the SCM_RIGHTS ancillary
>> message type described in unix(7)). Another possibility is
>> that the supervisor might inherit the file descriptor via
>> fork(2).
>
> I think a few people have already pointed out other ways of retrieving
> an fd. :)

Yup.

>> 3. The supervisor process will receive notification events on the
>> listening file descriptor. These events are returned as
>> structures of type seccomp_notif. Because this structure and
>> its size may evolve over kernel versions, the supervisor must
>> first determine the size of this structure using the sec‐
>> comp(2) SECCOMP_GET_NOTIF_SIZES operation, which returns a
>> structure of type seccomp_notif_sizes. The supervisor allo‐
>> cates a buffer of size seccomp_notif_sizes.seccomp_notif bytes
>> to receive notification events. In addition,the supervisor
>> allocates another buffer of size seccomp_notif_sizes.sec‐
>> comp_notif_resp bytes for the response (a struct sec‐
>> comp_notif_resp structure) that it will provide to the kernel
>> (and thus the target process).
>>
>> 4. The target process then performs its workload, which includes
>> system calls that will be controlled by the seccomp filter.
>> Whenever one of these system calls causes the filter to return
>> the SECCOMP_RET_USER_NOTIF action value, the kernel does not
>> execute the system call; instead, execution of the target
>> process is temporarily blocked inside the kernel and a notifi‐
>
> Maybe mention that the task is killable when so blocked?

Jann also noted this, and I thought it could be presumed, and so was
not thinking to add anything to the text. But, since you mention it too,
I've added some words to note that the sleep state is interruptible by
signals.

>> cation event is generated on the listening file descriptor.
>>
>> 5. The supervisor process can now repeatedly monitor the listen‐
>> ing file descriptor for SECCOMP_RET_USER_NOTIF-triggered
>> events. To do this, the supervisor uses the SEC‐
>> COMP_IOCTL_NOTIF_RECV ioctl(2) operation to read information
>> about a notification event; this operation blocks until an
>> event is available. The operation returns a seccomp_notif
>> structure containing information about the system call that is
>> being attempted by the target process.
>>
>> 6. The seccomp_notif structure returned by the SEC‐
>> COMP_IOCTL_NOTIF_RECV operation includes the same information
>> (a seccomp_data structure) that was passed to the seccomp fil‐
>> ter. This information allows the supervisor to discover the
>> system call number and the arguments for the target process's
>> system call. In addition, the notification event contains the
>> PID of the target process.
>
> (Technically TID.)

Yep. I've already made various fixes after comments from Jann.

>> The information in the notification can be used to discover
>> the values of pointer arguments for the target process's sys‐
>> tem call. (This is something that can't be done from within a
>> seccomp filter.) To do this (and assuming it has suitable
>> permissions), the supervisor opens the corresponding
>> /proc/[pid]/mem file, seeks to the memory location that corre‐
>> sponds to one of the pointer arguments whose value is supplied
>> in the notification event, and reads bytes from that location.
>> (The supervisor must be careful to avoid a race condition that
>> can occur when doing this; see the description of the SEC‐
>> COMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation below.) In addi‐
>> tion, the supervisor can access other system information that
>> is visible in user space but which is not accessible from a
>> seccomp filter.
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │Suppose we are reading a pathname from /proc/PID/mem │
>> │for a system call such as mkdir(). The pathname can │
>> │be an arbitrary length. How do we know how much (how │
>> │many pages) to read from /proc/PID/mem? │
>> └─────────────────────────────────────────────────────┘
>
> This has already been answered, I believe.

Yep.

>>
>> 7. Having obtained information as per the previous step, the
>> supervisor may then choose to perform an action in response to
>> the target process's system call (which, as noted above, is
>> not executed when the seccomp filter returns the SEC‐
>> COMP_RET_USER_NOTIF action value).
>
> Nit: It is not _yet_ executed it may very well be if the response is
> "continue".

Okay. I've added the word "yet" in point 4. I already elaborate on
the "continue" details later.

> This should either mention that when the fd becomes
> _RECVable the system call is guaranteed to not have executed yet or
> specify that it is not yet executed, I think.

I'm not sure that I understand your point here. I mean, doesn't the
arrival of the notification already imply that the system call hasn't
yet been executed? You seem to be drawing some distinction between
the notification vs FD being RECVable, but I don't understand what
that distinction is. Can you elaborate please...

>> One example use case here relates to containers. The target
>> process may be located inside a container where it does not
>> have sufficient capabilities to mount a filesystem in the con‐
>> tainer's mount namespace. However, the supervisor may be a
>> more privileged process that that does have sufficient capa‐
>> bilities to perform the mount operation.
>>
>> 8. The supervisor then sends a response to the notification. The
>> information in this response is used by the kernel to con‐
>> struct a return value for the target process's system call and
>> provide a value that will be assigned to the errno variable of
>> the target process.
>>
>> The response is sent using the SECCOMP_IOCTL_NOTIF_RECV
>> ioctl(2) operation, which is used to transmit a sec‐
>> comp_notif_resp structure to the kernel. This structure
>> includes a cookie value that the supervisor obtained in the
>> seccomp_notif structure returned by the SEC‐
>> COMP_IOCTL_NOTIF_RECV operation. This cookie value allows the
>> kernel to associate the response with the target process.
>
> I think here or above you should mention that the id or "cookie" _must_
> be used when a file descriptor to /proc/<pid>/mem or any /proc/<pid>/*
> is opened:
> fd = open(/proc/pid/*);
> verify_via_cookie_that_pid_still_alive(cookie);
> operate_on(fd)
>
> Otherwise this is a potential security issue.

Yes, but already in point 6 above I say:

(The supervisor must be careful to avoid a race condition that
can occur when doing this; see the description of the SEC‐
COMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation below.) In addi‐

And then I say more about the ioctl() later. So, I think that I've
covered this point sufficiently (?). Maybe you missed some of
that text. Or do you think there's still something I should add?

>> 9. Once the notification has been sent, the system call in the
>> target process unblocks, returning the information that was
>> provided by the supervisor in the notification response.
>>
>> As a variation on the last two steps, the supervisor can send a
>> response that tells the kernel that it should execute the target
>> process's system call; see the discussion of SEC‐
>> COMP_USER_NOTIF_FLAG_CONTINUE, below.
>>
>> ioctl(2) operations
>> The following ioctl(2) operations are provided to support seccomp
>> user-space notification. For each of these operations, the first
>> (file descriptor) argument of ioctl(2) is the listening file
>> descriptor returned by a call to seccomp(2) with the SECCOMP_FIL‐
>> TER_FLAG_NEW_LISTENER flag.
>>
>> SECCOMP_IOCTL_NOTIF_RECV
>> This operation is used to obtain a user-space notification
>> event. If no such event is currently pending, the opera‐
>> tion blocks until an event occurs. The third ioctl(2)
>> argument is a pointer to a structure of the following form
>> which contains information about the event. This struc‐
>> ture must be zeroed out before the call.
>>
>> struct seccomp_notif {
>> __u64 id; /* Cookie */
>> __u32 pid; /* PID of target process */
>> __u32 flags; /* Currently unused (0) */
>> struct seccomp_data data; /* See seccomp(2) */
>> };
>>
>> The fields in this structure are as follows:
>>
>> id This is a cookie for the notification. Each such
>> cookie is guaranteed to be unique for the corre‐
>> sponding seccomp filter. In other words, this
>> cookie is unique for each notification event from
>> the target process. The cookie value has the fol‐
>> lowing uses:
>>
>> · It can be used with the SEC‐
>> COMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation to
>> verify that the target process is still alive.
>>
>> · When returning a notification response to the
>> kernel, the supervisor must include the cookie
>> value in the seccomp_notif_resp structure that is
>> specified as the argument of the SEC‐
>> COMP_IOCTL_NOTIF_SEND operation.
>>
>> pid This is the PID of the target process that trig‐
>> gered the notification event.
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │This is a thread ID, rather than a PID, right? │
>> └─────────────────────────────────────────────────────┘
>
> Yes.
>
>>
>> flags This is a bit mask of flags providing further
>> information on the event. In the current implemen‐
>> tation, this field is always zero.
>>
>> data This is a seccomp_data structure containing infor‐
>> mation about the system call that triggered the
>> notification. This is the same structure that is
>> passed to the seccomp filter. See seccomp(2) for
>> details of this structure.
>>
>> On success, this operation returns 0; on failure, -1 is
>> returned, and errno is set to indicate the cause of the
>> error. This operation can fail with the following errors:
>>
>> EINVAL (since Linux 5.5)
>> The seccomp_notif structure that was passed to the
>> call contained nonzero fields.
>>
>> ENOENT The target process was killed by a signal as the
>> notification information was being generated.
>>
>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │From my experiments, it appears that if a SEC‐ │
>> │COMP_IOCTL_NOTIF_RECV is done after the target │
>> │process terminates, then the ioctl() simply blocks │
>> │(rather than returning an error to indicate that the │
>> │target process no longer exists). │
>> │ │
>> │I found that surprising, and it required some con‐ │
>> │tortions in the example program. It was not possi‐ │
>> │ble to code my SIGCHLD handler (which reaps the zom‐ │
>> │bie when the worker/target process terminates) to │
>> │simply set a flag checked in the main handleNotifi‐ │
>> │cations() loop, since this created an unavoidable │
>> │race where the child might terminate just after I │
>> │had checked the flag, but before I blocked (for‐ │
>> │ever!) in the SECCOMP_IOCTL_NOTIF_RECV operation. │
>> │Instead, I had to code the signal handler to simply │
>> │call _exit(2) in order to terminate the parent │
>> │process (the supervisor). │
>> │ │
>> │Is this expected behavior? It seems to me rather │
>> │desirable that SECCOMP_IOCTL_NOTIF_RECV should give │
>> │an error if the target process has terminated. │
>> └─────────────────────────────────────────────────────┘
>
> This has been discussed later in the thread too, I believe. My patchset
> fixed a different but related bug in ->poll() when a filter becomes
> unused. I hadn't noticed this behavior since I'm always polling. (Pure
> ioctls() feel a bit fishy to me. :) But obviously a valid use.)

Yes, I hope the ioctl() can be fixed.

>> SECCOMP_IOCTL_NOTIF_ID_VALID
>> This operation can be used to check that a notification ID
>> returned by an earlier SECCOMP_IOCTL_NOTIF_RECV operation
>> is still valid (i.e., that the target process still
>> exists).
>>
>> The third ioctl(2) argument is a pointer to the cookie
>> (id) returned by the SECCOMP_IOCTL_NOTIF_RECV operation.
>>
>> This operation is necessary to avoid race conditions that
>> can occur when the pid returned by the SEC‐
>> COMP_IOCTL_NOTIF_RECV operation terminates, and that
>> process ID is reused by another process. An example of
>> this kind of race is the following
>>
>> 1. A notification is generated on the listening file
>> descriptor. The returned seccomp_notif contains the
>> PID of the target process.
>>
>> 2. The target process terminates.
>>
>> 3. Another process is created on the system that by chance
>> reuses the PID that was freed when the target process
>> terminates.
>>
>> 4. The supervisor open(2)s the /proc/[pid]/mem file for
>> the PID obtained in step 1, with the intention of (say)
>> inspecting the memory locations that contains the argu‐
>> ments of the system call that triggered the notifica‐
>> tion in step 1.
>>
>> In the above scenario, the risk is that the supervisor may
>> try to access the memory of a process other than the tar‐
>> get. This race can be avoided by following the call to
>> open with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to ver‐
>> ify that the process that generated the notification is
>> still alive. (Note that if the target process subse‐
>> quently terminates, its PID won't be reused because there
>> remains an open reference to the /proc[pid]/mem file; in
>> this case, a subsequent read(2) from the file will return
>> 0, indicating end of file.)
>>
>> On success (i.e., the notification ID is still valid),
>> this operation returns 0 On failure (i.e., the notifica‐
>
> Missing a ".", I think.

(Yup. Already fixed.)

>> tion ID is no longer valid), -1 is returned, and errno is
>> set to ENOENT.
>>
>> SECCOMP_IOCTL_NOTIF_SEND
>> This operation is used to send a notification response
>> back to the kernel. The third ioctl(2) argument of this
>> structure is a pointer to a structure of the following
>> form:
>>
>> struct seccomp_notif_resp {
>> __u64 id; /* Cookie value */
>> __s64 val; /* Success return value */
>> __s32 error; /* 0 (success) or negative
>> error number */
>> __u32 flags; /* See below */
>> };
>>
>> The fields of this structure are as follows:
>>
>> id This is the cookie value that was obtained using
>> the SECCOMP_IOCTL_NOTIF_RECV operation. This
>> cookie value allows the kernel to correctly asso‐
>> ciate this response with the system call that trig‐
>> gered the user-space notification.
>>
>> val This is the value that will be used for a spoofed
>> success return for the target process's system
>> call; see below.
>>
>> error This is the value that will be used as the error
>> number (errno) for a spoofed error return for the
>> target process's system call; see below.
>
> Nit: "val" is only used when "error" is not set.

Yes. I note that below. I don't want to clutter this part of the page with
too many details.

>> flags This is a bit mask that includes zero or more of
>> the following flags
>>
>> SECCOMP_USER_NOTIF_FLAG_CONTINUE (since Linux 5.5)
>> Tell the kernel to execute the target
>> process's system call.
>>
>> Two kinds of response are possible:
>>
>> · A response to the kernel telling it to execute the tar‐
>> get process's system call. In this case, the flags
>> field includes SECCOMP_USER_NOTIF_FLAG_CONTINUE and the
>> error and val fields must be zero.
>>
>> This kind of response can be useful in cases where the
>> supervisor needs to do deeper analysis of the target's
>> system call than is possible from a seccomp filter
>> (e.g., examining the values of pointer arguments), and,
>> having verified that the system call is acceptable, the
>> supervisor wants to allow it to proceed.
>
> I think Jann has pointed this out. This needs to come with a big warning
> and I would explicitly put a:
> "The user notification mechanism cannot be used to implement a syscall
> security policy in user space!"
> You might want to take a look at the seccomp.h header file where I
> placed a giant warning about how to use this too.

Yes. Kees also raised this. See my reply to Jann (who pasted in a copy
of part of your comment from seccomp.h). I'm going to freely reuse the
text from your comment. Please take a look at the text in my reply to Jann,
ad let me know wat you think.

>> · A spoofed return value for the target process's system
>> call. In this case, the kernel does not execute the
>> target process's system call, instead causing the system
>> call to return a spoofed value as specified by fields of
>> the seccomp_notif_resp structure. The supervisor should
>> set the fields of this structure as follows:
>>
>> + flags does not contain SECCOMP_USER_NOTIF_FLAG_CON‐
>> TINUE.
>>
>> + error is set either to 0 for a spoofed "success"
>> return or to a negative error number for a spoofed
>> "failure" return. In the former case, the kernel
>> causes the target process's system call to return the
>> value specified in the val field. In the later case,
>> the kernel causes the target process's system call to
>> return -1, and errno is assigned the negated error
>> value.
>>
>> + val is set to a value that will be used as the return
>> value for a spoofed "success" return for the target
>> process's system call. The value in this field is
>> ignored if the error field contains a nonzero value.
>>
>> On success, this operation returns 0; on failure, -1 is
>> returned, and errno is set to indicate the cause of the
>> error. This operation can fail with the following errors:
>>
>> EINPROGRESS
>> A response to this notification has already been
>> sent.
>>
>> EINVAL An invalid value was specified in the flags field.
>>
>> EINVAL The flags field contained SEC‐
>> COMP_USER_NOTIF_FLAG_CONTINUE, and the error or val
>> field was not zero.
>>
>> ENOENT The blocked system call in the target process has
>> been interrupted by a signal handler.
>>
>> NOTES
>> The file descriptor returned when seccomp(2) is employed with the
>> SECCOMP_FILTER_FLAG_NEW_LISTENER flag can be monitored using
>> poll(2), epoll(7), and select(2). When a notification is pend‐
>> ing, these interfaces indicate that the file descriptor is read‐
>> able.
>
> This should also note that when a filter becomes unused, i.e. the last
> task using that filter in its filter hierarchy is dead (been
> reaped/autoreaped) ->poll() will notify with (E)POLLHUP.

Ahh! Now I understand. I was unaware of this. Jann commented that
poll() could be used as well, but you provided enough detail that
now I understand how this works. I added the following in NOTES
where poll/select/epoll are described:

· After the last thread using the filter has terminated and been
reaped using waitpid(2) (or similar), the file descriptor indi‐
cates an end-of-file condition (readable in select(2); POLL‐
HUP/EPOLLHUP in poll(2)/ epoll_wait(2)).

>> ┌─────────────────────────────────────────────────────┐
>> │FIXME │
>> ├─────────────────────────────────────────────────────┤
>> │Interestingly, after the event had been received, │
>> │the file descriptor indicates as writable (verified │
>> │from the source code and by experiment). How is this │
>> │useful? │
>> └─────────────────────────────────────────────────────┘
>>
>> EXAMPLES
>> The (somewhat contrived) program shown below demonstrates the use
>> of the interfaces described in this page. The program creates a
>> child process that serves as the "target" process. The child
>> process installs a seccomp filter that returns the SEC‐
>> COMP_RET_USER_NOTIF action value if a call is made to mkdir(2).
>> The child process then calls mkdir(2) once for each of the sup‐
>> plied command-line arguments, and reports the result returned by
>> the call. After processing all arguments, the child process ter‐
>> minates.
>>
>> The parent process acts as the supervisor, listening for the
>> notifications that are generated when the target process calls
>> mkdir(2). When such a notification occurs, the supervisor exam‐
>> ines the memory of the target process (using /proc/[pid]/mem) to
>> discover the pathname argument that was supplied to the mkdir(2)
>> call, and performs one of the following actions:
>>
>> · If the pathname begins with the prefix "/tmp/", then the super‐
>> visor attempts to create the specified directory, and then
>> spoofs a return for the target process based on the return
>> value of the supervisor's mkdir(2) call. In the event that
>> that call succeeds, the spoofed success return value is the
>> length of the pathname.
>>
>> · If the pathname begins with "./" (i.e., it is a relative path‐
>> name), the supervisor sends a SECCOMP_USER_NOTIF_FLAG_CONTINUE
>> response to the kernel to say that kernel should execute the
>> target process's mkdir(2) call.
>
> Potentially problematic if the two processes have the same privilege
> level and the supervisor intends _CONTINUE to mean "is safe to execute".

Understood. But I think that needs to be clarified elsewhere in the
page, since it's essentially the same point as "The user notification
mechanism cannot be used to implement a syscall security policy in
user space!" See my reply to Jann.

> An attacker could try to re-write arguments afaict.

By an attacker, I presume you mean a malign supervisor, right.
Sure, it looks to me as though rewriting arguments could be
possible. But, if you had privilege to do that, you'd presumably
have privileges for any number of other nefarious actities, right?
(So, I don't think anything special needs to be said here; let me
know if you feel something does need to be said.

> A good an easy example is usually mknod() in a user namespace. A
> _CONTINUE is always safe since you can't create device nodes anyway.

Okay -- but I wanted to provide an example (admittedly very
contrived) to show how the supervisor could either do the systcall
on behalf of the target, or leave things to the target to execute
the system call. Do you feel that the example is leading people
astray?

> Sorry, I can't review the rest in sufficient detail since I'm on
> vacation still so I'm just going to shut up now. :)

Well, thanks already, because your comments were already very
useful!. I will send out a new draft shortly :-).

Thanks,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2020-10-15 23:15:24

by Jann Horn

[permalink] [raw]
Subject: Re: For review: seccomp_user_notif(2) manual page

On Thu, Oct 15, 2020 at 1:24 PM Michael Kerrisk (man-pages)
<[email protected]> wrote:
> On 9/30/20 5:53 PM, Jann Horn wrote:
> > On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
> > <[email protected]> wrote:
> >> I knew it would be a big ask, but below is kind of the manual page
> >> I was hoping you might write [1] for the seccomp user-space notification
> >> mechanism. Since you didn't (and because 5.9 adds various new pieces
> >> such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD
> >> that also will need documenting [2]), I did :-). But of course I may
> >> have made mistakes...
[...]
> >> 3. The supervisor process will receive notification events on the
> >> listening file descriptor. These events are returned as
> >> structures of type seccomp_notif. Because this structure and
> >> its size may evolve over kernel versions, the supervisor must
> >> first determine the size of this structure using the sec‐
> >> comp(2) SECCOMP_GET_NOTIF_SIZES operation, which returns a
> >> structure of type seccomp_notif_sizes. The supervisor allo‐
> >> cates a buffer of size seccomp_notif_sizes.seccomp_notif bytes
> >> to receive notification events. In addition,the supervisor
> >> allocates another buffer of size seccomp_notif_sizes.sec‐
> >> comp_notif_resp bytes for the response (a struct sec‐
> >> comp_notif_resp structure) that it will provide to the kernel
> >> (and thus the target process).
> >>
> >> 4. The target process then performs its workload, which includes
> >> system calls that will be controlled by the seccomp filter.
> >> Whenever one of these system calls causes the filter to return
> >> the SECCOMP_RET_USER_NOTIF action value, the kernel does not
> >> execute the system call; instead, execution of the target
> >> process is temporarily blocked inside the kernel and a notifi‐
> >
> > where "blocked" refers to the interruptible, restartable kind - if the
> > child receives a signal with an SA_RESTART signal handler in the
> > meantime, it'll leave the syscall, go through the signal handler, then
> > restart the syscall again and send the same request to the supervisor
> > again. so the supervisor may see duplicate syscalls.
>
> So, I partially demonstrated what you describe here, for two example
> system calls (epoll_wait() and pause()). But I could not exactly
> demonstrate things as I understand you to be describing them. (So,
> I'm not sure whether I have not understood you correctly, or
> if things are not exactly as you describe them.)
>
> Here's a scenario (A) that I tested:
>
> 1. Target installs seccomp filters for a blocking syscall
> (epoll_wait() or pause(), both of which should never restart,
> regardless of SA_RESTART)
> 2. Target installs SIGINT handler with SA_RESTART
> 3. Supervisor is sleeping (i.e., is not blocked in
> SECCOMP_IOCTL_NOTIF_RECV operation).
> 4. Target makes a blocking system call (epoll_wait() or pause()).
> 5. SIGINT gets delivered to target; handler gets called;
> ***and syscall gets restarted by the kernel***
>
> That last should never happen, of course, and is a result of the
> combination of both the user-notify filter and the SA_RESTART flag.
> If one or other is not present, then the system call is not
> restarted.
>
> So, as you note below, the UAPI gets broken a little.
>
> However, from your description above I had understood that
> something like the following scenario (B) could occur:
>
> 1. Target installs seccomp filters for a blocking syscall
> (epoll_wait() or pause(), both of which should never restart,
> regardless of SA_RESTART)
> 2. Target installs SIGINT handler with SA_RESTART
> 3. Supervisor performs SECCOMP_IOCTL_NOTIF_RECV operation (which
> blocks).
> 4. Target makes a blocking system call (epoll_wait() or pause()).
> 5. Supervisor gets seccomp user-space notification (i.e.,
> SECCOMP_IOCTL_NOTIF_RECV ioctl() returns
> 6. SIGINT gets delivered to target; handler gets called;
> and syscall gets restarted by the kernel
> 7. Supervisor performs another SECCOMP_IOCTL_NOTIF_RECV operation
> which gets another notification for the restarted system call.
>
> However, I don't observe such behavior. In step 6, the syscall
> does not get restarted by the kernel, but instead returns -1/EINTR.
> Perhaps I have misconstructed my experiment in the second case, or
> perhaps I've misunderstood what you meant, or is it possibly the
> case that things are not quite as you said?

user@vm:~/test/seccomp-notify-interrupt$ cat seccomp-notify-interrupt.c
#define _GNU_SOURCE
#include <stdio.h>
#include <signal.h>
#include <err.h>
#include <errno.h>
#include <unistd.h>
#include <stdlib.h>
#include <sched.h>
#include <stddef.h>
#include <limits.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/prctl.h>
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <linux/futex.h>

struct {
int seccomp_fd;
} *shared;

static void handle_signal(int sig, siginfo_t *info, void *uctx) {
printf("signal handler invoked\n");
}

int main(void) {
setbuf(stdout, NULL);

shared = mmap(NULL, 0x1000, PROT_READ|PROT_WRITE,
MAP_ANONYMOUS|MAP_SHARED, -1, 0);
if (shared == MAP_FAILED)
err(1, "mmap");
shared->seccomp_fd = -1;

/* glibc's clone() wrapper doesn't support fork()-style usage */
pid_t child = syscall(__NR_clone, CLONE_FILES|SIGCHLD,
NULL, NULL, NULL, 0);
if (child == -1) err(1, "clone");
if (child == 0) {
/* don't outlive the parent */
prctl(PR_SET_PDEATHSIG, SIGKILL);
if (getppid() == 1) exit(0);

prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
struct sock_filter insns[] = {
BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)),
BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_pause, 0, 1),
BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_USER_NOTIF),
BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW)
};
struct sock_fprog prog = {
.len = sizeof(insns)/sizeof(insns[0]),
.filter = insns
};
int seccomp_ret = syscall(__NR_seccomp, SECCOMP_SET_MODE_FILTER,
SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
if (seccomp_ret < 0)
err(1, "install");
printf("installed seccomp: fd %d\n", seccomp_ret);

__atomic_store(&shared->seccomp_fd, &seccomp_ret, __ATOMIC_RELEASE);
int futex_ret = syscall(__NR_futex, &shared->seccomp_fd, FUTEX_WAKE,
INT_MAX, NULL, NULL, 0);
printf("woke %d waiters\n", futex_ret);

struct sigaction act = {
.sa_sigaction = handle_signal,
.sa_flags = SA_RESTART|SA_SIGINFO
};
if (sigaction(SIGUSR1, &act, NULL))
err(1, "sigaction");

pause();
perror("pause returned");
exit(0);
}

int futex_ret = syscall(__NR_futex, &shared->seccomp_fd, FUTEX_WAIT,
-1, NULL, NULL, 0);
if (futex_ret == -1 && errno != EAGAIN)
err(1, "futex wait");
int fd = __atomic_load_n(&shared->seccomp_fd, __ATOMIC_ACQUIRE);
printf("child installed seccomp fd %d\n", fd);

sleep(1);
printf("going to send SIGUSR1...\n");
kill(child, SIGUSR1);
sleep(1);

exit(0);
}
user@vm:~/test/seccomp-notify-interrupt$ gcc -o
seccomp-notify-interrupt seccomp-notify-interrupt.c -Wall
user@vm:~/test/seccomp-notify-interrupt$ strace -f
./seccomp-notify-interrupt >/dev/null
execve("./seccomp-notify-interrupt", ["./seccomp-notify-interrupt"],
0x7ffcb31a0d08 /* 42 vars */) = 0
brk(NULL) = 0x5565864b2000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=89296, ...}) = 0
mmap(NULL, 89296, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f7e688e7000
close(3) = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260A\2\0\0\0\0\0"...,
832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1824496, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x7f7e688e5000
mmap(NULL, 1837056, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f7e68724000
mprotect(0x7f7e68746000, 1658880, PROT_NONE) = 0
mmap(0x7f7e68746000, 1343488, PROT_READ|PROT_EXEC,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x22000) = 0x7f7e68746000
mmap(0x7f7e6888e000, 311296, PROT_READ,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x16a000) = 0x7f7e6888e000
mmap(0x7f7e688db000, 24576, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b6000) = 0x7f7e688db000
mmap(0x7f7e688e1000, 14336, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f7e688e1000
close(3) = 0
arch_prctl(ARCH_SET_FS, 0x7f7e688e6500) = 0
mprotect(0x7f7e688db000, 16384, PROT_READ) = 0
mprotect(0x556585183000, 4096, PROT_READ) = 0
mprotect(0x7f7e68924000, 4096, PROT_READ) = 0
munmap(0x7f7e688e7000, 89296) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, -1,
0) = 0x7f7e688fc000
clone(child_stack=NULL, flags=CLONE_FILES|SIGCHLD) = 2558
futex(0x7f7e688fc000, FUTEX_WAIT, 4294967295, NULLstrace: Process 2558 attached
<unfinished ...>
[pid 2558] prctl(PR_SET_PDEATHSIG, SIGKILL) = 0
[pid 2558] getppid() = 2557
[pid 2558] prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) = 0
[pid 2558] seccomp(SECCOMP_SET_MODE_FILTER, 0x8 /*
SECCOMP_FILTER_FLAG_??? */, {len=4, filter=0x7ffdf7cc9b50}) = 3
[pid 2558] write(1, "installed seccomp: fd 3\n", 24) = 24
[pid 2558] futex(0x7f7e688fc000, FUTEX_WAKE, 2147483647 <unfinished ...>
[pid 2557] <... futex resumed> ) = 0
[pid 2558] <... futex resumed> ) = 1
[pid 2558] write(1, "woke 1 waiters\n", 15) = 15
[pid 2557] write(1, "child installed seccomp fd 3\n", 29) = 29
[pid 2558] rt_sigaction(SIGUSR1, {sa_handler=0x556585181215,
sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART|SA_SIGINFO,
sa_restorer=0x7f7e6875b840}, NULL, 8) = 0
[pid 2557] nanosleep({tv_sec=1, tv_nsec=0}, <unfinished ...>
[pid 2558] pause( <unfinished ...>
[pid 2557] <... nanosleep resumed> 0x7ffdf7cc9b10) = 0
[pid 2557] write(1, "going to send SIGUSR1...", 24) = 24
[pid 2557] write(1, "\n", 1) = 1
[pid 2557] kill(2558, SIGUSR1) = 0
[pid 2557] nanosleep({tv_sec=1, tv_nsec=0}, <unfinished ...>
[pid 2558] <... pause resumed> ) = ? ERESTARTSYS (To be
restarted if SA_RESTART is set)
[pid 2558] --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER,
si_pid=2557, si_uid=1000} ---
[pid 2558] write(1, "signal handler invoked", 22) = 22
[pid 2558] write(1, "\n", 1) = 1
[pid 2558] rt_sigreturn({mask=[]}) = 34
[pid 2558] pause( <unfinished ...>
[pid 2557] <... nanosleep resumed> 0x7ffdf7cc9b10) = 0
[pid 2557] exit_group(0) = ?
[pid 2557] +++ exited with 0 +++
<... pause resumed>) = ?
+++ killed by SIGKILL +++
user@vm:~/test/seccomp-notify-interrupt$


[...]
> >> event is available.
> >
> > Maybe we should note here that you can use the multi-fd-polling APIs
> > (select/poll/epoll) instead, and that if the notification goes away
> > before you call SECCOMP_IOCTL_NOTIF_RECV, the ioctl will return
> > -ENOENT instead of blocking, and therefore as long as nobody else
> > reads from the same fd, you can assume that after the fd reports as
> > readable, you can call SECCOMP_IOCTL_NOTIF_RECV once without blocking.
>
> I'd rather not add this info in the overview section, which is
> already longer than I would like. But I did add some details
> in NOTES:
>
> [[
> The file descriptor returned when seccomp(2) is employed with the
> SECCOMP_FILTER_FLAG_NEW_LISTENER flag can be monitored using
> poll(2), epoll(7), and select(2). When a notification is pending,
> these interfaces indicate that the file descriptor is readable.
> Following such an indication, a subsequent SEC‐
> COMP_IOCTL_NOTIF_RECV ioctl(2) will not block, returning either
> information about a notification or else failing with the error
> EINTR if the target process has been killed by a signal or its
> system call has been interrupted by a signal handler.
> ]]
>
> Okay?

Sounds good.

[...]
> >> bilities to perform the mount operation.
> >>
> >> 8. The supervisor then sends a response to the notification. The
> >> information in this response is used by the kernel to con‐
> >> struct a return value for the target process's system call and
> >> provide a value that will be assigned to the errno variable of
> >> the target process.
> >>
> >> The response is sent using the SECCOMP_IOCTL_NOTIF_RECV
> >> ioctl(2) operation, which is used to transmit a sec‐
> >> comp_notif_resp structure to the kernel. This structure
> >> includes a cookie value that the supervisor obtained in the
> >> seccomp_notif structure returned by the SEC‐
> >> COMP_IOCTL_NOTIF_RECV operation. This cookie value allows the
> >> kernel to associate the response with the target process.
> >
> > (unless if the target thread entered a signal handler or was killed in
> > the meantime)
>
> Yes, but I think I have this adequately covered in the errors described
> later in the page for SECCOMP_IOCTL_NOTIF_RECV. (I have now added the
> target-process-terminated case to the orror text.)
>
> ENOENT The blocked system call in the target has been
> interrupted by a signal handler or the target
> process has terminated.
>
> Is that sufficient?

Ah, right.

[...]
> >> ENOENT The target process was killed by a signal as the
> >> notification information was being generated.
> >
> > Not just killed, interruption with a signal handler has the same effect.
>
> Ah yes! Thanks. I added that as well.
>
> [[
> ENOENT The target thread was killed by a signal as the
> notification information was being generated, or the
> target's (blocked) system call was interrupted by a
> signal handler.
> ]]
>
> Okay?

Yeah, sounds good.

[...]
> >> In the above scenario, the risk is that the supervisor may
> >> try to access the memory of a process other than the tar‐
> >> get. This race can be avoided by following the call to
> >> open with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to ver‐
> >> ify that the process that generated the notification is
> >> still alive. (Note that if the target process subse‐
> >> quently terminates, its PID won't be reused because there
> >
> > That's wrong, the PID can be reused, but the /proc/$pid directory is
> > internally not associated with the numeric PID, but, conceptually
> > speaking, with a specific incarnation of the PID, or something like
> > that. (Actually, it is associated with the "struct pid", which is not
> > reused, instead of the numeric PID.)
>
> Thanks. I simplified the last sentence of the paragraph:
>
> In the above scenario, the risk is that the supervisor may
> try to access the memory of a process other than the tar‐
> get. This race can be avoided by following the call to
> open(2) with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to
> verify that the process that generated the notification is
> still alive. (Note that if the target terminates after the
> latter step, a subsequent read(2) from the file descriptor
> will return 0, indicating end of file.)
>
> I think that's probably enough detail.

Maybe make that "may return 0" instead of "will return 0" - reading
from /proc/$pid/mem can only return 0 in the following cases AFAICS:

1. task->mm was already gone at open() time
2. mm->mm_users has dropped to zero (the mm only has lazytlb users;
page tables and VMAs are being blown away or have been blown away)
3. the syscall was called with length 0

When a process has gone away, normally mm->mm_users will drop to zero,
but someone else could theoretically still be holding a reference to
the mm (e.g. someone else in the middle of accessing /proc/$pid/mem).
(Such references should normally not be very long-lived though.)

Additionally, in the unlikely case that the OOM killer just chomped
through the page tables of the target process, I think the read will
return -EIO (same error as if the address was simply unmapped) if the
address is within a non-shared mapping. (Maybe that's something procfs
could do better...)

[...]
> >> NOTES
> >> The file descriptor returned when seccomp(2) is employed with the
> >> SECCOMP_FILTER_FLAG_NEW_LISTENER flag can be monitored using
> >> poll(2), epoll(7), and select(2). When a notification is pend‐
> >> ing, these interfaces indicate that the file descriptor is read‐
> >> able.
> >
> > We should probably also point out somewhere that, as
> > include/uapi/linux/seccomp.h says:
> >
> > * Similar precautions should be applied when stacking SECCOMP_RET_USER_NOTIF
> > * or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the
> > * same syscall, the most recently added filter takes precedence. This means
> > * that the new SECCOMP_RET_USER_NOTIF filter can override any
> > * SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all
>
> My takeaway from Chritian's comments is that this comment in the kernel
> source is partially wrong, since it is not possible to install multiple
> filters with SECCOMP_RET_USER_NOTIF, right?

Yeah. (Well, AFAICS technically, you can add more filters that return
SECCOMP_RET_USER_NOTIF, but when a filter returns that without having
a notifier fd attached, seccomp blocks the syscall with -ENOSYS; it
won't use the notifier fd attached to a different filter in the
chain.)

> > * such filtered syscalls to be executed by sending the response
> > * SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can equally
> > * be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.
> >
> > In other words, from a security perspective, you must assume that the
> > target process can bypass any SECCOMP_RET_USER_NOTIF (or
> > SECCOMP_RET_TRACE) filters unless it is completely prohibited from
> > calling seccomp().
>
> Drawing on text from Chrstian's comment in seccomp.h and Kees's mail,
> I added the following in NOTES:
>
> Design goals; use of SECCOMP_USER_NOTIF_FLAG_CONTINUE
> The intent of the user-space notification feature is to allow sys‐
> tem calls to be performed on behalf of the target. The target's
> system call should either be handled by the supervisor or allowed
> to continue normally in the kernel (where standard security poli‐
> cies will be applied).
>
> Note well: this mechanism must not be used to make security policy
> decisions about the system call, which would be inherently race-
> prone for reasons described next.
>
> The SECCOMP_USER_NOTIF_FLAG_CONTINUE flag must be used with cau‐
> tion. If set by the supervisor, the target's system call will
> continue. However, there is a time-of-check, time-of-use race
> here, since an attacker could exploit the interval of time where
> the target is blocked waiting on the "continue" response to do
> things such as rewriting the system call arguments.
>
> Note furthermore that a user-space notifier can be bypassed if the
> existing filters allow the use of seccomp(2) or prctl(2) to
> install a filter that returns an action value with a higher prece‐
> dence than SECCOMP_RET_USER_NOTIF (see seccomp(2)).
>
> It should thus be absolutely clear that the seccomp user-space
> notification mechanism can not be used to implement a security
> policy! It should only ever be used in scenarios where a more
> privileged process supervises the system calls of a lesser privi‐
> leged target to get around kernel-enforced security restrictions
> when the supervisor deems this safe. In other words, in order to
> continue a system call, the supervisor should be sure that another
> security mechanism or the kernel itself will sufficiently block
> the system call if its arguments are rewritten to something
> unsafe.
>
> Seem okay?

Yeah, sounds good.

[...]
> >> if (s == 0) {
> >> fprintf(stderr, "\tS: read() of /proc/PID/mem "
> >> "returned 0 (EOF)\n");
> >> exit(EXIT_FAILURE);
> >> }
> >>
> >> if (close(procMemFd) == -1)
> >> errExit("close-/proc/PID/mem");
> >
> > We should probably make sure here that the value we read is actually
> > NUL-terminated?
>
> So, I was curious about that point also. But, (why) are we not
> guaranteed that it will be NUL-terminated?

Because it's random memory filled by another process, which we don't
necessarily trust. While seccomp notifiers aren't usable for applying
*extra* security restrictions, the supervisor will still often be more
privileged than the supervised process.

[...]
> >> /* Discover the sizes of the structures that are used to receive
> >> notifications and send notification responses, and allocate
> >> buffers of those sizes. */
> >>
> >> if (seccomp(SECCOMP_GET_NOTIF_SIZES, 0, &sizes) == -1)
> >> errExit("\tS: seccomp-SECCOMP_GET_NOTIF_SIZES");
> >>
> >> struct seccomp_notif *req = malloc(sizes.seccomp_notif);
> >> if (req == NULL)
> >> errExit("\tS: malloc");
> >>
> >> struct seccomp_notif_resp *resp = malloc(sizes.seccomp_notif_resp);
> >
> > This should probably do something like max(sizes.seccomp_notif_resp,
> > sizeof(struct seccomp_notif_resp)) in case the program was built
> > against new UAPI headers that make struct seccomp_notif_resp big, but
> > is running under an old kernel where that struct is still smaller?
>
> I'm confused. Why? I mean, if the running kernel says that it expects
> a buffer of a certain size, and we allocate a buffer of that size,
> what's the problem?

Because in userspace, we cast the result of malloc() to a "struct
seccomp_notif_resp *". If the kernel tells us that it expects a size
smaller than sizeof(struct seccomp_notif_resp), then we end up with a
pointer to a struct that consists partly of allocated memory, partly
of out-of-bounds memory, which is generally a bad idea - I'm not sure
whether the C standard permits that. And if userspace then e.g.
decides to access some member of that struct that is beyond what the
kernel thinks is the struct size, we get actual OOB memory accesses.

Subject: Re: For review: seccomp_user_notif(2) manual page

Hello Jann,

Thanks for your reply!

On 10/15/20 10:32 PM, Jann Horn wrote:
> On Thu, Oct 15, 2020 at 1:24 PM Michael Kerrisk (man-pages)
> <[email protected]> wrote:
>> On 9/30/20 5:53 PM, Jann Horn wrote:
>>> On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
>>> <[email protected]> wrote:
>>>> I knew it would be a big ask, but below is kind of the manual page
>>>> I was hoping you might write [1] for the seccomp user-space notification
>>>> mechanism. Since you didn't (and because 5.9 adds various new pieces
>>>> such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD
>>>> that also will need documenting [2]), I did :-). But of course I may
>>>> have made mistakes...
> [...]
>>>> 3. The supervisor process will receive notification events on the
>>>> listening file descriptor. These events are returned as
>>>> structures of type seccomp_notif. Because this structure and
>>>> its size may evolve over kernel versions, the supervisor must
>>>> first determine the size of this structure using the sec‐
>>>> comp(2) SECCOMP_GET_NOTIF_SIZES operation, which returns a
>>>> structure of type seccomp_notif_sizes. The supervisor allo‐
>>>> cates a buffer of size seccomp_notif_sizes.seccomp_notif bytes
>>>> to receive notification events. In addition,the supervisor
>>>> allocates another buffer of size seccomp_notif_sizes.sec‐
>>>> comp_notif_resp bytes for the response (a struct sec‐
>>>> comp_notif_resp structure) that it will provide to the kernel
>>>> (and thus the target process).
>>>>
>>>> 4. The target process then performs its workload, which includes
>>>> system calls that will be controlled by the seccomp filter.
>>>> Whenever one of these system calls causes the filter to return
>>>> the SECCOMP_RET_USER_NOTIF action value, the kernel does not
>>>> execute the system call; instead, execution of the target
>>>> process is temporarily blocked inside the kernel and a notifi‐
>>>
>>> where "blocked" refers to the interruptible, restartable kind - if the
>>> child receives a signal with an SA_RESTART signal handler in the
>>> meantime, it'll leave the syscall, go through the signal handler, then
>>> restart the syscall again and send the same request to the supervisor
>>> again. so the supervisor may see duplicate syscalls.
>>
>> So, I partially demonstrated what you describe here, for two example
>> system calls (epoll_wait() and pause()). But I could not exactly
>> demonstrate things as I understand you to be describing them. (So,
>> I'm not sure whether I have not understood you correctly, or
>> if things are not exactly as you describe them.)
>>
>> Here's a scenario (A) that I tested:
>>
>> 1. Target installs seccomp filters for a blocking syscall
>> (epoll_wait() or pause(), both of which should never restart,
>> regardless of SA_RESTART)
>> 2. Target installs SIGINT handler with SA_RESTART
>> 3. Supervisor is sleeping (i.e., is not blocked in
>> SECCOMP_IOCTL_NOTIF_RECV operation).
>> 4. Target makes a blocking system call (epoll_wait() or pause()).
>> 5. SIGINT gets delivered to target; handler gets called;
>> ***and syscall gets restarted by the kernel***
>>
>> That last should never happen, of course, and is a result of the
>> combination of both the user-notify filter and the SA_RESTART flag.
>> If one or other is not present, then the system call is not
>> restarted.
>>
>> So, as you note below, the UAPI gets broken a little.
>>
>> However, from your description above I had understood that
>> something like the following scenario (B) could occur:
>>
>> 1. Target installs seccomp filters for a blocking syscall
>> (epoll_wait() or pause(), both of which should never restart,
>> regardless of SA_RESTART)
>> 2. Target installs SIGINT handler with SA_RESTART
>> 3. Supervisor performs SECCOMP_IOCTL_NOTIF_RECV operation (which
>> blocks).
>> 4. Target makes a blocking system call (epoll_wait() or pause()).
>> 5. Supervisor gets seccomp user-space notification (i.e.,
>> SECCOMP_IOCTL_NOTIF_RECV ioctl() returns
>> 6. SIGINT gets delivered to target; handler gets called;
>> and syscall gets restarted by the kernel
>> 7. Supervisor performs another SECCOMP_IOCTL_NOTIF_RECV operation
>> which gets another notification for the restarted system call.
>>
>> However, I don't observe such behavior. In step 6, the syscall
>> does not get restarted by the kernel, but instead returns -1/EINTR.
>> Perhaps I have misconstructed my experiment in the second case, or
>> perhaps I've misunderstood what you meant, or is it possibly the
>> case that things are not quite as you said?

Thanks for the code, Jann (including the demo of the CLONE_FILES
technique to pass the notification FD to the supervisor).

But I think your code just demonstrates what I described in
scenario A. So, it seems that I both understood what you
meant (because my code demonstrates the same thing) and
also misunderstood what you said (because I thought you
were meaning something more like scenario B).

I'm not sure if I should write anything about this small UAPI
breakage in BUGS, or not. Your thoughts?

> user@vm:~/test/seccomp-notify-interrupt$ cat seccomp-notify-interrupt.c
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <signal.h>
> #include <err.h>
> #include <errno.h>
> #include <unistd.h>
> #include <stdlib.h>
> #include <sched.h>
> #include <stddef.h>
> #include <limits.h>
> #include <sys/mman.h>
> #include <sys/syscall.h>
> #include <sys/prctl.h>
> #include <linux/seccomp.h>
> #include <linux/filter.h>
> #include <linux/futex.h>
>
> struct {
> int seccomp_fd;
> } *shared;
>
> static void handle_signal(int sig, siginfo_t *info, void *uctx) {
> printf("signal handler invoked\n");
> }
>
> int main(void) {
> setbuf(stdout, NULL);
>
> shared = mmap(NULL, 0x1000, PROT_READ|PROT_WRITE,
> MAP_ANONYMOUS|MAP_SHARED, -1, 0);
> if (shared == MAP_FAILED)
> err(1, "mmap");
> shared->seccomp_fd = -1;
>
> /* glibc's clone() wrapper doesn't support fork()-style usage */
> pid_t child = syscall(__NR_clone, CLONE_FILES|SIGCHLD,
> NULL, NULL, NULL, 0);
> if (child == -1) err(1, "clone");
> if (child == 0) {
> /* don't outlive the parent */
> prctl(PR_SET_PDEATHSIG, SIGKILL);
> if (getppid() == 1) exit(0);
>
> prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
> struct sock_filter insns[] = {
> BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)),
> BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_pause, 0, 1),
> BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_USER_NOTIF),
> BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW)
> };
> struct sock_fprog prog = {
> .len = sizeof(insns)/sizeof(insns[0]),
> .filter = insns
> };
> int seccomp_ret = syscall(__NR_seccomp, SECCOMP_SET_MODE_FILTER,
> SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
> if (seccomp_ret < 0)
> err(1, "install");
> printf("installed seccomp: fd %d\n", seccomp_ret);
>
> __atomic_store(&shared->seccomp_fd, &seccomp_ret, __ATOMIC_RELEASE);
> int futex_ret = syscall(__NR_futex, &shared->seccomp_fd, FUTEX_WAKE,
> INT_MAX, NULL, NULL, 0);
> printf("woke %d waiters\n", futex_ret);
>
> struct sigaction act = {
> .sa_sigaction = handle_signal,
> .sa_flags = SA_RESTART|SA_SIGINFO
> };
> if (sigaction(SIGUSR1, &act, NULL))
> err(1, "sigaction");
>
> pause();
> perror("pause returned");
> exit(0);
> }
>
> int futex_ret = syscall(__NR_futex, &shared->seccomp_fd, FUTEX_WAIT,
> -1, NULL, NULL, 0);
> if (futex_ret == -1 && errno != EAGAIN)
> err(1, "futex wait");
> int fd = __atomic_load_n(&shared->seccomp_fd, __ATOMIC_ACQUIRE);
> printf("child installed seccomp fd %d\n", fd);
>
> sleep(1);
> printf("going to send SIGUSR1...\n");
> kill(child, SIGUSR1);
> sleep(1);
>
> exit(0);
> }
> user@vm:~/test/seccomp-notify-interrupt$ gcc -o
> seccomp-notify-interrupt seccomp-notify-interrupt.c -Wall
> user@vm:~/test/seccomp-notify-interrupt$ strace -f
> ./seccomp-notify-interrupt >/dev/null
> execve("./seccomp-notify-interrupt", ["./seccomp-notify-interrupt"],
> 0x7ffcb31a0d08 /* 42 vars */) = 0
> brk(NULL) = 0x5565864b2000
> access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
> openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
> fstat(3, {st_mode=S_IFREG|0644, st_size=89296, ...}) = 0
> mmap(NULL, 89296, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f7e688e7000
> close(3) = 0
> openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
> read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260A\2\0\0\0\0\0"...,
> 832) = 832
> fstat(3, {st_mode=S_IFREG|0755, st_size=1824496, ...}) = 0
> mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
> 0) = 0x7f7e688e5000
> mmap(NULL, 1837056, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f7e68724000
> mprotect(0x7f7e68746000, 1658880, PROT_NONE) = 0
> mmap(0x7f7e68746000, 1343488, PROT_READ|PROT_EXEC,
> MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x22000) = 0x7f7e68746000
> mmap(0x7f7e6888e000, 311296, PROT_READ,
> MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x16a000) = 0x7f7e6888e000
> mmap(0x7f7e688db000, 24576, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b6000) = 0x7f7e688db000
> mmap(0x7f7e688e1000, 14336, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f7e688e1000
> close(3) = 0
> arch_prctl(ARCH_SET_FS, 0x7f7e688e6500) = 0
> mprotect(0x7f7e688db000, 16384, PROT_READ) = 0
> mprotect(0x556585183000, 4096, PROT_READ) = 0
> mprotect(0x7f7e68924000, 4096, PROT_READ) = 0
> munmap(0x7f7e688e7000, 89296) = 0
> mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, -1,
> 0) = 0x7f7e688fc000
> clone(child_stack=NULL, flags=CLONE_FILES|SIGCHLD) = 2558
> futex(0x7f7e688fc000, FUTEX_WAIT, 4294967295, NULLstrace: Process 2558 attached
> <unfinished ...>
> [pid 2558] prctl(PR_SET_PDEATHSIG, SIGKILL) = 0
> [pid 2558] getppid() = 2557
> [pid 2558] prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) = 0
> [pid 2558] seccomp(SECCOMP_SET_MODE_FILTER, 0x8 /*
> SECCOMP_FILTER_FLAG_??? */, {len=4, filter=0x7ffdf7cc9b50}) = 3
> [pid 2558] write(1, "installed seccomp: fd 3\n", 24) = 24
> [pid 2558] futex(0x7f7e688fc000, FUTEX_WAKE, 2147483647 <unfinished ...>
> [pid 2557] <... futex resumed> ) = 0
> [pid 2558] <... futex resumed> ) = 1
> [pid 2558] write(1, "woke 1 waiters\n", 15) = 15
> [pid 2557] write(1, "child installed seccomp fd 3\n", 29) = 29
> [pid 2558] rt_sigaction(SIGUSR1, {sa_handler=0x556585181215,
> sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART|SA_SIGINFO,
> sa_restorer=0x7f7e6875b840}, NULL, 8) = 0
> [pid 2557] nanosleep({tv_sec=1, tv_nsec=0}, <unfinished ...>
> [pid 2558] pause( <unfinished ...>
> [pid 2557] <... nanosleep resumed> 0x7ffdf7cc9b10) = 0
> [pid 2557] write(1, "going to send SIGUSR1...", 24) = 24
> [pid 2557] write(1, "\n", 1) = 1
> [pid 2557] kill(2558, SIGUSR1) = 0
> [pid 2557] nanosleep({tv_sec=1, tv_nsec=0}, <unfinished ...>
> [pid 2558] <... pause resumed> ) = ? ERESTARTSYS (To be
> restarted if SA_RESTART is set)
> [pid 2558] --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER,
> si_pid=2557, si_uid=1000} ---
> [pid 2558] write(1, "signal handler invoked", 22) = 22
> [pid 2558] write(1, "\n", 1) = 1
> [pid 2558] rt_sigreturn({mask=[]}) = 34
> [pid 2558] pause( <unfinished ...>
> [pid 2557] <... nanosleep resumed> 0x7ffdf7cc9b10) = 0
> [pid 2557] exit_group(0) = ?
> [pid 2557] +++ exited with 0 +++
> <... pause resumed>) = ?
> +++ killed by SIGKILL +++
> user@vm:~/test/seccomp-notify-interrupt$

[...]

>>>> In the above scenario, the risk is that the supervisor may
>>>> try to access the memory of a process other than the tar‐
>>>> get. This race can be avoided by following the call to
>>>> open with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to ver‐
>>>> ify that the process that generated the notification is
>>>> still alive. (Note that if the target process subse‐
>>>> quently terminates, its PID won't be reused because there
>>>
>>> That's wrong, the PID can be reused, but the /proc/$pid directory is
>>> internally not associated with the numeric PID, but, conceptually
>>> speaking, with a specific incarnation of the PID, or something like
>>> that. (Actually, it is associated with the "struct pid", which is not
>>> reused, instead of the numeric PID.)
>>
>> Thanks. I simplified the last sentence of the paragraph:
>>
>> In the above scenario, the risk is that the supervisor may
>> try to access the memory of a process other than the tar‐
>> get. This race can be avoided by following the call to
>> open(2) with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to
>> verify that the process that generated the notification is
>> still alive. (Note that if the target terminates after the
>> latter step, a subsequent read(2) from the file descriptor
>> will return 0, indicating end of file.)
>>
>> I think that's probably enough detail.
>
> Maybe make that "may return 0" instead of "will return 0" - reading
> from /proc/$pid/mem can only return 0 in the following cases AFAICS:
>
> 1. task->mm was already gone at open() time
> 2. mm->mm_users has dropped to zero (the mm only has lazytlb users;
> page tables and VMAs are being blown away or have been blown away)
> 3. the syscall was called with length 0
>
> When a process has gone away, normally mm->mm_users will drop to zero,
> but someone else could theoretically still be holding a reference to
> the mm (e.g. someone else in the middle of accessing /proc/$pid/mem).
> (Such references should normally not be very long-lived though.)
>
> Additionally, in the unlikely case that the OOM killer just chomped
> through the page tables of the target process, I think the read will
> return -EIO (same error as if the address was simply unmapped) if the
> address is within a non-shared mapping. (Maybe that's something procfs
> could do better...)

Thanks for all the detail! I changed the text to say "may"
instead of "will".

> [...]
>>>> NOTES
>>>> The file descriptor returned when seccomp(2) is employed with the
>>>> SECCOMP_FILTER_FLAG_NEW_LISTENER flag can be monitored using
>>>> poll(2), epoll(7), and select(2). When a notification is pend‐
>>>> ing, these interfaces indicate that the file descriptor is read‐
>>>> able.
>>>
>>> We should probably also point out somewhere that, as
>>> include/uapi/linux/seccomp.h says:
>>>
>>> * Similar precautions should be applied when stacking SECCOMP_RET_USER_NOTIF
>>> * or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on the
>>> * same syscall, the most recently added filter takes precedence. This means
>>> * that the new SECCOMP_RET_USER_NOTIF filter can override any
>>> * SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing all
>>
>> My takeaway from Chritian's comments is that this comment in the kernel
>> source is partially wrong, since it is not possible to install multiple
>> filters with SECCOMP_RET_USER_NOTIF, right?
>
> Yeah. (Well, AFAICS technically, you can add more filters that return
> SECCOMP_RET_USER_NOTIF, but when a filter returns that without having
> a notifier fd attached, seccomp blocks the syscall with -ENOSYS; it
> won't use the notifier fd attached to a different filter in the
> chain.)

Ah yes. I misspoke. I meant to say that only one filter can be installed
with SECCOMP_FILTER_FLAG_NEW_LISTENER (and that's what seccomp(2)
currently says). Also, I just checked, and I have already added the
detail about ENOSYS in seccomp(2).

SECCOMP_RET_USER_NOTIF (since Linux 5.0)
...
If there is no attached supervisor (either because the
filter was not installed with the SECCOMP_FIL‐
TER_FLAG_NEW_LISTENER flag or because the file descriptor
was closed), the filter returns ENOSYS (similar to what
happens when a filter returns SECCOMP_RET_TRACE and there
is no tracer). See seccomp_user_notif(2) for further
details.

[...]

>>>> if (s == 0) {
>>>> fprintf(stderr, "\tS: read() of /proc/PID/mem "
>>>> "returned 0 (EOF)\n");
>>>> exit(EXIT_FAILURE);
>>>> }
>>>>
>>>> if (close(procMemFd) == -1)
>>>> errExit("close-/proc/PID/mem");
>>>
>>> We should probably make sure here that the value we read is actually
>>> NUL-terminated?
>>
>> So, I was curious about that point also. But, (why) are we not
>> guaranteed that it will be NUL-terminated?
>
> Because it's random memory filled by another process, which we don't
> necessarily trust. While seccomp notifiers aren't usable for applying
> *extra* security restrictions, the supervisor will still often be more
> privileged than the supervised process.

D'oh! Yes, I see that I failed my Security Engineering 101 exam.

How about:

/* We have no guarantees about what was in the memory of the target
process. Therefore, we ensure that 'path' is null-terminated. Such
precautions are particularly important in cases where (as is
common) the surpervisor is running at a higher privilege level
than the target. */

// 'len' is size of buffer; 's' is return value from pread()
int zeroIdx = len - 1;
if (s < zeroIdx)
zeroIdx = s;
path[zeroIdx] = '\0';

Or just simply:

path[len - 1] = '\0';

?

>>>> /* Discover the sizes of the structures that are used to receive
>>>> notifications and send notification responses, and allocate
>>>> buffers of those sizes. */
>>>>
>>>> if (seccomp(SECCOMP_GET_NOTIF_SIZES, 0, &sizes) == -1)
>>>> errExit("\tS: seccomp-SECCOMP_GET_NOTIF_SIZES");
>>>>
>>>> struct seccomp_notif *req = malloc(sizes.seccomp_notif);
>>>> if (req == NULL)
>>>> errExit("\tS: malloc");
>>>>
>>>> struct seccomp_notif_resp *resp = malloc(sizes.seccomp_notif_resp);
>>>
>>> This should probably do something like max(sizes.seccomp_notif_resp,
>>> sizeof(struct seccomp_notif_resp)) in case the program was built
>>> against new UAPI headers that make struct seccomp_notif_resp big, but
>>> is running under an old kernel where that struct is still smaller?
>>
>> I'm confused. Why? I mean, if the running kernel says that it expects
>> a buffer of a certain size, and we allocate a buffer of that size,
>> what's the problem?
>
> Because in userspace, we cast the result of malloc() to a "struct
> seccomp_notif_resp *". If the kernel tells us that it expects a size
> smaller than sizeof(struct seccomp_notif_resp), then we end up with a
> pointer to a struct that consists partly of allocated memory, partly
> of out-of-bounds memory, which is generally a bad idea - I'm not sure
> whether the C standard permits that. And if userspace then e.g.
> decides to access some member of that struct that is beyond what the
> kernel thinks is the struct size, we get actual OOB memory accesses.

Thanks. Got it. (But gosh, this seems like a fragile API mess.)

I added the following to the code:

/* When allocating the response buffer, we must allow for the fact
that the user-space binary may have been built with user-space
headers where 'struct seccomp_notif_resp' is bigger than the
response buffer expected by the (older) kernel. Therefore, we
allocate a buffer that is the maximum of the two sizes. This
ensures that if the supervisor places bytes into the response
structure that are past the response size that the kernel expects,
then the supervisor is not touching an invalid memory location. */

size_t resp_size = sizes.seccomp_notif_resp;
if (sizeof(struct seccomp_notif_resp) > resp_size)
resp_size = sizeof(struct seccomp_notif_resp);

struct seccomp_notif_resp *resp = malloc(resp_size);

Okay?

Thanks,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2020-10-17 06:08:09

by Jann Horn

[permalink] [raw]
Subject: Re: For review: seccomp_user_notif(2) manual page

On Fri, Oct 16, 2020 at 8:29 PM Michael Kerrisk (man-pages)
<[email protected]> wrote:
> On 10/15/20 10:32 PM, Jann Horn wrote:
> > On Thu, Oct 15, 2020 at 1:24 PM Michael Kerrisk (man-pages)
> > <[email protected]> wrote:
> >> On 9/30/20 5:53 PM, Jann Horn wrote:
> >>> On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
> >>> <[email protected]> wrote:
> >>>> I knew it would be a big ask, but below is kind of the manual page
> >>>> I was hoping you might write [1] for the seccomp user-space notification
> >>>> mechanism. Since you didn't (and because 5.9 adds various new pieces
> >>>> such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD
> >>>> that also will need documenting [2]), I did :-). But of course I may
> >>>> have made mistakes...
> > [...]
> >>>> 3. The supervisor process will receive notification events on the
> >>>> listening file descriptor. These events are returned as
> >>>> structures of type seccomp_notif. Because this structure and
> >>>> its size may evolve over kernel versions, the supervisor must
> >>>> first determine the size of this structure using the sec‐
> >>>> comp(2) SECCOMP_GET_NOTIF_SIZES operation, which returns a
> >>>> structure of type seccomp_notif_sizes. The supervisor allo‐
> >>>> cates a buffer of size seccomp_notif_sizes.seccomp_notif bytes
> >>>> to receive notification events. In addition,the supervisor
> >>>> allocates another buffer of size seccomp_notif_sizes.sec‐
> >>>> comp_notif_resp bytes for the response (a struct sec‐
> >>>> comp_notif_resp structure) that it will provide to the kernel
> >>>> (and thus the target process).
> >>>>
> >>>> 4. The target process then performs its workload, which includes
> >>>> system calls that will be controlled by the seccomp filter.
> >>>> Whenever one of these system calls causes the filter to return
> >>>> the SECCOMP_RET_USER_NOTIF action value, the kernel does not
> >>>> execute the system call; instead, execution of the target
> >>>> process is temporarily blocked inside the kernel and a notifi‐
> >>>
> >>> where "blocked" refers to the interruptible, restartable kind - if the
> >>> child receives a signal with an SA_RESTART signal handler in the
> >>> meantime, it'll leave the syscall, go through the signal handler, then
> >>> restart the syscall again and send the same request to the supervisor
> >>> again. so the supervisor may see duplicate syscalls.
> >>
> >> So, I partially demonstrated what you describe here, for two example
> >> system calls (epoll_wait() and pause()). But I could not exactly
> >> demonstrate things as I understand you to be describing them. (So,
> >> I'm not sure whether I have not understood you correctly, or
> >> if things are not exactly as you describe them.)
> >>
> >> Here's a scenario (A) that I tested:
> >>
> >> 1. Target installs seccomp filters for a blocking syscall
> >> (epoll_wait() or pause(), both of which should never restart,
> >> regardless of SA_RESTART)
> >> 2. Target installs SIGINT handler with SA_RESTART
> >> 3. Supervisor is sleeping (i.e., is not blocked in
> >> SECCOMP_IOCTL_NOTIF_RECV operation).
> >> 4. Target makes a blocking system call (epoll_wait() or pause()).
> >> 5. SIGINT gets delivered to target; handler gets called;
> >> ***and syscall gets restarted by the kernel***
> >>
> >> That last should never happen, of course, and is a result of the
> >> combination of both the user-notify filter and the SA_RESTART flag.
> >> If one or other is not present, then the system call is not
> >> restarted.
> >>
> >> So, as you note below, the UAPI gets broken a little.
> >>
> >> However, from your description above I had understood that
> >> something like the following scenario (B) could occur:
> >>
> >> 1. Target installs seccomp filters for a blocking syscall
> >> (epoll_wait() or pause(), both of which should never restart,
> >> regardless of SA_RESTART)
> >> 2. Target installs SIGINT handler with SA_RESTART
> >> 3. Supervisor performs SECCOMP_IOCTL_NOTIF_RECV operation (which
> >> blocks).
> >> 4. Target makes a blocking system call (epoll_wait() or pause()).
> >> 5. Supervisor gets seccomp user-space notification (i.e.,
> >> SECCOMP_IOCTL_NOTIF_RECV ioctl() returns
> >> 6. SIGINT gets delivered to target; handler gets called;
> >> and syscall gets restarted by the kernel
> >> 7. Supervisor performs another SECCOMP_IOCTL_NOTIF_RECV operation
> >> which gets another notification for the restarted system call.
> >>
> >> However, I don't observe such behavior. In step 6, the syscall
> >> does not get restarted by the kernel, but instead returns -1/EINTR.
> >> Perhaps I have misconstructed my experiment in the second case, or
> >> perhaps I've misunderstood what you meant, or is it possibly the
> >> case that things are not quite as you said?
>
> Thanks for the code, Jann (including the demo of the CLONE_FILES
> technique to pass the notification FD to the supervisor).
>
> But I think your code just demonstrates what I described in
> scenario A. So, it seems that I both understood what you
> meant (because my code demonstrates the same thing) and
> also misunderstood what you said (because I thought you
> were meaning something more like scenario B).

Ahh, sorry, I should've read your mail more carefully. Indeed, that
testcase only shows scenario A. But the following shows scenario B...



user@vm:~/test/seccomp-notify-interrupt$ cat seccomp-notify-interrupt-b.c
#define _GNU_SOURCE
#include <stdio.h>
#include <signal.h>
#include <err.h>
#include <errno.h>
#include <unistd.h>
#include <stdlib.h>
#include <sched.h>
#include <stddef.h>
#include <string.h>
#include <limits.h>
#include <inttypes.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/ioctl.h>
#include <sys/prctl.h>
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <linux/futex.h>

struct {
int seccomp_fd;
} *shared;

static void handle_signal(int sig, siginfo_t *info, void *uctx) {
const char *msg = "signal handler invoked\n";
write(1, msg, strlen(msg));
}

static size_t max_size(size_t a, size_t b) {
return (a > b) ? a : b;
}

int main(void) {
setbuf(stdout, NULL);

shared = mmap(NULL, 0x1000, PROT_READ|PROT_WRITE,
MAP_ANONYMOUS|MAP_SHARED, -1, 0);
if (shared == MAP_FAILED)
err(1, "mmap");
shared->seccomp_fd = -1;

/* glibc's clone() wrapper doesn't support fork()-style usage */
pid_t child = syscall(__NR_clone, CLONE_FILES|SIGCHLD,
NULL, NULL, NULL, 0);
if (child == -1) err(1, "clone");
if (child == 0) {
/* don't outlive the parent */
prctl(PR_SET_PDEATHSIG, SIGKILL);
if (getppid() == 1) exit(0);

prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
struct sock_filter insns[] = {
BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)),
BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_pause, 0, 1),
BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_USER_NOTIF),
BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW)
};
struct sock_fprog prog = {
.len = sizeof(insns)/sizeof(insns[0]),
.filter = insns
};
int seccomp_ret = syscall(__NR_seccomp, SECCOMP_SET_MODE_FILTER,
SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
if (seccomp_ret < 0)
err(1, "install");
printf("installed seccomp: fd %d\n", seccomp_ret);

__atomic_store(&shared->seccomp_fd, &seccomp_ret, __ATOMIC_RELEASE);
int futex_ret = syscall(__NR_futex, &shared->seccomp_fd, FUTEX_WAKE,
INT_MAX, NULL, NULL, 0);
printf("woke %d waiters\n", futex_ret);

struct sigaction act = {
.sa_sigaction = handle_signal,
.sa_flags = SA_RESTART|SA_SIGINFO
};
if (sigaction(SIGUSR1, &act, NULL))
err(1, "sigaction");

pause();
perror("pause returned");
exit(0);
}

int futex_ret = syscall(__NR_futex, &shared->seccomp_fd, FUTEX_WAIT,
-1, NULL, NULL, 0);
if (futex_ret == -1 && errno != EAGAIN)
err(1, "futex wait");
int fd = __atomic_load_n(&shared->seccomp_fd, __ATOMIC_ACQUIRE);
printf("child installed seccomp fd %d\n", fd);

struct seccomp_notif_sizes sizes;
if (syscall(__NR_seccomp, SECCOMP_GET_NOTIF_SIZES, 0, &sizes))
err(1, "notif_sizes");
struct seccomp_notif *notif = malloc(max_size(
sizeof(struct seccomp_notif),
sizes.seccomp_notif
));
if (!notif)
err(1, "malloc");
for (int i=0; i<4; i++) {
memset(notif, '\0', sizes.seccomp_notif);
if (ioctl(fd, SECCOMP_IOCTL_NOTIF_RECV, notif))
err(1, "notif_recv");
printf("got notif: id=%" PRIu64 " pid=%u nr=%d\n",
notif->id, notif->pid, notif->data.nr);
sleep(1);
printf("going to send SIGUSR1...\n");
kill(child, SIGUSR1);
}
sleep(1);

exit(0);
}
user@vm:~/test/seccomp-notify-interrupt$ gcc -o
seccomp-notify-interrupt-b seccomp-notify-interrupt-b.c
user@vm:~/test/seccomp-notify-interrupt$ ./seccomp-notify-interrupt-b
installed seccomp: fd 3
woke 1 waiters
child installed seccomp fd 3
got notif: id=4490537653766950251 pid=2641 nr=34
going to send SIGUSR1...
signal handler invoked
got notif: id=4490537653766950252 pid=2641 nr=34
going to send SIGUSR1...
signal handler invoked
got notif: id=4490537653766950253 pid=2641 nr=34
going to send SIGUSR1...
signal handler invoked
got notif: id=4490537653766950254 pid=2641 nr=34
going to send SIGUSR1...
signal handler invoked
user@vm:~/test/seccomp-notify-interrupt$



> I'm not sure if I should write anything about this small UAPI
> breakage in BUGS, or not. Your thoughts?

Thinking about it a bit more: Any code that relies on pause() or
epoll_wait() not restarting is buggy anyway, right? Because a signal
could also arrive directly before entering the syscall, while
userspace code is still executing? So one could argue that we're just
enlarging a preexisting race. (Unless the signal handler checks the
interrupted register state to figure out whether we already entered
syscall handling?)

If userspace relies on non-restarting behavior, it should be using
something like epoll_pwait(). And that stuff only unblocks signals
after we've already past the seccomp checks on entry. (I guess this
also means that anything that uses pause() properly effectively has to
either run pause() in a loop with nothing else [iow, not care whether
pause() restarts] or siglongjmp() out of the signal handler [iow,
unwind through the signal frame]?)

So we should probably document the restarting behavior as something
the supervisor has to deal with in the manpage; but for the
"non-restarting syscalls can restart from the target's perspective"
aspect, it might be enough to document this as quirky behavior that
can't actually break correct code? (Or not document it at all. Dunno.)

[...]
> >>>> if (s == 0) {
> >>>> fprintf(stderr, "\tS: read() of /proc/PID/mem "
> >>>> "returned 0 (EOF)\n");
> >>>> exit(EXIT_FAILURE);
> >>>> }
> >>>>
> >>>> if (close(procMemFd) == -1)
> >>>> errExit("close-/proc/PID/mem");
> >>>
> >>> We should probably make sure here that the value we read is actually
> >>> NUL-terminated?
> >>
> >> So, I was curious about that point also. But, (why) are we not
> >> guaranteed that it will be NUL-terminated?
> >
> > Because it's random memory filled by another process, which we don't
> > necessarily trust. While seccomp notifiers aren't usable for applying
> > *extra* security restrictions, the supervisor will still often be more
> > privileged than the supervised process.
>
> D'oh! Yes, I see that I failed my Security Engineering 101 exam.
>
> How about:
>
> /* We have no guarantees about what was in the memory of the target
> process. Therefore, we ensure that 'path' is null-terminated. Such
> precautions are particularly important in cases where (as is
> common) the surpervisor is running at a higher privilege level
> than the target. */
>
> // 'len' is size of buffer; 's' is return value from pread()
> int zeroIdx = len - 1;
> if (s < zeroIdx)
> zeroIdx = s;
> path[zeroIdx] = '\0';
>
> Or just simply:
>
> path[len - 1] = '\0';
>
> ?

I'd either do "path[s-1] = '\0'" or bail out if "path[s - 1] != '\0'".
Especially if we haven't NUL-terminated the buffer before reading into
it, we shouldn't write a nullbyte to path[len - 1], since the bytes in
front of that will stay uninitialized.

(Oh, by the way: In general, reading path buffers like this (with the
read potentially going beyond the end of the actual buffer) can
have... interesting interactions with userfaultfd. If the path is
stored in one page, starting at a non-zero offset inside the page, our
read will always overlap into the second page. That second page might
belong to a completely different VMA. If that VMA has a userfaultfd
handler, we'll take a userfaultfd fault and wait for the userfaultfd
handler to service the fault. Normally that's fine-ish; but if the
target thread is supposed to *be* the thread handling userfaultfd
faults in its process (and it never intentionally accesses any
userfaultfd regions, only other threads do that), userspace will
deadlock, because the thread waiting for userfaultfd fault resolution
is the same one that's blocked on the userfaultfd. But this is not
special to seccomp; there are syscalls that do the same thing,
although their over-reads are typically smaller. E.g.
do_strncpy_from_user() over-reads by up to 7 bytes. But when this came
up in a discussion with Linus Torvalds, he said it was a theoretical
concern; so I guess if the kernel seems fine with doing that in
practice, we probably don't care too much here either.)

> >>>> /* Discover the sizes of the structures that are used to receive
> >>>> notifications and send notification responses, and allocate
> >>>> buffers of those sizes. */
> >>>>
> >>>> if (seccomp(SECCOMP_GET_NOTIF_SIZES, 0, &sizes) == -1)
> >>>> errExit("\tS: seccomp-SECCOMP_GET_NOTIF_SIZES");
> >>>>
> >>>> struct seccomp_notif *req = malloc(sizes.seccomp_notif);
> >>>> if (req == NULL)
> >>>> errExit("\tS: malloc");
> >>>>
> >>>> struct seccomp_notif_resp *resp = malloc(sizes.seccomp_notif_resp);
> >>>
> >>> This should probably do something like max(sizes.seccomp_notif_resp,
> >>> sizeof(struct seccomp_notif_resp)) in case the program was built
> >>> against new UAPI headers that make struct seccomp_notif_resp big, but
> >>> is running under an old kernel where that struct is still smaller?
> >>
> >> I'm confused. Why? I mean, if the running kernel says that it expects
> >> a buffer of a certain size, and we allocate a buffer of that size,
> >> what's the problem?
> >
> > Because in userspace, we cast the result of malloc() to a "struct
> > seccomp_notif_resp *". If the kernel tells us that it expects a size
> > smaller than sizeof(struct seccomp_notif_resp), then we end up with a
> > pointer to a struct that consists partly of allocated memory, partly
> > of out-of-bounds memory, which is generally a bad idea - I'm not sure
> > whether the C standard permits that. And if userspace then e.g.
> > decides to access some member of that struct that is beyond what the
> > kernel thinks is the struct size, we get actual OOB memory accesses.
>
> Thanks. Got it. (But gosh, this seems like a fragile API mess.)
>
> I added the following to the code:
>
> /* When allocating the response buffer, we must allow for the fact
> that the user-space binary may have been built with user-space
> headers where 'struct seccomp_notif_resp' is bigger than the
> response buffer expected by the (older) kernel. Therefore, we
> allocate a buffer that is the maximum of the two sizes. This
> ensures that if the supervisor places bytes into the response
> structure that are past the response size that the kernel expects,
> then the supervisor is not touching an invalid memory location. */
>
> size_t resp_size = sizes.seccomp_notif_resp;
> if (sizeof(struct seccomp_notif_resp) > resp_size)
> resp_size = sizeof(struct seccomp_notif_resp);
>
> struct seccomp_notif_resp *resp = malloc(resp_size);
>
> Okay?

Looks good.

Subject: Re: For review: seccomp_user_notif(2) manual page

Hello Jann,

On 10/17/20 2:25 AM, Jann Horn wrote:
> On Fri, Oct 16, 2020 at 8:29 PM Michael Kerrisk (man-pages)
> <[email protected]> wrote:
>> On 10/15/20 10:32 PM, Jann Horn wrote:
>>> On Thu, Oct 15, 2020 at 1:24 PM Michael Kerrisk (man-pages)
>>> <[email protected]> wrote:
>>>> On 9/30/20 5:53 PM, Jann Horn wrote:
>>>>> On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
>>>>> <[email protected]> wrote:
>>>>>> I knew it would be a big ask, but below is kind of the manual page
>>>>>> I was hoping you might write [1] for the seccomp user-space notification
>>>>>> mechanism. Since you didn't (and because 5.9 adds various new pieces
>>>>>> such as SECCOMP_ADDFD_FLAG_SETFD and SECCOMP_IOCTL_NOTIF_ADDFD
>>>>>> that also will need documenting [2]), I did :-). But of course I may
>>>>>> have made mistakes...
>>> [...]
>>>>>> 3. The supervisor process will receive notification events on the
>>>>>> listening file descriptor. These events are returned as
>>>>>> structures of type seccomp_notif. Because this structure and
>>>>>> its size may evolve over kernel versions, the supervisor must
>>>>>> first determine the size of this structure using the sec‐
>>>>>> comp(2) SECCOMP_GET_NOTIF_SIZES operation, which returns a
>>>>>> structure of type seccomp_notif_sizes. The supervisor allo‐
>>>>>> cates a buffer of size seccomp_notif_sizes.seccomp_notif bytes
>>>>>> to receive notification events. In addition,the supervisor
>>>>>> allocates another buffer of size seccomp_notif_sizes.sec‐
>>>>>> comp_notif_resp bytes for the response (a struct sec‐
>>>>>> comp_notif_resp structure) that it will provide to the kernel
>>>>>> (and thus the target process).
>>>>>>
>>>>>> 4. The target process then performs its workload, which includes
>>>>>> system calls that will be controlled by the seccomp filter.
>>>>>> Whenever one of these system calls causes the filter to return
>>>>>> the SECCOMP_RET_USER_NOTIF action value, the kernel does not
>>>>>> execute the system call; instead, execution of the target
>>>>>> process is temporarily blocked inside the kernel and a notifi‐
>>>>>
>>>>> where "blocked" refers to the interruptible, restartable kind - if the
>>>>> child receives a signal with an SA_RESTART signal handler in the
>>>>> meantime, it'll leave the syscall, go through the signal handler, then
>>>>> restart the syscall again and send the same request to the supervisor
>>>>> again. so the supervisor may see duplicate syscalls.
>>>>
>>>> So, I partially demonstrated what you describe here, for two example
>>>> system calls (epoll_wait() and pause()). But I could not exactly
>>>> demonstrate things as I understand you to be describing them. (So,
>>>> I'm not sure whether I have not understood you correctly, or
>>>> if things are not exactly as you describe them.)
>>>>
>>>> Here's a scenario (A) that I tested:
>>>>
>>>> 1. Target installs seccomp filters for a blocking syscall
>>>> (epoll_wait() or pause(), both of which should never restart,
>>>> regardless of SA_RESTART)
>>>> 2. Target installs SIGINT handler with SA_RESTART
>>>> 3. Supervisor is sleeping (i.e., is not blocked in
>>>> SECCOMP_IOCTL_NOTIF_RECV operation).
>>>> 4. Target makes a blocking system call (epoll_wait() or pause()).
>>>> 5. SIGINT gets delivered to target; handler gets called;
>>>> ***and syscall gets restarted by the kernel***
>>>>
>>>> That last should never happen, of course, and is a result of the
>>>> combination of both the user-notify filter and the SA_RESTART flag.
>>>> If one or other is not present, then the system call is not
>>>> restarted.
>>>>
>>>> So, as you note below, the UAPI gets broken a little.
>>>>
>>>> However, from your description above I had understood that
>>>> something like the following scenario (B) could occur:
>>>>
>>>> 1. Target installs seccomp filters for a blocking syscall
>>>> (epoll_wait() or pause(), both of which should never restart,
>>>> regardless of SA_RESTART)
>>>> 2. Target installs SIGINT handler with SA_RESTART
>>>> 3. Supervisor performs SECCOMP_IOCTL_NOTIF_RECV operation (which
>>>> blocks).
>>>> 4. Target makes a blocking system call (epoll_wait() or pause()).
>>>> 5. Supervisor gets seccomp user-space notification (i.e.,
>>>> SECCOMP_IOCTL_NOTIF_RECV ioctl() returns
>>>> 6. SIGINT gets delivered to target; handler gets called;
>>>> and syscall gets restarted by the kernel
>>>> 7. Supervisor performs another SECCOMP_IOCTL_NOTIF_RECV operation
>>>> which gets another notification for the restarted system call.
>>>>
>>>> However, I don't observe such behavior. In step 6, the syscall
>>>> does not get restarted by the kernel, but instead returns -1/EINTR.
>>>> Perhaps I have misconstructed my experiment in the second case, or
>>>> perhaps I've misunderstood what you meant, or is it possibly the
>>>> case that things are not quite as you said?
>>
>> Thanks for the code, Jann (including the demo of the CLONE_FILES
>> technique to pass the notification FD to the supervisor).
>>
>> But I think your code just demonstrates what I described in
>> scenario A. So, it seems that I both understood what you
>> meant (because my code demonstrates the same thing) and
>> also misunderstood what you said (because I thought you
>> were meaning something more like scenario B).
>
> Ahh, sorry, I should've read your mail more carefully. Indeed, that
> testcase only shows scenario A. But the following shows scenario B...
>
> user@vm:~/test/seccomp-notify-interrupt$ cat seccomp-notify-interrupt-b.c
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <signal.h>
> #include <err.h>
> #include <errno.h>
> #include <unistd.h>
> #include <stdlib.h>
> #include <sched.h>
> #include <stddef.h>
> #include <string.h>
> #include <limits.h>
> #include <inttypes.h>
> #include <sys/mman.h>
> #include <sys/syscall.h>
> #include <sys/ioctl.h>
> #include <sys/prctl.h>
> #include <linux/seccomp.h>
> #include <linux/filter.h>
> #include <linux/futex.h>
>
> struct {
> int seccomp_fd;
> } *shared;
>
> static void handle_signal(int sig, siginfo_t *info, void *uctx) {
> const char *msg = "signal handler invoked\n";
> write(1, msg, strlen(msg));
> }
>
> static size_t max_size(size_t a, size_t b) {
> return (a > b) ? a : b;
> }
>
> int main(void) {
> setbuf(stdout, NULL);
>
> shared = mmap(NULL, 0x1000, PROT_READ|PROT_WRITE,
> MAP_ANONYMOUS|MAP_SHARED, -1, 0);
> if (shared == MAP_FAILED)
> err(1, "mmap");
> shared->seccomp_fd = -1;
>
> /* glibc's clone() wrapper doesn't support fork()-style usage */
> pid_t child = syscall(__NR_clone, CLONE_FILES|SIGCHLD,
> NULL, NULL, NULL, 0);
> if (child == -1) err(1, "clone");
> if (child == 0) {
> /* don't outlive the parent */
> prctl(PR_SET_PDEATHSIG, SIGKILL);
> if (getppid() == 1) exit(0);
>
> prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
> struct sock_filter insns[] = {
> BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)),
> BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_pause, 0, 1),
> BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_USER_NOTIF),
> BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW)
> };
> struct sock_fprog prog = {
> .len = sizeof(insns)/sizeof(insns[0]),
> .filter = insns
> };
> int seccomp_ret = syscall(__NR_seccomp, SECCOMP_SET_MODE_FILTER,
> SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
> if (seccomp_ret < 0)
> err(1, "install");
> printf("installed seccomp: fd %d\n", seccomp_ret);
>
> __atomic_store(&shared->seccomp_fd, &seccomp_ret, __ATOMIC_RELEASE);
> int futex_ret = syscall(__NR_futex, &shared->seccomp_fd, FUTEX_WAKE,
> INT_MAX, NULL, NULL, 0);
> printf("woke %d waiters\n", futex_ret);
>
> struct sigaction act = {
> .sa_sigaction = handle_signal,
> .sa_flags = SA_RESTART|SA_SIGINFO
> };
> if (sigaction(SIGUSR1, &act, NULL))
> err(1, "sigaction");
>
> pause();
> perror("pause returned");
> exit(0);
> }
>
> int futex_ret = syscall(__NR_futex, &shared->seccomp_fd, FUTEX_WAIT,
> -1, NULL, NULL, 0);
> if (futex_ret == -1 && errno != EAGAIN)
> err(1, "futex wait");
> int fd = __atomic_load_n(&shared->seccomp_fd, __ATOMIC_ACQUIRE);
> printf("child installed seccomp fd %d\n", fd);
>
> struct seccomp_notif_sizes sizes;
> if (syscall(__NR_seccomp, SECCOMP_GET_NOTIF_SIZES, 0, &sizes))
> err(1, "notif_sizes");
> struct seccomp_notif *notif = malloc(max_size(
> sizeof(struct seccomp_notif),
> sizes.seccomp_notif
> ));
> if (!notif)
> err(1, "malloc");
> for (int i=0; i<4; i++) {
> memset(notif, '\0', sizes.seccomp_notif);
> if (ioctl(fd, SECCOMP_IOCTL_NOTIF_RECV, notif))
> err(1, "notif_recv");
> printf("got notif: id=%" PRIu64 " pid=%u nr=%d\n",
> notif->id, notif->pid, notif->data.nr);
> sleep(1);
> printf("going to send SIGUSR1...\n");
> kill(child, SIGUSR1);
> }
> sleep(1);
>
> exit(0);
> }
> user@vm:~/test/seccomp-notify-interrupt$ gcc -o
> seccomp-notify-interrupt-b seccomp-notify-interrupt-b.c
> user@vm:~/test/seccomp-notify-interrupt$ ./seccomp-notify-interrupt-b
> installed seccomp: fd 3
> woke 1 waiters
> child installed seccomp fd 3
> got notif: id=4490537653766950251 pid=2641 nr=34
> going to send SIGUSR1...
> signal handler invoked
> got notif: id=4490537653766950252 pid=2641 nr=34
> going to send SIGUSR1...
> signal handler invoked
> got notif: id=4490537653766950253 pid=2641 nr=34
> going to send SIGUSR1...
> signal handler invoked
> got notif: id=4490537653766950254 pid=2641 nr=34
> going to send SIGUSR1...
> signal handler invoked
> user@vm:~/test/seccomp-notify-interrupt$

Thanks for that! Clearly I must have messed up something when
I tried to construct the code to test that scenario.

>> I'm not sure if I should write anything about this small UAPI
>> breakage in BUGS, or not. Your thoughts?
>
> Thinking about it a bit more: Any code that relies on pause() or
> epoll_wait() not restarting is buggy anyway, right? Because a signal
> could also arrive directly before entering the syscall, while
> userspace code is still executing? So one could argue that we're just
> enlarging a preexisting race. (Unless the signal handler checks the
> interrupted register state to figure out whether we already entered
> syscall handling?)

Yes, that all makes sense.

> If userspace relies on non-restarting behavior, it should be using
> something like epoll_pwait(). And that stuff only unblocks signals
> after we've already past the seccomp checks on entry.

Thanks for elaborating that detail, since as soon as you talked
about "enlarging a preexisting race" above, I immediately wondered
sigsuspend(), pselect(), etc.

(Mind you, I still wonder about the effect on system calls that
are normally nonrestartable because they have timeouts. My
understanding is that the kernel doesn't restart those system
calls because it's impossible for the kernel to restart the call
with the right timeout value. I wonder what happens when those
system calls are restarted in the scenario we're discussing.)

Anyway, returning to your point... So, to be clear (and to
quickly remind myself in case I one day reread this thread),
there is not a problem with sigsuspend(), pselect(), ppoll(),
and epoll_pwait() since:

* Before the syscall, signals are blocked in the target.
* Inside the syscall, signals are still blocked at the time
the check is made for seccomp filters.
* If a seccomp user-space notification event kicks, the target
is put to sleep with the signals still blocked.
* The signal will only get delivered after the supervisor either
triggers a spoofed success/failure return in the target or the
supervisor sends a CONTINUE response to the kernel telling it
to execute the target's system call. Either way, there won't be
any restarting of the target's system call (and the supervisor
thus won't see multiple notifications).

(Right?)

> (I guess this
> also means that anything that uses pause() properly effectively has to
> either run pause() in a loop with nothing else [iow, not care whether
> pause() restarts] or siglongjmp() out of the signal handler [iow,
> unwind through the signal frame]?)

Yes, that's my understanding. Simple pause() (vs sigsuspend())
is always racy.

> So we should probably document the restarting behavior as something
> the supervisor has to deal with in the manpage; but for the
> "non-restarting syscalls can restart from the target's perspective"
> aspect, it might be enough to document this as quirky behavior that
> can't actually break correct code? (Or not document it at all. Dunno.)

So, I've added the following to the page:

Interaction with SA_RESTART signal handlers
Consider the following scenario:

· The target process has used sigaction(2) to install a signal
handler with the SA_RESTART flag.

· The target has made a system call that triggered a seccomp user-
space notification and the target is currently blocked until the
supervisor sends a notification response.

· A signal is delivered to the target and the signal handler is
executed.

· When (if) the supervisor attempts to send a notification
response, the SECCOMP_IOCTL_NOTIF_SEND ioctl(2)) operation will
fail with the ENOENT error.

In this scenario, the kernel will restart the target's system
call. Consequently, the supervisor will receive another user-
space notification. Thus, depending on how many times the blocked
system call is interrupted by a signal handler, the supervisor may
receive multiple notifications for the same system call in the
target.

One oddity is that system call restarting as described in this
scenario will occur even for the blocking system calls listed in
signal(7) that would never normally be restarted by the SA_RESTART
flag.

Does that seem okay?

In addition, I've queued a cross-reference in signal(7):

In certain circumstances, the seccomp(2) user-space notifi‐
cation feature can lead to restarting of system calls that
would otherwise never be restarted by SA_RESTART; for
details, see seccomp_user_notif(2).

> [...]
>>>>>> if (s == 0) {
>>>>>> fprintf(stderr, "\tS: read() of /proc/PID/mem "
>>>>>> "returned 0 (EOF)\n");
>>>>>> exit(EXIT_FAILURE);
>>>>>> }
>>>>>>
>>>>>> if (close(procMemFd) == -1)
>>>>>> errExit("close-/proc/PID/mem");
>>>>>
>>>>> We should probably make sure here that the value we read is actually
>>>>> NUL-terminated?
>>>>
>>>> So, I was curious about that point also. But, (why) are we not
>>>> guaranteed that it will be NUL-terminated?
>>>
>>> Because it's random memory filled by another process, which we don't
>>> necessarily trust. While seccomp notifiers aren't usable for applying
>>> *extra* security restrictions, the supervisor will still often be more
>>> privileged than the supervised process.
>>
>> D'oh! Yes, I see that I failed my Security Engineering 101 exam.
>>
>> How about:
>>
>> /* We have no guarantees about what was in the memory of the target
>> process. Therefore, we ensure that 'path' is null-terminated. Such
>> precautions are particularly important in cases where (as is
>> common) the surpervisor is running at a higher privilege level
>> than the target. */
>>
>> // 'len' is size of buffer; 's' is return value from pread()
>> int zeroIdx = len - 1;
>> if (s < zeroIdx)
>> zeroIdx = s;
>> path[zeroIdx] = '\0';
>>
>> Or just simply:
>>
>> path[len - 1] = '\0';
>>
>> ?
>
> I'd either do "path[s-1] = '\0'" or bail out if "path[s - 1] != '\0'".
> Especially if we haven't NUL-terminated the buffer before reading into
> it, we shouldn't write a nullbyte to path[len - 1], since the bytes in
> front of that will stay uninitialized.

I realized by the way that I made a thinko. In the usual case,
read(fd, buf, PATH_MAX) will return PATHMAX bytes that include
trailing garbage after the pathname. So the right check is I think
to scan from the start of the buffer to see if there's a NUL, and
error if there is not, and that's how I have modified the example
program.

> (Oh, by the way: In general, reading path buffers like this (with the
> read potentially going beyond the end of the actual buffer) can
> have... interesting interactions with userfaultfd. If the path is
> stored in one page, starting at a non-zero offset inside the page, our
> read will always overlap into the second page. That second page might
> belong to a completely different VMA. If that VMA has a userfaultfd
> handler, we'll take a userfaultfd fault and wait for the userfaultfd
> handler to service the fault. Normally that's fine-ish; but if the
> target thread is supposed to *be* the thread handling userfaultfd
> faults in its process (and it never intentionally accesses any
> userfaultfd regions, only other threads do that), userspace will
> deadlock, because the thread waiting for userfaultfd fault resolution
> is the same one that's blocked on the userfaultfd. But this is not
> special to seccomp; there are syscalls that do the same thing,
> although their over-reads are typically smaller. E.g.
> do_strncpy_from_user() over-reads by up to 7 bytes. But when this came
> up in a discussion with Linus Torvalds, he said it was a theoretical
> concern; so I guess if the kernel seems fine with doing that in
> practice, we probably don't care too much here either.)

Thanks for the background info. Indeed there are some
bizarre corner cases...

[...]

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Subject: Re: For review: seccomp_user_notif(2) manual page

Hi Jann,

On 10/1/20 4:14 AM, Jann Horn wrote:
> On Thu, Oct 1, 2020 at 3:52 AM Jann Horn <[email protected]> wrote:
>> On Thu, Oct 1, 2020 at 1:25 AM Tycho Andersen <[email protected]> wrote:
>>> On Thu, Oct 01, 2020 at 01:11:33AM +0200, Jann Horn wrote:
>>>> On Thu, Oct 1, 2020 at 1:03 AM Tycho Andersen <[email protected]> wrote:
>>>>> On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) wrote:
>>>>>> On 9/30/20 5:03 PM, Tycho Andersen wrote:
>>>>>>> On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
>>>>>>>> ┌─────────────────────────────────────────────────────┐
>>>>>>>> │FIXME │
>>>>>>>> ├─────────────────────────────────────────────────────┤
>>>>>>>> │From my experiments, it appears that if a SEC‐ │
>>>>>>>> │COMP_IOCTL_NOTIF_RECV is done after the target │
>>>>>>>> │process terminates, then the ioctl() simply blocks │
>>>>>>>> │(rather than returning an error to indicate that the │
>>>>>>>> │target process no longer exists). │
>>>>>>>
>>>>>>> Yeah, I think Christian wanted to fix this at some point,
>>>>>>
>>>>>> Do you have a pointer that discussion? I could not find it with a
>>>>>> quick search.
>>>>>>
>>>>>>> but it's a
>>>>>>> bit sticky to do.
>>>>>>
>>>>>> Can you say a few words about the nature of the problem?
>>>>>
>>>>> I remembered wrong, it's actually in the tree: 99cdb8b9a573 ("seccomp:
>>>>> notify about unused filter"). So maybe there's a bug here?
>>>>
>>>> That thing only notifies on ->poll, it doesn't unblock ioctls; and
>>>> Michael's sample code uses SECCOMP_IOCTL_NOTIF_RECV to wait. So that
>>>> commit doesn't have any effect on this kind of usage.
>>>
>>> Yes, thanks. And the ones stuck in RECV are waiting on a semaphore so
>>> we don't have a count of all of them, unfortunately.
>>>
>>> We could maybe look inside the wait_list, but that will probably make
>>> people angry :)
>>
>> The easiest way would probably be to open-code the semaphore-ish part,
>> and let the semaphore and poll share the waitqueue. The current code
>> kind of mirrors the semaphore's waitqueue in the wqh - open-coding the
>> entire semaphore would IMO be cleaner than that. And it's not like
>> semaphore semantics are even a good fit for this code anyway.
>>
>> Let's see... if we didn't have the existing UAPI to worry about, I'd
>> do it as follows (*completely* untested). That way, the ioctl would
>> block exactly until either there actually is a request to deliver or
>> there are no more users of the filter. The problem is that if we just
>> apply this patch, existing users of SECCOMP_IOCTL_NOTIF_RECV that use
>> an event loop and don't set O_NONBLOCK will be screwed. So we'd
>> probably also have to add some stupid counter in place of the
>> semaphore's counter that we can use to preserve the old behavior of
>> returning -ENOENT once for each cancelled request. :(
>>
>> I guess this is a nice point in favor of Michael's usual complaint
>> that if there are no man pages for a feature by the time the feature
>> lands upstream, there's a higher chance that the UAPI will suck
>> forever...
>
> And I guess this would be the UAPI-compatible version - not actually
> as terrible as I thought it might be. Do y'all want this? If so, feel
> free to either turn this into a proper patch with Co-developed-by, or
> tell me that I should do it and I'll try to get around to turning it
> into something proper.

Thanks for taking a shot at this.

I tried applying the patch below to vanilla 5.9.0.
(There's one typo: s/ENOTCON/ENOTCONN).

It seems not to work though; when I send a signal to my test
target process that is sleeping waiting for the notification
response, the process enters the uninterruptible D state.
Any thoughts?

Thanks,

Michael

> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 676d4af62103..d08c453fcc2c 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -138,7 +138,7 @@ struct seccomp_kaddfd {
> * @notifications: A list of struct seccomp_knotif elements.
> */
> struct notification {
> - struct semaphore request;
> + bool canceled_reqs;
> u64 next_id;
> struct list_head notifications;
> };
> @@ -859,7 +859,6 @@ static int seccomp_do_user_notification(int this_syscall,
> list_add(&n.list, &match->notif->notifications);
> INIT_LIST_HEAD(&n.addfd);
>
> - up(&match->notif->request);
> wake_up_poll(&match->wqh, EPOLLIN | EPOLLRDNORM);
> mutex_unlock(&match->notify_lock);
>
> @@ -901,8 +900,20 @@ static int seccomp_do_user_notification(int this_syscall,
> * *reattach* to a notifier right now. If one is added, we'll need to
> * keep track of the notif itself and make sure they match here.
> */
> - if (match->notif)
> + if (match->notif) {
> list_del(&n.list);
> +
> + /*
> + * We are stuck with a UAPI that requires that after a spurious
> + * wakeup, SECCOMP_IOCTL_NOTIF_RECV must return immediately.
> + * This is the tracking for that, keeping track of whether we
> + * canceled a request after waking waiters, but before userspace
> + * picked up the notification.
> + */
> + if (n.state == SECCOMP_NOTIFY_INIT)
> + match->notif->canceled_reqs = true;
> + }
> +
> out:
> mutex_unlock(&match->notify_lock);
>
> @@ -1178,6 +1189,7 @@ static long seccomp_notify_recv(struct
> seccomp_filter *filter,
> void __user *buf)
> {
> struct seccomp_knotif *knotif = NULL, *cur;
> + DECLARE_WAITQUEUE(wait, current);
> struct seccomp_notif unotif;
> ssize_t ret;
>
> @@ -1190,11 +1202,9 @@ static long seccomp_notify_recv(struct
> seccomp_filter *filter,
>
> memset(&unotif, 0, sizeof(unotif));
>
> - ret = down_interruptible(&filter->notif->request);
> - if (ret < 0)
> - return ret;
> -
> mutex_lock(&filter->notify_lock);
> +
> +retry:
> list_for_each_entry(cur, &filter->notif->notifications, list) {
> if (cur->state == SECCOMP_NOTIFY_INIT) {
> knotif = cur;
> @@ -1202,14 +1212,32 @@ static long seccomp_notify_recv(struct
> seccomp_filter *filter,
> }
> }
>
> - /*
> - * If we didn't find a notification, it could be that the task was
> - * interrupted by a fatal signal between the time we were woken and
> - * when we were able to acquire the rw lock.
> - */
> if (!knotif) {
> - ret = -ENOENT;
> - goto out;
> + /* This has to happen before checking &filter->users. */
> + prepare_to_wait(&filter->wqh, &wait, TASK_INTERRUPTIBLE);
> +
> + /*
> + * If all users of the filter are gone, throw an error instead
> + * of pointlessly continuing to block.
> + */
> + if (refcount_read(&filter->users) == 0) {
> + ret = -ENOTCON;
> + goto out;
> + }
> + if (filter->notif->canceled_reqs) {
> + ret = -ENOENT;
> + goto out;
> + } else {
> + /* No notifications pending - wait for one,
> then retry. */
> + mutex_unlock(&filter->notify_lock);
> + schedule();
> + mutex_lock(&filter->notify_lock);
> + if (signal_pending(current)) {
> + ret = -EINTR;
> + goto out;
> + }
> + goto retry;
> + }
> }
>
> unotif.id = knotif->id;
> @@ -1220,6 +1248,8 @@ static long seccomp_notify_recv(struct
> seccomp_filter *filter,
> wake_up_poll(&filter->wqh, EPOLLOUT | EPOLLWRNORM);
> ret = 0;
> out:
> + filter->notif->canceled_reqs = false;
> + finish_wait(&filter->wqh, &wait);
> mutex_unlock(&filter->notify_lock);
>
> if (ret == 0 && copy_to_user(buf, &unotif, sizeof(unotif))) {
> @@ -1233,10 +1263,8 @@ static long seccomp_notify_recv(struct
> seccomp_filter *filter,
> */
> mutex_lock(&filter->notify_lock);
> knotif = find_notification(filter, unotif.id);
> - if (knotif) {
> + if (knotif)
> knotif->state = SECCOMP_NOTIFY_INIT;
> - up(&filter->notif->request);
> - }
> mutex_unlock(&filter->notify_lock);
> }
>
> @@ -1485,7 +1513,6 @@ static struct file *init_listener(struct
> seccomp_filter *filter)
> if (!filter->notif)
> goto out;
>
> - sema_init(&filter->notif->request, 0);
> filter->notif->next_id = get_random_u64();
> INIT_LIST_HEAD(&filter->notif->notifications);
>


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2020-10-26 03:07:26

by Kees Cook

[permalink] [raw]
Subject: Re: For review: seccomp_user_notif(2) manual page

On Thu, Oct 01, 2020 at 03:52:02AM +0200, Jann Horn wrote:
> On Thu, Oct 1, 2020 at 1:25 AM Tycho Andersen <[email protected]> wrote:
> > On Thu, Oct 01, 2020 at 01:11:33AM +0200, Jann Horn wrote:
> > > On Thu, Oct 1, 2020 at 1:03 AM Tycho Andersen <[email protected]> wrote:
> > > > On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) wrote:
> > > > > On 9/30/20 5:03 PM, Tycho Andersen wrote:
> > > > > > On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
> > > > > >> ┌─────────────────────────────────────────────────────┐
> > > > > >> │FIXME │
> > > > > >> ├─────────────────────────────────────────────────────┤
> > > > > >> │From my experiments, it appears that if a SEC‐ │
> > > > > >> │COMP_IOCTL_NOTIF_RECV is done after the target │
> > > > > >> │process terminates, then the ioctl() simply blocks │
> > > > > >> │(rather than returning an error to indicate that the │
> > > > > >> │target process no longer exists). │
> > > > > >
> > > > > > Yeah, I think Christian wanted to fix this at some point,
> > > > >
> > > > > Do you have a pointer that discussion? I could not find it with a
> > > > > quick search.
> > > > >
> > > > > > but it's a
> > > > > > bit sticky to do.
> > > > >
> > > > > Can you say a few words about the nature of the problem?
> > > >
> > > > I remembered wrong, it's actually in the tree: 99cdb8b9a573 ("seccomp:
> > > > notify about unused filter"). So maybe there's a bug here?
> > >
> > > That thing only notifies on ->poll, it doesn't unblock ioctls; and
> > > Michael's sample code uses SECCOMP_IOCTL_NOTIF_RECV to wait. So that
> > > commit doesn't have any effect on this kind of usage.
> >
> > Yes, thanks. And the ones stuck in RECV are waiting on a semaphore so
> > we don't have a count of all of them, unfortunately.
> >
> > We could maybe look inside the wait_list, but that will probably make
> > people angry :)
>
> The easiest way would probably be to open-code the semaphore-ish part,
> and let the semaphore and poll share the waitqueue. The current code
> kind of mirrors the semaphore's waitqueue in the wqh - open-coding the
> entire semaphore would IMO be cleaner than that. And it's not like
> semaphore semantics are even a good fit for this code anyway.
>
> Let's see... if we didn't have the existing UAPI to worry about, I'd
> do it as follows (*completely* untested). That way, the ioctl would
> block exactly until either there actually is a request to deliver or
> there are no more users of the filter. The problem is that if we just
> apply this patch, existing users of SECCOMP_IOCTL_NOTIF_RECV that use
> an event loop and don't set O_NONBLOCK will be screwed. So we'd

Wait, why? Do you mean a ioctl calling loop (rather than a poll event
loop)? I think poll would be fine, but a "try calling RECV and expect to
return ENOENT" loop would change. But I don't think anyone would do this
exactly because it _currently_ acts like O_NONBLOCK, yes?

> probably also have to add some stupid counter in place of the
> semaphore's counter that we can use to preserve the old behavior of
> returning -ENOENT once for each cancelled request. :(

I only see this in Debian Code Search:
https://sources.debian.org/src/crun/0.15+dfsg-1/src/libcrun/seccomp_notify.c/?hl=166#L166
which is using epoll_wait():
https://sources.debian.org/src/crun/0.15+dfsg-1/src/libcrun/container.c/?hl=1326#L1326

I expect LXC is using it. :)

Let's change it ASAP! ;)

-Kees

>
> I guess this is a nice point in favor of Michael's usual complaint
> that if there are no man pages for a feature by the time the feature
> lands upstream, there's a higher chance that the UAPI will suck
> forever...
>
>
>
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 676d4af62103..f0f4c68e0bc6 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -138,7 +138,6 @@ struct seccomp_kaddfd {
> * @notifications: A list of struct seccomp_knotif elements.
> */
> struct notification {
> - struct semaphore request;
> u64 next_id;
> struct list_head notifications;
> };
> @@ -859,7 +858,6 @@ static int seccomp_do_user_notification(int this_syscall,
> list_add(&n.list, &match->notif->notifications);
> INIT_LIST_HEAD(&n.addfd);
>
> - up(&match->notif->request);
> wake_up_poll(&match->wqh, EPOLLIN | EPOLLRDNORM);
> mutex_unlock(&match->notify_lock);
>
> @@ -1175,9 +1173,10 @@ find_notification(struct seccomp_filter *filter, u64 id)
>
>
> static long seccomp_notify_recv(struct seccomp_filter *filter,
> - void __user *buf)
> + void __user *buf, bool blocking)
> {
> struct seccomp_knotif *knotif = NULL, *cur;
> + DECLARE_WAITQUEUE(wait, current);
> struct seccomp_notif unotif;
> ssize_t ret;
>
> @@ -1190,11 +1189,9 @@ static long seccomp_notify_recv(struct
> seccomp_filter *filter,
>
> memset(&unotif, 0, sizeof(unotif));
>
> - ret = down_interruptible(&filter->notif->request);
> - if (ret < 0)
> - return ret;
> -
> mutex_lock(&filter->notify_lock);
> +
> +retry:
> list_for_each_entry(cur, &filter->notif->notifications, list) {
> if (cur->state == SECCOMP_NOTIFY_INIT) {
> knotif = cur;
> @@ -1202,14 +1199,32 @@ static long seccomp_notify_recv(struct
> seccomp_filter *filter,
> }
> }
>
> - /*
> - * If we didn't find a notification, it could be that the task was
> - * interrupted by a fatal signal between the time we were woken and
> - * when we were able to acquire the rw lock.
> - */
> if (!knotif) {
> - ret = -ENOENT;
> - goto out;
> + /* This has to happen before checking &filter->users. */
> + prepare_to_wait(&filter->wqh, &wait, TASK_INTERRUPTIBLE);
> +
> + /*
> + * If all users of the filter are gone, throw an error instead
> + * of pointlessly continuing to block.
> + */
> + if (refcount_read(&filter->users) == 0) {
> + ret = -ENOTCON;
> + goto out;
> + }
> + if (blocking) {
> + /* No notifications pending - wait for one,
> then retry. */
> + mutex_unlock(&filter->notify_lock);
> + schedule();
> + mutex_lock(&filter->notify_lock);
> + if (signal_pending(current)) {
> + ret = -EINTR;
> + goto out;
> + }
> + goto retry;
> + } else {
> + ret = -ENOENT;
> + goto out;
> + }
> }
>
> unotif.id = knotif->id;
> @@ -1220,6 +1235,7 @@ static long seccomp_notify_recv(struct
> seccomp_filter *filter,
> wake_up_poll(&filter->wqh, EPOLLOUT | EPOLLWRNORM);
> ret = 0;
> out:
> + finish_wait(&filter->wqh, &wait);
> mutex_unlock(&filter->notify_lock);
>
> if (ret == 0 && copy_to_user(buf, &unotif, sizeof(unotif))) {
> @@ -1233,10 +1249,8 @@ static long seccomp_notify_recv(struct
> seccomp_filter *filter,
> */
> mutex_lock(&filter->notify_lock);
> knotif = find_notification(filter, unotif.id);
> - if (knotif) {
> + if (knotif)
> knotif->state = SECCOMP_NOTIFY_INIT;
> - up(&filter->notif->request);
> - }
> mutex_unlock(&filter->notify_lock);
> }
>
> @@ -1412,11 +1426,12 @@ static long seccomp_notify_ioctl(struct file
> *file, unsigned int cmd,
> {
> struct seccomp_filter *filter = file->private_data;
> void __user *buf = (void __user *)arg;
> + bool blocking = !(file->f_flags & O_NONBLOCK);
>
> /* Fixed-size ioctls */
> switch (cmd) {
> case SECCOMP_IOCTL_NOTIF_RECV:
> - return seccomp_notify_recv(filter, buf);
> + return seccomp_notify_recv(filter, buf, blocking);
> case SECCOMP_IOCTL_NOTIF_SEND:
> return seccomp_notify_send(filter, buf);
> case SECCOMP_IOCTL_NOTIF_ID_VALID_WRONG_DIR:
> @@ -1485,7 +1500,6 @@ static struct file *init_listener(struct
> seccomp_filter *filter)
> if (!filter->notif)
> goto out;
>
> - sema_init(&filter->notif->request, 0);
> filter->notif->next_id = get_random_u64();
> INIT_LIST_HEAD(&filter->notif->notifications);

--
Kees Cook

2020-10-26 04:15:47

by Kees Cook

[permalink] [raw]
Subject: Re: For review: seccomp_user_notif(2) manual page

On Thu, Oct 15, 2020 at 01:24:03PM +0200, Michael Kerrisk (man-pages) wrote:
> On 10/1/20 1:39 AM, Kees Cook wrote:
> > I'll comment more later, but I've run out of time today and I didn't see
> > anyone mention this detail yet in the existing threads... :)
>
> Later never came :-). But, I hope you may have comments for the
> next draft, which I will send out soon.

Later is now, and Soon approaches!

I finally caught up and read through this whole thread. Thank you all
for the bug fix[1], and I'm looking forward to more[2]. :)

For my reply I figured I'd base it on the current draft, so here's a
simulated quote based on the seccomp_user_notif branch of
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git
through commit 71101158fe330af5a26552447a0bb433b69e15b7
$ COLUMNS=75 man --nh --nj man2/seccomp_user_notif.2 | sed 's/^/> /'

On Sun, Oct 25, 2020 at 01:54:05PM +0100, Michael Kerrisk (man-pages) wrote:
> SECCOMP_USER_NOTIF(2) Linux Programmer's Manual SECCOMP_USER_NOTIF(2)
>
> NAME
> seccomp_user_notif - Seccomp user-space notification mechanism
>
> SYNOPSIS
> #include <linux/seccomp.h>
> #include <linux/filter.h>
> #include <linux/audit.h>
>
> int seccomp(unsigned int operation, unsigned int flags, void *args);
>
> #include <sys/ioctl.h>
>
> int ioctl(int fd, SECCOMP_IOCTL_NOTIF_RECV,
> struct seccomp_notif *req);
> int ioctl(int fd, SECCOMP_IOCTL_NOTIF_SEND,
> struct seccomp_notif_resp *resp);
> int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ID_VALID, __u64 *id);
>
> DESCRIPTION
> This page describes the user-space notification mechanism provided
> by the Secure Computing (seccomp) facility. As well as the use of
> the SECCOMP_FILTER_FLAG_NEW_LISTENER flag, the
> SECCOMP_RET_USER_NOTIF action value, and the
> SECCOMP_GET_NOTIF_SIZES operation described in seccomp(2), this
> mechanism involves the use of a number of related ioctl(2)
> operations (described below).
>
> Overview
> In conventional usage of a seccomp filter, the decision about how
> to treat a system call is made by the filter itself. By contrast,
> the user-space notification mechanism allows the seccomp filter to
> delegate the handling of the system call to another user-space
> process. Note that this mechanism is explicitly not intended as a
> method implementing security policy; see NOTES.
>
> In the discussion that follows, the thread(s) on which the seccomp
> filter is installed is (are) referred to as the target, and the
> process that is notified by the user-space notification mechanism
> is referred to as the supervisor.
>
> A suitably privileged supervisor can use the user-space
> notification mechanism to perform actions on behalf of the target.
> The advantage of the user-space notification mechanism is that the
> supervisor will usually be able to retrieve information about the
> target and the performed system call that the seccomp filter
> itself cannot. (A seccomp filter is limited in the information it
> can obtain and the actions that it can perform because it is
> running on a virtual machine inside the kernel.)
>
> An overview of the steps performed by the target and the
> supervisor is as follows:
>
> 1. The target establishes a seccomp filter in the usual manner,
> but with two differences:
>
> • The seccomp(2) flags argument includes the flag
> SECCOMP_FILTER_FLAG_NEW_LISTENER. Consequently, the return
> value of the (successful) seccomp(2) call is a new

nit: extra space

> "listening" file descriptor that can be used to receive
> notifications. Only one "listening" seccomp filter can be
> installed for a thread.

I like this limitation, but I expect that it'll need to change in the
future. Even with LSMs, we see the need for arbitrary stacking, and the
idea of there being only 1 supervisor will eventually break down. Right
now there is only 1 because only container managers are using this
feature. But if some daemon starts using it to isolate some thread,
suddenly it might break if a container manager is trying to listen to it
too, etc. I expect it won't be needed soon, but I do think it'll change.

>
> • In cases where it is appropriate, the seccomp filter returns
> the action value SECCOMP_RET_USER_NOTIF. This return value
> will trigger a notification event.
>
> 2. In order that the supervisor can obtain notifications using the
> listening file descriptor, (a duplicate of) that file
> descriptor must be passed from the target to the supervisor.

Yet another reason to have an "activate on exec" mode for seccomp. With
no_new_privs _not_ being delayed in such a way, I think it'd be safe to
add. The supervisor would get the fd immediately, and then once it
fork/execed suddenly the whole thing would activate, and no fd passing
needed.

The "on exec" boundary is really only needed for oblivious targets. For
a coordinated target, I've thought it might be nice to have an arbitrary
"go" point, where the target could call seccomp() with something like a
SECCOMP_ACTIVATE_DELAYED_FILTERS operation. This lets any process
initialization happen that might need to do things that would be blocked
by filters, etc.

Before:

fork
install some filters that don't block initialization
exec
do some initialization
install more filters, maybe block exec, seccomp
run

After:

fork
install delayed filters
exec
do some initialization
activate delayed filters
run

In practice, the two-stage filter application has been fine, if
sometimes a bit complex (e.g. for user_notif, "do some initialization"
includes figuring out how to pass the fd back to the supervisor, etc).

> One way in which this could be done is by passing the file
> descriptor over a UNIX domain socket connection between the
> target and the supervisor (using the SCM_RIGHTS ancillary
> message type described in unix(7)).
>
> 3. The supervisor will receive notification events on the
> listening file descriptor. These events are returned as
> structures of type seccomp_notif. Because this structure and
> its size may evolve over kernel versions, the supervisor must
> first determine the size of this structure using the seccomp(2)
> SECCOMP_GET_NOTIF_SIZES operation, which returns a structure of
> type seccomp_notif_sizes. The supervisor allocates a buffer of
> size seccomp_notif_sizes.seccomp_notif bytes to receive
> notification events. In addition,the supervisor allocates
> another buffer of size seccomp_notif_sizes.seccomp_notif_resp
> bytes for the response (a struct seccomp_notif_resp structure)
> that it will provide to the kernel (and thus the target).
>
> 4. The target then performs its workload, which includes system
> calls that will be controlled by the seccomp filter. Whenever
> one of these system calls causes the filter to return the
> SECCOMP_RET_USER_NOTIF action value, the kernel does not (yet)
> execute the system call; instead, execution of the target is
> temporarily blocked inside the kernel (in a sleep state that is
> interruptible by signals) and a notification event is generated
> on the listening file descriptor.
>
> 5. The supervisor can now repeatedly monitor the listening file
> descriptor for SECCOMP_RET_USER_NOTIF-triggered events. To do
> this, the supervisor uses the SECCOMP_IOCTL_NOTIF_RECV ioctl(2)
> operation to read information about a notification event; this
> operation blocks until an event is available. The operation
> returns a seccomp_notif structure containing information about
> the system call that is being attempted by the target.
>
> 6. The seccomp_notif structure returned by the
> SECCOMP_IOCTL_NOTIF_RECV operation includes the same
> information (a seccomp_data structure) that was passed to the
> seccomp filter. This information allows the supervisor to
> discover the system call number and the arguments for the
> target's system call. In addition, the notification event
> contains the ID of the thread that triggered the notification.

Should "cookie" be at least named here, just to provide a bit more
context for when it is mentioned in 8 below? E.g.:

... In addition, the notification event
contains the triggering thread's ID and a unique cookie to be
used in subsequent SECCOMP_IOCTL_NOTIF_ID_VALID and
SECCOMP_IOCTL_NOTIF_SEND operations.

>
> The information in the notification can be used to discover the
> values of pointer arguments for the target's system call.
> (This is something that can't be done from within a seccomp
> filter.) One way in which the supervisor can do this is to
> open the corresponding /proc/[tid]/mem file (see proc(5)) and
> read bytes from the location that corresponds to one of the
> pointer arguments whose value is supplied in the notification
> event. (The supervisor must be careful to avoid a race
> condition that can occur when doing this; see the description
> of the SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation below.)
> In addition, the supervisor can access other system information
> that is visible in user space but which is not accessible from
> a seccomp filter.
>
> 7. Having obtained information as per the previous step, the
> supervisor may then choose to perform an action in response to
> the target's system call (which, as noted above, is not
> executed when the seccomp filter returns the
> SECCOMP_RET_USER_NOTIF action value).
>
> One example use case here relates to containers. The target
> may be located inside a container where it does not have
> sufficient capabilities to mount a filesystem in the
> container's mount namespace. However, the supervisor may be a
> more privileged process that does have sufficient capabilities
> to perform the mount operation.
>
> 8. The supervisor then sends a response to the notification. The
> information in this response is used by the kernel to construct
> a return value for the target's system call and provide a value
> that will be assigned to the errno variable of the target.
>
> The response is sent using the SECCOMP_IOCTL_NOTIF_SEND
> ioctl(2) operation, which is used to transmit a
> seccomp_notif_resp structure to the kernel. This structure
> includes a cookie value that the supervisor obtained in the
> seccomp_notif structure returned by the
> SECCOMP_IOCTL_NOTIF_RECV operation. This cookie value allows
> the kernel to associate the response with the target.

Describing where the cookie came from seems like it should live in 6
above. A reader would have to take this new info and figure out where
SECCOMP_IOCTL_NOTIF_RECV was described and piece it together. With the
suggestion to 6 above, maybe:

... This structure
must include the cookie value that the supervisor obtained in
the seccomp_notif structure returned by the
SECCOMP_IOCTL_NOTIF_RECV operation, which allows the kernel
to associate the response with the target.

>
> 9. Once the notification has been sent, the system call in the
> target thread unblocks, returning the information that was
> provided by the supervisor in the notification response.
>
> As a variation on the last two steps, the supervisor can send a
> response that tells the kernel that it should execute the target
> thread's system call; see the discussion of
> SECCOMP_USER_NOTIF_FLAG_CONTINUE, below.
>
> ioctl(2) operations
> The following ioctl(2) operations are provided to support seccomp
> user-space notification. For each of these operations, the first
> (file descriptor) argument of ioctl(2) is the listening file
> descriptor returned by a call to seccomp(2) with the
> SECCOMP_FILTER_FLAG_NEW_LISTENER flag.
>
> SECCOMP_IOCTL_NOTIF_RECV
> This operation is used to obtain a user-space notification
> event. If no such event is currently pending, the
> operation blocks until an event occurs. The third ioctl(2)
> argument is a pointer to a structure of the following form
> which contains information about the event. This structure
> must be zeroed out before the call.
>
> struct seccomp_notif {
> __u64 id; /* Cookie */
> __u32 pid; /* TID of target thread */

Should we rename this variable from pid to tid? Yes it's UAPI, but yay for
anonymous unions:

struct seccomp_notif {
__u64 id; /* Cookie */
union {
__u32 pid;
__u32 tid; /* TID of target thread */
};
__u32 flags; /* Currently unused (0) */
struct seccomp_data data; /* See seccomp(2) */
};

> __u32 flags; /* Currently unused (0) */
> struct seccomp_data data; /* See seccomp(2) */
> };
>
> The fields in this structure are as follows:
>
> id This is a cookie for the notification. Each such
> cookie is guaranteed to be unique for the
> corresponding seccomp filter.
>
> • It can be used with the
> SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation to
> verify that the target is still alive.
>
> • When returning a notification response to the
> kernel, the supervisor must include the cookie
> value in the seccomp_notif_resp structure that is
> specified as the argument of the
> SECCOMP_IOCTL_NOTIF_SEND operation.
>
> pid This is the thread ID of the target thread that
> triggered the notification event.
>
> flags This is a bit mask of flags providing further
> information on the event. In the current
> implementation, this field is always zero.
>
> data This is a seccomp_data structure containing
> information about the system call that triggered the
> notification. This is the same structure that is
> passed to the seccomp filter. See seccomp(2) for
> details of this structure.
>
> On success, this operation returns 0; on failure, -1 is
> returned, and errno is set to indicate the cause of the
> error. This operation can fail with the following errors:
>
> EINVAL (since Linux 5.5)
> The seccomp_notif structure that was passed to the
> call contained nonzero fields.
>
> ENOENT The target thread was killed by a signal as the
> notification information was being generated, or the
> target's (blocked) system call was interrupted by a
> signal handler.
>
> SECCOMP_IOCTL_NOTIF_ID_VALID
> This operation can be used to check that a notification ID
> returned by an earlier SECCOMP_IOCTL_NOTIF_RECV operation
> is still valid (i.e., that the target still exists).

Maybe clarify a bit more, since it's covering more than just "is the
target still alive", but also "is that syscall still waiting for a
response":

is still valid (i.e., that the target still exists and
the syscall is still blocked waiting for a response).


>
> The third ioctl(2) argument is a pointer to the cookie (id)
> returned by the SECCOMP_IOCTL_NOTIF_RECV operation.
>
> This operation is necessary to avoid race conditions that
> can occur when the pid returned by the
> SECCOMP_IOCTL_NOTIF_RECV operation terminates, and that
> process ID is reused by another process. An example of
> this kind of race is the following
>
> 1. A notification is generated on the listening file
> descriptor. The returned seccomp_notif contains the TID
> of the target thread (in the pid field of the
> structure).
>
> 2. The target terminates.
>
> 3. Another thread or process is created on the system that
> by chance reuses the TID that was freed when the target
> terminated.
>
> 4. The supervisor open(2)s the /proc/[tid]/mem file for the
> TID obtained in step 1, with the intention of (say)
> inspecting the memory location(s) that containing the
> argument(s) of the system call that triggered the
> notification in step 1.
>
> In the above scenario, the risk is that the supervisor may
> try to access the memory of a process other than the
> target. This race can be avoided by following the call to
> open(2) with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to
> verify that the process that generated the notification is
> still alive. (Note that if the target terminates after the
> latter step, a subsequent read(2) from the file descriptor
> may return 0, indicating end of file.)
>
> On success (i.e., the notification ID is still valid), this
> operation returns 0. On failure (i.e., the notification ID
> is no longer valid), -1 is returned, and errno is set to
> ENOENT.
>
> SECCOMP_IOCTL_NOTIF_SEND
> This operation is used to send a notification response back
> to the kernel. The third ioctl(2) argument of this
> structure is a pointer to a structure of the following
> form:
>
> struct seccomp_notif_resp {
> __u64 id; /* Cookie value */
> __s64 val; /* Success return value */
> __s32 error; /* 0 (success) or negative
> error number */
> __u32 flags; /* See below */
> };
>
> The fields of this structure are as follows:
>
> id This is the cookie value that was obtained using the
> SECCOMP_IOCTL_NOTIF_RECV operation. This cookie
> value allows the kernel to correctly associate this
> response with the system call that triggered the
> user-space notification.
>
> val This is the value that will be used for a spoofed
> success return for the target's system call; see
> below.
>
> error This is the value that will be used as the error
> number (errno) for a spoofed error return for the
> target's system call; see below.
>
> flags This is a bit mask that includes zero or more of the
> following flags:
>
> SECCOMP_USER_NOTIF_FLAG_CONTINUE (since Linux 5.5)
> Tell the kernel to execute the target's
> system call.
>
> Two kinds of response are possible:
>
> • A response to the kernel telling it to execute the
> target's system call. In this case, the flags field
> includes SECCOMP_USER_NOTIF_FLAG_CONTINUE and the error
> and val fields must be zero.
>
> This kind of response can be useful in cases where the
> supervisor needs to do deeper analysis of the target's
> system call than is possible from a seccomp filter (e.g.,
> examining the values of pointer arguments), and, having
> decided that the system call does not require emulation
> by the supervisor, the supervisor wants the system call
> to be executed normally in the target.
>
> The SECCOMP_USER_NOTIF_FLAG_CONTINUE flag should be used
> with caution; see NOTES.
>
> • A spoofed return value for the target's system call. In
> this case, the kernel does not execute the target's
> system call, instead causing the system call to return a
> spoofed value as specified by fields of the
> seccomp_notif_resp structure. The supervisor should set
> the fields of this structure as follows:
>
> + flags does not contain
> SECCOMP_USER_NOTIF_FLAG_CONTINUE.
>
> + error is set either to 0 for a spoofed "success"
> return or to a negative error number for a spoofed
> "failure" return. In the former case, the kernel
> causes the target's system call to return the value
> specified in the val field. In the later case, the
> kernel causes the target's system call to return -1,
> and errno is assigned the negated error value.
>
> + val is set to a value that will be used as the return
> value for a spoofed "success" return for the target's
> system call. The value in this field is ignored if
> the error field contains a nonzero value.

Strictly speaking, this is architecture specific, but all architectures
do it this way. Should seccomp enforce val == 0 when err != 0 ?

>
> On success, this operation returns 0; on failure, -1 is
> returned, and errno is set to indicate the cause of the
> error. This operation can fail with the following errors:
>
> EINPROGRESS
> A response to this notification has already been
> sent.
>
> EINVAL An invalid value was specified in the flags field.
>
> EINVAL The flags field contained
> SECCOMP_USER_NOTIF_FLAG_CONTINUE, and the error or
> val field was not zero.
>
> ENOENT The blocked system call in the target has been
> interrupted by a signal handler or the target has
> terminated.
>
> NOTES
> select()/poll()/epoll semantics
> The file descriptor returned when seccomp(2) is employed with the
> SECCOMP_FILTER_FLAG_NEW_LISTENER flag can be monitored using
> poll(2), epoll(7), and select(2). These interfaces indicate that
> the file descriptor is ready as follows:
>
> • When a notification is pending, these interfaces indicate that
> the file descriptor is readable. Following such an indication,
> a subsequent SECCOMP_IOCTL_NOTIF_RECV ioctl(2) will not block,
> returning either information about a notification or else
> failing with the error EINTR if the target has been killed by a
> signal or its system call has been interrupted by a signal
> handler.
>
> • After the notification has been received (i.e., by the
> SECCOMP_IOCTL_NOTIF_RECV ioctl(2) operation), these interfaces
> indicate that the file descriptor is writable, meaning that a
> notification response can be sent using the
> SECCOMP_IOCTL_NOTIF_SEND ioctl(2) operation.
>
> • After the last thread using the filter has terminated and been
> reaped using waitpid(2) (or similar), the file descriptor
> indicates an end-of-file condition (readable in select(2);
> POLLHUP/EPOLLHUP in poll(2)/ epoll_wait(2)).

I'll reply separately about the "ioctl() does not terminate when all
filters have terminated" case.

>
> Design goals; use of SECCOMP_USER_NOTIF_FLAG_CONTINUE
> The intent of the user-space notification feature is to allow
> system calls to be performed on behalf of the target. The
> target's system call should either be handled by the supervisor or
> allowed to continue normally in the kernel (where standard
> security policies will be applied).
>
> Note well: this mechanism must not be used to make security policy
> decisions about the system call, which would be inherently race-
> prone for reasons described next.
>
> The SECCOMP_USER_NOTIF_FLAG_CONTINUE flag must be used with
> caution. If set by the supervisor, the target's system call will
> continue. However, there is a time-of-check, time-of-use race
> here, since an attacker could exploit the interval of time where
> the target is blocked waiting on the "continue" response to do
> things such as rewriting the system call arguments.
>
> Note furthermore that a user-space notifier can be bypassed if the
> existing filters allow the use of seccomp(2) or prctl(2) to
> install a filter that returns an action value with a higher
> precedence than SECCOMP_RET_USER_NOTIF (see seccomp(2)).
>
> It should thus be absolutely clear that the seccomp user-space
> notification mechanism can not be used to implement a security
> policy! It should only ever be used in scenarios where a more
> privileged process supervises the system calls of a lesser
> privileged target to get around kernel-enforced security
> restrictions when the supervisor deems this safe. In other words,
> in order to continue a system call, the supervisor should be sure
> that another security mechanism or the kernel itself will
> sufficiently block the system call if its arguments are rewritten
> to something unsafe.
>
> Interaction with SA_RESTART signal handlers
> Consider the following scenario:
>
> • The target process has used sigaction(2) to install a signal
> handler with the SA_RESTART flag.
>
> • The target has made a system call that triggered a seccomp user-
> space notification and the target is currently blocked until the
> supervisor sends a notification response.
>
> • A signal is delivered to the target and the signal handler is
> executed.
>
> • When (if) the supervisor attempts to send a notification
> response, the SECCOMP_IOCTL_NOTIF_SEND ioctl(2)) operation will
> fail with the ENOENT error.
>
> In this scenario, the kernel will restart the target's system
> call. Consequently, the supervisor will receive another user-
> space notification. Thus, depending on how many times the blocked
> system call is interrupted by a signal handler, the supervisor may
> receive multiple notifications for the same system call in the

maybe "... for the same instance of a system call in the target." for
clarity?

> target.
>
> One oddity is that system call restarting as described in this
> scenario will occur even for the blocking system calls listed in
> signal(7) that would never normally be restarted by the SA_RESTART
> flag.

Does this need fixing? I imagine the correct behavior for this case
would be a response to _SEND of EINPROGRESS and the target would see
EINTR normally?

I mean, it's not like seccomp doesn't already expose weirdness with
syscall restarts. Not even arm64 compat agrees[3] with arm32 in this
regard. :(

> BUGS
> If a SECCOMP_IOCTL_NOTIF_RECV ioctl(2) operation is performed
> after the target terminates, then the ioctl(2) call simply blocks
> (rather than returning an error to indicate that the target no
> longer exists).

I want this fixed. It caused me no end of pain when building the
selftests, and ended up spawning my implementing a global test timeout
in kselftest. :P Before the usage counter refactor, there was no sane
way to deal with this, but now I think we're close[2]. I'll reply
separately about this.

>
> EXAMPLES
> The (somewhat contrived) program shown below demonstrates the use
> of the interfaces described in this page. The program creates a
> child process that serves as the "target" process. The child
> process installs a seccomp filter that returns the
> SECCOMP_RET_USER_NOTIF action value if a call is made to mkdir(2).
> The child process then calls mkdir(2) once for each of the
> supplied command-line arguments, and reports the result returned
> by the call. After processing all arguments, the child process
> terminates.
>
> The parent process acts as the supervisor, listening for the
> notifications that are generated when the target process calls
> mkdir(2). When such a notification occurs, the supervisor
> examines the memory of the target process (using /proc/[pid]/mem)
> to discover the pathname argument that was supplied to the
> mkdir(2) call, and performs one of the following actions:

I like this example! It's simple enough to be understandable and complex
enough to show the purpose of user_notif. :)

>
> • If the pathname begins with the prefix "/tmp/", then the
> supervisor attempts to create the specified directory, and then
> spoofs a return for the target process based on the return value
> of the supervisor's mkdir(2) call. In the event that that call
> succeeds, the spoofed success return value is the length of the
> pathname.
>
> • If the pathname begins with "./" (i.e., it is a relative
> pathname), the supervisor sends a
> SECCOMP_USER_NOTIF_FLAG_CONTINUE response to the kernel to say
> that the kernel should execute the target process's mkdir(2)
> call.
>
> • If the pathname begins with some other prefix, the supervisor
> spoofs an error return for the target process, so that the
> target process's mkdir(2) call appears to fail with the error
> EOPNOTSUPP ("Operation not supported"). Additionally, if the
> specified pathname is exactly "/bye", then the supervisor
> terminates.
>
> This program can be used to demonstrate various aspects of the
> behavior of the seccomp user-space notification mechanism. To
> help aid such demonstrations, the program logs various messages to
> show the operation of the target process (lines prefixed "T:") and
> the supervisor (indented lines prefixed "S:").
>
> In the following example, the target attempts to create the
> directory /tmp/x. Upon receiving the notification, the supervisor
> creates the directory on the target's behalf, and spoofs a success
> return to be received by the target process's mkdir(2) call.
>
> $ ./seccomp_unotify /tmp/x
> T: PID = 23168
>
> T: about to mkdir("/tmp/x")
> S: got notification (ID 0x17445c4a0f4e0e3c) for PID 23168
> S: executing: mkdir("/tmp/x", 0700)
> S: success! spoofed return = 6
> S: sending response (flags = 0; val = 6; error = 0)
> T: SUCCESS: mkdir(2) returned 6
>
> T: terminating
> S: target has terminated; bye
>
> In the above output, note that the spoofed return value seen by
> the target process is 6 (the length of the pathname /tmp/x),
> whereas a normal mkdir(2) call returns 0 on success.
>
> In the next example, the target attempts to create a directory
> using the relative pathname ./sub. Since this pathname starts
> with "./", the supervisor sends a SECCOMP_USER_NOTIF_FLAG_CONTINUE
> response to the kernel, and the kernel then (successfully)
> executes the target process's mkdir(2) call.
>
> $ ./seccomp_unotify ./sub
> T: PID = 23204
>
> T: about to mkdir("./sub")
> S: got notification (ID 0xddb16abe25b4c12) for PID 23204
> S: target can execute system call
> S: sending response (flags = 0x1; val = 0; error = 0)
> T: SUCCESS: mkdir(2) returned 0
>
> T: terminating
> S: target has terminated; bye
>
> If the target process attempts to create a directory with a
> pathname that doesn't start with "." and doesn't begin with the
> prefix "/tmp/", then the supervisor spoofs an error return
> (EOPNOTSUPP, "Operation not supported") for the target's mkdir(2)
> call (which is not executed):
>
> $ ./seccomp_unotify /xxx
> T: PID = 23178
>
> T: about to mkdir("/xxx")
> S: got notification (ID 0xe7dc095d1c524e80) for PID 23178
> S: spoofing error response (Operation not supported)
> S: sending response (flags = 0; val = 0; error = -95)
> T: ERROR: mkdir(2): Operation not supported
>
> T: terminating
> S: target has terminated; bye
>
> In the next example, the target process attempts to create a
> directory with the pathname /tmp/nosuchdir/b. Upon receiving the
> notification, the supervisor attempts to create that directory,
> but the mkdir(2) call fails because the directory /tmp/nosuchdir
> does not exist. Consequently, the supervisor spoofs an error
> return that passes the error that it received back to the target
> process's mkdir(2) call.
>
> $ ./seccomp_unotify /tmp/nosuchdir/b
> T: PID = 23199
>
> T: about to mkdir("/tmp/nosuchdir/b")
> S: got notification (ID 0x8744454293506046) for PID 23199
> S: executing: mkdir("/tmp/nosuchdir/b", 0700)
> S: failure! (errno = 2; No such file or directory)
> S: sending response (flags = 0; val = 0; error = -2)
> T: ERROR: mkdir(2): No such file or directory
>
> T: terminating
> S: target has terminated; bye
>
> If the supervisor receives a notification and sees that the
> argument of the target's mkdir(2) is the string "/bye", then (as
> well as spoofing an EOPNOTSUPP error), the supervisor terminates.
> If the target process subsequently executes another mkdir(2) that
> triggers its seccomp filter to return the SECCOMP_RET_USER_NOTIF
> action value, then the kernel causes the target process's system
> call to fail with the error ENOSYS ("Function not implemented").
> This is demonstrated by the following example:
>
> $ ./seccomp_unotify /bye /tmp/y
> T: PID = 23185
>
> T: about to mkdir("/bye")
> S: got notification (ID 0xa81236b1d2f7b0f4) for PID 23185
> S: spoofing error response (Operation not supported)
> S: sending response (flags = 0; val = 0; error = -95)
> S: terminating **********
> T: ERROR: mkdir(2): Operation not supported
>
> T: about to mkdir("/tmp/y")
> T: ERROR: mkdir(2): Function not implemented
>
> T: terminating
>
> Program source
> #define _GNU_SOURCE
> #include <sys/types.h>
> #include <sys/prctl.h>
> #include <fcntl.h>
> #include <limits.h>
> #include <signal.h>
> #include <stddef.h>
> #include <stdint.h>
> #include <stdbool.h>
> #include <linux/audit.h>
> #include <sys/syscall.h>
> #include <sys/stat.h>
> #include <linux/filter.h>
> #include <linux/seccomp.h>
> #include <sys/ioctl.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <errno.h>
> #include <sys/socket.h>
> #include <sys/un.h>
>
> #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
> } while (0)

Because I love macros, you can expand this to make it take a format
string:

#define errExit(fmt, ...) do { \
char __err[64]; \
strerror_r(errno, __err, sizeof(__err)); \
fprintf(stderr, fmt ": %s\n", ##__VA_ARG__, __err); \
exit(EXIT_FAILURE); \
} while (0)

>
> /* Send the file descriptor 'fd' over the connected UNIX domain socket
> 'sockfd'. Returns 0 on success, or -1 on error. */
>
> static int
> sendfd(int sockfd, int fd)
> {
> struct msghdr msgh;
> struct iovec iov;
> int data;
> struct cmsghdr *cmsgp;
>
> /* Allocate a char array of suitable size to hold the ancillary data.
> However, since this buffer is in reality a 'struct cmsghdr', use a
> union to ensure that it is suitably aligned. */
> union {
> char buf[CMSG_SPACE(sizeof(int))];
> /* Space large enough to hold an 'int' */
> struct cmsghdr align;
> } controlMsg;
>
> /* The 'msg_name' field can be used to specify the address of the
> destination socket when sending a datagram. However, we do not
> need to use this field because 'sockfd' is a connected socket. */
>
> msgh.msg_name = NULL;
> msgh.msg_namelen = 0;
>
> /* On Linux, we must transmit at least one byte of real data in
> order to send ancillary data. We transmit an arbitrary integer
> whose value is ignored by recvfd(). */
>
> msgh.msg_iov = &iov;
> msgh.msg_iovlen = 1;
> iov.iov_base = &data;
> iov.iov_len = sizeof(int);
> data = 12345;
>
> /* Set 'msghdr' fields that describe ancillary data */
>
> msgh.msg_control = controlMsg.buf;
> msgh.msg_controllen = sizeof(controlMsg.buf);
>
> /* Set up ancillary data describing file descriptor to send */
>
> cmsgp = CMSG_FIRSTHDR(&msgh);
> cmsgp->cmsg_level = SOL_SOCKET;
> cmsgp->cmsg_type = SCM_RIGHTS;
> cmsgp->cmsg_len = CMSG_LEN(sizeof(int));
> memcpy(CMSG_DATA(cmsgp), &fd, sizeof(int));
>
> /* Send real plus ancillary data */
>
> if (sendmsg(sockfd, &msgh, 0) == -1)
> return -1;
>
> return 0;
> }
>
> /* Receive a file descriptor on a connected UNIX domain socket. Returns
> the received file descriptor on success, or -1 on error. */
>
> static int
> recvfd(int sockfd)
> {
> struct msghdr msgh;
> struct iovec iov;
> int data, fd;
> ssize_t nr;
>
> /* Allocate a char buffer for the ancillary data. See the comments
> in sendfd() */
> union {
> char buf[CMSG_SPACE(sizeof(int))];
> struct cmsghdr align;
> } controlMsg;
> struct cmsghdr *cmsgp;
>
> /* The 'msg_name' field can be used to obtain the address of the
> sending socket. However, we do not need this information. */
>
> msgh.msg_name = NULL;
> msgh.msg_namelen = 0;
>
> /* Specify buffer for receiving real data */
>
> msgh.msg_iov = &iov;
> msgh.msg_iovlen = 1;
> iov.iov_base = &data; /* Real data is an 'int' */
> iov.iov_len = sizeof(int);
>
> /* Set 'msghdr' fields that describe ancillary data */
>
> msgh.msg_control = controlMsg.buf;
> msgh.msg_controllen = sizeof(controlMsg.buf);
>
> /* Receive real plus ancillary data; real data is ignored */
>
> nr = recvmsg(sockfd, &msgh, 0);
> if (nr == -1)
> return -1;
>
> cmsgp = CMSG_FIRSTHDR(&msgh);
>
> /* Check the validity of the 'cmsghdr' */
>
> if (cmsgp == NULL ||
> cmsgp->cmsg_len != CMSG_LEN(sizeof(int)) ||
> cmsgp->cmsg_level != SOL_SOCKET ||
> cmsgp->cmsg_type != SCM_RIGHTS) {
> errno = EINVAL;
> return -1;
> }
>
> /* Return the received file descriptor to our caller */
>
> memcpy(&fd, CMSG_DATA(cmsgp), sizeof(int));
> return fd;
> }
>
> static void
> sigchldHandler(int sig)
> {
> char *msg = "\tS: target has terminated; bye\n";
>
> write(STDOUT_FILENO, msg, strlen(msg));

white space nit: extra space before "="
efficiency nit: strlen isn't needed, since it can be done with
compile-time constant constants:

char msg[] = "\tS: target has terminated; bye\n";
write(STDOUT_FILENO, msg, sizeof(msg) - 1);

(some optimization levels may already replace the strlen a sizeof - 1)

> _exit(EXIT_SUCCESS);
> }
>
> static int
> seccomp(unsigned int operation, unsigned int flags, void *args)
> {
> return syscall(__NR_seccomp, operation, flags, args);
> }
>
> /* The following is the x86-64-specific BPF boilerplate code for checking
> that the BPF program is running on the right architecture + ABI. At
> completion of these instructions, the accumulator contains the system
> call number. */
>
> /* For the x32 ABI, all system call numbers have bit 30 set */
>
> #define X32_SYSCALL_BIT 0x40000000
>
> #define X86_64_CHECK_ARCH_AND_LOAD_SYSCALL_NR \
> BPF_STMT(BPF_LD | BPF_W | BPF_ABS, \
> (offsetof(struct seccomp_data, arch))), \
> BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 0, 2), \
> BPF_STMT(BPF_LD | BPF_W | BPF_ABS, \
> (offsetof(struct seccomp_data, nr))), \
> BPF_JUMP(BPF_JMP | BPF_JGE | BPF_K, X32_SYSCALL_BIT, 0, 1), \
> BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS)
>
> /* installNotifyFilter() installs a seccomp filter that generates
> user-space notifications (SECCOMP_RET_USER_NOTIF) when the process
> calls mkdir(2); the filter allows all other system calls.
>
> The function return value is a file descriptor from which the
> user-space notifications can be fetched. */
>
> static int
> installNotifyFilter(void)
> {
> struct sock_filter filter[] = {
> X86_64_CHECK_ARCH_AND_LOAD_SYSCALL_NR,
>
> /* mkdir() triggers notification to user-space supervisor */
>
> BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_mkdir, 0, 1),
> BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_USER_NOTIF),
>
> /* Every other system call is allowed */
>
> BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
> };
>
> struct sock_fprog prog = {
> .len = sizeof(filter) / sizeof(filter[0]),
> .filter = filter,
> };
>
> /* Install the filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER flag;
> as a result, seccomp() returns a notification file descriptor. */
>
> int notifyFd = seccomp(SECCOMP_SET_MODE_FILTER,
> SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
> if (notifyFd == -1)
> errExit("seccomp-install-notify-filter");
>
> return notifyFd;
> }
>
> /* Close a pair of sockets created by socketpair() */
>
> static void
> closeSocketPair(int sockPair[2])
> {
> if (close(sockPair[0]) == -1)
> errExit("closeSocketPair-close-0");
> if (close(sockPair[1]) == -1)
> errExit("closeSocketPair-close-1");
> }
>
> /* Implementation of the target process; create a child process that:
>
> (1) installs a seccomp filter with the
> SECCOMP_FILTER_FLAG_NEW_LISTENER flag;
> (2) writes the seccomp notification file descriptor returned from
> the previous step onto the UNIX domain socket, 'sockPair[0]';
> (3) calls mkdir(2) for each element of 'argv'.
>
> The function return value in the parent is the PID of the child
> process; the child does not return from this function. */
>
> static pid_t
> targetProcess(int sockPair[2], char *argv[])
> {
> pid_t targetPid = fork();
> if (targetPid == -1)
> errExit("fork");
>
> if (targetPid > 0) /* In parent, return PID of child */
> return targetPid;
>
> /* Child falls through to here */
>
> printf("T: PID = %ld\n", (long) getpid());
>
> /* Install seccomp filter(s) */
>
> if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0))
> errExit("prctl");
>
> int notifyFd = installNotifyFilter();
>
> /* Pass the notification file descriptor to the tracing process over
> a UNIX domain socket */
>
> if (sendfd(sockPair[0], notifyFd) == -1)
> errExit("sendfd");
>
> /* Notification and socket FDs are no longer needed in target */
>
> if (close(notifyFd) == -1)
> errExit("close-target-notify-fd");
>
> closeSocketPair(sockPair);
>
> /* Perform a mkdir() call for each of the command-line arguments */
>
> for (char **ap = argv; *ap != NULL; ap++) {
> printf("\nT: about to mkdir(\"%s\")\n", *ap);
>
> int s = mkdir(*ap, 0700);
> if (s == -1)
> perror("T: ERROR: mkdir(2)");
> else
> printf("T: SUCCESS: mkdir(2) returned %d\n", s);
> }
>
> printf("\nT: terminating\n");
> exit(EXIT_SUCCESS);
> }
>
> /* Check that the notification ID provided by a SECCOMP_IOCTL_NOTIF_RECV
> operation is still valid. It will no longer be valid if the process
> has terminated. This operation can be used when accessing /proc/PID
> files in the target process in order to avoid TOCTOU race conditions
> where the PID that is returned by SECCOMP_IOCTL_NOTIF_RECV terminates
> and is reused by another process. */
>
> static void
> checkNotificationIdIsValid(int notifyFd, uint64_t id)
> {
> if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_ID_VALID, &id) == -1) {
> fprintf(stderr, "\tS: notification ID check: "
> "target has terminated!!!\n");
>
> exit(EXIT_FAILURE);

And now you can do:

errExit("\tS: notification ID check: "
"target has terminated! ioctl");

;)

> }
> }
>
> /* Access the memory of the target process in order to discover the
> pathname that was given to mkdir() */
>
> static bool
> getTargetPathname(struct seccomp_notif *req, int notifyFd,
> char *path, size_t len)
> {
> char procMemPath[PATH_MAX];
>
> snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", req->pid);
>
> int procMemFd = open(procMemPath, O_RDONLY);
> if (procMemFd == -1)
> errExit("Supervisor: open");
>
> /* Check that the process whose info we are accessing is still alive.
> If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed
> in checkNotificationIdIsValid()) succeeds, we know that the
> /proc/PID/mem file descriptor that we opened corresponds to the
> process for which we received a notification. If that process
> subsequently terminates, then read() on that file descriptor
> will return 0 (EOF). */
>
> checkNotificationIdIsValid(notifyFd, req->id);
>
> /* Read bytes at the location containing the pathname argument
> (i.e., the first argument) of the mkdir(2) call */
>
> ssize_t nread = pread(procMemFd, path, len, req->data.args[0]);
> if (nread == -1)
> errExit("pread");
>
> if (nread == 0) {
> fprintf(stderr, "\tS: pread() of /proc/PID/mem "
> "returned 0 (EOF)\n");
> exit(EXIT_FAILURE);
> }
>
> if (close(procMemFd) == -1)
> errExit("close-/proc/PID/mem");
>
> /* We have no guarantees about what was in the memory of the target
> process. We therefore treat the buffer returned by pread() as
> untrusted input. The buffer should be terminated by a null byte;
> if not, then we will trigger an error for the target process. */
>
> for (int j = 0; j < nread; j++)
> if (path[j] == ' ')

This rendering typo (' ' vs '\0') ends up manifesting badly. ;) The man
source shows:

if (path[j] == \(aq\0\(aq)

I think this needs to be \\0 ?

Or it could also be a tested as:

if (strnlen(path, nread) < nread)

> return true;
>
> return false;
> }
>
> /* Handle notifications that arrive via the SECCOMP_RET_USER_NOTIF file
> descriptor, 'notifyFd'. */
>
> static void
> handleNotifications(int notifyFd)
> {
> struct seccomp_notif_sizes sizes;
> char path[PATH_MAX];
>
> /* Discover the sizes of the structures that are used to receive
> notifications and send notification responses, and allocate
> buffers of those sizes. */
>
> if (seccomp(SECCOMP_GET_NOTIF_SIZES, 0, &sizes) == -1)
> errExit("\tS: seccomp-SECCOMP_GET_NOTIF_SIZES");
>
> struct seccomp_notif *req = malloc(sizes.seccomp_notif);
> if (req == NULL)
> errExit("\tS: malloc");
>
> /* When allocating the response buffer, we must allow for the fact
> that the user-space binary may have been built with user-space
> headers where 'struct seccomp_notif_resp' is bigger than the
> response buffer expected by the (older) kernel. Therefore, we
> allocate a buffer that is the maximum of the two sizes. This
> ensures that if the supervisor places bytes into the response
> structure that are past the response size that the kernel expects,
> then the supervisor is not touching an invalid memory location. */
>
> size_t resp_size = sizes.seccomp_notif_resp;
> if (sizeof(struct seccomp_notif_resp) > resp_size)
> resp_size = sizeof(struct seccomp_notif_resp);
>
> struct seccomp_notif_resp *resp = malloc(resp_size);
> if (resp == NULL)
> errExit("\tS: malloc");
>
> /* Loop handling notifications */
>
> for (;;) {
> /* Wait for next notification, returning info in '*req' */
>
> memset(req, 0, sizes.seccomp_notif);
> if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_RECV, req) == -1) {
> if (errno == EINTR)
> continue;
> errExit("Supervisor: ioctl-SECCOMP_IOCTL_NOTIF_RECV");
> }
>
> printf("\tS: got notification (ID %#llx) for PID %d\n",
> req->id, req->pid);
>
> /* The only system call that can generate a notification event
> is mkdir(2). Nevertheless, we check that the notified system
> call is indeed mkdir() as kind of future-proofing of this
> code in case the seccomp filter is later modified to
> generate notifications for other system calls. */
>
> if (req->data.nr != __NR_mkdir) {
> printf("\tS: notification contained unexpected "
> "system call number; bye!!!\n");
> exit(EXIT_FAILURE);
> }
>
> bool pathOK = getTargetPathname(req, notifyFd, path,
> sizeof(path));
>
> /* Prepopulate some fields of the response */
>
> resp->id = req->id; /* Response includes notification ID */
> resp->flags = 0;
> resp->val = 0;
>
> /* If the target pathname was not valid, trigger an EINVAL error;
> if the directory is in /tmp, then create it on behalf of the
> supervisor; if the pathname starts with '.', tell the kernel
> to let the target process execute the mkdir(); otherwise, give
> an error for a directory pathname in any other location. */
>
> if (!pathOK) {
> resp->error = -EINVAL;
> printf("\tS: spoofing error for invalid pathname (%s)\n",
> strerror(-resp->error));
> } else if (strncmp(path, "/tmp/", strlen("/tmp/")) == 0) {
> printf("\tS: executing: mkdir(\"%s\", %#llo)\n",
> path, req->data.args[1]);
>
> if (mkdir(path, req->data.args[1]) == 0) {
> resp->error = 0; /* "Success" */
> resp->val = strlen(path); /* Used as return value of
> mkdir() in target */
> printf("\tS: success! spoofed return = %lld\n",
> resp->val);
> } else {
>
> /* If mkdir() failed in the supervisor, pass the error
> back to the target */
>
> resp->error = -errno;
> printf("\tS: failure! (errno = %d; %s)\n", errno,
> strerror(errno));
> }
> } else if (strncmp(path, "./", strlen("./")) == 0) {
> resp->error = resp->val = 0;
> resp->flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;
> printf("\tS: target can execute system call\n");
> } else {
> resp->error = -EOPNOTSUPP;
> printf("\tS: spoofing error response (%s)\n",
> strerror(-resp->error));
> }
>
> /* Send a response to the notification */
>
> printf("\tS: sending response "
> "(flags = %#x; val = %lld; error = %d)\n",
> resp->flags, resp->val, resp->error);
>
> if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_SEND, resp) == -1) {
> if (errno == ENOENT)
> printf("\tS: response failed with ENOENT; "
> "perhaps target process's syscall was "
> "interrupted by a signal?\n");
> else
> perror("ioctl-SECCOMP_IOCTL_NOTIF_SEND");
> }
>
> /* If the pathname is just "/bye", then the supervisor
> terminates. This allows us to see what happens if the
> target process makes further calls to mkdir(2). */
>
> if (strcmp(path, "/bye") == 0) {
> printf("\tS: terminating **********\n");
> exit(EXIT_FAILURE);
> }
> }
> }
>
> /* Implementation of the supervisor process:
>
> (1) obtains the notification file descriptor from 'sockPair[1]'
> (2) handles notifications that arrive on that file descriptor. */
>
> static void
> supervisor(int sockPair[2])
> {
> int notifyFd = recvfd(sockPair[1]);
> if (notifyFd == -1)
> errExit("recvfd");
>
> closeSocketPair(sockPair); /* We no longer need the socket pair */
>
> handleNotifications(notifyFd);
> }
>
> int
> main(int argc, char *argv[])
> {
> int sockPair[2];
>
> setbuf(stdout, NULL);
>
> if (argc < 2) {
> fprintf(stderr, "At least one pathname argument is required\n");
> exit(EXIT_FAILURE);
> }
>
> /* Create a UNIX domain socket that is used to pass the seccomp
> notification file descriptor from the target process to the
> supervisor process. */
>
> if (socketpair(AF_UNIX, SOCK_STREAM, 0, sockPair) == -1)
> errExit("socketpair");
>
> /* Create a child process--the "target"--that installs seccomp
> filtering. The target process writes the seccomp notification
> file descriptor onto 'sockPair[0]' and then calls mkdir(2) for
> each directory in the command-line arguments. */
>
> (void) targetProcess(sockPair, &argv[optind]);
>
> /* Catch SIGCHLD when the target terminates, so that the
> supervisor can also terminate. */
>
> struct sigaction sa;
> sa.sa_handler = sigchldHandler;
> sa.sa_flags = 0;
> sigemptyset(&sa.sa_mask);
> if (sigaction(SIGCHLD, &sa, NULL) == -1)
> errExit("sigaction");
>
> supervisor(sockPair);
>
> exit(EXIT_SUCCESS);
> }
>
> SEE ALSO
> ioctl(2), seccomp(2)
>
> A further example program can be found in the kernel source file
> samples/seccomp/user-trap.c.
>
> Linux 2020-10-01 SECCOMP_USER_NOTIF(2)

Thank you so much for this documentation and example! :)

-Kees

[1] https://git.kernel.org/linus/dfe719fef03d752f1682fa8aeddf30ba501c8555
[2] https://lore.kernel.org/lkml/CAG48ez3kpEDO1x_HfvOM2R9M78Ach9O_4+Pjs-vLLfqvZL+13A@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CAGXu5jKzif=vp6gn5ZtrTx-JTN367qFphobnt9s=awbaafwoUw@mail.gmail.com/

--
Kees Cook

2020-10-26 10:16:30

by Jann Horn

[permalink] [raw]
Subject: Re: For review: seccomp_user_notif(2) manual page

On Sat, Oct 24, 2020 at 2:53 PM Michael Kerrisk (man-pages)
<[email protected]> wrote:
> On 10/17/20 2:25 AM, Jann Horn wrote:
> > On Fri, Oct 16, 2020 at 8:29 PM Michael Kerrisk (man-pages)
> > <[email protected]> wrote:
[...]
> >> I'm not sure if I should write anything about this small UAPI
> >> breakage in BUGS, or not. Your thoughts?
> >
> > Thinking about it a bit more: Any code that relies on pause() or
> > epoll_wait() not restarting is buggy anyway, right? Because a signal
> > could also arrive directly before entering the syscall, while
> > userspace code is still executing? So one could argue that we're just
> > enlarging a preexisting race. (Unless the signal handler checks the
> > interrupted register state to figure out whether we already entered
> > syscall handling?)
>
> Yes, that all makes sense.
>
> > If userspace relies on non-restarting behavior, it should be using
> > something like epoll_pwait(). And that stuff only unblocks signals
> > after we've already past the seccomp checks on entry.
>
> Thanks for elaborating that detail, since as soon as you talked
> about "enlarging a preexisting race" above, I immediately wondered
> sigsuspend(), pselect(), etc.
>
> (Mind you, I still wonder about the effect on system calls that
> are normally nonrestartable because they have timeouts. My
> understanding is that the kernel doesn't restart those system
> calls because it's impossible for the kernel to restart the call
> with the right timeout value. I wonder what happens when those
> system calls are restarted in the scenario we're discussing.)

Ah, that's an interesting edge case...

> Anyway, returning to your point... So, to be clear (and to
> quickly remind myself in case I one day reread this thread),
> there is not a problem with sigsuspend(), pselect(), ppoll(),
> and epoll_pwait() since:
>
> * Before the syscall, signals are blocked in the target.
> * Inside the syscall, signals are still blocked at the time
> the check is made for seccomp filters.
> * If a seccomp user-space notification event kicks, the target
> is put to sleep with the signals still blocked.
> * The signal will only get delivered after the supervisor either
> triggers a spoofed success/failure return in the target or the
> supervisor sends a CONTINUE response to the kernel telling it
> to execute the target's system call. Either way, there won't be
> any restarting of the target's system call (and the supervisor
> thus won't see multiple notifications).
>
> (Right?)

Yeah.

[...]
> > So we should probably document the restarting behavior as something
> > the supervisor has to deal with in the manpage; but for the
> > "non-restarting syscalls can restart from the target's perspective"
> > aspect, it might be enough to document this as quirky behavior that
> > can't actually break correct code? (Or not document it at all. Dunno.)
>
> So, I've added the following to the page:
>
> Interaction with SA_RESTART signal handlers
> Consider the following scenario:
>
> · The target process has used sigaction(2) to install a signal
> handler with the SA_RESTART flag.
>
> · The target has made a system call that triggered a seccomp user-
> space notification and the target is currently blocked until the
> supervisor sends a notification response.
>
> · A signal is delivered to the target and the signal handler is
> executed.
>
> · When (if) the supervisor attempts to send a notification
> response, the SECCOMP_IOCTL_NOTIF_SEND ioctl(2)) operation will
> fail with the ENOENT error.
>
> In this scenario, the kernel will restart the target's system
> call. Consequently, the supervisor will receive another user-
> space notification. Thus, depending on how many times the blocked
> system call is interrupted by a signal handler, the supervisor may
> receive multiple notifications for the same system call in the
> target.
>
> One oddity is that system call restarting as described in this
> scenario will occur even for the blocking system calls listed in
> signal(7) that would never normally be restarted by the SA_RESTART
> flag.
>
> Does that seem okay?

Sounds good to me.

> In addition, I've queued a cross-reference in signal(7):
>
> In certain circumstances, the seccomp(2) user-space notifi‐
> cation feature can lead to restarting of system calls that
> would otherwise never be restarted by SA_RESTART; for
> details, see seccomp_user_notif(2).

Subject: Re: For review: seccomp_user_notif(2) manual page

Hi Jann,

On 10/26/20 10:32 AM, Jann Horn wrote:
> On Sat, Oct 24, 2020 at 2:53 PM Michael Kerrisk (man-pages)
> <[email protected]> wrote:
>> On 10/17/20 2:25 AM, Jann Horn wrote:
>>> On Fri, Oct 16, 2020 at 8:29 PM Michael Kerrisk (man-pages)
>>> <[email protected]> wrote:
> [...]
>>>> I'm not sure if I should write anything about this small UAPI
>>>> breakage in BUGS, or not. Your thoughts?
>>>
>>> Thinking about it a bit more: Any code that relies on pause() or
>>> epoll_wait() not restarting is buggy anyway, right? Because a signal
>>> could also arrive directly before entering the syscall, while
>>> userspace code is still executing? So one could argue that we're just
>>> enlarging a preexisting race. (Unless the signal handler checks the
>>> interrupted register state to figure out whether we already entered
>>> syscall handling?)
>>
>> Yes, that all makes sense.
>>
>>> If userspace relies on non-restarting behavior, it should be using
>>> something like epoll_pwait(). And that stuff only unblocks signals
>>> after we've already past the seccomp checks on entry.
>>
>> Thanks for elaborating that detail, since as soon as you talked
>> about "enlarging a preexisting race" above, I immediately wondered
>> sigsuspend(), pselect(), etc.
>>
>> (Mind you, I still wonder about the effect on system calls that
>> are normally nonrestartable because they have timeouts. My
>> understanding is that the kernel doesn't restart those system
>> calls because it's impossible for the kernel to restart the call
>> with the right timeout value. I wonder what happens when those
>> system calls are restarted in the scenario we're discussing.)
>
> Ah, that's an interesting edge case...

I'm going to drop a FIXME into the page source so that
there's a reminder of this issue in the next draft of
the page, which I'm about to send out.

[...]

Thanks for checking the other pieces, Jann.

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2020-10-26 11:21:19

by Jann Horn

[permalink] [raw]
Subject: Re: For review: seccomp_user_notif(2) manual page

On Mon, Oct 26, 2020 at 1:32 AM Kees Cook <[email protected]> wrote:
> On Thu, Oct 01, 2020 at 03:52:02AM +0200, Jann Horn wrote:
> > On Thu, Oct 1, 2020 at 1:25 AM Tycho Andersen <[email protected]> wrote:
> > > On Thu, Oct 01, 2020 at 01:11:33AM +0200, Jann Horn wrote:
> > > > On Thu, Oct 1, 2020 at 1:03 AM Tycho Andersen <[email protected]> wrote:
> > > > > On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) wrote:
> > > > > > On 9/30/20 5:03 PM, Tycho Andersen wrote:
> > > > > > > On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
> > > > > > >> ┌─────────────────────────────────────────────────────┐
> > > > > > >> │FIXME │
> > > > > > >> ├─────────────────────────────────────────────────────┤
> > > > > > >> │From my experiments, it appears that if a SEC‐ │
> > > > > > >> │COMP_IOCTL_NOTIF_RECV is done after the target │
> > > > > > >> │process terminates, then the ioctl() simply blocks │
> > > > > > >> │(rather than returning an error to indicate that the │
> > > > > > >> │target process no longer exists). │
> > > > > > >
> > > > > > > Yeah, I think Christian wanted to fix this at some point,
> > > > > >
> > > > > > Do you have a pointer that discussion? I could not find it with a
> > > > > > quick search.
> > > > > >
> > > > > > > but it's a
> > > > > > > bit sticky to do.
> > > > > >
> > > > > > Can you say a few words about the nature of the problem?
> > > > >
> > > > > I remembered wrong, it's actually in the tree: 99cdb8b9a573 ("seccomp:
> > > > > notify about unused filter"). So maybe there's a bug here?
> > > >
> > > > That thing only notifies on ->poll, it doesn't unblock ioctls; and
> > > > Michael's sample code uses SECCOMP_IOCTL_NOTIF_RECV to wait. So that
> > > > commit doesn't have any effect on this kind of usage.
> > >
> > > Yes, thanks. And the ones stuck in RECV are waiting on a semaphore so
> > > we don't have a count of all of them, unfortunately.
> > >
> > > We could maybe look inside the wait_list, but that will probably make
> > > people angry :)
> >
> > The easiest way would probably be to open-code the semaphore-ish part,
> > and let the semaphore and poll share the waitqueue. The current code
> > kind of mirrors the semaphore's waitqueue in the wqh - open-coding the
> > entire semaphore would IMO be cleaner than that. And it's not like
> > semaphore semantics are even a good fit for this code anyway.
> >
> > Let's see... if we didn't have the existing UAPI to worry about, I'd
> > do it as follows (*completely* untested). That way, the ioctl would
> > block exactly until either there actually is a request to deliver or
> > there are no more users of the filter. The problem is that if we just
> > apply this patch, existing users of SECCOMP_IOCTL_NOTIF_RECV that use
> > an event loop and don't set O_NONBLOCK will be screwed. So we'd
>
> Wait, why? Do you mean a ioctl calling loop (rather than a poll event
> loop)?

No, I'm talking about poll event loops.

> I think poll would be fine, but a "try calling RECV and expect to
> return ENOENT" loop would change. But I don't think anyone would do this
> exactly because it _currently_ acts like O_NONBLOCK, yes?
>
> > probably also have to add some stupid counter in place of the
> > semaphore's counter that we can use to preserve the old behavior of
> > returning -ENOENT once for each cancelled request. :(
>
> I only see this in Debian Code Search:
> https://sources.debian.org/src/crun/0.15+dfsg-1/src/libcrun/seccomp_notify.c/?hl=166#L166
> which is using epoll_wait():
> https://sources.debian.org/src/crun/0.15+dfsg-1/src/libcrun/container.c/?hl=1326#L1326
>
> I expect LXC is using it. :)

The problem is the scenario where a process is interrupted while it's
waiting for the supervisor to reply.

Consider the following scenario (with supervisor "S" and target "T"; S
wants to wait for events on two file descriptors seccomp_fd and
other_fd):

S: starts poll() to wait for events on seccomp_fd and other_fd
T: performs a syscall that's filtered with RET_USER_NOTIF
S: poll() returns and signals readiness of seccomp_fd
T: receives signal SIGUSR1
T: syscall aborts, enters signal handler
T: signal handler blocks on unfiltered syscall (e.g. write())
S: starts SECCOMP_IOCTL_NOTIF_RECV
S: blocks because no syscalls are pending

Depending on what other_fd is, this could in a worst case even lead to
a deadlock (if e.g. the signal handler wants to write to stdout, but
the stdout fd is hooked up to other_fd in the supervisor, but the
supervisor can't consume the data written because it's stuck in
seccomp handling).

So we have to ensure that when existing code (like that crun code you
linked to) triggers this case, SECCOMP_IOCTL_NOTIF_RECV returns
immediately instead of blocking.

(Oh, but by the way, that crun code looks broken anyway, because
AFAICS it treats all error returns from SECCOMP_IOCTL_NOTIF_RECV
equally by bailing out; and it kinda looks like that bailout path then
nukes the container, or something? So that needs to be fixed either
way.)

Subject: Re: For review: seccomp_user_notif(2) manual page

Hello Kees,

On 10/26/20 1:19 AM, Kees Cook wrote:
> On Thu, Oct 15, 2020 at 01:24:03PM +0200, Michael Kerrisk (man-pages) wrote:
>> On 10/1/20 1:39 AM, Kees Cook wrote:
>>> I'll comment more later, but I've run out of time today and I didn't see
>>> anyone mention this detail yet in the existing threads... :)
>>
>> Later never came :-). But, I hope you may have comments for the
>> next draft, which I will send out soon.
>
> Later is now, and Soon approaches!
>
> I finally caught up and read through this whole thread. Thank you all
> for the bug fix[1], and I'm looking forward to more[2]. :)


> For my reply I figured I'd base it on the current draft, so here's a
> simulated quote based on the seccomp_user_notif branch of
> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git
> through commit 71101158fe330af5a26552447a0bb433b69e15b7
> $ COLUMNS=75 man --nh --nj man2/seccomp_user_notif.2 | sed 's/^/> /'

Thanks for reviewing the latest version!

> On Sun, Oct 25, 2020 at 01:54:05PM +0100, Michael Kerrisk (man-pages) wrote:
>> SECCOMP_USER_NOTIF(2) Linux Programmer's Manual SECCOMP_USER_NOTIF(2)
>>
>> NAME
>> seccomp_user_notif - Seccomp user-space notification mechanism
>>
>> SYNOPSIS
>> #include <linux/seccomp.h>
>> #include <linux/filter.h>
>> #include <linux/audit.h>
>>
>> int seccomp(unsigned int operation, unsigned int flags, void *args);
>>
>> #include <sys/ioctl.h>
>>
>> int ioctl(int fd, SECCOMP_IOCTL_NOTIF_RECV,
>> struct seccomp_notif *req);
>> int ioctl(int fd, SECCOMP_IOCTL_NOTIF_SEND,
>> struct seccomp_notif_resp *resp);
>> int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ID_VALID, __u64 *id);
>>
>> DESCRIPTION
>> This page describes the user-space notification mechanism provided
>> by the Secure Computing (seccomp) facility. As well as the use of
>> the SECCOMP_FILTER_FLAG_NEW_LISTENER flag, the
>> SECCOMP_RET_USER_NOTIF action value, and the
>> SECCOMP_GET_NOTIF_SIZES operation described in seccomp(2), this
>> mechanism involves the use of a number of related ioctl(2)
>> operations (described below).
>>
>> Overview
>> In conventional usage of a seccomp filter, the decision about how
>> to treat a system call is made by the filter itself. By contrast,
>> the user-space notification mechanism allows the seccomp filter to
>> delegate the handling of the system call to another user-space
>> process. Note that this mechanism is explicitly not intended as a
>> method implementing security policy; see NOTES.
>>
>> In the discussion that follows, the thread(s) on which the seccomp
>> filter is installed is (are) referred to as the target, and the
>> process that is notified by the user-space notification mechanism
>> is referred to as the supervisor.
>>
>> A suitably privileged supervisor can use the user-space
>> notification mechanism to perform actions on behalf of the target.
>> The advantage of the user-space notification mechanism is that the
>> supervisor will usually be able to retrieve information about the
>> target and the performed system call that the seccomp filter
>> itself cannot. (A seccomp filter is limited in the information it
>> can obtain and the actions that it can perform because it is
>> running on a virtual machine inside the kernel.)
>>
>> An overview of the steps performed by the target and the
>> supervisor is as follows:
>>
>> 1. The target establishes a seccomp filter in the usual manner,
>> but with two differences:
>>
>> • The seccomp(2) flags argument includes the flag
>> SECCOMP_FILTER_FLAG_NEW_LISTENER. Consequently, the return
>> value of the (successful) seccomp(2) call is a new
>
> nit: extra space

Thanks. Fixed.

>> "listening" file descriptor that can be used to receive
>> notifications. Only one "listening" seccomp filter can be
>> installed for a thread.
>
> I like this limitation, but I expect that it'll need to change in the
> future. Even with LSMs, we see the need for arbitrary stacking, and the
> idea of there being only 1 supervisor will eventually break down. Right
> now there is only 1 because only container managers are using this
> feature. But if some daemon starts using it to isolate some thread,
> suddenly it might break if a container manager is trying to listen to it
> too, etc. I expect it won't be needed soon, but I do think it'll change.

Thanks for the background. (I added your text in a comment in the
page, just for my own reference in the future.)

>> • In cases where it is appropriate, the seccomp filter returns
>> the action value SECCOMP_RET_USER_NOTIF. This return value
>> will trigger a notification event.
>>
>> 2. In order that the supervisor can obtain notifications using the
>> listening file descriptor, (a duplicate of) that file
>> descriptor must be passed from the target to the supervisor.
>
> Yet another reason to have an "activate on exec" mode for seccomp. With

Funnily enough, I was having an in-person conversation just last week
with someone else who was interested in "activate-on-exec".

> no_new_privs _not_ being delayed in such a way, I think it'd be safe to
> add. The supervisor would get the fd immediately, and then once it
> fork/execed suddenly the whole thing would activate, and no fd passing
> needed.
>
> The "on exec" boundary is really only needed for oblivious targets. For
> a coordinated target, I've thought it might be nice to have an arbitrary
> "go" point, where the target could call seccomp() with something like a
> SECCOMP_ACTIVATE_DELAYED_FILTERS operation. This lets any process
> initialization happen that might need to do things that would be blocked
> by filters, etc.
>
> Before:
>
> fork
> install some filters that don't block initialization
> exec
> do some initialization
> install more filters, maybe block exec, seccomp
> run
>
> After:
>
> fork
> install delayed filters
> exec
> do some initialization
> activate delayed filters
> run
>
> In practice, the two-stage filter application has been fine, if
> sometimes a bit complex (e.g. for user_notif, "do some initialization"
> includes figuring out how to pass the fd back to the supervisor, etc).

Yes, something like what you describe above would certainly make some
uses easier. Activate-on-exec seems to me the most compelling need
though..

>> One way in which this could be done is by passing the file
>> descriptor over a UNIX domain socket connection between the
>> target and the supervisor (using the SCM_RIGHTS ancillary
>> message type described in unix(7)).
>>
>> 3. The supervisor will receive notification events on the
>> listening file descriptor. These events are returned as
>> structures of type seccomp_notif. Because this structure and
>> its size may evolve over kernel versions, the supervisor must
>> first determine the size of this structure using the seccomp(2)
>> SECCOMP_GET_NOTIF_SIZES operation, which returns a structure of
>> type seccomp_notif_sizes. The supervisor allocates a buffer of
>> size seccomp_notif_sizes.seccomp_notif bytes to receive
>> notification events. In addition,the supervisor allocates
>> another buffer of size seccomp_notif_sizes.seccomp_notif_resp
>> bytes for the response (a struct seccomp_notif_resp structure)
>> that it will provide to the kernel (and thus the target).
>>
>> 4. The target then performs its workload, which includes system
>> calls that will be controlled by the seccomp filter. Whenever
>> one of these system calls causes the filter to return the
>> SECCOMP_RET_USER_NOTIF action value, the kernel does not (yet)
>> execute the system call; instead, execution of the target is
>> temporarily blocked inside the kernel (in a sleep state that is
>> interruptible by signals) and a notification event is generated
>> on the listening file descriptor.
>>
>> 5. The supervisor can now repeatedly monitor the listening file
>> descriptor for SECCOMP_RET_USER_NOTIF-triggered events. To do
>> this, the supervisor uses the SECCOMP_IOCTL_NOTIF_RECV ioctl(2)
>> operation to read information about a notification event; this
>> operation blocks until an event is available. The operation
>> returns a seccomp_notif structure containing information about
>> the system call that is being attempted by the target.
>>
>> 6. The seccomp_notif structure returned by the
>> SECCOMP_IOCTL_NOTIF_RECV operation includes the same
>> information (a seccomp_data structure) that was passed to the
>> seccomp filter. This information allows the supervisor to
>> discover the system call number and the arguments for the
>> target's system call. In addition, the notification event
>> contains the ID of the thread that triggered the notification.
>
> Should "cookie" be at least named here, just to provide a bit more
> context for when it is mentioned in 8 below? E.g.:
>
> ... In addition, the notification event
> contains the triggering thread's ID and a unique cookie to be
> used in subsequent SECCOMP_IOCTL_NOTIF_ID_VALID and
> SECCOMP_IOCTL_NOTIF_SEND operations.

Good catch! Changed as you suggest. (And thanks so much for all
your suggested rewordings; that makes things *much* easier for me.)

>> The information in the notification can be used to discover the
>> values of pointer arguments for the target's system call.
>> (This is something that can't be done from within a seccomp
>> filter.) One way in which the supervisor can do this is to
>> open the corresponding /proc/[tid]/mem file (see proc(5)) and
>> read bytes from the location that corresponds to one of the
>> pointer arguments whose value is supplied in the notification
>> event. (The supervisor must be careful to avoid a race
>> condition that can occur when doing this; see the description
>> of the SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation below.)
>> In addition, the supervisor can access other system information
>> that is visible in user space but which is not accessible from
>> a seccomp filter.
>>
>> 7. Having obtained information as per the previous step, the
>> supervisor may then choose to perform an action in response to
>> the target's system call (which, as noted above, is not
>> executed when the seccomp filter returns the
>> SECCOMP_RET_USER_NOTIF action value).
>>
>> One example use case here relates to containers. The target
>> may be located inside a container where it does not have
>> sufficient capabilities to mount a filesystem in the
>> container's mount namespace. However, the supervisor may be a
>> more privileged process that does have sufficient capabilities
>> to perform the mount operation.
>>
>> 8. The supervisor then sends a response to the notification. The
>> information in this response is used by the kernel to construct
>> a return value for the target's system call and provide a value
>> that will be assigned to the errno variable of the target.
>>
>> The response is sent using the SECCOMP_IOCTL_NOTIF_SEND
>> ioctl(2) operation, which is used to transmit a
>> seccomp_notif_resp structure to the kernel. This structure
>> includes a cookie value that the supervisor obtained in the
>> seccomp_notif structure returned by the
>> SECCOMP_IOCTL_NOTIF_RECV operation. This cookie value allows
>> the kernel to associate the response with the target.
>
> Describing where the cookie came from seems like it should live in 6
> above. A reader would have to take this new info and figure out where
> SECCOMP_IOCTL_NOTIF_RECV was described and piece it together.

Yeah. I hate it when the documentation loses the reader like that :-}.

> With the
> suggestion to 6 above, maybe:
>
> ... This structure
> must include the cookie value that the supervisor obtained in
> the seccomp_notif structure returned by the
> SECCOMP_IOCTL_NOTIF_RECV operation, which allows the kernel
> to associate the response with the target.

Great! Changed.

>> 9. Once the notification has been sent, the system call in the
>> target thread unblocks, returning the information that was
>> provided by the supervisor in the notification response.
>>
>> As a variation on the last two steps, the supervisor can send a
>> response that tells the kernel that it should execute the target
>> thread's system call; see the discussion of
>> SECCOMP_USER_NOTIF_FLAG_CONTINUE, below.
>>
>> ioctl(2) operations
>> The following ioctl(2) operations are provided to support seccomp
>> user-space notification. For each of these operations, the first
>> (file descriptor) argument of ioctl(2) is the listening file
>> descriptor returned by a call to seccomp(2) with the
>> SECCOMP_FILTER_FLAG_NEW_LISTENER flag.
>>
>> SECCOMP_IOCTL_NOTIF_RECV
>> This operation is used to obtain a user-space notification
>> event. If no such event is currently pending, the
>> operation blocks until an event occurs. The third ioctl(2)
>> argument is a pointer to a structure of the following form
>> which contains information about the event. This structure
>> must be zeroed out before the call.
>>
>> struct seccomp_notif {
>> __u64 id; /* Cookie */
>> __u32 pid; /* TID of target thread */
>
> Should we rename this variable from pid to tid? Yes it's UAPI, but yay for
> anonymous unions:
>
> struct seccomp_notif {
> __u64 id; /* Cookie */
> union {
> __u32 pid;
> __u32 tid; /* TID of target thread */
> };
> __u32 flags; /* Currently unused (0) */
> struct seccomp_data data; /* See seccomp(2) */
> };

Yes, it would be nice to make this change. But, already there
are so many places in the UAPI where the pid/tid is messed upp :-(.

>> __u32 flags; /* Currently unused (0) */
>> struct seccomp_data data; /* See seccomp(2) */
>> };
>>
>> The fields in this structure are as follows:
>>
>> id This is a cookie for the notification. Each such
>> cookie is guaranteed to be unique for the
>> corresponding seccomp filter.
>>
>> • It can be used with the
>> SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation to
>> verify that the target is still alive.
>>
>> • When returning a notification response to the
>> kernel, the supervisor must include the cookie
>> value in the seccomp_notif_resp structure that is
>> specified as the argument of the
>> SECCOMP_IOCTL_NOTIF_SEND operation.
>>
>> pid This is the thread ID of the target thread that
>> triggered the notification event.
>>
>> flags This is a bit mask of flags providing further
>> information on the event. In the current
>> implementation, this field is always zero.
>>
>> data This is a seccomp_data structure containing
>> information about the system call that triggered the
>> notification. This is the same structure that is
>> passed to the seccomp filter. See seccomp(2) for
>> details of this structure.
>>
>> On success, this operation returns 0; on failure, -1 is
>> returned, and errno is set to indicate the cause of the
>> error. This operation can fail with the following errors:
>>
>> EINVAL (since Linux 5.5)
>> The seccomp_notif structure that was passed to the
>> call contained nonzero fields.
>>
>> ENOENT The target thread was killed by a signal as the
>> notification information was being generated, or the
>> target's (blocked) system call was interrupted by a
>> signal handler.
>>
>> SECCOMP_IOCTL_NOTIF_ID_VALID
>> This operation can be used to check that a notification ID
>> returned by an earlier SECCOMP_IOCTL_NOTIF_RECV operation
>> is still valid (i.e., that the target still exists).
>
> Maybe clarify a bit more, since it's covering more than just "is the
> target still alive", but also "is that syscall still waiting for a
> response":
>
> is still valid (i.e., that the target still exists and
> the syscall is still blocked waiting for a response).

Thanks. I made it:

(i.e., that the target still exists and its system call
is still blocked waiting for a response).

>> The third ioctl(2) argument is a pointer to the cookie (id)
>> returned by the SECCOMP_IOCTL_NOTIF_RECV operation.
>>
>> This operation is necessary to avoid race conditions that
>> can occur when the pid returned by the
>> SECCOMP_IOCTL_NOTIF_RECV operation terminates, and that
>> process ID is reused by another process. An example of
>> this kind of race is the following
>>
>> 1. A notification is generated on the listening file
>> descriptor. The returned seccomp_notif contains the TID
>> of the target thread (in the pid field of the
>> structure).
>>
>> 2. The target terminates.
>>
>> 3. Another thread or process is created on the system that
>> by chance reuses the TID that was freed when the target
>> terminated.
>>
>> 4. The supervisor open(2)s the /proc/[tid]/mem file for the
>> TID obtained in step 1, with the intention of (say)
>> inspecting the memory location(s) that containing the
>> argument(s) of the system call that triggered the
>> notification in step 1.
>>
>> In the above scenario, the risk is that the supervisor may
>> try to access the memory of a process other than the
>> target. This race can be avoided by following the call to
>> open(2) with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to
>> verify that the process that generated the notification is
>> still alive. (Note that if the target terminates after the
>> latter step, a subsequent read(2) from the file descriptor
>> may return 0, indicating end of file.)
>>
>> On success (i.e., the notification ID is still valid), this
>> operation returns 0. On failure (i.e., the notification ID
>> is no longer valid), -1 is returned, and errno is set to
>> ENOENT.
>>
>> SECCOMP_IOCTL_NOTIF_SEND
>> This operation is used to send a notification response back
>> to the kernel. The third ioctl(2) argument of this
>> structure is a pointer to a structure of the following
>> form:
>>
>> struct seccomp_notif_resp {
>> __u64 id; /* Cookie value */
>> __s64 val; /* Success return value */
>> __s32 error; /* 0 (success) or negative
>> error number */
>> __u32 flags; /* See below */
>> };
>>
>> The fields of this structure are as follows:
>>
>> id This is the cookie value that was obtained using the
>> SECCOMP_IOCTL_NOTIF_RECV operation. This cookie
>> value allows the kernel to correctly associate this
>> response with the system call that triggered the
>> user-space notification.
>>
>> val This is the value that will be used for a spoofed
>> success return for the target's system call; see
>> below.
>>
>> error This is the value that will be used as the error
>> number (errno) for a spoofed error return for the
>> target's system call; see below.
>>
>> flags This is a bit mask that includes zero or more of the
>> following flags:
>>
>> SECCOMP_USER_NOTIF_FLAG_CONTINUE (since Linux 5.5)
>> Tell the kernel to execute the target's
>> system call.
>>
>> Two kinds of response are possible:
>>
>> • A response to the kernel telling it to execute the
>> target's system call. In this case, the flags field
>> includes SECCOMP_USER_NOTIF_FLAG_CONTINUE and the error
>> and val fields must be zero.
>>
>> This kind of response can be useful in cases where the
>> supervisor needs to do deeper analysis of the target's
>> system call than is possible from a seccomp filter (e.g.,
>> examining the values of pointer arguments), and, having
>> decided that the system call does not require emulation
>> by the supervisor, the supervisor wants the system call
>> to be executed normally in the target.
>>
>> The SECCOMP_USER_NOTIF_FLAG_CONTINUE flag should be used
>> with caution; see NOTES.
>>
>> • A spoofed return value for the target's system call. In
>> this case, the kernel does not execute the target's
>> system call, instead causing the system call to return a
>> spoofed value as specified by fields of the
>> seccomp_notif_resp structure. The supervisor should set
>> the fields of this structure as follows:
>>
>> + flags does not contain
>> SECCOMP_USER_NOTIF_FLAG_CONTINUE.
>>
>> + error is set either to 0 for a spoofed "success"
>> return or to a negative error number for a spoofed
>> "failure" return. In the former case, the kernel
>> causes the target's system call to return the value
>> specified in the val field. In the later case, the
>> kernel causes the target's system call to return -1,
>> and errno is assigned the negated error value.
>>
>> + val is set to a value that will be used as the return
>> value for a spoofed "success" return for the target's
>> system call. The value in this field is ignored if
>> the error field contains a nonzero value.
>
> Strictly speaking, this is architecture specific, but all architectures
> do it this way. Should seccomp enforce val == 0 when err != 0 ?

That seems a resonable check to add. Initially, I found the absence of
such a check confusing, since it left me wondering: have I understood
the kernel code correctly?

>> On success, this operation returns 0; on failure, -1 is
>> returned, and errno is set to indicate the cause of the
>> error. This operation can fail with the following errors:
>>
>> EINPROGRESS
>> A response to this notification has already been
>> sent.
>>
>> EINVAL An invalid value was specified in the flags field.
>>
>> EINVAL The flags field contained
>> SECCOMP_USER_NOTIF_FLAG_CONTINUE, and the error or
>> val field was not zero.
>>
>> ENOENT The blocked system call in the target has been
>> interrupted by a signal handler or the target has
>> terminated.
>>
>> NOTES
>> select()/poll()/epoll semantics
>> The file descriptor returned when seccomp(2) is employed with the
>> SECCOMP_FILTER_FLAG_NEW_LISTENER flag can be monitored using
>> poll(2), epoll(7), and select(2). These interfaces indicate that
>> the file descriptor is ready as follows:
>>
>> • When a notification is pending, these interfaces indicate that
>> the file descriptor is readable. Following such an indication,
>> a subsequent SECCOMP_IOCTL_NOTIF_RECV ioctl(2) will not block,
>> returning either information about a notification or else
>> failing with the error EINTR if the target has been killed by a
>> signal or its system call has been interrupted by a signal
>> handler.
>>
>> • After the notification has been received (i.e., by the
>> SECCOMP_IOCTL_NOTIF_RECV ioctl(2) operation), these interfaces
>> indicate that the file descriptor is writable, meaning that a
>> notification response can be sent using the
>> SECCOMP_IOCTL_NOTIF_SEND ioctl(2) operation.
>>
>> • After the last thread using the filter has terminated and been
>> reaped using waitpid(2) (or similar), the file descriptor
>> indicates an end-of-file condition (readable in select(2);
>> POLLHUP/EPOLLHUP in poll(2)/ epoll_wait(2)).
>
> I'll reply separately about the "ioctl() does not terminate when all
> filters have terminated" case.

Okay.

>> Design goals; use of SECCOMP_USER_NOTIF_FLAG_CONTINUE
>> The intent of the user-space notification feature is to allow
>> system calls to be performed on behalf of the target. The
>> target's system call should either be handled by the supervisor or
>> allowed to continue normally in the kernel (where standard
>> security policies will be applied).
>>
>> Note well: this mechanism must not be used to make security policy
>> decisions about the system call, which would be inherently race-
>> prone for reasons described next.
>>
>> The SECCOMP_USER_NOTIF_FLAG_CONTINUE flag must be used with
>> caution. If set by the supervisor, the target's system call will
>> continue. However, there is a time-of-check, time-of-use race
>> here, since an attacker could exploit the interval of time where
>> the target is blocked waiting on the "continue" response to do
>> things such as rewriting the system call arguments.
>>
>> Note furthermore that a user-space notifier can be bypassed if the
>> existing filters allow the use of seccomp(2) or prctl(2) to
>> install a filter that returns an action value with a higher
>> precedence than SECCOMP_RET_USER_NOTIF (see seccomp(2)).
>>
>> It should thus be absolutely clear that the seccomp user-space
>> notification mechanism can not be used to implement a security
>> policy! It should only ever be used in scenarios where a more
>> privileged process supervises the system calls of a lesser
>> privileged target to get around kernel-enforced security
>> restrictions when the supervisor deems this safe. In other words,
>> in order to continue a system call, the supervisor should be sure
>> that another security mechanism or the kernel itself will
>> sufficiently block the system call if its arguments are rewritten
>> to something unsafe.
>>
>> Interaction with SA_RESTART signal handlers
>> Consider the following scenario:
>>
>> • The target process has used sigaction(2) to install a signal
>> handler with the SA_RESTART flag.
>>
>> • The target has made a system call that triggered a seccomp user-
>> space notification and the target is currently blocked until the
>> supervisor sends a notification response.
>>
>> • A signal is delivered to the target and the signal handler is
>> executed.
>>
>> • When (if) the supervisor attempts to send a notification
>> response, the SECCOMP_IOCTL_NOTIF_SEND ioctl(2)) operation will
>> fail with the ENOENT error.
>>
>> In this scenario, the kernel will restart the target's system
>> call. Consequently, the supervisor will receive another user-
>> space notification. Thus, depending on how many times the blocked
>> system call is interrupted by a signal handler, the supervisor may
>> receive multiple notifications for the same system call in the
>
> maybe "... for the same instance of a system call in the target." for
> clarity?

Yes, that's a nice clarification.

>> target.
>>
>> One oddity is that system call restarting as described in this
>> scenario will occur even for the blocking system calls listed in
>> signal(7) that would never normally be restarted by the SA_RESTART
>> flag.
>
> Does this need fixing? I imagine the correct behavior for this case
> would be a response to _SEND of EINPROGRESS and the target would see
> EINTR normally?

That sounds reasonable.

> I mean, it's not like seccomp doesn't already expose weirdness with
> syscall restarts. Not even arm64 compat agrees[3] with arm32 in this
> regard. :(

I've added the above comments as a FIXME in the page.

>> BUGS
>> If a SECCOMP_IOCTL_NOTIF_RECV ioctl(2) operation is performed
>> after the target terminates, then the ioctl(2) call simply blocks
>> (rather than returning an error to indicate that the target no
>> longer exists).
>
> I want this fixed. It caused me no end of pain when building the
> selftests, and ended up spawning my implementing a global test timeout
> in kselftest. :P Before the usage counter refactor, there was no sane
> way to deal with this, but now I think we're close[2]. I'll reply
> separately about this.

Also added as FIXME comment in the page :-).

The behavior here is surprising, and caused me some
confusion until I worked out what was going on.

>> EXAMPLES
>> The (somewhat contrived) program shown below demonstrates the use
>> of the interfaces described in this page. The program creates a
>> child process that serves as the "target" process. The child
>> process installs a seccomp filter that returns the
>> SECCOMP_RET_USER_NOTIF action value if a call is made to mkdir(2).
>> The child process then calls mkdir(2) once for each of the
>> supplied command-line arguments, and reports the result returned
>> by the call. After processing all arguments, the child process
>> terminates.
>>
>> The parent process acts as the supervisor, listening for the
>> notifications that are generated when the target process calls
>> mkdir(2). When such a notification occurs, the supervisor
>> examines the memory of the target process (using /proc/[pid]/mem)
>> to discover the pathname argument that was supplied to the
>> mkdir(2) call, and performs one of the following actions:
>
> I like this example! It's simple enough to be understandable and complex
> enough to show the purpose of user_notif. :)

Precisely my aim. Thank you for noticing and appreciating :-).

>> • If the pathname begins with the prefix "/tmp/", then the
>> supervisor attempts to create the specified directory, and then
>> spoofs a return for the target process based on the return value
>> of the supervisor's mkdir(2) call. In the event that that call
>> succeeds, the spoofed success return value is the length of the
>> pathname.
>>
>> • If the pathname begins with "./" (i.e., it is a relative
>> pathname), the supervisor sends a
>> SECCOMP_USER_NOTIF_FLAG_CONTINUE response to the kernel to say
>> that the kernel should execute the target process's mkdir(2)
>> call.
>>
>> • If the pathname begins with some other prefix, the supervisor
>> spoofs an error return for the target process, so that the
>> target process's mkdir(2) call appears to fail with the error
>> EOPNOTSUPP ("Operation not supported"). Additionally, if the
>> specified pathname is exactly "/bye", then the supervisor
>> terminates.

[...]

>> Program source
>> #define _GNU_SOURCE
>> #include <sys/types.h>
>> #include <sys/prctl.h>
>> #include <fcntl.h>
>> #include <limits.h>
>> #include <signal.h>
>> #include <stddef.h>
>> #include <stdint.h>
>> #include <stdbool.h>
>> #include <linux/audit.h>
>> #include <sys/syscall.h>
>> #include <sys/stat.h>
>> #include <linux/filter.h>
>> #include <linux/seccomp.h>
>> #include <sys/ioctl.h>
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <unistd.h>
>> #include <errno.h>
>> #include <sys/socket.h>
>> #include <sys/un.h>
>>
>> #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
>> } while (0)
>
> Because I love macros, you can expand this to make it take a format
> string:
>
> #define errExit(fmt, ...) do { \
> char __err[64]; \
> strerror_r(errno, __err, sizeof(__err)); \
> fprintf(stderr, fmt ": %s\n", ##__VA_ARG__, __err); \
> exit(EXIT_FAILURE); \
> } while (0)

I'm a bit divivided about this. I don't want to distract the reader by
requiring them to understand the macro. I'll leave this for the moment.

[...]

>> static void
>> sigchldHandler(int sig)
>> {
>> char *msg = "\tS: target has terminated; bye\n";
>>
>> write(STDOUT_FILENO, msg, strlen(msg));
>
> white space nit: extra space before "="

Thanks!

> efficiency nit: strlen isn't needed, since it can be done with
> compile-time constant constants:
>
> char msg[] = "\tS: target has terminated; bye\n";
> write(STDOUT_FILENO, msg, sizeof(msg) - 1);
>
> (some optimization levels may already replace the strlen a sizeof - 1)

Changed as you suggest. Thanks!

>> _exit(EXIT_SUCCESS);
>> }

[...]

>> static void
>> checkNotificationIdIsValid(int notifyFd, uint64_t id)
>> {
>> if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_ID_VALID, &id) == -1) {
>> fprintf(stderr, "\tS: notification ID check: "
>> "target has terminated!!!\n");
>>
>> exit(EXIT_FAILURE);
>
> And now you can do:
>
> errExit("\tS: notification ID check: "
> "target has terminated! ioctl");
>
> ;)

Thanks. Changed as you suggest.

>> }
>> }
>>
>> /* Access the memory of the target process in order to discover the
>> pathname that was given to mkdir() */
>>
>> static bool
>> getTargetPathname(struct seccomp_notif *req, int notifyFd,
>> char *path, size_t len)
>> {
>> char procMemPath[PATH_MAX];
>>
>> snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", req->pid);
>>
>> int procMemFd = open(procMemPath, O_RDONLY);
>> if (procMemFd == -1)
>> errExit("Supervisor: open");
>>
>> /* Check that the process whose info we are accessing is still alive.
>> If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed
>> in checkNotificationIdIsValid()) succeeds, we know that the
>> /proc/PID/mem file descriptor that we opened corresponds to the
>> process for which we received a notification. If that process
>> subsequently terminates, then read() on that file descriptor
>> will return 0 (EOF). */
>>
>> checkNotificationIdIsValid(notifyFd, req->id);
>>
>> /* Read bytes at the location containing the pathname argument
>> (i.e., the first argument) of the mkdir(2) call */
>>
>> ssize_t nread = pread(procMemFd, path, len, req->data.args[0]);
>> if (nread == -1)
>> errExit("pread");
>>
>> if (nread == 0) {
>> fprintf(stderr, "\tS: pread() of /proc/PID/mem "
>> "returned 0 (EOF)\n");
>> exit(EXIT_FAILURE);
>> }
>>
>> if (close(procMemFd) == -1)
>> errExit("close-/proc/PID/mem");
>>
>> /* We have no guarantees about what was in the memory of the target
>> process. We therefore treat the buffer returned by pread() as
>> untrusted input. The buffer should be terminated by a null byte;
>> if not, then we will trigger an error for the target process. */
>>
>> for (int j = 0; j < nread; j++)
>> if (path[j] == ' ')
>
> This rendering typo (' ' vs '\0') ends up manifesting badly. ;) The man
> source shows:
>
> if (path[j] == \(aq\0\(aq)
>
> I think this needs to be \\0 ?

Yes, that was the intention.

> Or it could also be a tested as:
>
> if (strnlen(path, nread) < nread)
>

Good point. Changed to:

if (strnlen(path, nread) < nread)
return true;

[...]

>
> Thank you so much for this documentation and example! :)

You're welcome. It's been "interesting" uncovering the glitches :-).

Cheers,

Michael


> [1] https://git.kernel.org/linus/dfe719fef03d752f1682fa8aeddf30ba501c8555
> [2] https://lore.kernel.org/lkml/CAG48ez3kpEDO1x_HfvOM2R9M78Ach9O_4+Pjs-vLLfqvZL+13A@mail.gmail.com/
> [3] https://lore.kernel.org/lkml/CAGXu5jKzif=vp6gn5ZtrTx-JTN367qFphobnt9s=awbaafwoUw@mail.gmail.com/
>


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2020-10-26 12:03:24

by Jann Horn

[permalink] [raw]
Subject: Re: For review: seccomp_user_notif(2) manual page

On Mon, Oct 26, 2020 at 10:51 AM Jann Horn <[email protected]> wrote:
> On Mon, Oct 26, 2020 at 1:32 AM Kees Cook <[email protected]> wrote:
> > On Thu, Oct 01, 2020 at 03:52:02AM +0200, Jann Horn wrote:
> > > On Thu, Oct 1, 2020 at 1:25 AM Tycho Andersen <[email protected]> wrote:
> > > > On Thu, Oct 01, 2020 at 01:11:33AM +0200, Jann Horn wrote:
> > > > > On Thu, Oct 1, 2020 at 1:03 AM Tycho Andersen <[email protected]> wrote:
> > > > > > On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) wrote:
> > > > > > > On 9/30/20 5:03 PM, Tycho Andersen wrote:
> > > > > > > > On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
> > > > > > > >> ┌─────────────────────────────────────────────────────┐
> > > > > > > >> │FIXME │
> > > > > > > >> ├─────────────────────────────────────────────────────┤
> > > > > > > >> │From my experiments, it appears that if a SEC‐ │
> > > > > > > >> │COMP_IOCTL_NOTIF_RECV is done after the target │
> > > > > > > >> │process terminates, then the ioctl() simply blocks │
> > > > > > > >> │(rather than returning an error to indicate that the │
> > > > > > > >> │target process no longer exists). │
> > > > > > > >
> > > > > > > > Yeah, I think Christian wanted to fix this at some point,
> > > > > > >
> > > > > > > Do you have a pointer that discussion? I could not find it with a
> > > > > > > quick search.
> > > > > > >
> > > > > > > > but it's a
> > > > > > > > bit sticky to do.
> > > > > > >
> > > > > > > Can you say a few words about the nature of the problem?
> > > > > >
> > > > > > I remembered wrong, it's actually in the tree: 99cdb8b9a573 ("seccomp:
> > > > > > notify about unused filter"). So maybe there's a bug here?
> > > > >
> > > > > That thing only notifies on ->poll, it doesn't unblock ioctls; and
> > > > > Michael's sample code uses SECCOMP_IOCTL_NOTIF_RECV to wait. So that
> > > > > commit doesn't have any effect on this kind of usage.
> > > >
> > > > Yes, thanks. And the ones stuck in RECV are waiting on a semaphore so
> > > > we don't have a count of all of them, unfortunately.
> > > >
> > > > We could maybe look inside the wait_list, but that will probably make
> > > > people angry :)
> > >
> > > The easiest way would probably be to open-code the semaphore-ish part,
> > > and let the semaphore and poll share the waitqueue. The current code
> > > kind of mirrors the semaphore's waitqueue in the wqh - open-coding the
> > > entire semaphore would IMO be cleaner than that. And it's not like
> > > semaphore semantics are even a good fit for this code anyway.
> > >
> > > Let's see... if we didn't have the existing UAPI to worry about, I'd
> > > do it as follows (*completely* untested). That way, the ioctl would
> > > block exactly until either there actually is a request to deliver or
> > > there are no more users of the filter. The problem is that if we just
> > > apply this patch, existing users of SECCOMP_IOCTL_NOTIF_RECV that use
> > > an event loop and don't set O_NONBLOCK will be screwed. So we'd
> >
> > Wait, why? Do you mean a ioctl calling loop (rather than a poll event
> > loop)?
>
> No, I'm talking about poll event loops.
>
> > I think poll would be fine, but a "try calling RECV and expect to
> > return ENOENT" loop would change. But I don't think anyone would do this
> > exactly because it _currently_ acts like O_NONBLOCK, yes?
> >
> > > probably also have to add some stupid counter in place of the
> > > semaphore's counter that we can use to preserve the old behavior of
> > > returning -ENOENT once for each cancelled request. :(
> >
> > I only see this in Debian Code Search:
> > https://sources.debian.org/src/crun/0.15+dfsg-1/src/libcrun/seccomp_notify.c/?hl=166#L166
> > which is using epoll_wait():
> > https://sources.debian.org/src/crun/0.15+dfsg-1/src/libcrun/container.c/?hl=1326#L1326
> >
> > I expect LXC is using it. :)
>
> The problem is the scenario where a process is interrupted while it's
> waiting for the supervisor to reply.
>
> Consider the following scenario (with supervisor "S" and target "T"; S
> wants to wait for events on two file descriptors seccomp_fd and
> other_fd):
>
> S: starts poll() to wait for events on seccomp_fd and other_fd
> T: performs a syscall that's filtered with RET_USER_NOTIF
> S: poll() returns and signals readiness of seccomp_fd
> T: receives signal SIGUSR1
> T: syscall aborts, enters signal handler
> T: signal handler blocks on unfiltered syscall (e.g. write())
> S: starts SECCOMP_IOCTL_NOTIF_RECV
> S: blocks because no syscalls are pending
>
> Depending on what other_fd is, this could in a worst case even lead to
> a deadlock (if e.g. the signal handler wants to write to stdout, but
> the stdout fd is hooked up to other_fd in the supervisor, but the
> supervisor can't consume the data written because it's stuck in
> seccomp handling).
>
> So we have to ensure that when existing code (like that crun code you
> linked to) triggers this case, SECCOMP_IOCTL_NOTIF_RECV returns
> immediately instead of blocking.

Or I guess we could also just set O_NONBLOCK on the fd by default?
Since the one existing user is eventloop-based...

2020-10-26 20:54:00

by Jann Horn

[permalink] [raw]
Subject: Re: For review: seccomp_user_notif(2) manual page

On Sun, Oct 25, 2020 at 5:32 PM Michael Kerrisk (man-pages)
<[email protected]> wrote:
> On 10/1/20 4:14 AM, Jann Horn wrote:
> > On Thu, Oct 1, 2020 at 3:52 AM Jann Horn <[email protected]> wrote:
> >> On Thu, Oct 1, 2020 at 1:25 AM Tycho Andersen <[email protected]> wrote:
> >>> On Thu, Oct 01, 2020 at 01:11:33AM +0200, Jann Horn wrote:
> >>>> On Thu, Oct 1, 2020 at 1:03 AM Tycho Andersen <[email protected]> wrote:
> >>>>> On Wed, Sep 30, 2020 at 10:34:51PM +0200, Michael Kerrisk (man-pages) wrote:
> >>>>>> On 9/30/20 5:03 PM, Tycho Andersen wrote:
> >>>>>>> On Wed, Sep 30, 2020 at 01:07:38PM +0200, Michael Kerrisk (man-pages) wrote:
> >>>>>>>> ┌─────────────────────────────────────────────────────┐
> >>>>>>>> │FIXME │
> >>>>>>>> ├─────────────────────────────────────────────────────┤
> >>>>>>>> │From my experiments, it appears that if a SEC‐ │
> >>>>>>>> │COMP_IOCTL_NOTIF_RECV is done after the target │
> >>>>>>>> │process terminates, then the ioctl() simply blocks │
> >>>>>>>> │(rather than returning an error to indicate that the │
> >>>>>>>> │target process no longer exists). │
> >>>>>>>
> >>>>>>> Yeah, I think Christian wanted to fix this at some point,
> >>>>>>
> >>>>>> Do you have a pointer that discussion? I could not find it with a
> >>>>>> quick search.
> >>>>>>
> >>>>>>> but it's a
> >>>>>>> bit sticky to do.
> >>>>>>
> >>>>>> Can you say a few words about the nature of the problem?
> >>>>>
> >>>>> I remembered wrong, it's actually in the tree: 99cdb8b9a573 ("seccomp:
> >>>>> notify about unused filter"). So maybe there's a bug here?
> >>>>
> >>>> That thing only notifies on ->poll, it doesn't unblock ioctls; and
> >>>> Michael's sample code uses SECCOMP_IOCTL_NOTIF_RECV to wait. So that
> >>>> commit doesn't have any effect on this kind of usage.
> >>>
> >>> Yes, thanks. And the ones stuck in RECV are waiting on a semaphore so
> >>> we don't have a count of all of them, unfortunately.
> >>>
> >>> We could maybe look inside the wait_list, but that will probably make
> >>> people angry :)
> >>
> >> The easiest way would probably be to open-code the semaphore-ish part,
> >> and let the semaphore and poll share the waitqueue. The current code
> >> kind of mirrors the semaphore's waitqueue in the wqh - open-coding the
> >> entire semaphore would IMO be cleaner than that. And it's not like
> >> semaphore semantics are even a good fit for this code anyway.
> >>
> >> Let's see... if we didn't have the existing UAPI to worry about, I'd
> >> do it as follows (*completely* untested). That way, the ioctl would
> >> block exactly until either there actually is a request to deliver or
> >> there are no more users of the filter. The problem is that if we just
> >> apply this patch, existing users of SECCOMP_IOCTL_NOTIF_RECV that use
> >> an event loop and don't set O_NONBLOCK will be screwed. So we'd
> >> probably also have to add some stupid counter in place of the
> >> semaphore's counter that we can use to preserve the old behavior of
> >> returning -ENOENT once for each cancelled request. :(
> >>
> >> I guess this is a nice point in favor of Michael's usual complaint
> >> that if there are no man pages for a feature by the time the feature
> >> lands upstream, there's a higher chance that the UAPI will suck
> >> forever...
> >
> > And I guess this would be the UAPI-compatible version - not actually
> > as terrible as I thought it might be. Do y'all want this? If so, feel
> > free to either turn this into a proper patch with Co-developed-by, or
> > tell me that I should do it and I'll try to get around to turning it
> > into something proper.
>
> Thanks for taking a shot at this.
>
> I tried applying the patch below to vanilla 5.9.0.
> (There's one typo: s/ENOTCON/ENOTCONN).
>
> It seems not to work though; when I send a signal to my test
> target process that is sleeping waiting for the notification
> response, the process enters the uninterruptible D state.
> Any thoughts?

Ah, yeah, I think I was completely misusing the wait API. I'll go change that.

(Btw, in general, for reports about hangs like that, it can be helpful
to have the contents of /proc/$pid/stack. And for cases where CPUs are
spinning, the relevant part from the output of the "L" sysrq, or
something like that.)

Also, I guess we can probably break this part of UAPI after all, since
the only user of this interface seems to currently be completely
broken in this case anyway? So I think we want the other
implementation without the ->canceled_reqs logic after all.

I'm a bit on the fence now on whether non-blocking mode should use
ENOTCONN or not... I guess if we returned ENOENT even when there are
no more listeners, you'd have to disambiguate through the poll()
revents, which would be kinda ugly?

I'll try to turn this into a proper patch submission...

Subject: Re: For review: seccomp_user_notif(2) manual page

On 10/26/20 4:54 PM, Jann Horn wrote:
> On Sun, Oct 25, 2020 at 5:32 PM Michael Kerrisk (man-pages)
> <[email protected]> wrote:
[...]
>> I tried applying the patch below to vanilla 5.9.0.
>> (There's one typo: s/ENOTCON/ENOTCONN).
>>
>> It seems not to work though; when I send a signal to my test
>> target process that is sleeping waiting for the notification
>> response, the process enters the uninterruptible D state.
>> Any thoughts?
>
> Ah, yeah, I think I was completely misusing the wait API. I'll go change that.
>
> (Btw, in general, for reports about hangs like that, it can be helpful
> to have the contents of /proc/$pid/stack. And for cases where CPUs are
> spinning, the relevant part from the output of the "L" sysrq, or
> something like that.)

Thanks for the tipcs!

> Also, I guess we can probably break this part of UAPI after all, since
> the only user of this interface seems to currently be completely
> broken in this case anyway? So I think we want the other
> implementation without the ->canceled_reqs logic after all.

Okay.

> I'm a bit on the fence now on whether non-blocking mode should use
> ENOTCONN or not... I guess if we returned ENOENT even when there are
> no more listeners, you'd have to disambiguate through the poll()
> revents, which would be kinda ugly?

I must confess, I'm not quite clear on which two cases you
are trying to distinguish. Can you elaborate?

> I'll try to turn this into a proper patch submission...

Thank you!!

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2020-10-28 06:50:25

by Jann Horn

[permalink] [raw]
Subject: Re: For review: seccomp_user_notif(2) manual page

On Tue, Oct 27, 2020 at 7:14 AM Michael Kerrisk (man-pages)
<[email protected]> wrote:
> On 10/26/20 4:54 PM, Jann Horn wrote:
> > I'm a bit on the fence now on whether non-blocking mode should use
> > ENOTCONN or not... I guess if we returned ENOENT even when there are
> > no more listeners, you'd have to disambiguate through the poll()
> > revents, which would be kinda ugly?
>
> I must confess, I'm not quite clear on which two cases you
> are trying to distinguish. Can you elaborate?

Let's say someone writes a program whose responsibilities are just to
handle seccomp events and to listen on some other fd for commands. And
this is implemented with an event loop. Then once all the target
processes are gone (including zombie reaping), we'll start getting
EPOLLERR.

If NOTIF_RECV starts returning -ENOTCONN at this point, the event loop
can just call into the seccomp logic without any arguments; it can
just call NOTIF_RECV one more time, see the -ENOTCONN, and terminate.
The downside is that there's one more error code userspace has to
special-case.
This would be more consistent with what we'd be doing in the blocking case.

If NOTIF_RECV keeps returning -ENOENT, the event loop has to also tell
the seccomp logic what the revents are.

I guess it probably doesn't really matter much.

2020-10-29 00:35:01

by Jann Horn

[permalink] [raw]
Subject: Re: For review: seccomp_user_notif(2) manual page

On Wed, Oct 28, 2020 at 6:44 PM Sargun Dhillon <[email protected]> wrote:
> On Wed, Oct 28, 2020 at 2:43 AM Jann Horn <[email protected]> wrote:
> > On Wed, Oct 28, 2020 at 7:32 AM Sargun Dhillon <[email protected]> wrote:
> > > On Tue, Oct 27, 2020 at 3:28 AM Jann Horn <[email protected]> wrote:
> > > > On Tue, Oct 27, 2020 at 7:14 AM Michael Kerrisk (man-pages)
> > > > <[email protected]> wrote:
> > > > > On 10/26/20 4:54 PM, Jann Horn wrote:
> > > > > > I'm a bit on the fence now on whether non-blocking mode should use
> > > > > > ENOTCONN or not... I guess if we returned ENOENT even when there are
> > > > > > no more listeners, you'd have to disambiguate through the poll()
> > > > > > revents, which would be kinda ugly?
> > > > >
> > > > > I must confess, I'm not quite clear on which two cases you
> > > > > are trying to distinguish. Can you elaborate?
> > > >
> > > > Let's say someone writes a program whose responsibilities are just to
> > > > handle seccomp events and to listen on some other fd for commands. And
> > > > this is implemented with an event loop. Then once all the target
> > > > processes are gone (including zombie reaping), we'll start getting
> > > > EPOLLERR.
> > > >
> > > > If NOTIF_RECV starts returning -ENOTCONN at this point, the event loop
> > > > can just call into the seccomp logic without any arguments; it can
> > > > just call NOTIF_RECV one more time, see the -ENOTCONN, and terminate.
> > > > The downside is that there's one more error code userspace has to
> > > > special-case.
> > > > This would be more consistent with what we'd be doing in the blocking case.
> > > >
> > > > If NOTIF_RECV keeps returning -ENOENT, the event loop has to also tell
> > > > the seccomp logic what the revents are.
> > > >
> > > > I guess it probably doesn't really matter much.
> > >
> > > So, in practice, if you're emulating a blocking syscall (such as open,
> > > perf_event_open, or any of a number of other syscalls), you probably
> > > have to do it on a separate thread in the supervisor because you want
> > > to continue to be able to receive new notifications if any other process
> > > generates a seccomp notification event that you need to handle.
> > >
> > > In addition to that, some of these syscalls are preemptible, so you need
> > > to poll SECCOMP_IOCTL_NOTIF_ID_VALID to make sure that the program
> > > under supervision hasn't left the syscall.
> > >
> > > If we're to implement a mechanism that makes the seccomp ioctl receive
> > > non-blocking, it would be valuable to address this problem as well (getting
> > > a notification when the supervisor is processing a syscall and needs to
> > > preempt it). In the best case, this can be a minor inconvenience, and
> > > in the worst case this can result in weird errors where you're keeping
> > > resources open that the container expects to be closed.
> >
> > Does "a notification" mean signals? Or would you want to have a second
> > thread in userspace that poll()s for cancellation events on the
> > seccomp fd and then somehow takes care of interrupting the first
> > thread, or something like that?
>
> I would be reluctant to be prescriptive in that it be a signal. Right
> now, it's implemented
> as a second thread in userspace that does a ioctl(...) and checks if
> the notification
> is valid / alive, and does what's required if the notification has
> died (interrupting
> the first thread).
>
> >
> > Either way, I think your proposal goes beyond the scope of patching
> > the existing weirdness, and should be a separate patch.
>
> I agree it should be a separate patch, but I think that it'd be nice if there
> was a way to do something like:
> * opt-in to getting another message after receiving the notification
> that indicates the program has left the syscall

I guess to do that cleanly, we'd want something like an array
associated with the seccomp filter that has a size N that's determined
when the filter is set up... and then when a received but unanswered
notification is cancelled, we'd insert its identifier into that array.
And if we enforce that the supervisor can never have more than N
pending messages (by just not delivering new ones if there are N old
ones pending), we'll know that any possible cancellation will always
fit, and we don't need to worry about dynamic memory allocation.

And we could raise EPOLLPRI on the file descriptor when the array is
non-empty, so that if userspace doesn't currently want to handle new
notifications (because it's already dealing with a bunch of them),
userspace can do that, too.

> * when you do the RECV, you can specify a flag or some such asking
> that you get signaled / notified about the program leaving the syscall

I think filter setup time is easier to deal with than RECV time.

> * a multiplexed receive that can say if an existing notification in progress
> has left the valid state.

Or alternatively a separate ioctl for receiving cancellation messages,
which you'd only call on EPOLLPRI.

2020-10-29 08:46:11

by Jann Horn

[permalink] [raw]
Subject: Re: For review: seccomp_user_notif(2) manual page

On Wed, Oct 28, 2020 at 11:56 PM Kees Cook <[email protected]> wrote:
> On Mon, Oct 26, 2020 at 11:31:01AM +0100, Jann Horn wrote:
> > Or I guess we could also just set O_NONBLOCK on the fd by default?
> > Since the one existing user is eventloop-based...
>
> I thought about that initially, but it rubs me the wrong way: it
> violates least-surprise for me. File descriptors are expected to be
> default-blocking. It *is* a special fd, though, so maybe it could work.
> The only case I can think of it would break would be ioctl-loop case
> that is already buggy in that it didn't handle non-zero returns?

We don't have any actual users that use the API that way outside of
the kernel's selftest/sample code, right?

2020-10-29 08:47:09

by Jann Horn

[permalink] [raw]
Subject: Re: For review: seccomp_user_notif(2) manual page

On Wed, Oct 28, 2020 at 11:53 PM Kees Cook <[email protected]> wrote:
> On Mon, Oct 26, 2020 at 10:51:02AM +0100, Jann Horn wrote:
> > The problem is the scenario where a process is interrupted while it's
> > waiting for the supervisor to reply.
> >
> > Consider the following scenario (with supervisor "S" and target "T"; S
> > wants to wait for events on two file descriptors seccomp_fd and
> > other_fd):
> >
> > S: starts poll() to wait for events on seccomp_fd and other_fd
> > T: performs a syscall that's filtered with RET_USER_NOTIF
> > S: poll() returns and signals readiness of seccomp_fd
> > T: receives signal SIGUSR1
> > T: syscall aborts, enters signal handler
> > T: signal handler blocks on unfiltered syscall (e.g. write())
> > S: starts SECCOMP_IOCTL_NOTIF_RECV
> > S: blocks because no syscalls are pending
>
> Oooh, yes, ew. Thanks for the illustration.
>
> Thinking about this from userspace's least-surprise view, I would expect
> the "recv" to stay "queued", in the sense we'd see this:
>
> S: starts poll() to wait for events on seccomp_fd and other_fd
> T: performs a syscall that's filtered with RET_USER_NOTIF
> S: poll() returns and signals readiness of seccomp_fd
> T: receives signal SIGUSR1
> T: syscall aborts, enters signal handler
> T: signal handler blocks on unfiltered syscall (e.g. write())
> S: starts SECCOMP_IOCTL_NOTIF_RECV
> S: gets (stale) seccomp_notif from seccomp_fd
> S: sends seccomp_notif_resp, receives ENOENT (or some better errno?)
>
> This is not at all how things are designed internally right now, but
> that behavior would work, yes?

It would be really ugly, but it could theoretically be made to work,
to some degree.


The first bit of trouble is that currently the notification lives on
the stack of the target process. If we want to be able to show
userspace the stale notification, we'd have to store it elsewhere. And
since we really don't want to start randomly throwing -ENOMEM in any
of this stuff, we'd basically have to store it in pre-allocated memory
inside the filter.


The second bit of trouble is that if the supervisor is so oblivious
that it doesn't realize that syscalls can be interrupted, it'll run
into other problems. Let's say the target process does something like
this:

int func(void) {
char pathbuf[4096];
sprintf(pathbuf, "/tmp/blah.%d", some_number);
mount("foo", pathbuf, ...);
}

and mount() is handled with a notification. If the supervisor just
reads the path string and immediately passes it into the real mount()
syscall, something like this can happen:

target: starts mount()
target: receives signal, aborts mount()
target: runs signal handler, returns from signal handler
target: returns out of func()
supervisor: receives notification
supervisor: reads path from remote buffer
supervisor: calls mount()

but because the stack allocation has already been freed by the time
the supervisor reads it, the supervisor just reads random garbage, and
beautiful fireworks ensue.

So the supervisor *fundamentally* has to be written to expect that at
*any* time, the target can abandon a syscall. And every read of remote
memory has to be separated from uses of that remote memory by a
notification ID recheck.

And at that point, I think it's reasonable to expect the supervisor to
also be able to handle that a syscall can be aborted before the
notification is delivered.

2020-10-29 09:00:07

by Jann Horn

[permalink] [raw]
Subject: Re: For review: seccomp_user_notif(2) manual page

On Thu, Oct 29, 2020 at 3:13 AM Tycho Andersen <[email protected]> wrote:
> > > Consider the following scenario (with supervisor "S" and target "T"; S
> > > wants to wait for events on two file descriptors seccomp_fd and
> > > other_fd):
> > >
> > > S: starts poll() to wait for events on seccomp_fd and other_fd
> > > T: performs a syscall that's filtered with RET_USER_NOTIF
> > > S: poll() returns and signals readiness of seccomp_fd
> > > T: receives signal SIGUSR1
> > > T: syscall aborts, enters signal handler
> > > T: signal handler blocks on unfiltered syscall (e.g. write())
> > > S: starts SECCOMP_IOCTL_NOTIF_RECV
> > > S: blocks because no syscalls are pending
> > >
> > > Depending on what other_fd is, this could in a worst case even lead to
> > > a deadlock (if e.g. the signal handler wants to write to stdout, but
> > > the stdout fd is hooked up to other_fd in the supervisor, but the
> > > supervisor can't consume the data written because it's stuck in
> > > seccomp handling).
> > >
> > > So we have to ensure that when existing code (like that crun code you
> > > linked to) triggers this case, SECCOMP_IOCTL_NOTIF_RECV returns
> > > immediately instead of blocking.
> >
> > Or I guess we could also just set O_NONBLOCK on the fd by default?
> > Since the one existing user is eventloop-based...
>
> I feel like it's ok to return an error from the RECV ioctl() if
> there's never going to be any more events on the fd; was there
> something fundamentally wrong with your patch here:
> https://lore.kernel.org/bpf/CAG48ez2xn+_KznEztJ-eVTsTzkbf9CVgPqaAk7TpRNAqbdaRoA@mail.gmail.com/
> ?

No, I have a new version of that about 80% done and hope to send it
out soonish. (There's some stuff around tests that I still need to
cobble together).

2020-10-29 09:59:32

by Kees Cook

[permalink] [raw]
Subject: Re: For review: seccomp_user_notif(2) manual page

On Mon, Oct 26, 2020 at 10:51:02AM +0100, Jann Horn wrote:
> The problem is the scenario where a process is interrupted while it's
> waiting for the supervisor to reply.
>
> Consider the following scenario (with supervisor "S" and target "T"; S
> wants to wait for events on two file descriptors seccomp_fd and
> other_fd):
>
> S: starts poll() to wait for events on seccomp_fd and other_fd
> T: performs a syscall that's filtered with RET_USER_NOTIF
> S: poll() returns and signals readiness of seccomp_fd
> T: receives signal SIGUSR1
> T: syscall aborts, enters signal handler
> T: signal handler blocks on unfiltered syscall (e.g. write())
> S: starts SECCOMP_IOCTL_NOTIF_RECV
> S: blocks because no syscalls are pending

Oooh, yes, ew. Thanks for the illustration.

Thinking about this from userspace's least-surprise view, I would expect
the "recv" to stay "queued", in the sense we'd see this:

S: starts poll() to wait for events on seccomp_fd and other_fd
T: performs a syscall that's filtered with RET_USER_NOTIF
S: poll() returns and signals readiness of seccomp_fd
T: receives signal SIGUSR1
T: syscall aborts, enters signal handler
T: signal handler blocks on unfiltered syscall (e.g. write())
S: starts SECCOMP_IOCTL_NOTIF_RECV
S: gets (stale) seccomp_notif from seccomp_fd
S: sends seccomp_notif_resp, receives ENOENT (or some better errno?)

This is not at all how things are designed internally right now, but
that behavior would work, yes?

--
Kees Cook

2020-10-29 09:59:46

by Kees Cook

[permalink] [raw]
Subject: Re: For review: seccomp_user_notif(2) manual page

On Mon, Oct 26, 2020 at 11:31:01AM +0100, Jann Horn wrote:
> Or I guess we could also just set O_NONBLOCK on the fd by default?
> Since the one existing user is eventloop-based...

I thought about that initially, but it rubs me the wrong way: it
violates least-surprise for me. File descriptors are expected to be
default-blocking. It *is* a special fd, though, so maybe it could work.
The only case I can think of it would break would be ioctl-loop case
that is already buggy in that it didn't handle non-zero returns?

--
Kees Cook

2020-10-29 10:00:12

by Jann Horn

[permalink] [raw]
Subject: Re: For review: seccomp_user_notif(2) manual page

On Wed, Oct 28, 2020 at 7:32 AM Sargun Dhillon <[email protected]> wrote:
> On Tue, Oct 27, 2020 at 3:28 AM Jann Horn <[email protected]> wrote:
> > On Tue, Oct 27, 2020 at 7:14 AM Michael Kerrisk (man-pages)
> > <[email protected]> wrote:
> > > On 10/26/20 4:54 PM, Jann Horn wrote:
> > > > I'm a bit on the fence now on whether non-blocking mode should use
> > > > ENOTCONN or not... I guess if we returned ENOENT even when there are
> > > > no more listeners, you'd have to disambiguate through the poll()
> > > > revents, which would be kinda ugly?
> > >
> > > I must confess, I'm not quite clear on which two cases you
> > > are trying to distinguish. Can you elaborate?
> >
> > Let's say someone writes a program whose responsibilities are just to
> > handle seccomp events and to listen on some other fd for commands. And
> > this is implemented with an event loop. Then once all the target
> > processes are gone (including zombie reaping), we'll start getting
> > EPOLLERR.
> >
> > If NOTIF_RECV starts returning -ENOTCONN at this point, the event loop
> > can just call into the seccomp logic without any arguments; it can
> > just call NOTIF_RECV one more time, see the -ENOTCONN, and terminate.
> > The downside is that there's one more error code userspace has to
> > special-case.
> > This would be more consistent with what we'd be doing in the blocking case.
> >
> > If NOTIF_RECV keeps returning -ENOENT, the event loop has to also tell
> > the seccomp logic what the revents are.
> >
> > I guess it probably doesn't really matter much.
>
> So, in practice, if you're emulating a blocking syscall (such as open,
> perf_event_open, or any of a number of other syscalls), you probably
> have to do it on a separate thread in the supervisor because you want
> to continue to be able to receive new notifications if any other process
> generates a seccomp notification event that you need to handle.
>
> In addition to that, some of these syscalls are preemptible, so you need
> to poll SECCOMP_IOCTL_NOTIF_ID_VALID to make sure that the program
> under supervision hasn't left the syscall.
>
> If we're to implement a mechanism that makes the seccomp ioctl receive
> non-blocking, it would be valuable to address this problem as well (getting
> a notification when the supervisor is processing a syscall and needs to
> preempt it). In the best case, this can be a minor inconvenience, and
> in the worst case this can result in weird errors where you're keeping
> resources open that the container expects to be closed.

Does "a notification" mean signals? Or would you want to have a second
thread in userspace that poll()s for cancellation events on the
seccomp fd and then somehow takes care of interrupting the first
thread, or something like that?

Either way, I think your proposal goes beyond the scope of patching
the existing weirdness, and should be a separate patch.

2020-10-29 10:00:44

by Sargun Dhillon

[permalink] [raw]
Subject: Re: For review: seccomp_user_notif(2) manual page

On Tue, Oct 27, 2020 at 3:28 AM Jann Horn <[email protected]> wrote:
>
> On Tue, Oct 27, 2020 at 7:14 AM Michael Kerrisk (man-pages)
> <[email protected]> wrote:
> > On 10/26/20 4:54 PM, Jann Horn wrote:
> > > I'm a bit on the fence now on whether non-blocking mode should use
> > > ENOTCONN or not... I guess if we returned ENOENT even when there are
> > > no more listeners, you'd have to disambiguate through the poll()
> > > revents, which would be kinda ugly?
> >
> > I must confess, I'm not quite clear on which two cases you
> > are trying to distinguish. Can you elaborate?
>
> Let's say someone writes a program whose responsibilities are just to
> handle seccomp events and to listen on some other fd for commands. And
> this is implemented with an event loop. Then once all the target
> processes are gone (including zombie reaping), we'll start getting
> EPOLLERR.
>
> If NOTIF_RECV starts returning -ENOTCONN at this point, the event loop
> can just call into the seccomp logic without any arguments; it can
> just call NOTIF_RECV one more time, see the -ENOTCONN, and terminate.
> The downside is that there's one more error code userspace has to
> special-case.
> This would be more consistent with what we'd be doing in the blocking case.
>
> If NOTIF_RECV keeps returning -ENOENT, the event loop has to also tell
> the seccomp logic what the revents are.
>
> I guess it probably doesn't really matter much.

So, in practice, if you're emulating a blocking syscall (such as open,
perf_event_open, or any of a number of other syscalls), you probably
have to do it on a separate thread in the supervisor because you want
to continue to be able to receive new notifications if any other process
generates a seccomp notification event that you need to handle.

In addition to that, some of these syscalls are preemptible, so you need
to poll SECCOMP_IOCTL_NOTIF_ID_VALID to make sure that the program
under supervision hasn't left the syscall.

If we're to implement a mechanism that makes the seccomp ioctl receive
non-blocking, it would be valuable to address this problem as well (getting
a notification when the supervisor is processing a syscall and needs to
preempt it). In the best case, this can be a minor inconvenience, and
in the worst case this can result in weird errors where you're keeping
resources open that the container expects to be closed.

2020-10-29 10:01:21

by Sargun Dhillon

[permalink] [raw]
Subject: Re: For review: seccomp_user_notif(2) manual page

On Wed, Oct 28, 2020 at 2:43 AM Jann Horn <[email protected]> wrote:
>
> On Wed, Oct 28, 2020 at 7:32 AM Sargun Dhillon <[email protected]> wrote:
> > On Tue, Oct 27, 2020 at 3:28 AM Jann Horn <[email protected]> wrote:
> > > On Tue, Oct 27, 2020 at 7:14 AM Michael Kerrisk (man-pages)
> > > <[email protected]> wrote:
> > > > On 10/26/20 4:54 PM, Jann Horn wrote:
> > > > > I'm a bit on the fence now on whether non-blocking mode should use
> > > > > ENOTCONN or not... I guess if we returned ENOENT even when there are
> > > > > no more listeners, you'd have to disambiguate through the poll()
> > > > > revents, which would be kinda ugly?
> > > >
> > > > I must confess, I'm not quite clear on which two cases you
> > > > are trying to distinguish. Can you elaborate?
> > >
> > > Let's say someone writes a program whose responsibilities are just to
> > > handle seccomp events and to listen on some other fd for commands. And
> > > this is implemented with an event loop. Then once all the target
> > > processes are gone (including zombie reaping), we'll start getting
> > > EPOLLERR.
> > >
> > > If NOTIF_RECV starts returning -ENOTCONN at this point, the event loop
> > > can just call into the seccomp logic without any arguments; it can
> > > just call NOTIF_RECV one more time, see the -ENOTCONN, and terminate.
> > > The downside is that there's one more error code userspace has to
> > > special-case.
> > > This would be more consistent with what we'd be doing in the blocking case.
> > >
> > > If NOTIF_RECV keeps returning -ENOENT, the event loop has to also tell
> > > the seccomp logic what the revents are.
> > >
> > > I guess it probably doesn't really matter much.
> >
> > So, in practice, if you're emulating a blocking syscall (such as open,
> > perf_event_open, or any of a number of other syscalls), you probably
> > have to do it on a separate thread in the supervisor because you want
> > to continue to be able to receive new notifications if any other process
> > generates a seccomp notification event that you need to handle.
> >
> > In addition to that, some of these syscalls are preemptible, so you need
> > to poll SECCOMP_IOCTL_NOTIF_ID_VALID to make sure that the program
> > under supervision hasn't left the syscall.
> >
> > If we're to implement a mechanism that makes the seccomp ioctl receive
> > non-blocking, it would be valuable to address this problem as well (getting
> > a notification when the supervisor is processing a syscall and needs to
> > preempt it). In the best case, this can be a minor inconvenience, and
> > in the worst case this can result in weird errors where you're keeping
> > resources open that the container expects to be closed.
>
> Does "a notification" mean signals? Or would you want to have a second
> thread in userspace that poll()s for cancellation events on the
> seccomp fd and then somehow takes care of interrupting the first
> thread, or something like that?

I would be reluctant to be prescriptive in that it be a signal. Right
now, it's implemented
as a second thread in userspace that does a ioctl(...) and checks if
the notification
is valid / alive, and does what's required if the notification has
died (interrupting
the first thread).

>
> Either way, I think your proposal goes beyond the scope of patching
> the existing weirdness, and should be a separate patch.

I agree it should be a separate patch, but I think that it'd be nice if there
was a way to do something like:
* opt-in to getting another message after receiving the notification
that indicates the program has left the syscall
* when you do the RECV, you can specify a flag or some such asking
that you get signaled / notified about the program leaving the syscall
* a multiplexed receive that can say if an existing notification in progress
has left the valid state.

---
The reason I bring this up as part of this current thread / discussion is that
I think that they may be related in terms of how we want the behaviour to act.

I would love to hear how people think this should work, or better suggestions
than the second thread approach above, or the alternative approach of
polling all the notifications in progress on some interval [and relying on
epoll timeout to trigger that interval].