2021-04-30 20:50:55

by Sargun Dhillon

[permalink] [raw]
Subject: [PATCH v2 0/5] Handle seccomp notification preemption

This patchset addresses a race condition we've dealt with recently with
seccomp. Specifically programs interrupting syscalls while they're in
progress. This was exacerbated by Golang's recent adoption of "async
preemption", in which they try to interrupt any syscall that's been
running for more than 10ms during GC. During certain syscalls, it's
non-trivial to write them in a reetrant manner in userspace (mount).

This has a couple semantic changes, and relaxes a check on seccomp_data, and
changes the semantics with ordering of how addfd and notification replies
in the supervisor are handled.

I'm resending after rebasing and testing on v5.12. It turns out this change
also fixed a bug Rodrigo found that could occur with addfd around certain
race conditions[2].

It also follows up on the original proposal from Tycho[3] to allow
for adding an FD and returning that value atomically.

Changes since v1:
* Change from a flag, to an ioctl that explicitly sets the killable
state after reading the notification.
* Adds some more tests to make sure that fatal signals are handled.

Changes since RFC[1]:
* Fix some documentation
* Add Rata's patches to allow for direct return from addfd

[1]: https://lore.kernel.org/lkml/[email protected]/
[2]: https://lore.kernel.org/lkml/[email protected]/
[3]: https://lore.kernel.org/lkml/202012011322.26DCBC64F2@keescook/

Rodrigo Campos (2):
seccomp: Support atomic "addfd + send reply"
selftests/seccomp: Add test for atomic addfd+send

Sargun Dhillon (3):
seccomp: Refactor notification handler to prepare for new semantics
seccomp: Add wait_killable semantic to seccomp user notifier
selftests/seccomp: Add test for wait killable notifier

.../userspace-api/seccomp_filter.rst | 22 +-
include/uapi/linux/seccomp.h | 3 +
kernel/seccomp.c | 146 ++++++++++++--
tools/testing/selftests/seccomp/seccomp_bpf.c | 190 ++++++++++++++++++
4 files changed, 335 insertions(+), 26 deletions(-)

--
2.25.1


2021-04-30 20:51:09

by Sargun Dhillon

[permalink] [raw]
Subject: [PATCH v2 5/5] selftests/seccomp: Add test for atomic addfd+send

From: Rodrigo Campos <[email protected]>

This just adds a test to verify that when using the new introduced flag
to ADDFD, a valid fd is added and returned as the syscall result.

Signed-off-by: Rodrigo Campos <[email protected]>
Signed-off-by: Sargun Dhillon <[email protected]>
---
tools/testing/selftests/seccomp/seccomp_bpf.c | 38 +++++++++++++++++++
1 file changed, 38 insertions(+)

diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 9a72ba8bb4f7..be911809212f 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -240,6 +240,10 @@ struct seccomp_notif_addfd {
#define SECCOMP_IOCTL_NOTIF_SET_WAIT_KILLABLE SECCOMP_IOW(4, __u64)
#endif

+#ifndef SECCOMP_ADDFD_FLAG_SEND
+#define SECCOMP_ADDFD_FLAG_SEND (1UL << 1) /* Addfd and return it, atomically */
+#endif
+
struct seccomp_notif_addfd_small {
__u64 id;
char weird[4];
@@ -3981,8 +3985,14 @@ TEST(user_notification_addfd)
ASSERT_GE(pid, 0);

if (pid == 0) {
+ /* fds will be added and this value is expected */
if (syscall(__NR_getppid) != USER_NOTIF_MAGIC)
exit(1);
+
+ /* Atomic addfd+send is received here. Check it is a valid fd */
+ if (fcntl(syscall(__NR_getppid), F_GETFD) == -1)
+ exit(1);
+
exit(syscall(__NR_getppid) != USER_NOTIF_MAGIC);
}

@@ -4061,6 +4071,30 @@ TEST(user_notification_addfd)
ASSERT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
ASSERT_EQ(addfd.id, req.id);

+ /* Verify we can do an atomic addfd and send */
+ addfd.newfd = 0;
+ addfd.flags = SECCOMP_ADDFD_FLAG_SEND;
+ fd = ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd);
+
+ /* Child has fds 0-6 and 42 used, we expect the lower fd available: 7 */
+ EXPECT_EQ(fd, 7);
+ EXPECT_EQ(filecmp(getpid(), pid, memfd, fd), 0);
+
+ /*
+ * This sets the ID of the ADD FD to the last request plus 1. The
+ * notification ID increments 1 per notification.
+ */
+ addfd.id = req.id + 1;
+
+ /* This spins until the underlying notification is generated */
+ while (ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd) != -1 &&
+ errno != -EINPROGRESS)
+ nanosleep(&delay, NULL);
+
+ memset(&req, 0, sizeof(req));
+ ASSERT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
+ ASSERT_EQ(addfd.id, req.id);
+
resp.id = req.id;
resp.error = 0;
resp.val = USER_NOTIF_MAGIC;
@@ -4121,6 +4155,10 @@ TEST(user_notification_addfd_rlimit)
EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd), -1);
EXPECT_EQ(errno, EMFILE);

+ addfd.flags = SECCOMP_ADDFD_FLAG_SEND;
+ EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd), -1);
+ EXPECT_EQ(errno, EMFILE);
+
addfd.newfd = 100;
addfd.flags = SECCOMP_ADDFD_FLAG_SETFD;
EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd), -1);
--
2.25.1

2021-04-30 20:51:23

by Sargun Dhillon

[permalink] [raw]
Subject: [PATCH v2 4/5] seccomp: Support atomic "addfd + send reply"

From: Rodrigo Campos <[email protected]>

Alban Crequy reported a race condition userspace faces when we want to
add some fds and make the syscall return them[1] using seccomp notify.

The problem is that currently two different ioctl() calls are needed by
the process handling the syscalls (agent) for another userspace process
(target): SECCOMP_IOCTL_NOTIF_ADDFD to allocate the fd and
SECCOMP_IOCTL_NOTIF_SEND to return that value. Therefore, it is possible
for the agent to do the first ioctl to add a file descriptor but the
target is interrupted (EINTR) before the agent does the second ioctl()
call.

Other patches in this series add a way to block signals when a syscall
is put to wait by seccomp. However, that might be a big hammer for some
cases, as the golang runtime uses SIGURG to interrupt threads for GC
collection. Sometimes we just don't want to interfere with the GC, for
example, and just either add the fd and return it or fail the syscall.
With no leaking fds added inadvertly to the target process.

This patch adds a flag to the ADDFD ioctl() so it adds the fd and
returns that value atomically to the target program, as suggested by
Kees Cook[2]. This is done by simply allowing
seccomp_do_user_notification() to add the fd and return it in this case.
Therefore, in this case the target wakes up from the wait in
seccomp_do_user_notification() either to interrupt the syscall or to add
the fd and return it.

This "allocate an fd and return" functionality is useful for syscalls
that return a file descriptor only, like connect(2). Other syscalls that
return a file descriptor but not as return value (or return more than
one fd), like socketpair(), pipe(), recvmsg with SCM_RIGHTs, will not
work with this flag. The way to go to emulate those in cases where a
signal might interrupt is to use the functionality to block signals.

The struct seccomp_notif_resp, used when doing SECCOMP_IOCTL_NOTIF_SEND
ioctl() to send a response to the target, has three more fields that we
don't allow to set when doing the addfd ioctl() to also return. The
reasons to disallow each field are:
* val: This will be set to the new allocated fd. No point taking it
from userspace in this case.
* error: If this is non-zero, the value is ignored. Therefore,
it is pointless in this case as we want to return the value.
* flags: The only flag is to let userspace continue to execute the
syscall. This seems pointless, as we want the syscall to return the
allocated fd.

This is why those fields are not possible to set when using this new
flag.

[1]: https://lore.kernel.org/lkml/CADZs7q4sw71iNHmV8EOOXhUKJMORPzF7thraxZYddTZsxta-KQ@mail.gmail.com/
[2]: https://lore.kernel.org/lkml/202012011322.26DCBC64F2@keescook/

Signed-off-by: Rodrigo Campos <[email protected]>
Signed-off-by: Sargun Dhillon <[email protected]>
---
include/uapi/linux/seccomp.h | 1 +
kernel/seccomp.c | 49 +++++++++++++++++++++++++++++++++---
2 files changed, 46 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 71dbc1f7889f..6a14c39a6e05 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -115,6 +115,7 @@ struct seccomp_notif_resp {

/* valid flags for seccomp_notif_addfd */
#define SECCOMP_ADDFD_FLAG_SETFD (1UL << 0) /* Specify remote fd */
+#define SECCOMP_ADDFD_FLAG_SEND (1UL << 1) /* Addfd and return it, atomically */

/**
* struct seccomp_notif_addfd
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 7ac1cea6e7f0..eed3294f3df8 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -109,6 +109,7 @@ struct seccomp_knotif {
* installing process should allocate the fd as normal.
* @flags: The flags for the new file descriptor. At the moment, only O_CLOEXEC
* is allowed.
+ * @ioctl_flags: The flags used for the seccomp_addfd ioctl.
* @ret: The return value of the installing process. It is set to the fd num
* upon success (>= 0).
* @completion: Indicates that the installing process has completed fd
@@ -120,6 +121,7 @@ struct seccomp_kaddfd {
struct file *file;
int fd;
unsigned int flags;
+ __u32 ioctl_flags;

/* To only be set on reply */
int ret;
@@ -1064,14 +1066,35 @@ static u64 seccomp_next_notify_id(struct seccomp_filter *filter)
return filter->notif->next_id++;
}

-static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd)
+static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd, struct seccomp_knotif *n)
{
+ int fd;
+
/*
* Remove the notification, and reset the list pointers, indicating
* that it has been handled.
*/
list_del_init(&addfd->list);
- addfd->ret = receive_fd_replace(addfd->fd, addfd->file, addfd->flags);
+ fd = receive_fd_replace(addfd->fd, addfd->file, addfd->flags);
+
+ addfd->ret = fd;
+
+ if (addfd->ioctl_flags & SECCOMP_ADDFD_FLAG_SEND) {
+ /* If we fail reset and return an error to the notifier */
+ if (fd < 0) {
+ n->state = SECCOMP_NOTIFY_SENT;
+ } else {
+ /* Return the FD we just added */
+ n->flags = 0;
+ n->error = 0;
+ n->val = fd;
+ }
+ }
+
+ /*
+ * Mark the notification as completed. From this point, addfd mem
+ * might be invalidated and we can't safely read it anymore.
+ */
complete(&addfd->completion);
}

@@ -1143,7 +1166,7 @@ static int seccomp_do_user_notification(int this_syscall,
struct seccomp_kaddfd, list);
/* Check if we were woken up by a addfd message */
if (addfd)
- seccomp_handle_addfd(addfd);
+ seccomp_handle_addfd(addfd, &n);

} while (n.state != SECCOMP_NOTIFY_REPLIED);

@@ -1610,7 +1633,7 @@ static long seccomp_notify_addfd(struct seccomp_filter *filter,
if (addfd.newfd_flags & ~O_CLOEXEC)
return -EINVAL;

- if (addfd.flags & ~SECCOMP_ADDFD_FLAG_SETFD)
+ if (addfd.flags & ~(SECCOMP_ADDFD_FLAG_SETFD | SECCOMP_ADDFD_FLAG_SEND))
return -EINVAL;

if (addfd.newfd && !(addfd.flags & SECCOMP_ADDFD_FLAG_SETFD))
@@ -1620,6 +1643,7 @@ static long seccomp_notify_addfd(struct seccomp_filter *filter,
if (!kaddfd.file)
return -EBADF;

+ kaddfd.ioctl_flags = addfd.flags;
kaddfd.flags = addfd.newfd_flags;
kaddfd.fd = (addfd.flags & SECCOMP_ADDFD_FLAG_SETFD) ?
addfd.newfd : -1;
@@ -1645,6 +1669,23 @@ static long seccomp_notify_addfd(struct seccomp_filter *filter,
goto out_unlock;
}

+ if (addfd.flags & SECCOMP_ADDFD_FLAG_SEND) {
+ /*
+ * Disallow queuing an atomic addfd + send reply while there are
+ * some addfd requests still to process.
+ *
+ * There is no clear reason to support it and allows us to keep
+ * the loop on the other side straight-forward.
+ */
+ if (!list_empty(&knotif->addfd)) {
+ ret = -EBUSY;
+ goto out_unlock;
+ }
+
+ /* Allow exactly only one reply */
+ knotif->state = SECCOMP_NOTIFY_REPLIED;
+ }
+
list_add(&kaddfd.list, &knotif->addfd);
complete(&knotif->ready);
mutex_unlock(&filter->notify_lock);
--
2.25.1

2021-04-30 20:52:04

by Sargun Dhillon

[permalink] [raw]
Subject: [PATCH v2 1/5] seccomp: Refactor notification handler to prepare for new semantics

This refactors the user notification code to have a do / while loop around
the completion condition. This has a small change in semantic, in that
previously we ignored addfd calls upon wakeup if the notification had been
responded to, but instead with the new change we check for an outstanding
addfd calls prior to returning to userspace.

Rodrigo Campos also identified a bug that can result in addfd causing
an early return, when the supervisor didn't actually handle the syscall [1].

[1]: https://lore.kernel.org/lkml/[email protected]/

Fixes: 7cf97b125455 ("seccomp: Introduce addfd ioctl to seccomp user notifier")
Signed-off-by: Sargun Dhillon <[email protected]>
---
kernel/seccomp.c | 30 ++++++++++++++++--------------
1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 1d60fc2c9987..93684cc63285 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1098,28 +1098,30 @@ static int seccomp_do_user_notification(int this_syscall,

up(&match->notif->request);
wake_up_poll(&match->wqh, EPOLLIN | EPOLLRDNORM);
- mutex_unlock(&match->notify_lock);

/*
* This is where we wait for a reply from userspace.
*/
-wait:
- err = wait_for_completion_interruptible(&n.ready);
- mutex_lock(&match->notify_lock);
- if (err == 0) {
- /* Check if we were woken up by a addfd message */
+ do {
+ mutex_unlock(&match->notify_lock);
+ err = wait_for_completion_interruptible(&n.ready);
+ mutex_lock(&match->notify_lock);
+ if (err != 0)
+ goto interrupted;
+
addfd = list_first_entry_or_null(&n.addfd,
struct seccomp_kaddfd, list);
- if (addfd && n.state != SECCOMP_NOTIFY_REPLIED) {
+ /* Check if we were woken up by a addfd message */
+ if (addfd)
seccomp_handle_addfd(addfd);
- mutex_unlock(&match->notify_lock);
- goto wait;
- }
- ret = n.val;
- err = n.error;
- flags = n.flags;
- }

+ } while (n.state != SECCOMP_NOTIFY_REPLIED);
+
+ ret = n.val;
+ err = n.error;
+ flags = n.flags;
+
+interrupted:
/* If there were any pending addfd calls, clear them out */
list_for_each_entry_safe(addfd, tmp, &n.addfd, list) {
/* The process went away before we got a chance to handle it */
--
2.25.1

2021-04-30 20:52:04

by Sargun Dhillon

[permalink] [raw]
Subject: [PATCH v2 3/5] selftests/seccomp: Add test for wait killable notifier

This adds a test for the positive case of the wait killable notifier,
in testing that when the feature is activated the process acts as
expected -- in not terminating on a non-fatal signal, and instead
queueing it up. There is already a test case for normal handlers
and preemption.

Signed-off-by: Sargun Dhillon <[email protected]>
---
tools/testing/selftests/seccomp/seccomp_bpf.c | 152 ++++++++++++++++++
1 file changed, 152 insertions(+)

diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 98c3b647f54d..9a72ba8bb4f7 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -235,6 +235,11 @@ struct seccomp_notif_addfd {
};
#endif

+#ifndef SECCOMP_IOCTL_NOTIF_SET_WAIT_KILLABLE
+/* Set flag to prevent non-fatal signal preemption */
+#define SECCOMP_IOCTL_NOTIF_SET_WAIT_KILLABLE SECCOMP_IOW(4, __u64)
+#endif
+
struct seccomp_notif_addfd_small {
__u64 id;
char weird[4];
@@ -4135,6 +4140,153 @@ TEST(user_notification_addfd_rlimit)
close(memfd);
}

+/* Parses /proc/$PID/status, and stores the state and shdpnd */
+static void parse_status(pid_t pid, char *state, long long *shdpnd)
+{
+ char *line = NULL;
+ char proc_path[100] = {0};
+ size_t len = 0;
+ FILE *f;
+
+ snprintf(proc_path, sizeof(proc_path), "/proc/%d/status", pid);
+ f = fopen(proc_path, "r");
+ if (f == NULL)
+ ksft_exit_fail_msg("%s - Could not open %s\n",
+ strerror(errno), proc_path);
+
+ while (getline(&line, &len, f) != -1) {
+ if (strstr(line, "State")) {
+ if (sscanf(line, "State:\t%c", state) != 1)
+ ksft_exit_fail_msg("Couldn't parse state %s\n",
+ line);
+ }
+
+ if (strstr(line, "ShdPnd")) {
+ if (sscanf(line, "ShdPnd:\t%llx", shdpnd) != 1)
+ ksft_exit_fail_msg("Couldn't parse shdpnd %s\n",
+ line);
+ }
+ }
+
+ free(line);
+ fclose(f);
+}
+
+TEST(user_notification_signal_wait_killable)
+{
+ struct seccomp_notif_resp resp = {};
+ struct seccomp_notif req = {};
+ int status, listener;
+ long long sigpnd;
+ char state;
+ pid_t pid;
+ long ret;
+
+ ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+ ASSERT_EQ(0, ret) {
+ TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+ }
+
+ listener = user_notif_syscall(__NR_gettid,
+ SECCOMP_FILTER_FLAG_NEW_LISTENER);
+ ASSERT_GE(listener, 0);
+
+ pid = fork();
+ ASSERT_GE(pid, 0);
+
+ if (pid == 0) {
+ sigset_t mask;
+
+ sigemptyset(&mask);
+ sigaddset(&mask, SIGUSR1);
+
+ EXPECT_EQ(sigprocmask(SIG_BLOCK, &mask, NULL), 0);
+
+ /* Check once for wait_killable */
+ if (syscall(__NR_gettid) != 42)
+ exit(1);
+
+ EXPECT_EQ(SIGUSR1, sigwaitinfo(&mask, NULL));
+
+ exit(42);
+ }
+
+ ASSERT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
+
+ /* Wait interruptible should be in sleep */
+ parse_status(pid, &state, &sigpnd);
+ ASSERT_EQ(state, 'S');
+
+ ASSERT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SET_WAIT_KILLABLE, &req.id), 0);
+ EXPECT_EQ(kill(pid, SIGUSR1), 0);
+
+ /* Check for transition to "disk sleep" -- async (or timeout) */
+ do {
+ usleep(100);
+ parse_status(pid, &state, &sigpnd);
+ } while (state != 'D');
+ EXPECT_EQ(sigpnd, (1 << (SIGUSR1 - 1)));
+
+ /* Make sure idempotency is handled correctly */
+ EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SET_WAIT_KILLABLE, &req.id), -1);
+ EXPECT_EQ(errno, EALREADY);
+
+ resp.id = req.id;
+ resp.error = 0;
+ resp.val = 42;
+
+ EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0);
+
+ EXPECT_EQ(waitpid(pid, &status, 0), pid);
+ EXPECT_EQ(true, WIFEXITED(status));
+ EXPECT_EQ(42, WEXITSTATUS(status));
+}
+
+
+TEST(user_notification_signal_wait_killable_kill)
+{
+ struct seccomp_notif req = {};
+ int status, listener;
+ long long sigpnd;
+ char state;
+ pid_t pid;
+ long ret;
+
+ ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+ ASSERT_EQ(0, ret) {
+ TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+ }
+
+ listener = user_notif_syscall(__NR_gettid,
+ SECCOMP_FILTER_FLAG_NEW_LISTENER);
+ ASSERT_GE(listener, 0);
+
+ pid = fork();
+ ASSERT_GE(pid, 0);
+
+ if (pid == 0) {
+ syscall(__NR_gettid);
+ exit(1);
+ }
+
+ ASSERT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
+ ASSERT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SET_WAIT_KILLABLE, &req.id), 0);
+ do {
+ usleep(100);
+ parse_status(pid, &state, &sigpnd);
+ } while (state != 'D');
+
+ EXPECT_EQ(kill(pid, SIGKILL), 0);
+
+ EXPECT_EQ(waitpid(pid, &status, 0), pid);
+ EXPECT_EQ(true, WIFSIGNALED(status));
+ EXPECT_EQ(9, WTERMSIG(status));
+
+ /* Make sure the notification is invalidated properly */
+ EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ID_VALID, &req.id), -1);
+}
+
+
/*
* TODO:
* - expand NNP testing
--
2.25.1

2021-04-30 20:52:17

by Sargun Dhillon

[permalink] [raw]
Subject: [PATCH v2 2/5] seccomp: Add wait_killable semantic to seccomp user notifier

The user notifier feature allows for filtering of seccomp notifications in
userspace. While the user notifier is handling the syscall, the notifying
process can be preempted, thus ending the notification. This has become a
growing problem, as Golang has adopted signal based async preemption[1]. In
this, it will preempt every 10ms, thus leaving the supervisor less than
10ms to respond to a given notification. If the syscall require I/O (mount,
connect) on behalf of the process, it can easily take 10ms.

This allows the supervisor to set a flag that moves the process into a
state where it is only killable by terminating signals as opposed to all
signals. The process can still be terminated before the supervisor receives
the notification.

Signed-off-by: Sargun Dhillon <[email protected]>

[1]: https://github.com/golang/go/issues/24543
---
.../userspace-api/seccomp_filter.rst | 22 +++---
include/uapi/linux/seccomp.h | 2 +
kernel/seccomp.c | 71 ++++++++++++++++++-
3 files changed, 85 insertions(+), 10 deletions(-)

diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
index bd9165241b6c..6ba4d39baf2a 100644
--- a/Documentation/userspace-api/seccomp_filter.rst
+++ b/Documentation/userspace-api/seccomp_filter.rst
@@ -250,14 +250,20 @@ Users can read via ``ioctl(SECCOMP_IOCTL_NOTIF_RECV)`` (or ``poll()``) on a
seccomp notification fd to receive a ``struct seccomp_notif``, which contains
five members: the input length of the structure, a unique-per-filter ``id``,
the ``pid`` of the task which triggered this request (which may be 0 if the
-task is in a pid ns not visible from the listener's pid namespace), a ``flags``
-member which for now only has ``SECCOMP_NOTIF_FLAG_SIGNALED``, representing
-whether or not the notification is a result of a non-fatal signal, and the
-``data`` passed to seccomp. Userspace can then make a decision based on this
-information about what to do, and ``ioctl(SECCOMP_IOCTL_NOTIF_SEND)`` a
-response, indicating what should be returned to userspace. The ``id`` member of
-``struct seccomp_notif_resp`` should be the same ``id`` as in ``struct
-seccomp_notif``.
+task is in a pid ns not visible from the listener's pid namespace). The
+notification also contains the ``data`` passed to seccomp, and a currently
+unused filters flag.
+
+Upon receiving the notification ``ioctl(SECCOMP_IOCTL_NOTIF_SET_WAIT_KILLABLE)``
+can be used to make the notifying task only respond to fatal signals. Once the
+notification has been set as wait_killable, it cannot be unset. If the
+notification is already in the wait_killable state, EALREADY will be returned by
+the ioctl.
+
+Userspace can then make a decision based on this information about what to do,
+and ``ioctl(SECCOMP_IOCTL_NOTIF_SEND)`` a response, indicating what should be
+returned to userspace. The ``id`` member of ``struct seccomp_notif_resp`` should
+be the same ``id`` as in ``struct seccomp_notif``.

It is worth noting that ``struct seccomp_data`` contains the values of register
arguments to the syscall, but does not contain pointers to memory. The task's
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 6ba18b82a02e..71dbc1f7889f 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -146,5 +146,7 @@ struct seccomp_notif_addfd {
/* On success, the return value is the remote process's added fd number */
#define SECCOMP_IOCTL_NOTIF_ADDFD SECCOMP_IOW(3, \
struct seccomp_notif_addfd)
+/* Set flag to prevent non-fatal signal preemption */
+#define SECCOMP_IOCTL_NOTIF_SET_WAIT_KILLABLE SECCOMP_IOW(4, __u64)

#endif /* _UAPI_LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 93684cc63285..7ac1cea6e7f0 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -97,6 +97,8 @@ struct seccomp_knotif {

/* outstanding addfd requests */
struct list_head addfd;
+
+ bool wait_killable;
};

/**
@@ -1073,6 +1075,12 @@ static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd)
complete(&addfd->completion);
}

+/* Must be called with notify_lock held */
+static inline bool notification_wait_killable(struct seccomp_knotif *n)
+{
+ return n->state == SECCOMP_NOTIFY_SENT && n->wait_killable;
+}
+
static int seccomp_do_user_notification(int this_syscall,
struct seccomp_filter *match,
const struct seccomp_data *sd)
@@ -1103,11 +1111,33 @@ static int seccomp_do_user_notification(int this_syscall,
* This is where we wait for a reply from userspace.
*/
do {
+ int wait_killable = notification_wait_killable(&n);
+
mutex_unlock(&match->notify_lock);
- err = wait_for_completion_interruptible(&n.ready);
+ if (wait_killable)
+ err = wait_for_completion_killable(&n.ready);
+ else
+ err = wait_for_completion_interruptible(&n.ready);
mutex_lock(&match->notify_lock);
- if (err != 0)
+ if (err != 0) {
+ /*
+ * If the SECCOMP_IOCTL_NOTIF_SET_WAIT_KILLABLE was
+ * used to change the state of the handler to
+ * wait_killable, but a non-fatal signal arrived
+ * before we could complete the transition, we could
+ * erroneously finish waiting early.
+ *
+ * If we were previously in not in the wait_killable
+ * state and we've transitioned to the wait_killable
+ * state, retry waiting. wait_for_completion_killable
+ * will check again if there is a fatal signal waiting,
+ * or another completion (addfd).
+ */
+ if (!wait_killable && notification_wait_killable(&n))
+ continue;
+
goto interrupted;
+ }

addfd = list_first_entry_or_null(&n.addfd,
struct seccomp_kaddfd, list);
@@ -1477,6 +1507,12 @@ static long seccomp_notify_recv(struct seccomp_filter *filter,
if (knotif) {
knotif->state = SECCOMP_NOTIFY_INIT;
up(&filter->notif->request);
+
+ /* Wake the task to reset its state */
+ if (knotif->wait_killable) {
+ knotif->wait_killable = false;
+ complete(&knotif->ready);
+ }
}
mutex_unlock(&filter->notify_lock);
}
@@ -1648,6 +1684,35 @@ static long seccomp_notify_addfd(struct seccomp_filter *filter,
return ret;
}

+static long seccomp_notify_set_wait_killable(struct seccomp_filter *filter,
+ void __user *buf)
+{
+ struct seccomp_knotif *knotif;
+ u64 id;
+ long ret;
+
+ if (copy_from_user(&id, buf, sizeof(id)))
+ return -EFAULT;
+
+ ret = mutex_lock_interruptible(&filter->notify_lock);
+ if (ret < 0)
+ return ret;
+
+ knotif = find_notification(filter, id);
+ if (!knotif || knotif->state != SECCOMP_NOTIFY_SENT) {
+ ret = -ENOENT;
+ } else if (knotif->wait_killable) {
+ ret = -EALREADY;
+ } else {
+ ret = 0;
+ knotif->wait_killable = true;
+ complete(&knotif->ready);
+ }
+
+ mutex_unlock(&filter->notify_lock);
+ return ret;
+}
+
static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,
unsigned long arg)
{
@@ -1663,6 +1728,8 @@ static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,
case SECCOMP_IOCTL_NOTIF_ID_VALID_WRONG_DIR:
case SECCOMP_IOCTL_NOTIF_ID_VALID:
return seccomp_notify_id_valid(filter, buf);
+ case SECCOMP_IOCTL_NOTIF_SET_WAIT_KILLABLE:
+ return seccomp_notify_set_wait_killable(filter, buf);
}

/* Extensible Argument ioctls */
--
2.25.1

2021-04-30 23:24:27

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] seccomp: Add wait_killable semantic to seccomp user notifier

On Fri, Apr 30, 2021 at 1:49 PM Sargun Dhillon <[email protected]> wrote:
>
> The user notifier feature allows for filtering of seccomp notifications in
> userspace. While the user notifier is handling the syscall, the notifying
> process can be preempted, thus ending the notification. This has become a
> growing problem, as Golang has adopted signal based async preemption[1]. In
> this, it will preempt every 10ms, thus leaving the supervisor less than
> 10ms to respond to a given notification. If the syscall require I/O (mount,
> connect) on behalf of the process, it can easily take 10ms.
>
> This allows the supervisor to set a flag that moves the process into a
> state where it is only killable by terminating signals as opposed to all
> signals. The process can still be terminated before the supervisor receives
> the notification.

This is still racy, right? If a signal arrives after the syscall
enters the seccomp code but before the supervisor gets around to
issuing the new ioctl, the syscall will erroneously return -EINTR,
right?

Can we please just fully fix this instead of piling a racy partial fix
on top of an incorrect design?

--Andy

2021-05-01 00:11:07

by Sargun Dhillon

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] seccomp: Add wait_killable semantic to seccomp user notifier

On Fri, Apr 30, 2021 at 4:23 PM Andy Lutomirski <[email protected]> wrote:
>
> On Fri, Apr 30, 2021 at 1:49 PM Sargun Dhillon <[email protected]> wrote:
> >
> > The user notifier feature allows for filtering of seccomp notifications in
> > userspace. While the user notifier is handling the syscall, the notifying
> > process can be preempted, thus ending the notification. This has become a
> > growing problem, as Golang has adopted signal based async preemption[1]. In
> > this, it will preempt every 10ms, thus leaving the supervisor less than
> > 10ms to respond to a given notification. If the syscall require I/O (mount,
> > connect) on behalf of the process, it can easily take 10ms.
> >
> > This allows the supervisor to set a flag that moves the process into a
> > state where it is only killable by terminating signals as opposed to all
> > signals. The process can still be terminated before the supervisor receives
> > the notification.
>
> This is still racy, right? If a signal arrives after the syscall
> enters the seccomp code but before the supervisor gets around to
> issuing the new ioctl, the syscall will erroneously return -EINTR,
> right?
>
> Can we please just fully fix this instead of piling a racy partial fix
> on top of an incorrect design?
>
> --Andy

I thought that you were fine with this approach. Sorry.

Maybe this is a dumb question, what's wrong with returning an EINTR if the
syscall was never observed by the supervisor?

I think that the only other reasonable design is that we add data to the
existing action which makes it sleep in wait_killable state.

2021-05-06 20:06:14

by Rodrigo Campos

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] seccomp: Add wait_killable semantic to seccomp user notifier

On Sat, May 1, 2021 at 2:10 AM Sargun Dhillon <[email protected]> wrote:
>
> On Fri, Apr 30, 2021 at 4:23 PM Andy Lutomirski <[email protected]> wrote:
> >
> > On Fri, Apr 30, 2021 at 1:49 PM Sargun Dhillon <[email protected]> wrote:
> > >
> > > The user notifier feature allows for filtering of seccomp notifications in
> > > userspace. While the user notifier is handling the syscall, the notifying
> > > process can be preempted, thus ending the notification. This has become a
> > > growing problem, as Golang has adopted signal based async preemption[1]. In
> > > this, it will preempt every 10ms, thus leaving the supervisor less than
> > > 10ms to respond to a given notification. If the syscall require I/O (mount,
> > > connect) on behalf of the process, it can easily take 10ms.
> > >
> > > This allows the supervisor to set a flag that moves the process into a
> > > state where it is only killable by terminating signals as opposed to all
> > > signals. The process can still be terminated before the supervisor receives
> > > the notification.
> >
> > This is still racy, right? If a signal arrives after the syscall
> > enters the seccomp code but before the supervisor gets around to
> > issuing the new ioctl, the syscall will erroneously return -EINTR,
> > right?
> >
> > Can we please just fully fix this instead of piling a racy partial fix
> > on top of an incorrect design?
> >
> > --Andy
>
> I thought that you were fine with this approach. Sorry.
>
> Maybe this is a dumb question, what's wrong with returning an EINTR if the
> syscall was never observed by the supervisor?

IIUC It is partially observed by the supervisor.

If you are polling on the seccomp-fd, you _will see_ the POLLIN,
right? That is sent a few lines before the wait, here[1]. And it can
happen that after seeing the POLLIN, by the time we run the recv ioctl
we hang up waiting _for a new syscall to be issued_. Because the
previous syscall could have been interrupted by a signal before we
issued the recv ioctl. IOW, we have spurious wake-ups and no way to
tell when there is something to receive or not for the recv ioctl.
Seems fishy, especially since this is supposed to be a "recv that
blocks signals" ioctl but it can be interrupted by a signal anyways if
they happen before we issue the ioctl.

Another problem is that the container/target will see EINTR on a
syscall that maybe doesn't return EINTR. The program might not handle
that case, as in fact shouldn't ever happen. This is a problem today
too, though, but it is what I understand Andy asks to improve. For
sure it is not proper syscall emulation if the emulated returns things
that should/are never be returned by a regular syscall.

Also, a third problem: if the go runtime gives like ~10ms in some
cases between signals, it is not clear how many times we will issue
the recv ioctl before the runtime sends a signal and interrupts the
container waiting. Furthermore the race is still there, just the
window is smaller.

The main point being the second issue I mentioned: it will never be
correct emulation if an emulated syscall returns different things than
a real one (and things that can't be returned at all by some
syscalls). Additionally, the emulated syscall behaviour should also be
compatible with the real, etc.

> I think that the only other reasonable design is that we add data to the existing action which makes it sleep in wait_killable state.

I don't think the wait should be doing a wait_killable under Andy's idea.

We want to be woken-up in seccomp_do_user_notification() so we know we
received a signal and notify to the supervisor about it. IMHO, we want
an additional "if" here[2], like (haven't checked the retun codes of
functions) "if err != 0" and do a "wake_up_poll(&match->wqh, EPOLLIN
| EPOLLRDNORM);" or POLLPRI or something, and then "goto wait". IOW,
we only return when the supervisor tells us to, if a signal arrives we
just let the supervisor know and continue waiting for a response. The
proper response might be "I wrote partial data, return only 3 bytes",
etc. This way we can properly emulate it.

What is not clear to me, and I'd like others to comment is:

1. How should this new "the supervisor should handle signals" mode be enabled?
a. If we use an extra ioctl, as Andy suggested, I think we have a
race too: from the moment the seccomp syscall is issued until the new
ioctl is called there is a race. If the container does a syscall in
that window, it _can_ incorrectly return EINTR as it does now. This
race can be very small and the ioctl can make _all the next syscalls_
wait in this new mode, so maybe the race is so small that we don't
care in practice.
b. But if we want to really fully solve the issue, the other option
that comes to mind is adding a new flag for the seccomp syscall, to
signal that the supervisor will handle signals and should be
forwarded. This way, if the container does a new syscall it will be
put to wait in this new mode (we have the information that the new
mode should be used already). Something like
SECCOMP_FILTER_FLAG_NEW_LISTENER_SOMETHING (today we have
SECCOMP_FILTER_FLAG_NEW_LISTENER). Ideally shorter :D

2. What should we notify to the supervisor? Only that a signal arrived
or also which signal it was? If we do the EPOLLPRI thingy, as Andy
mentioned in a previous thread, IIUC we will only notify that _a_
signal arrived, but not _which_. To notify which signal arrived might
be complex, though, (not sure, I haven't explored how that should be).

3. How should we notify whatever we want to notify? If we only notify
that a signal arrived, that is probably simple. But if we want to
notify which signal arrived it might be tricker and would like to know
what others think about how that interface should be (adjust the
struct seccomp_notif I guess?), if there are strong opinions, etc.


I'll be away for a few weeks now, but I'll catch up with the discussion ASAP.


Best,
Rodrigo

[1]: https://github.com/torvalds/linux/blob/8404c9fbc84b741f66cff7d4934a25dd2c344452/kernel/seccomp.c#L1107
[2]: https://github.com/torvalds/linux/blob/8404c9fbc84b741f66cff7d4934a25dd2c344452/kernel/seccomp.c#L1116

2021-05-07 01:17:46

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] seccomp: Add wait_killable semantic to seccomp user notifier

On Thu, May 6, 2021 at 12:35 PM Rodrigo Campos <[email protected]> wrote:

> We want to be woken-up in seccomp_do_user_notification() so we know we
> received a signal and notify to the supervisor about it. IMHO, we want
> an additional "if" here[2], like (haven't checked the retun codes of
> functions) "if err != 0" and do a "wake_up_poll(&match->wqh, EPOLLIN
> | EPOLLRDNORM);" or POLLPRI or something, and then "goto wait". IOW,
> we only return when the supervisor tells us to, if a signal arrives we
> just let the supervisor know and continue waiting for a response. The
> proper response might be "I wrote partial data, return only 3 bytes",
> etc. This way we can properly emulate it.
>
> What is not clear to me, and I'd like others to comment is:
>
> 1. How should this new "the supervisor should handle signals" mode be enabled?
> a. If we use an extra ioctl, as Andy suggested, I think we have a
> race too: from the moment the seccomp syscall is issued until the new
> ioctl is called there is a race. If the container does a syscall in
> that window, it _can_ incorrectly return EINTR as it does now. This
> race can be very small and the ioctl can make _all the next syscalls_
> wait in this new mode, so maybe the race is so small that we don't
> care in practice.

If the ioctl is sticky, it can presumably avoid the race entirely by
having whoever calls seccomp() immediately do a dummy syscall or
otherwise wait for the notifier to confirm that it has done the ioctl.
Admittedly, this may be annoying.

> b. But if we want to really fully solve the issue, the other option
> that comes to mind is adding a new flag for the seccomp syscall, to
> signal that the supervisor will handle signals and should be
> forwarded. This way, if the container does a new syscall it will be
> put to wait in this new mode (we have the information that the new
> mode should be used already). Something like
> SECCOMP_FILTER_FLAG_NEW_LISTENER_SOMETHING (today we have
> SECCOMP_FILTER_FLAG_NEW_LISTENER). Ideally shorter :D

I like that better.

>
> 2. What should we notify to the supervisor? Only that a signal arrived
> or also which signal it was? If we do the EPOLLPRI thingy, as Andy
> mentioned in a previous thread, IIUC we will only notify that _a_
> signal arrived, but not _which_. To notify which signal arrived might
> be complex, though, (not sure, I haven't explored how that should be).

Are there any examples in the kernel of syscalls that care *which*
signal came in? As far as I know, there are really only three states
that matter: no signal is pending, a kill is pending, or a regular
signal is pending. (The latter means that there's a signal that is
unmasked, etc and will should therefore interrupt a syscall if the
syscall can be interrupted.

2021-05-07 07:56:56

by Sargun Dhillon

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] seccomp: Add wait_killable semantic to seccomp user notifier

On Thu, May 6, 2021 at 4:59 PM Andy Lutomirski <[email protected]> wrote:
>
> On Thu, May 6, 2021 at 12:35 PM Rodrigo Campos <[email protected]> wrote:
>
> > We want to be woken-up in seccomp_do_user_notification() so we know we
> > received a signal and notify to the supervisor about it. IMHO, we want
> > an additional "if" here[2], like (haven't checked the retun codes of
> > functions) "if err != 0" and do a "wake_up_poll(&match->wqh, EPOLLIN
> > | EPOLLRDNORM);" or POLLPRI or something, and then "goto wait". IOW,
> > we only return when the supervisor tells us to, if a signal arrives we
> > just let the supervisor know and continue waiting for a response. The
> > proper response might be "I wrote partial data, return only 3 bytes",
> > etc. This way we can properly emulate it.
> >
> > What is not clear to me, and I'd like others to comment is:
> >
> > 1. How should this new "the supervisor should handle signals" mode be enabled?
> > a. If we use an extra ioctl, as Andy suggested, I think we have a
> > race too: from the moment the seccomp syscall is issued until the new
> > ioctl is called there is a race. If the container does a syscall in
> > that window, it _can_ incorrectly return EINTR as it does now. This
> > race can be very small and the ioctl can make _all the next syscalls_
> > wait in this new mode, so maybe the race is so small that we don't
> > care in practice.
>
> If the ioctl is sticky, it can presumably avoid the race entirely by
> having whoever calls seccomp() immediately do a dummy syscall or
> otherwise wait for the notifier to confirm that it has done the ioctl.
> Admittedly, this may be annoying.
>
> > b. But if we want to really fully solve the issue, the other option
> > that comes to mind is adding a new flag for the seccomp syscall, to
> > signal that the supervisor will handle signals and should be
> > forwarded. This way, if the container does a new syscall it will be
> > put to wait in this new mode (we have the information that the new
> > mode should be used already). Something like
> > SECCOMP_FILTER_FLAG_NEW_LISTENER_SOMETHING (today we have
> > SECCOMP_FILTER_FLAG_NEW_LISTENER). Ideally shorter :D
>
> I like that better.
>
> >
> > 2. What should we notify to the supervisor? Only that a signal arrived
> > or also which signal it was? If we do the EPOLLPRI thingy, as Andy
> > mentioned in a previous thread, IIUC we will only notify that _a_
> > signal arrived, but not _which_. To notify which signal arrived might
> > be complex, though, (not sure, I haven't explored how that should be).
>
> Are there any examples in the kernel of syscalls that care *which*
> signal came in? As far as I know, there are really only three states
> that matter: no signal is pending, a kill is pending, or a regular
> signal is pending. (The latter means that there's a signal that is
> unmasked, etc and will should therefore interrupt a syscall if the
> syscall can be interrupted.

Andy,
I do not believe we want full preemption on all syscalls. I think we want
the ability to have "safe" (for some definition of safe) preemption on some
syscalls.

For example, a syscall like connect should allow for arbitrary
preemption, even if the supervisor isn't able to get to it in time (using
alarm here is a thing I've seen as a common pattern). If a connect
starts, and is then preempted, you get an EINTR. glibc will retry the
connect, and if it was already in progress, you should get an
EINPROGRESS, otherwise no error. This is "correct" behaviour,
but not all syscalls do this.

I think we can take a slightly lighterweight approach in in the filter
action, we can include a flag in the data of the action that the filter
return which signifies whether or not the call should be preemptible
before being seen by the supervisor, and another flag which indicates
that once the supervisor has RECV'd the notification, it should not
allow for interruption.

I agree though, that if the call is in the non-preemptible state we need to
have a mechanism to notify the supervisor to cancel work, and
return to the notifying process. I think that if we do the POLLPRI thing,
then there should be a RECV_SIGNAL ioctl to read POLLPRI notifications,
that indicate a particular notification has been preempted (signaled). At this
point, the supervisor can do whatever it wants.

There is also a slight other race condition:
1. Notification is generated wait non-preemptible flag in data
2. Program returns signal
3. At this point, should the kernel do POLLPRI or wait for
the supervisor to first read the notification. Should we add
some metadata to the notification indicating it was attempted
to be preempted?

Would the following work:
1. Add a two flags to the data, indicating pre-notification preemption being
allowed and return -EINTR, and once the notification is picked up by
the supervisor transitioning into an -EINTR state
2. Make it so if an in-progress notification is interrupted via a
signal, it will
send a POLLPRI, which is read via RECV_SIGNAL.