2021-04-26 18:07:43

by Sargun Dhillon

[permalink] [raw]
Subject: [PATCH RESEND 0/5] Handle seccomp notification preemption

This patchset addresses a race condition we've dealt with recently with
seccomp. Specifically programs interrupting syscalls while they're in
progress. This was exacerbated by Golang's recent adoption of "async
preemption", in which they try to interrupt any syscall that's been
running for more than 10ms during GC. During certain syscalls, it's
non-trivial to write them in a reetrant manner in userspace (mount).

This has a couple semantic changes, and relaxes a check on seccomp_data, and
changes the semantics with ordering of how addfd and notification replies
in the supervisor are handled.

I'm resending after rebasing and testing on v5.12. It turns out this change
also fixed a bug Rodrigo found that could occur with addfd around certain
race conditions[2].

It also follows up on the original proposal from Tycho[3] to allow
for adding an FD and returning that value atomically.

Changes since v1[1]:
* Fix some documentation
* Add Rata's patches to allow for direct return from addfd

[1]: https://lore.kernel.org/lkml/[email protected]/
[2]: https://lore.kernel.org/lkml/[email protected]/
[3]: https://lore.kernel.org/lkml/202012011322.26DCBC64F2@keescook/


Rodrigo Campos (2):
seccomp: Support atomic "addfd + send reply"
selftests/seccomp: Add test for atomic addfd+send

Sargun Dhillon (3):
seccomp: Refactor notification handler to prepare for new semantics
seccomp: Add wait_killable semantic to seccomp user notifier
selftests/seccomp: Add test for wait killable notifier

.../userspace-api/seccomp_filter.rst | 15 +-
include/uapi/linux/seccomp.h | 4 +
kernel/seccomp.c | 129 ++++++++++++++----
tools/testing/selftests/seccomp/seccomp_bpf.c | 102 ++++++++++++++
4 files changed, 220 insertions(+), 30 deletions(-)


base-commit: 9f4ad9e425a1d3b6a34617b8ea226d56a119a717
--
2.25.1


2021-04-26 18:07:58

by Sargun Dhillon

[permalink] [raw]
Subject: [PATCH RESEND 1/5] seccomp: Refactor notification handler to prepare for new semantics

This refactors the user notification code to have a do / while loop around
the completion condition. This has a small change in semantic, in that
previously we ignored addfd calls upon wakeup if the notification had been
responded to, but instead with the new change we check for an outstanding
addfd calls prior to returning to userspace.

Signed-off-by: Sargun Dhillon <[email protected]>
---
kernel/seccomp.c | 30 ++++++++++++++++--------------
1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 1d60fc2c9987..93684cc63285 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1098,28 +1098,30 @@ static int seccomp_do_user_notification(int this_syscall,

up(&match->notif->request);
wake_up_poll(&match->wqh, EPOLLIN | EPOLLRDNORM);
- mutex_unlock(&match->notify_lock);

/*
* This is where we wait for a reply from userspace.
*/
-wait:
- err = wait_for_completion_interruptible(&n.ready);
- mutex_lock(&match->notify_lock);
- if (err == 0) {
- /* Check if we were woken up by a addfd message */
+ do {
+ mutex_unlock(&match->notify_lock);
+ err = wait_for_completion_interruptible(&n.ready);
+ mutex_lock(&match->notify_lock);
+ if (err != 0)
+ goto interrupted;
+
addfd = list_first_entry_or_null(&n.addfd,
struct seccomp_kaddfd, list);
- if (addfd && n.state != SECCOMP_NOTIFY_REPLIED) {
+ /* Check if we were woken up by a addfd message */
+ if (addfd)
seccomp_handle_addfd(addfd);
- mutex_unlock(&match->notify_lock);
- goto wait;
- }
- ret = n.val;
- err = n.error;
- flags = n.flags;
- }

+ } while (n.state != SECCOMP_NOTIFY_REPLIED);
+
+ ret = n.val;
+ err = n.error;
+ flags = n.flags;
+
+interrupted:
/* If there were any pending addfd calls, clear them out */
list_for_each_entry_safe(addfd, tmp, &n.addfd, list) {
/* The process went away before we got a chance to handle it */
--
2.25.1

2021-04-26 18:08:11

by Sargun Dhillon

[permalink] [raw]
Subject: [PATCH RESEND 2/5] seccomp: Add wait_killable semantic to seccomp user notifier

The user notifier feature allows for filtering of seccomp notifications in
userspace. While the user notifier is handling the syscall, the notifying
process can be preempted, thus ending the notification. This has become a
growing problem, as Golang has adopted signal based async preemption[1]. In
this, it will preempt every 10ms, thus leaving the supervisor less than
10ms to respond to a given notification. If the syscall require I/O (mount,
connect) on behalf of the process, it can easily take 10ms.

This allows the supervisor to set a flag that moves the process into a
state where it is only killable by terminating signals as opposed to all
signals. The process can still be terminated before the supervisor receives
the notification.

Signed-off-by: Sargun Dhillon <[email protected]>

[1]: https://github.com/golang/go/issues/24543
---
.../userspace-api/seccomp_filter.rst | 15 +++---
include/uapi/linux/seccomp.h | 3 ++
kernel/seccomp.c | 54 ++++++++++++++++---
3 files changed, 58 insertions(+), 14 deletions(-)

diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
index bd9165241b6c..75de9400d56a 100644
--- a/Documentation/userspace-api/seccomp_filter.rst
+++ b/Documentation/userspace-api/seccomp_filter.rst
@@ -251,13 +251,14 @@ seccomp notification fd to receive a ``struct seccomp_notif``, which contains
five members: the input length of the structure, a unique-per-filter ``id``,
the ``pid`` of the task which triggered this request (which may be 0 if the
task is in a pid ns not visible from the listener's pid namespace), a ``flags``
-member which for now only has ``SECCOMP_NOTIF_FLAG_SIGNALED``, representing
-whether or not the notification is a result of a non-fatal signal, and the
-``data`` passed to seccomp. Userspace can then make a decision based on this
-information about what to do, and ``ioctl(SECCOMP_IOCTL_NOTIF_SEND)`` a
-response, indicating what should be returned to userspace. The ``id`` member of
-``struct seccomp_notif_resp`` should be the same ``id`` as in ``struct
-seccomp_notif``.
+member and the ``data`` passed to seccomp. Upon receiving the notification,
+the ``SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE`` flag may be set, which will
+try to put the task into a state where it will only respond to fatal signals.
+
+Userspace can then make a decision based on this information about what to do,
+and ``ioctl(SECCOMP_IOCTL_NOTIF_SEND)`` a response, indicating what should be
+returned to userspace. The ``id`` member of ``struct seccomp_notif_resp`` should
+be the same ``id`` as in ``struct seccomp_notif``.

It is worth noting that ``struct seccomp_data`` contains the values of register
arguments to the syscall, but does not contain pointers to memory. The task's
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 6ba18b82a02e..bc7fc8b04749 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -70,6 +70,9 @@ struct seccomp_notif_sizes {
__u16 seccomp_data;
};

+/* Valid flags for struct seccomp_notif */
+#define SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE (1UL << 0) /* Prevent task from being interrupted */
+
struct seccomp_notif {
__u64 id;
__u32 pid;
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 93684cc63285..b852e8617004 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -97,6 +97,8 @@ struct seccomp_knotif {

/* outstanding addfd requests */
struct list_head addfd;
+
+ bool wait_killable;
};

/**
@@ -1073,6 +1075,11 @@ static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd)
complete(&addfd->completion);
}

+static bool notification_interruptible(struct seccomp_knotif *n)
+{
+ return !(n->state == SECCOMP_NOTIFY_SENT && n->wait_killable);
+}
+
static int seccomp_do_user_notification(int this_syscall,
struct seccomp_filter *match,
const struct seccomp_data *sd)
@@ -1082,6 +1089,7 @@ static int seccomp_do_user_notification(int this_syscall,
long ret = 0;
struct seccomp_knotif n = {};
struct seccomp_kaddfd *addfd, *tmp;
+ bool interruptible = true;

mutex_lock(&match->notify_lock);
err = -ENOSYS;
@@ -1103,11 +1111,31 @@ static int seccomp_do_user_notification(int this_syscall,
* This is where we wait for a reply from userspace.
*/
do {
+ interruptible = notification_interruptible(&n);
+
mutex_unlock(&match->notify_lock);
- err = wait_for_completion_interruptible(&n.ready);
+ if (interruptible)
+ err = wait_for_completion_interruptible(&n.ready);
+ else
+ err = wait_for_completion_killable(&n.ready);
mutex_lock(&match->notify_lock);
- if (err != 0)
+
+ if (err != 0) {
+ /*
+ * There is a race condition here where if the
+ * notification was received with the
+ * SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE flag, but a
+ * non-fatal signal was received before we could
+ * transition we could erroneously end our wait early.
+ *
+ * The next wait for completion will ensure the signal
+ * was not fatal.
+ */
+ if (interruptible && !notification_interruptible(&n))
+ continue;
+
goto interrupted;
+ }

addfd = list_first_entry_or_null(&n.addfd,
struct seccomp_kaddfd, list);
@@ -1422,14 +1450,16 @@ static long seccomp_notify_recv(struct seccomp_filter *filter,
struct seccomp_notif unotif;
ssize_t ret;

+ ret = copy_from_user(&unotif, buf, sizeof(unotif));
+ if (ret)
+ return -EFAULT;
+
/* Verify that we're not given garbage to keep struct extensible. */
- ret = check_zeroed_user(buf, sizeof(unotif));
- if (ret < 0)
- return ret;
- if (!ret)
+ if (unotif.flags & ~(SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE))
return -EINVAL;

- memset(&unotif, 0, sizeof(unotif));
+ if (unotif.id || unotif.pid)
+ return -EINVAL;

ret = down_interruptible(&filter->notif->request);
if (ret < 0)
@@ -1457,6 +1487,12 @@ static long seccomp_notify_recv(struct seccomp_filter *filter,
unotif.pid = task_pid_vnr(knotif->task);
unotif.data = *(knotif->data);

+ if (unotif.flags & SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE) {
+ knotif->wait_killable = true;
+ complete(&knotif->ready);
+ }
+
+
knotif->state = SECCOMP_NOTIFY_SENT;
wake_up_poll(&filter->wqh, EPOLLOUT | EPOLLWRNORM);
ret = 0;
@@ -1477,6 +1513,10 @@ static long seccomp_notify_recv(struct seccomp_filter *filter,
if (knotif) {
knotif->state = SECCOMP_NOTIFY_INIT;
up(&filter->notif->request);
+
+ /* Wake the task to reset its state */
+ if (knotif->wait_killable)
+ complete(&knotif->ready);
}
mutex_unlock(&filter->notify_lock);
}
--
2.25.1

2021-04-26 18:08:54

by Sargun Dhillon

[permalink] [raw]
Subject: [PATCH RESEND 5/5] selftests/seccomp: Add test for atomic addfd+send

From: Rodrigo Campos <[email protected]>

This just adds a test to verify that when using the new introduced flag
to ADDFD, a valid fd is added and returned as the syscall result.

Signed-off-by: Rodrigo Campos <[email protected]>
Signed-off-by: Sargun Dhillon <[email protected]>
---
tools/testing/selftests/seccomp/seccomp_bpf.c | 38 +++++++++++++++++++
1 file changed, 38 insertions(+)

diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 34140ce2ab21..b3bfe76c6e90 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -239,6 +239,10 @@ struct seccomp_notif_addfd {
#define SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE (1UL << 0) /* Prevent task from being interrupted */
#endif

+#ifndef SECCOMP_ADDFD_FLAG_SEND
+#define SECCOMP_ADDFD_FLAG_SEND (1UL << 1) /* Addfd and return it, atomically */
+#endif
+
struct seccomp_notif_addfd_small {
__u64 id;
char weird[4];
@@ -3980,8 +3984,14 @@ TEST(user_notification_addfd)
ASSERT_GE(pid, 0);

if (pid == 0) {
+ /* fds will be added and this value is expected */
if (syscall(__NR_getppid) != USER_NOTIF_MAGIC)
exit(1);
+
+ /* Atomic addfd+send is received here. Check it is a valid fd */
+ if (fcntl(syscall(__NR_getppid), F_GETFD) == -1)
+ exit(1);
+
exit(syscall(__NR_getppid) != USER_NOTIF_MAGIC);
}

@@ -4060,6 +4070,30 @@ TEST(user_notification_addfd)
ASSERT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
ASSERT_EQ(addfd.id, req.id);

+ /* Verify we can do an atomic addfd and send */
+ addfd.newfd = 0;
+ addfd.flags = SECCOMP_ADDFD_FLAG_SEND;
+ fd = ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd);
+
+ /* Child has fds 0-6 and 42 used, we expect the lower fd available: 7 */
+ EXPECT_EQ(fd, 7);
+ EXPECT_EQ(filecmp(getpid(), pid, memfd, fd), 0);
+
+ /*
+ * This sets the ID of the ADD FD to the last request plus 1. The
+ * notification ID increments 1 per notification.
+ */
+ addfd.id = req.id + 1;
+
+ /* This spins until the underlying notification is generated */
+ while (ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd) != -1 &&
+ errno != -EINPROGRESS)
+ nanosleep(&delay, NULL);
+
+ memset(&req, 0, sizeof(req));
+ ASSERT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
+ ASSERT_EQ(addfd.id, req.id);
+
resp.id = req.id;
resp.error = 0;
resp.val = USER_NOTIF_MAGIC;
@@ -4120,6 +4154,10 @@ TEST(user_notification_addfd_rlimit)
EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd), -1);
EXPECT_EQ(errno, EMFILE);

+ addfd.flags = SECCOMP_ADDFD_FLAG_SEND;
+ EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd), -1);
+ EXPECT_EQ(errno, EMFILE);
+
addfd.newfd = 100;
addfd.flags = SECCOMP_ADDFD_FLAG_SETFD;
EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd), -1);
--
2.25.1

2021-04-26 18:09:02

by Sargun Dhillon

[permalink] [raw]
Subject: [PATCH RESEND 4/5] seccomp: Support atomic "addfd + send reply"

From: Rodrigo Campos <[email protected]>

Alban Crequy reported a race condition userspace faces when we want to
add some fds and make the syscall return them[1] using seccomp notify.

The problem is that currently two different ioctl() calls are needed by
the process handling the syscalls (agent) for another userspace process
(target): SECCOMP_IOCTL_NOTIF_ADDFD to allocate the fd and
SECCOMP_IOCTL_NOTIF_SEND to return that value. Therefore, it is possible
for the agent to do the first ioctl to add a file descriptor but the
target is interrupted (EINTR) before the agent does the second ioctl()
call.

Other patches in this series add a way to block signals when a syscall
is put to wait by seccomp. However, that might be a big hammer for some
cases, as the golang runtime uses SIGURG to interrupt threads for GC
collection. Sometimes we just don't want to interfere with the GC, for
example, and just either add the fd and return it or fail the syscall.
With no leaking fds added inadvertly to the target process.

This patch adds a flag to the ADDFD ioctl() so it adds the fd and
returns that value atomically to the target program, as suggested by
Kees Cook[2]. This is done by simply allowing
seccomp_do_user_notification() to add the fd and return it in this case.
Therefore, in this case the target wakes up from the wait in
seccomp_do_user_notification() either to interrupt the syscall or to add
the fd and return it.

This "allocate an fd and return" functionality is useful for syscalls
that return a file descriptor only, like connect(2). Other syscalls that
return a file descriptor but not as return value (or return more than
one fd), like socketpair(), pipe(), recvmsg with SCM_RIGHTs, will not
work with this flag. The way to go to emulate those in cases where a
signal might interrupt is to use the functionality to block signals.

The struct seccomp_notif_resp, used when doing SECCOMP_IOCTL_NOTIF_SEND
ioctl() to send a response to the target, has three more fields that we
don't allow to set when doing the addfd ioctl() to also return. The
reasons to disallow each field are:
* val: This will be set to the new allocated fd. No point taking it
from userspace in this case.
* error: If this is non-zero, the value is ignored. Therefore,
it is pointless in this case as we want to return the value.
* flags: The only flag is to let userspace continue to execute the
syscall. This seems pointless, as we want the syscall to return the
allocated fd.

This is why those fields are not possible to set when using this new
flag.

[1]: https://lore.kernel.org/lkml/CADZs7q4sw71iNHmV8EOOXhUKJMORPzF7thraxZYddTZsxta-KQ@mail.gmail.com/
[2]: https://lore.kernel.org/lkml/202012011322.26DCBC64F2@keescook/

Signed-off-by: Rodrigo Campos <[email protected]>
Signed-off-by: Sargun Dhillon <[email protected]>
---
include/uapi/linux/seccomp.h | 1 +
kernel/seccomp.c | 49 +++++++++++++++++++++++++++++++++---
2 files changed, 46 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index bc7fc8b04749..95dd9bab73c6 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -118,6 +118,7 @@ struct seccomp_notif_resp {

/* valid flags for seccomp_notif_addfd */
#define SECCOMP_ADDFD_FLAG_SETFD (1UL << 0) /* Specify remote fd */
+#define SECCOMP_ADDFD_FLAG_SEND (1UL << 1) /* Addfd and return it, atomically */

/**
* struct seccomp_notif_addfd
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index b852e8617004..b9a56f33ce1a 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -109,6 +109,7 @@ struct seccomp_knotif {
* installing process should allocate the fd as normal.
* @flags: The flags for the new file descriptor. At the moment, only O_CLOEXEC
* is allowed.
+ * @ioctl_flags: The flags used for the seccomp_addfd ioctl.
* @ret: The return value of the installing process. It is set to the fd num
* upon success (>= 0).
* @completion: Indicates that the installing process has completed fd
@@ -120,6 +121,7 @@ struct seccomp_kaddfd {
struct file *file;
int fd;
unsigned int flags;
+ __u32 ioctl_flags;

/* To only be set on reply */
int ret;
@@ -1064,14 +1066,35 @@ static u64 seccomp_next_notify_id(struct seccomp_filter *filter)
return filter->notif->next_id++;
}

-static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd)
+static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd, struct seccomp_knotif *n)
{
+ int fd;
+
/*
* Remove the notification, and reset the list pointers, indicating
* that it has been handled.
*/
list_del_init(&addfd->list);
- addfd->ret = receive_fd_replace(addfd->fd, addfd->file, addfd->flags);
+ fd = receive_fd_replace(addfd->fd, addfd->file, addfd->flags);
+
+ addfd->ret = fd;
+
+ if (addfd->ioctl_flags & SECCOMP_ADDFD_FLAG_SEND) {
+ /* If we fail reset and return an error to the notifier */
+ if (fd < 0) {
+ n->state = SECCOMP_NOTIFY_SENT;
+ } else {
+ /* Return the FD we just added */
+ n->flags = 0;
+ n->error = 0;
+ n->val = fd;
+ }
+ }
+
+ /*
+ * Mark the notification as completed. From this point, addfd mem
+ * might be invalidated and we can't safely read it anymore.
+ */
complete(&addfd->completion);
}

@@ -1141,7 +1164,7 @@ static int seccomp_do_user_notification(int this_syscall,
struct seccomp_kaddfd, list);
/* Check if we were woken up by a addfd message */
if (addfd)
- seccomp_handle_addfd(addfd);
+ seccomp_handle_addfd(addfd, &n);

} while (n.state != SECCOMP_NOTIFY_REPLIED);

@@ -1614,7 +1637,7 @@ static long seccomp_notify_addfd(struct seccomp_filter *filter,
if (addfd.newfd_flags & ~O_CLOEXEC)
return -EINVAL;

- if (addfd.flags & ~SECCOMP_ADDFD_FLAG_SETFD)
+ if (addfd.flags & ~(SECCOMP_ADDFD_FLAG_SETFD | SECCOMP_ADDFD_FLAG_SEND))
return -EINVAL;

if (addfd.newfd && !(addfd.flags & SECCOMP_ADDFD_FLAG_SETFD))
@@ -1624,6 +1647,7 @@ static long seccomp_notify_addfd(struct seccomp_filter *filter,
if (!kaddfd.file)
return -EBADF;

+ kaddfd.ioctl_flags = addfd.flags;
kaddfd.flags = addfd.newfd_flags;
kaddfd.fd = (addfd.flags & SECCOMP_ADDFD_FLAG_SETFD) ?
addfd.newfd : -1;
@@ -1649,6 +1673,23 @@ static long seccomp_notify_addfd(struct seccomp_filter *filter,
goto out_unlock;
}

+ if (addfd.flags & SECCOMP_ADDFD_FLAG_SEND) {
+ /*
+ * Disallow queuing an atomic addfd + send reply while there are
+ * some addfd requests still to process.
+ *
+ * There is no clear reason to support it and allows us to keep
+ * the loop on the other side straight-forward.
+ */
+ if (!list_empty(&knotif->addfd)) {
+ ret = -EBUSY;
+ goto out_unlock;
+ }
+
+ /* Allow exactly only one reply */
+ knotif->state = SECCOMP_NOTIFY_REPLIED;
+ }
+
list_add(&kaddfd.list, &knotif->addfd);
complete(&knotif->ready);
mutex_unlock(&filter->notify_lock);
--
2.25.1

2021-04-26 18:09:15

by Sargun Dhillon

[permalink] [raw]
Subject: [PATCH RESEND 3/5] selftests/seccomp: Add test for wait killable notifier

This adds a test for the positive case of the wait killable notifier,
in testing that when the feature is activated the process acts as
expected -- in not terminating on a non-fatal signal, and instead
queueing it up. There is already a test case for normal handlers
and preemption.

Signed-off-by: Sargun Dhillon <[email protected]>
---
tools/testing/selftests/seccomp/seccomp_bpf.c | 64 +++++++++++++++++++
1 file changed, 64 insertions(+)

diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 98c3b647f54d..34140ce2ab21 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -235,6 +235,10 @@ struct seccomp_notif_addfd {
};
#endif

+#ifndef SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE
+#define SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE (1UL << 0) /* Prevent task from being interrupted */
+#endif
+
struct seccomp_notif_addfd_small {
__u64 id;
char weird[4];
@@ -4135,6 +4139,66 @@ TEST(user_notification_addfd_rlimit)
close(memfd);
}

+TEST(user_notification_signal_wait_killable)
+{
+ pid_t pid;
+ long ret;
+ int status, listener, sk_pair[2];
+ struct seccomp_notif req = {
+ .flags = SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE,
+ };
+ struct seccomp_notif_resp resp = {};
+ char c;
+
+ ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+ ASSERT_EQ(0, ret) {
+ TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+ }
+
+ ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
+ ASSERT_EQ(fcntl(sk_pair[0], F_SETFL, O_NONBLOCK), 0);
+
+ listener = user_notif_syscall(__NR_gettid,
+ SECCOMP_FILTER_FLAG_NEW_LISTENER);
+ ASSERT_GE(listener, 0);
+
+ pid = fork();
+ ASSERT_GE(pid, 0);
+
+ if (pid == 0) {
+ close(sk_pair[0]);
+ handled = sk_pair[1];
+ if (signal(SIGUSR1, signal_handler) == SIG_ERR) {
+ perror("signal");
+ exit(1);
+ }
+
+ ret = syscall(__NR_gettid);
+ exit(!(ret == 42));
+ }
+ close(sk_pair[1]);
+
+ EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
+ EXPECT_EQ(kill(pid, SIGUSR1), 0);
+ /* Make sure we didn't get a signal */
+ EXPECT_EQ(read(sk_pair[0], &c, 1), -1);
+ /* Make sure the notification is still alive */
+ EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ID_VALID, &req.id), 0);
+
+ resp.id = req.id;
+ resp.error = 0;
+ resp.val = 42;
+
+ EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0);
+
+ EXPECT_EQ(waitpid(pid, &status, 0), pid);
+ EXPECT_EQ(true, WIFEXITED(status));
+ EXPECT_EQ(0, WEXITSTATUS(status));
+ /* Check we eventually received the signal */
+ EXPECT_EQ(read(sk_pair[0], &c, 1), 1);
+}
+
+
/*
* TODO:
* - expand NNP testing
--
2.25.1

2021-04-26 19:11:26

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH RESEND 3/5] selftests/seccomp: Add test for wait killable notifier

On Mon, Apr 26, 2021 at 11:06:08AM -0700, Sargun Dhillon wrote:
> +TEST(user_notification_signal_wait_killable)
> +{
> + pid_t pid;
> + long ret;
> + int status, listener, sk_pair[2];
> + struct seccomp_notif req = {
> + .flags = SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE,
> + };
> + struct seccomp_notif_resp resp = {};
> + char c;
> +
> + ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
> + ASSERT_EQ(0, ret) {
> + TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
> + }
> +
> + ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
> + ASSERT_EQ(fcntl(sk_pair[0], F_SETFL, O_NONBLOCK), 0);
> +
> + listener = user_notif_syscall(__NR_gettid,
> + SECCOMP_FILTER_FLAG_NEW_LISTENER);
> + ASSERT_GE(listener, 0);
> +
> + pid = fork();
> + ASSERT_GE(pid, 0);
> +
> + if (pid == 0) {
> + close(sk_pair[0]);
> + handled = sk_pair[1];
> + if (signal(SIGUSR1, signal_handler) == SIG_ERR) {

I think here you want a write(handled, "x", 1), right?

Tycho

2021-04-26 19:38:07

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH RESEND 2/5] seccomp: Add wait_killable semantic to seccomp user notifier

On Mon, Apr 26, 2021 at 11:06:07AM -0700, Sargun Dhillon wrote:
> @@ -1103,11 +1111,31 @@ static int seccomp_do_user_notification(int this_syscall,
> * This is where we wait for a reply from userspace.
> */
> do {
> + interruptible = notification_interruptible(&n);
> +
> mutex_unlock(&match->notify_lock);
> - err = wait_for_completion_interruptible(&n.ready);
> + if (interruptible)
> + err = wait_for_completion_interruptible(&n.ready);
> + else
> + err = wait_for_completion_killable(&n.ready);
> mutex_lock(&match->notify_lock);
> - if (err != 0)
> +
> + if (err != 0) {
> + /*
> + * There is a race condition here where if the
> + * notification was received with the
> + * SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE flag, but a
> + * non-fatal signal was received before we could
> + * transition we could erroneously end our wait early.
> + *
> + * The next wait for completion will ensure the signal
> + * was not fatal.
> + */
> + if (interruptible && !notification_interruptible(&n))
> + continue;

I'm trying to understand how one would hit this race,

> @@ -1457,6 +1487,12 @@ static long seccomp_notify_recv(struct seccomp_filter *filter,
> unotif.pid = task_pid_vnr(knotif->task);
> unotif.data = *(knotif->data);
>
> + if (unotif.flags & SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE) {
> + knotif->wait_killable = true;
> + complete(&knotif->ready);
> + }
> +
> +
> knotif->state = SECCOMP_NOTIFY_SENT;
> wake_up_poll(&filter->wqh, EPOLLOUT | EPOLLWRNORM);
> ret = 0;

Seems like the idea is that if someone does a ioctl(RECV, ...) twice
they'll hit it? But doesn't the test for NOTIFY_INIT and return
-ENOENT above this hunk prevent that?

Thanks,

Tycho

2021-04-26 22:16:57

by Sargun Dhillon

[permalink] [raw]
Subject: Re: [PATCH RESEND 2/5] seccomp: Add wait_killable semantic to seccomp user notifier

On Mon, Apr 26, 2021 at 01:02:29PM -0600, Tycho Andersen wrote:
> On Mon, Apr 26, 2021 at 11:06:07AM -0700, Sargun Dhillon wrote:
> > @@ -1103,11 +1111,31 @@ static int seccomp_do_user_notification(int this_syscall,
> > * This is where we wait for a reply from userspace.
> > */
> > do {
> > + interruptible = notification_interruptible(&n);
> > +
> > mutex_unlock(&match->notify_lock);
> > - err = wait_for_completion_interruptible(&n.ready);
> > + if (interruptible)
> > + err = wait_for_completion_interruptible(&n.ready);
> > + else
> > + err = wait_for_completion_killable(&n.ready);
> > mutex_lock(&match->notify_lock);
> > - if (err != 0)
> > +
> > + if (err != 0) {
> > + /*
> > + * There is a race condition here where if the
> > + * notification was received with the
> > + * SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE flag, but a
> > + * non-fatal signal was received before we could
> > + * transition we could erroneously end our wait early.
> > + *
> > + * The next wait for completion will ensure the signal
> > + * was not fatal.
> > + */
> > + if (interruptible && !notification_interruptible(&n))
> > + continue;
>
> I'm trying to understand how one would hit this race,
>

I'm thinking:
P: Process that "generates" notification
S: Supervisor
U: User

P: Generated notification
S: ioctl(RECV...) // With wait_killable flag.
...complete is called in the supervisor, but the P may not be woken up...
U: kill -SIGTERM $P
...signal gets delivered to p and causes wakeup and
wait_for_completion_interruptible returns 1...

Then you need to check the race
> > @@ -1457,6 +1487,12 @@ static long seccomp_notify_recv(struct seccomp_filter *filter,
> > unotif.pid = task_pid_vnr(knotif->task);
> > unotif.data = *(knotif->data);
> >
> > + if (unotif.flags & SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE) {
> > + knotif->wait_killable = true;
> > + complete(&knotif->ready);
> > + }
> > +
> > +
> > knotif->state = SECCOMP_NOTIFY_SENT;
> > wake_up_poll(&filter->wqh, EPOLLOUT | EPOLLWRNORM);
> > ret = 0;
>
> Seems like the idea is that if someone does a ioctl(RECV, ...) twice
> they'll hit it? But doesn't the test for NOTIFY_INIT and return
> -ENOENT above this hunk prevent that?
>
> Thanks,
>
> Tycho

2021-04-27 13:53:35

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH RESEND 2/5] seccomp: Add wait_killable semantic to seccomp user notifier

On Mon, Apr 26, 2021 at 10:15:28PM +0000, Sargun Dhillon wrote:
> On Mon, Apr 26, 2021 at 01:02:29PM -0600, Tycho Andersen wrote:
> > On Mon, Apr 26, 2021 at 11:06:07AM -0700, Sargun Dhillon wrote:
> > > @@ -1103,11 +1111,31 @@ static int seccomp_do_user_notification(int this_syscall,
> > > * This is where we wait for a reply from userspace.
> > > */
> > > do {
> > > + interruptible = notification_interruptible(&n);
> > > +
> > > mutex_unlock(&match->notify_lock);
> > > - err = wait_for_completion_interruptible(&n.ready);
> > > + if (interruptible)
> > > + err = wait_for_completion_interruptible(&n.ready);
> > > + else
> > > + err = wait_for_completion_killable(&n.ready);
> > > mutex_lock(&match->notify_lock);
> > > - if (err != 0)
> > > +
> > > + if (err != 0) {
> > > + /*
> > > + * There is a race condition here where if the
> > > + * notification was received with the
> > > + * SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE flag, but a
> > > + * non-fatal signal was received before we could
> > > + * transition we could erroneously end our wait early.
> > > + *
> > > + * The next wait for completion will ensure the signal
> > > + * was not fatal.
> > > + */
> > > + if (interruptible && !notification_interruptible(&n))
> > > + continue;
> >
> > I'm trying to understand how one would hit this race,
> >
>
> I'm thinking:
> P: Process that "generates" notification
> S: Supervisor
> U: User
>
> P: Generated notification
> S: ioctl(RECV...) // With wait_killable flag.
> ...complete is called in the supervisor, but the P may not be woken up...
> U: kill -SIGTERM $P
> ...signal gets delivered to p and causes wakeup and
> wait_for_completion_interruptible returns 1...
>
> Then you need to check the race

I see, thanks. This seems like a consequence of having the flag be
per-RECV-call vs. per-filter. Seems like it might be simpler to have
it be per-filter?

Tycho

2021-04-27 16:27:37

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH RESEND 2/5] seccomp: Add wait_killable semantic to seccomp user notifier

On Tue, Apr 27, 2021 at 6:48 AM Tycho Andersen <[email protected]> wrote:
>
> On Mon, Apr 26, 2021 at 10:15:28PM +0000, Sargun Dhillon wrote:
> > On Mon, Apr 26, 2021 at 01:02:29PM -0600, Tycho Andersen wrote:
> > > On Mon, Apr 26, 2021 at 11:06:07AM -0700, Sargun Dhillon wrote:
> > > > @@ -1103,11 +1111,31 @@ static int seccomp_do_user_notification(int this_syscall,
> > > > * This is where we wait for a reply from userspace.
> > > > */
> > > > do {
> > > > + interruptible = notification_interruptible(&n);
> > > > +
> > > > mutex_unlock(&match->notify_lock);
> > > > - err = wait_for_completion_interruptible(&n.ready);
> > > > + if (interruptible)
> > > > + err = wait_for_completion_interruptible(&n.ready);
> > > > + else
> > > > + err = wait_for_completion_killable(&n.ready);
> > > > mutex_lock(&match->notify_lock);
> > > > - if (err != 0)
> > > > +
> > > > + if (err != 0) {
> > > > + /*
> > > > + * There is a race condition here where if the
> > > > + * notification was received with the
> > > > + * SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE flag, but a
> > > > + * non-fatal signal was received before we could
> > > > + * transition we could erroneously end our wait early.
> > > > + *
> > > > + * The next wait for completion will ensure the signal
> > > > + * was not fatal.
> > > > + */
> > > > + if (interruptible && !notification_interruptible(&n))
> > > > + continue;
> > >
> > > I'm trying to understand how one would hit this race,
> > >
> >
> > I'm thinking:
> > P: Process that "generates" notification
> > S: Supervisor
> > U: User
> >
> > P: Generated notification
> > S: ioctl(RECV...) // With wait_killable flag.
> > ...complete is called in the supervisor, but the P may not be woken up...
> > U: kill -SIGTERM $P
> > ...signal gets delivered to p and causes wakeup and
> > wait_for_completion_interruptible returns 1...
> >
> > Then you need to check the race
>
> I see, thanks. This seems like a consequence of having the flag be
> per-RECV-call vs. per-filter. Seems like it might be simpler to have
> it be per-filter?
>

Backing up a minute, how is the current behavior not a serious
correctness issue? I can think of two scenarios that seem entirely
broken right now:

1. Process makes a syscall that is not permitted to return -EINTR. It
gets a signal and returns -EINTR when user notifiers are in use.

2. Process makes a syscall that is permitted to return -EINTR. But
-EINTR for IO means "I got interrupted and *did not do the IO*".
Nevertheless, the syscall returns -EINTR and the IO is done.

ISTM the current behavior is severely broken, and the new behavior
isn't *that* much better since it simply ignores signals and can't
emulate -EINTR (or all the various restart modes, sigh). Surely the
right behavior is to have the seccomped process notice that it got a
signal and inform the monitor of that fact so that the monitor can
take appropriate action.

IOW, I don't think that the current behavior *or* the patched opt-in
behavior is great. I think we would do better to have the filter
indicate that it is signal-aware and to document that non-signal-aware
filters cannot behave correctly with respect to signals.

--Andy

2021-04-27 16:37:06

by Sargun Dhillon

[permalink] [raw]
Subject: Re: [PATCH RESEND 2/5] seccomp: Add wait_killable semantic to seccomp user notifier

On Tue, Apr 27, 2021 at 6:48 AM Tycho Andersen <[email protected]> wrote:
>
> On Mon, Apr 26, 2021 at 10:15:28PM +0000, Sargun Dhillon wrote:
> > On Mon, Apr 26, 2021 at 01:02:29PM -0600, Tycho Andersen wrote:
> > > On Mon, Apr 26, 2021 at 11:06:07AM -0700, Sargun Dhillon wrote:
> > > > @@ -1103,11 +1111,31 @@ static int seccomp_do_user_notification(int this_syscall,
> > > > * This is where we wait for a reply from userspace.
> > > > */
> > > > do {
> > > > + interruptible = notification_interruptible(&n);
> > > > +
> > > > mutex_unlock(&match->notify_lock);
> > > > - err = wait_for_completion_interruptible(&n.ready);
> > > > + if (interruptible)
> > > > + err = wait_for_completion_interruptible(&n.ready);
> > > > + else
> > > > + err = wait_for_completion_killable(&n.ready);
> > > > mutex_lock(&match->notify_lock);
> > > > - if (err != 0)
> > > > +
> > > > + if (err != 0) {
> > > > + /*
> > > > + * There is a race condition here where if the
> > > > + * notification was received with the
> > > > + * SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE flag, but a
> > > > + * non-fatal signal was received before we could
> > > > + * transition we could erroneously end our wait early.
> > > > + *
> > > > + * The next wait for completion will ensure the signal
> > > > + * was not fatal.
> > > > + */
> > > > + if (interruptible && !notification_interruptible(&n))
> > > > + continue;
> > >
> > > I'm trying to understand how one would hit this race,
> > >
> >
> > I'm thinking:
> > P: Process that "generates" notification
> > S: Supervisor
> > U: User
> >
> > P: Generated notification
> > S: ioctl(RECV...) // With wait_killable flag.
> > ...complete is called in the supervisor, but the P may not be woken up...
> > U: kill -SIGTERM $P
> > ...signal gets delivered to p and causes wakeup and
> > wait_for_completion_interruptible returns 1...
> >
> > Then you need to check the race
>
> I see, thanks. This seems like a consequence of having the flag be
> per-RECV-call vs. per-filter. Seems like it might be simpler to have
> it be per-filter?
>
> Tycho

You're right.

I think an alternative solution would be to make it on a per-action
basis, and in the filter have a different action for non-preemptible
user notifications.

Since you can only install one filter, I do not think we want to make
it so we do it on a entire filter basis, in case a filter handles a combination
of preemptible and non-preemptible syscalls. For example if you mix
mount and accept.

2021-04-27 17:23:37

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH RESEND 2/5] seccomp: Add wait_killable semantic to seccomp user notifier

On Tue, Apr 27, 2021 at 09:23:42AM -0700, Andy Lutomirski wrote:
> On Tue, Apr 27, 2021 at 6:48 AM Tycho Andersen <[email protected]> wrote:
> >
> > On Mon, Apr 26, 2021 at 10:15:28PM +0000, Sargun Dhillon wrote:
> > > On Mon, Apr 26, 2021 at 01:02:29PM -0600, Tycho Andersen wrote:
> > > > On Mon, Apr 26, 2021 at 11:06:07AM -0700, Sargun Dhillon wrote:
> > > > > @@ -1103,11 +1111,31 @@ static int seccomp_do_user_notification(int this_syscall,
> > > > > * This is where we wait for a reply from userspace.
> > > > > */
> > > > > do {
> > > > > + interruptible = notification_interruptible(&n);
> > > > > +
> > > > > mutex_unlock(&match->notify_lock);
> > > > > - err = wait_for_completion_interruptible(&n.ready);
> > > > > + if (interruptible)
> > > > > + err = wait_for_completion_interruptible(&n.ready);
> > > > > + else
> > > > > + err = wait_for_completion_killable(&n.ready);
> > > > > mutex_lock(&match->notify_lock);
> > > > > - if (err != 0)
> > > > > +
> > > > > + if (err != 0) {
> > > > > + /*
> > > > > + * There is a race condition here where if the
> > > > > + * notification was received with the
> > > > > + * SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE flag, but a
> > > > > + * non-fatal signal was received before we could
> > > > > + * transition we could erroneously end our wait early.
> > > > > + *
> > > > > + * The next wait for completion will ensure the signal
> > > > > + * was not fatal.
> > > > > + */
> > > > > + if (interruptible && !notification_interruptible(&n))
> > > > > + continue;
> > > >
> > > > I'm trying to understand how one would hit this race,
> > > >
> > >
> > > I'm thinking:
> > > P: Process that "generates" notification
> > > S: Supervisor
> > > U: User
> > >
> > > P: Generated notification
> > > S: ioctl(RECV...) // With wait_killable flag.
> > > ...complete is called in the supervisor, but the P may not be woken up...
> > > U: kill -SIGTERM $P
> > > ...signal gets delivered to p and causes wakeup and
> > > wait_for_completion_interruptible returns 1...
> > >
> > > Then you need to check the race
> >
> > I see, thanks. This seems like a consequence of having the flag be
> > per-RECV-call vs. per-filter. Seems like it might be simpler to have
> > it be per-filter?
> >
>
> Backing up a minute, how is the current behavior not a serious
> correctness issue? I can think of two scenarios that seem entirely
> broken right now:
>
> 1. Process makes a syscall that is not permitted to return -EINTR. It
> gets a signal and returns -EINTR when user notifiers are in use.
>
> 2. Process makes a syscall that is permitted to return -EINTR. But
> -EINTR for IO means "I got interrupted and *did not do the IO*".
> Nevertheless, the syscall returns -EINTR and the IO is done.
>
> ISTM the current behavior is severely broken, and the new behavior
> isn't *that* much better since it simply ignores signals and can't
> emulate -EINTR (or all the various restart modes, sigh). Surely the
> right behavior is to have the seccomped process notice that it got a
> signal and inform the monitor of that fact so that the monitor can
> take appropriate action.

This doesn't help your case (2) though, since the IO could be done
before the supervisor gets the notification.

> IOW, I don't think that the current behavior *or* the patched opt-in
> behavior is great. I think we would do better to have the filter
> indicate that it is signal-aware and to document that non-signal-aware
> filters cannot behave correctly with respect to signals.

I think it would be hard to make a signal-aware filter, it really does
feel like the only thing to do is a killable wait.

Tycho

2021-04-27 22:13:00

by Sargun Dhillon

[permalink] [raw]
Subject: Re: [PATCH RESEND 2/5] seccomp: Add wait_killable semantic to seccomp user notifier

On Tue, Apr 27, 2021 at 11:07:53AM -0600, Tycho Andersen wrote:
> On Tue, Apr 27, 2021 at 09:23:42AM -0700, Andy Lutomirski wrote:
> > On Tue, Apr 27, 2021 at 6:48 AM Tycho Andersen <[email protected]> wrote:
> > >
> > > On Mon, Apr 26, 2021 at 10:15:28PM +0000, Sargun Dhillon wrote:
> > > > On Mon, Apr 26, 2021 at 01:02:29PM -0600, Tycho Andersen wrote:
> > > > > On Mon, Apr 26, 2021 at 11:06:07AM -0700, Sargun Dhillon wrote:
> > > > > > @@ -1103,11 +1111,31 @@ static int seccomp_do_user_notification(int this_syscall,
> > > > > > * This is where we wait for a reply from userspace.
> > > > > > */
> > > > > > do {
> > > > > > + interruptible = notification_interruptible(&n);
> > > > > > +
> > > > > > mutex_unlock(&match->notify_lock);
> > > > > > - err = wait_for_completion_interruptible(&n.ready);
> > > > > > + if (interruptible)
> > > > > > + err = wait_for_completion_interruptible(&n.ready);
> > > > > > + else
> > > > > > + err = wait_for_completion_killable(&n.ready);
> > > > > > mutex_lock(&match->notify_lock);
> > > > > > - if (err != 0)
> > > > > > +
> > > > > > + if (err != 0) {
> > > > > > + /*
> > > > > > + * There is a race condition here where if the
> > > > > > + * notification was received with the
> > > > > > + * SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE flag, but a
> > > > > > + * non-fatal signal was received before we could
> > > > > > + * transition we could erroneously end our wait early.
> > > > > > + *
> > > > > > + * The next wait for completion will ensure the signal
> > > > > > + * was not fatal.
> > > > > > + */
> > > > > > + if (interruptible && !notification_interruptible(&n))
> > > > > > + continue;
> > > > >
> > > > > I'm trying to understand how one would hit this race,
> > > > >
> > > >
> > > > I'm thinking:
> > > > P: Process that "generates" notification
> > > > S: Supervisor
> > > > U: User
> > > >
> > > > P: Generated notification
> > > > S: ioctl(RECV...) // With wait_killable flag.
> > > > ...complete is called in the supervisor, but the P may not be woken up...
> > > > U: kill -SIGTERM $P
> > > > ...signal gets delivered to p and causes wakeup and
> > > > wait_for_completion_interruptible returns 1...
> > > >
> > > > Then you need to check the race
> > >
> > > I see, thanks. This seems like a consequence of having the flag be
> > > per-RECV-call vs. per-filter. Seems like it might be simpler to have
> > > it be per-filter?
> > >
I agree that it is hard / impossible to guarantee correctness *after* the fact.
> >
> > Backing up a minute, how is the current behavior not a serious
> > correctness issue? I can think of two scenarios that seem entirely
> > broken right now:
> >
> > 1. Process makes a syscall that is not permitted to return -EINTR. It
> > gets a signal and returns -EINTR when user notifiers are in use.
> >
Yes, there's a whole host of problems here. Things like fsmount should not
be interruptible.

> > 2. Process makes a syscall that is permitted to return -EINTR. But
> > -EINTR for IO means "I got interrupted and *did not do the IO*".
> > Nevertheless, the syscall returns -EINTR and the IO is done.
In general, I think that the idea is to do as little side-effect I/O
as possible. The use cases we've looked at all have nice ways to unwind
them (perf_event_open, BPF, accept), but others are less good for unwinding
(mount). There are some middle ground calls like connect, but they're
less bad.

> >
> > ISTM the current behavior is severely broken, and the new behavior
> > isn't *that* much better since it simply ignores signals and can't
> > emulate -EINTR (or all the various restart modes, sigh). Surely the
> > right behavior is to have the seccomped process notice that it got a
> > signal and inform the monitor of that fact so that the monitor can
> > take appropriate action.
>
> This doesn't help your case (2) though, since the IO could be done
> before the supervisor gets the notification.
>
I think for something like mount, if it fails (gets interrupted) via a
fatal signal, that's grounds for terminating the container.

> > IOW, I don't think that the current behavior *or* the patched opt-in
> > behavior is great. I think we would do better to have the filter
> > indicate that it is signal-aware and to document that non-signal-aware
> > filters cannot behave correctly with respect to signals.
>
> I think it would be hard to make a signal-aware filter, it really does
> feel like the only thing to do is a killable wait.
>
> Tycho

There are plenty of scenarios where the syscall can be handled in an interruptible
fashion. I like to use accept as an example. I think Jann Horn had put together
a patchset on how the supervisor could be notified (as opposed to background
polling). If the call is interrupted, you can just "finish" the accept on restart
of the sycall by handing the FD over.

I see a handful of paths forward:

* We add a new action USER_NOTIF_KILLABLE which requires a fatal signal
in order to be interrupted
* We add a chunk of data to the USER_NOTIF return code (say, WAIT_KILLABLE)
from the BPF filter that indicates what kind of wait should happen
* (what is happening now) An ioctl flag to say pickup the notification
and put it into a wait_killable state
* An ioctl "command" that puts an existing notifcation in progress into
the wait killable state.

2021-04-27 23:21:25

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH RESEND 2/5] seccomp: Add wait_killable semantic to seccomp user notifier

On Tue, Apr 27, 2021 at 3:10 PM Sargun Dhillon <[email protected]> wrote:
>
> On Tue, Apr 27, 2021 at 11:07:53AM -0600, Tycho Andersen wrote:
> > On Tue, Apr 27, 2021 at 09:23:42AM -0700, Andy Lutomirski wrote:
> > > On Tue, Apr 27, 2021 at 6:48 AM Tycho Andersen <[email protected]> wrote:
> > > >
> > > > On Mon, Apr 26, 2021 at 10:15:28PM +0000, Sargun Dhillon wrote:
> > >
> > > ISTM the current behavior is severely broken, and the new behavior
> > > isn't *that* much better since it simply ignores signals and can't
> > > emulate -EINTR (or all the various restart modes, sigh). Surely the
> > > right behavior is to have the seccomped process notice that it got a
> > > signal and inform the monitor of that fact so that the monitor can
> > > take appropriate action.
> >
> > This doesn't help your case (2) though, since the IO could be done
> > before the supervisor gets the notification.

Tycho, I disagree. Here's how native syscalls work:

1. Entry work is done and the syscall hander does whatever it does at
the beginning of the function. This is entirely non-interruptible.

2. The syscall handler decides it wants to wait, interruptibly,
killably or otherwise.

3. It gets signaled. It takes appropriate action. Appropriate action
does *not* mean -EINTR. It means that something that is correct *for
that syscall* happens. For nanosleep(), this involves the restart
block (and I don't think we need to support the restart block). For
accept(), it mostly seems to mean that the syscall completes as usual.
For write(2), it means that, depending on file type and whether any IO
has occured, either -EINTR is returned and no IO happens, or fewer
bytes than requested are transferred, or the syscall completes. (Or,
if it's a KILL, the process dies early and partial IO is ignored.)
For some syscalls (some AF_UNIX writes, for example, or ptrace()), the
syscall indeed gets interrupted, but it uses one of the -ERESTART
mecahnisms.

User notifiers should allow correct emulation. Right now, it doesn't,
but there is no reason it can't.

> >
> I think for something like mount, if it fails (gets interrupted) via a
> fatal signal, that's grounds for terminating the container.

That would be quite obnoxious if it happens after the container
starts. Ctrl-C-ing a fusermount call should not be grounds for
outright destruction.
>
> I see a handful of paths forward:
>
> * We add a new action USER_NOTIF_KILLABLE which requires a fatal signal
> in order to be interrupted
> * We add a chunk of data to the USER_NOTIF return code (say, WAIT_KILLABLE)
> from the BPF filter that indicates what kind of wait should happen
> * (what is happening now) An ioctl flag to say pickup the notification
> and put it into a wait_killable state
> * An ioctl "command" that puts an existing notifcation in progress into
> the wait killable state.

In the simplest correct API, I think that kills should always kill the
task. Non-kill signals should not kill the syscall but the user
notifier should be informed that a signal happened. (Whether this
requires POLLPRI or some other handshaking mechanism is an open
question.)

The only real downside I see is that genuinely interruptible syscalls
will end up with higher than necessary signal latency if they occur
during the interruptible period. An extended API could allow the
filter to return an indication that an interrupt should result in a
specified error code (-EINTR, ERESTARTxyz, etc), and the monitor could
do a new ioctl() to tell the kernel that the syscall should stop being
interruptible. That ioctl() itself would return a status saying
whether the syscall was already interrupted. One nice feature of this
approach is that the existing model is equivalent to the filter saying
"interruptible with EINTR" and the monitor simply forgetting to do the
new ioctl.

--Andy

2021-04-28 00:25:39

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH RESEND 2/5] seccomp: Add wait_killable semantic to seccomp user notifier

On Tue, Apr 27, 2021 at 04:19:54PM -0700, Andy Lutomirski wrote:
> On Tue, Apr 27, 2021 at 3:10 PM Sargun Dhillon <[email protected]> wrote:
> >
> > On Tue, Apr 27, 2021 at 11:07:53AM -0600, Tycho Andersen wrote:
> > > On Tue, Apr 27, 2021 at 09:23:42AM -0700, Andy Lutomirski wrote:
> > > > On Tue, Apr 27, 2021 at 6:48 AM Tycho Andersen <[email protected]> wrote:
> > > > >
> > > > > On Mon, Apr 26, 2021 at 10:15:28PM +0000, Sargun Dhillon wrote:
> > > >
> > > > ISTM the current behavior is severely broken, and the new behavior
> > > > isn't *that* much better since it simply ignores signals and can't
> > > > emulate -EINTR (or all the various restart modes, sigh). Surely the
> > > > right behavior is to have the seccomped process notice that it got a
> > > > signal and inform the monitor of that fact so that the monitor can
> > > > take appropriate action.
> > >
> > > This doesn't help your case (2) though, since the IO could be done
> > > before the supervisor gets the notification.
>
> Tycho, I disagree. Here's how native syscalls work:
>
> 1. Entry work is done and the syscall hander does whatever it does at
> the beginning of the function. This is entirely non-interruptible.
>
> 2. The syscall handler decides it wants to wait, interruptibly,
> killably or otherwise.
>
> 3. It gets signaled. It takes appropriate action. Appropriate action
> does *not* mean -EINTR. It means that something that is correct *for
> that syscall* happens. For nanosleep(), this involves the restart
> block (and I don't think we need to support the restart block). For
> accept(), it mostly seems to mean that the syscall completes as usual.
> For write(2), it means that, depending on file type and whether any IO
> has occured, either -EINTR is returned and no IO happens, or fewer
> bytes than requested are transferred, or the syscall completes. (Or,
> if it's a KILL, the process dies early and partial IO is ignored.)
> For some syscalls (some AF_UNIX writes, for example, or ptrace()), the
> syscall indeed gets interrupted, but it uses one of the -ERESTART
> mecahnisms.
>
> User notifiers should allow correct emulation. Right now, it doesn't,
> but there is no reason it can't.

Thanks for the explanation.

Consider fsmount, which has a,

ret = mutex_lock_interruptible(&fc->uapi_mutex);
if (ret < 0)
goto err_fsfd;

If a regular task is interrupted during that wait, it return -EINTR
or whatever back to userspace.

Suppose that we intercept fsmount. The supervisor decides the mount is
OK, does the fsmount, injects the mount fd into the container, and
then the tracee receives a signal. At this point, the mount fd is
visible inside the container. The supervisor gets a notification about
the signal and revokes the mount fd, but there was some time where it
was exposed in the container, whereas with the interrupt in the native
syscall there was never any exposure.

How does the supervisor correctly emulate this syscall? Maybe this
doesn't matter and this is just "completes as usual"? But then
handling signals is just a latency issue, not a correctness one. Or
most likely I'm misunderstanding something else :)

Tycho

2021-04-28 03:22:11

by Sargun Dhillon

[permalink] [raw]
Subject: Re: [PATCH RESEND 2/5] seccomp: Add wait_killable semantic to seccomp user notifier

On Tue, Apr 27, 2021 at 04:19:54PM -0700, Andy Lutomirski wrote:
> On Tue, Apr 27, 2021 at 3:10 PM Sargun Dhillon <[email protected]> wrote:
> >
> > On Tue, Apr 27, 2021 at 11:07:53AM -0600, Tycho Andersen wrote:
> > > On Tue, Apr 27, 2021 at 09:23:42AM -0700, Andy Lutomirski wrote:
> > > > On Tue, Apr 27, 2021 at 6:48 AM Tycho Andersen <[email protected]> wrote:
> > > > >
> > > > > On Mon, Apr 26, 2021 at 10:15:28PM +0000, Sargun Dhillon wrote:
> > > >
> > > > ISTM the current behavior is severely broken, and the new behavior
> > > > isn't *that* much better since it simply ignores signals and can't
> > > > emulate -EINTR (or all the various restart modes, sigh). Surely the
> > > > right behavior is to have the seccomped process notice that it got a
> > > > signal and inform the monitor of that fact so that the monitor can
> > > > take appropriate action.
> > >
> > > This doesn't help your case (2) though, since the IO could be done
> > > before the supervisor gets the notification.
>
> Tycho, I disagree. Here's how native syscalls work:
>
> 1. Entry work is done and the syscall hander does whatever it does at
> the beginning of the function. This is entirely non-interruptible.
>
> 2. The syscall handler decides it wants to wait, interruptibly,
> killably or otherwise.
>
> 3. It gets signaled. It takes appropriate action. Appropriate action
> does *not* mean -EINTR. It means that something that is correct *for
> that syscall* happens. For nanosleep(), this involves the restart
> block (and I don't think we need to support the restart block). For
> accept(), it mostly seems to mean that the syscall completes as usual.
> For write(2), it means that, depending on file type and whether any IO
> has occured, either -EINTR is returned and no IO happens, or fewer
> bytes than requested are transferred, or the syscall completes. (Or,
> if it's a KILL, the process dies early and partial IO is ignored.)
> For some syscalls (some AF_UNIX writes, for example, or ptrace()), the
> syscall indeed gets interrupted, but it uses one of the -ERESTART
> mecahnisms.
>
> User notifiers should allow correct emulation. Right now, it doesn't,
> but there is no reason it can't.
>
> > >
> > I think for something like mount, if it fails (gets interrupted) via a
> > fatal signal, that's grounds for terminating the container.
>
> That would be quite obnoxious if it happens after the container
> starts. Ctrl-C-ing a fusermount call should not be grounds for
> outright destruction.
> >
> > I see a handful of paths forward:
> >
> > * We add a new action USER_NOTIF_KILLABLE which requires a fatal signal
> > in order to be interrupted
> > * We add a chunk of data to the USER_NOTIF return code (say, WAIT_KILLABLE)
> > from the BPF filter that indicates what kind of wait should happen
> > * (what is happening now) An ioctl flag to say pickup the notification
> > and put it into a wait_killable state
> > * An ioctl "command" that puts an existing notifcation in progress into
> > the wait killable state.
>
> In the simplest correct API, I think that kills should always kill the
> task. Non-kill signals should not kill the syscall but the user
> notifier should be informed that a signal happened. (Whether this
> requires POLLPRI or some other handshaking mechanism is an open
> question.)
>
This is pretty easy to do fortunately. I do think that we should have
the ability to do this from the filter as certain syscalls have no
assumptions about how to handle preemption (and while it is waiting for
the supervisor to pick it up, it can be interrupted). We do not want the
intermediate window to act differently.

I wonder what Kees would prefer.

> The only real downside I see is that genuinely interruptible syscalls
> will end up with higher than necessary signal latency if they occur
> during the interruptible period. An extended API could allow the
> filter to return an indication that an interrupt should result in a
> specified error code (-EINTR, ERESTARTxyz, etc), and the monitor could
> do a new ioctl() to tell the kernel that the syscall should stop being
> interruptible. That ioctl() itself would return a status saying
Yeah,
Personally, I'd rather do (in order):
1. ioctl to make syscalls uninterruptible
2. ioctl to re-make syscalls re-interruptible
3. action that upon action handling makes syscall non-interruptible.

I think then I can maybe pickup on Jann's patches to make it so the
supervisor is notified of the interruption if the syscall is interrupted,
but to do this in a non-terrible way is more difficult. I really liked
how Tycho handled memory management in the current notifier, and moving
tracking logic for the in-flight notifications elsewhere feels "icky"
to me, but that's a topic for another day IMHO.

> whether the syscall was already interrupted. One nice feature of this
> approach is that the existing model is equivalent to the filter saying
> "interruptible with EINTR" and the monitor simply forgetting to do the
> new ioctl.
>
> --Andy

2021-04-28 13:46:22

by Rodrigo Campos

[permalink] [raw]
Subject: Re: [PATCH RESEND 2/5] seccomp: Add wait_killable semantic to seccomp user notifier

On Wed, Apr 28, 2021 at 2:22 AM Tycho Andersen <[email protected]> wrote:
>
> On Tue, Apr 27, 2021 at 04:19:54PM -0700, Andy Lutomirski wrote:
> > User notifiers should allow correct emulation. Right now, it doesn't,
> > but there is no reason it can't.
>
> Thanks for the explanation.
>
> Consider fsmount, which has a,
>
> ret = mutex_lock_interruptible(&fc->uapi_mutex);
> if (ret < 0)
> goto err_fsfd;
>
> If a regular task is interrupted during that wait, it return -EINTR
> or whatever back to userspace.
>
> Suppose that we intercept fsmount. The supervisor decides the mount is
> OK, does the fsmount, injects the mount fd into the container, and
> then the tracee receives a signal. At this point, the mount fd is
> visible inside the container. The supervisor gets a notification about
> the signal and revokes the mount fd, but there was some time where it
> was exposed in the container, whereas with the interrupt in the native
> syscall there was never any exposure.

IIUC, this is solved by my patch, patch 4 of the series. The
supervisor should do the addfd with the flag added in that patch
(SECCOMP_ADDFD_FLAG_SEND) for an atomic "addfd + send".

That means when using the atomic "addfd+send" what happens is: either
we add the fd _and_ the added fd value is returned to the syscall or
the fd is not added at all and the container sees the syscall as
interrupted. Therefore, the fd is only visible to the container when
it should.


Best,
Rodrigo

2021-04-28 14:54:24

by Rodrigo Campos

[permalink] [raw]
Subject: Re: [PATCH RESEND 2/5] seccomp: Add wait_killable semantic to seccomp user notifier

On Wed, Apr 28, 2021 at 1:10 PM Rodrigo Campos <[email protected]> wrote:
>
> On Wed, Apr 28, 2021 at 2:22 AM Tycho Andersen <[email protected]> wrote:
> >
> > On Tue, Apr 27, 2021 at 04:19:54PM -0700, Andy Lutomirski wrote:
> > > User notifiers should allow correct emulation. Right now, it doesn't,
> > > but there is no reason it can't.
> >
> > Thanks for the explanation.
> >
> > Consider fsmount, which has a,
> >
> > ret = mutex_lock_interruptible(&fc->uapi_mutex);
> > if (ret < 0)
> > goto err_fsfd;
> >
> > If a regular task is interrupted during that wait, it return -EINTR
> > or whatever back to userspace.
> >
> > Suppose that we intercept fsmount. The supervisor decides the mount is
> > OK, does the fsmount, injects the mount fd into the container, and
> > then the tracee receives a signal. At this point, the mount fd is
> > visible inside the container. The supervisor gets a notification about
> > the signal and revokes the mount fd, but there was some time where it
> > was exposed in the container, whereas with the interrupt in the native
> > syscall there was never any exposure.
>
> IIUC, this is solved by my patch, patch 4 of the series. The
> supervisor should do the addfd with the flag added in that patch
> (SECCOMP_ADDFD_FLAG_SEND) for an atomic "addfd + send".

Well, under Andy's proposal handling that is even simpler. If the
signal is delivered after we added the fd (note that the container
syscall does not return when the signal arrives, as it happens today,
it just signals the notifier and continues to wait), we can just
ignore the signal and return that (if that is the appropriate thing
for that syscall, but I guess after adding an fd there isn't any other
reasonable thing to do).



Best,
Rodrigo

2021-04-28 17:35:17

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH RESEND 2/5] seccomp: Add wait_killable semantic to seccomp user notifier

On Wed, Apr 28, 2021 at 03:20:02PM +0200, Rodrigo Campos wrote:
> On Wed, Apr 28, 2021 at 1:10 PM Rodrigo Campos <[email protected]> wrote:
> >
> > On Wed, Apr 28, 2021 at 2:22 AM Tycho Andersen <[email protected]> wrote:
> > >
> > > On Tue, Apr 27, 2021 at 04:19:54PM -0700, Andy Lutomirski wrote:
> > > > User notifiers should allow correct emulation. Right now, it doesn't,
> > > > but there is no reason it can't.
> > >
> > > Thanks for the explanation.
> > >
> > > Consider fsmount, which has a,
> > >
> > > ret = mutex_lock_interruptible(&fc->uapi_mutex);
> > > if (ret < 0)
> > > goto err_fsfd;
> > >
> > > If a regular task is interrupted during that wait, it return -EINTR
> > > or whatever back to userspace.
> > >
> > > Suppose that we intercept fsmount. The supervisor decides the mount is
> > > OK, does the fsmount, injects the mount fd into the container, and
> > > then the tracee receives a signal. At this point, the mount fd is
> > > visible inside the container. The supervisor gets a notification about
> > > the signal and revokes the mount fd, but there was some time where it
> > > was exposed in the container, whereas with the interrupt in the native
> > > syscall there was never any exposure.
> >
> > IIUC, this is solved by my patch, patch 4 of the series. The
> > supervisor should do the addfd with the flag added in that patch
> > (SECCOMP_ADDFD_FLAG_SEND) for an atomic "addfd + send".
>
> Well, under Andy's proposal handling that is even simpler. If the
> signal is delivered after we added the fd (note that the container
> syscall does not return when the signal arrives, as it happens today,
> it just signals the notifier and continues to wait), we can just
> ignore the signal and return that (if that is the appropriate thing
> for that syscall, but I guess after adding an fd there isn't any other
> reasonable thing to do).

Yes, agreed. After thinking about this more, my example is bogus: the
kernel doesn't sleep after it installs the fd, so it would ignore any
signals too.

Even if the kernel *did* sleep after installing the fd, it would still
be correct emulation to install it and then do whatever the kernel did
during that sleep. So I withdraw my objection :)

Thanks,

Tycho

2021-04-28 20:53:52

by Sargun Dhillon

[permalink] [raw]
Subject: Re: [PATCH RESEND 2/5] seccomp: Add wait_killable semantic to seccomp user notifier

On Wed, Apr 28, 2021 at 7:08 AM Tycho Andersen <[email protected]> wrote:
>
> On Wed, Apr 28, 2021 at 03:20:02PM +0200, Rodrigo Campos wrote:
> > On Wed, Apr 28, 2021 at 1:10 PM Rodrigo Campos <[email protected]> wrote:
> > >
> > > On Wed, Apr 28, 2021 at 2:22 AM Tycho Andersen <[email protected]> wrote:
> > > >
> > > > On Tue, Apr 27, 2021 at 04:19:54PM -0700, Andy Lutomirski wrote:
> > > > > User notifiers should allow correct emulation. Right now, it doesn't,
> > > > > but there is no reason it can't.
> > > >
> > > > Thanks for the explanation.
> > > >
> > > > Consider fsmount, which has a,
> > > >
> > > > ret = mutex_lock_interruptible(&fc->uapi_mutex);
> > > > if (ret < 0)
> > > > goto err_fsfd;
> > > >
> > > > If a regular task is interrupted during that wait, it return -EINTR
> > > > or whatever back to userspace.
> > > >
> > > > Suppose that we intercept fsmount. The supervisor decides the mount is
> > > > OK, does the fsmount, injects the mount fd into the container, and
> > > > then the tracee receives a signal. At this point, the mount fd is
> > > > visible inside the container. The supervisor gets a notification about
> > > > the signal and revokes the mount fd, but there was some time where it
> > > > was exposed in the container, whereas with the interrupt in the native
> > > > syscall there was never any exposure.
> > >
> > > IIUC, this is solved by my patch, patch 4 of the series. The
> > > supervisor should do the addfd with the flag added in that patch
> > > (SECCOMP_ADDFD_FLAG_SEND) for an atomic "addfd + send".
> >
> > Well, under Andy's proposal handling that is even simpler. If the
> > signal is delivered after we added the fd (note that the container
> > syscall does not return when the signal arrives, as it happens today,
> > it just signals the notifier and continues to wait), we can just
> > ignore the signal and return that (if that is the appropriate thing
> > for that syscall, but I guess after adding an fd there isn't any other
> > reasonable thing to do).
>
> Yes, agreed. After thinking about this more, my example is bogus: the
> kernel doesn't sleep after it installs the fd, so it would ignore any
> signals too.
>
> Even if the kernel *did* sleep after installing the fd, it would still
> be correct emulation to install it and then do whatever the kernel did
> during that sleep. So I withdraw my objection :)
>
> Thanks,
>
> Tycho

Great.

I'll respin the series and add a
SECCOMP_IOCTL_NOTIF_SET_WAIT_KILLABLE command.

We can do the other aforementioned optimizations above when
specific use cases come up. I would like to work on preemption
notification after this lands though.