This adds the capability for seccomp notifier listeners to add file
descriptors in response to a seccomp notification. This is useful for
syscalls in which the previous capabilities were not sufficient. The
current mechanism works well for syscalls that either have side effects
that are system / namespace wide (mount), or that operate on a specific
set of registers (reboot, mknod), and don't require dereferencing pointers.
The problem with derefencing pointers in a supervisor is that it leaves
us vulnerable to TOC-TOU [1] style attacks. For syscalls that had a direct
effect on file descriptors pidfd_getfd was added, allowing for those file
descriptors to be directly operated upon by the supervisor [2].
Unfortunately, this leaves system calls which return file descriptors
out of the picture. These are fairly common syscalls, such as openat,
socket, and perf_event_open that return file descriptors, and have
arguments that are pointers. These require that the supervisor is able to
verify the arguments, make the call on behalf of the process on hand,
and pass back the resulting file descriptor. This is where addfd comes
into play.
There is an additional flag that allows you to "set" an FD, rather than
add it with an arbitrary number. This has dup2 style semantics, and
installs the new file at that file descriptor, and atomically closes
the old one if it existed. This is useful for a particular use case
that we have, in which we want to swap out AF_INET sockets for AF_UNIX,
AF_INET6, and sockets in another namespace when doing "upconversion".
My specific usecase at Netflix is to enable our IPv4-IPv6 transition
mechanism, in which we our namespaces have no real IPv4 reachability,
and when it comes time to do a connect(2), we get a socket from a
namespace with global IPv4 reachability.
In addition, we intend to use it for our servicemesh, and where our
service mesh needs to intercept traffic ingress traffic, the addfd
capability will act as a mechanism to do socket activation.
Addfd is not implemented as a separate syscall, a la pidfd_getfd, as
VFS makes some optimizations in regards to the fdtable, and assumes
that they are not modified by external processes. Although a mechanism
that scheduled something in the context of the task could work, it is
somewhat simpler to do it in the context of the ioctl as we control
the task while in kernel.
There is an additional flag (move) that was added to enable cgroup
v1 controllers (netprio, classid), and moving sockets, as a socket
can only be associated with one cgroup at a time.
[1]: https://lore.kernel.org/lkml/[email protected]/
[2]: https://lore.kernel.org/lkml/[email protected]/
Sargun Dhillon (5):
seccomp: Add find_notification helper
seccomp: Introduce addfd ioctl to seccomp user notifier
selftests/seccomp: Test SECCOMP_IOCTL_NOTIF_ADDFD
seccomp: Add SECCOMP_ADDFD_FLAG_MOVE flag to add fd ioctl
selftests/seccomp: Add test for addfd move semantics
include/uapi/linux/seccomp.h | 33 +++
kernel/seccomp.c | 228 +++++++++++++++--
tools/testing/selftests/seccomp/seccomp_bpf.c | 235 ++++++++++++++++++
3 files changed, 479 insertions(+), 17 deletions(-)
--
2.25.1
This adds a helper which can iterate through a seccomp_filter to
find a notification matching an ID. It removes several replicated
chunks of code.
Signed-off-by: Sargun Dhillon <[email protected]>
Cc: Matt Denton <[email protected]>
Cc: Kees Cook <[email protected]>,
Cc: Jann Horn <[email protected]>,
Cc: Robert Sesek <[email protected]>,
Cc: Chris Palmer <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Tycho Andersen <[email protected]>
---
kernel/seccomp.c | 38 +++++++++++++++++++++-----------------
1 file changed, 21 insertions(+), 17 deletions(-)
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 55a6184f5990..f6ce94b7a167 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1021,10 +1021,25 @@ static int seccomp_notify_release(struct inode *inode, struct file *file)
return 0;
}
+/* must be called with notif_lock held */
+static inline struct seccomp_knotif *
+find_notification(struct seccomp_filter *filter, u64 id)
+{
+ struct seccomp_knotif *cur;
+
+ list_for_each_entry(cur, &filter->notif->notifications, list) {
+ if (cur->id == id)
+ return cur;
+ }
+
+ return NULL;
+}
+
+
static long seccomp_notify_recv(struct seccomp_filter *filter,
void __user *buf)
{
- struct seccomp_knotif *knotif = NULL, *cur;
+ struct seccomp_knotif *knotif, *cur;
struct seccomp_notif unotif;
ssize_t ret;
@@ -1078,14 +1093,8 @@ static long seccomp_notify_recv(struct seccomp_filter *filter,
* may have died when we released the lock, so we need to make
* sure it's still around.
*/
- knotif = NULL;
mutex_lock(&filter->notify_lock);
- list_for_each_entry(cur, &filter->notif->notifications, list) {
- if (cur->id == unotif.id) {
- knotif = cur;
- break;
- }
- }
+ knotif = find_notification(filter, unotif.id);
if (knotif) {
knotif->state = SECCOMP_NOTIFY_INIT;
@@ -1150,7 +1159,7 @@ static long seccomp_notify_send(struct seccomp_filter *filter,
static long seccomp_notify_id_valid(struct seccomp_filter *filter,
void __user *buf)
{
- struct seccomp_knotif *knotif = NULL;
+ struct seccomp_knotif *knotif;
u64 id;
long ret;
@@ -1162,15 +1171,10 @@ static long seccomp_notify_id_valid(struct seccomp_filter *filter,
return ret;
ret = -ENOENT;
- list_for_each_entry(knotif, &filter->notif->notifications, list) {
- if (knotif->id == id) {
- if (knotif->state == SECCOMP_NOTIFY_SENT)
- ret = 0;
- goto out;
- }
- }
+ knotif = find_notification(filter, id);
+ if (knotif && knotif->state == SECCOMP_NOTIFY_SENT)
+ ret = 0;
-out:
mutex_unlock(&filter->notify_lock);
return ret;
}
--
2.25.1
This adds a seccomp notifier ioctl which allows for the listener to "add"
file descriptors to a process which originated a seccomp user
notification. This allows calls like mount, and mknod to be "implemented",
as the return value, and the arguments are data in memory. On the other
hand, calls like connect can be "implemented" using pidfd_getfd.
Unfortunately, there are calls which return file descriptors, like
open, which are vulnerable to TOC-TOU attacks, and require that the
more privileged supervisor can inspect the argument, and perform the
syscall on behalf of the process generating the notifiation. This
allows the file descriptor generated from that open call to be
returned to the calling process.
In addition, there is funcitonality to allow for replacement of
specific file descriptors, following dup2-like semantics.
Signed-off-by: Sargun Dhillon <[email protected]>
Suggested-by: Matt Denton <[email protected]>
Cc: Kees Cook <[email protected]>,
Cc: Jann Horn <[email protected]>,
Cc: Robert Sesek <[email protected]>,
Cc: Chris Palmer <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Tycho Andersen <[email protected]>
---
include/uapi/linux/seccomp.h | 25 ++++++
kernel/seccomp.c | 169 ++++++++++++++++++++++++++++++++++-
2 files changed, 193 insertions(+), 1 deletion(-)
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index c1735455bc53..7d450a9e4c29 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -113,6 +113,27 @@ struct seccomp_notif_resp {
__u32 flags;
};
+/* valid flags for seccomp_notif_addfd */
+#define SECCOMP_ADDFD_FLAG_SETFD (1UL << 0) /* Specify remote fd */
+
+/**
+ * struct seccomp_notif_addfd
+ * @size: The size of the seccomp_notif_addfd datastructure
+ * @fd: The local fd number
+ * @id: The ID of the seccomp notification
+ * @fd_flags: Flags the remote FD should be allocated under
+ * @remote_fd: Optional remote FD number if SETFD option is set, otherwise 0.
+ * @flags: SECCOMP_ADDFD_FLAG_*
+ */
+struct seccomp_notif_addfd {
+ __u32 size;
+ __u32 fd;
+ __u64 id;
+ __u32 fd_flags;
+ __u32 remote_fd;
+ __u64 flags;
+};
+
#define SECCOMP_IOC_MAGIC '!'
#define SECCOMP_IO(nr) _IO(SECCOMP_IOC_MAGIC, nr)
#define SECCOMP_IOR(nr, type) _IOR(SECCOMP_IOC_MAGIC, nr, type)
@@ -124,4 +145,8 @@ struct seccomp_notif_resp {
#define SECCOMP_IOCTL_NOTIF_SEND SECCOMP_IOWR(1, \
struct seccomp_notif_resp)
#define SECCOMP_IOCTL_NOTIF_ID_VALID SECCOMP_IOR(2, __u64)
+/* On success, the return value is the remote process's added fd number */
+#define SECCOMP_IOCTL_NOTIF_ADDFD SECCOMP_IOR(3, \
+ struct seccomp_notif_addfd)
+
#endif /* _UAPI_LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index f6ce94b7a167..88940eeabaee 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -77,10 +77,42 @@ struct seccomp_knotif {
long val;
u32 flags;
- /* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
+ /*
+ * Signals when this has changed states, such as the listener
+ * dying, a new seccomp addfd message, or changing to REPLIED
+ */
struct completion ready;
struct list_head list;
+
+ /* outstanding addfd requests */
+ struct list_head addfd;
+};
+
+/**
+ * struct seccomp_kaddfd - contianer for seccomp_addfd ioctl messages
+ *
+ * @file: A reference to the file to install in the other task
+ * @fd: The fd number to install it at. If the fd number is -1, it means the
+ * installing process should allocate the fd as normal.
+ * @flags: The flags for the new file descriptor. At the moment, only O_CLOEXEC
+ * is allowed.
+ * @ret: The return value of the installing process. It is set to the fd num
+ * upon success (>= 0).
+ * @completion: Indicates that the installing process has completed fd
+ * installation, or gone away (either due to successful
+ * reply, or signal)
+ *
+ */
+struct seccomp_kaddfd {
+ struct file *file;
+ int fd;
+ unsigned int flags;
+
+ /* To only be set on reply */
+ int ret;
+ struct completion completion;
+ struct list_head list;
};
/**
@@ -735,6 +767,35 @@ static u64 seccomp_next_notify_id(struct seccomp_filter *filter)
return filter->notif->next_id++;
}
+static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd)
+{
+ int ret;
+
+ /*
+ * Remove the notification, and reset the list pointers, indicating
+ * that it has been handled.
+ */
+ list_del_init(&addfd->list);
+
+ ret = security_file_receive(addfd->file);
+ if (ret)
+ goto out;
+
+ if (addfd->fd >= 0) {
+ ret = replace_fd(addfd->fd, addfd->file, addfd->flags);
+ if (ret >= 0)
+ fput(addfd->file);
+ } else {
+ ret = get_unused_fd_flags(addfd->flags);
+ if (ret >= 0)
+ fd_install(ret, addfd->file);
+ }
+
+out:
+ addfd->ret = ret;
+ complete(&addfd->completion);
+}
+
static int seccomp_do_user_notification(int this_syscall,
struct seccomp_filter *match,
const struct seccomp_data *sd)
@@ -743,6 +804,7 @@ static int seccomp_do_user_notification(int this_syscall,
u32 flags = 0;
long ret = 0;
struct seccomp_knotif n = {};
+ struct seccomp_kaddfd *addfd, *tmp;
mutex_lock(&match->notify_lock);
err = -ENOSYS;
@@ -755,6 +817,7 @@ static int seccomp_do_user_notification(int this_syscall,
n.id = seccomp_next_notify_id(match);
init_completion(&n.ready);
list_add(&n.list, &match->notif->notifications);
+ INIT_LIST_HEAD(&n.addfd);
up(&match->notif->request);
wake_up_poll(&match->notif->wqh, EPOLLIN | EPOLLRDNORM);
@@ -763,14 +826,31 @@ static int seccomp_do_user_notification(int this_syscall,
/*
* This is where we wait for a reply from userspace.
*/
+wait:
err = wait_for_completion_interruptible(&n.ready);
mutex_lock(&match->notify_lock);
if (err == 0) {
+ /* Check if we were woken up by a addfd message */
+ addfd = list_first_entry_or_null(&n.addfd,
+ struct seccomp_kaddfd, list);
+ if (addfd && n.state != SECCOMP_NOTIFY_REPLIED) {
+ seccomp_handle_addfd(addfd);
+ mutex_unlock(&match->notify_lock);
+ goto wait;
+ }
ret = n.val;
err = n.error;
flags = n.flags;
}
+ /* If there were any pending addfd calls, clear them out */
+ list_for_each_entry_safe(addfd, tmp, &n.addfd, list) {
+ /* The process went away before we got a chance to handle it */
+ addfd->ret = -ENOENT;
+ list_del_init(&addfd->list);
+ complete(&addfd->completion);
+ }
+
/*
* Note that it's possible the listener died in between the time when
* we were notified of a respons (or a signal) and when we were able to
@@ -1179,6 +1259,91 @@ static long seccomp_notify_id_valid(struct seccomp_filter *filter,
return ret;
}
+static long seccomp_notify_addfd(struct seccomp_filter *filter,
+ struct seccomp_notif_addfd __user *uaddfd)
+{
+ struct seccomp_notif_addfd addfd;
+ struct seccomp_knotif *knotif;
+ struct seccomp_kaddfd kaddfd;
+ u32 size;
+ int ret;
+
+ ret = get_user(size, &uaddfd->size);
+ if (ret)
+ return ret;
+
+ ret = copy_struct_from_user(&addfd, sizeof(addfd), uaddfd, size);
+ if (ret)
+ return ret;
+
+ if (addfd.fd_flags & (~O_CLOEXEC))
+ return -EINVAL;
+
+ if (addfd.flags & ~(SECCOMP_ADDFD_FLAG_SETFD))
+ return -EINVAL;
+
+ if (addfd.remote_fd && !(addfd.flags & SECCOMP_ADDFD_FLAG_SETFD))
+ return -EINVAL;
+
+ kaddfd.file = fget(addfd.fd);
+ if (!kaddfd.file)
+ return -EBADF;
+
+ kaddfd.flags = addfd.fd_flags;
+ kaddfd.fd = (addfd.flags & SECCOMP_ADDFD_FLAG_SETFD) ?
+ addfd.remote_fd : -1;
+ init_completion(&kaddfd.completion);
+
+ ret = mutex_lock_interruptible(&filter->notify_lock);
+ if (ret)
+ goto out;
+
+ knotif = find_notification(filter, addfd.id);
+ /*
+ * We do not want to allow for FD injection to occur before the
+ * notification has been picked up by a userspace handler, or after
+ * the notification has been replied to.
+ */
+ if (!knotif || knotif->state != SECCOMP_NOTIFY_SENT) {
+ ret = -ENOENT;
+ goto out_unlock;
+ }
+
+ list_add(&kaddfd.list, &knotif->addfd);
+ complete(&knotif->ready);
+ mutex_unlock(&filter->notify_lock);
+
+ /* Now we wait for it to be processed */
+ ret = wait_for_completion_interruptible(&kaddfd.completion);
+ if (ret == 0) {
+ /*
+ * We had a successful completion. The other side has already
+ * removed us from the addfd queue, and
+ * wait_for_completion_interruptible has a memory barrier.
+ */
+ ret = kaddfd.ret;
+ goto out;
+ }
+
+ mutex_lock(&filter->notify_lock);
+ /*
+ * Even though we were woken up by a signal, and not a successful
+ * completion, a completion may have happened in the mean time.
+ */
+ if (list_empty(&kaddfd.list))
+ ret = kaddfd.ret;
+ else
+ list_del(&kaddfd.list);
+
+out_unlock:
+ mutex_unlock(&filter->notify_lock);
+out:
+ if (ret < 0)
+ fput(kaddfd.file);
+
+ return ret;
+}
+
static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,
unsigned long arg)
{
@@ -1192,6 +1357,8 @@ static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,
return seccomp_notify_send(filter, buf);
case SECCOMP_IOCTL_NOTIF_ID_VALID:
return seccomp_notify_id_valid(filter, buf);
+ case SECCOMP_IOCTL_NOTIF_ADDFD:
+ return seccomp_notify_addfd(filter, buf);
default:
return -EINVAL;
}
--
2.25.1
Certain files, when moved to another process have metadata changed, such
as netprioidx, and classid. This is the default behaviour in sending
sockets with SCM_RIGHTS over unix sockets. Depending on the usecase,
this may or may not be desirable with the addfd ioctl. This allows
the user to opt-in.
Signed-off-by: Sargun Dhillon <[email protected]>
Suggested-by: Tycho Andersen <[email protected]>
Cc: Matt Denton <[email protected]>
Cc: Kees Cook <[email protected]>,
Cc: Jann Horn <[email protected]>,
Cc: Robert Sesek <[email protected]>,
Cc: Chris Palmer <[email protected]>
Cc: Christian Brauner <[email protected]>
---
include/uapi/linux/seccomp.h | 8 ++++++++
kernel/seccomp.c | 31 +++++++++++++++++++++++++++----
2 files changed, 35 insertions(+), 4 deletions(-)
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 7d450a9e4c29..ccd1c960372a 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -115,6 +115,14 @@ struct seccomp_notif_resp {
/* valid flags for seccomp_notif_addfd */
#define SECCOMP_ADDFD_FLAG_SETFD (1UL << 0) /* Specify remote fd */
+/*
+ * Certain file descriptors are behave differently depending on the process
+ * they are created in. Specifcally, sockets, and their interactions with the
+ * net_cls and net_prio cgroup v1 controllers. This "moves" the file descriptor
+ * so that it takes on the cgroup controller's configuration in the process
+ * that the file descriptor is being added to.
+ */
+#define SECCOMP_ADDFD_FLAG_MOVE (1UL << 1)
/**
* struct seccomp_notif_addfd
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 88940eeabaee..2e649f3cb10e 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -41,6 +41,9 @@
#include <linux/tracehook.h>
#include <linux/uaccess.h>
#include <linux/anon_inodes.h>
+#include <net/netprio_cgroup.h>
+#include <net/sock.h>
+#include <net/cls_cgroup.h>
enum notify_state {
SECCOMP_NOTIFY_INIT,
@@ -108,6 +111,7 @@ struct seccomp_kaddfd {
struct file *file;
int fd;
unsigned int flags;
+ bool move;
/* To only be set on reply */
int ret;
@@ -769,7 +773,8 @@ static u64 seccomp_next_notify_id(struct seccomp_filter *filter)
static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd)
{
- int ret;
+ struct socket *sock;
+ int err, ret;
/*
* Remove the notification, and reset the list pointers, indicating
@@ -785,12 +790,29 @@ static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd)
ret = replace_fd(addfd->fd, addfd->file, addfd->flags);
if (ret >= 0)
fput(addfd->file);
+ else
+ goto out;
} else {
ret = get_unused_fd_flags(addfd->flags);
if (ret >= 0)
fd_install(ret, addfd->file);
+ else
+ goto out;
}
+ if (addfd->move) {
+ sock = sock_from_file(addfd->file, &err);
+ if (sock) {
+ sock_update_netprioidx(&sock->sk->sk_cgrp_data);
+ sock_update_classid(&sock->sk->sk_cgrp_data);
+ }
+ }
+ /*
+ * An extra reference is taken on the ioctl side, so upon success, we
+ * must consume all references (and on failure, none).
+ */
+ fput(addfd->file);
+
out:
addfd->ret = ret;
complete(&addfd->completion);
@@ -1279,16 +1301,17 @@ static long seccomp_notify_addfd(struct seccomp_filter *filter,
if (addfd.fd_flags & (~O_CLOEXEC))
return -EINVAL;
- if (addfd.flags & ~(SECCOMP_ADDFD_FLAG_SETFD))
+ if (addfd.flags & ~(SECCOMP_ADDFD_FLAG_SETFD|SECCOMP_ADDFD_FLAG_MOVE))
return -EINVAL;
if (addfd.remote_fd && !(addfd.flags & SECCOMP_ADDFD_FLAG_SETFD))
return -EINVAL;
- kaddfd.file = fget(addfd.fd);
+ kaddfd.file = fget_many(addfd.fd, 2);
if (!kaddfd.file)
return -EBADF;
+ kaddfd.move = (addfd.flags & SECCOMP_ADDFD_FLAG_MOVE);
kaddfd.flags = addfd.fd_flags;
kaddfd.fd = (addfd.flags & SECCOMP_ADDFD_FLAG_SETFD) ?
addfd.remote_fd : -1;
@@ -1339,7 +1362,7 @@ static long seccomp_notify_addfd(struct seccomp_filter *filter,
mutex_unlock(&filter->notify_lock);
out:
if (ret < 0)
- fput(kaddfd.file);
+ fput_many(kaddfd.file, 2);
return ret;
}
--
2.25.1
Test whether we can add file descriptors in response to notifications.
This injects the file descriptors via notifications, and then uses
kcmp to determine whether or not it has been successful.
It also includes some basic sanity checking for arguments.
Signed-off-by: Sargun Dhillon <[email protected]>
Cc: Matt Denton <[email protected]>
Cc: Kees Cook <[email protected]>,
Cc: Jann Horn <[email protected]>,
Cc: Robert Sesek <[email protected]>,
Cc: Chris Palmer <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Tycho Andersen <[email protected]>
---
tools/testing/selftests/seccomp/seccomp_bpf.c | 227 ++++++++++++++++++
1 file changed, 227 insertions(+)
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index c0aa46ce14f6..1ec43fef2b93 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -45,6 +45,7 @@
#include <sys/socket.h>
#include <sys/ioctl.h>
#include <linux/kcmp.h>
+#include <sys/resource.h>
#include <unistd.h>
#include <sys/syscall.h>
@@ -181,6 +182,12 @@ struct seccomp_metadata {
#define SECCOMP_IOCTL_NOTIF_SEND SECCOMP_IOWR(1, \
struct seccomp_notif_resp)
#define SECCOMP_IOCTL_NOTIF_ID_VALID SECCOMP_IOR(2, __u64)
+/* On success, the return value is the remote process's added fd number */
+#define SECCOMP_IOCTL_NOTIF_ADDFD SECCOMP_IOR(3, \
+ struct seccomp_notif_addfd)
+
+/* valid flags for seccomp_notif_addfd */
+#define SECCOMP_ADDFD_FLAG_SETFD (1UL << 0) /* Specify remote fd */
struct seccomp_notif {
__u64 id;
@@ -201,6 +208,15 @@ struct seccomp_notif_sizes {
__u16 seccomp_notif_resp;
__u16 seccomp_data;
};
+
+struct seccomp_notif_addfd {
+ __u32 size;
+ __u32 fd;
+ __u64 id;
+ __u32 fd_flags;
+ __u32 remote_fd;
+ __u64 flags;
+};
#endif
#ifndef PTRACE_EVENTMSG_SYSCALL_ENTRY
@@ -3686,6 +3702,217 @@ TEST(user_notification_continue)
}
}
+TEST(user_notification_sendfd)
+{
+ pid_t pid;
+ long ret;
+ int status, listener, memfd;
+ struct seccomp_notif_addfd addfd = {};
+ struct seccomp_notif req = {};
+ struct seccomp_notif_resp resp = {};
+
+ memfd = memfd_create("test", 0);
+ ASSERT_GE(memfd, 0);
+
+ ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+ ASSERT_EQ(0, ret) {
+ TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+ }
+
+ /* Check that the basic notification machinery works */
+ listener = user_trap_syscall(__NR_getppid,
+ SECCOMP_FILTER_FLAG_NEW_LISTENER);
+ ASSERT_GE(listener, 0);
+
+ pid = fork();
+ ASSERT_GE(pid, 0);
+
+ if (pid == 0)
+ exit(syscall(__NR_getppid) != USER_NOTIF_MAGIC);
+
+ ASSERT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
+
+ addfd.size = sizeof(addfd);
+ addfd.fd = memfd;
+ addfd.fd_flags = O_CLOEXEC;
+ addfd.remote_fd = 0;
+ addfd.id = req.id;
+ addfd.flags = 0xff;
+
+ /* Verify bad flags cannot be set */
+ EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd), -1);
+ EXPECT_EQ(errno, EINVAL);
+
+ /* Verify that remote_fd cannot be set without setting flags */
+ addfd.flags = 0;
+ addfd.remote_fd = 1;
+ EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd), -1);
+ EXPECT_EQ(errno, EINVAL);
+
+ /* Verify we can set an arbitrary remote fd */
+ addfd.remote_fd = 0;
+
+ ret = ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd);
+ EXPECT_GE(ret, 0);
+ EXPECT_EQ(filecmp(getpid(), pid, memfd, ret), 0);
+
+ /* Verify we can set a specific remote fd */
+ addfd.remote_fd = 42;
+ addfd.flags = SECCOMP_ADDFD_FLAG_SETFD;
+
+ EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd), 42);
+ EXPECT_EQ(filecmp(getpid(), pid, memfd, 42), 0);
+
+ resp.id = req.id;
+ resp.error = 0;
+ resp.val = USER_NOTIF_MAGIC;
+
+ EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0);
+
+
+ EXPECT_EQ(waitpid(pid, &status, 0), pid);
+ EXPECT_EQ(true, WIFEXITED(status));
+ EXPECT_EQ(0, WEXITSTATUS(status));
+
+ close(memfd);
+}
+
+TEST(user_notification_sendfd_goaway)
+{
+ pid_t pid, pid2;
+ long ret;
+ int status, listener, memfd;
+ struct seccomp_notif_addfd addfd = {};
+ struct seccomp_notif req = {};
+ struct seccomp_notif_resp resp = {};
+
+ memfd = memfd_create("test", 0);
+ ASSERT_GE(memfd, 0);
+
+ ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+ ASSERT_EQ(0, ret) {
+ TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+ }
+
+ /* Check that the basic notification machinery works */
+ listener = user_trap_syscall(__NR_getppid,
+ SECCOMP_FILTER_FLAG_NEW_LISTENER);
+ ASSERT_GE(listener, 0);
+
+ pid = fork();
+ ASSERT_GE(pid, 0);
+
+ if (pid == 0) {
+ syscall(__NR_getppid);
+ exit(0);
+ }
+
+ ASSERT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
+
+
+ ASSERT_EQ(kill(pid, SIGSTOP), 0);
+
+ resp.id = req.id;
+ resp.error = 0;
+ resp.val = USER_NOTIF_MAGIC;
+
+ ASSERT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0);
+
+ pid2 = fork();
+ if (pid2 == 0) {
+ sleep(1);
+ exit(kill(pid, SIGCONT) != 0);
+ }
+
+ /* Add FD should return ENOENT */
+ addfd.size = sizeof(addfd);
+ addfd.fd = memfd;
+ addfd.fd_flags = O_CLOEXEC;
+ addfd.remote_fd = 0;
+ addfd.id = req.id;
+ addfd.flags = 0;
+
+ EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd), -1);
+ EXPECT_EQ(errno, ENOENT);
+
+ EXPECT_EQ(waitpid(pid2, &status, 0), pid2);
+ EXPECT_EQ(true, WIFEXITED(status));
+ EXPECT_EQ(0, WEXITSTATUS(status));
+
+ EXPECT_EQ(waitpid(pid, &status, 0), pid);
+ EXPECT_EQ(true, WIFEXITED(status));
+ EXPECT_EQ(0, WEXITSTATUS(status));
+
+ close(memfd);
+}
+
+TEST(user_notification_sendfd_rlimit)
+{
+ pid_t pid;
+ long ret;
+ int status, listener, memfd;
+ struct seccomp_notif_addfd addfd = {};
+ struct seccomp_notif req = {};
+ struct seccomp_notif_resp resp = {};
+ const struct rlimit lim = {
+ .rlim_cur = 0,
+ .rlim_max = 0,
+ };
+
+ memfd = memfd_create("test", 0);
+ ASSERT_GE(memfd, 0);
+
+ ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+ ASSERT_EQ(0, ret) {
+ TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+ }
+
+ /* Check that the basic notification machinery works */
+ listener = user_trap_syscall(__NR_getppid,
+ SECCOMP_FILTER_FLAG_NEW_LISTENER);
+ ASSERT_GE(listener, 0);
+
+ pid = fork();
+ ASSERT_GE(pid, 0);
+
+ if (pid == 0)
+ exit(syscall(__NR_getppid) != USER_NOTIF_MAGIC);
+
+
+ ASSERT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
+
+ ASSERT_EQ(prlimit(pid, RLIMIT_NOFILE, &lim, NULL), 0);
+
+ addfd.size = sizeof(addfd);
+ addfd.fd = memfd;
+ addfd.fd_flags = O_CLOEXEC;
+ addfd.remote_fd = 0;
+ addfd.id = req.id;
+ addfd.flags = 0;
+
+ /* Should probably spot check /proc/sys/fs/file-nr */
+ EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd), -1);
+ EXPECT_EQ(errno, EMFILE);
+
+ addfd.remote_fd = 100;
+ addfd.flags = SECCOMP_ADDFD_FLAG_SETFD;
+ EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd), -1);
+ EXPECT_EQ(errno, EBADF);
+
+ resp.id = req.id;
+ resp.error = 0;
+ resp.val = USER_NOTIF_MAGIC;
+
+ EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0);
+
+
+ EXPECT_EQ(waitpid(pid, &status, 0), pid);
+ EXPECT_EQ(true, WIFEXITED(status));
+ EXPECT_EQ(0, WEXITSTATUS(status));
+
+ close(memfd);
+}
+
/*
* TODO:
* - add microbenchmarks
--
2.25.1
On Sun, May 24, 2020 at 04:39:39PM -0700, Sargun Dhillon wrote:
> +static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd)
> +{
> + int ret;
> +
> + /*
> + * Remove the notification, and reset the list pointers, indicating
> + * that it has been handled.
> + */
> + list_del_init(&addfd->list);
> +
> + ret = security_file_receive(addfd->file);
> + if (ret)
> + goto out;
> +
> + if (addfd->fd >= 0) {
> + ret = replace_fd(addfd->fd, addfd->file, addfd->flags);
> + if (ret >= 0)
> + fput(addfd->file);
> + } else {
> + ret = get_unused_fd_flags(addfd->flags);
> + if (ret >= 0)
> + fd_install(ret, addfd->file);
> + }
> +
> +out:
> + addfd->ret = ret;
> + complete(&addfd->completion);
> +}
My previous comment about SCM_RIGHTS still applies, right? That is, we
should do,
sock = sock_from_file(fp[i], &err);
if (sock) {
sock_update_netprioidx(&sock->sk->sk_cgrp_data);
sock_update_classid(&sock->sk->sk_cgrp_data);
}
and perhaps lift that into a helper.
Tycho
On Sun, May 24, 2020 at 04:39:39PM -0700, Sargun Dhillon wrote:
> +static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd)
> +{
> + int ret;
> +
> + /*
> + * Remove the notification, and reset the list pointers, indicating
> + * that it has been handled.
> + */
> + list_del_init(&addfd->list);
> +
> + ret = security_file_receive(addfd->file);
> + if (ret)
> + goto out;
> +
> + if (addfd->fd >= 0) {
> + ret = replace_fd(addfd->fd, addfd->file, addfd->flags);
> + if (ret >= 0)
> + fput(addfd->file);
> + } else {
> + ret = get_unused_fd_flags(addfd->flags);
> + if (ret >= 0)
> + fd_install(ret, addfd->file);
Bad refcounting rules. *IF* we go with anything of that sort (and I'm not
convinced that the entire series makes sense), it's better to have more
uniform rules re reference consumption/disposal.
Make the destructor of addfd *ALWAYS* drop its reference. And have this
function go
if (addfd->fd >= 0) {
ret = replace_fd(addfd->fd, addfd->file, addfd->flags);
} else {
ret = get_unused_fd_flags(addfd->flags);
if (ret >= 0)
fd_install(ret, get_file(addfd->file));
}
This introduces another call to addfd, in which the move flag is set. It
may make sense to setup a cgroup v1 hierarchy, and check that the
netprioidx is changed.
Signed-off-by: Sargun Dhillon <[email protected]>
Cc: Matt Denton <[email protected]>
Cc: Kees Cook <[email protected]>,
Cc: Jann Horn <[email protected]>,
Cc: Robert Sesek <[email protected]>,
Cc: Chris Palmer <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Tycho Andersen <[email protected]>
---
tools/testing/selftests/seccomp/seccomp_bpf.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 1ec43fef2b93..f4b50cbbde42 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -188,6 +188,8 @@ struct seccomp_metadata {
/* valid flags for seccomp_notif_addfd */
#define SECCOMP_ADDFD_FLAG_SETFD (1UL << 0) /* Specify remote fd */
+#define SECCOMP_ADDFD_FLAG_MOVE (1UL << 1)
+
struct seccomp_notif {
__u64 id;
@@ -3756,6 +3758,12 @@ TEST(user_notification_sendfd)
EXPECT_GE(ret, 0);
EXPECT_EQ(filecmp(getpid(), pid, memfd, ret), 0);
+ /* Move the FD */
+ addfd.flags = SECCOMP_ADDFD_FLAG_MOVE;
+ ret = ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd);
+ EXPECT_GE(ret, 0);
+ EXPECT_EQ(filecmp(getpid(), pid, memfd, ret), 0);
+
/* Verify we can set a specific remote fd */
addfd.remote_fd = 42;
addfd.flags = SECCOMP_ADDFD_FLAG_SETFD;
--
2.25.1
On Sun, May 24, 2020 at 04:39:38PM -0700, Sargun Dhillon wrote:
> This adds a helper which can iterate through a seccomp_filter to
> find a notification matching an ID. It removes several replicated
> chunks of code.
>
> Signed-off-by: Sargun Dhillon <[email protected]>
> Cc: Matt Denton <[email protected]>
> Cc: Kees Cook <[email protected]>,
> Cc: Jann Horn <[email protected]>,
> Cc: Robert Sesek <[email protected]>,
> Cc: Chris Palmer <[email protected]>
> Cc: Christian Brauner <[email protected]>
> Cc: Tycho Andersen <[email protected]>
> ---
> kernel/seccomp.c | 38 +++++++++++++++++++++-----------------
> 1 file changed, 21 insertions(+), 17 deletions(-)
>
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 55a6184f5990..f6ce94b7a167 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -1021,10 +1021,25 @@ static int seccomp_notify_release(struct inode *inode, struct file *file)
> return 0;
> }
>
> +/* must be called with notif_lock held */
> +static inline struct seccomp_knotif *
> +find_notification(struct seccomp_filter *filter, u64 id)
> +{
> + struct seccomp_knotif *cur;
> +
> + list_for_each_entry(cur, &filter->notif->notifications, list) {
> + if (cur->id == id)
> + return cur;
> + }
> +
> + return NULL;
> +}
I think there's also an instance of this in _send() that we can change
to use find_notification() as well.
Tycho
On Sun, May 24, 2020 at 05:57:32PM -0600, Tycho Andersen wrote:
> On Sun, May 24, 2020 at 04:39:39PM -0700, Sargun Dhillon wrote:
> > +static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd)
> > +{
> > + int ret;
> > +
> > + /*
> > + * Remove the notification, and reset the list pointers, indicating
> > + * that it has been handled.
> > + */
> > + list_del_init(&addfd->list);
> > +
> > + ret = security_file_receive(addfd->file);
> > + if (ret)
> > + goto out;
> > +
> > + if (addfd->fd >= 0) {
> > + ret = replace_fd(addfd->fd, addfd->file, addfd->flags);
> > + if (ret >= 0)
> > + fput(addfd->file);
> > + } else {
> > + ret = get_unused_fd_flags(addfd->flags);
> > + if (ret >= 0)
> > + fd_install(ret, addfd->file);
> > + }
> > +
> > +out:
> > + addfd->ret = ret;
> > + complete(&addfd->completion);
> > +}
>
> My previous comment about SCM_RIGHTS still applies, right? That is, we
> should do,
>
> sock = sock_from_file(fp[i], &err);
> if (sock) {
> sock_update_netprioidx(&sock->sk->sk_cgrp_data);
> sock_update_classid(&sock->sk->sk_cgrp_data);
> }
>
> and perhaps lift that into a helper.
Oh, and now I see the later patch. But is there a reason to separate
these? I can't think of one.
Tycho
On Sun, May 24, 2020 at 5:05 PM Al Viro <[email protected]> wrote:
>
> On Sun, May 24, 2020 at 04:39:39PM -0700, Sargun Dhillon wrote:
>
> Bad refcounting rules. *IF* we go with anything of that sort (and I'm not
> convinced that the entire series makes sense), it's better to have more
> uniform rules re reference consumption/disposal.
>
> Make the destructor of addfd *ALWAYS* drop its reference. And have this
> function go
Are you suggesting the in both the error, and non-error cases the ioctl
invoker side is responsible for fput'ing the final reference in both the
success and non-success cases? Would we take an extra reference
prior to fd_install?
>
> if (addfd->fd >= 0) {
> ret = replace_fd(addfd->fd, addfd->file, addfd->flags);
> } else {
> ret = get_unused_fd_flags(addfd->flags);
> if (ret >= 0)
> fd_install(ret, get_file(addfd->file));
> }
>
Wouldn't this result in consumption of reference in one case (fd_install),
and the fd still having a reference in the replace_fd case?
On Sun, May 24, 2020 at 05:27:58PM -0700, Sargun Dhillon wrote:
> > if (addfd->fd >= 0) {
> > ret = replace_fd(addfd->fd, addfd->file, addfd->flags);
> > } else {
> > ret = get_unused_fd_flags(addfd->flags);
> > if (ret >= 0)
> > fd_install(ret, get_file(addfd->file));
^^^^^^^^
> Wouldn't this result in consumption of reference in one case (fd_install),
> and the fd still having a reference in the replace_fd case?
On Sun, May 24, 2020 at 04:39:38PM -0700, Sargun Dhillon wrote:
> This adds a helper which can iterate through a seccomp_filter to
> find a notification matching an ID. It removes several replicated
> chunks of code.
>
> Signed-off-by: Sargun Dhillon <[email protected]>
> Cc: Matt Denton <[email protected]>
> Cc: Kees Cook <[email protected]>,
> Cc: Jann Horn <[email protected]>,
> Cc: Robert Sesek <[email protected]>,
> Cc: Chris Palmer <[email protected]>
> Cc: Christian Brauner <[email protected]>
> Cc: Tycho Andersen <[email protected]>
> ---
> kernel/seccomp.c | 38 +++++++++++++++++++++-----------------
> 1 file changed, 21 insertions(+), 17 deletions(-)
>
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 55a6184f5990..f6ce94b7a167 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -1021,10 +1021,25 @@ static int seccomp_notify_release(struct inode *inode, struct file *file)
> return 0;
> }
>
> +/* must be called with notif_lock held */
> +static inline struct seccomp_knotif *
> +find_notification(struct seccomp_filter *filter, u64 id)
> +{
> + struct seccomp_knotif *cur;
> +
> + list_for_each_entry(cur, &filter->notif->notifications, list) {
> + if (cur->id == id)
> + return cur;
> + }
> +
> + return NULL;
> +}
> +
> +
> static long seccomp_notify_recv(struct seccomp_filter *filter,
> void __user *buf)
> {
> - struct seccomp_knotif *knotif = NULL, *cur;
> + struct seccomp_knotif *knotif, *cur;
> struct seccomp_notif unotif;
> ssize_t ret;
>
> @@ -1078,14 +1093,8 @@ static long seccomp_notify_recv(struct seccomp_filter *filter,
> * may have died when we released the lock, so we need to make
> * sure it's still around.
> */
> - knotif = NULL;
> mutex_lock(&filter->notify_lock);
> - list_for_each_entry(cur, &filter->notif->notifications, list) {
> - if (cur->id == unotif.id) {
> - knotif = cur;
> - break;
> - }
> - }
> + knotif = find_notification(filter, unotif.id);
>
> if (knotif) {
> knotif->state = SECCOMP_NOTIFY_INIT;
> @@ -1150,7 +1159,7 @@ static long seccomp_notify_send(struct seccomp_filter *filter,
> static long seccomp_notify_id_valid(struct seccomp_filter *filter,
> void __user *buf)
> {
> - struct seccomp_knotif *knotif = NULL;
> + struct seccomp_knotif *knotif;
> u64 id;
> long ret;
>
> @@ -1162,15 +1171,10 @@ static long seccomp_notify_id_valid(struct seccomp_filter *filter,
> return ret;
>
> ret = -ENOENT;
> - list_for_each_entry(knotif, &filter->notif->notifications, list) {
> - if (knotif->id == id) {
> - if (knotif->state == SECCOMP_NOTIFY_SENT)
> - ret = 0;
> - goto out;
> - }
> - }
> + knotif = find_notification(filter, id);
> + if (knotif && knotif->state == SECCOMP_NOTIFY_SENT)
> + ret = 0;
Coul be a little nicer to have this be:
if (knotif && knotif->state == SECCOMP_NOTIFY_SENT)
ret = 0;
else
ret = -ENOENT;
or, if you want to keep the assignment out of the lock:
ret = -ENOENT;
ret = mutex_lock_interruptible(&filter->notify_lock);
if (ret < 0)
return ret;
knotif = find_notification(filter, id);
if (knotif && knotif->state == SECCOMP_NOTIFY_SENT)
ret = 0;
otherwise looks like a good cleanup to me.
On Sun, May 24, 2020 at 04:39:39PM -0700, Sargun Dhillon wrote:
> This adds a seccomp notifier ioctl which allows for the listener to "add"
> file descriptors to a process which originated a seccomp user
> notification. This allows calls like mount, and mknod to be "implemented",
> as the return value, and the arguments are data in memory. On the other
> hand, calls like connect can be "implemented" using pidfd_getfd.
>
> Unfortunately, there are calls which return file descriptors, like
> open, which are vulnerable to TOC-TOU attacks, and require that the
> more privileged supervisor can inspect the argument, and perform the
> syscall on behalf of the process generating the notifiation. This
> allows the file descriptor generated from that open call to be
> returned to the calling process.
>
> In addition, there is funcitonality to allow for replacement of
> specific file descriptors, following dup2-like semantics.
>
> Signed-off-by: Sargun Dhillon <[email protected]>
> Suggested-by: Matt Denton <[email protected]>
> Cc: Kees Cook <[email protected]>,
> Cc: Jann Horn <[email protected]>,
> Cc: Robert Sesek <[email protected]>,
> Cc: Chris Palmer <[email protected]>
> Cc: Christian Brauner <[email protected]>
> Cc: Tycho Andersen <[email protected]>
> ---
> include/uapi/linux/seccomp.h | 25 ++++++
> kernel/seccomp.c | 169 ++++++++++++++++++++++++++++++++++-
> 2 files changed, 193 insertions(+), 1 deletion(-)
>
> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> index c1735455bc53..7d450a9e4c29 100644
> --- a/include/uapi/linux/seccomp.h
> +++ b/include/uapi/linux/seccomp.h
> @@ -113,6 +113,27 @@ struct seccomp_notif_resp {
> __u32 flags;
> };
>
> +/* valid flags for seccomp_notif_addfd */
> +#define SECCOMP_ADDFD_FLAG_SETFD (1UL << 0) /* Specify remote fd */
> +
> +/**
> + * struct seccomp_notif_addfd
> + * @size: The size of the seccomp_notif_addfd datastructure
> + * @fd: The local fd number
> + * @id: The ID of the seccomp notification
> + * @fd_flags: Flags the remote FD should be allocated under
> + * @remote_fd: Optional remote FD number if SETFD option is set, otherwise 0.
> + * @flags: SECCOMP_ADDFD_FLAG_*
> + */
> +struct seccomp_notif_addfd {
> + __u32 size;
> + __u32 fd;
> + __u64 id;
> + __u32 fd_flags;
> + __u32 remote_fd;
> + __u64 flags;
> +};
This was a little confusing to me at first. So fd is the fd from which
we take the struct file and remote_fd is either -1 at which point we
just allocate the next free fd number and if it is not we
allocate/replace a specific one. Maybe it would be clearer if we did:
struct seccomp_notif_addfd {
__u32 size;
__u64 id;
__u64 flags;
__u32 srcfd;
__u32 newfd;
__u32 newfd_flags;
};
No need to hide in the name that this is remote_dup2().
> +
> #define SECCOMP_IOC_MAGIC '!'
> #define SECCOMP_IO(nr) _IO(SECCOMP_IOC_MAGIC, nr)
> #define SECCOMP_IOR(nr, type) _IOR(SECCOMP_IOC_MAGIC, nr, type)
> @@ -124,4 +145,8 @@ struct seccomp_notif_resp {
> #define SECCOMP_IOCTL_NOTIF_SEND SECCOMP_IOWR(1, \
> struct seccomp_notif_resp)
> #define SECCOMP_IOCTL_NOTIF_ID_VALID SECCOMP_IOR(2, __u64)
> +/* On success, the return value is the remote process's added fd number */
> +#define SECCOMP_IOCTL_NOTIF_ADDFD SECCOMP_IOR(3, \
> + struct seccomp_notif_addfd)
> +
> #endif /* _UAPI_LINUX_SECCOMP_H */
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index f6ce94b7a167..88940eeabaee 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -77,10 +77,42 @@ struct seccomp_knotif {
> long val;
> u32 flags;
>
> - /* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
> + /*
> + * Signals when this has changed states, such as the listener
> + * dying, a new seccomp addfd message, or changing to REPLIED
> + */
> struct completion ready;
>
> struct list_head list;
> +
> + /* outstanding addfd requests */
> + struct list_head addfd;
> +};
> +
> +/**
> + * struct seccomp_kaddfd - contianer for seccomp_addfd ioctl messages
^^^ typo
> + *
> + * @file: A reference to the file to install in the other task
> + * @fd: The fd number to install it at. If the fd number is -1, it means the
> + * installing process should allocate the fd as normal.
> + * @flags: The flags for the new file descriptor. At the moment, only O_CLOEXEC
> + * is allowed.
> + * @ret: The return value of the installing process. It is set to the fd num
> + * upon success (>= 0).
> + * @completion: Indicates that the installing process has completed fd
> + * installation, or gone away (either due to successful
> + * reply, or signal)
> + *
> + */
> +struct seccomp_kaddfd {
> + struct file *file;
> + int fd;
> + unsigned int flags;
> +
> + /* To only be set on reply */
> + int ret;
> + struct completion completion;
> + struct list_head list;
> };
>
> /**
> @@ -735,6 +767,35 @@ static u64 seccomp_next_notify_id(struct seccomp_filter *filter)
> return filter->notif->next_id++;
> }
>
> +static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd)
> +{
> + int ret;
> +
> + /*
> + * Remove the notification, and reset the list pointers, indicating
> + * that it has been handled.
> + */
> + list_del_init(&addfd->list);
> +
> + ret = security_file_receive(addfd->file);
> + if (ret)
> + goto out;
> +
> + if (addfd->fd >= 0) {
> + ret = replace_fd(addfd->fd, addfd->file, addfd->flags);
> + if (ret >= 0)
> + fput(addfd->file);
> + } else {
> + ret = get_unused_fd_flags(addfd->flags);
> + if (ret >= 0)
> + fd_install(ret, addfd->file);
> + }
> +
> +out:
> + addfd->ret = ret;
> + complete(&addfd->completion);
> +}
> +
> static int seccomp_do_user_notification(int this_syscall,
> struct seccomp_filter *match,
> const struct seccomp_data *sd)
> @@ -743,6 +804,7 @@ static int seccomp_do_user_notification(int this_syscall,
> u32 flags = 0;
> long ret = 0;
> struct seccomp_knotif n = {};
> + struct seccomp_kaddfd *addfd, *tmp;
>
> mutex_lock(&match->notify_lock);
> err = -ENOSYS;
> @@ -755,6 +817,7 @@ static int seccomp_do_user_notification(int this_syscall,
> n.id = seccomp_next_notify_id(match);
> init_completion(&n.ready);
> list_add(&n.list, &match->notif->notifications);
> + INIT_LIST_HEAD(&n.addfd);
>
> up(&match->notif->request);
> wake_up_poll(&match->notif->wqh, EPOLLIN | EPOLLRDNORM);
> @@ -763,14 +826,31 @@ static int seccomp_do_user_notification(int this_syscall,
> /*
> * This is where we wait for a reply from userspace.
> */
> +wait:
> err = wait_for_completion_interruptible(&n.ready);
> mutex_lock(&match->notify_lock);
> if (err == 0) {
> + /* Check if we were woken up by a addfd message */
> + addfd = list_first_entry_or_null(&n.addfd,
> + struct seccomp_kaddfd, list);
> + if (addfd && n.state != SECCOMP_NOTIFY_REPLIED) {
> + seccomp_handle_addfd(addfd);
> + mutex_unlock(&match->notify_lock);
> + goto wait;
> + }
> ret = n.val;
> err = n.error;
> flags = n.flags;
> }
>
> + /* If there were any pending addfd calls, clear them out */
> + list_for_each_entry_safe(addfd, tmp, &n.addfd, list) {
> + /* The process went away before we got a chance to handle it */
> + addfd->ret = -ENOENT;
Looks like it should be -ESRCH?
> + list_del_init(&addfd->list);
> + complete(&addfd->completion);
> + }
> +
> /*
> * Note that it's possible the listener died in between the time when
> * we were notified of a respons (or a signal) and when we were able to
> @@ -1179,6 +1259,91 @@ static long seccomp_notify_id_valid(struct seccomp_filter *filter,
> return ret;
> }
>
> +static long seccomp_notify_addfd(struct seccomp_filter *filter,
> + struct seccomp_notif_addfd __user *uaddfd)
> +{
> + struct seccomp_notif_addfd addfd;
> + struct seccomp_knotif *knotif;
> + struct seccomp_kaddfd kaddfd;
> + u32 size;
> + int ret;
> +
> + ret = get_user(size, &uaddfd->size);
> + if (ret)
> + return ret;
> +
> + ret = copy_struct_from_user(&addfd, sizeof(addfd), uaddfd, size);
> + if (ret)
> + return ret;
> +
> + if (addfd.fd_flags & (~O_CLOEXEC))
> + return -EINVAL;
> +
> + if (addfd.flags & ~(SECCOMP_ADDFD_FLAG_SETFD))
> + return -EINVAL;
I'd suggest to remove the brackets and just do
if (addfd.flags & ~SECCOMP_ADDFD_FLAG_SETFD)
that's more common in the kernel.
> +
> + if (addfd.remote_fd && !(addfd.flags & SECCOMP_ADDFD_FLAG_SETFD))
> + return -EINVAL;
> +
> + kaddfd.file = fget(addfd.fd);
> + if (!kaddfd.file)
> + return -EBADF;
> +
> + kaddfd.flags = addfd.fd_flags;
> + kaddfd.fd = (addfd.flags & SECCOMP_ADDFD_FLAG_SETFD) ?
> + addfd.remote_fd : -1;
> + init_completion(&kaddfd.completion);
> +
> + ret = mutex_lock_interruptible(&filter->notify_lock);
> + if (ret)
The patter in all other places in seccomp is to compare against < 0, so
if (ret < 0)
goto out;
I think.
> + goto out;
> +
> + knotif = find_notification(filter, addfd.id);
> + /*
> + * We do not want to allow for FD injection to occur before the
> + * notification has been picked up by a userspace handler, or after
> + * the notification has been replied to.
> + */
> + if (!knotif || knotif->state != SECCOMP_NOTIFY_SENT) {
> + ret = -ENOENT;
> + goto out_unlock;
> + }
if (!knotif) {
ret = -ESRCH;
goto out_unlock;
}
if (knotif->state != SECCOMP_NOTIFY_SENT) {
ret = -EBUSY/-EINVAL;
goto out_unlock;
}
so we don't overload errors?
> +
> + list_add(&kaddfd.list, &knotif->addfd);
> + complete(&knotif->ready);
> + mutex_unlock(&filter->notify_lock);
> +
> + /* Now we wait for it to be processed */
> + ret = wait_for_completion_interruptible(&kaddfd.completion);
> + if (ret == 0) {
> + /*
> + * We had a successful completion. The other side has already
> + * removed us from the addfd queue, and
> + * wait_for_completion_interruptible has a memory barrier.
> + */
> + ret = kaddfd.ret;
> + goto out;
> + }
> +
> + mutex_lock(&filter->notify_lock);
> + /*
> + * Even though we were woken up by a signal, and not a successful
> + * completion, a completion may have happened in the mean time.
> + */
> + if (list_empty(&kaddfd.list))
> + ret = kaddfd.ret;
> + else
> + list_del(&kaddfd.list);
> +
> +out_unlock:
> + mutex_unlock(&filter->notify_lock);
> +out:
> + if (ret < 0)
> + fput(kaddfd.file);
> +
> + return ret;
> +}
> +
> static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,
> unsigned long arg)
> {
> @@ -1192,6 +1357,8 @@ static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,
> return seccomp_notify_send(filter, buf);
> case SECCOMP_IOCTL_NOTIF_ID_VALID:
> return seccomp_notify_id_valid(filter, buf);
> + case SECCOMP_IOCTL_NOTIF_ADDFD:
> + return seccomp_notify_addfd(filter, buf);
> default:
> return -EINVAL;
> }
> --
> 2.25.1
>
On Sun, May 24, 2020 at 04:39:41PM -0700, Sargun Dhillon wrote:
> Certain files, when moved to another process have metadata changed, such
> as netprioidx, and classid. This is the default behaviour in sending
> sockets with SCM_RIGHTS over unix sockets. Depending on the usecase,
> this may or may not be desirable with the addfd ioctl. This allows
> the user to opt-in.
>
> Signed-off-by: Sargun Dhillon <[email protected]>
> Suggested-by: Tycho Andersen <[email protected]>
> Cc: Matt Denton <[email protected]>
> Cc: Kees Cook <[email protected]>,
> Cc: Jann Horn <[email protected]>,
> Cc: Robert Sesek <[email protected]>,
> Cc: Chris Palmer <[email protected]>
> Cc: Christian Brauner <[email protected]>
> ---
> include/uapi/linux/seccomp.h | 8 ++++++++
> kernel/seccomp.c | 31 +++++++++++++++++++++++++++----
> 2 files changed, 35 insertions(+), 4 deletions(-)
>
> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> index 7d450a9e4c29..ccd1c960372a 100644
> --- a/include/uapi/linux/seccomp.h
> +++ b/include/uapi/linux/seccomp.h
> @@ -115,6 +115,14 @@ struct seccomp_notif_resp {
>
> /* valid flags for seccomp_notif_addfd */
> #define SECCOMP_ADDFD_FLAG_SETFD (1UL << 0) /* Specify remote fd */
> +/*
> + * Certain file descriptors are behave differently depending on the process
"do behave"?
> + * they are created in. Specifcally, sockets, and their interactions with the
> + * net_cls and net_prio cgroup v1 controllers. This "moves" the file descriptor
> + * so that it takes on the cgroup controller's configuration in the process
> + * that the file descriptor is being added to.
> + */
> +#define SECCOMP_ADDFD_FLAG_MOVE (1UL << 1)
I'm not happy about the name because "moving" has much more to do with
transferring ownership than what we are doing here. After a "move" the
fd shouldn't be valid anymore. But that might just be my thinking.
But why make this opt-in and not do it exactly like when you send around
fds and make this mandatory?
> > + * they are created in. Specifcally, sockets, and their interactions with the
> > + * net_cls and net_prio cgroup v1 controllers. This "moves" the file descriptor
> > + * so that it takes on the cgroup controller's configuration in the process
> > + * that the file descriptor is being added to.
> > + */
> > +#define SECCOMP_ADDFD_FLAG_MOVE (1UL << 1)
>
> I'm not happy about the name because "moving" has much more to do with
> transferring ownership than what we are doing here. After a "move" the
> fd shouldn't be valid anymore. But that might just be my thinking.
>
> But why make this opt-in and not do it exactly like when you send around
> fds and make this mandatory?
Based upon Tycho's comments in an offline thread, I'm going to make
this the default
(setting the cgroup metadata) to mirror what SCM_RIGHTS does, and then
if we come
up with a good use case where we need to preserve *cgroup v1*
metadata, then we can
add an opt-out flag in the future.
On Mon, May 25, 2020 at 6:50 AM Christian Brauner
<[email protected]> wrote:
>
> On Sun, May 24, 2020 at 04:39:39PM -0700, Sargun Dhillon wrote:
> > This adds a seccomp notifier ioctl which allows for the listener to "add"
> > file descriptors to a process which originated a seccomp user
> > notification. This allows calls like mount, and mknod to be "implemented",
> > as the return value, and the arguments are data in memory. On the other
> > hand, calls like connect can be "implemented" using pidfd_getfd.
> >
> > Unfortunately, there are calls which return file descriptors, like
> > open, which are vulnerable to TOC-TOU attacks, and require that the
> > more privileged supervisor can inspect the argument, and perform the
> > syscall on behalf of the process generating the notifiation. This
> > allows the file descriptor generated from that open call to be
> > returned to the calling process.
> >
> > In addition, there is funcitonality to allow for replacement of
> > specific file descriptors, following dup2-like semantics.
> >
> > Signed-off-by: Sargun Dhillon <[email protected]>
> > Suggested-by: Matt Denton <[email protected]>
> > Cc: Kees Cook <[email protected]>,
> > Cc: Jann Horn <[email protected]>,
> > Cc: Robert Sesek <[email protected]>,
> > Cc: Chris Palmer <[email protected]>
> > Cc: Christian Brauner <[email protected]>
> > Cc: Tycho Andersen <[email protected]>
> > ---
> > include/uapi/linux/seccomp.h | 25 ++++++
> > kernel/seccomp.c | 169 ++++++++++++++++++++++++++++++++++-
> > 2 files changed, 193 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> > index c1735455bc53..7d450a9e4c29 100644
> > --- a/include/uapi/linux/seccomp.h
> > +++ b/include/uapi/linux/seccomp.h
> > @@ -113,6 +113,27 @@ struct seccomp_notif_resp {
> > __u32 flags;
> > };
> >
> > +/* valid flags for seccomp_notif_addfd */
> > +#define SECCOMP_ADDFD_FLAG_SETFD (1UL << 0) /* Specify remote fd */
> > +
> > +/**
> > + * struct seccomp_notif_addfd
> > + * @size: The size of the seccomp_notif_addfd datastructure
> > + * @fd: The local fd number
> > + * @id: The ID of the seccomp notification
> > + * @fd_flags: Flags the remote FD should be allocated under
> > + * @remote_fd: Optional remote FD number if SETFD option is set, otherwise 0.
> > + * @flags: SECCOMP_ADDFD_FLAG_*
> > + */
> > +struct seccomp_notif_addfd {
> > + __u32 size;
> > + __u32 fd;
> > + __u64 id;
> > + __u32 fd_flags;
> > + __u32 remote_fd;
> > + __u64 flags;
> > +};
>
> This was a little confusing to me at first. So fd is the fd from which
> we take the struct file and remote_fd is either -1 at which point we
> just allocate the next free fd number and if it is not we
> allocate/replace a specific one. Maybe it would be clearer if we did:
>
> struct seccomp_notif_addfd {
> __u32 size;
> __u64 id;
> __u64 flags;
> __u32 srcfd;
> __u32 newfd;
> __u32 newfd_flags;
> };
>
> No need to hide in the name that this is remote_dup2().
>
> > +
> > #define SECCOMP_IOC_MAGIC '!'
> > #define SECCOMP_IO(nr) _IO(SECCOMP_IOC_MAGIC, nr)
> > #define SECCOMP_IOR(nr, type) _IOR(SECCOMP_IOC_MAGIC, nr, type)
> > @@ -124,4 +145,8 @@ struct seccomp_notif_resp {
> > #define SECCOMP_IOCTL_NOTIF_SEND SECCOMP_IOWR(1, \
> > struct seccomp_notif_resp)
> > #define SECCOMP_IOCTL_NOTIF_ID_VALID SECCOMP_IOR(2, __u64)
> > +/* On success, the return value is the remote process's added fd number */
> > +#define SECCOMP_IOCTL_NOTIF_ADDFD SECCOMP_IOR(3, \
> > + struct seccomp_notif_addfd)
> > +
> > #endif /* _UAPI_LINUX_SECCOMP_H */
> > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > index f6ce94b7a167..88940eeabaee 100644
> > --- a/kernel/seccomp.c
> > +++ b/kernel/seccomp.c
> > @@ -77,10 +77,42 @@ struct seccomp_knotif {
> > long val;
> > u32 flags;
> >
> > - /* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
> > + /*
> > + * Signals when this has changed states, such as the listener
> > + * dying, a new seccomp addfd message, or changing to REPLIED
> > + */
> > struct completion ready;
> >
> > struct list_head list;
> > +
> > + /* outstanding addfd requests */
> > + struct list_head addfd;
> > +};
> > +
> > +/**
> > + * struct seccomp_kaddfd - contianer for seccomp_addfd ioctl messages
>
> ^^^ typo
>
> > + *
> > + * @file: A reference to the file to install in the other task
> > + * @fd: The fd number to install it at. If the fd number is -1, it means the
> > + * installing process should allocate the fd as normal.
> > + * @flags: The flags for the new file descriptor. At the moment, only O_CLOEXEC
> > + * is allowed.
> > + * @ret: The return value of the installing process. It is set to the fd num
> > + * upon success (>= 0).
> > + * @completion: Indicates that the installing process has completed fd
> > + * installation, or gone away (either due to successful
> > + * reply, or signal)
> > + *
> > + */
> > +struct seccomp_kaddfd {
> > + struct file *file;
> > + int fd;
> > + unsigned int flags;
> > +
> > + /* To only be set on reply */
> > + int ret;
> > + struct completion completion;
> > + struct list_head list;
> > };
> >
> > /**
> > @@ -735,6 +767,35 @@ static u64 seccomp_next_notify_id(struct seccomp_filter *filter)
> > return filter->notif->next_id++;
> > }
> >
> > +static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd)
> > +{
> > + int ret;
> > +
> > + /*
> > + * Remove the notification, and reset the list pointers, indicating
> > + * that it has been handled.
> > + */
> > + list_del_init(&addfd->list);
> > +
> > + ret = security_file_receive(addfd->file);
> > + if (ret)
> > + goto out;
> > +
> > + if (addfd->fd >= 0) {
> > + ret = replace_fd(addfd->fd, addfd->file, addfd->flags);
> > + if (ret >= 0)
> > + fput(addfd->file);
> > + } else {
> > + ret = get_unused_fd_flags(addfd->flags);
> > + if (ret >= 0)
> > + fd_install(ret, addfd->file);
> > + }
> > +
> > +out:
> > + addfd->ret = ret;
> > + complete(&addfd->completion);
> > +}
> > +
> > static int seccomp_do_user_notification(int this_syscall,
> > struct seccomp_filter *match,
> > const struct seccomp_data *sd)
> > @@ -743,6 +804,7 @@ static int seccomp_do_user_notification(int this_syscall,
> > u32 flags = 0;
> > long ret = 0;
> > struct seccomp_knotif n = {};
> > + struct seccomp_kaddfd *addfd, *tmp;
> >
> > mutex_lock(&match->notify_lock);
> > err = -ENOSYS;
> > @@ -755,6 +817,7 @@ static int seccomp_do_user_notification(int this_syscall,
> > n.id = seccomp_next_notify_id(match);
> > init_completion(&n.ready);
> > list_add(&n.list, &match->notif->notifications);
> > + INIT_LIST_HEAD(&n.addfd);
> >
> > up(&match->notif->request);
> > wake_up_poll(&match->notif->wqh, EPOLLIN | EPOLLRDNORM);
> > @@ -763,14 +826,31 @@ static int seccomp_do_user_notification(int this_syscall,
> > /*
> > * This is where we wait for a reply from userspace.
> > */
> > +wait:
> > err = wait_for_completion_interruptible(&n.ready);
> > mutex_lock(&match->notify_lock);
> > if (err == 0) {
> > + /* Check if we were woken up by a addfd message */
> > + addfd = list_first_entry_or_null(&n.addfd,
> > + struct seccomp_kaddfd, list);
> > + if (addfd && n.state != SECCOMP_NOTIFY_REPLIED) {
> > + seccomp_handle_addfd(addfd);
> > + mutex_unlock(&match->notify_lock);
> > + goto wait;
> > + }
> > ret = n.val;
> > err = n.error;
> > flags = n.flags;
> > }
> >
> > + /* If there were any pending addfd calls, clear them out */
> > + list_for_each_entry_safe(addfd, tmp, &n.addfd, list) {
> > + /* The process went away before we got a chance to handle it */
> > + addfd->ret = -ENOENT;
>
> Looks like it should be -ESRCH?
>
I'm a little confused on where we use ESRCH vs. ENOENT. It looks like
in the cookie
check (SECCOMP_IOCTL_NOTIF_ID_VALID), we return ENOENT on both error paths
-- whether the notification is missing, or whether the notification
was already replied to.
I originally had this as ESRCH, but switched to ENOENT to be
consistent with that API.
Do we want the API to disclose information about half-done /
incomplete notifications?
On Mon, May 25, 2020 at 11:59:18PM -0700, Sargun Dhillon wrote:
> On Mon, May 25, 2020 at 6:50 AM Christian Brauner
> <[email protected]> wrote:
> >
> > On Sun, May 24, 2020 at 04:39:39PM -0700, Sargun Dhillon wrote:
> > > This adds a seccomp notifier ioctl which allows for the listener to "add"
> > > file descriptors to a process which originated a seccomp user
> > > notification. This allows calls like mount, and mknod to be "implemented",
> > > as the return value, and the arguments are data in memory. On the other
> > > hand, calls like connect can be "implemented" using pidfd_getfd.
> > >
> > > Unfortunately, there are calls which return file descriptors, like
> > > open, which are vulnerable to TOC-TOU attacks, and require that the
> > > more privileged supervisor can inspect the argument, and perform the
> > > syscall on behalf of the process generating the notifiation. This
> > > allows the file descriptor generated from that open call to be
> > > returned to the calling process.
> > >
> > > In addition, there is funcitonality to allow for replacement of
> > > specific file descriptors, following dup2-like semantics.
> > >
> > > Signed-off-by: Sargun Dhillon <[email protected]>
> > > Suggested-by: Matt Denton <[email protected]>
> > > Cc: Kees Cook <[email protected]>,
> > > Cc: Jann Horn <[email protected]>,
> > > Cc: Robert Sesek <[email protected]>,
> > > Cc: Chris Palmer <[email protected]>
> > > Cc: Christian Brauner <[email protected]>
> > > Cc: Tycho Andersen <[email protected]>
> > > ---
> > > include/uapi/linux/seccomp.h | 25 ++++++
> > > kernel/seccomp.c | 169 ++++++++++++++++++++++++++++++++++-
> > > 2 files changed, 193 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> > > index c1735455bc53..7d450a9e4c29 100644
> > > --- a/include/uapi/linux/seccomp.h
> > > +++ b/include/uapi/linux/seccomp.h
> > > @@ -113,6 +113,27 @@ struct seccomp_notif_resp {
> > > __u32 flags;
> > > };
> > >
> > > +/* valid flags for seccomp_notif_addfd */
> > > +#define SECCOMP_ADDFD_FLAG_SETFD (1UL << 0) /* Specify remote fd */
> > > +
> > > +/**
> > > + * struct seccomp_notif_addfd
> > > + * @size: The size of the seccomp_notif_addfd datastructure
> > > + * @fd: The local fd number
> > > + * @id: The ID of the seccomp notification
> > > + * @fd_flags: Flags the remote FD should be allocated under
> > > + * @remote_fd: Optional remote FD number if SETFD option is set, otherwise 0.
> > > + * @flags: SECCOMP_ADDFD_FLAG_*
> > > + */
> > > +struct seccomp_notif_addfd {
> > > + __u32 size;
> > > + __u32 fd;
> > > + __u64 id;
> > > + __u32 fd_flags;
> > > + __u32 remote_fd;
> > > + __u64 flags;
> > > +};
> >
> > This was a little confusing to me at first. So fd is the fd from which
> > we take the struct file and remote_fd is either -1 at which point we
> > just allocate the next free fd number and if it is not we
> > allocate/replace a specific one. Maybe it would be clearer if we did:
> >
> > struct seccomp_notif_addfd {
> > __u32 size;
> > __u64 id;
> > __u64 flags;
> > __u32 srcfd;
> > __u32 newfd;
> > __u32 newfd_flags;
> > };
> >
> > No need to hide in the name that this is remote_dup2().
> >
> > > +
> > > #define SECCOMP_IOC_MAGIC '!'
> > > #define SECCOMP_IO(nr) _IO(SECCOMP_IOC_MAGIC, nr)
> > > #define SECCOMP_IOR(nr, type) _IOR(SECCOMP_IOC_MAGIC, nr, type)
> > > @@ -124,4 +145,8 @@ struct seccomp_notif_resp {
> > > #define SECCOMP_IOCTL_NOTIF_SEND SECCOMP_IOWR(1, \
> > > struct seccomp_notif_resp)
> > > #define SECCOMP_IOCTL_NOTIF_ID_VALID SECCOMP_IOR(2, __u64)
> > > +/* On success, the return value is the remote process's added fd number */
> > > +#define SECCOMP_IOCTL_NOTIF_ADDFD SECCOMP_IOR(3, \
> > > + struct seccomp_notif_addfd)
> > > +
> > > #endif /* _UAPI_LINUX_SECCOMP_H */
> > > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > > index f6ce94b7a167..88940eeabaee 100644
> > > --- a/kernel/seccomp.c
> > > +++ b/kernel/seccomp.c
> > > @@ -77,10 +77,42 @@ struct seccomp_knotif {
> > > long val;
> > > u32 flags;
> > >
> > > - /* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
> > > + /*
> > > + * Signals when this has changed states, such as the listener
> > > + * dying, a new seccomp addfd message, or changing to REPLIED
> > > + */
> > > struct completion ready;
> > >
> > > struct list_head list;
> > > +
> > > + /* outstanding addfd requests */
> > > + struct list_head addfd;
> > > +};
> > > +
> > > +/**
> > > + * struct seccomp_kaddfd - contianer for seccomp_addfd ioctl messages
> >
> > ^^^ typo
> >
> > > + *
> > > + * @file: A reference to the file to install in the other task
> > > + * @fd: The fd number to install it at. If the fd number is -1, it means the
> > > + * installing process should allocate the fd as normal.
> > > + * @flags: The flags for the new file descriptor. At the moment, only O_CLOEXEC
> > > + * is allowed.
> > > + * @ret: The return value of the installing process. It is set to the fd num
> > > + * upon success (>= 0).
> > > + * @completion: Indicates that the installing process has completed fd
> > > + * installation, or gone away (either due to successful
> > > + * reply, or signal)
> > > + *
> > > + */
> > > +struct seccomp_kaddfd {
> > > + struct file *file;
> > > + int fd;
> > > + unsigned int flags;
> > > +
> > > + /* To only be set on reply */
> > > + int ret;
> > > + struct completion completion;
> > > + struct list_head list;
> > > };
> > >
> > > /**
> > > @@ -735,6 +767,35 @@ static u64 seccomp_next_notify_id(struct seccomp_filter *filter)
> > > return filter->notif->next_id++;
> > > }
> > >
> > > +static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd)
> > > +{
> > > + int ret;
> > > +
> > > + /*
> > > + * Remove the notification, and reset the list pointers, indicating
> > > + * that it has been handled.
> > > + */
> > > + list_del_init(&addfd->list);
> > > +
> > > + ret = security_file_receive(addfd->file);
> > > + if (ret)
> > > + goto out;
> > > +
> > > + if (addfd->fd >= 0) {
> > > + ret = replace_fd(addfd->fd, addfd->file, addfd->flags);
> > > + if (ret >= 0)
> > > + fput(addfd->file);
> > > + } else {
> > > + ret = get_unused_fd_flags(addfd->flags);
> > > + if (ret >= 0)
> > > + fd_install(ret, addfd->file);
> > > + }
> > > +
> > > +out:
> > > + addfd->ret = ret;
> > > + complete(&addfd->completion);
> > > +}
> > > +
> > > static int seccomp_do_user_notification(int this_syscall,
> > > struct seccomp_filter *match,
> > > const struct seccomp_data *sd)
> > > @@ -743,6 +804,7 @@ static int seccomp_do_user_notification(int this_syscall,
> > > u32 flags = 0;
> > > long ret = 0;
> > > struct seccomp_knotif n = {};
> > > + struct seccomp_kaddfd *addfd, *tmp;
> > >
> > > mutex_lock(&match->notify_lock);
> > > err = -ENOSYS;
> > > @@ -755,6 +817,7 @@ static int seccomp_do_user_notification(int this_syscall,
> > > n.id = seccomp_next_notify_id(match);
> > > init_completion(&n.ready);
> > > list_add(&n.list, &match->notif->notifications);
> > > + INIT_LIST_HEAD(&n.addfd);
> > >
> > > up(&match->notif->request);
> > > wake_up_poll(&match->notif->wqh, EPOLLIN | EPOLLRDNORM);
> > > @@ -763,14 +826,31 @@ static int seccomp_do_user_notification(int this_syscall,
> > > /*
> > > * This is where we wait for a reply from userspace.
> > > */
> > > +wait:
> > > err = wait_for_completion_interruptible(&n.ready);
> > > mutex_lock(&match->notify_lock);
> > > if (err == 0) {
> > > + /* Check if we were woken up by a addfd message */
> > > + addfd = list_first_entry_or_null(&n.addfd,
> > > + struct seccomp_kaddfd, list);
> > > + if (addfd && n.state != SECCOMP_NOTIFY_REPLIED) {
> > > + seccomp_handle_addfd(addfd);
> > > + mutex_unlock(&match->notify_lock);
> > > + goto wait;
> > > + }
> > > ret = n.val;
> > > err = n.error;
> > > flags = n.flags;
> > > }
> > >
> > > + /* If there were any pending addfd calls, clear them out */
> > > + list_for_each_entry_safe(addfd, tmp, &n.addfd, list) {
> > > + /* The process went away before we got a chance to handle it */
> > > + addfd->ret = -ENOENT;
> >
> > Looks like it should be -ESRCH?
> >
> I'm a little confused on where we use ESRCH vs. ENOENT. It looks like
> in the cookie
> check (SECCOMP_IOCTL_NOTIF_ID_VALID), we return ENOENT on both error paths
> -- whether the notification is missing, or whether the notification
> was already replied to.
>
> I originally had this as ESRCH, but switched to ENOENT to be
> consistent with that API.
So, for the geth_nth_filter() it makes sense to me that it's ENOENT, for
id valid, maybe too, for notify_{recv,send} I'd be inclined to say this
should've been ESRCH. But whatever, it's too late for that and keeping
it consistent matters more. But...
> Do we want the API to disclose information about half-done /
> incomplete notifications?
we were doing that before, see notify_send
if (knotif->state != SECCOMP_NOTIFY_SENT) {
ret = -EINPROGRESS;
goto out;
}
we distinguish the !knotif case before that too. Might make sense to do
the same for the fd case.
Christian