2018-05-31 14:52:22

by Tycho Andersen

[permalink] [raw]
Subject: [PATCH v3 0/4] seccomp trap to userspace

Hi all,

Here's a v3 of the seccomp trap to userspace, with all the nits from v2
fixed. Open questions from v2 are still:

1. is it ok not to use netlink?
2. what should the fd passing API look like? (see patch notes on this
one for details of why the current one might (?) be a problem)

As an added bonus, I've also written some stress testing, with lots of
tasks and listeners (1000 of each) sharing the same notification thread,
and not found any issues so far. Code is here:
https://github.com/tych0/kernel-utils/blob/master/seccomp/notify_stress.c
although I haven't included it in the patchset.

v2: https://lkml.org/lkml/2018/5/17/627

Tycho Andersen (4):
seccomp: add a return code to trap to userspace
seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE
seccomp: add a way to get a listener fd from ptrace
seccomp: add support for passing fds via USER_NOTIF

arch/Kconfig | 7 +
include/linux/seccomp.h | 14 +-
include/uapi/linux/ptrace.h | 2 +
include/uapi/linux/seccomp.h | 20 +-
kernel/ptrace.c | 4 +
kernel/seccomp.c | 477 +++++++++++++++++-
tools/testing/selftests/seccomp/seccomp_bpf.c | 373 +++++++++++++-
7 files changed, 889 insertions(+), 8 deletions(-)

--
2.17.0



2018-05-31 14:51:48

by Tycho Andersen

[permalink] [raw]
Subject: [PATCH v3 4/4] seccomp: add support for passing fds via USER_NOTIF

The idea here is that the userspace handler should be able to pass an fd
back to the trapped task, for example so it can be returned from socket().

I've proposed one API here, but I'm open to other options. In particular,
this only lets you return an fd from a syscall, which may not be enough in
all cases. For example, if an fd is written to an output parameter instead
of returned, the current API can't handle this. Another case is that
netlink takes as input fds sometimes (IFLA_NET_NS_FD, e.g.). If netlink
ever decides to install an fd and output it, we wouldn't be able to handle
this either.

Still, the vast majority of interesting cases are covered by this API, so
perhaps it is Enough.

I've left it as a separate commit for two reasons:
* It illustrates the way in which we would grow struct seccomp_notif and
struct seccomp_notif_resp without using netlink
* It shows just how little code is needed to accomplish this :)

v2: new in v2
v3: no changes

Signed-off-by: Tycho Andersen <[email protected]>
CC: Kees Cook <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Oleg Nesterov <[email protected]>
CC: Eric W. Biederman <[email protected]>
CC: "Serge E. Hallyn" <[email protected]>
CC: Christian Brauner <[email protected]>
CC: Tyler Hicks <[email protected]>
CC: Akihiro Suda <[email protected]>
---
include/uapi/linux/seccomp.h | 2 +
kernel/seccomp.c | 49 +++++++-
tools/testing/selftests/seccomp/seccomp_bpf.c | 112 ++++++++++++++++++
3 files changed, 161 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 8160e6cad528..3124427219cb 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -71,6 +71,8 @@ struct seccomp_notif_resp {
__u64 id;
__s32 error;
__s64 val;
+ __u8 return_fd;
+ __u32 fd;
};

#endif /* _UAPI_LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 6dc99c65c2f4..2ee958b3efde 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -77,6 +77,8 @@ struct seccomp_knotif {
/* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
int error;
long val;
+ struct file *file;
+ unsigned int flags;

/* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
struct completion ready;
@@ -780,10 +782,32 @@ static void seccomp_do_user_notification(int this_syscall,
goto remove_list;
}

- ret = n.val;
- err = n.error;
+ if (n.file) {
+ int fd;
+
+ fd = get_unused_fd_flags(n.flags);
+ if (fd < 0) {
+ err = fd;
+ ret = -1;
+ goto remove_list;
+ }
+
+ ret = fd;
+ err = 0;
+
+ fd_install(fd, n.file);
+ /* Don't fput, since fd has a reference now */
+ n.file = NULL;
+ } else {
+ ret = n.val;
+ err = n.error;
+ }
+

remove_list:
+ if (n.file)
+ fput(n.file);
+
list_del(&n.list);
out:
mutex_unlock(&match->notify_lock);
@@ -1598,6 +1622,27 @@ static ssize_t seccomp_notify_write(struct file *file, const char __user *buf,
knotif->state = SECCOMP_NOTIFY_REPLIED;
knotif->error = resp.error;
knotif->val = resp.val;
+
+ if (resp.return_fd) {
+ struct fd fd;
+
+ /*
+ * This is a little hokey: we need a real fget() (i.e. not
+ * __fget_light(), which is what fdget does), but we also need
+ * the flags from strcut fd. So, we get it, put it, and get it
+ * again for real.
+ */
+ fd = fdget(resp.fd);
+ knotif->flags = fd.flags;
+ fdput(fd);
+
+ knotif->file = fget(resp.fd);
+ if (!knotif->file) {
+ ret = -EBADF;
+ goto out;
+ }
+ }
+
complete(&knotif->ready);
out:
mutex_unlock(&filter->notify_lock);
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 07984f7bd601..b9a4f676566d 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -167,6 +167,8 @@ struct seccomp_notif_resp {
__u64 id;
__s32 error;
__s64 val;
+ __u8 return_fd;
+ __u32 fd;
};
#endif

@@ -3176,6 +3178,116 @@ TEST(get_user_notification_ptrace)
close(listener);
}

+TEST(user_notification_pass_fd)
+{
+ pid_t pid;
+ int status, listener;
+ int sk_pair[2];
+ char c;
+ struct seccomp_notif req = {};
+ struct seccomp_notif_resp resp = {};
+ long ret;
+
+ ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
+
+ pid = fork();
+ ASSERT_GE(pid, 0);
+
+ if (pid == 0) {
+ int fd;
+ char buf[16];
+
+ EXPECT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
+
+ /* Signal we're ready and have installed the filter. */
+ EXPECT_EQ(write(sk_pair[1], "J", 1), 1);
+
+ EXPECT_EQ(read(sk_pair[1], &c, 1), 1);
+ EXPECT_EQ(c, 'H');
+ close(sk_pair[1]);
+
+ /* An fd from getpid(). Let the games begin. */
+ fd = syscall(__NR_getpid);
+ EXPECT_GT(fd, 0);
+ EXPECT_EQ(read(fd, buf, sizeof(buf)), 12);
+ close(fd);
+
+ exit(strcmp("hello world", buf));
+ }
+
+ EXPECT_EQ(read(sk_pair[0], &c, 1), 1);
+ EXPECT_EQ(c, 'J');
+
+ EXPECT_EQ(ptrace(PTRACE_ATTACH, pid), 0);
+ EXPECT_EQ(waitpid(pid, NULL, 0), pid);
+ listener = ptrace(PTRACE_SECCOMP_GET_LISTENER, pid, 0);
+ EXPECT_GE(listener, 0);
+ EXPECT_EQ(ptrace(PTRACE_DETACH, pid, NULL, 0), 0);
+
+ /* Now signal we are done installing so it can do a getpid */
+ EXPECT_EQ(write(sk_pair[0], "H", 1), 1);
+ close(sk_pair[0]);
+
+ /* Make a new socket pair so we can send half across */
+ EXPECT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
+
+ ret = read_notif(listener, &req);
+ EXPECT_EQ(ret, sizeof(req));
+ EXPECT_EQ(errno, 0);
+
+ resp.id = req.id;
+ resp.return_fd = 1;
+ resp.fd = sk_pair[1];
+ EXPECT_EQ(write(listener, &resp, sizeof(resp)), sizeof(resp));
+ close(sk_pair[1]);
+
+ EXPECT_EQ(write(sk_pair[0], "hello world\0", 12), 12);
+ close(sk_pair[0]);
+
+ EXPECT_EQ(waitpid(pid, &status, 0), pid);
+ EXPECT_EQ(true, WIFEXITED(status));
+ EXPECT_EQ(0, WEXITSTATUS(status));
+ close(listener);
+}
+
+TEST(user_notification_struct_size_mismatch)
+{
+ pid_t pid;
+ long ret;
+ int status, listener, len;
+ struct seccomp_notif req;
+ struct seccomp_notif_resp resp;
+
+ listener = user_trap_syscall(__NR_getpid,
+ SECCOMP_FILTER_FLAG_GET_LISTENER);
+ EXPECT_GE(listener, 0);
+
+ pid = fork();
+ ASSERT_GE(pid, 0);
+
+ if (pid == 0) {
+ ret = syscall(__NR_getpid);
+ exit(ret != USER_NOTIF_MAGIC);
+ }
+
+ EXPECT_EQ(read(listener, &req, sizeof(req)), sizeof(req));
+
+ resp.id = req.id;
+ resp.error = 0;
+ resp.val = USER_NOTIF_MAGIC;
+
+ /*
+ * Only write a partial structure: this is what was available before we
+ * had fd support.
+ */
+ len = offsetof(struct seccomp_notif_resp, val) + sizeof(resp.val);
+ EXPECT_EQ(write(listener, &resp, len), len);
+
+ EXPECT_EQ(waitpid(pid, &status, 0), pid);
+ EXPECT_EQ(true, WIFEXITED(status));
+ EXPECT_EQ(0, WEXITSTATUS(status));
+}
+
/*
* TODO:
* - add microbenchmarks
--
2.17.0


2018-05-31 14:52:02

by Tycho Andersen

[permalink] [raw]
Subject: [PATCH v3 2/4] seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE

In the next commit we'll use this same mnemonic to get a listener for the
nth filter, so we need it available outside of CHECKPOINT_RESTORE. This is
slightly looser than necessary, because it really could be
CHECKPOINT_RESTORE || USER_NOTIFICATION, but it's declared static and this
complicates the code less, so hopefully it's ok.

v2: new in v2
v3: no changes

Signed-off-by: Tycho Andersen <[email protected]>
CC: Kees Cook <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Oleg Nesterov <[email protected]>
CC: Eric W. Biederman <[email protected]>
CC: "Serge E. Hallyn" <[email protected]>
CC: Christian Brauner <[email protected]>
CC: Tyler Hicks <[email protected]>
CC: Akihiro Suda <[email protected]>
---
kernel/seccomp.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index f69327d5f7c7..4a6d6bd80d07 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1168,7 +1168,7 @@ long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
return do_seccomp(op, 0, uargs);
}

-#if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_CHECKPOINT_RESTORE)
+#if defined(CONFIG_SECCOMP_FILTER)
static struct seccomp_filter *get_nth_filter(struct task_struct *task,
unsigned long filter_off)
{
@@ -1215,6 +1215,7 @@ static struct seccomp_filter *get_nth_filter(struct task_struct *task,
return filter;
}

+#if defined(CONFIG_CHECKPOINT_RESTORE)
long seccomp_get_filter(struct task_struct *task, unsigned long filter_off,
void __user *data)
{
@@ -1287,7 +1288,8 @@ long seccomp_get_metadata(struct task_struct *task,
__put_seccomp_filter(filter);
return ret;
}
-#endif
+#endif /* CONFIG_CHECKPOINT_RESTORE */
+#endif /* CONFIG_SECCOMP_FILTER */

#ifdef CONFIG_SYSCTL

--
2.17.0


2018-05-31 14:53:02

by Tycho Andersen

[permalink] [raw]
Subject: [PATCH v3 1/4] seccomp: add a return code to trap to userspace

This patch introduces a means for syscalls matched in seccomp to notify
some other task that a particular filter has been triggered.

The motivation for this is primarily for use with containers. For example,
if a container does an init_module(), we obviously don't want to load this
untrusted code, which may be compiled for the wrong version of the kernel
anyway. Instead, we could parse the module image, figure out which module
the container is trying to load and load it on the host.

As another example, containers cannot mknod(), since this checks
capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
/dev/zero should be ok for containers to mknod, but we'd like to avoid hard
coding some whitelist in the kernel. Another example is mount(), which has
many security restrictions for good reason, but configuration or runtime
knowledge could potentially be used to relax these restrictions.

This patch adds functionality that is already possible via at least two
other means that I know about, both of which involve ptrace(): first, one
could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
Unfortunately this is slow, so a faster version would be to install a
filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
Since ptrace allows only one tracer, if the container runtime is that
tracer, users inside the container (or outside) trying to debug it will not
be able to use ptrace, which is annoying. It also means that older
distributions based on Upstart cannot boot inside containers using ptrace,
since upstart itself uses ptrace to start services.

The actual implementation of this is fairly small, although getting the
synchronization right was/is slightly complex.

Finally, it's worth noting that the classic seccomp TOCTOU of reading
memory data from the task still applies here, but can be avoided with
careful design of the userspace handler: if the userspace handler reads all
of the task memory that is necessary before applying its security policy,
the tracee's subsequent memory edits will not be read by the tracer.

v2: * make id a u64; the idea here being that it will never overflow,
because 64 is huge (one syscall every nanosecond => wrap every 584
years) (Andy)
* prevent nesting of user notifications: if someone is already attached
the tree in one place, nobody else can attach to the tree (Andy)
* notify the listener of signals the tracee receives as well (Andy)
* implement poll
v3: * lockdep fix (Oleg)
* drop unnecessary WARN()s (Christian)
* rearrange error returns to be more rpetty (Christian)
* fix build in !CONFIG_SECCOMP_USER_NOTIFICATION case

Signed-off-by: Tycho Andersen <[email protected]>
CC: Kees Cook <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Oleg Nesterov <[email protected]>
CC: Eric W. Biederman <[email protected]>
CC: "Serge E. Hallyn" <[email protected]>
CC: Christian Brauner <[email protected]>
CC: Tyler Hicks <[email protected]>
CC: Akihiro Suda <[email protected]>
---
arch/Kconfig | 7 +
include/linux/seccomp.h | 3 +-
include/uapi/linux/seccomp.h | 18 +-
kernel/seccomp.c | 398 +++++++++++++++++-
tools/testing/selftests/seccomp/seccomp_bpf.c | 195 ++++++++-
5 files changed, 615 insertions(+), 6 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 75dd23acf133..1c1ae8d8c8b9 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -401,6 +401,13 @@ config SECCOMP_FILTER

See Documentation/prctl/seccomp_filter.txt for details.

+config SECCOMP_USER_NOTIFICATION
+ bool "Enable the SECCOMP_RET_USER_NOTIF seccomp action"
+ depends on SECCOMP_FILTER
+ help
+ Enable SECCOMP_RET_USER_NOTIF, a return code which can be used by seccomp
+ programs to notify a userspace listener that a particular event happened.
+
config HAVE_GCC_PLUGINS
bool
help
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index c723a5c4e3ff..0fd3e0676a1c 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -5,7 +5,8 @@
#include <uapi/linux/seccomp.h>

#define SECCOMP_FILTER_FLAG_MASK (SECCOMP_FILTER_FLAG_TSYNC | \
- SECCOMP_FILTER_FLAG_LOG)
+ SECCOMP_FILTER_FLAG_LOG | \
+ SECCOMP_FILTER_FLAG_GET_LISTENER)

#ifdef CONFIG_SECCOMP

diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 2a0bd9dd104d..8160e6cad528 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -17,8 +17,9 @@
#define SECCOMP_GET_ACTION_AVAIL 2

/* Valid flags for SECCOMP_SET_MODE_FILTER */
-#define SECCOMP_FILTER_FLAG_TSYNC 1
-#define SECCOMP_FILTER_FLAG_LOG 2
+#define SECCOMP_FILTER_FLAG_TSYNC 1
+#define SECCOMP_FILTER_FLAG_LOG 2
+#define SECCOMP_FILTER_FLAG_GET_LISTENER 4

/*
* All BPF programs must return a 32-bit value.
@@ -34,6 +35,7 @@
#define SECCOMP_RET_KILL SECCOMP_RET_KILL_THREAD
#define SECCOMP_RET_TRAP 0x00030000U /* disallow and force a SIGSYS */
#define SECCOMP_RET_ERRNO 0x00050000U /* returns an errno */
+#define SECCOMP_RET_USER_NOTIF 0x7fc00000U /* notifies userspace */
#define SECCOMP_RET_TRACE 0x7ff00000U /* pass to a tracer or disallow */
#define SECCOMP_RET_LOG 0x7ffc0000U /* allow after logging */
#define SECCOMP_RET_ALLOW 0x7fff0000U /* allow */
@@ -59,4 +61,16 @@ struct seccomp_data {
__u64 args[6];
};

+struct seccomp_notif {
+ __u64 id;
+ pid_t pid;
+ struct seccomp_data data;
+};
+
+struct seccomp_notif_resp {
+ __u64 id;
+ __s32 error;
+ __s64 val;
+};
+
#endif /* _UAPI_LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index dc77548167ef..f69327d5f7c7 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -31,6 +31,7 @@
#endif

#ifdef CONFIG_SECCOMP_FILTER
+#include <linux/file.h>
#include <linux/filter.h>
#include <linux/pid.h>
#include <linux/ptrace.h>
@@ -38,6 +39,52 @@
#include <linux/tracehook.h>
#include <linux/uaccess.h>

+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+#include <linux/anon_inodes.h>
+
+enum notify_state {
+ SECCOMP_NOTIFY_INIT,
+ SECCOMP_NOTIFY_SENT,
+ SECCOMP_NOTIFY_REPLIED,
+};
+
+struct seccomp_knotif {
+ /* The pid whose filter triggered the notification */
+ pid_t pid;
+
+ /*
+ * The "cookie" for this request; this is unique for this filter.
+ */
+ u32 id;
+
+ /*
+ * The seccomp data. This pointer is valid the entire time this
+ * notification is active, since it comes from __seccomp_filter which
+ * eclipses the entire lifecycle here.
+ */
+ const struct seccomp_data *data;
+
+ /*
+ * Notification states. When SECCOMP_RET_USER_NOTIF is returned, a
+ * struct seccomp_knotif is created and starts out in INIT. Once the
+ * handler reads the notification off of an FD, it transitions to READ.
+ * If a signal is received the state transitions back to INIT and
+ * another message is sent. When the userspace handler replies, state
+ * transitions to REPLIED.
+ */
+ enum notify_state state;
+
+ /* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
+ int error;
+ long val;
+
+ /* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
+ struct completion ready;
+
+ struct list_head list;
+};
+#endif
+
/**
* struct seccomp_filter - container for seccomp BPF programs
*
@@ -64,6 +111,27 @@ struct seccomp_filter {
bool log;
struct seccomp_filter *prev;
struct bpf_prog *prog;
+
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+ /*
+ * A semaphore that users of this notification can wait on for
+ * changes. Actual reads and writes are still controlled with
+ * filter->notify_lock.
+ */
+ struct semaphore request;
+
+ /* A lock for all notification-related accesses. */
+ struct mutex notify_lock;
+
+ /* Is there currently an attached listener? */
+ bool has_listener;
+
+ /* The id of the next request. */
+ u64 next_id;
+
+ /* A list of struct seccomp_knotif elements. */
+ struct list_head notifications;
+#endif
};

/* Limit any path through the tree to 256KB worth of instructions. */
@@ -383,6 +451,13 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
if (!sfilter)
return ERR_PTR(-ENOMEM);

+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+ mutex_init(&sfilter->notify_lock);
+ sema_init(&sfilter->request, 0);
+ INIT_LIST_HEAD(&sfilter->notifications);
+ sfilter->next_id = get_random_u64();
+#endif
+
ret = bpf_prog_create_from_user(&sfilter->prog, fprog,
seccomp_check_filter, save_orig);
if (ret < 0) {
@@ -547,13 +622,15 @@ static void seccomp_send_sigsys(int syscall, int reason)
#define SECCOMP_LOG_TRACE (1 << 4)
#define SECCOMP_LOG_LOG (1 << 5)
#define SECCOMP_LOG_ALLOW (1 << 6)
+#define SECCOMP_LOG_USER_NOTIF (1 << 7)

static u32 seccomp_actions_logged = SECCOMP_LOG_KILL_PROCESS |
SECCOMP_LOG_KILL_THREAD |
SECCOMP_LOG_TRAP |
SECCOMP_LOG_ERRNO |
SECCOMP_LOG_TRACE |
- SECCOMP_LOG_LOG;
+ SECCOMP_LOG_LOG |
+ SECCOMP_LOG_USER_NOTIF;

static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
bool requested)
@@ -572,6 +649,9 @@ static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
case SECCOMP_RET_TRACE:
log = requested && seccomp_actions_logged & SECCOMP_LOG_TRACE;
break;
+ case SECCOMP_RET_USER_NOTIF:
+ log = requested && seccomp_actions_logged & SECCOMP_LOG_USER_NOTIF;
+ break;
case SECCOMP_RET_LOG:
log = seccomp_actions_logged & SECCOMP_LOG_LOG;
break;
@@ -645,6 +725,81 @@ void secure_computing_strict(int this_syscall)
}
#else

+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+static u64 seccomp_next_notify_id(struct seccomp_filter *filter)
+{
+ /* Note: overflow is ok here, the id just needs to be unique */
+ return filter->next_id++;
+}
+
+static void seccomp_do_user_notification(int this_syscall,
+ struct seccomp_filter *match,
+ const struct seccomp_data *sd)
+{
+ int err;
+ long ret = 0;
+ struct seccomp_knotif n = {};
+
+ mutex_lock(&match->notify_lock);
+ err = -ENOSYS;
+ if (!match->has_listener)
+ goto out;
+
+ n.pid = current->pid;
+ n.state = SECCOMP_NOTIFY_INIT;
+ n.data = sd;
+ n.id = seccomp_next_notify_id(match);
+ init_completion(&n.ready);
+
+ list_add(&n.list, &match->notifications);
+
+ mutex_unlock(&match->notify_lock);
+ up(&match->request);
+
+ err = wait_for_completion_interruptible(&n.ready);
+ mutex_lock(&match->notify_lock);
+
+ /*
+ * Here it's possible we got a signal and then had to wait on the mutex
+ * while the reply was sent, so let's be sure there wasn't a response
+ * in the meantime.
+ */
+ if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) {
+ /*
+ * We got a signal. Let's tell userspace about it (potentially
+ * again, if we had already notified them about the first one).
+ */
+ if (n.state == SECCOMP_NOTIFY_SENT) {
+ n.state = SECCOMP_NOTIFY_INIT;
+ up(&match->request);
+ }
+ mutex_unlock(&match->notify_lock);
+ err = wait_for_completion_killable(&n.ready);
+ mutex_lock(&match->notify_lock);
+ if (err < 0)
+ goto remove_list;
+ }
+
+ ret = n.val;
+ err = n.error;
+
+remove_list:
+ list_del(&n.list);
+out:
+ mutex_unlock(&match->notify_lock);
+ syscall_set_return_value(current, task_pt_regs(current),
+ err, ret);
+}
+#else
+static void seccomp_do_user_notification(int this_syscall,
+ struct seccomp_filter *match,
+ const struct seccomp_data *sd)
+{
+ seccomp_log(this_syscall, SIGSYS, SECCOMP_RET_USER_NOTIF, true);
+ do_exit(SIGSYS);
+}
+#endif
+
#ifdef CONFIG_SECCOMP_FILTER
static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
const bool recheck_after_trace)
@@ -722,6 +877,9 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,

return 0;

+ case SECCOMP_RET_USER_NOTIF:
+ seccomp_do_user_notification(this_syscall, match, sd);
+ goto skip;
case SECCOMP_RET_LOG:
seccomp_log(this_syscall, 0, action, true);
return 0;
@@ -828,6 +986,9 @@ static long seccomp_set_mode_strict(void)
}

#ifdef CONFIG_SECCOMP_FILTER
+static struct file *init_listener(struct task_struct *,
+ struct seccomp_filter *);
+
/**
* seccomp_set_mode_filter: internal function for setting seccomp filter
* @flags: flags to change filter behavior
@@ -847,6 +1008,8 @@ static long seccomp_set_mode_filter(unsigned int flags,
const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
struct seccomp_filter *prepared = NULL;
long ret = -EINVAL;
+ int listener = 0;
+ struct file *listener_f = NULL;

/* Validate flags. */
if (flags & ~SECCOMP_FILTER_FLAG_MASK)
@@ -857,13 +1020,28 @@ static long seccomp_set_mode_filter(unsigned int flags,
if (IS_ERR(prepared))
return PTR_ERR(prepared);

+ if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
+ listener = get_unused_fd_flags(O_RDWR);
+ if (listener < 0) {
+ ret = listener;
+ goto out_free;
+ }
+
+ listener_f = init_listener(current, prepared);
+ if (IS_ERR(listener_f)) {
+ put_unused_fd(listener);
+ ret = PTR_ERR(listener_f);
+ goto out_free;
+ }
+ }
+
/*
* Make sure we cannot change seccomp or nnp state via TSYNC
* while another thread is in the middle of calling exec.
*/
if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
mutex_lock_killable(&current->signal->cred_guard_mutex))
- goto out_free;
+ goto out_put_fd;

spin_lock_irq(&current->sighand->siglock);

@@ -881,6 +1059,16 @@ static long seccomp_set_mode_filter(unsigned int flags,
spin_unlock_irq(&current->sighand->siglock);
if (flags & SECCOMP_FILTER_FLAG_TSYNC)
mutex_unlock(&current->signal->cred_guard_mutex);
+out_put_fd:
+ if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
+ if (ret < 0) {
+ fput(listener_f);
+ put_unused_fd(listener);
+ } else {
+ fd_install(listener, listener_f);
+ ret = listener;
+ }
+ }
out_free:
seccomp_filter_free(prepared);
return ret;
@@ -909,6 +1097,9 @@ static long seccomp_get_action_avail(const char __user *uaction)
case SECCOMP_RET_LOG:
case SECCOMP_RET_ALLOW:
break;
+ case SECCOMP_RET_USER_NOTIF:
+ if (IS_ENABLED(CONFIG_SECCOMP_USER_NOTIFICATION))
+ break;
default:
return -EOPNOTSUPP;
}
@@ -1105,6 +1296,7 @@ long seccomp_get_metadata(struct task_struct *task,
#define SECCOMP_RET_KILL_THREAD_NAME "kill_thread"
#define SECCOMP_RET_TRAP_NAME "trap"
#define SECCOMP_RET_ERRNO_NAME "errno"
+#define SECCOMP_RET_USER_NOTIF_NAME "user_notif"
#define SECCOMP_RET_TRACE_NAME "trace"
#define SECCOMP_RET_LOG_NAME "log"
#define SECCOMP_RET_ALLOW_NAME "allow"
@@ -1114,6 +1306,7 @@ static const char seccomp_actions_avail[] =
SECCOMP_RET_KILL_THREAD_NAME " "
SECCOMP_RET_TRAP_NAME " "
SECCOMP_RET_ERRNO_NAME " "
+ SECCOMP_RET_USER_NOTIF_NAME " "
SECCOMP_RET_TRACE_NAME " "
SECCOMP_RET_LOG_NAME " "
SECCOMP_RET_ALLOW_NAME;
@@ -1131,6 +1324,7 @@ static const struct seccomp_log_name seccomp_log_names[] = {
{ SECCOMP_LOG_TRACE, SECCOMP_RET_TRACE_NAME },
{ SECCOMP_LOG_LOG, SECCOMP_RET_LOG_NAME },
{ SECCOMP_LOG_ALLOW, SECCOMP_RET_ALLOW_NAME },
+ { SECCOMP_LOG_USER_NOTIF, SECCOMP_RET_USER_NOTIF_NAME },
{ }
};

@@ -1279,3 +1473,203 @@ static int __init seccomp_sysctl_init(void)
device_initcall(seccomp_sysctl_init)

#endif /* CONFIG_SYSCTL */
+
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+static int seccomp_notify_release(struct inode *inode, struct file *file)
+{
+ struct seccomp_filter *filter = file->private_data;
+ struct seccomp_knotif *knotif;
+
+ mutex_lock(&filter->notify_lock);
+
+ /*
+ * If this file is being closed because e.g. the task who owned it
+ * died, let's wake everyone up who was waiting on us.
+ */
+ list_for_each_entry(knotif, &filter->notifications, list) {
+ if (knotif->state == SECCOMP_NOTIFY_REPLIED)
+ continue;
+
+ knotif->state = SECCOMP_NOTIFY_REPLIED;
+ knotif->error = -ENOSYS;
+ knotif->val = 0;
+
+ complete(&knotif->ready);
+ }
+
+ filter->has_listener = false;
+ mutex_unlock(&filter->notify_lock);
+ __put_seccomp_filter(filter);
+ return 0;
+}
+
+static ssize_t seccomp_notify_read(struct file *f, char __user *buf,
+ size_t size, loff_t *ppos)
+{
+ struct seccomp_filter *filter = f->private_data;
+ struct seccomp_knotif *knotif = NULL, *cur;
+ struct seccomp_notif unotif;
+ ssize_t ret;
+
+ /* No offset reads. */
+ if (*ppos != 0)
+ return -EINVAL;
+
+ ret = down_interruptible(&filter->request);
+ if (ret < 0)
+ return ret;
+
+ mutex_lock(&filter->notify_lock);
+ list_for_each_entry(cur, &filter->notifications, list) {
+ if (cur->state == SECCOMP_NOTIFY_INIT) {
+ knotif = cur;
+ break;
+ }
+ }
+
+ /*
+ * If we didn't find a notification, it could be that the task was
+ * interrupted between the time we were woken and when we were able to
+ * acquire the rw lock. Should we retry here or just -ENOENT? -ENOENT
+ * for now.
+ */
+ if (!knotif) {
+ ret = -ENOENT;
+ goto out;
+ }
+
+ unotif.id = knotif->id;
+ unotif.pid = knotif->pid;
+ unotif.data = *(knotif->data);
+
+ size = min_t(size_t, size, sizeof(struct seccomp_notif));
+ if (copy_to_user(buf, &unotif, size)) {
+ ret = -EFAULT;
+ goto out;
+ }
+
+ ret = sizeof(unotif);
+ knotif->state = SECCOMP_NOTIFY_SENT;
+
+out:
+ mutex_unlock(&filter->notify_lock);
+ return ret;
+}
+
+static ssize_t seccomp_notify_write(struct file *file, const char __user *buf,
+ size_t size, loff_t *ppos)
+{
+ struct seccomp_filter *filter = file->private_data;
+ struct seccomp_notif_resp resp = {};
+ struct seccomp_knotif *knotif = NULL;
+ ssize_t ret = -EINVAL;
+
+ /* No partial writes. */
+ if (*ppos != 0)
+ return -EINVAL;
+
+ size = min_t(size_t, size, sizeof(resp));
+ if (copy_from_user(&resp, buf, size))
+ return -EFAULT;
+
+ ret = mutex_lock_interruptible(&filter->notify_lock);
+ if (ret < 0)
+ return ret;
+
+ list_for_each_entry(knotif, &filter->notifications, list) {
+ if (knotif->id == resp.id)
+ break;
+ }
+
+ if (!knotif || knotif->id != resp.id) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ /* Allow exactly one reply. */
+ if (knotif->state != SECCOMP_NOTIFY_SENT) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ ret = size;
+ knotif->state = SECCOMP_NOTIFY_REPLIED;
+ knotif->error = resp.error;
+ knotif->val = resp.val;
+ complete(&knotif->ready);
+out:
+ mutex_unlock(&filter->notify_lock);
+ return ret;
+}
+
+static __poll_t seccomp_notify_poll(struct file *file,
+ struct poll_table_struct *poll_tab)
+{
+ struct seccomp_filter *filter = file->private_data;
+ __poll_t ret = 0;
+ struct seccomp_knotif *cur;
+
+ ret = mutex_lock_interruptible(&filter->notify_lock);
+ if (ret < 0)
+ return ret;
+
+ list_for_each_entry(cur, &filter->notifications, list) {
+ if (cur->state == SECCOMP_NOTIFY_INIT)
+ ret |= EPOLLIN | EPOLLRDNORM;
+ if (cur->state == SECCOMP_NOTIFY_SENT)
+ ret |= EPOLLOUT | EPOLLWRNORM;
+ }
+
+ mutex_unlock(&filter->notify_lock);
+
+ return ret;
+}
+
+static const struct file_operations seccomp_notify_ops = {
+ .read = seccomp_notify_read,
+ .write = seccomp_notify_write,
+ .poll = seccomp_notify_poll,
+ .release = seccomp_notify_release,
+};
+
+static struct file *init_listener(struct task_struct *task,
+ struct seccomp_filter *filter)
+{
+ struct file *ret = ERR_PTR(-EBUSY);
+ struct seccomp_filter *cur;
+ bool have_listener = false;
+ int filter_nesting = 0;
+
+ for (cur = task->seccomp.filter; cur; cur = cur->prev) {
+ mutex_lock_nested(&cur->notify_lock, filter_nesting);
+ filter_nesting++;
+ if (cur->has_listener)
+ have_listener = true;
+ }
+
+ if (have_listener)
+ goto out;
+
+ ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops,
+ filter, O_RDWR);
+ if (IS_ERR(ret))
+ goto out;
+
+
+ /* The file has a reference to it now */
+ __get_seccomp_filter(filter);
+ filter->has_listener = true;
+
+out:
+ for (cur = task->seccomp.filter; cur; cur = cur->prev)
+ mutex_unlock(&cur->notify_lock);
+
+ return ret;
+}
+#else
+static struct file *init_listener(struct task_struct *task,
+ struct seccomp_filter *filter)
+{
+ return ERR_PTR(-EINVAL);
+}
+#endif
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 168c66d74fc5..f439bd3fb15f 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -40,10 +40,12 @@
#include <sys/fcntl.h>
#include <sys/mman.h>
#include <sys/times.h>
+#include <sys/socket.h>

#define _GNU_SOURCE
#include <unistd.h>
#include <sys/syscall.h>
+#include <poll.h>

#include "../kselftest_harness.h"

@@ -150,6 +152,24 @@ struct seccomp_metadata {
};
#endif

+#ifndef SECCOMP_FILTER_FLAG_GET_LISTENER
+#define SECCOMP_FILTER_FLAG_GET_LISTENER 4
+
+#define SECCOMP_RET_USER_NOTIF 0x7fc00000U
+
+struct seccomp_notif {
+ __u64 id;
+ pid_t pid;
+ struct seccomp_data data;
+};
+
+struct seccomp_notif_resp {
+ __u64 id;
+ __s32 error;
+ __s64 val;
+};
+#endif
+
#ifndef seccomp
int seccomp(unsigned int op, unsigned int flags, void *args)
{
@@ -2072,7 +2092,8 @@ TEST(seccomp_syscall_mode_lock)
TEST(detect_seccomp_filter_flags)
{
unsigned int flags[] = { SECCOMP_FILTER_FLAG_TSYNC,
- SECCOMP_FILTER_FLAG_LOG };
+ SECCOMP_FILTER_FLAG_LOG,
+ SECCOMP_FILTER_FLAG_GET_LISTENER };
unsigned int flag, all_flags;
int i;
long ret;
@@ -2917,6 +2938,178 @@ TEST(get_metadata)
ASSERT_EQ(0, kill(pid, SIGKILL));
}

+static int user_trap_syscall(int nr, unsigned int flags)
+{
+ struct sock_filter filter[] = {
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
+ offsetof(struct seccomp_data, nr)),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, nr, 0, 1),
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF),
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+ };
+
+ struct sock_fprog prog = {
+ .len = (unsigned short)ARRAY_SIZE(filter),
+ .filter = filter,
+ };
+
+ return seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog);
+}
+
+static int read_notif(int listener, struct seccomp_notif *req)
+{
+ int ret;
+
+ do {
+ errno = 0;
+ ret = read(listener, req, sizeof(*req));
+ } while (ret == -1 && errno == ENOENT);
+ return ret;
+}
+
+static void signal_handler(int signal)
+{
+}
+
+#define USER_NOTIF_MAGIC 116983961184613L
+TEST(get_user_notification_syscall)
+{
+ pid_t pid;
+ long ret;
+ int status, listener;
+ struct seccomp_notif req = {};
+ struct seccomp_notif_resp resp = {};
+ struct pollfd pollfd;
+
+ struct sock_filter filter[] = {
+ BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
+ };
+ struct sock_fprog prog = {
+ .len = (unsigned short)ARRAY_SIZE(filter),
+ .filter = filter,
+ };
+
+ pid = fork();
+ ASSERT_GE(pid, 0);
+
+ /* Check that we get -ENOSYS with no listener attached */
+ if (pid == 0) {
+ if (user_trap_syscall(__NR_getpid, 0) < 0)
+ exit(1);
+ ret = syscall(__NR_getpid);
+ exit(ret >= 0 || errno != ENOSYS);
+ }
+
+ EXPECT_EQ(waitpid(pid, &status, 0), pid);
+ EXPECT_EQ(true, WIFEXITED(status));
+ EXPECT_EQ(0, WEXITSTATUS(status));
+
+ /* Add some no-op filters so that we (don't) trigger lockdep. */
+ EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
+ EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
+ EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
+ EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
+
+ /* Check that the basic notification machinery works */
+ listener = user_trap_syscall(__NR_getpid,
+ SECCOMP_FILTER_FLAG_GET_LISTENER);
+ EXPECT_GE(listener, 0);
+
+ /* Installing a second listener in the chain should EBUSY */
+ EXPECT_EQ(user_trap_syscall(__NR_getpid,
+ SECCOMP_FILTER_FLAG_GET_LISTENER),
+ -1);
+ EXPECT_EQ(errno, EBUSY);
+
+ pid = fork();
+ ASSERT_GE(pid, 0);
+
+ if (pid == 0) {
+ ret = syscall(__NR_getpid);
+ exit(ret != USER_NOTIF_MAGIC);
+ }
+
+ EXPECT_EQ(read(listener, &req, sizeof(req)), sizeof(req));
+
+ pollfd.fd = listener;
+ pollfd.events = POLLIN | POLLOUT;
+
+ EXPECT_GT(poll(&pollfd, 1, -1), 0);
+ EXPECT_EQ(pollfd.revents, POLLOUT);
+
+ EXPECT_EQ(req.data.nr, __NR_getpid);
+
+ resp.id = req.id;
+ resp.error = 0;
+ resp.val = USER_NOTIF_MAGIC;
+
+ EXPECT_EQ(write(listener, &resp, sizeof(resp)), sizeof(resp));
+
+ EXPECT_EQ(waitpid(pid, &status, 0), pid);
+ EXPECT_EQ(true, WIFEXITED(status));
+ EXPECT_EQ(0, WEXITSTATUS(status));
+
+ /*
+ * Check that nothing bad happens when we kill the task in the middle
+ * of a syscall.
+ */
+ pid = fork();
+ ASSERT_GE(pid, 0);
+
+ if (pid == 0) {
+ ret = syscall(__NR_getpid);
+ exit(ret != USER_NOTIF_MAGIC);
+ }
+
+ ret = read(listener, &req, sizeof(req));
+ EXPECT_EQ(ret, sizeof(req));
+
+ EXPECT_EQ(kill(pid, SIGKILL), 0);
+ EXPECT_EQ(waitpid(pid, NULL, 0), pid);
+
+ resp.id = req.id;
+ ret = write(listener, &resp, sizeof(resp));
+ EXPECT_EQ(ret, -1);
+ EXPECT_EQ(errno, EINVAL);
+
+ /*
+ * Check that we get another notification about a signal in the middle
+ * of a syscall.
+ */
+ pid = fork();
+ ASSERT_GE(pid, 0);
+
+ if (pid == 0) {
+ if (signal(SIGUSR1, signal_handler) == SIG_ERR) {
+ perror("signal");
+ exit(1);
+ }
+ ret = syscall(__NR_getpid);
+ exit(ret != USER_NOTIF_MAGIC);
+ }
+
+ ret = read_notif(listener, &req);
+ EXPECT_EQ(ret, sizeof(req));
+ EXPECT_EQ(errno, 0);
+
+ EXPECT_EQ(kill(pid, SIGUSR1), 0);
+
+ ret = read_notif(listener, &req);
+ EXPECT_EQ(ret, sizeof(req));
+ EXPECT_EQ(errno, 0);
+
+ resp.id = req.id;
+ ret = write(listener, &resp, sizeof(resp));
+ EXPECT_EQ(ret, sizeof(resp));
+ EXPECT_EQ(errno, 0);
+
+ EXPECT_EQ(waitpid(pid, &status, 0), pid);
+ EXPECT_EQ(true, WIFEXITED(status));
+ EXPECT_EQ(0, WEXITSTATUS(status));
+
+ close(listener);
+}
+
/*
* TODO:
* - add microbenchmarks
--
2.17.0


2018-05-31 14:53:44

by Tycho Andersen

[permalink] [raw]
Subject: [PATCH v3 3/4] seccomp: add a way to get a listener fd from ptrace

As an alternative to SECCOMP_FILTER_FLAG_GET_LISTENER, perhaps a ptrace()
version which can acquire filters is useful. There are at least two reasons
this is preferable, even though it uses ptrace:

1. You can control tasks that aren't cooperating with you
2. You can control tasks whose filters block sendmsg() and socket(); if the
task installs a filter which blocks these calls, there's no way with
SECCOMP_FILTER_FLAG_GET_LISTENER to get the fd out to the privileged task.

v2: fix a bug where listener mode was not unset when an unused fd was not
available
v3: fix refcounting bug (Oleg)

Signed-off-by: Tycho Andersen <[email protected]>
CC: Kees Cook <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Oleg Nesterov <[email protected]>
CC: Eric W. Biederman <[email protected]>
CC: "Serge E. Hallyn" <[email protected]>
CC: Christian Brauner <[email protected]>
CC: Tyler Hicks <[email protected]>
CC: Akihiro Suda <[email protected]>
---
include/linux/seccomp.h | 11 ++++
include/uapi/linux/ptrace.h | 2 +
kernel/ptrace.c | 4 ++
kernel/seccomp.c | 28 ++++++++
tools/testing/selftests/seccomp/seccomp_bpf.c | 66 +++++++++++++++++++
5 files changed, 111 insertions(+)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 0fd3e0676a1c..10e684899b7b 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -111,4 +111,15 @@ static inline long seccomp_get_metadata(struct task_struct *task,
return -EINVAL;
}
#endif /* CONFIG_SECCOMP_FILTER && CONFIG_CHECKPOINT_RESTORE */
+
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+extern long seccomp_get_listener(struct task_struct *task,
+ unsigned long filter_off);
+#else
+static inline long seccomp_get_listener(struct task_struct *task,
+ unsigned long filter_off)
+{
+ return -EINVAL;
+}
+#endif/* CONFIG_SECCOMP_USER_NOTIFICATION */
#endif /* _LINUX_SECCOMP_H */
diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
index d5a1b8a492b9..dc0abf81de3b 100644
--- a/include/uapi/linux/ptrace.h
+++ b/include/uapi/linux/ptrace.h
@@ -73,6 +73,8 @@ struct seccomp_metadata {
__u64 flags; /* Output: filter's flags */
};

+#define PTRACE_SECCOMP_GET_LISTENER 0x420e
+
/* Read signals from a shared (process wide) queue */
#define PTRACE_PEEKSIGINFO_SHARED (1 << 0)

diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 21fec73d45d4..fcbdb6f4dc07 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -1096,6 +1096,10 @@ int ptrace_request(struct task_struct *child, long request,
ret = seccomp_get_metadata(child, addr, datavp);
break;

+ case PTRACE_SECCOMP_GET_LISTENER:
+ ret = seccomp_get_listener(child, addr);
+ break;
+
default:
break;
}
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 4a6d6bd80d07..6dc99c65c2f4 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1668,6 +1668,34 @@ static struct file *init_listener(struct task_struct *task,

return ret;
}
+
+long seccomp_get_listener(struct task_struct *task,
+ unsigned long filter_off)
+{
+ struct seccomp_filter *filter;
+ struct file *listener;
+ int fd;
+
+ filter = get_nth_filter(task, filter_off);
+ if (IS_ERR(filter))
+ return PTR_ERR(filter);
+
+ fd = get_unused_fd_flags(O_RDWR);
+ if (fd < 0) {
+ __put_seccomp_filter(filter);
+ return fd;
+ }
+
+ listener = init_listener(task, task->seccomp.filter);
+ __put_seccomp_filter(filter);
+ if (IS_ERR(listener)) {
+ put_unused_fd(fd);
+ return PTR_ERR(listener);
+ }
+
+ fd_install(fd, listener);
+ return fd;
+}
#else
static struct file *init_listener(struct task_struct *task,
struct seccomp_filter *filter)
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index f439bd3fb15f..07984f7bd601 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -178,6 +178,10 @@ int seccomp(unsigned int op, unsigned int flags, void *args)
}
#endif

+#ifndef PTRACE_SECCOMP_GET_LISTENER
+#define PTRACE_SECCOMP_GET_LISTENER 0x420e
+#endif
+
#if __BYTE_ORDER == __LITTLE_ENDIAN
#define syscall_arg(_n) (offsetof(struct seccomp_data, args[_n]))
#elif __BYTE_ORDER == __BIG_ENDIAN
@@ -3110,6 +3114,68 @@ TEST(get_user_notification_syscall)
close(listener);
}

+TEST(get_user_notification_ptrace)
+{
+ pid_t pid;
+ int status, listener;
+ int sk_pair[2];
+ char c;
+ struct seccomp_notif req = {};
+ struct seccomp_notif_resp resp = {};
+
+ ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
+
+ pid = fork();
+ ASSERT_GE(pid, 0);
+
+ if (pid == 0) {
+ EXPECT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
+
+ /* Test that we get ENOSYS while not attached */
+ EXPECT_EQ(syscall(__NR_getpid), -1);
+ EXPECT_EQ(errno, ENOSYS);
+
+ /* Signal we're ready and have installed the filter. */
+ EXPECT_EQ(write(sk_pair[1], "J", 1), 1);
+
+ EXPECT_EQ(read(sk_pair[1], &c, 1), 1);
+ EXPECT_EQ(c, 'H');
+
+ exit(syscall(__NR_getpid) != USER_NOTIF_MAGIC);
+ }
+
+ EXPECT_EQ(read(sk_pair[0], &c, 1), 1);
+ EXPECT_EQ(c, 'J');
+
+ EXPECT_EQ(ptrace(PTRACE_ATTACH, pid), 0);
+ EXPECT_EQ(waitpid(pid, NULL, 0), pid);
+ listener = ptrace(PTRACE_SECCOMP_GET_LISTENER, pid, 0);
+ EXPECT_GE(listener, 0);
+
+ /* EBUSY for second listener */
+ EXPECT_EQ(ptrace(PTRACE_SECCOMP_GET_LISTENER, pid, 0), -1);
+ EXPECT_EQ(errno, EBUSY);
+
+ EXPECT_EQ(ptrace(PTRACE_DETACH, pid, NULL, 0), 0);
+
+ /* Now signal we are done and respond with magic */
+ EXPECT_EQ(write(sk_pair[0], "H", 1), 1);
+
+ EXPECT_EQ(read(listener, &req, sizeof(req)), sizeof(req));
+
+ resp.id = req.id;
+ resp.error = 0;
+ resp.val = USER_NOTIF_MAGIC;
+
+ EXPECT_EQ(write(listener, &resp, sizeof(resp)), sizeof(resp));
+
+ EXPECT_EQ(waitpid(pid, &status, 0), pid);
+ EXPECT_EQ(true, WIFEXITED(status));
+ EXPECT_EQ(0, WEXITSTATUS(status));
+
+ close(listener);
+}
+
/*
* TODO:
* - add microbenchmarks
--
2.17.0


2018-06-02 13:14:40

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] seccomp: add support for passing fds via USER_NOTIF

On Sat, Jun 2, 2018 at 2:58 PM Tycho Andersen <[email protected]> wrote:
> The idea here is that the userspace handler should be able to pass an fd
> back to the trapped task, for example so it can be returned from socket().
>
> I've proposed one API here, but I'm open to other options. In particular,
> this only lets you return an fd from a syscall, which may not be enough in
> all cases. For example, if an fd is written to an output parameter instead
> of returned, the current API can't handle this. Another case is that
> netlink takes as input fds sometimes (IFLA_NET_NS_FD, e.g.). If netlink
> ever decides to install an fd and output it, we wouldn't be able to handle
> this either.
>
> Still, the vast majority of interesting cases are covered by this API, so
> perhaps it is Enough.
>
> I've left it as a separate commit for two reasons:
> * It illustrates the way in which we would grow struct seccomp_notif and
> struct seccomp_notif_resp without using netlink
> * It shows just how little code is needed to accomplish this :)
[...]
> + fd = get_unused_fd_flags(n.flags);

Here, you're using n.flags in a context where it will be tested
against O_CLOEXEC to determine whether the new fd should be
close-on-exec.

[...]
> + /*
> + * This is a little hokey: we need a real fget() (i.e. not
> + * __fget_light(), which is what fdget does), but we also need
> + * the flags from strcut fd. So, we get it, put it, and get it
> + * again for real.
> + */
> + fd = fdget(resp.fd);
> + knotif->flags = fd.flags;
> + fdput(fd);
> +
> + knotif->file = fget(resp.fd);
> + if (!knotif->file) {
> + ret = -EBADF;
> + goto out;
> + }

But here fd.flags contains the low 2 bits of the return value of
__fget_light, which are either 0 or FDPUT_FPUT (encoded as 1). This
flag states whether fdget() took a reference on the file, which is
mostly equivalent to "is the current process multithreaded?". (This is
the reason why fdget returns flags and fget doesn't - the flag from
fdget is to decide whether you'll need an fput(), which is
unconditional for fget().)

Apart from this issue, I think that in general, it's probably not a
good idea to copy the close-on-exec flag from the fd in the
supervising process - the supervising process might want all the fds
it is working with to be O_CLOEXEC independent of whether the
supervised process wants an O_CLOEXEC fd. It might make sense to add a
field for this to struct seccomp_notif_resp instead.

2018-06-02 18:19:10

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] seccomp: add support for passing fds via USER_NOTIF

Hi Jann,

Thanks for taking a look!

On Sat, Jun 02, 2018 at 03:13:39PM +0200, Jann Horn wrote:
> On Sat, Jun 2, 2018 at 2:58 PM Tycho Andersen <[email protected]> wrote:
> > The idea here is that the userspace handler should be able to pass an fd
> > back to the trapped task, for example so it can be returned from socket().
> >
> > I've proposed one API here, but I'm open to other options. In particular,
> > this only lets you return an fd from a syscall, which may not be enough in
> > all cases. For example, if an fd is written to an output parameter instead
> > of returned, the current API can't handle this. Another case is that
> > netlink takes as input fds sometimes (IFLA_NET_NS_FD, e.g.). If netlink
> > ever decides to install an fd and output it, we wouldn't be able to handle
> > this either.
> >
> > Still, the vast majority of interesting cases are covered by this API, so
> > perhaps it is Enough.
> >
> > I've left it as a separate commit for two reasons:
> > * It illustrates the way in which we would grow struct seccomp_notif and
> > struct seccomp_notif_resp without using netlink
> > * It shows just how little code is needed to accomplish this :)
> [...]
> > + fd = get_unused_fd_flags(n.flags);
>
> Here, you're using n.flags in a context where it will be tested
> against O_CLOEXEC to determine whether the new fd should be
> close-on-exec.
>
> [...]
> > + /*
> > + * This is a little hokey: we need a real fget() (i.e. not
> > + * __fget_light(), which is what fdget does), but we also need
> > + * the flags from strcut fd. So, we get it, put it, and get it
> > + * again for real.
> > + */
> > + fd = fdget(resp.fd);
> > + knotif->flags = fd.flags;
> > + fdput(fd);
> > +
> > + knotif->file = fget(resp.fd);
> > + if (!knotif->file) {
> > + ret = -EBADF;
> > + goto out;
> > + }
>
> But here fd.flags contains the low 2 bits of the return value of
> __fget_light, which are either 0 or FDPUT_FPUT (encoded as 1). This
> flag states whether fdget() took a reference on the file, which is
> mostly equivalent to "is the current process multithreaded?". (This is
> the reason why fdget returns flags and fget doesn't - the flag from
> fdget is to decide whether you'll need an fput(), which is
> unconditional for fget().)

Oof, yes.

> Apart from this issue, I think that in general, it's probably not a
> good idea to copy the close-on-exec flag from the fd in the
> supervising process - the supervising process might want all the fds
> it is working with to be O_CLOEXEC independent of whether the
> supervised process wants an O_CLOEXEC fd. It might make sense to add a
> field for this to struct seccomp_notif_resp instead.

Yes, I wondered about this. In particular, maybe it just makes sense
to pass back the exact flags that the FD should be opened with too, so
if in the future there's some other flag we might want to twiddle, we
don't need another patch. I'll make the change for v4.

Thanks!

Tycho

2018-06-02 19:15:03

by Alban Crequy

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] seccomp: add support for passing fds via USER_NOTIF

On Thu, 31 May 2018 at 16:52, Tycho Andersen <[email protected]> wrote:
>
> The idea here is that the userspace handler should be able to pass an fd
> back to the trapped task, for example so it can be returned from socket().
>
> I've proposed one API here, but I'm open to other options. In particular,
> this only lets you return an fd from a syscall, which may not be enough in
> all cases. For example, if an fd is written to an output parameter instead
> of returned, the current API can't handle this. Another case is that
> netlink takes as input fds sometimes (IFLA_NET_NS_FD, e.g.). If netlink
> ever decides to install an fd and output it, we wouldn't be able to handle
> this either.
>
> Still, the vast majority of interesting cases are covered by this API, so
> perhaps it is Enough.
>
> I've left it as a separate commit for two reasons:
> * It illustrates the way in which we would grow struct seccomp_notif and
> struct seccomp_notif_resp without using netlink
> * It shows just how little code is needed to accomplish this :)
>
> v2: new in v2
> v3: no changes
>
> Signed-off-by: Tycho Andersen <[email protected]>
> CC: Kees Cook <[email protected]>
> CC: Andy Lutomirski <[email protected]>
> CC: Oleg Nesterov <[email protected]>
> CC: Eric W. Biederman <[email protected]>
> CC: "Serge E. Hallyn" <[email protected]>
> CC: Christian Brauner <[email protected]>
> CC: Tyler Hicks <[email protected]>
> CC: Akihiro Suda <[email protected]>
> ---
> include/uapi/linux/seccomp.h | 2 +
> kernel/seccomp.c | 49 +++++++-
> tools/testing/selftests/seccomp/seccomp_bpf.c | 112 ++++++++++++++++++
> 3 files changed, 161 insertions(+), 2 deletions(-)
>
> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> index 8160e6cad528..3124427219cb 100644
> --- a/include/uapi/linux/seccomp.h
> +++ b/include/uapi/linux/seccomp.h
> @@ -71,6 +71,8 @@ struct seccomp_notif_resp {
> __u64 id;
> __s32 error;
> __s64 val;
> + __u8 return_fd;
> + __u32 fd;
> };
>
> #endif /* _UAPI_LINUX_SECCOMP_H */
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 6dc99c65c2f4..2ee958b3efde 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -77,6 +77,8 @@ struct seccomp_knotif {
> /* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
> int error;
> long val;
> + struct file *file;
> + unsigned int flags;
>
> /* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
> struct completion ready;
> @@ -780,10 +782,32 @@ static void seccomp_do_user_notification(int this_syscall,
> goto remove_list;
> }
>
> - ret = n.val;
> - err = n.error;
> + if (n.file) {
> + int fd;
> +
> + fd = get_unused_fd_flags(n.flags);
> + if (fd < 0) {
> + err = fd;
> + ret = -1;
> + goto remove_list;
> + }
> +
> + ret = fd;
> + err = 0;
> +
> + fd_install(fd, n.file);
> + /* Don't fput, since fd has a reference now */
> + n.file = NULL;

Do we want the cgroup classid and netprio to be applied here, before
the fd_install? I am looking at similar code in net/core/scm.c
scm_detach_fds():
sock = sock_from_file(fp[i], &err);
if (sock) {
sock_update_netprioidx(&sock->sk->sk_cgrp_data);
sock_update_classid(&sock->sk->sk_cgrp_data);
}

The listener process might live in a different cgroup with a different
classid & netprio, so it might not be applied as the app might expect.

Also, should we update the struct sock_cgroup_data of the socket, in
order to make the BPF helper function bpf_skb_under_cgroup() work wrt
the cgroup of the target process instead of the listener process? I am
looking at cgroup_sk_alloc(). I don't know what's the correct
behaviour we want here.

> + } else {
> + ret = n.val;
> + err = n.error;
> + }
> +
>
> remove_list:
> + if (n.file)
> + fput(n.file);
> +
> list_del(&n.list);
> out:
> mutex_unlock(&match->notify_lock);
> @@ -1598,6 +1622,27 @@ static ssize_t seccomp_notify_write(struct file *file, const char __user *buf,
> knotif->state = SECCOMP_NOTIFY_REPLIED;
> knotif->error = resp.error;
> knotif->val = resp.val;
> +
> + if (resp.return_fd) {
> + struct fd fd;
> +
> + /*
> + * This is a little hokey: we need a real fget() (i.e. not
> + * __fget_light(), which is what fdget does), but we also need
> + * the flags from strcut fd. So, we get it, put it, and get it
> + * again for real.
> + */
> + fd = fdget(resp.fd);
> + knotif->flags = fd.flags;
> + fdput(fd);
> +
> + knotif->file = fget(resp.fd);
> + if (!knotif->file) {
> + ret = -EBADF;
> + goto out;

Should the "knotif->state = SECCOMP_NOTIFY_REPLIED" and other changes
be done after the error case here? In case of bad fd, it seems to
return -EBADF the first time and -EINVAL the next time because the
state would have been changed to SECCOMP_NOTIFY_REPLIED already.

> + }
> + }
> +
> complete(&knotif->ready);
> out:
> mutex_unlock(&filter->notify_lock);
> diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
> index 07984f7bd601..b9a4f676566d 100644
> --- a/tools/testing/selftests/seccomp/seccomp_bpf.c
> +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
> @@ -167,6 +167,8 @@ struct seccomp_notif_resp {
> __u64 id;
> __s32 error;
> __s64 val;
> + __u8 return_fd;
> + __u32 fd;
> };
> #endif
>
> @@ -3176,6 +3178,116 @@ TEST(get_user_notification_ptrace)
> close(listener);
> }
>
> +TEST(user_notification_pass_fd)
> +{
> + pid_t pid;
> + int status, listener;
> + int sk_pair[2];
> + char c;
> + struct seccomp_notif req = {};
> + struct seccomp_notif_resp resp = {};
> + long ret;
> +
> + ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
> +
> + pid = fork();
> + ASSERT_GE(pid, 0);
> +
> + if (pid == 0) {
> + int fd;
> + char buf[16];
> +
> + EXPECT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
> +
> + /* Signal we're ready and have installed the filter. */
> + EXPECT_EQ(write(sk_pair[1], "J", 1), 1);
> +
> + EXPECT_EQ(read(sk_pair[1], &c, 1), 1);
> + EXPECT_EQ(c, 'H');
> + close(sk_pair[1]);
> +
> + /* An fd from getpid(). Let the games begin. */
> + fd = syscall(__NR_getpid);
> + EXPECT_GT(fd, 0);
> + EXPECT_EQ(read(fd, buf, sizeof(buf)), 12);
> + close(fd);
> +
> + exit(strcmp("hello world", buf));
> + }
> +
> + EXPECT_EQ(read(sk_pair[0], &c, 1), 1);
> + EXPECT_EQ(c, 'J');
> +
> + EXPECT_EQ(ptrace(PTRACE_ATTACH, pid), 0);
> + EXPECT_EQ(waitpid(pid, NULL, 0), pid);
> + listener = ptrace(PTRACE_SECCOMP_GET_LISTENER, pid, 0);
> + EXPECT_GE(listener, 0);
> + EXPECT_EQ(ptrace(PTRACE_DETACH, pid, NULL, 0), 0);
> +
> + /* Now signal we are done installing so it can do a getpid */
> + EXPECT_EQ(write(sk_pair[0], "H", 1), 1);
> + close(sk_pair[0]);
> +
> + /* Make a new socket pair so we can send half across */
> + EXPECT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
> +
> + ret = read_notif(listener, &req);
> + EXPECT_EQ(ret, sizeof(req));
> + EXPECT_EQ(errno, 0);
> +
> + resp.id = req.id;
> + resp.return_fd = 1;
> + resp.fd = sk_pair[1];
> + EXPECT_EQ(write(listener, &resp, sizeof(resp)), sizeof(resp));
> + close(sk_pair[1]);
> +
> + EXPECT_EQ(write(sk_pair[0], "hello world\0", 12), 12);
> + close(sk_pair[0]);
> +
> + EXPECT_EQ(waitpid(pid, &status, 0), pid);
> + EXPECT_EQ(true, WIFEXITED(status));
> + EXPECT_EQ(0, WEXITSTATUS(status));
> + close(listener);
> +}
> +
> +TEST(user_notification_struct_size_mismatch)
> +{
> + pid_t pid;
> + long ret;
> + int status, listener, len;
> + struct seccomp_notif req;
> + struct seccomp_notif_resp resp;
> +
> + listener = user_trap_syscall(__NR_getpid,
> + SECCOMP_FILTER_FLAG_GET_LISTENER);
> + EXPECT_GE(listener, 0);
> +
> + pid = fork();
> + ASSERT_GE(pid, 0);
> +
> + if (pid == 0) {
> + ret = syscall(__NR_getpid);
> + exit(ret != USER_NOTIF_MAGIC);
> + }
> +
> + EXPECT_EQ(read(listener, &req, sizeof(req)), sizeof(req));
> +
> + resp.id = req.id;
> + resp.error = 0;
> + resp.val = USER_NOTIF_MAGIC;
> +
> + /*
> + * Only write a partial structure: this is what was available before we
> + * had fd support.
> + */
> + len = offsetof(struct seccomp_notif_resp, val) + sizeof(resp.val);
> + EXPECT_EQ(write(listener, &resp, len), len);
> +
> + EXPECT_EQ(waitpid(pid, &status, 0), pid);
> + EXPECT_EQ(true, WIFEXITED(status));
> + EXPECT_EQ(0, WEXITSTATUS(status));
> +}
> +
> /*
> * TODO:
> * - add microbenchmarks
> --
> 2.17.0
>

2018-06-03 18:42:20

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH v3 1/4] seccomp: add a return code to trap to userspace

On Sun, Jun 3, 2018 at 2:29 PM Tycho Andersen <[email protected]> wrote:
>
> This patch introduces a means for syscalls matched in seccomp to notify
> some other task that a particular filter has been triggered.
>
> The motivation for this is primarily for use with containers. For example,
> if a container does an init_module(), we obviously don't want to load this
> untrusted code, which may be compiled for the wrong version of the kernel
> anyway. Instead, we could parse the module image, figure out which module
> the container is trying to load and load it on the host.
>
> As another example, containers cannot mknod(), since this checks
> capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> coding some whitelist in the kernel. Another example is mount(), which has
> many security restrictions for good reason, but configuration or runtime
> knowledge could potentially be used to relax these restrictions.
>
> This patch adds functionality that is already possible via at least two
> other means that I know about, both of which involve ptrace(): first, one
> could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
> Unfortunately this is slow, so a faster version would be to install a
> filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
> Since ptrace allows only one tracer, if the container runtime is that
> tracer, users inside the container (or outside) trying to debug it will not
> be able to use ptrace, which is annoying. It also means that older
> distributions based on Upstart cannot boot inside containers using ptrace,
> since upstart itself uses ptrace to start services.
>
> The actual implementation of this is fairly small, although getting the
> synchronization right was/is slightly complex.
>
> Finally, it's worth noting that the classic seccomp TOCTOU of reading
> memory data from the task still applies here, but can be avoided with
> careful design of the userspace handler: if the userspace handler reads all
> of the task memory that is necessary before applying its security policy,
> the tracee's subsequent memory edits will not be read by the tracer.
[...]
> @@ -857,13 +1020,28 @@ static long seccomp_set_mode_filter(unsigned int flags,
> if (IS_ERR(prepared))
> return PTR_ERR(prepared);
>
> + if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
> + listener = get_unused_fd_flags(O_RDWR);

I think you want either 0 or O_CLOEXEC here?

> +out_put_fd:
> + if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
> + if (ret < 0) {
> + fput(listener_f);
> + put_unused_fd(listener);
> + } else {
> + fd_install(listener, listener_f);
> + ret = listener;
> + }
> + }
> out_free:
> seccomp_filter_free(prepared);
> return ret;
[...]
> +static __poll_t seccomp_notify_poll(struct file *file,
> + struct poll_table_struct *poll_tab)
> +{
> + struct seccomp_filter *filter = file->private_data;
> + __poll_t ret = 0;
> + struct seccomp_knotif *cur;
> +
> + ret = mutex_lock_interruptible(&filter->notify_lock);
> + if (ret < 0)
> + return ret;
> +
> + list_for_each_entry(cur, &filter->notifications, list) {
> + if (cur->state == SECCOMP_NOTIFY_INIT)
> + ret |= EPOLLIN | EPOLLRDNORM;
> + if (cur->state == SECCOMP_NOTIFY_SENT)
> + ret |= EPOLLOUT | EPOLLWRNORM;
> + }
> +
> + mutex_unlock(&filter->notify_lock);
> +
> + return ret;
> +}

I don't think f_op->poll handlers work like this. AFAIK you're
supposed to use something like poll_wait() to connect the caller to
something like a waitqueue head, so that as soon as the file becomes
ready for reading/writing, any waiting task is notified. See
eventfd_poll() in fs/eventfd.c for a simple example. AFAICS in the
current code, seccomp_notify_poll() only works if an event is pending
at the time seccomp_notify_poll() is called.

2018-06-04 00:16:49

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] seccomp: add support for passing fds via USER_NOTIF

Hi Alban,

On Sat, Jun 02, 2018 at 09:14:09PM +0200, Alban Crequy wrote:
> On Thu, 31 May 2018 at 16:52, Tycho Andersen <[email protected]> wrote:
> >
> > The idea here is that the userspace handler should be able to pass an fd
> > back to the trapped task, for example so it can be returned from socket().
> >
> > I've proposed one API here, but I'm open to other options. In particular,
> > this only lets you return an fd from a syscall, which may not be enough in
> > all cases. For example, if an fd is written to an output parameter instead
> > of returned, the current API can't handle this. Another case is that
> > netlink takes as input fds sometimes (IFLA_NET_NS_FD, e.g.). If netlink
> > ever decides to install an fd and output it, we wouldn't be able to handle
> > this either.
> >
> > Still, the vast majority of interesting cases are covered by this API, so
> > perhaps it is Enough.
> >
> > I've left it as a separate commit for two reasons:
> > * It illustrates the way in which we would grow struct seccomp_notif and
> > struct seccomp_notif_resp without using netlink
> > * It shows just how little code is needed to accomplish this :)
> >
> > v2: new in v2
> > v3: no changes
> >
> > Signed-off-by: Tycho Andersen <[email protected]>
> > CC: Kees Cook <[email protected]>
> > CC: Andy Lutomirski <[email protected]>
> > CC: Oleg Nesterov <[email protected]>
> > CC: Eric W. Biederman <[email protected]>
> > CC: "Serge E. Hallyn" <[email protected]>
> > CC: Christian Brauner <[email protected]>
> > CC: Tyler Hicks <[email protected]>
> > CC: Akihiro Suda <[email protected]>
> > ---
> > include/uapi/linux/seccomp.h | 2 +
> > kernel/seccomp.c | 49 +++++++-
> > tools/testing/selftests/seccomp/seccomp_bpf.c | 112 ++++++++++++++++++
> > 3 files changed, 161 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> > index 8160e6cad528..3124427219cb 100644
> > --- a/include/uapi/linux/seccomp.h
> > +++ b/include/uapi/linux/seccomp.h
> > @@ -71,6 +71,8 @@ struct seccomp_notif_resp {
> > __u64 id;
> > __s32 error;
> > __s64 val;
> > + __u8 return_fd;
> > + __u32 fd;
> > };
> >
> > #endif /* _UAPI_LINUX_SECCOMP_H */
> > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > index 6dc99c65c2f4..2ee958b3efde 100644
> > --- a/kernel/seccomp.c
> > +++ b/kernel/seccomp.c
> > @@ -77,6 +77,8 @@ struct seccomp_knotif {
> > /* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
> > int error;
> > long val;
> > + struct file *file;
> > + unsigned int flags;
> >
> > /* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
> > struct completion ready;
> > @@ -780,10 +782,32 @@ static void seccomp_do_user_notification(int this_syscall,
> > goto remove_list;
> > }
> >
> > - ret = n.val;
> > - err = n.error;
> > + if (n.file) {
> > + int fd;
> > +
> > + fd = get_unused_fd_flags(n.flags);
> > + if (fd < 0) {
> > + err = fd;
> > + ret = -1;
> > + goto remove_list;
> > + }
> > +
> > + ret = fd;
> > + err = 0;
> > +
> > + fd_install(fd, n.file);
> > + /* Don't fput, since fd has a reference now */
> > + n.file = NULL;
>
> Do we want the cgroup classid and netprio to be applied here, before
> the fd_install? I am looking at similar code in net/core/scm.c
> scm_detach_fds():
> sock = sock_from_file(fp[i], &err);
> if (sock) {
> sock_update_netprioidx(&sock->sk->sk_cgrp_data);
> sock_update_classid(&sock->sk->sk_cgrp_data);
> }
>
> The listener process might live in a different cgroup with a different
> classid & netprio, so it might not be applied as the app might expect.

Thanks, I hadn't really thought about this. I think doing what
SCM_RIGHTS does makes sense -- the operation is essentially the same.

> Also, should we update the struct sock_cgroup_data of the socket, in
> order to make the BPF helper function bpf_skb_under_cgroup() work wrt
> the cgroup of the target process instead of the listener process? I am
> looking at cgroup_sk_alloc(). I don't know what's the correct
> behaviour we want here.

SCM_RIGHTS seems to omit this (I assume you mean the val field of
struct sock_cgroup_data, which seems to be a pointer to struct
cgroup*), any idea why?

> > + } else {
> > + ret = n.val;
> > + err = n.error;
> > + }
> > +
> >
> > remove_list:
> > + if (n.file)
> > + fput(n.file);
> > +
> > list_del(&n.list);
> > out:
> > mutex_unlock(&match->notify_lock);
> > @@ -1598,6 +1622,27 @@ static ssize_t seccomp_notify_write(struct file *file, const char __user *buf,
> > knotif->state = SECCOMP_NOTIFY_REPLIED;
> > knotif->error = resp.error;
> > knotif->val = resp.val;
> > +
> > + if (resp.return_fd) {
> > + struct fd fd;
> > +
> > + /*
> > + * This is a little hokey: we need a real fget() (i.e. not
> > + * __fget_light(), which is what fdget does), but we also need
> > + * the flags from strcut fd. So, we get it, put it, and get it
> > + * again for real.
> > + */
> > + fd = fdget(resp.fd);
> > + knotif->flags = fd.flags;
> > + fdput(fd);
> > +
> > + knotif->file = fget(resp.fd);
> > + if (!knotif->file) {
> > + ret = -EBADF;
> > + goto out;
>
> Should the "knotif->state = SECCOMP_NOTIFY_REPLIED" and other changes
> be done after the error case here? In case of bad fd, it seems to
> return -EBADF the first time and -EINVAL the next time because the
> state would have been changed to SECCOMP_NOTIFY_REPLIED already.

Yes, good catch, thanks!

Tycho

2018-06-04 00:18:54

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH v3 1/4] seccomp: add a return code to trap to userspace

Hi Jann,

On Sun, Jun 03, 2018 at 08:41:01PM +0200, Jann Horn wrote:
> On Sun, Jun 3, 2018 at 2:29 PM Tycho Andersen <[email protected]> wrote:
> >
> > This patch introduces a means for syscalls matched in seccomp to notify
> > some other task that a particular filter has been triggered.
> >
> > The motivation for this is primarily for use with containers. For example,
> > if a container does an init_module(), we obviously don't want to load this
> > untrusted code, which may be compiled for the wrong version of the kernel
> > anyway. Instead, we could parse the module image, figure out which module
> > the container is trying to load and load it on the host.
> >
> > As another example, containers cannot mknod(), since this checks
> > capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> > /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> > coding some whitelist in the kernel. Another example is mount(), which has
> > many security restrictions for good reason, but configuration or runtime
> > knowledge could potentially be used to relax these restrictions.
> >
> > This patch adds functionality that is already possible via at least two
> > other means that I know about, both of which involve ptrace(): first, one
> > could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
> > Unfortunately this is slow, so a faster version would be to install a
> > filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
> > Since ptrace allows only one tracer, if the container runtime is that
> > tracer, users inside the container (or outside) trying to debug it will not
> > be able to use ptrace, which is annoying. It also means that older
> > distributions based on Upstart cannot boot inside containers using ptrace,
> > since upstart itself uses ptrace to start services.
> >
> > The actual implementation of this is fairly small, although getting the
> > synchronization right was/is slightly complex.
> >
> > Finally, it's worth noting that the classic seccomp TOCTOU of reading
> > memory data from the task still applies here, but can be avoided with
> > careful design of the userspace handler: if the userspace handler reads all
> > of the task memory that is necessary before applying its security policy,
> > the tracee's subsequent memory edits will not be read by the tracer.
> [...]
> > @@ -857,13 +1020,28 @@ static long seccomp_set_mode_filter(unsigned int flags,
> > if (IS_ERR(prepared))
> > return PTR_ERR(prepared);
> >
> > + if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
> > + listener = get_unused_fd_flags(O_RDWR);
>
> I think you want either 0 or O_CLOEXEC here?

Do we? I suppose it makes sense to be able to set CLOEXEC, but I could
imagine a case where a handler wanted to fork+exec to handle
something. I'm happy to make the change, but it's not obvious to me
that it's what we want by default.

> > +out_put_fd:
> > + if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
> > + if (ret < 0) {
> > + fput(listener_f);
> > + put_unused_fd(listener);
> > + } else {
> > + fd_install(listener, listener_f);
> > + ret = listener;
> > + }
> > + }
> > out_free:
> > seccomp_filter_free(prepared);
> > return ret;
> [...]
> > +static __poll_t seccomp_notify_poll(struct file *file,
> > + struct poll_table_struct *poll_tab)
> > +{
> > + struct seccomp_filter *filter = file->private_data;
> > + __poll_t ret = 0;
> > + struct seccomp_knotif *cur;
> > +
> > + ret = mutex_lock_interruptible(&filter->notify_lock);
> > + if (ret < 0)
> > + return ret;
> > +
> > + list_for_each_entry(cur, &filter->notifications, list) {
> > + if (cur->state == SECCOMP_NOTIFY_INIT)
> > + ret |= EPOLLIN | EPOLLRDNORM;
> > + if (cur->state == SECCOMP_NOTIFY_SENT)
> > + ret |= EPOLLOUT | EPOLLWRNORM;
> > + }
> > +
> > + mutex_unlock(&filter->notify_lock);
> > +
> > + return ret;
> > +}
>
> I don't think f_op->poll handlers work like this. AFAIK you're
> supposed to use something like poll_wait() to connect the caller to
> something like a waitqueue head, so that as soon as the file becomes
> ready for reading/writing, any waiting task is notified. See
> eventfd_poll() in fs/eventfd.c for a simple example. AFAICS in the
> current code, seccomp_notify_poll() only works if an event is pending
> at the time seccomp_notify_poll() is called.

Arg. I was trying to avoid adding yet another piece of
synchronization, but perhaps it's not possible. Thanks for pointing
this out.

Tycho

2018-06-08 16:31:20

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v3 0/4] seccomp trap to userspace

On Thu, May 31, 2018 at 7:49 AM, Tycho Andersen <[email protected]> wrote:
> Hi all,
>
> Here's a v3 of the seccomp trap to userspace, with all the nits from v2
> fixed. Open questions from v2 are still:
>
> 1. is it ok not to use netlink?

Yeah, I think there isn't a sensible way to reuse that API, which is
too bad. Let's just try to keep this interface future-proofed. :)

> 2. what should the fd passing API look like? (see patch notes on this
> one for details of why the current one might (?) be a problem)

The only thing in my mind is avoiding the problems with other fd
passing API (e.g. when do rlimits get checked, etc).

> As an added bonus, I've also written some stress testing, with lots of
> tasks and listeners (1000 of each) sharing the same notification thread,
> and not found any issues so far. Code is here:
> https://github.com/tych0/kernel-utils/blob/master/seccomp/notify_stress.c
> although I haven't included it in the patchset.

That's excellent, thanks!

> v2: https://lkml.org/lkml/2018/5/17/627
>
> Tycho Andersen (4):
> seccomp: add a return code to trap to userspace
> seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE
> seccomp: add a way to get a listener fd from ptrace
> seccomp: add support for passing fds via USER_NOTIF

I'm under a time crunch with the merge window, but after -rc2 I should
have time to give this some close review. FWIW, I expect this to enter
-next this cycle and get it into the 4.19 merge window: we need the
feature and the alternatives have been well explored and don't look
workable.

Thanks for the series!

-Kees

--
Kees Cook
Pixel Security

2018-06-08 21:05:21

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH v3 0/4] seccomp trap to userspace

Hi Kees,

On Fri, Jun 08, 2018 at 09:29:42AM -0700, Kees Cook wrote:
> On Thu, May 31, 2018 at 7:49 AM, Tycho Andersen <[email protected]> wrote:
> > Hi all,
> >
> > Here's a v3 of the seccomp trap to userspace, with all the nits from v2
> > fixed. Open questions from v2 are still:
> >
> > 1. is it ok not to use netlink?
>
> Yeah, I think there isn't a sensible way to reuse that API, which is
> too bad. Let's just try to keep this interface future-proofed. :)

Yes, I think it is, assuming that we always use a zero value as the
"do the same thing as before" value. Perhaps I should write that
assumption down somewhere...

> > 2. what should the fd passing API look like? (see patch notes on this
> > one for details of why the current one might (?) be a problem)
>
> The only thing in my mind is avoiding the problems with other fd
> passing API (e.g. when do rlimits get checked, etc).

My read of get_unused_fd_flags() is that it does check RLIMIT_NOFILE,
so I think we're ok there.

My biggest concern was just about the case where we want to do
something besides return an fd from a syscall (e.g. install an fd, but
return it via some pointer or something), but I'm not aware of
anywhere we do that today, so maybe I'm worrying about it too much.

> > As an added bonus, I've also written some stress testing, with lots of
> > tasks and listeners (1000 of each) sharing the same notification thread,
> > and not found any issues so far. Code is here:
> > https://github.com/tych0/kernel-utils/blob/master/seccomp/notify_stress.c
> > although I haven't included it in the patchset.
>
> That's excellent, thanks!
>
> > v2: https://lkml.org/lkml/2018/5/17/627
> >
> > Tycho Andersen (4):
> > seccomp: add a return code to trap to userspace
> > seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE
> > seccomp: add a way to get a listener fd from ptrace
> > seccomp: add support for passing fds via USER_NOTIF
>
> I'm under a time crunch with the merge window, but after -rc2 I should
> have time to give this some close review. FWIW, I expect this to enter
> -next this cycle and get it into the 4.19 merge window: we need the
> feature and the alternatives have been well explored and don't look
> workable.

No rush. I am preparing a v4 with the various comments in this thread
fixed, hopefully I'll send it out early next week.

Tycho

2018-06-12 23:17:00

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH v3 1/4] seccomp: add a return code to trap to userspace

Hi Matthew,

On Tue, Jun 12, 2018 at 02:39:03PM -0700, Matthew Helsley wrote:
> On Thu, May 31, 2018 at 7:49 AM, Tycho Andersen <[email protected]> wrote:
>
> <snip>
>
>
> > +struct seccomp_notif {
> > + __u64 id;
> > + pid_t pid;
> > + struct seccomp_data data;
> > +};
> >
>
> Since it's part of the UAPI I think it would be good to add documentation
> to this patch series. Part of that documentation should talk about which
> pid namespaces this pid value is relevant in. This is especially important
> since the feature is designed for use by things like container/sandbox
> managers.

Yes, at least Documentation/userspace-api/seccomp_filter.txt should be
updated. I'll add that for the next series.

> > +static void seccomp_do_user_notification(int this_syscall,
> > + struct seccomp_filter *match,
> > + const struct seccomp_data *sd)
> > +{
> > + int err;
> > + long ret = 0;
> > + struct seccomp_knotif n = {};
> > +
> > + mutex_lock(&match->notify_lock);
> > + err = -ENOSYS;
> > + if (!match->has_listener)
> > + goto out;
> > +
> > + n.pid = current->pid;
> >
>
> How have you tested this code for correctness? I don't see many
> combinations being tested below nor here:
> https://github.com/tych0/kernel-utils/blob/master/seccomp/notify_stress.c
>
> What about processes in different pid namespaces? Make tests that vary key
> parameters like these between the task generating the notifications and the
> task interested in processing the notifications. Make tests that try to
> kill them at interesting times too. etc. Make tests that pass the
> notification fd around and see how they work (or not).
>
> I ask about testing because you're effectively returning a pid value to
> userspace here but not using the proper macros to access the task's struct
> pid for that purpose. That will work so long as both processes are in the
> same pid namespace but breaks otherwise.
>
> Furthermore, a pid value is not the best solution for the queueing of these
> notifications because a single pid value is not meaningful/correct outside
> current's pid namespace and the seccomp notification file descriptor could
> be passed on to processes in another pid namespaces; this pid value will
> almost always not be relevant or correct there hence taking a reference to

Well, it *has* to be relevant in some cases: if you want to access
some of the task's memory to e.g. read the args to the syscall, you
need to ptrace or map_files to access the memory. So we do need to
pass it, but,

> the struct pid is useful. Then, during read(), the code would use the
> proper macro to turn the struct pid reference into a pid value relevant in
> the *reader's* pid namespace *or* return something obviously bogus if the
> reader is in a pid namespace that can't see that pid. This could prevent
> management processes from being tricked into clobbering the wrong process,
> feeding the wrong process sensitive information via syscall results, etc.

Yes, this makes sense. Seems like we need to do some magic about
passing the tracee's task struct to the listener, so it can do
task_pid_vnr(). We could perhaps require the listener to be in the
init_pid_ns case, but I think things like socket() might still be
useful even if you can't read the tracee's memory.

> Alternately, you could choose to specify that the seccomp manager is
> expected to be in the pid namespace of the process it's managing at all
> times. That's not necessarily trivial either because the process(es) it
> manages could potentially create new child pid namespaces. It also means
> that the processes being managed can "see" the manager process at all times.

Right, I think we don't want to require this.

> Regardless, you still need to use the proper macros to access current's pid
> for export to userspace.

Yes, thanks.

Tycho

2018-06-13 15:33:50

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH v3 1/4] seccomp: add a return code to trap to userspace

On Mon, Jun 4, 2018 at 2:18 AM Tycho Andersen <[email protected]> wrote:
>
> Hi Jann,
>
> On Sun, Jun 03, 2018 at 08:41:01PM +0200, Jann Horn wrote:
> > On Sun, Jun 3, 2018 at 2:29 PM Tycho Andersen <[email protected]> wrote:
> > >
> > > This patch introduces a means for syscalls matched in seccomp to notify
> > > some other task that a particular filter has been triggered.
> > >
> > > The motivation for this is primarily for use with containers. For example,
> > > if a container does an init_module(), we obviously don't want to load this
> > > untrusted code, which may be compiled for the wrong version of the kernel
> > > anyway. Instead, we could parse the module image, figure out which module
> > > the container is trying to load and load it on the host.
> > >
> > > As another example, containers cannot mknod(), since this checks
> > > capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> > > /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> > > coding some whitelist in the kernel. Another example is mount(), which has
> > > many security restrictions for good reason, but configuration or runtime
> > > knowledge could potentially be used to relax these restrictions.
> > >
> > > This patch adds functionality that is already possible via at least two
> > > other means that I know about, both of which involve ptrace(): first, one
> > > could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
> > > Unfortunately this is slow, so a faster version would be to install a
> > > filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
> > > Since ptrace allows only one tracer, if the container runtime is that
> > > tracer, users inside the container (or outside) trying to debug it will not
> > > be able to use ptrace, which is annoying. It also means that older
> > > distributions based on Upstart cannot boot inside containers using ptrace,
> > > since upstart itself uses ptrace to start services.
> > >
> > > The actual implementation of this is fairly small, although getting the
> > > synchronization right was/is slightly complex.
> > >
> > > Finally, it's worth noting that the classic seccomp TOCTOU of reading
> > > memory data from the task still applies here, but can be avoided with
> > > careful design of the userspace handler: if the userspace handler reads all
> > > of the task memory that is necessary before applying its security policy,
> > > the tracee's subsequent memory edits will not be read by the tracer.
> > [...]
> > > @@ -857,13 +1020,28 @@ static long seccomp_set_mode_filter(unsigned int flags,
> > > if (IS_ERR(prepared))
> > > return PTR_ERR(prepared);
> > >
> > > + if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
> > > + listener = get_unused_fd_flags(O_RDWR);
> >
> > I think you want either 0 or O_CLOEXEC here?
>
> Do we? I suppose it makes sense to be able to set CLOEXEC, but I could
> imagine a case where a handler wanted to fork+exec to handle
> something. I'm happy to make the change, but it's not obvious to me
> that it's what we want by default.

I said "either 0 or O_CLOEXEC" - I just meant that O_RDWR doesn't make
much sense to me here, given that that's not a property of the fd and
will be ignored by the function you're calling.

On whether 0 or O_CLOEXEC is better: If you look at
get_unused_fd_flags() calls in e.g. various ioctl handlers, it's a mix
of places that hardcode 0, places that hardcode O_CLOEXEC, and places
that allow the caller to specify the flag. Either should work - but
personally, I believe that if the caller can't pass a flag,
get_unused_fd_flags(O_CLOEXEC) is the better choice because you can
still clear the O_CLOEXEC flag using fcntl() if necessary, while
setting the flag using fcntl() is potentially racy in a multi-threaded
context.

2018-06-13 15:44:16

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH v3 1/4] seccomp: add a return code to trap to userspace

On Wed, Jun 13, 2018 at 5:32 PM Jann Horn <[email protected]> wrote:
>
> On Mon, Jun 4, 2018 at 2:18 AM Tycho Andersen <[email protected]> wrote:
> >
> > Hi Jann,
> >
> > On Sun, Jun 03, 2018 at 08:41:01PM +0200, Jann Horn wrote:
> > > On Sun, Jun 3, 2018 at 2:29 PM Tycho Andersen <[email protected]> wrote:
> > > >
> > > > This patch introduces a means for syscalls matched in seccomp to notify
> > > > some other task that a particular filter has been triggered.
[...]

By the way: You should probably CC linux-api@ on the next version of
this patchset.
Documentation/process/submitting-patches.rst says:

| If changes affect userland-kernel interfaces, please send the MAN-PAGES
| maintainer (as listed in the MAINTAINERS file) a man-pages patch, or at
| least a notification of the change, so that some information makes its way
| into the manual pages. User-space API changes should also be copied to
| [email protected].

2018-06-14 21:05:32

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH v3 1/4] seccomp: add a return code to trap to userspace

On Thu, Jun 14, 2018 at 12:44:21PM -0700, Matthew Helsley wrote:
> On Tue, Jun 12, 2018 at 4:16 PM, Tycho Andersen <[email protected]> wrote:
>
> > Hi Matthew,
> >
> > On Tue, Jun 12, 2018 at 02:39:03PM -0700, Matthew Helsley wrote:
> > > On Thu, May 31, 2018 at 7:49 AM, Tycho Andersen <[email protected]> wrote:
> > >
> > > <snip>
> > >
> > >
> > > > +struct seccomp_notif {
> > > > + __u64 id;
> > > > + pid_t pid;
> > > > + struct seccomp_data data;
> > > > +};
> > > >
> > >
> > > Since it's part of the UAPI I think it would be good to add documentation
> > > to this patch series. Part of that documentation should talk about which
> > > pid namespaces this pid value is relevant in. This is especially
> > important
> > > since the feature is designed for use by things like container/sandbox
> > > managers.
> >
> > Yes, at least Documentation/userspace-api/seccomp_filter.txt should be
> > updated. I'll add that for the next series.
> >
>
> Are there some relevant man pages too that should be updated too?

Yes, but since those are in a separate tree, I usually send the sets
separately.

> > > > +static void seccomp_do_user_notification(int this_syscall,
> > > > + struct seccomp_filter *match,
> > > > + const struct seccomp_data *sd)
> > > > +{
> > > > + int err;
> > > > + long ret = 0;
> > > > + struct seccomp_knotif n = {};
> > > > +
> > > > + mutex_lock(&match->notify_lock);
> > > > + err = -ENOSYS;
> > > > + if (!match->has_listener)
> > > > + goto out;
> > > > +
> > > > + n.pid = current->pid;
> > > >
> > >
> > > How have you tested this code for correctness? I don't see many
> > > combinations being tested below nor here:
> > > https://github.com/tych0/kernel-utils/blob/master/
> > seccomp/notify_stress.c
> > >
> > > What about processes in different pid namespaces? Make tests that vary
> > key
> > > parameters like these between the task generating the notifications and
> > the
> > > task interested in processing the notifications. Make tests that try to
> > > kill them at interesting times too. etc. Make tests that pass the
> > > notification fd around and see how they work (or not).
> > >
> > > I ask about testing because you're effectively returning a pid value to
> > > userspace here but not using the proper macros to access the task's
> > struct
> > > pid for that purpose. That will work so long as both processes are in the
> > > same pid namespace but breaks otherwise.
> > >
> > > Furthermore, a pid value is not the best solution for the queueing of
> > these
> > > notifications because a single pid value is not meaningful/correct
> > outside
> > > current's pid namespace and the seccomp notification file descriptor
> > could
> > > be passed on to processes in another pid namespaces; this pid value will
> > > almost always not be relevant or correct there hence taking a reference
> > to
> >
> > Well, it *has* to be relevant in some cases: if you want to access
> > some of the task's memory to e.g. read the args to the syscall, you
> > need to ptrace or map_files to access the memory. So we do need to
> > pass it, but,
> >
> > > the struct pid is useful. Then, during read(), the code would use the
> > > proper macro to turn the struct pid reference into a pid value relevant
> > in
> > > the *reader's* pid namespace *or* return something obviously bogus if the
> > > reader is in a pid namespace that can't see that pid. This could prevent
> > > management processes from being tricked into clobbering the wrong
> > process,
> > > feeding the wrong process sensitive information via syscall results, etc.
> >
> > Yes, this makes sense. Seems like we need to do some magic about
> > passing the tracee's task struct to the listener, so it can do
> > task_pid_vnr(). We could perhaps require the listener to be in the
> > init_pid_ns case, but I think things like socket() might still be
> > useful even if you can't read the tracee's memory.
>
>
> You could take a reference to the pid rather than the task
> then use pid_vnr(). In that case the changes should result in something
> like:
>
>
> struct seccomp_knotif {
> /* The pid whose filter triggered the notification */
> struct pid *pid;
> ...
>
> static void seccomp_do_user_notification(...)
> {
> ...
> n.pid = get_task_pid(current, PIDTYPE_PID);
> ...
> remove_list:
> list_del(&n.list);
> put_pid(n.pid);
> ...
> }
> ...
> static ssize_t seccomp_notify_read(...)
> {
> ...
> unotif.pid = pid_vnr(knotif->pid);
> ...
> }
>
> I like holding the pid reference because it's what we do elsewhere when pid
> namespaces
> are a concern and it more precisely specifies what the knotif content needs
> to convey.
> Otherwise I don't think it makes a difference.

Great, thanks, I'll do this. I guess we need a put_pid() here too.

Cheers,

Tycho

2018-06-14 21:54:50

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v3 1/4] seccomp: add a return code to trap to userspace

Tycho Andersen <[email protected]> writes:

> On Thu, Jun 14, 2018 at 12:44:21PM -0700, Matthew Helsley wrote:
>> On Tue, Jun 12, 2018 at 4:16 PM, Tycho Andersen <[email protected]> wrote:
>>
>> > Hi Matthew,
>> >
>> > On Tue, Jun 12, 2018 at 02:39:03PM -0700, Matthew Helsley wrote:
>> > > On Thu, May 31, 2018 at 7:49 AM, Tycho Andersen <[email protected]> wrote:
>> > >
>> > > <snip>
>> > >
>> > >
>> > > > +struct seccomp_notif {
>> > > > + __u64 id;
>> > > > + pid_t pid;
>> > > > + struct seccomp_data data;
>> > > > +};
>> > > >
>> > >
>> > > Since it's part of the UAPI I think it would be good to add documentation
>> > > to this patch series. Part of that documentation should talk about which
>> > > pid namespaces this pid value is relevant in. This is especially
>> > important
>> > > since the feature is designed for use by things like container/sandbox
>> > > managers.
>> >
>> > Yes, at least Documentation/userspace-api/seccomp_filter.txt should be
>> > updated. I'll add that for the next series.
>> >
>>
>> Are there some relevant man pages too that should be updated too?
>
> Yes, but since those are in a separate tree, I usually send the sets
> separately.
>
>> > > > +static void seccomp_do_user_notification(int this_syscall,
>> > > > + struct seccomp_filter *match,
>> > > > + const struct seccomp_data *sd)
>> > > > +{
>> > > > + int err;
>> > > > + long ret = 0;
>> > > > + struct seccomp_knotif n = {};
>> > > > +
>> > > > + mutex_lock(&match->notify_lock);
>> > > > + err = -ENOSYS;
>> > > > + if (!match->has_listener)
>> > > > + goto out;
>> > > > +
>> > > > + n.pid = current->pid;
>> > > >
>> > >
>> > > How have you tested this code for correctness? I don't see many
>> > > combinations being tested below nor here:
>> > > https://github.com/tych0/kernel-utils/blob/master/
>> > seccomp/notify_stress.c
>> > >
>> > > What about processes in different pid namespaces? Make tests that vary
>> > key
>> > > parameters like these between the task generating the notifications and
>> > the
>> > > task interested in processing the notifications. Make tests that try to
>> > > kill them at interesting times too. etc. Make tests that pass the
>> > > notification fd around and see how they work (or not).
>> > >
>> > > I ask about testing because you're effectively returning a pid value to
>> > > userspace here but not using the proper macros to access the task's
>> > struct
>> > > pid for that purpose. That will work so long as both processes are in the
>> > > same pid namespace but breaks otherwise.
>> > >
>> > > Furthermore, a pid value is not the best solution for the queueing of
>> > these
>> > > notifications because a single pid value is not meaningful/correct
>> > outside
>> > > current's pid namespace and the seccomp notification file descriptor
>> > could
>> > > be passed on to processes in another pid namespaces; this pid value will
>> > > almost always not be relevant or correct there hence taking a reference
>> > to
>> >
>> > Well, it *has* to be relevant in some cases: if you want to access
>> > some of the task's memory to e.g. read the args to the syscall, you
>> > need to ptrace or map_files to access the memory. So we do need to
>> > pass it, but,
>> >
>> > > the struct pid is useful. Then, during read(), the code would use the
>> > > proper macro to turn the struct pid reference into a pid value relevant
>> > in
>> > > the *reader's* pid namespace *or* return something obviously bogus if the
>> > > reader is in a pid namespace that can't see that pid. This could prevent
>> > > management processes from being tricked into clobbering the wrong
>> > process,
>> > > feeding the wrong process sensitive information via syscall results, etc.
>> >
>> > Yes, this makes sense. Seems like we need to do some magic about
>> > passing the tracee's task struct to the listener, so it can do
>> > task_pid_vnr(). We could perhaps require the listener to be in the
>> > init_pid_ns case, but I think things like socket() might still be
>> > useful even if you can't read the tracee's memory.
>>
>>
>> You could take a reference to the pid rather than the task
>> then use pid_vnr(). In that case the changes should result in something
>> like:
>>
>>
>> struct seccomp_knotif {
>> /* The pid whose filter triggered the notification */
>> struct pid *pid;
>> ...
>>
>> static void seccomp_do_user_notification(...)
>> {
>> ...
>> n.pid = get_task_pid(current, PIDTYPE_PID);
>> ...
>> remove_list:
>> list_del(&n.list);
>> put_pid(n.pid);
>> ...
>> }
>> ...
>> static ssize_t seccomp_notify_read(...)
>> {
>> ...
>> unotif.pid = pid_vnr(knotif->pid);
>> ...
>> }
>>
>> I like holding the pid reference because it's what we do elsewhere when pid
>> namespaces
>> are a concern and it more precisely specifies what the knotif content needs
>> to convey.
>> Otherwise I don't think it makes a difference.
>
> Great, thanks, I'll do this. I guess we need a put_pid() here too.

A) We know that the task is stopped. Unless there is something
like SIGKILL that can make the task move you don't need to
take a reference to anything.

B) pid_vnr is the wrong answer. When you create the struct file
and intialize the filter you need to capture the calling processes
pid namespace. The you can use "pid_nr_ns(knotif->pid, filter->pid_ns);".
That will work consistently even if the file descriptor is passed
between processes.

Eric

2018-06-20 05:06:07

by Tobin C. Harding

[permalink] [raw]
Subject: Re: [PATCH v3 1/4] seccomp: add a return code to trap to userspace

On Thu, May 31, 2018 at 08:49:46AM -0600, Tycho Andersen wrote:
> This patch introduces a means for syscalls matched in seccomp to notify
> some other task that a particular filter has been triggered.
>
> The motivation for this is primarily for use with containers. For example,
> if a container does an init_module(), we obviously don't want to load this
> untrusted code, which may be compiled for the wrong version of the kernel
> anyway. Instead, we could parse the module image, figure out which module
> the container is trying to load and load it on the host.
>
> As another example, containers cannot mknod(), since this checks
> capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> coding some whitelist in the kernel. Another example is mount(), which has
> many security restrictions for good reason, but configuration or runtime
> knowledge could potentially be used to relax these restrictions.
>
> This patch adds functionality that is already possible via at least two
> other means that I know about, both of which involve ptrace(): first, one
> could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
> Unfortunately this is slow, so a faster version would be to install a
> filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
> Since ptrace allows only one tracer, if the container runtime is that
> tracer, users inside the container (or outside) trying to debug it will not
> be able to use ptrace, which is annoying. It also means that older
> distributions based on Upstart cannot boot inside containers using ptrace,
> since upstart itself uses ptrace to start services.
>
> The actual implementation of this is fairly small, although getting the
> synchronization right was/is slightly complex.
>
> Finally, it's worth noting that the classic seccomp TOCTOU of reading
> memory data from the task still applies here, but can be avoided with
> careful design of the userspace handler: if the userspace handler reads all
> of the task memory that is necessary before applying its security policy,
> the tracee's subsequent memory edits will not be read by the tracer.
>
> v2: * make id a u64; the idea here being that it will never overflow,
> because 64 is huge (one syscall every nanosecond => wrap every 584
> years) (Andy)
> * prevent nesting of user notifications: if someone is already attached
> the tree in one place, nobody else can attach to the tree (Andy)
> * notify the listener of signals the tracee receives as well (Andy)
> * implement poll
> v3: * lockdep fix (Oleg)
> * drop unnecessary WARN()s (Christian)
> * rearrange error returns to be more rpetty (Christian)
> * fix build in !CONFIG_SECCOMP_USER_NOTIFICATION case
>
> Signed-off-by: Tycho Andersen <[email protected]>
> CC: Kees Cook <[email protected]>
> CC: Andy Lutomirski <[email protected]>
> CC: Oleg Nesterov <[email protected]>
> CC: Eric W. Biederman <[email protected]>
> CC: "Serge E. Hallyn" <[email protected]>
> CC: Christian Brauner <[email protected]>
> CC: Tyler Hicks <[email protected]>
> CC: Akihiro Suda <[email protected]>
> ---
> arch/Kconfig | 7 +
> include/linux/seccomp.h | 3 +-
> include/uapi/linux/seccomp.h | 18 +-
> kernel/seccomp.c | 398 +++++++++++++++++-
> tools/testing/selftests/seccomp/seccomp_bpf.c | 195 ++++++++-
> 5 files changed, 615 insertions(+), 6 deletions(-)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 75dd23acf133..1c1ae8d8c8b9 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -401,6 +401,13 @@ config SECCOMP_FILTER
>
> See Documentation/prctl/seccomp_filter.txt for details.
>
> +config SECCOMP_USER_NOTIFICATION
> + bool "Enable the SECCOMP_RET_USER_NOTIF seccomp action"
> + depends on SECCOMP_FILTER
> + help
> + Enable SECCOMP_RET_USER_NOTIF, a return code which can be used by seccomp
> + programs to notify a userspace listener that a particular event happened.
> +
> config HAVE_GCC_PLUGINS
> bool
> help
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index c723a5c4e3ff..0fd3e0676a1c 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -5,7 +5,8 @@
> #include <uapi/linux/seccomp.h>
>
> #define SECCOMP_FILTER_FLAG_MASK (SECCOMP_FILTER_FLAG_TSYNC | \
> - SECCOMP_FILTER_FLAG_LOG)
> + SECCOMP_FILTER_FLAG_LOG | \
> + SECCOMP_FILTER_FLAG_GET_LISTENER)
>
> #ifdef CONFIG_SECCOMP
>
> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> index 2a0bd9dd104d..8160e6cad528 100644
> --- a/include/uapi/linux/seccomp.h
> +++ b/include/uapi/linux/seccomp.h
> @@ -17,8 +17,9 @@
> #define SECCOMP_GET_ACTION_AVAIL 2
>
> /* Valid flags for SECCOMP_SET_MODE_FILTER */
> -#define SECCOMP_FILTER_FLAG_TSYNC 1
> -#define SECCOMP_FILTER_FLAG_LOG 2
> +#define SECCOMP_FILTER_FLAG_TSYNC 1
> +#define SECCOMP_FILTER_FLAG_LOG 2
> +#define SECCOMP_FILTER_FLAG_GET_LISTENER 4
>
> /*
> * All BPF programs must return a 32-bit value.
> @@ -34,6 +35,7 @@
> #define SECCOMP_RET_KILL SECCOMP_RET_KILL_THREAD
> #define SECCOMP_RET_TRAP 0x00030000U /* disallow and force a SIGSYS */
> #define SECCOMP_RET_ERRNO 0x00050000U /* returns an errno */
> +#define SECCOMP_RET_USER_NOTIF 0x7fc00000U /* notifies userspace */
> #define SECCOMP_RET_TRACE 0x7ff00000U /* pass to a tracer or disallow */
> #define SECCOMP_RET_LOG 0x7ffc0000U /* allow after logging */
> #define SECCOMP_RET_ALLOW 0x7fff0000U /* allow */
> @@ -59,4 +61,16 @@ struct seccomp_data {
> __u64 args[6];
> };
>
> +struct seccomp_notif {
> + __u64 id;
> + pid_t pid;
> + struct seccomp_data data;
> +};
> +
> +struct seccomp_notif_resp {
> + __u64 id;
> + __s32 error;
> + __s64 val;
> +};
> +
> #endif /* _UAPI_LINUX_SECCOMP_H */
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index dc77548167ef..f69327d5f7c7 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -31,6 +31,7 @@
> #endif
>
> #ifdef CONFIG_SECCOMP_FILTER
> +#include <linux/file.h>
> #include <linux/filter.h>
> #include <linux/pid.h>
> #include <linux/ptrace.h>
> @@ -38,6 +39,52 @@
> #include <linux/tracehook.h>
> #include <linux/uaccess.h>
>
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +#include <linux/anon_inodes.h>
> +
> +enum notify_state {
> + SECCOMP_NOTIFY_INIT,
> + SECCOMP_NOTIFY_SENT,
> + SECCOMP_NOTIFY_REPLIED,
> +};
> +
> +struct seccomp_knotif {
> + /* The pid whose filter triggered the notification */
> + pid_t pid;
> +
> + /*
> + * The "cookie" for this request; this is unique for this filter.
> + */
> + u32 id;

Perhaps
/* The "cookie" for this request; this is unique for this filter. */


Epic patch review :)

thanks,
Tobin.

2018-06-20 05:55:21

by Tobin C. Harding

[permalink] [raw]
Subject: Re: [PATCH v3 1/4] seccomp: add a return code to trap to userspace

A few other piddly suggestions.

On Thu, May 31, 2018 at 08:49:46AM -0600, Tycho Andersen wrote:
> This patch introduces a means for syscalls matched in seccomp to notify
> some other task that a particular filter has been triggered.
>
> The motivation for this is primarily for use with containers. For example,
> if a container does an init_module(), we obviously don't want to load this
> untrusted code, which may be compiled for the wrong version of the kernel
> anyway. Instead, we could parse the module image, figure out which module
> the container is trying to load and load it on the host.
>
> As another example, containers cannot mknod(), since this checks
> capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> coding some whitelist in the kernel. Another example is mount(), which has
> many security restrictions for good reason, but configuration or runtime
> knowledge could potentially be used to relax these restrictions.
>
> This patch adds functionality that is already possible via at least two
> other means that I know about, both of which involve ptrace(): first, one
> could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
> Unfortunately this is slow, so a faster version would be to install a
> filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
> Since ptrace allows only one tracer, if the container runtime is that
> tracer, users inside the container (or outside) trying to debug it will not
> be able to use ptrace, which is annoying. It also means that older
> distributions based on Upstart cannot boot inside containers using ptrace,
> since upstart itself uses ptrace to start services.
>
> The actual implementation of this is fairly small, although getting the
> synchronization right was/is slightly complex.
>
> Finally, it's worth noting that the classic seccomp TOCTOU of reading
> memory data from the task still applies here, but can be avoided with
> careful design of the userspace handler: if the userspace handler reads all
> of the task memory that is necessary before applying its security policy,
> the tracee's subsequent memory edits will not be read by the tracer.
>
> v2: * make id a u64; the idea here being that it will never overflow,
> because 64 is huge (one syscall every nanosecond => wrap every 584
> years) (Andy)
> * prevent nesting of user notifications: if someone is already attached
> the tree in one place, nobody else can attach to the tree (Andy)
> * notify the listener of signals the tracee receives as well (Andy)
> * implement poll
> v3: * lockdep fix (Oleg)
> * drop unnecessary WARN()s (Christian)
> * rearrange error returns to be more rpetty (Christian)
> * fix build in !CONFIG_SECCOMP_USER_NOTIFICATION case
>
> Signed-off-by: Tycho Andersen <[email protected]>
> CC: Kees Cook <[email protected]>
> CC: Andy Lutomirski <[email protected]>
> CC: Oleg Nesterov <[email protected]>
> CC: Eric W. Biederman <[email protected]>
> CC: "Serge E. Hallyn" <[email protected]>
> CC: Christian Brauner <[email protected]>
> CC: Tyler Hicks <[email protected]>
> CC: Akihiro Suda <[email protected]>
> ---
> arch/Kconfig | 7 +
> include/linux/seccomp.h | 3 +-
> include/uapi/linux/seccomp.h | 18 +-
> kernel/seccomp.c | 398 +++++++++++++++++-
> tools/testing/selftests/seccomp/seccomp_bpf.c | 195 ++++++++-
> 5 files changed, 615 insertions(+), 6 deletions(-)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 75dd23acf133..1c1ae8d8c8b9 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -401,6 +401,13 @@ config SECCOMP_FILTER
>
> See Documentation/prctl/seccomp_filter.txt for details.
>
> +config SECCOMP_USER_NOTIFICATION
> + bool "Enable the SECCOMP_RET_USER_NOTIF seccomp action"
> + depends on SECCOMP_FILTER
> + help
> + Enable SECCOMP_RET_USER_NOTIF, a return code which can be used by seccomp
> + programs to notify a userspace listener that a particular event happened.
> +
> config HAVE_GCC_PLUGINS
> bool
> help
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index c723a5c4e3ff..0fd3e0676a1c 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -5,7 +5,8 @@
> #include <uapi/linux/seccomp.h>
>
> #define SECCOMP_FILTER_FLAG_MASK (SECCOMP_FILTER_FLAG_TSYNC | \
> - SECCOMP_FILTER_FLAG_LOG)
> + SECCOMP_FILTER_FLAG_LOG | \
> + SECCOMP_FILTER_FLAG_GET_LISTENER)
>
> #ifdef CONFIG_SECCOMP
>
> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> index 2a0bd9dd104d..8160e6cad528 100644
> --- a/include/uapi/linux/seccomp.h
> +++ b/include/uapi/linux/seccomp.h
> @@ -17,8 +17,9 @@
> #define SECCOMP_GET_ACTION_AVAIL 2
>
> /* Valid flags for SECCOMP_SET_MODE_FILTER */
> -#define SECCOMP_FILTER_FLAG_TSYNC 1
> -#define SECCOMP_FILTER_FLAG_LOG 2
> +#define SECCOMP_FILTER_FLAG_TSYNC 1
> +#define SECCOMP_FILTER_FLAG_LOG 2
> +#define SECCOMP_FILTER_FLAG_GET_LISTENER 4
>
> /*
> * All BPF programs must return a 32-bit value.
> @@ -34,6 +35,7 @@
> #define SECCOMP_RET_KILL SECCOMP_RET_KILL_THREAD
> #define SECCOMP_RET_TRAP 0x00030000U /* disallow and force a SIGSYS */
> #define SECCOMP_RET_ERRNO 0x00050000U /* returns an errno */
> +#define SECCOMP_RET_USER_NOTIF 0x7fc00000U /* notifies userspace */
> #define SECCOMP_RET_TRACE 0x7ff00000U /* pass to a tracer or disallow */
> #define SECCOMP_RET_LOG 0x7ffc0000U /* allow after logging */
> #define SECCOMP_RET_ALLOW 0x7fff0000U /* allow */
> @@ -59,4 +61,16 @@ struct seccomp_data {
> __u64 args[6];
> };
>
> +struct seccomp_notif {
> + __u64 id;
> + pid_t pid;
> + struct seccomp_data data;
> +};
> +
> +struct seccomp_notif_resp {
> + __u64 id;
> + __s32 error;
> + __s64 val;
> +};
> +
> #endif /* _UAPI_LINUX_SECCOMP_H */
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index dc77548167ef..f69327d5f7c7 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -31,6 +31,7 @@
> #endif
>
> #ifdef CONFIG_SECCOMP_FILTER
> +#include <linux/file.h>
> #include <linux/filter.h>
> #include <linux/pid.h>
> #include <linux/ptrace.h>
> @@ -38,6 +39,52 @@
> #include <linux/tracehook.h>
> #include <linux/uaccess.h>
>
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +#include <linux/anon_inodes.h>
> +
> +enum notify_state {
> + SECCOMP_NOTIFY_INIT,
> + SECCOMP_NOTIFY_SENT,
> + SECCOMP_NOTIFY_REPLIED,
> +};
> +
> +struct seccomp_knotif {
> + /* The pid whose filter triggered the notification */
> + pid_t pid;
> +
> + /*
> + * The "cookie" for this request; this is unique for this filter.
> + */
> + u32 id;
> +
> + /*
> + * The seccomp data. This pointer is valid the entire time this
> + * notification is active, since it comes from __seccomp_filter which
> + * eclipses the entire lifecycle here.
> + */
> + const struct seccomp_data *data;
> +
> + /*
> + * Notification states. When SECCOMP_RET_USER_NOTIF is returned, a
> + * struct seccomp_knotif is created and starts out in INIT. Once the
> + * handler reads the notification off of an FD, it transitions to READ.
> + * If a signal is received the state transitions back to INIT and
> + * another message is sent. When the userspace handler replies, state
> + * transitions to REPLIED.
> + */
> + enum notify_state state;
> +
> + /* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
> + int error;
> + long val;
> +
> + /* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
> + struct completion ready;
> +
> + struct list_head list;
> +};
> +#endif
> +
> /**
> * struct seccomp_filter - container for seccomp BPF programs
> *
> @@ -64,6 +111,27 @@ struct seccomp_filter {
> bool log;
> struct seccomp_filter *prev;
> struct bpf_prog *prog;
> +
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> + /*
> + * A semaphore that users of this notification can wait on for
> + * changes. Actual reads and writes are still controlled with
> + * filter->notify_lock.
> + */
> + struct semaphore request;
> +
> + /* A lock for all notification-related accesses. */
> + struct mutex notify_lock;
> +
> + /* Is there currently an attached listener? */
> + bool has_listener;
> +
> + /* The id of the next request. */
> + u64 next_id;
> +
> + /* A list of struct seccomp_knotif elements. */
> + struct list_head notifications;
> +#endif
> };
>
> /* Limit any path through the tree to 256KB worth of instructions. */
> @@ -383,6 +451,13 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
> if (!sfilter)
> return ERR_PTR(-ENOMEM);
>
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> + mutex_init(&sfilter->notify_lock);
> + sema_init(&sfilter->request, 0);
> + INIT_LIST_HEAD(&sfilter->notifications);
> + sfilter->next_id = get_random_u64();
> +#endif
> +
> ret = bpf_prog_create_from_user(&sfilter->prog, fprog,
> seccomp_check_filter, save_orig);
> if (ret < 0) {
> @@ -547,13 +622,15 @@ static void seccomp_send_sigsys(int syscall, int reason)
> #define SECCOMP_LOG_TRACE (1 << 4)
> #define SECCOMP_LOG_LOG (1 << 5)
> #define SECCOMP_LOG_ALLOW (1 << 6)
> +#define SECCOMP_LOG_USER_NOTIF (1 << 7)
>
> static u32 seccomp_actions_logged = SECCOMP_LOG_KILL_PROCESS |
> SECCOMP_LOG_KILL_THREAD |
> SECCOMP_LOG_TRAP |
> SECCOMP_LOG_ERRNO |
> SECCOMP_LOG_TRACE |
> - SECCOMP_LOG_LOG;
> + SECCOMP_LOG_LOG |
> + SECCOMP_LOG_USER_NOTIF;
>
> static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
> bool requested)
> @@ -572,6 +649,9 @@ static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
> case SECCOMP_RET_TRACE:
> log = requested && seccomp_actions_logged & SECCOMP_LOG_TRACE;
> break;
> + case SECCOMP_RET_USER_NOTIF:
> + log = requested && seccomp_actions_logged & SECCOMP_LOG_USER_NOTIF;
> + break;
> case SECCOMP_RET_LOG:
> log = seccomp_actions_logged & SECCOMP_LOG_LOG;
> break;
> @@ -645,6 +725,81 @@ void secure_computing_strict(int this_syscall)
> }
> #else
>
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +static u64 seccomp_next_notify_id(struct seccomp_filter *filter)
> +{
> + /* Note: overflow is ok here, the id just needs to be unique */
> + return filter->next_id++;
> +}
> +
> +static void seccomp_do_user_notification(int this_syscall,
> + struct seccomp_filter *match,
> + const struct seccomp_data *sd)
> +{
> + int err;
> + long ret = 0;
> + struct seccomp_knotif n = {};
> +
> + mutex_lock(&match->notify_lock);
> + err = -ENOSYS;
> + if (!match->has_listener)
> + goto out;
> +
> + n.pid = current->pid;
> + n.state = SECCOMP_NOTIFY_INIT;
> + n.data = sd;
> + n.id = seccomp_next_notify_id(match);
> + init_completion(&n.ready);
> +
> + list_add(&n.list, &match->notifications);
> +
> + mutex_unlock(&match->notify_lock);
> + up(&match->request);
> +
> + err = wait_for_completion_interruptible(&n.ready);
> + mutex_lock(&match->notify_lock);
> +
> + /*
> + * Here it's possible we got a signal and then had to wait on the mutex
> + * while the reply was sent, so let's be sure there wasn't a response
> + * in the meantime.
> + */
> + if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) {
> + /*
> + * We got a signal. Let's tell userspace about it (potentially
> + * again, if we had already notified them about the first one).
> + */
> + if (n.state == SECCOMP_NOTIFY_SENT) {
> + n.state = SECCOMP_NOTIFY_INIT;
> + up(&match->request);
> + }
> + mutex_unlock(&match->notify_lock);
> + err = wait_for_completion_killable(&n.ready);
> + mutex_lock(&match->notify_lock);
> + if (err < 0)
> + goto remove_list;
> + }
> +
> + ret = n.val;
> + err = n.error;
> +
> +remove_list:
> + list_del(&n.list);
> +out:
> + mutex_unlock(&match->notify_lock);
> + syscall_set_return_value(current, task_pt_regs(current),
> + err, ret);
> +}
> +#else
> +static void seccomp_do_user_notification(int this_syscall,
> + struct seccomp_filter *match,
> + const struct seccomp_data *sd)
> +{
> + seccomp_log(this_syscall, SIGSYS, SECCOMP_RET_USER_NOTIF, true);
> + do_exit(SIGSYS);
> +}
> +#endif
> +
> #ifdef CONFIG_SECCOMP_FILTER
> static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
> const bool recheck_after_trace)
> @@ -722,6 +877,9 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
>
> return 0;
>
> + case SECCOMP_RET_USER_NOTIF:
> + seccomp_do_user_notification(this_syscall, match, sd);
> + goto skip;
> case SECCOMP_RET_LOG:

Perhaps add newline after 'got skip;' (inline with rest of this function).

> seccomp_log(this_syscall, 0, action, true);
> return 0;
> @@ -828,6 +986,9 @@ static long seccomp_set_mode_strict(void)
> }
>
> #ifdef CONFIG_SECCOMP_FILTER
> +static struct file *init_listener(struct task_struct *,
> + struct seccomp_filter *);
> +
> /**
> * seccomp_set_mode_filter: internal function for setting seccomp filter
> * @flags: flags to change filter behavior
> @@ -847,6 +1008,8 @@ static long seccomp_set_mode_filter(unsigned int flags,
> const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
> struct seccomp_filter *prepared = NULL;
> long ret = -EINVAL;
> + int listener = 0;
> + struct file *listener_f = NULL;
>
> /* Validate flags. */
> if (flags & ~SECCOMP_FILTER_FLAG_MASK)
> @@ -857,13 +1020,28 @@ static long seccomp_set_mode_filter(unsigned int flags,
> if (IS_ERR(prepared))
> return PTR_ERR(prepared);
>
> + if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
> + listener = get_unused_fd_flags(O_RDWR);
> + if (listener < 0) {
> + ret = listener;
> + goto out_free;
> + }
> +
> + listener_f = init_listener(current, prepared);
> + if (IS_ERR(listener_f)) {
> + put_unused_fd(listener);
> + ret = PTR_ERR(listener_f);
> + goto out_free;
> + }
> + }
> +
> /*
> * Make sure we cannot change seccomp or nnp state via TSYNC
> * while another thread is in the middle of calling exec.
> */
> if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
> mutex_lock_killable(&current->signal->cred_guard_mutex))
> - goto out_free;
> + goto out_put_fd;
>
> spin_lock_irq(&current->sighand->siglock);
>
> @@ -881,6 +1059,16 @@ static long seccomp_set_mode_filter(unsigned int flags,
> spin_unlock_irq(&current->sighand->siglock);
> if (flags & SECCOMP_FILTER_FLAG_TSYNC)
> mutex_unlock(&current->signal->cred_guard_mutex);
> +out_put_fd:
> + if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
> + if (ret < 0) {
> + fput(listener_f);
> + put_unused_fd(listener);
> + } else {
> + fd_install(listener, listener_f);
> + ret = listener;
> + }
> + }
> out_free:
> seccomp_filter_free(prepared);
> return ret;
> @@ -909,6 +1097,9 @@ static long seccomp_get_action_avail(const char __user *uaction)
> case SECCOMP_RET_LOG:
> case SECCOMP_RET_ALLOW:
> break;
> + case SECCOMP_RET_USER_NOTIF:
> + if (IS_ENABLED(CONFIG_SECCOMP_USER_NOTIFICATION))
> + break;
> default:
> return -EOPNOTSUPP;
> }
> @@ -1105,6 +1296,7 @@ long seccomp_get_metadata(struct task_struct *task,
> #define SECCOMP_RET_KILL_THREAD_NAME "kill_thread"
> #define SECCOMP_RET_TRAP_NAME "trap"
> #define SECCOMP_RET_ERRNO_NAME "errno"
> +#define SECCOMP_RET_USER_NOTIF_NAME "user_notif"
> #define SECCOMP_RET_TRACE_NAME "trace"
> #define SECCOMP_RET_LOG_NAME "log"
> #define SECCOMP_RET_ALLOW_NAME "allow"
> @@ -1114,6 +1306,7 @@ static const char seccomp_actions_avail[] =
> SECCOMP_RET_KILL_THREAD_NAME " "
> SECCOMP_RET_TRAP_NAME " "
> SECCOMP_RET_ERRNO_NAME " "
> + SECCOMP_RET_USER_NOTIF_NAME " "
> SECCOMP_RET_TRACE_NAME " "
> SECCOMP_RET_LOG_NAME " "
> SECCOMP_RET_ALLOW_NAME;
> @@ -1131,6 +1324,7 @@ static const struct seccomp_log_name seccomp_log_names[] = {
> { SECCOMP_LOG_TRACE, SECCOMP_RET_TRACE_NAME },
> { SECCOMP_LOG_LOG, SECCOMP_RET_LOG_NAME },
> { SECCOMP_LOG_ALLOW, SECCOMP_RET_ALLOW_NAME },
> + { SECCOMP_LOG_USER_NOTIF, SECCOMP_RET_USER_NOTIF_NAME },
> { }
> };
>
> @@ -1279,3 +1473,203 @@ static int __init seccomp_sysctl_init(void)
> device_initcall(seccomp_sysctl_init)
>
> #endif /* CONFIG_SYSCTL */
> +
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +static int seccomp_notify_release(struct inode *inode, struct file *file)
> +{
> + struct seccomp_filter *filter = file->private_data;
> + struct seccomp_knotif *knotif;
> +
> + mutex_lock(&filter->notify_lock);
> +
> + /*
> + * If this file is being closed because e.g. the task who owned it
> + * died, let's wake everyone up who was waiting on us.
> + */
> + list_for_each_entry(knotif, &filter->notifications, list) {
> + if (knotif->state == SECCOMP_NOTIFY_REPLIED)
> + continue;
> +
> + knotif->state = SECCOMP_NOTIFY_REPLIED;
> + knotif->error = -ENOSYS;
> + knotif->val = 0;
> +
> + complete(&knotif->ready);
> + }
> +
> + filter->has_listener = false;
> + mutex_unlock(&filter->notify_lock);
> + __put_seccomp_filter(filter);
> + return 0;
> +}
> +
> +static ssize_t seccomp_notify_read(struct file *f, char __user *buf,
> + size_t size, loff_t *ppos)
> +{
> + struct seccomp_filter *filter = f->private_data;
> + struct seccomp_knotif *knotif = NULL, *cur;
> + struct seccomp_notif unotif;
> + ssize_t ret;
> +
> + /* No offset reads. */
> + if (*ppos != 0)
> + return -EINVAL;
> +
> + ret = down_interruptible(&filter->request);
> + if (ret < 0)
> + return ret;
> +
> + mutex_lock(&filter->notify_lock);
> + list_for_each_entry(cur, &filter->notifications, list) {
> + if (cur->state == SECCOMP_NOTIFY_INIT) {
> + knotif = cur;
> + break;
> + }
> + }
> +
> + /*
> + * If we didn't find a notification, it could be that the task was
> + * interrupted between the time we were woken and when we were able to
> + * acquire the rw lock. Should we retry here or just -ENOENT? -ENOENT
> + * for now.
> + */
> + if (!knotif) {
> + ret = -ENOENT;
> + goto out;
> + }
> +
> + unotif.id = knotif->id;
> + unotif.pid = knotif->pid;
> + unotif.data = *(knotif->data);
> +
> + size = min_t(size_t, size, sizeof(struct seccomp_notif));
> + if (copy_to_user(buf, &unotif, size)) {
> + ret = -EFAULT;
> + goto out;
> + }
> +
> + ret = sizeof(unotif);
> + knotif->state = SECCOMP_NOTIFY_SENT;
> +
> +out:
> + mutex_unlock(&filter->notify_lock);
> + return ret;
> +}
> +
> +static ssize_t seccomp_notify_write(struct file *file, const char __user *buf,
> + size_t size, loff_t *ppos)
> +{
> + struct seccomp_filter *filter = file->private_data;
> + struct seccomp_notif_resp resp = {};
> + struct seccomp_knotif *knotif = NULL;

Perhaps the other way around (inverse Christmas tree)

struct seccomp_knotif *knotif = NULL;
struct seccomp_notif_resp resp = {};

> + ssize_t ret = -EINVAL;
> +
> + /* No partial writes. */
> + if (*ppos != 0)
> + return -EINVAL;
> +
> + size = min_t(size_t, size, sizeof(resp));
> + if (copy_from_user(&resp, buf, size))
> + return -EFAULT;
> +
> + ret = mutex_lock_interruptible(&filter->notify_lock);
> + if (ret < 0)
> + return ret;
> +
> + list_for_each_entry(knotif, &filter->notifications, list) {
> + if (knotif->id == resp.id)
> + break;
> + }
> +
> + if (!knotif || knotif->id != resp.id) {
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + /* Allow exactly one reply. */
> + if (knotif->state != SECCOMP_NOTIFY_SENT) {
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + ret = size;
> + knotif->state = SECCOMP_NOTIFY_REPLIED;
> + knotif->error = resp.error;
> + knotif->val = resp.val;
> + complete(&knotif->ready);
> +out:
> + mutex_unlock(&filter->notify_lock);
> + return ret;
> +}
> +
> +static __poll_t seccomp_notify_poll(struct file *file,
> + struct poll_table_struct *poll_tab)
> +{
> + struct seccomp_filter *filter = file->private_data;
> + __poll_t ret = 0;
> + struct seccomp_knotif *cur;
> +
> + ret = mutex_lock_interruptible(&filter->notify_lock);
> + if (ret < 0)
> + return ret;
> +
> + list_for_each_entry(cur, &filter->notifications, list) {
> + if (cur->state == SECCOMP_NOTIFY_INIT)
> + ret |= EPOLLIN | EPOLLRDNORM;
> + if (cur->state == SECCOMP_NOTIFY_SENT)
> + ret |= EPOLLOUT | EPOLLWRNORM;
> + }
> +
> + mutex_unlock(&filter->notify_lock);
> +
> + return ret;
> +}
> +
> +static const struct file_operations seccomp_notify_ops = {
> + .read = seccomp_notify_read,
> + .write = seccomp_notify_write,
> + .poll = seccomp_notify_poll,
> + .release = seccomp_notify_release,
> +};
> +
> +static struct file *init_listener(struct task_struct *task,
> + struct seccomp_filter *filter)
> +{
> + struct file *ret = ERR_PTR(-EBUSY);
> + struct seccomp_filter *cur;
> + bool have_listener = false;
> + int filter_nesting = 0;
> +
> + for (cur = task->seccomp.filter; cur; cur = cur->prev) {
> + mutex_lock_nested(&cur->notify_lock, filter_nesting);
> + filter_nesting++;
> + if (cur->has_listener)
> + have_listener = true;
> + }
> +
> + if (have_listener)
> + goto out;

Perhaps just goto out directly

if (cur->has_listener)
goto out;


Hope this helps,
Tobin

2018-06-20 14:43:07

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH v3 1/4] seccomp: add a return code to trap to userspace

Hi Eric,

On Thu, Jun 14, 2018 at 04:53:51PM -0500, Eric W. Biederman wrote:
> >> static void seccomp_do_user_notification(...)
> >> {
> >> ...
> >> n.pid = get_task_pid(current, PIDTYPE_PID);
> >> ...
> >> remove_list:
> >> list_del(&n.list);
> >> put_pid(n.pid);
> >> ...
> >> }
> >> ...
> >> static ssize_t seccomp_notify_read(...)
> >> {
> >> ...
> >> unotif.pid = pid_vnr(knotif->pid);
> >> ...
> >> }
> >>
> >> I like holding the pid reference because it's what we do elsewhere when pid
> >> namespaces
> >> are a concern and it more precisely specifies what the knotif content needs
> >> to convey.
> >> Otherwise I don't think it makes a difference.
> >
> > Great, thanks, I'll do this. I guess we need a put_pid() here too.
>
> A) We know that the task is stopped. Unless there is something
> like SIGKILL that can make the task move you don't need to
> take a reference to anything.

Yes, agreed. (I think the task can't die, because even if it gets an
interrupt, we hold the ->notify_lock here, so it'll block waiting for
that to remove itself from the notification queue.)

> B) pid_vnr is the wrong answer. When you create the struct file
> and intialize the filter you need to capture the calling processes
> pid namespace. The you can use "pid_nr_ns(knotif->pid, filter->pid_ns);".
> That will work consistently even if the file descriptor is passed
> between processes.

We want the pid of the tracee in the tracer's namespace, so I'm not so
sure. Doesn't your code above give us the pid in the namespace of the
task that happened to create the struct file (which may be unrelated
to the namespace of the tracer)?

Tycho

2018-06-20 14:58:02

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH v3 1/4] seccomp: add a return code to trap to userspace

On Thu, Jun 14, 2018 at 02:55:03PM -0700, Matthew Helsley wrote:
> > + /*
> > + * Notification states. When SECCOMP_RET_USER_NOTIF is returned, a
> > + * struct seccomp_knotif is created and starts out in INIT. Once
> > the
> > + * handler reads the notification off of an FD, it transitions to
> > READ.
> >
>
> Comment above needs to be updated: READ state no longer exists.

Thanks, fixed for v4.

>
> > + * If a signal is received the state transitions back to INIT and
> > + * another message is sent. When the userspace handler replies,
> > state
> > + * transitions to REPLIED.
> > + */
> > + enum notify_state state;
> > +
> > + /* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
> > + int error;
> > + long val;
> > +
> > + /* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
> > + struct completion ready;
> > +
> > + struct list_head list;
> > +};
> > +#endif
> > +
> > /**
> > * struct seccomp_filter - container for seccomp BPF programs
> > *
> > @@ -64,6 +111,27 @@ struct seccomp_filter {
> > bool log;
> > struct seccomp_filter *prev;
> > struct bpf_prog *prog;
> > +
> > +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> > + /*
> > + * A semaphore that users of this notification can wait on for
> > + * changes. Actual reads and writes are still controlled with
> > + * filter->notify_lock.
> > + */
> > + struct semaphore request;
> > +
> > + /* A lock for all notification-related accesses. */
> > + struct mutex notify_lock;
> > +
> > + /* Is there currently an attached listener? */
> > + bool has_listener;
> >
>
> Assumes only one listener.
>
> Is there any chance userspace could try to attach multiple listeners and
> get confused? Perhaps by sharing the fd with multiple processes (via exec,
> SCM_RIGHTS..)?

Yes, only one listener fd is allowed, we return -EBUSY if you try to
attach multiple times. Once the fd is created, listening to it with
multiple tasks is fine, we synchronize that (indeed, that's what the
stress test tests).

> > + /*
> > + * Here it's possible we got a signal and then had to wait on the
> > mutex
> > + * while the reply was sent, so let's be sure there wasn't a
> > response
> > + * in the meantime.
> > + */
> > + if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) {
> > + /*
> > + * We got a signal. Let's tell userspace about it
> > (potentially
> > + * again, if we had already notified them about the first
> > one).
> > + */
> > + if (n.state == SECCOMP_NOTIFY_SENT) {
> > + n.state = SECCOMP_NOTIFY_INIT;
> > + up(&match->request);
> > + }
> > + mutex_unlock(&match->notify_lock);
> > + err = wait_for_completion_killable(&n.ready);
> > + mutex_lock(&match->notify_lock);
> > + if (err < 0)
> > + goto remove_list;
> > + }
> >
>
> This section looks a little odd.
>
> I notice you don't loop here yet you reset n.state back to INIT and don't
> wait for
> it to return to the SENT (much less REPLIED) state. This effectively looks
> like a subtle
> hard-coded number of "send" retries.

Yes, it is.

> I'm low on time so I couldn't figure out answers to some important
> questions:
>
> Is one "retry" always enough? Is it possible the notification might get
> lost due to the lack
> of a loop here? Is it possible the syscall might return without a "reply"?
> That's not
> consistent with your comments about the states. Is there any possibility of
> missing a
> completion because of the interaction between this and other parts that
> signal completion
> conditionally based on this state?

This is in response to this subthread on the first version:
https://lkml.org/lkml/2018/3/15/1122

Basically, we want exactly one notification if the thing gets a
signal, and then we wait_killable instead of wait_interruptible
instead. It should not be possible for a syscall to return without a
reply (indeed, for a syscall to return, it needs a wake on the
completion, so it would just hang forever if there's some bug here).

> +
> > + ret = n.val;
> > + err = n.error;
> > +
> > +remove_list:
> > + list_del(&n.list);
> > +out:
> > + mutex_unlock(&match->notify_lock);
> > + syscall_set_return_value(current, task_pt_regs(current),
> > + err, ret);
> > +}
> > +#else
> > +static void seccomp_do_user_notification(int this_syscall,
> > + struct seccomp_filter *match,
> > + const struct seccomp_data *sd)
> > +{
> > + seccomp_log(this_syscall, SIGSYS, SECCOMP_RET_USER_NOTIF, true);
> > + do_exit(SIGSYS);
> > +}
> > +#endif
> > +
> > #ifdef CONFIG_SECCOMP_FILTER
> > static int __seccomp_filter(int this_syscall, const struct seccomp_data
> > *sd,
> > const bool recheck_after_trace)
> > @@ -722,6 +877,9 @@ static int __seccomp_filter(int this_syscall, const
> > struct seccomp_data *sd,
> >
> > return 0;
> >
> > + case SECCOMP_RET_USER_NOTIF:
> > + seccomp_do_user_notification(this_syscall, match, sd);
> > + goto skip;
> > case SECCOMP_RET_LOG:
> > seccomp_log(this_syscall, 0, action, true);
> > return 0;
> > @@ -828,6 +986,9 @@ static long seccomp_set_mode_strict(void)
> > }
> >
> > #ifdef CONFIG_SECCOMP_FILTER
> > +static struct file *init_listener(struct task_struct *,
> > + struct seccomp_filter *);
> > +
> > /**
> > * seccomp_set_mode_filter: internal function for setting seccomp filter
> > * @flags: flags to change filter behavior
> > @@ -847,6 +1008,8 @@ static long seccomp_set_mode_filter(unsigned int
> > flags,
> > const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
> > struct seccomp_filter *prepared = NULL;
> > long ret = -EINVAL;
> > + int listener = 0;
> > + struct file *listener_f = NULL;
> >
> > /* Validate flags. */
> > if (flags & ~SECCOMP_FILTER_FLAG_MASK)
> > @@ -857,13 +1020,28 @@ static long seccomp_set_mode_filter(unsigned int
> > flags,
> > if (IS_ERR(prepared))
> > return PTR_ERR(prepared);
> >
> > + if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
>
> + listener = get_unused_fd_flags(O_RDWR);
> >
>
> So I checked and a man page does need to be update. Among other updates
> this needs a mention in the RETURN section of
> man 2 seccomp along the lines of:
>
> "If SECCOMP_FILTER_FLAG_GET_LISTENER is specified then the return value is
> a file descriptor upon success or -1 otherwise."
>
> "If seccomp() is called multiple times
> with SECCOMP_FILTER_FLAG_GET_LISTENER then, after the first successful
> call, it will not
> create any new file descriptors and may return with either the existing
> file descriptor or -1 and errno set to EBUSY" (I smell a testcase!)

Yes, I'll update the man pages when these get merged. And there
already is a test for this.

> Now, that said, this interface is somewhat awkward. The principle of least
> surprise suggests that subsequent calls with GET_LISTENER
> should return any file descriptor created previously rather than -1 and
> errno==EBUSY. Perhaps rather than GET_LISTENER you could rename it
> NEW_LISTENER_EXCL. Or you could add the ability to return any pre-existing
> fd.

How about NEW_LISTENER? We can add a GET_LISTENER command later if
people find it useful, but frankly I think the ptrace interface is
what most people will use, and we could possibly drop this all
together.

> > +static ssize_t seccomp_notify_read(struct file *f, char __user *buf,
> > + size_t size, loff_t *ppos)
> > +{
> > + struct seccomp_filter *filter = f->private_data;
> > + struct seccomp_knotif *knotif = NULL, *cur;
> > + struct seccomp_notif unotif;
> > + ssize_t ret;
> > +
> > + /* No offset reads. */
> > + if (*ppos != 0)
> > + return -EINVAL;
> > +
> >
>
> I think you should use memset to clear unotif -- that prevents any
> accidental information disclosure
> via the kernel stack if this structure ever has padding.

Good catch, thanks.

> Will the size of unotif ever *possibly* need to change (i.e. grow) in the
> future? You might consider
> how that could be enabled while maintaining backwards compatibility.
>
> I'm thinking you should check size here:
>
> if (size < sizeof(unotif))
> return -EFAULT;

Well, if it ever needs to grow, we can't do that :). And yes, the
intent is for it to grow -- it does later in this series, for example.

> Yes, copy_to_user() will stop you from walking past a page boundary but it
> won't
> stop you from quietly clobbering legitimate userspace memory. Seeing:
>
> size != sizeof(unotif)
>
> is also an indicator that there may be an ABI mismatch -- so again,
> consider the
> current/possible-future behavior carefully.
>
> + ret = down_interruptible(&filter->request);
> > + if (ret < 0)
> > + return ret;
> > +
> > + mutex_lock(&filter->notify_lock);
> > + list_for_each_entry(cur, &filter->notifications, list) {
> > + if (cur->state == SECCOMP_NOTIFY_INIT) {
> > + knotif = cur;
> > + break;
> > + }
> > + }
> > +
> > + /*
> > + * If we didn't find a notification, it could be that the task was
> > + * interrupted between the time we were woken and when we were
> > able to
> > + * acquire the rw lock. Should we retry here or just -ENOENT?
> > -ENOENT
> > + * for now.
> > + */
> > + if (!knotif) {
> > + ret = -ENOENT;
> > + goto out;
> > + }
> > +
> > + unotif.id = knotif->id;
> > + unotif.pid = knotif->pid;
> > + unotif.data = *(knotif->data);
> > +
> > + size = min_t(size_t, size, sizeof(struct seccomp_notif));
>
>
> nit: sizeof(unotif) here since that's shorter, ultimately what we're
> copying to userspace,
> changes if the type ever changes, and you're also using that to set ret
> later.

Fixed, thanks.

>
> >
>
> + if (copy_to_user(buf, &unotif, size)) {
> > + ret = -EFAULT;
> > + goto out;
> > + }
> > +
> > + ret = sizeof(unotif);
> > + knotif->state = SECCOMP_NOTIFY_SENT;
> > +
> > +out:
> > + mutex_unlock(&filter->notify_lock);
> > + return ret;
> > +}
> > +
> > +static ssize_t seccomp_notify_write(struct file *file, const char __user
> > *buf,
> > + size_t size, loff_t *ppos)
> > +{
> > + struct seccomp_filter *filter = file->private_data;
> > + struct seccomp_notif_resp resp = {};
> > + struct seccomp_knotif *knotif = NULL;
> > + ssize_t ret = -EINVAL;
> > +
> > + /* No partial writes. */
> > + if (*ppos != 0)
> > + return -EINVAL;
> >
>
> What happens if size < sizeof(resp) ? Is there a chance we could give some
> kernel bits to
> the process we're filtering with seccomp()?
>
> Again: Using memset() to clear resp seems like a good step since it
> contains field(s) that
> could appear in the filtered program. That or simply:
>
> if (size < sizeof(resp))
> return -EINVAL;

We're initializing with = {}, so I think it's not necessary here.

>
> > +
> > + size = min_t(size_t, size, sizeof(resp));
> > + if (copy_from_user(&resp, buf, size))
> > + return -EFAULT;
> > +
> > + ret = mutex_lock_interruptible(&filter->notify_lock);
> > + if (ret < 0)
> > + return ret;
> > +
> > + list_for_each_entry(knotif, &filter->notifications, list) {
> > + if (knotif->id == resp.id)
> > + break;
> > + }
> > +
> > + if (!knotif || knotif->id != resp.id) {
> > + ret = -EINVAL;
> > + goto out;
> > + }
> > +
> > + /* Allow exactly one reply. */
> > + if (knotif->state != SECCOMP_NOTIFY_SENT) {
> > + ret = -EINVAL;
> >
>
> nit: is there a better errno than EINVAL for this which would distinguish
> this error from simple invalid parameters?
>
> EALREADY (connection already in progress)? <-- not mentioned in man 2
> write ("unused" below)
> EINPROGRESS (operation already in progress)? <-- unused (see man 2 connect)

Sure, I'll switch to one of these.

> ENOTSUP ? <--
> unused
> ENOTUNIQ (name not unique on network)? <-- unused
> ENOSPC (device containing file has no room for the data)? <-- definitely
> used
> EIO (low level IO error) ?
> <-- definitely used
>
>
> > + goto out;
> > + }
> > +
> > + ret = size;
> > + knotif->state = SECCOMP_NOTIFY_REPLIED;
> > + knotif->error = resp.error;
> > + knotif->val = resp.val;
> > + complete(&knotif->ready);
> > +out:
> > + mutex_unlock(&filter->notify_lock);
> > + return ret;
> > +}
> > +
> > +static __poll_t seccomp_notify_poll(struct file *file,
> > + struct poll_table_struct *poll_tab)
> > +{
> > + struct seccomp_filter *filter = file->private_data;
> > + __poll_t ret = 0;
> > + struct seccomp_knotif *cur;
> > +
> > + ret = mutex_lock_interruptible(&filter->notify_lock);
> > + if (ret < 0)
> > + return ret;
>
> +
> > + list_for_each_entry(cur, &filter->notifications, list) {
> > + if (cur->state == SECCOMP_NOTIFY_INIT)
> > + ret |= EPOLLIN | EPOLLRDNORM;
> > + if (cur->state == SECCOMP_NOTIFY_SENT)
> > + ret |= EPOLLOUT | EPOLLWRNORM;
> >
>
> Hmm, it's been a while since I read poll file ops but if you do wind up
> walking this list then you may not have to walk the entire list here:
>
> if (ret == EPOLLIN | EPOLLRDNORM | EPOLLOUT | EPOLLWRNORM)
> break;
>
> Then poll() is not always O(N) (where N is the number of queued
> notifications...)

Yep, poll is totally reworked for v4 based on other comments.

Tycho