Received: by 10.223.185.116 with SMTP id b49csp1118953wrg; Wed, 14 Feb 2018 11:57:41 -0800 (PST) X-Google-Smtp-Source: AH8x227RkL68G8npXjnjCe7HoY3cZlbshAeWU3/KZ3IbfboymPcK3YbIIofMrcJNYVfw9+n90rBR X-Received: by 10.99.99.132 with SMTP id x126mr144629pgb.86.1518638261841; Wed, 14 Feb 2018 11:57:41 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1518638261; cv=none; d=google.com; s=arc-20160816; b=QOtrc3iRJmO4eHZGiTQNbCGZPDd5rQtwLoiBk8u1AyLW8q8oz1m84Ht6/ROtpQUU/B odADKVrsvkCr4bzcG4G7LWYGJCcFHgUCQ1yqlZK7FhqjmxjLde7/fxYURxTVTUL6fM16 rHfToXPGNlO4fML2G/KmftW3+ZeLSnD16sPiXUXM4VwAkjefePXQ0G5BmQrJVgkWqIY8 wFkJQcuQ0RAGiyvqsY0XJJRS80XZzHyJMwMfKefYARSa2WgvBG8vdJwcS13M6Mf9ZyTu hnu3dbIt5UciTpg9r9LSPS5/DLmanE5CdpQMFEPEwN9e192UB5SoI74RptrdzR50rYu3 vK0Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature:arc-authentication-results; bh=8y05HLNHhFSv2aG37B/XExSchBKN72ONDxvGASB0bQ4=; b=daXYurnVfCsSiB+0vh5p7fSUm34es5Cb7Sy0qP46zZKHIABLFrNM5Mo9MunDlqrffN zpOoDSZ3TYX6sHe7MyoujhBPyrbiw3P53jcK18/2psJa2AGUXxBJgaBz/cqS+NxAj/23 Q29EnTBRkCQwHcrGdWWfJR/stDj36EHI16jX3eLL7lBjKoaXRokbj/sJQ+n2eaMq8Qo5 rOAzkneMy/gXenJRJRslBYnrR1vK6+ejWnd2DlWofAs45FcOeGIeHXL5oeglW0NF2aHY sYayugg7vs+O6JMTt4R4rnx5QMsPVL2cnrs/CNZRXcujp80Zo0kW7z/3pIPfHKHMCpLL uYjQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@tycho-ws.20150623.gappssmtp.com header.s=20150623 header.b=Z0icQIL9; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id l8si1244364pgq.35.2018.02.14.11.57.27; Wed, 14 Feb 2018 11:57:41 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@tycho-ws.20150623.gappssmtp.com header.s=20150623 header.b=Z0icQIL9; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1031317AbeBNPaF (ORCPT + 99 others); Wed, 14 Feb 2018 10:30:05 -0500 Received: from mail-qt0-f182.google.com ([209.85.216.182]:37251 "EHLO mail-qt0-f182.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1031179AbeBNPaD (ORCPT ); Wed, 14 Feb 2018 10:30:03 -0500 Received: by mail-qt0-f182.google.com with SMTP id d26so6555120qtj.4 for ; Wed, 14 Feb 2018 07:30:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=tycho-ws.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=8y05HLNHhFSv2aG37B/XExSchBKN72ONDxvGASB0bQ4=; b=Z0icQIL9pYISJlflQaIWtiU3dzfLOdoOSo4vx/TgCUrJxh3/DiCiqGGMLpbDS3gXDp b/PgOJJNYzR3N/hCHnJ6ENrRnBglTgr+jsN4u8O2ZD/bOveWXCAqQZgdN7SsKjq4xk97 upLAZMPxnmSf1MI+093qreiYwgoBgVoX2dhUQQoO/7keeakUFUG34gJRVP/o78BsHOCP /PGkqXH/+j+zbNcOHrBp6ad2tg7aaqy4RK2TCI/x0uHInbECzkOUcP6C0mxlMLQHOHp0 S7Z9BFkHrkQy26YHPmEM9UAngAoWyipOIWp7g7o6ofmQOiO4+REYLJ46dXTMahJYr0U5 lVDw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=8y05HLNHhFSv2aG37B/XExSchBKN72ONDxvGASB0bQ4=; b=VPMqHNne5O/DhJ++2E6PdTqGOEa+ZtL/lH9vxDyv5mkrlTebbJTPV2QxhYgj0dM0CP ZhY+yP4fc6Q6rdOEAmgH1s0U7W15Zy/H7vxYXT2YxFRmV7cJK9jZlFYDd0gUgbqwnDKo mvcu613ZRUTtIDZgSHXOVMzMAemkyuNvHtYX97qh5xqvTOu5zK8ehKA4GT7If/V216Ut 0Q3TXJEN8pEG6ACckKBLANSo3y4htKXBkyg1S1wICx+Pfpr/3zCN7PikHduuq0YNNKQK 5tjAMJLAXqV+95QuqVXiFgWg6PkBbAi3I335WLBAuCRg0HE6hpNGvqo4YbkAukda7w+L 1d8Q== X-Gm-Message-State: APf1xPCvjghZM6MrIq7l9YCqponeP+oWVY4eDPamVhToLwZvc2q0mltV iftcYlZtp0Ai+N96m0dcQMXDbA== X-Received: by 10.200.42.244 with SMTP id c49mr8377463qta.310.1518622202431; Wed, 14 Feb 2018 07:30:02 -0800 (PST) Received: from cisco ([8.24.24.129]) by smtp.gmail.com with ESMTPSA id t10sm11373207qkg.39.2018.02.14.07.30.00 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Wed, 14 Feb 2018 07:30:01 -0800 (PST) Date: Wed, 14 Feb 2018 08:29:58 -0700 From: Tycho Andersen To: Kees Cook Cc: LKML , Linux Containers , Andy Lutomirski , Oleg Nesterov , "Eric W . Biederman" , "Serge E . Hallyn" , Christian Brauner , Tyler Hicks , Akihiro Suda , Tom Hromatka , Sargun Dhillon , Paul Moore Subject: Re: [RFC 1/3] seccomp: add a return code to trap to userspace Message-ID: <20180214152958.cjgwh2k52zji2jxk@cisco> References: <20180204104946.25559-1-tycho@tycho.ws> <20180204104946.25559-2-tycho@tycho.ws> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: NeoMutt/20170609 (1.8.3) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hey Kees, Thanks for taking a look! On Tue, Feb 13, 2018 at 01:09:20PM -0800, Kees Cook wrote: > On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen wrote: > > This patch introduces a means for syscalls matched in seccomp to notify > > some other task that a particular filter has been triggered. > > > > The motivation for this is primarily for use with containers. For example, > > if a container does an init_module(), we obviously don't want to load this > > untrusted code, which may be compiled for the wrong version of the kernel > > anyway. Instead, we could parse the module image, figure out which module > > the container is trying to load and load it on the host. > > > > As another example, containers cannot mknod(), since this checks > > capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or > > /dev/zero should be ok for containers to mknod, but we'd like to avoid hard > > coding some whitelist in the kernel. Another example is mount(), which has > > many security restrictions for good reason, but configuration or runtime > > knowledge could potentially be used to relax these restrictions. > > Related to the eBPF seccomp thread, can the logic for these things be > handled entirely by eBPF? My assumption is that you still need to stop > the process to do something (i.e. do a mknod, or a mount) before > letting it continue. Is there some "wait for notification" system in > eBPF? I replied in the other thread (https://patchwork.ozlabs.org/cover/872938/#1856642 for those following along at home), but no, at least not that I know of. > > The actual implementation of this is fairly small, although getting the > > synchronization right was/is slightly complex. Also worth noting that there > > is one race still present: > > > > 1. a task does a SECCOMP_RET_USER_NOTIF > > 2. the userspace handler reads this notification > > 3. the task dies > > 4. a new task with the same pid starts > > 5. this new task does a SECCOMP_RET_USER_NOTIF, gets the same cookie id > > that the previous one did > > 6. the userspace handler writes a response > > > > There's no way to distinguish this case right now. Maybe we care, maybe we > > don't, but it's worth noting. > > So, I'd like to avoid the cookie if possible (surprise). Why isn't it > possible to close the kernel-side of the fd to indicate that it lost > the pid it was attached to? Because the fd is for a filter, not a task. > Is this just that the reader has no idea > who is sending messages? So the risk is a fork/die loop within the > same process tree (i.e. attached to the same filter)? Hrmpf. I can't > think of a better way to handle the > one(fd)-to-many(task-with-that-filter-attached) situation... Yes, exactly. The cookie just adds uniqueness, and as Andy pointed out if we switch to u64, the race above basically ("u64 should be enough for anybody") goes away. > > Right now the interface is a simple structure copy across a file > > descriptor. We could potentially invent something fancier. > > I wonder if this communication should be netlink, which gives a more > well-structured way to describe what's on the wire? The reason I ask > is because if we ever change the seccomp_data structure, we'll now > have two places where we need to deal with it (the first being within > the BPF itself). My initial idea was to prefix the communication with > a size field, then send the structure, and then I had nightmares, and > realized this was basically netlink reinvented. I suggested netlink in LA, and everyone (especially Andy) groaned very loudly :). I'm happy to switch it to netlink if you like, although i think memcpy() of structs should be safe here, since the return value from read or write can indicate the size of things. > > Finally, it's worth noting that the classic seccomp TOCTOU of reading > > memory data from the task still applies here, but can be avoided with > > careful design of the userspace handler: if the userspace handler reads all > > of the task memory that is necessary before applying its security policy, > > the tracee's subsequent memory edits will not be read by the tracer. > > Is this really true? Couldn't a multi-threaded process muck with > memory out from under both the manager and the stopped process? Sure, but as long as the manager copies the relevant arguments out of the tracee's memory *before* evaluating whether it's safe to do the thing the tracee wants to do, it's ok. The assumption here is that the tracee can't corrupt the manager's memory (because if it could, lots of other things would already be broken). > > /* > > * All BPF programs must return a 32-bit value. > > @@ -34,6 +35,7 @@ > > #define SECCOMP_RET_KILL SECCOMP_RET_KILL_THREAD > > #define SECCOMP_RET_TRAP 0x00030000U /* disallow and force a SIGSYS */ > > #define SECCOMP_RET_ERRNO 0x00050000U /* returns an errno */ > > +#define SECCOMP_RET_USER_NOTIF 0x7fc00000U /* notifies userspace */ > > #define SECCOMP_RET_TRACE 0x7ff00000U /* pass to a tracer or disallow */ > > /me tries to come up with an ordering rationale here and fails. > > An ERRNO filter would block a USER_NOTIF because it's unconditional. > TRACE could be either, USER_NOTIF could be either. > > This means TRACE rules would be bumped by a USER_NOTIF... hmm. Yes, I didn't exactly know what to do here. ERRNO, TRAP, and KILL all seemed more important than USER_NOTIF, but TRACE didn't. I don't have a strong opinion about what to do here, because users can adjust their filters accordingly. Let me know what you prefer. > > +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION > > I wonder if it's time to split up seccomp.c ... probably not, but I've > always been unhappy with the #ifdefs even for just regular _FILTER. ;) A reasonable question. I'm happy to do that as a separate series before this one goes in if you want. > > +static ssize_t seccomp_notify_write(struct file *file, const char __user *buf, > > + size_t size, loff_t *ppos) > > +{ > > + struct seccomp_filter *filter = file->private_data; > > + struct seccomp_notif_resp resp = {}; > > + struct seccomp_knotif *knotif = NULL; > > + struct list_head *cur; > > + ssize_t ret = -EINVAL; > > + > > + /* No partial writes. */ > > + if (*ppos != 0) > > + return -EINVAL; > > + > > + size = min_t(size_t, size, sizeof(resp)); > > In this case, we can't use min_t, size _must_ be == sizeof(resp), > otherwise we're operating on what's in the stack (which is zeroed, but > still). I'm not sure I follow. If the user passes in an old (smaller) struct seccomp_notif_resp, we don't want to copy more than they specified. If they pass in a bigger one, this will be sizeof(resp). > > + if (copy_from_user(&resp, buf, size)) > > + return -EFAULT; > > + > > + ret = mutex_lock_interruptible(&filter->notify_lock); > > + if (ret < 0) > > + return ret; > > + > > + list_for_each(cur, &filter->notifications) { > > + knotif = list_entry(cur, struct seccomp_knotif, list); > > + > > + if (knotif->id == resp.id) > > + break; > > So we're finding the matching id here. Now, I'm trying to think about > how this will look in real-world use: the pid will be _blocked_ while > this happening. And all the other pids that trip this filter will > _also_ be blocked, since they're all waiting for the reader to read > and respond. The risk is pid death while waiting, and having another > appear with the same pid, trigger the same filter, get blocked, and > then the reader replies for the old pid, and the new pid gets the > results? Yep, exactly. > Since this notification queue is already linear, can't we use ordering > to enforce this? i.e. only the pid at the head of the filter > notification queue is going to have anything happening to it. Or is > the idea to have multiple readers/writers of the fd? I'm not really sure how we prevent multiple readers/writers of the fd. But even with a single writer, the case you described "could" happen (although again, with u64 cookies it shouldn't be a problem). I'm not sure how ordering helps us though; the problem is really that one entry for a pid was deleted, and a whole new one was created. So ordering will look ok, but the response will go to the wrong pid. > > + } > > + > > + if (!knotif || knotif->id != resp.id) { > > + ret = -EINVAL; > > + goto out; > > + } > > + > > + ret = size; > > + knotif->state = SECCOMP_NOTIFY_WRITE; > > + knotif->error = resp.error; > > + knotif->val = resp.val; > > + complete(&knotif->ready); > > +out: > > + mutex_unlock(&filter->notify_lock); > > + return ret; > > +} > > + > > +static const struct file_operations seccomp_notify_ops = { > > + .read = seccomp_notify_read, > > + .write = seccomp_notify_write, > > + /* TODO: poll */ > > What's needed for poll? I think you've got all the pieces you need > already, i.e. wait queue, notifications, etc. Nothing, I just didn't implement it. I will do so for v2. > > + .release = seccomp_notify_release, > > +}; > > + > > +static struct file *init_listener(struct seccomp_filter *filter) > > +{ > > + struct file *ret; > > + > > + mutex_lock(&filter->notify_lock); > > + if (filter->has_listener) { > > + mutex_unlock(&filter->notify_lock); > > + return ERR_PTR(-EBUSY); > > + } > > + > > + ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops, > > + filter, O_RDWR); > > + if (IS_ERR(ret)) { > > + __put_seccomp_filter(filter); > > + } else { > > + /* > > + * Intentionally don't put_seccomp_filter(). The file > > + * has a reference to it now. > > + */ > > + filter->has_listener = true; > > + } > > I spent some time staring at this, and I don't see it: where is the > get_() for this? The caller of init_listener() already does a put() on > the failure path. It seems like there is a get() missing near the > start of init_listener(), or I've entirely missed something. Ugh, yes. For the SECCOMP_FILTER_FLAG_GET_LISTENER case, you're right. Originally I only had the ptrace-based one, and that has a get() in get_nth_filter(), so the comment makes sense in that case. I'll straighten this out for v2 and > (Regardless, I think the usage counting need a comment somewhere, > maybe near the top of seccomp.c with the field?) ...add a comment. > > +#define USER_NOTIF_MAGIC 116983961184613L > > Is this just you mashing the numpad? :) Is there some better way to generate magic numbers? :) Cheers, Tycho