Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp68036pxb; Tue, 2 Mar 2021 19:27:19 -0800 (PST) X-Google-Smtp-Source: ABdhPJy8rOCvVsCaI069Wp7EiTgxl0d8gyoX8S/FrJYcnd68p2mtmeDFAO8pSEspMTpxiP4hWg+a X-Received: by 2002:a17:907:a04f:: with SMTP id gz15mr7893405ejc.293.1614742038821; Tue, 02 Mar 2021 19:27:18 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1614742038; cv=none; d=google.com; s=arc-20160816; b=NOhUIzEZVhznANQ0m/m5d7trS7PwIFw4kjl3Qvv8C9X4gnul9xY6/XJfK3a3z4z/8S 9XxdTtWTCQSi76R2sD2o5rQylD2/IbG2zJoocfZdANRnHGuBGQUUAC2VJnYr7gA9kyzw zKOPCtu1j9Llgyy4g7MNC7+efIKw9lFVuoZ9fGYy12LMjtogUDCbpFQbAAvgYsqit8X0 ZF1KoV7qRpF7aQxYB55ObF8ly1jmLv9Qx46dUFfHcgtlkYXr2PbROn0kiCUsXSP/xo9Q imN2heWOx4k7nOV9oXwchQStauag9TYEDpaOycOceWzuepQ7IFxYzvtPLWEiWbDWm2eT iD3A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=84k8exGqk/wFecnPoUmd26p97OJX/AbFh1BrviIj0gs=; b=Llec9l6j49P7Wl3uH9KORREfFtw7Em+lyghCl+JpjJdzosHPt4pn/0S+C71IKgvH34 dcFTVS5cZ7vulXxFp9YkciLY4pzeMmOzMtZCnJW78Eu/XdKui14Yi2LPdrfoKNhhQ7PI 38svEJNss9lnbjSQ/7fL20BtAVW5UZRYvs5AuZdVlskq+Y9CtpeF385qJwhmkmaDAUWZ nAy11kQ8FT24XAK+jo8XhhUNB0X/fNyqShwprwgSsm3nDhjbkSlxCD6Q9oMiboZCmYxS +uE4IK+lCdSD42FjOEpq0EsyqXTgC0Eqd9c5efxK4+RnY6xIML6Bm63WMd8BIeJvk7mz gR2A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id a4si6050226ejv.296.2021.03.02.19.26.55; Tue, 02 Mar 2021 19:27:18 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235608AbhCANXN (ORCPT + 99 others); Mon, 1 Mar 2021 08:23:13 -0500 Received: from youngberry.canonical.com ([91.189.89.112]:43101 "EHLO youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235619AbhCANWp (ORCPT ); Mon, 1 Mar 2021 08:22:45 -0500 Received: from ip5f5af0a0.dynamic.kabel-deutschland.de ([95.90.240.160] helo=wittgenstein) by youngberry.canonical.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1lGiUn-0000AB-AK; Mon, 01 Mar 2021 13:21:57 +0000 Date: Mon, 1 Mar 2021 14:21:56 +0100 From: Christian Brauner To: Sargun Dhillon Cc: Kees Cook , LKML , Giuseppe Scrivano , Tycho Andersen , Hariharan Ananthakrishnan , Keerti Lakshminarayan , Kyle Anderson , Linux Containers List , stgraber@ubuntu.com, Andy Lutomirski Subject: Re: seccomp: Delay filter activation Message-ID: <20210301132156.in3z53t5xxy3ity5@wittgenstein> References: <20210301110907.2qoxmiy55gpkgwnq@wittgenstein> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20210301110907.2qoxmiy55gpkgwnq@wittgenstein> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Mar 01, 2021 at 12:09:09PM +0100, Christian Brauner wrote: > On Sat, Feb 20, 2021 at 01:31:57AM -0800, Sargun Dhillon wrote: > > We've run into a problem where attaching a filter can be quite messy > > business because the filter itself intercepts sendmsg, and other > > syscalls related to exfiltrating the listener FD. I believe that this > > problem set has been brought up before, and although there are > > "simpler" methods of exfiltrating the listener, like clone3 or > > pidfd_getfd, but these are still less than ideal. > > (You really like sending patches and discussion points in the middle of > the merge window. :D I think everyone's panicked about getting their PRs > in shape so it's not unlikely that this sometimes gets lost on the list. :)) > > It hasn't been a huge problem for us, especially since we added > pidfd_getfd() this seemed like a straightforward problem to solve by > selecting a fix fd number that is to be used for the listener. But I can > see why it is annoying. > > > > > One of the ideas that's been talked about (I want to say back at LSS > > NA) is the idea of "delayed activation". I was thinking that it might > > be nice to have a mechanism to do delayed attach, either activated on > > execve / fork, or an ioctl on the listenerfd to activate the filter > > and have a flag like SECCOMP_FILTER_FLAG_NEW_LISTENER_INACTIVE, which > > indicates that the listener should be setup, but not enforcing, and > > another ioctl to activate it. > > > > The later approach is preferred due to simplicity, but I can see a > > situation where you could accidentally get into a state where the > > filter is not being enforced. Additionally, this may have unforeseen > > implications with CRIU. > > (If you were to expose an ioctl() that allows userspace to query the > notifer state then CRIU shouldn't have a problem restoring the notifier > in the correct state. Right now it doesn't do anyting fancy about the > notifier, it just restores the task with the filter. It just has to > learn about the new feature and that's fine imho.) > > > > > I'm curious whether this is a problem others share, and whether any of > > the aforementioned approaches seem reasonable. > > So when I originally suggested the delayed activation I I had another > related idea that I think I might have mentioned too: if we're already > considering delaying filter activation I like to discuss the possibility > of attaching a seccomp filter to a task. > > Right now, if one task wants to attach to another task they need to > recreate the whole seccomp filter and load it. That's not just pretty > expensive but also only works if you have access to the rules that the > filter was generated with. For container that's usually some sort of > pseudo seccomp filter configuration language dumped into a config file > from which it can be read. > > So right now the status quo is: > > struct sock_filter filter[] = { > BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)), > BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, nr, 0, 1), > BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_USER_NOTIF), /* Get me a listener fd */ > BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW), > }; > struct sock_fprog prog = { > .len = (unsigned short)ARRAY_SIZE(filter), > .filter = filter, > }; > int fd = seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog); > > and then the caller must send the fd to the manager or the manager uses > pidfd_getfd(). > > But, why not get a bit crazy^wcreative; especially since seccomp() is > already a multiplexer. We introduce a new seccomp flag: > > #define SECCOMP_FILTER_DETACHED > > and a new seccomp command: > > #define SECCOMP_ATTACH_FILTER > > And now we could do something like: > > pid_t pid = fork(); > if (pid < 0) > return; > > if (pid == 0) { > // do stuff > BARRIER_WAKE_SETUP_DONE; > > // do more unrelated stuff > > BARRIER_WAIT_SECCOMP_FILTER; > execve(exec-something); > } else { > > int fd_filter; > > struct sock_filter filter[] = { > BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)), > BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, nr, 0, 1), > BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW), > }; > > struct sock_fprog prog = { > .len = (unsigned short)ARRAY_SIZE(filter), > .filter = filter, > }; > > int fd_filter = seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_DETACHED, &prog); > > BARRIER_WAIT_SETUP_DONE; > > int ret = seccomp(SECCOMP_ATTACH_FILTER, 0, INT_TO_PTR(fd_listener)); This obviously should've been sm like: struct seccomp_filter_attach { union { __s32 pidfd; __s32 pid; }; __u32 fd_filter; }; and then int ret = seccomp(SECCOMP_ATTACH_FILTER, 0, seccomp_filter_attach); > > BARRIER_WAKE_SECCOMP_FILTER; > } > > And now you have attached a filter to another task. This would be super > elegant for a container manager. The container manager could also stash > the filter fd and when attaching to a container the manager can send the > attaching task the fd and the attaching task can do: > > int ret = seccomp(SECCOMP_ATTACH_FILTER, 0, INT_TO_PTR(fd_filter)); > > too and would be attached to the same filter as the target task. > > And for the listener fd case a container manager could simply set > SECCOMP_RET_USER_NOTIF as before > > struct sock_filter filter[] = { > BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)), > BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, nr, 0, 1), > BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_USER_NOTIF), > BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW), > }; > > and now fd_filter simply functions as the notifier fd after > seccomp(SECCOMP_ATTACH_FILTER) that's basically the fancy version of my > delayed notifier activiation idea. > > I'm sure there's nastiness to figure out but I would love to see > something like this. > > Christian