Received: by 2002:a05:6a10:22f:0:0:0:0 with SMTP id 15csp808865pxk; Wed, 23 Sep 2020 17:27:46 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzCJ0oV3mHDONaPNMaWr2/63pEkqNRRlQm7kS2xzPrbMBbw3I/+11FysLbEWZtFFK2ADSzF X-Received: by 2002:a17:906:f92:: with SMTP id q18mr2038497ejj.237.1600907266421; Wed, 23 Sep 2020 17:27:46 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1600907266; cv=none; d=google.com; s=arc-20160816; b=hTMoHi1mghSpsQ04kwlkYKA81X1V4lx+KoPqCS2REP2/+qlX++jzvnno+lNL1fIY5O hXXltS5eYHon65+k0fUDfxwqCXoADqxquY4+uoWYVYeSIP9yYOwAJgEmKVyUkFL0PMl8 PFN2CJmg4PlgwkKC6qYfKdaPPcAJtavXdMC24pwTpK0RdKYcgfxNF5tlUkKzDql/s8Zf P4mKg6knHHxBZMh+/Rd1rscCpHSwvVT6P4VwLqDQN/o14zphBHn/HioyHPMbtXO607N4 rtzKhHzpipF0b+0XSV3xNfiaD4NYJo0aF+hKhOxmVWDo6PL1V+NriGQoInwnMsMacxUA 6B1w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=z36kKBCsQHaNPCeH2E/H9tB3p5RdrZUDg9hhwZ3Nxx4=; b=VxRt0BaHcCx54KXHI6MkJnwOHiCrSBZDcR0TCwCdKfv5o/M9MM+h0BYuUUeM78L23/ ke8FCe/x0bF79xGNxPXXx7LQs4TBSBGaKrPiAXVXR6E+q/Tju982ENO2g436F1U/ktmG P0dIMio5QkrS84Zi34KUll/BNEI2WN4Ra5CsW4SPu++RD/5h+OmPec1aV/bp/rq4EgWO 26MoFCdvw/BssNcKc2E+l9RRHZaHUe/bvq3Qi8ip3A/OFeL3sCJf4nANIda2VUghBzhe 8sEbyRxUQUeiRM7WRjGSpAd6v7JOphRZowpouzQKEN1x6/8Sd58SayhphFSCxuVppUoD uMVQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=cbX2fdSu; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id r21si1045379edm.221.2020.09.23.17.27.22; Wed, 23 Sep 2020 17:27:46 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=cbX2fdSu; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726735AbgIXAZc (ORCPT + 99 others); Wed, 23 Sep 2020 20:25:32 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40690 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726466AbgIXAZb (ORCPT ); Wed, 23 Sep 2020 20:25:31 -0400 Received: from mail-ej1-x641.google.com (mail-ej1-x641.google.com [IPv6:2a00:1450:4864:20::641]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 38596C0613CE for ; Wed, 23 Sep 2020 17:25:31 -0700 (PDT) Received: by mail-ej1-x641.google.com with SMTP id gr14so2093908ejb.1 for ; Wed, 23 Sep 2020 17:25:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=z36kKBCsQHaNPCeH2E/H9tB3p5RdrZUDg9hhwZ3Nxx4=; b=cbX2fdSuIUKZiYq81JUboXArhe0hzzMQKHDvzeE7J6nPvEp8uAfCWtiVvkPW+XoRqf SNmPSYUj8Mpf3ntteDZJ0BanpOlimPgmPK5k08S1L+NTDs61EtOmrdTemzlWhloBfw6h hKLpi61bNpByUX8xegnen+FIfES+zFOI5Jgc8ssG7qTKjF/Wpu+CsolNmc1A1GzdxMsW NkjdaMnu37uiBq+cQSIdZi0TXBDDkqRhIpx11URvPwfi8G5hVCXLICWEaTH5P15N5kMO CFX1LqExw0W6wTTa9LYSuBv3hcYXouQwkUxbrf+gi3nYE4HwGGLnZmq/5/jBYASMI8q5 9ahA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=z36kKBCsQHaNPCeH2E/H9tB3p5RdrZUDg9hhwZ3Nxx4=; b=nL0gnuuYXGghgZcWbSjGBeDyp6pEcPkSA1tfHfFD107b7s/dwyuXulwanKCxyPUGjX gZfF7ZVAHCYu6aoqI82DK15FjpGjAtwTmrAVkilXUYW+ojdo4lA/aLoavOENbjd8zr5Y idH+nZTUJMJlm1X8BvbQ/SBrJgpcgHLJiXAuUecbLORi9glvrl5UB09rrWahXgkQBWIM kd1NpnEawq3TaufDq8aA0yeZlc0P00UNn0GZF3U+51v90zct7gD2eyVaR6oLnj9cMcNH AeNx2MBvSRbwYXwWykO3Ia2uy0Kq0mEYaD8k8d5uNKfJ7qLAP2s/DUmIfqYcuRuz0NZP HyiA== X-Gm-Message-State: AOAM532/+V/gR7pvUoW66zO9yuTyb5AB5IB41xCGt/g3QCVe9QgqCyg9 LYm5cVS4UzoGu/6r2U+bzcf9wItBBwtHF/XXo/vLlw== X-Received: by 2002:a17:906:1f94:: with SMTP id t20mr2089748ejr.493.1600907129579; Wed, 23 Sep 2020 17:25:29 -0700 (PDT) MIME-Version: 1.0 References: <20200923232923.3142503-1-keescook@chromium.org> <20200923232923.3142503-4-keescook@chromium.org> In-Reply-To: <20200923232923.3142503-4-keescook@chromium.org> From: Jann Horn Date: Thu, 24 Sep 2020 02:25:03 +0200 Message-ID: Subject: Re: [PATCH 3/6] seccomp: Implement constant action bitmaps To: Kees Cook Cc: YiFei Zhu , Christian Brauner , Tycho Andersen , Andy Lutomirski , Will Drewry , Andrea Arcangeli , Giuseppe Scrivano , Tobin Feldman-Fitzthum , Dimitrios Skarlatos , Valentin Rothberg , Hubertus Franke , Jack Chen , Josep Torrellas , Tianyin Xu , bpf , Linux Containers , Linux API , kernel list Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Sep 24, 2020 at 1:29 AM Kees Cook wrote: > One of the most common pain points with seccomp filters has been dealing > with the overhead of processing the filters, especially for "always allow" > or "always reject" cases. The "always reject" cases don't need to be fast, in particular not the kill_thread/kill_process ones. Nobody's going to have "process kills itself by executing a forbidden syscall" on a critical hot codepath. > While BPF is extremely fast[1], it will always > have overhead associated with it. Additionally, due to seccomp's design, > filters are layered, which means processing time goes up as the number > of filters attached goes up. [...] > In order to build this mapping at filter attach time, each filter is > executed for every syscall (under each possible architecture), and > checked for any accesses of struct seccomp_data that are not the "arch" > nor "nr" (syscall) members. If only "arch" and "nr" are examined, then > there is a constant mapping for that syscall, and bitmaps can be updated > accordingly. If any accesses happen outside of those struct members, > seccomp must not bypass filter execution for that syscall, since program > state will be used to determine filter action result. (This logic comes > in the next patch.) > > [1] https://lore.kernel.org/bpf/20200531171915.wsxvdjeetmhpsdv2@ast-mbp.dhcp.thefacebook.com/ > [2] https://lore.kernel.org/bpf/20200601101137.GA121847@gardel-login/ > [3] https://lore.kernel.org/bpf/717a06e7f35740ccb4c70470ec70fb2f@huawei.com/ > > Signed-off-by: Kees Cook > --- > include/linux/seccomp.h | 18 ++++ > kernel/seccomp.c | 207 +++++++++++++++++++++++++++++++++++++++- > 2 files changed, 221 insertions(+), 4 deletions(-) > > diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h > index 0be20bc81ea9..96df2f899e3d 100644 > --- a/include/linux/seccomp.h > +++ b/include/linux/seccomp.h > @@ -25,6 +25,17 @@ > #define SECCOMP_ARCH_IS_MULTIPLEX 3 > #define SECCOMP_ARCH_IS_UNKNOWN 0xff > > +/* When no bits are set for a syscall, filters are run. */ > +struct seccomp_bitmaps { > +#ifdef SECCOMP_ARCH > + /* "allow" are initialized to set and only ever get cleared. */ > + DECLARE_BITMAP(allow, NR_syscalls); This bitmap makes sense. The "NR_syscalls" part assumes that the compat syscall tables will not be bigger than the native syscall table, right? I guess that's usually mostly true nowadays, thanks to the syscall table unification... (might be worth a comment though) > + /* These are initialized to clear and only ever get set. */ > + DECLARE_BITMAP(kill_thread, NR_syscalls); > + DECLARE_BITMAP(kill_process, NR_syscalls); I don't think these bitmaps make sense, this is not part of any fastpath. (However, a "which syscalls have a fixed result" bitmap might make sense if we want to export the list of permitted syscalls as a text file in procfs, as I mentioned over at .) > +#endif > +}; > + > struct seccomp_filter; > /** > * struct seccomp - the state of a seccomp'ed process > @@ -45,6 +56,13 @@ struct seccomp { > #endif > atomic_t filter_count; > struct seccomp_filter *filter; > + struct seccomp_bitmaps native; > +#ifdef CONFIG_COMPAT > + struct seccomp_bitmaps compat; > +#endif > +#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH > + struct seccomp_bitmaps multiplex; > +#endif Why do we have one bitmap per thread (in struct seccomp) instead of putting the bitmap for a given filter and all its ancestors into the seccomp_filter? > }; > > #ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER > diff --git a/kernel/seccomp.c b/kernel/seccomp.c > index 0a3ff8eb8aea..111a238bc532 100644 > --- a/kernel/seccomp.c > +++ b/kernel/seccomp.c > @@ -318,7 +318,7 @@ static inline u8 seccomp_get_arch(u32 syscall_arch, u32 syscall_nr) > > #ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH > if (syscall_arch == SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH) { > - seccomp_arch |= (sd->nr & SECCOMP_MULTIPLEXED_SYSCALL_TABLE_MASK) >> > + seccomp_arch |= (syscall_nr & SECCOMP_MULTIPLEXED_SYSCALL_TABLE_MASK) >> > SECCOMP_MULTIPLEXED_SYSCALL_TABLE_SHIFT; This belongs over into patch 1. > } > #endif > @@ -559,6 +559,21 @@ static inline void seccomp_sync_threads(unsigned long flags) > atomic_set(&thread->seccomp.filter_count, > atomic_read(&thread->seccomp.filter_count)); > > + /* Copy syscall filter bitmaps. */ > + memcpy(&thread->seccomp.native, > + &caller->seccomp.native, > + sizeof(caller->seccomp.native)); > +#ifdef CONFIG_COMPAT > + memcpy(&thread->seccomp.compat, > + &caller->seccomp.compat, > + sizeof(caller->seccomp.compat)); > +#endif > +#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH > + memcpy(&thread->seccomp.multiplex, > + &caller->seccomp.multiplex, > + sizeof(caller->seccomp.multiplex)); > +#endif This part wouldn't be necessary if the bitmasks were part of the seccomp_filter... > /* > * Don't let an unprivileged task work around > * the no_new_privs restriction by creating > @@ -661,6 +676,114 @@ seccomp_prepare_user_filter(const char __user *user_filter) > return filter; > } > > +static inline bool sd_touched(pte_t *ptep) > +{ > + return !!pte_young(*(READ_ONCE(ptep))); > +} I think this is left over from the previous version and should've been removed? [...] > +/* > + * Walk everyone syscall combination for this arch/mask combo and update nit: "Walk every possible", or something like that > + * the bitmaps with any results. > + */ > +static void seccomp_update_bitmap(struct seccomp_filter *filter, > + void *pagepair, u32 arch, u32 mask, > + struct seccomp_bitmaps *bitmaps) [...] > @@ -970,6 +1097,65 @@ static int seccomp_do_user_notification(int this_syscall, > return -1; > } > > +#ifdef SECCOMP_ARCH > +static inline bool __bypass_filter(struct seccomp_bitmaps *bitmaps, > + u32 nr, u32 *filter_ret) > +{ > + if (nr < NR_syscalls) { > + if (test_bit(nr, bitmaps->allow)) { > + *filter_ret = SECCOMP_RET_ALLOW; > + return true; > + } > + if (test_bit(nr, bitmaps->kill_process)) { > + *filter_ret = SECCOMP_RET_KILL_PROCESS; > + return true; > + } > + if (test_bit(nr, bitmaps->kill_thread)) { > + *filter_ret = SECCOMP_RET_KILL_THREAD; > + return true; > + } The checks against ->kill_process and ->kill_thread won't make anything faster, but since they will run in the fastpath, they'll probably actually contribute to making things *slower*. > + } > + return false; > +} > + > +static inline u32 check_syscall(const struct seccomp_data *sd, > + struct seccomp_filter **match) > +{ > + u32 filter_ret = SECCOMP_RET_KILL_PROCESS; > + u8 arch = seccomp_get_arch(sd->arch, sd->nr); > + > + switch (arch) { > + case SECCOMP_ARCH_IS_NATIVE: > + if (__bypass_filter(¤t->seccomp.native, sd->nr, &filter_ret)) > + return filter_ret; > + break; > +#ifdef CONFIG_COMPAT > + case SECCOMP_ARCH_IS_COMPAT: > + if (__bypass_filter(¤t->seccomp.compat, sd->nr, &filter_ret)) > + return filter_ret; > + break; > +#endif > +#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH > + case SECCOMP_ARCH_IS_MULTIPLEX: > + if (__bypass_filter(¤t->seccomp.multiplex, sd->nr, &filter_ret)) > + return filter_ret; > + break; > +#endif > + default: > + WARN_ON_ONCE(1); > + return filter_ret; > + }; > + > + return seccomp_run_filters(sd, match); > +} You could write this in a less repetitive way, and especially if we get rid of the kill_* masks, also more compact: static inline u32 check_syscall(const struct seccomp_data *sd, struct seccomp_filter **match) { struct seccomp_bitmaps *bitmaps; u32 filter_ret; switch (arch) { case SECCOMP_ARCH_IS_NATIVE: bitmaps = ¤t->seccomp.native; break; #ifdef CONFIG_COMPAT case SECCOMP_ARCH_IS_COMPAT: bitmaps = ¤t->seccomp.compat; break; #endif #ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH case SECCOMP_ARCH_IS_MULTIPLEX: bitmaps = ¤t->seccomp.multiplex; break; #endif default: WARN_ON_ONCE(1); return SECCOMP_RET_KILL_PROCESS; } if ((unsigned)sd->nr < __NR_syscalls && test_bit(sd->nr, bitmaps->allow)) return SECCOMP_RET_ALLOW; return seccomp_run_filters(sd, match); } [...] > @@ -1625,12 +1812,24 @@ static long seccomp_set_mode_filter(unsigned int flags, > mutex_lock_killable(¤t->signal->cred_guard_mutex)) > goto out_put_fd; > > + /* > + * This memory will be needed for bitmap testing, but we'll > + * be holding a spinlock at that point. Do the allocation > + * (and free) outside of the lock. > + * > + * Alternative: we could do the bitmap update before attach > + * to avoid spending too much time under lock. > + */ > + pagepair = vzalloc(PAGE_SIZE * 2); > + if (!pagepair) > + goto out_put_fd; > + [...] > - ret = seccomp_attach_filter(flags, prepared); > + ret = seccomp_attach_filter(flags, prepared, pagepair); You probably intended to rip this stuff back out? AFAIU the vzalloc() stuff is a remnant from the old version that relied on MMU trickery.