Received: by 2002:a6b:500f:0:0:0:0:0 with SMTP id e15csp2380448iob; Sat, 30 Apr 2022 07:06:41 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzGNd2OtQwaf9OdjrytZXh2v58khzjQvVTpol3bFzydwaXnFGKoZ0ylsW6kLk9Xd7XDGc8d X-Received: by 2002:a05:6512:2312:b0:472:5de2:ddf0 with SMTP id o18-20020a056512231200b004725de2ddf0mr564549lfu.134.1651327601685; Sat, 30 Apr 2022 07:06:41 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1651327601; cv=none; d=google.com; s=arc-20160816; b=YYe1plRNNpij3G/VHACrCOJ85AODEMwpnY0Aq965kQSXEVFaRNko8yqv5t9YmbQCEY RfXHlS6Kk5elhWuH/aix7iV2WPFmezlrbOizb7nnJC5ehTRTZjqRB22RNmcQk4h34n4U CW3vQcyXJv26ntlK8vXPI8dcp0aocIzcOa5dwtEydaSrnp8bHaNNmvxDxLH1vqrbjgYn 1kss16dnJlTbZvPbolXJOW1/ueFLSDTV2re3+9QSmTYABKcREHU18/VpGvbugTXZMaF8 3Nd58HaLT9dmlrkhyLcnXPyspyP+viMeJ+0fx6JUsY3VqZAivhRH+wDWf0rARmhCpshb 4Zog== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=lqDxjXKGozL4JLJA55lbL4O+gqsTyufg1RUYSCEYXqk=; b=SVtaxm5tTiEXklMYpZtvnxU3pP5ArO16ZYRRfbT3XopovgvqSfkiNp2VFZseq0NQeA ofqc/hHmLd1zvGpLJVrulz/O1ESsWp9HgldrUGGt38MKulnxSYhQoJMNad+E7HTw8tyY XQo+JaTNyr8zSx+GotA+JuuxoMkzwaXRUbFeE/3NRiwZrimsfyALX7RNWKS/bzyH+ksq q2atXHgtnxdYVy4lVgiuQYHvv4FlEK7v4JocyoEoqKSgAC4vRHmHdMPGmzJGrhZkWAsP QKl+vyTEB7NMuq6G9yMwPdZQO5nObUyhaJlijmUt7VN2EUWABQ79BHCO1BdMg7zTU3rk v+qA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@chromium.org header.s=google header.b=BGOiyxCZ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chromium.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id y18-20020a2eb012000000b0024db9f258b4si8655618ljk.48.2022.04.30.07.06.15; Sat, 30 Apr 2022 07:06:41 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@chromium.org header.s=google header.b=BGOiyxCZ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chromium.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1379951AbiD2SZq (ORCPT + 99 others); Fri, 29 Apr 2022 14:25:46 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42792 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1379993AbiD2SZp (ORCPT ); Fri, 29 Apr 2022 14:25:45 -0400 Received: from mail-pg1-x52c.google.com (mail-pg1-x52c.google.com [IPv6:2607:f8b0:4864:20::52c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A8533D3716 for ; Fri, 29 Apr 2022 11:22:25 -0700 (PDT) Received: by mail-pg1-x52c.google.com with SMTP id q12so7084314pgj.13 for ; Fri, 29 Apr 2022 11:22:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=lqDxjXKGozL4JLJA55lbL4O+gqsTyufg1RUYSCEYXqk=; b=BGOiyxCZSMxAEyazPfI9DdXqgYukw98A9ZM4PRq0UYaSqweX5mgd5hWhstCt4s8iSp mSjaEhTH0AHvtkZq2mTLlyWRSO60Ge0eielBZg83IgVwATBsQ2t+gEAPLfL7lHAo+3qK pL7UDfR7dxORQ9EUt3Xje22xG5K8rzrqmd5Ag= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=lqDxjXKGozL4JLJA55lbL4O+gqsTyufg1RUYSCEYXqk=; b=AWvTbx1L+xoQWZPmzwv0+Kuvmul54ghsNoRY5VVSz8VyJU9HS1wwAMZXkU3e4GVB5M 4bwQBviNOw8yiZA7ig29fWMtDG+GuyU0RVNB+xEHkei3g7dKazMhyERulENyrqKRyGoS Lb6bucxfUSDvj03/BVbXN/VZiixWkaN7Oq7f34Oy9/e7NpE0BrRwV4qNp5HvMtvJnfVZ OgG2y47j+Wn4amjuhXj6NNXfC8FVOwZfwN4Rzz8B7FV2VKtYsGXAmYfMs3nU/3eYxGV9 ifPSqAMFfwCkhtueuxFHx1vECj9+Lf8XfAMXD3IKDZ2kX78yNoWcBTql8p8hn2ZH/ok/ 63pg== X-Gm-Message-State: AOAM530BFg3KZSVO2IqKREIcPpY9MlnAT0o6VC4rrfIYbvXMpi+cR/vV AzNpLpH62tfZeOxmFTymjhwKxg== X-Received: by 2002:a63:4384:0:b0:3ab:9d95:a30a with SMTP id q126-20020a634384000000b003ab9d95a30amr535803pga.68.1651256545095; Fri, 29 Apr 2022 11:22:25 -0700 (PDT) Received: from www.outflux.net (smtp.outflux.net. [198.145.64.163]) by smtp.gmail.com with ESMTPSA id e9-20020a635449000000b003c14af505edsm4040598pgm.5.2022.04.29.11.22.24 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Apr 2022 11:22:24 -0700 (PDT) Date: Fri, 29 Apr 2022 11:22:23 -0700 From: Kees Cook To: Sargun Dhillon Cc: LKML , Linux Containers , Rodrigo Campos , Christian Brauner , Giuseppe Scrivano , Will Drewry , Andy Lutomirski , Alban Crequy Subject: Re: [PATCH v3 1/2] seccomp: Add wait_killable semantic to seccomp user notifier Message-ID: <202204291120.79624BD@keescook> References: <20220429023113.74993-1-sargun@sargun.me> <20220429023113.74993-2-sargun@sargun.me> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20220429023113.74993-2-sargun@sargun.me> X-Spam-Status: No, score=-2.7 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Apr 28, 2022 at 07:31:12PM -0700, Sargun Dhillon wrote: > This introduces a per-filter flag (SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV) > that makes it so that when notifications are received by the supervisor > the notifying process will transition to wait killable semantics. Although > wait killable isn't a set of semantics formally exposed to userspace, > the concept is searchable. If the notifying process is signaled prior > to the notification being received by the userspace agent, it will > be handled as normal. > > One quirk about how this is handled is that the notifying process > only switches to TASK_KILLABLE if it receives a wakeup from either > an addfd or a signal. This is to avoid an unnecessary wakeup of > the notifying task. > > Signed-off-by: Sargun Dhillon > --- > .../userspace-api/seccomp_filter.rst | 8 ++++ > include/linux/seccomp.h | 3 +- > include/uapi/linux/seccomp.h | 2 + > kernel/seccomp.c | 42 ++++++++++++++++++- > 4 files changed, 52 insertions(+), 3 deletions(-) > > diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst > index 539e9d4a4860..204cf5ba511a 100644 > --- a/Documentation/userspace-api/seccomp_filter.rst > +++ b/Documentation/userspace-api/seccomp_filter.rst > @@ -271,6 +271,14 @@ notifying process it will be replaced. The supervisor can also add an FD, and > respond atomically by using the ``SECCOMP_ADDFD_FLAG_SEND`` flag and the return > value will be the injected file descriptor number. > > +The notifying process can be preempted, resulting in the notification being > +aborted. This can be problematic when trying to take actions on behalf of the > +notifying process that are long-running and typically retryable (mounting a > +filesytem). Alternatively, the at filter installation time, the typo: "the at" -> "at" > +``SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV`` flag can be set. This flag makes it > +such that when a user notification is received by the supervisor, the notifying > +process will ignore non-fatal signals until the response is sent. > + > It is worth noting that ``struct seccomp_data`` contains the values of register > arguments to the syscall, but does not contain pointers to memory. The task's > memory is accessible to suitably privileged traces via ``ptrace()`` or > diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h > index 0c564e5d40ff..d31d76be4982 100644 > --- a/include/linux/seccomp.h > +++ b/include/linux/seccomp.h > @@ -8,7 +8,8 @@ > SECCOMP_FILTER_FLAG_LOG | \ > SECCOMP_FILTER_FLAG_SPEC_ALLOW | \ > SECCOMP_FILTER_FLAG_NEW_LISTENER | \ > - SECCOMP_FILTER_FLAG_TSYNC_ESRCH) > + SECCOMP_FILTER_FLAG_TSYNC_ESRCH | \ > + SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV) > > /* sizeof() the first published struct seccomp_notif_addfd */ > #define SECCOMP_NOTIFY_ADDFD_SIZE_VER0 24 > diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h > index 78074254ab98..0fdc6ef02b94 100644 > --- a/include/uapi/linux/seccomp.h > +++ b/include/uapi/linux/seccomp.h > @@ -23,6 +23,8 @@ > #define SECCOMP_FILTER_FLAG_SPEC_ALLOW (1UL << 2) > #define SECCOMP_FILTER_FLAG_NEW_LISTENER (1UL << 3) > #define SECCOMP_FILTER_FLAG_TSYNC_ESRCH (1UL << 4) > +/* Received notifications wait in killable state (only respond to fatal signals) */ > +#define SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV (1UL << 5) > > /* > * All BPF programs must return a 32-bit value. > diff --git a/kernel/seccomp.c b/kernel/seccomp.c > index db10e73d06e0..9291b0843cb2 100644 > --- a/kernel/seccomp.c > +++ b/kernel/seccomp.c > @@ -201,6 +201,8 @@ static inline void seccomp_cache_prepare(struct seccomp_filter *sfilter) > * the filter can be freed. > * @cache: cache of arch/syscall mappings to actions > * @log: true if all actions except for SECCOMP_RET_ALLOW should be logged > + * @wait_killable_recv: Put notifying process in killable state once the > + * notification is received by the userspace listener. > * @prev: points to a previously installed, or inherited, filter > * @prog: the BPF program to evaluate > * @notif: the struct that holds all notification related information > @@ -221,6 +223,7 @@ struct seccomp_filter { > refcount_t refs; > refcount_t users; > bool log; > + bool wait_killable_recv; > struct action_cache cache; > struct seccomp_filter *prev; > struct bpf_prog *prog; > @@ -894,6 +897,10 @@ static long seccomp_attach_filter(unsigned int flags, > if (flags & SECCOMP_FILTER_FLAG_LOG) > filter->log = true; > > + /* Set wait killable flag, if present. */ > + if (flags & SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV) > + filter->wait_killable_recv = true; > + > /* > * If there is an existing filter, make it the prev and don't drop its > * task reference. > @@ -1081,6 +1088,12 @@ static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd, struct seccomp_kn > complete(&addfd->completion); > } > > +static bool should_sleep_killable(struct seccomp_filter *match, > + struct seccomp_knotif *n) > +{ > + return match->wait_killable_recv && n->state == SECCOMP_NOTIFY_SENT; > +} > + > static int seccomp_do_user_notification(int this_syscall, > struct seccomp_filter *match, > const struct seccomp_data *sd) > @@ -1111,11 +1124,25 @@ static int seccomp_do_user_notification(int this_syscall, > * This is where we wait for a reply from userspace. > */ > do { > + bool wait_killable = should_sleep_killable(match, &n); > + > mutex_unlock(&match->notify_lock); > - err = wait_for_completion_interruptible(&n.ready); > + if (wait_killable) > + err = wait_for_completion_killable(&n.ready); > + else > + err = wait_for_completion_interruptible(&n.ready); > mutex_lock(&match->notify_lock); > - if (err != 0) > + > + if (err != 0) { > + /* > + * Check to see if the notifcation got picked up and > + * whether we should switch to wait killable. > + */ > + if (!wait_killable && should_sleep_killable(match, &n)) > + continue; > + > goto interrupted; > + } > > addfd = list_first_entry_or_null(&n.addfd, > struct seccomp_kaddfd, list); > @@ -1485,6 +1512,9 @@ static long seccomp_notify_recv(struct seccomp_filter *filter, > mutex_lock(&filter->notify_lock); > knotif = find_notification(filter, unotif.id); > if (knotif) { > + /* Reset the process to make sure it's not stuck */ > + if (should_sleep_killable(filter, knotif)) > + complete(&knotif->ready); > knotif->state = SECCOMP_NOTIFY_INIT; > up(&filter->notif->request); > } > @@ -1830,6 +1860,14 @@ static long seccomp_set_mode_filter(unsigned int flags, > ((flags & SECCOMP_FILTER_FLAG_TSYNC_ESRCH) == 0)) > return -EINVAL; > > + /* > + * The SECCOMP_FILTER_FLAG_WAIT_KILLABLE_SENT flag doesn't make sense > + * without the SECCOMP_FILTER_FLAG_NEW_LISTENER flag. > + */ > + if ((flags & SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV) && > + ((flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) == 0)) > + return -EINVAL; > + > /* Prepare the new filter before holding any locks. */ > prepared = seccomp_prepare_user_filter(filter); > if (IS_ERR(prepared)) Otherwise, looks good. Thanks for bringing this back up! -- Kees Cook