Received: by 2002:a6b:500f:0:0:0:0:0 with SMTP id e15csp204995iob; Mon, 2 May 2022 17:09:44 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwYotQeQYI5fRskLbAQzMYaJkIU4HQKtKAJQmm9oi87G2VTdUxyyBh1L1XfH8PyW7CSxytQ X-Received: by 2002:a63:1c01:0:b0:399:5113:9d6a with SMTP id c1-20020a631c01000000b0039951139d6amr11621822pgc.550.1651536584516; Mon, 02 May 2022 17:09:44 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1651536584; cv=none; d=google.com; s=arc-20160816; b=FbvesgoaduNyHWuURWTtQKhRokEcdry0xnXHmh63HdfrfW2EN254SQerGFYTwLW0Ue n7Glu+SF0SivvFhEkmtgrVzvwozhKXS6mD/P+ywVm1TWMx1d5lq6GlfzZAuoXo3rBtyo 4V7IKxuPzcC5JBVqQ9wxpoWXEfCBKLf2ZE1DNsUcXUflejKOSHy7mEKsvmsHF/8iCShs k7d9AKPQEChZ4qZxiFIKMUc3aRvydhnEUjW23/6QxHLftdExrUYvfsa5Fa4vMUUPqkLV BDrz/VtVqPoD/O4bAuZYjxJf5vKa1H1i7Guy+rh/ZbKt1b+6f9ZMOFTTw7VaLt+yrc7+ 5/+g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=cX0NiBN/CCJrlkMPWV999pJdK+WS9wSu4FXPabVPzBE=; b=o0wsscWTgO2rIP1UburxSiqc7HekWrkCxCtIDQcynsPRrEij05Jo+YT20o2UGF2XTV ACpJMSvm10HqH19fRGmKY97/jzzMai3gZ0jACUYLo7gXn8mvLHZZBVWXyC6kld0bTRvm Bs27D/YQaKdc5vZxmlZUvWf7sJeD/ldwpZLx64VcSW9zzhUmBLaLHviW1XkYDjknPiNJ os0msZqNk5u+4nIzDWsX4NDab2ePr1xP9H3lHG8ZdKrS4x+0neOAix8RiZ7zqUbNUKlM 98iQydhbnLg7oGm6kzBoagvoLkvLjOIC/Uuj5In3ARaPCrOE+3L3H/pDwUdBI2HU/ykx U8kA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@sargun.me header.s=google header.b=CO4nbr5G; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=sargun.me Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [23.128.96.19]) by mx.google.com with ESMTPS id a134-20020a621a8c000000b004fa662b24c0si14896912pfa.330.2022.05.02.17.09.44 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 02 May 2022 17:09:44 -0700 (PDT) Received-SPF: softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19; Authentication-Results: mx.google.com; dkim=pass header.i=@sargun.me header.s=google header.b=CO4nbr5G; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=sargun.me Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id CC1DB369D7; Mon, 2 May 2022 17:08:49 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1379439AbiD2RSA (ORCPT + 99 others); Fri, 29 Apr 2022 13:18:00 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34438 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1379100AbiD2RR7 (ORCPT ); Fri, 29 Apr 2022 13:17:59 -0400 Received: from mail-il1-x131.google.com (mail-il1-x131.google.com [IPv6:2607:f8b0:4864:20::131]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0AFB45EDC6 for ; Fri, 29 Apr 2022 10:14:40 -0700 (PDT) Received: by mail-il1-x131.google.com with SMTP id r17so4391786iln.9 for ; Fri, 29 Apr 2022 10:14:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sargun.me; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=cX0NiBN/CCJrlkMPWV999pJdK+WS9wSu4FXPabVPzBE=; b=CO4nbr5Gh5ZG2PI9QNv28iRmrVEqXqZWN8JGMwCEDHfQUoXkpMAYWE0VfxYtrIbmKF 3i38zQpBTsmrPC/OoMLbapA97cEpOK0D3dYW5qcrzl98i/r7fcdYfHm6v9oRqJqqPvxm yafEem6H/ucvvFyl752kB3x5Qby6T5Wl9VoP0= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=cX0NiBN/CCJrlkMPWV999pJdK+WS9wSu4FXPabVPzBE=; b=2qq4msg4NpluNzifT2g83QzADr/67U2z/g4mvx/9Fa/8hZkqJskNMxL9EEsGrVFzlJ WsE3cN+InTKccUEETbV4wrRBehlYYlo0v13Lz36j52xatcIenNuhuoGdaLKSC0lsUOqW xvitw67DVuDsZ9RqdAVC5coJfpKI+W+NSv+YD4JRrOKyfRcs/XNqiT76+ejzAX/LMoHd +KoAJJi52q2CKQAg3I7dtBToMv6mgVk4VTxH6eRjZt0O7AKZeBIDJfzbOd0y0alPeRZI uBt5KlBSRIAmsUR4BHvwcDYtkg2ILnYRYof4+W2QFGcT8UdKNdevudiG6H4D9J9qE+DC 3KAg== X-Gm-Message-State: AOAM533zzS3YZFvlIDK3NjmEPmuD7SVgHUH+KZbgDQFszpifhKKzpLY1 RFatioB3svIqknAq6cs5i0gDOg== X-Received: by 2002:a05:6e02:1547:b0:2cd:6214:e8f5 with SMTP id j7-20020a056e02154700b002cd6214e8f5mr151639ilu.220.1651252479279; Fri, 29 Apr 2022 10:14:39 -0700 (PDT) Received: from ircssh-3.c.rugged-nimbus-611.internal (80.60.198.104.bc.googleusercontent.com. [104.198.60.80]) by smtp.gmail.com with ESMTPSA id k24-20020a02a718000000b0032b3a781778sm713410jam.60.2022.04.29.10.14.38 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Apr 2022 10:14:38 -0700 (PDT) Date: Fri, 29 Apr 2022 17:14:37 +0000 From: Sargun Dhillon To: Rodrigo Campos Cc: Kees Cook , LKML , Linux Containers , Christian Brauner , Giuseppe Scrivano , Will Drewry , Andy Lutomirski , Alban Crequy Subject: Re: [PATCH v3 1/2] seccomp: Add wait_killable semantic to seccomp user notifier Message-ID: <20220429171437.GA1267404@ircssh-3.c.rugged-nimbus-611.internal> References: <20220429023113.74993-1-sargun@sargun.me> <20220429023113.74993-2-sargun@sargun.me> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Apr 29, 2022 at 11:42:15AM +0200, Rodrigo Campos wrote: > On Fri, Apr 29, 2022 at 4:32 AM Sargun Dhillon wrote: > > the concept is searchable. If the notifying process is signaled prior > > to the notification being received by the userspace agent, it will > > be handled as normal. > > Why is that? Why not always handle in the same way (if wait killable > is set, wait like that) > The goal is to avoid two things: 1. Unncessary work - Often times, we see workloads that implement techniques like hedging (Also known as request racing[1]). In fact, RFC3484 (destination address selection) gets implemented where the DNS library will connect to many backend addresses and whichever one comes back first "wins". 2. Side effects - We don't want a situation where a syscall is in progress that is non-trivial to rollback (mount), and from user space's perspective this syscall never completed. Blocking before the syscall even starts is excessive. When we looked at this we found that with runtimes like Golang, they can get into a bad situation if they have many (1000s) of threads that are in the middle of a syscall because all of them need to elide prior to GC. In this case the runtime prioritizes the liveness of GC vs. the syscalls. That being said, there may be some syscalls in a filter that need the suggested behaviour. I can imagine introducing a new flag (say SECCOMP_FILTER_FLAG_WAIT_KILLABLE) that applies to all states. Alternatively, in one implementation, I put the behaviour in the data field of the return from the BPF filter. > > diff --git a/kernel/seccomp.c b/kernel/seccomp.c > > index db10e73d06e0..9291b0843cb2 100644 > > --- a/kernel/seccomp.c > > +++ b/kernel/seccomp.c > > @@ -1081,6 +1088,12 @@ static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd, struct seccomp_kn > > complete(&addfd->completion); > > } > > > > +static bool should_sleep_killable(struct seccomp_filter *match, > > + struct seccomp_knotif *n) > > +{ > > + return match->wait_killable_recv && n->state == SECCOMP_NOTIFY_SENT; > > Here for some reason we check the notification state to be SENT. > Because we don't want to block unless the notification has been received by userspace. > > +} > > + > > static int seccomp_do_user_notification(int this_syscall, > > struct seccomp_filter *match, > > const struct seccomp_data *sd) > > @@ -1111,11 +1124,25 @@ static int seccomp_do_user_notification(int this_syscall, > > * This is where we wait for a reply from userspace. > > */ > > do { > > + bool wait_killable = should_sleep_killable(match, &n); > > + > > So here, the first time this runs this will be false even if the > wait_killable flag was used in the filter (because that function > checks the notification state to be sent, that is not true the first > time) > > Why not just do wait_for_completion_killable if match->wait_killable > and wait_for_completion_interruptible otherwise? Am I missing > something? Again, this is to allow for the notification to be able to be preempted prior to being received by the supervisor. > > > > Best, > Rodrigo [1]: https://research.google/pubs/pub40801/