Received: by 2002:a05:6a10:a841:0:0:0:0 with SMTP id d1csp4888865pxy; Tue, 27 Apr 2021 15:13:00 -0700 (PDT) X-Google-Smtp-Source: ABdhPJy6yg7OPDkiMnhOk/EvqhTp4Tib3B5ps/Dq/7q3u6aW5YO9MUbRvP+3A+pVzFKCdVXdbPKC X-Received: by 2002:a63:d552:: with SMTP id v18mr24015857pgi.313.1619561579813; Tue, 27 Apr 2021 15:12:59 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1619561579; cv=none; d=google.com; s=arc-20160816; b=BT3hkpHGAhNQyNqLBASgDRfGfZAcgPotSBU4JTOrmc4gVMq5lptoEBFTb5RTroM8ak MiFoeleDFtxIGJfN06NXTQfADNyRwIiIphah5/at52ZlLyjmEDRru2a8jsp9do5cIqPi TvscCC+qNITPrXTPX/llnrP007dFIhCJzztUpFwFQAhrl2bVM72kl8fcAqauYOU0+QT1 U0+uKQDv/EhK1utRbCGaf9/CdfRnGOTr+KGQuMt4650SKiOqjUzBIMiGfZYnnGyJVdmb jpQS30otljOuKs4zVDUmi0bdOmjzd5KOlfOx2ByNQh5Xy6jzo1JL+sXdr84IwTgQojRl BFYA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :dkim-signature; bh=J3OZQuMMYXhVRdbAGQKUg7Wra51AHbtP59/ycR6cv1A=; b=RAFdQA4AZjaaNmErXNB02DOah0mjoWs0xrww1CJVc0FCCONLIcwPKE02fzwQtV6v0L hhTcLyOrloz5NDEE34fs8NH7lGg7/CDO+ORV8DkAaIKS73owFWMnVleFlhnGzYgDbvOr ET0lY0BBi35bWM8DQpEIBVVwmMoJrjNkyobB9niL2L5oxjXGR1W5Dkqb8XHOHDjgvxuv EmqTg+t5DPSY1Sl7QvjJp2ZXtn9GOD3G+xB3/IngPn6VWVocKFPFDZJYodvdsUNvS6/n 5EM7ZaKm/od2C7rE0hCtsU3YlP3jZJgeenDXdSvWAo8rNLoqNi8No1eZsqtEklvi+lUL 9Q6g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@sargun.me header.s=google header.b=oLMo6M7e; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id a24si5080567pfn.279.2021.04.27.15.12.36; Tue, 27 Apr 2021 15:12:59 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@sargun.me header.s=google header.b=oLMo6M7e; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235764AbhD0WLT (ORCPT + 99 others); Tue, 27 Apr 2021 18:11:19 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44458 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235382AbhD0WLR (ORCPT ); Tue, 27 Apr 2021 18:11:17 -0400 Received: from mail-il1-x131.google.com (mail-il1-x131.google.com [IPv6:2607:f8b0:4864:20::131]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8E24AC061574 for ; Tue, 27 Apr 2021 15:10:32 -0700 (PDT) Received: by mail-il1-x131.google.com with SMTP id p15so15018777iln.3 for ; Tue, 27 Apr 2021 15:10:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sargun.me; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=J3OZQuMMYXhVRdbAGQKUg7Wra51AHbtP59/ycR6cv1A=; b=oLMo6M7eRBG1HrCjd7sWkyEU1zNn7c1dNI2nSNyG7pjWl3w068pOFMPhpW/AcnTaPv tCdMb6rULM+RIohelzjfDLq+l6ZbzrKLyBWjTm5BbpCT+ODZhmB/LKNRZDXI0vUJQYE3 sV/EQ7yolgixvjTxAhI1tFHPkXqX3J3uqn6vQ= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=J3OZQuMMYXhVRdbAGQKUg7Wra51AHbtP59/ycR6cv1A=; b=LdC4OcPRg5EyrDU0dJYISN2cDwU6anGM4pGMJySk4IG/pNASqz4g/tGg3RaXFEysHU kVCCsp0zwmM4d9M9oDtSdeXp77IrHYUa3z9cYqbV4AnPpi/rIORuKtLgHcxgU4pQPtvy rxgt1fVpQ//UyBuCTAQW/XAC9T1j3xS9x4q+Szpw5k8eur59/YuT+GkFc4wjqiGc/eCn Bmu27YVy1yOCS52r6QijpOkxEP+ZPb+ytxjm57FnqQ4mzUGioVSs4wkC1Z1LBVzOmgLw j29nnis7V0lNsmOwhclchEu8VmSikpDEaf4NRSMiMEx04isjF3uyNfT5ykfBLzNofXLa 0pVw== X-Gm-Message-State: AOAM533PUBpI3V75RDuERpu1mNwNYwhfKdIW7mgQRRtTRlDfD8VyoUqo Q5VgFKLTPHp0DdCl7GDtxKJhXg== X-Received: by 2002:a92:c0c2:: with SMTP id t2mr20188209ilf.210.1619561431634; Tue, 27 Apr 2021 15:10:31 -0700 (PDT) Received: from ircssh-2.c.rugged-nimbus-611.internal (80.60.198.104.bc.googleusercontent.com. [104.198.60.80]) by smtp.gmail.com with ESMTPSA id n22sm528725ioc.5.2021.04.27.15.10.30 (version=TLS1_2 cipher=ECDHE-ECDSA-CHACHA20-POLY1305 bits=256/256); Tue, 27 Apr 2021 15:10:30 -0700 (PDT) Date: Tue, 27 Apr 2021 22:10:29 +0000 From: Sargun Dhillon To: Tycho Andersen Cc: Andy Lutomirski , Kees Cook , LKML , Linux Containers , Rodrigo Campos , Christian Brauner , Mauricio =?iso-8859-1?Q?V=E1squez?= Bernal , Giuseppe Scrivano , Will Drewry , Alban Crequy Subject: Re: [PATCH RESEND 2/5] seccomp: Add wait_killable semantic to seccomp user notifier Message-ID: <20210427221028.GA16602@ircssh-2.c.rugged-nimbus-611.internal> References: <20210426180610.2363-1-sargun@sargun.me> <20210426180610.2363-3-sargun@sargun.me> <20210426190229.GB1605795@cisco> <20210426221527.GA30835@ircssh-2.c.rugged-nimbus-611.internal> <20210427134853.GA1746081@cisco> <20210427170753.GA1786245@cisco> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20210427170753.GA1786245@cisco> User-Agent: Mutt/1.9.4 (2018-02-28) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 27, 2021 at 11:07:53AM -0600, Tycho Andersen wrote: > On Tue, Apr 27, 2021 at 09:23:42AM -0700, Andy Lutomirski wrote: > > On Tue, Apr 27, 2021 at 6:48 AM Tycho Andersen wrote: > > > > > > On Mon, Apr 26, 2021 at 10:15:28PM +0000, Sargun Dhillon wrote: > > > > On Mon, Apr 26, 2021 at 01:02:29PM -0600, Tycho Andersen wrote: > > > > > On Mon, Apr 26, 2021 at 11:06:07AM -0700, Sargun Dhillon wrote: > > > > > > @@ -1103,11 +1111,31 @@ static int seccomp_do_user_notification(int this_syscall, > > > > > > * This is where we wait for a reply from userspace. > > > > > > */ > > > > > > do { > > > > > > + interruptible = notification_interruptible(&n); > > > > > > + > > > > > > mutex_unlock(&match->notify_lock); > > > > > > - err = wait_for_completion_interruptible(&n.ready); > > > > > > + if (interruptible) > > > > > > + err = wait_for_completion_interruptible(&n.ready); > > > > > > + else > > > > > > + err = wait_for_completion_killable(&n.ready); > > > > > > mutex_lock(&match->notify_lock); > > > > > > - if (err != 0) > > > > > > + > > > > > > + if (err != 0) { > > > > > > + /* > > > > > > + * There is a race condition here where if the > > > > > > + * notification was received with the > > > > > > + * SECCOMP_USER_NOTIF_FLAG_WAIT_KILLABLE flag, but a > > > > > > + * non-fatal signal was received before we could > > > > > > + * transition we could erroneously end our wait early. > > > > > > + * > > > > > > + * The next wait for completion will ensure the signal > > > > > > + * was not fatal. > > > > > > + */ > > > > > > + if (interruptible && !notification_interruptible(&n)) > > > > > > + continue; > > > > > > > > > > I'm trying to understand how one would hit this race, > > > > > > > > > > > > > I'm thinking: > > > > P: Process that "generates" notification > > > > S: Supervisor > > > > U: User > > > > > > > > P: Generated notification > > > > S: ioctl(RECV...) // With wait_killable flag. > > > > ...complete is called in the supervisor, but the P may not be woken up... > > > > U: kill -SIGTERM $P > > > > ...signal gets delivered to p and causes wakeup and > > > > wait_for_completion_interruptible returns 1... > > > > > > > > Then you need to check the race > > > > > > I see, thanks. This seems like a consequence of having the flag be > > > per-RECV-call vs. per-filter. Seems like it might be simpler to have > > > it be per-filter? > > > I agree that it is hard / impossible to guarantee correctness *after* the fact. > > > > Backing up a minute, how is the current behavior not a serious > > correctness issue? I can think of two scenarios that seem entirely > > broken right now: > > > > 1. Process makes a syscall that is not permitted to return -EINTR. It > > gets a signal and returns -EINTR when user notifiers are in use. > > Yes, there's a whole host of problems here. Things like fsmount should not be interruptible. > > 2. Process makes a syscall that is permitted to return -EINTR. But > > -EINTR for IO means "I got interrupted and *did not do the IO*". > > Nevertheless, the syscall returns -EINTR and the IO is done. In general, I think that the idea is to do as little side-effect I/O as possible. The use cases we've looked at all have nice ways to unwind them (perf_event_open, BPF, accept), but others are less good for unwinding (mount). There are some middle ground calls like connect, but they're less bad. > > > > ISTM the current behavior is severely broken, and the new behavior > > isn't *that* much better since it simply ignores signals and can't > > emulate -EINTR (or all the various restart modes, sigh). Surely the > > right behavior is to have the seccomped process notice that it got a > > signal and inform the monitor of that fact so that the monitor can > > take appropriate action. > > This doesn't help your case (2) though, since the IO could be done > before the supervisor gets the notification. > I think for something like mount, if it fails (gets interrupted) via a fatal signal, that's grounds for terminating the container. > > IOW, I don't think that the current behavior *or* the patched opt-in > > behavior is great. I think we would do better to have the filter > > indicate that it is signal-aware and to document that non-signal-aware > > filters cannot behave correctly with respect to signals. > > I think it would be hard to make a signal-aware filter, it really does > feel like the only thing to do is a killable wait. > > Tycho There are plenty of scenarios where the syscall can be handled in an interruptible fashion. I like to use accept as an example. I think Jann Horn had put together a patchset on how the supervisor could be notified (as opposed to background polling). If the call is interrupted, you can just "finish" the accept on restart of the sycall by handing the FD over. I see a handful of paths forward: * We add a new action USER_NOTIF_KILLABLE which requires a fatal signal in order to be interrupted * We add a chunk of data to the USER_NOTIF return code (say, WAIT_KILLABLE) from the BPF filter that indicates what kind of wait should happen * (what is happening now) An ioctl flag to say pickup the notification and put it into a wait_killable state * An ioctl "command" that puts an existing notifcation in progress into the wait killable state.