Received: by 2002:a25:ad19:0:0:0:0:0 with SMTP id y25csp1353411ybi; Wed, 17 Jul 2019 13:48:40 -0700 (PDT) X-Google-Smtp-Source: APXvYqye37AqdbXt67JPSke6Ybfm0Go86HrIrviH+bpVKnK5Mx9qSkaWsdKeSDU7h60z6I+N91H0 X-Received: by 2002:a63:e610:: with SMTP id g16mr42270858pgh.392.1563396520570; Wed, 17 Jul 2019 13:48:40 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1563396520; cv=none; d=google.com; s=arc-20160816; b=N2SZkWVKCAQiMDlnLvicZvPINwymlhrjfjyR1XV8RxINkPZNGF9juWnYxb9hkNw7bX IbACoaZi/FiwuxqkucHibrXbBe0UDNQNyH1Q62HojJCKawAankiIVeSySe0S0cCfFShh /i305Y7fcuoCAVxH8eNpWeJKByqEweA1b8hCMHnkddYJ+neS34rCy1wLI2ylPvzTkHwZ Q1QtDHB61+7ZDryzjFaT5XB+wCETegybOGb5Y3F9gh/CeYOEWnSkrvGZrNxql/33Ob6k gmRwKetJJbLsNuSDNtny3CTL3tdTCVMYt5CcXv8QK4mgUX3nX4AmkgPtw7to1LKCS+g9 bl7Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=YIvu+m6OtJySXdmyyJx4x4Oj6+ax+DbpP8RQfikqUdA=; b=I9h9F2HdbyaYxM4njFsI39wDLkGrHbRQGGOn7nxmNjZUpntI8/gWY0uPvIVNhuamrU vKZdAzaofxE8CW4hmwWeR5upxo0RNKw3W49CRSRWZsJfvBeOI9FtydszlLYK5yxevj4o Fg5xD52JZhPMPrT1o6uPwSCtkH8wXAWQtZ6l3MmhDJCoI1NzUns6Ea4Ndyf2uqMUgT1y HqE8TxKF8IOSXbJmlUn6TcSfKllUuco1rRDw57Rehdwh5UF2wvZhw8v7zo7qelHxdbJG Ft3JyiNuWe1Yat9sZLJQHWQ/MCwr/HZdkrIEXQcbQg/iFGovdwMHvuNMbhxgBgBq9U4r HTJA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@joelfernandes.org header.s=google header.b=G2c6z1BI; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id t25si24908979pgk.442.2019.07.17.13.48.21; Wed, 17 Jul 2019 13:48:40 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@joelfernandes.org header.s=google header.b=G2c6z1BI; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727382AbfGQUsC (ORCPT + 99 others); Wed, 17 Jul 2019 16:48:02 -0400 Received: from mail-pf1-f196.google.com ([209.85.210.196]:42553 "EHLO mail-pf1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726063AbfGQUsB (ORCPT ); Wed, 17 Jul 2019 16:48:01 -0400 Received: by mail-pf1-f196.google.com with SMTP id q10so11416076pff.9 for ; Wed, 17 Jul 2019 13:48:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=joelfernandes.org; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=YIvu+m6OtJySXdmyyJx4x4Oj6+ax+DbpP8RQfikqUdA=; b=G2c6z1BI21qYjpUqgk7+Rn6/i8O16nVnLSNOzP3LPAWurG2ntXX0Fg1085fMf+NtlY vfB5F+t0VuBQE6o6YGA1tz1Hpb18KAGfFehiUiw+5licyggM6884JwJDMjWWmtk8jqgk e4syfpSHQfxXdtJ6YCnA+Wkh5oHWSXVpMK2Yk= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=YIvu+m6OtJySXdmyyJx4x4Oj6+ax+DbpP8RQfikqUdA=; b=DF4vh1ho6ZPLXzToUvUxt/GeQl9nYvUJpZc42sTLV97B0mdYm+GGswuvMCGGMLuIFJ J1Fv/bplqkBO31/EAKw77Mgwp6NJXTrAxnG8IOTcxgh0O8CUd8K2odnDGzDVqv6/jkU1 hcLVdpOMrkn1JlYKTPasAwLl1lh+MFOHpz7h0d5qLQZCpWM8byOvQ4pQKvG1B5Xb74mh /vmUXGy23LPeFtNhoRdG/3GEhOe/C2QQ0LSFjDNuJmVA+HE05QpG6dNJsCn0DIBHtjNN 2hz8J5DursMXkBICPpiPERaUbFEO6HSNTHvXQOM7zj0szRTwHgDrRSoHK9i8fz7oz4Ie wcGQ== X-Gm-Message-State: APjAAAXtP190Iskzs/aXynGD3a+K8PYvOuHrl1ABh5x/ghRTPUVI3yPH QGy1sF+pmziCjXu4rr2saRo= X-Received: by 2002:a63:4104:: with SMTP id o4mr44838239pga.345.1563396480804; Wed, 17 Jul 2019 13:48:00 -0700 (PDT) Received: from localhost ([2620:15c:6:12:9c46:e0da:efbf:69cc]) by smtp.gmail.com with ESMTPSA id a21sm29505988pfi.27.2019.07.17.13.47.59 (version=TLS1_3 cipher=AEAD-AES256-GCM-SHA384 bits=256/256); Wed, 17 Jul 2019 13:47:59 -0700 (PDT) Date: Wed, 17 Jul 2019 16:47:58 -0400 From: Joel Fernandes To: Suren Baghdasaryan Cc: Christian Brauner , LKML , kernel-team , Andrea Arcangeli , Andrew Morton , "Eric W. Biederman" , Oleg Nesterov , Tejun Heo , Jann Horn Subject: Re: [PATCH RFC v1] pidfd: fix a race in setting exit_state for pidfd polling Message-ID: <20190717204758.GB72146@google.com> References: <20190717172100.261204-1-joel@joelfernandes.org> <20190717175556.axe2pne7lcrkmtzr@brauner.io> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jul 17, 2019 at 11:09:59AM -0700, Suren Baghdasaryan wrote: > On Wed, Jul 17, 2019 at 10:56 AM Christian Brauner wrote: > > > > On Wed, Jul 17, 2019 at 01:21:00PM -0400, Joel Fernandes wrote: > > > From: Suren Baghdasaryan > > > > > > There is a race between reading task->exit_state in pidfd_poll and writing > > > it after do_notify_parent calls do_notify_pidfd. Expected sequence of > > > events is: > > > > > > CPU 0 CPU 1 > > > ------------------------------------------------ > > > exit_notify > > > do_notify_parent > > > do_notify_pidfd > > > tsk->exit_state = EXIT_DEAD > > > pidfd_poll > > > if (tsk->exit_state) > > > > > > However nothing prevents the following sequence: > > > > > > CPU 0 CPU 1 > > > ------------------------------------------------ > > > exit_notify > > > do_notify_parent > > > do_notify_pidfd > > > pidfd_poll > > > if (tsk->exit_state) > > > tsk->exit_state = EXIT_DEAD > > > > > > This causes a polling task to wait forever, since poll blocks because > > > exit_state is 0 and the waiting task is not notified again. A stress > > > test continuously doing pidfd poll and process exits uncovered this bug, > > > and the below patch fixes it. > > > > > > To fix this, we set tsk->exit_state before calling do_notify_pidfd. > > > > > > Cc: kernel-team@android.com > > > Signed-off-by: Suren Baghdasaryan > > > Signed-off-by: Joel Fernandes (Google) > > > > That means in such a situation other users will see EXIT_ZOMBIE where > > they didn't see that before until after the parent failed to get > > notified. > > I'm a little nervous about that myself even though in my stress > testing this worked fine. I think the safest change would be to move > do_notify_pidfd() out of do_notify_parent() and call it after > tsk->exit_state is finalized. The downside of that is that there are 4 My initial approach to pidfd polling did it this way, and I remember there was a break in semantics where this does not work well. We want the notification to happen in do_notify_parent() so that it is in sync with the wait APIs.. I don't see a risk with this patch though. But let us see what Oleg's eyes find. > places we call do_notify_parent(), so instead of calling > do_notify_pidfd() one time from do_notify_parent() we will be calling > it 4 times now. > > Also my original patch had memory barriers to ensure correct ordering > of tsk->exit_state writes before reads. In this final version Joel > removed them, so I suppose he found out they are not needed. Joel, > please clarify. The barriers were initially add by me to your patch, but then I felt it may not be needed so I removed them before sending the patch. My initial concern was something like the following: CPU 0 CPU 1 ------------------------------------------------ store tsk->exit_state = 1 /* smp_wmb() ? */ do_notify_parent wake up poll_wait() /* smp_rmb(); ? */ read tsk->exit_state = 0 block... I was initially concerned if tsk->exit_state write would be missed by the polling thread and we would block forever (similar to this bug). I don't think this is possible anymore since wakeup implies release-barrier and waiting implies acquire barrier AFAIU. I am still not fully sure though, so yeah if you guys think it is an issue, let us add the memory barriers. As such I know memory barrier additions to the kernel requires justification, otherwise Linus calls it "Voodoo programming". So let us convince ourself first if memory barriers are needed before adding them anyway. thanks, - Joel > Thanks! > > > That's a rather subtle internal change. I was worried about > > __ptrace_detach() since it explicitly checks for EXIT_ZOMBIE but it > > seems to me that this is fine since we hold write_lock_irq(&tasklist_lock); > > at the point when we do set p->exit_signal. > > > > Acked-by: Christian Brauner > > > > Once Oleg confirms that I'm right not to worty I'll pick this up. > > > > Thanks! > > Christian > > > > > > > > --- > > > kernel/exit.c | 8 +++++--- > > > 1 file changed, 5 insertions(+), 3 deletions(-) > > > > > > diff --git a/kernel/exit.c b/kernel/exit.c > > > index a75b6a7f458a..740ceacb4b76 100644 > > > --- a/kernel/exit.c > > > +++ b/kernel/exit.c > > > @@ -720,6 +720,7 @@ static void exit_notify(struct task_struct *tsk, int group_dead) > > > if (group_dead) > > > kill_orphaned_pgrp(tsk->group_leader, NULL); > > > > > > + tsk->exit_state = EXIT_ZOMBIE; > > > if (unlikely(tsk->ptrace)) { > > > int sig = thread_group_leader(tsk) && > > > thread_group_empty(tsk) && > > > @@ -1156,10 +1157,11 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p) > > > ptrace_unlink(p); > > > > > > /* If parent wants a zombie, don't release it now */ > > > - state = EXIT_ZOMBIE; > > > + p->exit_state = EXIT_ZOMBIE; > > > if (do_notify_parent(p, p->exit_signal)) > > > - state = EXIT_DEAD; > > > - p->exit_state = state; > > > + p->exit_state = EXIT_DEAD; > > > + > > > + state = p->exit_state; > > > write_unlock_irq(&tasklist_lock); > > > } > > > if (state == EXIT_DEAD) > > > -- > > > 2.22.0.657.g960e92d24f-goog > > >