Received: by 2002:a25:31c3:0:0:0:0:0 with SMTP id x186csp600132ybx; Wed, 6 Nov 2019 05:42:08 -0800 (PST) X-Google-Smtp-Source: APXvYqw3/ZqDIBSbGvG/HO+XdkfXuuQBKM/PdiR21dCYgR+jLSjaylPW0NbW0KCrSfvRupkop39o X-Received: by 2002:a50:97af:: with SMTP id e44mr2694910edb.3.1573047728065; Wed, 06 Nov 2019 05:42:08 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1573047728; cv=none; d=google.com; s=arc-20160816; b=BMZPQ0NR3qenDzmAd0TRT7ZuOmvPfmkcKdcoVWUkFf28/jbMbRZKvgCTDV+Svv6KYf MU+pVLcf9nmTpx6SATqDAswGMb7GZuDz8xUQ9HXvwyO2RTksyg8uYbYQb4YbnDOC8lAD DoyLrA2sUJvIJbyywSKpjPQrQ7seh9OMga2NDYYlhT+6PsZhx25TlfiBFyMNI+FvICxT U+mq3DgaWiduB5tzSb3Pjr3Nj2OIqmkPAn0joLXJRCyb1e7b4gq4BTD/xHgGYfk3qfZF ZI78VfsN5LV6KTQ3fEjU+30ia/UeQebrplKBzS1DHpKtRPYkqRp8ihXgRBUVNGVwkNt6 6E/g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:references :message-id:in-reply-to:subject:cc:to:from:date; bh=JdsIPML9i968+SoogKfsGSCb7wVDPq/6v3wgvAualZI=; b=hD92m9qn5WtYeXflk21FjIrHhXmfuYNIa7xjY2iCU9mbGXcXblNC0wRRGUt1DSBiKb omUICXGi3iJBMRNUg1pWFRmhWeODcTekx2R5JPFR7Sdtle8HTe5ylAJvj9Uut1GmEoT1 AOEIL+QdilI342cq5uIEKtdN90ek4U1Il61GYdF/e+KFANIP/RIJofy9kf/7Wh2euOB2 1qbUB/+dH3G6pHhvRpu90NqDZfGyCOgKnKHXjB/nZsevvCplX2NmWl1Kyom/x8ieN6zj TbaMKgsM745cBkwEjgCPdpfOlMKku2+WMgglXFo0z2ZvaqAiBzq1TWemjE89zGDJhurj 8ptw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id h1si7053847eds.434.2019.11.06.05.41.44; Wed, 06 Nov 2019 05:42:08 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731704AbfKFNis (ORCPT + 99 others); Wed, 6 Nov 2019 08:38:48 -0500 Received: from Galois.linutronix.de ([193.142.43.55]:44271 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728716AbfKFNir (ORCPT ); Wed, 6 Nov 2019 08:38:47 -0500 Received: from p5b06da22.dip0.t-ipconnect.de ([91.6.218.34] helo=nanos) by Galois.linutronix.de with esmtpsa (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1iSLWB-0005eJ-OQ; Wed, 06 Nov 2019 14:38:39 +0100 Date: Wed, 6 Nov 2019 14:38:38 +0100 (CET) From: Thomas Gleixner To: Oleg Nesterov cc: Florian Weimer , Shawn Landden , libc-alpha@sourceware.org, linux-api@vger.kernel.org, LKML , Arnd Bergmann , Deepa Dinamani , Andrew Morton , Catalin Marinas , Keith Packard , Peter Zijlstra Subject: Re: handle_exit_race && PF_EXITING In-Reply-To: <20191106121111.GC12575@redhat.com> Message-ID: References: <20191105152728.GA5666@redhat.com> <20191106085529.GA12575@redhat.com> <20191106103509.GB12575@redhat.com> <20191106121111.GC12575@redhat.com> User-Agent: Alpine 2.21 (DEB 202 2017-01-01) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Linutronix-Spam-Score: -1.0 X-Linutronix-Spam-Level: - X-Linutronix-Spam-Status: No , -1.0 points, 5.0 required, ALL_TRUSTED=-1,SHORTCIRCUIT=-0.0001 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 6 Nov 2019, Oleg Nesterov wrote: > But why we can not rely on handle_exit_race() without PF_EXITING check? > > Yes, exit_robust_list() is called before exit_pi_state_list() which (with > this patch) sets PF_EXITPIDONE. But this is fine? handle_futex_death() > doesn't wakeup pi futexes, exit_pi_state_list() does this. I know. You still create inconsistent state because of this: > raw_spin_lock_irq(&p->pi_lock); > - if (unlikely(p->flags & PF_EXITING)) { > + if (unlikely(p->flags & PF_EXITPIDONE)) { > /* > * The task is on the way out. When PF_EXITPIDONE is > * set, we know that the task has finished the > * cleanup: > */ > - int ret = handle_exit_race(uaddr, uval, p); > + int ret = handle_exit_race(uaddr, uval); > > raw_spin_unlock_irq(&p->pi_lock); > put_task_struct(p); Same explanation as before just not prosa this time: exit() lock_pi(futex2) exit_pi_state_list() lock(tsk->pi_lock) tsk->flags |= PF_EXITPIDONE; attach_to_pi_owner() ... // Loop unrolled for clarity while(!list_empty()) lock(tsk->pi_lock); cleanup(futex1) unlock(tsk->pi_lock) ... if (tsk->flags & PF_EXITPIDONE) ret = handle_exit_race() if (uval != orig_uval) return -EAGAIN; return -ESRCH; cleanup(futex2) return to userspace err = -ESRCH update futex2->uval with new owner TID and set OWNER_DIED userspace handles -ESRCH but futex2->uval has a valid TID and OWNER_DIED set. That's inconsistent state, the futex became a SNAFUtex and user space cannot recover from that. At least not existing user space and we cannot break that, right? If the kernel returns -ESRCH then the futex value must be preserved (except for the waiters bit being set, but that's not part of the problem at hand). You cannot just look at the kernel state with futexes. We have to guarantee consistent state between kernel and user space no matter what. And of course we have to be careful about user space creating inconsistent state for stupid or malicious reasons. See the whole state table above attach_to_pi_state(). The only way to fix this live lock issue is to gracefully wait for the cleanup to complete instead of busy looping. Yes, it sucks, but futexes suck by definition. Thanks, tglx