Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp806256ybl; Wed, 4 Dec 2019 11:11:42 -0800 (PST) X-Google-Smtp-Source: APXvYqxrFyP2pIIaG6009F8vO7orOjeYI8KJOfjfeZE2u5j1Fge9yMTorNOm3a4fm/o4/roqzgAd X-Received: by 2002:a9d:7495:: with SMTP id t21mr3854291otk.86.1575486702259; Wed, 04 Dec 2019 11:11:42 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1575486702; cv=none; d=google.com; s=arc-20160816; b=mxHm6BHMeytrx+QlxcNNlWTta9+OfgKFi0W8OJdXWu+fydg4pd3CLWw+93Vxxo5DNJ y/bavmxiNz9sjVwcMUiFOldfMYXwkJdhXB4Exonz1P6f71Klmcein9oAfBDO0tFHVfDb NIQsUok0yDpDDiMpxbjFSe+FRUkrL+GlRKK/uILnX6qUYfEHHwHlaVeSjOAhhEOdMbOd /Vcrj7PoQo6cERuzBSRfyX+L7QShChan8aKIvbFiIJ+DfbuoBLVWuUr01IMgQaa/flTO 5saVGx46N+Nu9K4mb/VCl70UmKoDIEV+MmC/lG8xmK4v9OYO0Z+K1mvB3K7XDUloCNJ9 nmag== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:message-id:date:subject:cc:to :from:dkim-signature; bh=wcOfmJCyzOpg/rTEK3RUFl+UlPyODylaeIabFqX8gkY=; b=JgdMgrgdFOciGZQTEbULmV1Upy5lFTeGxTR+7jbe5Cp71EAe7fQFDKd1/21L4T0dUV LoE+5assjso47Ny0xAFKT1dzhkLUSIgWduPYqZH+3rp9qNU7Av13AQ6TB0Br3xn5xQyr vP8Keos5QwJvnt9VJd642PwGEqcDhb0bkrD+z/XqkDcV+l5OSJqanJgh7HJN6Sn0G6G5 0nlYLR8AAZm18WEHV4keOF4MQHbBgjNk90wjZ8Y2VPKgavQXedLlyX/Bz6I40Houevrm g+prpENjkPFx1LdDgvbTRyFy7LiOi+Ri00salvjzqp+twnaFX9EnpH7bv+vTGJAkQ73T yhIg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=K3F3kKHW; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id t184si3658567oig.184.2019.12.04.11.11.29; Wed, 04 Dec 2019 11:11:42 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=K3F3kKHW; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731067AbfLDSJa (ORCPT + 99 others); Wed, 4 Dec 2019 13:09:30 -0500 Received: from mail.kernel.org ([198.145.29.99]:34960 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730778AbfLDSJY (ORCPT ); Wed, 4 Dec 2019 13:09:24 -0500 Received: from localhost (unknown [217.68.49.72]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 8C64720674; Wed, 4 Dec 2019 18:09:22 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1575482963; bh=tC/v4kvpKzq8JCffXM+a2Ma95LXY4yFKygP+qixgxVE=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=K3F3kKHWEdgxxdf4lucQ3noTjZXEdHMB7mwT7dVxkENQGgAps2YPAENUWFM/7OKmL UELiYVUaLT6Tqgye8wz9Np8ZVSpN+sBBJJAPKyXLsKmpqlkXYLEb7JIFWFq5f1G9A5 CLY3gVpZMFYkWPa/OZ/PKdO6i8OOiMbAXXKmFKZg= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, Thomas Gleixner , Ingo Molnar , "Peter Zijlstra (Intel)" Subject: [PATCH 4.14 188/209] futex: Replace PF_EXITPIDONE with a state Date: Wed, 4 Dec 2019 18:56:40 +0100 Message-Id: <20191204175336.430670266@linuxfoundation.org> X-Mailer: git-send-email 2.24.0 In-Reply-To: <20191204175321.609072813@linuxfoundation.org> References: <20191204175321.609072813@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Thomas Gleixner commit 3d4775df0a89240f671861c6ab6e8d59af8e9e41 upstream. The futex exit handling relies on PF_ flags. That's suboptimal as it requires a smp_mb() and an ugly lock/unlock of the exiting tasks pi_lock in the middle of do_exit() to enforce the observability of PF_EXITING in the futex code. Add a futex_state member to task_struct and convert the PF_EXITPIDONE logic over to the new state. The PF_EXITING dependency will be cleaned up in a later step. This prepares for handling various futex exit issues later. Signed-off-by: Thomas Gleixner Reviewed-by: Ingo Molnar Acked-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20191106224556.149449274@linutronix.de Signed-off-by: Greg Kroah-Hartman --- include/linux/futex.h | 33 +++++++++++++++++++++++++++++++++ include/linux/sched.h | 2 +- kernel/exit.c | 18 ++---------------- kernel/futex.c | 25 +++++++++++++------------ 4 files changed, 49 insertions(+), 29 deletions(-) --- a/include/linux/futex.h +++ b/include/linux/futex.h @@ -53,6 +53,10 @@ union futex_key { #define FUTEX_KEY_INIT (union futex_key) { .both = { .ptr = NULL } } #ifdef CONFIG_FUTEX +enum { + FUTEX_STATE_OK, + FUTEX_STATE_DEAD, +}; static inline void futex_init_task(struct task_struct *tsk) { @@ -62,6 +66,34 @@ static inline void futex_init_task(struc #endif INIT_LIST_HEAD(&tsk->pi_state_list); tsk->pi_state_cache = NULL; + tsk->futex_state = FUTEX_STATE_OK; +} + +/** + * futex_exit_done - Sets the tasks futex state to FUTEX_STATE_DEAD + * @tsk: task to set the state on + * + * Set the futex exit state of the task lockless. The futex waiter code + * observes that state when a task is exiting and loops until the task has + * actually finished the futex cleanup. The worst case for this is that the + * waiter runs through the wait loop until the state becomes visible. + * + * This has two callers: + * + * - futex_mm_release() after the futex exit cleanup has been done + * + * - do_exit() from the recursive fault handling path. + * + * In case of a recursive fault this is best effort. Either the futex exit + * code has run already or not. If the OWNER_DIED bit has been set on the + * futex then the waiter can take it over. If not, the problem is pushed + * back to user space. If the futex exit code did not run yet, then an + * already queued waiter might block forever, but there is nothing which + * can be done about that. + */ +static inline void futex_exit_done(struct task_struct *tsk) +{ + tsk->futex_state = FUTEX_STATE_DEAD; } void futex_mm_release(struct task_struct *tsk); @@ -71,6 +103,7 @@ long do_futex(u32 __user *uaddr, int op, #else static inline void futex_init_task(struct task_struct *tsk) { } static inline void futex_mm_release(struct task_struct *tsk) { } +static inline void futex_exit_done(struct task_struct *tsk) { } #endif #endif --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -959,6 +959,7 @@ struct task_struct { #endif struct list_head pi_state_list; struct futex_pi_state *pi_state_cache; + unsigned int futex_state; #endif #ifdef CONFIG_PERF_EVENTS struct perf_event_context *perf_event_ctxp[perf_nr_task_contexts]; @@ -1334,7 +1335,6 @@ extern struct pid *cad_pid; */ #define PF_IDLE 0x00000002 /* I am an IDLE thread */ #define PF_EXITING 0x00000004 /* Getting shut down */ -#define PF_EXITPIDONE 0x00000008 /* PI exit done on shut down */ #define PF_VCPU 0x00000010 /* I'm a virtual CPU */ #define PF_WQ_WORKER 0x00000020 /* I'm a workqueue worker */ #define PF_FORKNOEXEC 0x00000040 /* Forked but didn't exec */ --- a/kernel/exit.c +++ b/kernel/exit.c @@ -803,16 +803,7 @@ void __noreturn do_exit(long code) */ if (unlikely(tsk->flags & PF_EXITING)) { pr_alert("Fixing recursive fault but reboot is needed!\n"); - /* - * We can do this unlocked here. The futex code uses - * this flag just to verify whether the pi state - * cleanup has been done or not. In the worst case it - * loops once more. We pretend that the cleanup was - * done as there is no way to return. Either the - * OWNER_DIED bit is set by now or we push the blocked - * task into the wait for ever nirwana as well. - */ - tsk->flags |= PF_EXITPIDONE; + futex_exit_done(tsk); set_current_state(TASK_UNINTERRUPTIBLE); schedule(); } @@ -902,12 +893,7 @@ void __noreturn do_exit(long code) * Make sure we are holding no locks: */ debug_check_no_locks_held(); - /* - * We can do this unlocked here. The futex code uses this flag - * just to verify whether the pi state cleanup has been done - * or not. In the worst case it loops once more. - */ - tsk->flags |= PF_EXITPIDONE; + futex_exit_done(tsk); if (tsk->io_context) exit_io_context(tsk); --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1182,9 +1182,10 @@ static int handle_exit_race(u32 __user * u32 uval2; /* - * If PF_EXITPIDONE is not yet set, then try again. + * If the futex exit state is not yet FUTEX_STATE_DEAD, wait + * for it to finish. */ - if (tsk && !(tsk->flags & PF_EXITPIDONE)) + if (tsk && tsk->futex_state != FUTEX_STATE_DEAD) return -EAGAIN; /* @@ -1203,8 +1204,9 @@ static int handle_exit_race(u32 __user * * *uaddr = 0xC0000000; tsk = get_task(PID); * } if (!tsk->flags & PF_EXITING) { * ... attach(); - * tsk->flags |= PF_EXITPIDONE; } else { - * if (!(tsk->flags & PF_EXITPIDONE)) + * tsk->futex_state = } else { + * FUTEX_STATE_DEAD; if (tsk->futex_state != + * FUTEX_STATE_DEAD) * return -EAGAIN; * return -ESRCH; <--- FAIL * } @@ -1260,17 +1262,16 @@ static int attach_to_pi_owner(u32 __user } /* - * We need to look at the task state flags to figure out, - * whether the task is exiting. To protect against the do_exit - * change of the task flags, we do this protected by - * p->pi_lock: + * We need to look at the task state to figure out, whether the + * task is exiting. To protect against the change of the task state + * in futex_exit_release(), we do this protected by p->pi_lock: */ raw_spin_lock_irq(&p->pi_lock); - if (unlikely(p->flags & PF_EXITING)) { + if (unlikely(p->futex_state != FUTEX_STATE_OK)) { /* - * The task is on the way out. When PF_EXITPIDONE is - * set, we know that the task has finished the - * cleanup: + * The task is on the way out. When the futex state is + * FUTEX_STATE_DEAD, we know that the task has finished + * the cleanup: */ int ret = handle_exit_race(uaddr, uval, p);