Received: by 2002:a25:7ec1:0:0:0:0:0 with SMTP id z184csp5998600ybc; Wed, 27 Nov 2019 13:09:08 -0800 (PST) X-Google-Smtp-Source: APXvYqz3mRvS74CWHMNm/9y+tO+Tuf9bJcrgPZpWTEbETIs+zNYJ1TQB8HTdIfkUzXGejUllQbxH X-Received: by 2002:a05:6402:2051:: with SMTP id bc17mr1938622edb.28.1574888948348; Wed, 27 Nov 2019 13:09:08 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1574888948; cv=none; d=google.com; s=arc-20160816; b=RcWyRY6w/wHEj+Cc40bDD0/qVb24gemwitvTymuvscodCsCJcD9Slojbrj4aiCaxC3 PHZpvss8Xa4ylwNGKEB64UFF7lNwx8ucFzfFERx5mQFjal7rVUnLOnQGneeyJVz13Wyg DdnlPFa1sPBGLF9wYKgypmZwjIYpsAwYfJR4pBndms58lAxsbuSq15Z+UibLhtMSpCil El4XZN97WnVybU2ZycK8HAOde4d+YZ+n9uBqK8ePIlNzDFzMeufM5220qrV/76YvWT2F TPWlMyiXmuSb+gVVcUrTYpeBt82wcVjH9J7gaWpRkMk7PY4dBNRknJ3yJVAOAy+ecCLl S9UQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:message-id:date:subject:cc:to :from:dkim-signature; bh=3UYkeZEe+buBYNRJcCBaw6KEhC+ZXGemJhBP2MFOEWc=; b=rvm7f4TImDMhypIXkuZbwnRixX+avw8Via50Ow6RGYT/6QXwP83XkEU/oc1SB9XSz1 bWmjuu+s0HTDS0epl0L37vVTVkT18T0dBbdF945FxLdF+SX2GLQ8h491qBTJrtGRHQ7w X24G/a9/oCDDyakeNY6MKHKPHeJ5x4N+ODQNU1IauDjApe700lG4Fv4ZANdjc/9AJW9m bP9QzhhcJO1Gqb6OjWPVwNhBBaHa0JAXqNUzIwZdFqwr4cuV5kmAnnj+XWvthKMY8CCj v9NKyZya9fPWWNHkDotIB4t16vL3jC4UIHxlwaJasoGRsM4qX8Jl8EU8X1HUUWOFJfp7 +tOg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=jb3h9N92; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id x18si10184705ejc.97.2019.11.27.13.08.44; Wed, 27 Nov 2019 13:09:08 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=jb3h9N92; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731966AbfK0VHN (ORCPT + 99 others); Wed, 27 Nov 2019 16:07:13 -0500 Received: from mail.kernel.org ([198.145.29.99]:33254 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731784AbfK0VHK (ORCPT ); Wed, 27 Nov 2019 16:07:10 -0500 Received: from localhost (83-86-89-107.cable.dynamic.v4.ziggo.nl [83.86.89.107]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 88A392176D; Wed, 27 Nov 2019 21:07:09 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1574888830; bh=244NLxpKoyq9Y+LRAv523jmgSymzurZP1XuPfKKuBb0=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=jb3h9N92eIupJAGBV8j5SBr3UBpzp8K+lzvbfZq/cFL54YNiHePsHe5o+3kaUpsAs leKnvLjpx6MKxYRz23JCXc5PXkWSJNwQMNMHSEtqZQkGmqAs34/6WQ5uCMMqBH9OvK 0LqLvYo7y8YCy8dLgiLDvG5agFss6GQHyplq91Ag= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, Yang Tao , Yi Wang , Thomas Gleixner , Ingo Molnar , "Peter Zijlstra (Intel)" Subject: [PATCH 4.19 279/306] futex: Prevent robust futex exit race Date: Wed, 27 Nov 2019 21:32:09 +0100 Message-Id: <20191127203135.176374671@linuxfoundation.org> X-Mailer: git-send-email 2.24.0 In-Reply-To: <20191127203114.766709977@linuxfoundation.org> References: <20191127203114.766709977@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Yang Tao commit ca16d5bee59807bf04deaab0a8eccecd5061528c upstream. Robust futexes utilize the robust_list mechanism to allow the kernel to release futexes which are held when a task exits. The exit can be voluntary or caused by a signal or fault. This prevents that waiters block forever. The futex operations in user space store a pointer to the futex they are either locking or unlocking in the op_pending member of the per task robust list. After a lock operation has succeeded the futex is queued in the robust list linked list and the op_pending pointer is cleared. After an unlock operation has succeeded the futex is removed from the robust list linked list and the op_pending pointer is cleared. The robust list exit code checks for the pending operation and any futex which is queued in the linked list. It carefully checks whether the futex value is the TID of the exiting task. If so, it sets the OWNER_DIED bit and tries to wake up a potential waiter. This is race free for the lock operation but unlock has two race scenarios where waiters might not be woken up. These issues can be observed with regular robust pthread mutexes. PI aware pthread mutexes are not affected. (1) Unlocking task is killed after unlocking the futex value in user space before being able to wake a waiter. pthread_mutex_unlock() | V atomic_exchange_rel (&mutex->__data.__lock, 0) <------------------------killed lll_futex_wake () | | |(__lock = 0) |(enter kernel) | V do_exit() exit_mm() mm_release() exit_robust_list() handle_futex_death() | |(__lock = 0) |(uval = 0) | V if ((uval & FUTEX_TID_MASK) != task_pid_vnr(curr)) return 0; The sanity check which ensures that the user space futex is owned by the exiting task prevents the wakeup of waiters which in consequence block infinitely. (2) Waiting task is killed after a wakeup and before it can acquire the futex in user space. OWNER WAITER futex_wait() pthread_mutex_unlock() | | | |(__lock = 0) | | | V | futex_wake() ------------> wakeup() | |(return to userspace) |(__lock = 0) | V oldval = mutex->__data.__lock <-----------------killed atomic_compare_and_exchange_val_acq (&mutex->__data.__lock, | id | assume_other_futex_waiters, 0) | | | (enter kernel)| | V do_exit() | | V handle_futex_death() | |(__lock = 0) |(uval = 0) | V if ((uval & FUTEX_TID_MASK) != task_pid_vnr(curr)) return 0; The sanity check which ensures that the user space futex is owned by the exiting task prevents the wakeup of waiters, which seems to be correct as the exiting task does not own the futex value, but the consequence is that other waiters wont be woken up and block infinitely. In both scenarios the following conditions are true: - task->robust_list->list_op_pending != NULL - user space futex value == 0 - Regular futex (not PI) If these conditions are met then it is reasonably safe to wake up a potential waiter in order to prevent the above problems. As this might be a false positive it can cause spurious wakeups, but the waiter side has to handle other types of unrelated wakeups, e.g. signals gracefully anyway. So such a spurious wakeup will not affect the correctness of these operations. This workaround must not touch the user space futex value and cannot set the OWNER_DIED bit because the lock value is 0, i.e. uncontended. Setting OWNER_DIED in this case would result in inconsistent state and subsequently in malfunction of the owner died handling in user space. The rest of the user space state is still consistent as no other task can observe the list_op_pending entry in the exiting tasks robust list. The eventually woken up waiter will observe the uncontended lock value and take it over. [ tglx: Massaged changelog and comment. Made the return explicit and not depend on the subsequent check and added constants to hand into handle_futex_death() instead of plain numbers. Fixed a few coding style issues. ] Fixes: 0771dfefc9e5 ("[PATCH] lightweight robust futexes: core") Signed-off-by: Yang Tao Signed-off-by: Yi Wang Signed-off-by: Thomas Gleixner Reviewed-by: Ingo Molnar Acked-by: Peter Zijlstra (Intel) Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1573010582-35297-1-git-send-email-wang.yi59@zte.com.cn Link: https://lkml.kernel.org/r/20191106224555.943191378@linutronix.de Signed-off-by: Greg Kroah-Hartman --- kernel/futex.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 51 insertions(+), 7 deletions(-) --- a/kernel/futex.c +++ b/kernel/futex.c @@ -3457,11 +3457,16 @@ err_unlock: return ret; } +/* Constants for the pending_op argument of handle_futex_death */ +#define HANDLE_DEATH_PENDING true +#define HANDLE_DEATH_LIST false + /* * Process a futex-list entry, check whether it's owned by the * dying task, and do notification if so: */ -static int handle_futex_death(u32 __user *uaddr, struct task_struct *curr, int pi) +static int handle_futex_death(u32 __user *uaddr, struct task_struct *curr, + bool pi, bool pending_op) { u32 uval, uninitialized_var(nval), mval; int err; @@ -3474,6 +3479,42 @@ retry: if (get_user(uval, uaddr)) return -1; + /* + * Special case for regular (non PI) futexes. The unlock path in + * user space has two race scenarios: + * + * 1. The unlock path releases the user space futex value and + * before it can execute the futex() syscall to wake up + * waiters it is killed. + * + * 2. A woken up waiter is killed before it can acquire the + * futex in user space. + * + * In both cases the TID validation below prevents a wakeup of + * potential waiters which can cause these waiters to block + * forever. + * + * In both cases the following conditions are met: + * + * 1) task->robust_list->list_op_pending != NULL + * @pending_op == true + * 2) User space futex value == 0 + * 3) Regular futex: @pi == false + * + * If these conditions are met, it is safe to attempt waking up a + * potential waiter without touching the user space futex value and + * trying to set the OWNER_DIED bit. The user space futex value is + * uncontended and the rest of the user space mutex state is + * consistent, so a woken waiter will just take over the + * uncontended futex. Setting the OWNER_DIED bit would create + * inconsistent state and malfunction of the user space owner died + * handling. + */ + if (pending_op && !pi && !uval) { + futex_wake(uaddr, 1, 1, FUTEX_BITSET_MATCH_ANY); + return 0; + } + if ((uval & FUTEX_TID_MASK) != task_pid_vnr(curr)) return 0; @@ -3593,10 +3634,11 @@ void exit_robust_list(struct task_struct * A pending lock might already be on the list, so * don't process it twice: */ - if (entry != pending) + if (entry != pending) { if (handle_futex_death((void __user *)entry + futex_offset, - curr, pi)) + curr, pi, HANDLE_DEATH_LIST)) return; + } if (rc) return; entry = next_entry; @@ -3610,9 +3652,10 @@ void exit_robust_list(struct task_struct cond_resched(); } - if (pending) + if (pending) { handle_futex_death((void __user *)pending + futex_offset, - curr, pip); + curr, pip, HANDLE_DEATH_PENDING); + } } long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout, @@ -3789,7 +3832,8 @@ void compat_exit_robust_list(struct task if (entry != pending) { void __user *uaddr = futex_uaddr(entry, futex_offset); - if (handle_futex_death(uaddr, curr, pi)) + if (handle_futex_death(uaddr, curr, pi, + HANDLE_DEATH_LIST)) return; } if (rc) @@ -3808,7 +3852,7 @@ void compat_exit_robust_list(struct task if (pending) { void __user *uaddr = futex_uaddr(pending, futex_offset); - handle_futex_death(uaddr, curr, pip); + handle_futex_death(uaddr, curr, pip, HANDLE_DEATH_PENDING); } }