Received: by 2002:a25:7ec1:0:0:0:0:0 with SMTP id z184csp4806849ybc; Fri, 15 Nov 2019 10:21:22 -0800 (PST) X-Google-Smtp-Source: APXvYqyIMMTiG4Kpj7INVsNu2/331wRB/HQPzW860+qBbi21LEiqgwuiEqWMON4MA9km3RK/3I7E X-Received: by 2002:a17:906:5251:: with SMTP id y17mr2845817ejm.108.1573842082016; Fri, 15 Nov 2019 10:21:22 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1573842082; cv=none; d=google.com; s=arc-20160816; b=QwOCymOUlCByomum+xnImdtveZYw9sv0X+HL1ZpGv7rEANbEh5No6sbaeZg7n0m+Ju 1BQY6MM/asOPwHMgAwVnV2CzwqRN6C1IWII86/hMT4YRjx/Ze35GqxYL1wn6+Fw6SmyT XX8cBkv9fD/pm8Ckk07JaJAGyqhdyPEdLcxcLkTQC8B/wVxaoTBYLhJNS/tYbn5dvXRb pziqU7VQWvzoaBAfqigZe+0mmaBApoAd5h+RFIxPkh7q/g6H2nmiAt6AfmmPUHGezCGc YFkaWHeVXofXq7wmBzAgB3Y6zQ2knVWc8WmQvQxrH6r6ttIei535P31XPmDhskfOVfdw ynEg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :robot-unsubscribe:robot-id:message-id:mime-version:references :in-reply-to:cc:subject:to:reply-to:from:date; bh=YSiq7qDVhbIHvTkONm6KX/OfJjnr37FsAbzYV4fzzlU=; b=AGcOPpq8qJpWd2r7v5kFHPqk6U+kB6+BNGdv17MxCV0pDuKVHg+4RdHiuqZWTPuztK YzjUCh4416/hH2rtJcVzU/l9z7b1tl5YpW3EeJWdN7tIPuynpToqKuDyMVaR9LwHXOxT ZYFnBSmUPB+S8jFIfAolTr3ch0exVcGt90CZitj2CwnyjDYaxIOpPsOCINtjHma5hn96 AeibnuCudyB8k7+qm9neMV441XIAtxMTbCT36mYFk7OLV4nfRXGHfifIS59w8DNPfTtr 3iy4w/gbKr9l69ygC2R0KNhMafoI9RbTqkoFPzUH+9wjtixHJGqMw3ERZeYCxNcHi4IQ zE7A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id t18si6060076ejf.351.2019.11.15.10.20.56; Fri, 15 Nov 2019 10:21:22 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727217AbfKOSTn (ORCPT + 99 others); Fri, 15 Nov 2019 13:19:43 -0500 Received: from Galois.linutronix.de ([193.142.43.55]:44335 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727109AbfKOSTd (ORCPT ); Fri, 15 Nov 2019 13:19:33 -0500 Received: from [5.158.153.53] (helo=tip-bot2.lab.linutronix.de) by Galois.linutronix.de with esmtpsa (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1iVgBn-0005C5-Gu; Fri, 15 Nov 2019 19:19:23 +0100 Received: from [127.0.1.1] (localhost [IPv6:::1]) by tip-bot2.lab.linutronix.de (Postfix) with ESMTP id E38571C18D0; Fri, 15 Nov 2019 19:19:20 +0100 (CET) Date: Fri, 15 Nov 2019 18:19:20 -0000 From: "tip-bot2 for Yang Tao" Reply-to: linux-kernel@vger.kernel.org To: linux-tip-commits@vger.kernel.org Subject: [tip: locking/core] futex: Prevent robust futex exit race Cc: Yang Tao , Yi Wang , Thomas Gleixner , Ingo Molnar , "Peter Zijlstra (Intel)" , stable@vger.kernel.org, Borislav Petkov , linux-kernel@vger.kernel.org In-Reply-To: <1573010582-35297-1-git-send-email-wang.yi59@zte.com.cn> References: <1573010582-35297-1-git-send-email-wang.yi59@zte.com.cn> MIME-Version: 1.0 Message-ID: <157384196089.12247.9598930605067890286.tip-bot2@tip-bot2> X-Mailer: tip-git-log-daemon Robot-ID: Robot-Unsubscribe: Contact to get blacklisted from these emails Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit X-Linutronix-Spam-Score: -1.0 X-Linutronix-Spam-Level: - X-Linutronix-Spam-Status: No , -1.0 points, 5.0 required, ALL_TRUSTED=-1,SHORTCIRCUIT=-0.0001 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The following commit has been merged into the locking/core branch of tip: Commit-ID: ca16d5bee59807bf04deaab0a8eccecd5061528c Gitweb: https://git.kernel.org/tip/ca16d5bee59807bf04deaab0a8eccecd5061528c Author: Yang Tao AuthorDate: Wed, 06 Nov 2019 22:55:35 +01:00 Committer: Thomas Gleixner CommitterDate: Fri, 15 Nov 2019 19:10:49 +01:00 futex: Prevent robust futex exit race Robust futexes utilize the robust_list mechanism to allow the kernel to release futexes which are held when a task exits. The exit can be voluntary or caused by a signal or fault. This prevents that waiters block forever. The futex operations in user space store a pointer to the futex they are either locking or unlocking in the op_pending member of the per task robust list. After a lock operation has succeeded the futex is queued in the robust list linked list and the op_pending pointer is cleared. After an unlock operation has succeeded the futex is removed from the robust list linked list and the op_pending pointer is cleared. The robust list exit code checks for the pending operation and any futex which is queued in the linked list. It carefully checks whether the futex value is the TID of the exiting task. If so, it sets the OWNER_DIED bit and tries to wake up a potential waiter. This is race free for the lock operation but unlock has two race scenarios where waiters might not be woken up. These issues can be observed with regular robust pthread mutexes. PI aware pthread mutexes are not affected. (1) Unlocking task is killed after unlocking the futex value in user space before being able to wake a waiter. pthread_mutex_unlock() | V atomic_exchange_rel (&mutex->__data.__lock, 0) <------------------------killed lll_futex_wake () | | |(__lock = 0) |(enter kernel) | V do_exit() exit_mm() mm_release() exit_robust_list() handle_futex_death() | |(__lock = 0) |(uval = 0) | V if ((uval & FUTEX_TID_MASK) != task_pid_vnr(curr)) return 0; The sanity check which ensures that the user space futex is owned by the exiting task prevents the wakeup of waiters which in consequence block infinitely. (2) Waiting task is killed after a wakeup and before it can acquire the futex in user space. OWNER WAITER futex_wait() pthread_mutex_unlock() | | | |(__lock = 0) | | | V | futex_wake() ------------> wakeup() | |(return to userspace) |(__lock = 0) | V oldval = mutex->__data.__lock <-----------------killed atomic_compare_and_exchange_val_acq (&mutex->__data.__lock, | id | assume_other_futex_waiters, 0) | | | (enter kernel)| | V do_exit() | | V handle_futex_death() | |(__lock = 0) |(uval = 0) | V if ((uval & FUTEX_TID_MASK) != task_pid_vnr(curr)) return 0; The sanity check which ensures that the user space futex is owned by the exiting task prevents the wakeup of waiters, which seems to be correct as the exiting task does not own the futex value, but the consequence is that other waiters wont be woken up and block infinitely. In both scenarios the following conditions are true: - task->robust_list->list_op_pending != NULL - user space futex value == 0 - Regular futex (not PI) If these conditions are met then it is reasonably safe to wake up a potential waiter in order to prevent the above problems. As this might be a false positive it can cause spurious wakeups, but the waiter side has to handle other types of unrelated wakeups, e.g. signals gracefully anyway. So such a spurious wakeup will not affect the correctness of these operations. This workaround must not touch the user space futex value and cannot set the OWNER_DIED bit because the lock value is 0, i.e. uncontended. Setting OWNER_DIED in this case would result in inconsistent state and subsequently in malfunction of the owner died handling in user space. The rest of the user space state is still consistent as no other task can observe the list_op_pending entry in the exiting tasks robust list. The eventually woken up waiter will observe the uncontended lock value and take it over. [ tglx: Massaged changelog and comment. Made the return explicit and not depend on the subsequent check and added constants to hand into handle_futex_death() instead of plain numbers. Fixed a few coding style issues. ] Fixes: 0771dfefc9e5 ("[PATCH] lightweight robust futexes: core") Signed-off-by: Yang Tao Signed-off-by: Yi Wang Signed-off-by: Thomas Gleixner Reviewed-by: Ingo Molnar Acked-by: Peter Zijlstra (Intel) Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1573010582-35297-1-git-send-email-wang.yi59@zte.com.cn Link: https://lkml.kernel.org/r/20191106224555.943191378@linutronix.de --- kernel/futex.c | 58 +++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 51 insertions(+), 7 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 43229f8..49eaf5b 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -3452,11 +3452,16 @@ err_unlock: return ret; } +/* Constants for the pending_op argument of handle_futex_death */ +#define HANDLE_DEATH_PENDING true +#define HANDLE_DEATH_LIST false + /* * Process a futex-list entry, check whether it's owned by the * dying task, and do notification if so: */ -static int handle_futex_death(u32 __user *uaddr, struct task_struct *curr, int pi) +static int handle_futex_death(u32 __user *uaddr, struct task_struct *curr, + bool pi, bool pending_op) { u32 uval, uninitialized_var(nval), mval; int err; @@ -3469,6 +3474,42 @@ retry: if (get_user(uval, uaddr)) return -1; + /* + * Special case for regular (non PI) futexes. The unlock path in + * user space has two race scenarios: + * + * 1. The unlock path releases the user space futex value and + * before it can execute the futex() syscall to wake up + * waiters it is killed. + * + * 2. A woken up waiter is killed before it can acquire the + * futex in user space. + * + * In both cases the TID validation below prevents a wakeup of + * potential waiters which can cause these waiters to block + * forever. + * + * In both cases the following conditions are met: + * + * 1) task->robust_list->list_op_pending != NULL + * @pending_op == true + * 2) User space futex value == 0 + * 3) Regular futex: @pi == false + * + * If these conditions are met, it is safe to attempt waking up a + * potential waiter without touching the user space futex value and + * trying to set the OWNER_DIED bit. The user space futex value is + * uncontended and the rest of the user space mutex state is + * consistent, so a woken waiter will just take over the + * uncontended futex. Setting the OWNER_DIED bit would create + * inconsistent state and malfunction of the user space owner died + * handling. + */ + if (pending_op && !pi && !uval) { + futex_wake(uaddr, 1, 1, FUTEX_BITSET_MATCH_ANY); + return 0; + } + if ((uval & FUTEX_TID_MASK) != task_pid_vnr(curr)) return 0; @@ -3588,10 +3629,11 @@ void exit_robust_list(struct task_struct *curr) * A pending lock might already be on the list, so * don't process it twice: */ - if (entry != pending) + if (entry != pending) { if (handle_futex_death((void __user *)entry + futex_offset, - curr, pi)) + curr, pi, HANDLE_DEATH_LIST)) return; + } if (rc) return; entry = next_entry; @@ -3605,9 +3647,10 @@ void exit_robust_list(struct task_struct *curr) cond_resched(); } - if (pending) + if (pending) { handle_futex_death((void __user *)pending + futex_offset, - curr, pip); + curr, pip, HANDLE_DEATH_PENDING); + } } long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout, @@ -3784,7 +3827,8 @@ void compat_exit_robust_list(struct task_struct *curr) if (entry != pending) { void __user *uaddr = futex_uaddr(entry, futex_offset); - if (handle_futex_death(uaddr, curr, pi)) + if (handle_futex_death(uaddr, curr, pi, + HANDLE_DEATH_LIST)) return; } if (rc) @@ -3803,7 +3847,7 @@ void compat_exit_robust_list(struct task_struct *curr) if (pending) { void __user *uaddr = futex_uaddr(pending, futex_offset); - handle_futex_death(uaddr, curr, pip); + handle_futex_death(uaddr, curr, pip, HANDLE_DEATH_PENDING); } }