Stefan reported, that the glibc tst-robustpi4 test case fails
occasionally. That case creates the following race between
sys_exit() and sys_futex(LOCK_PI):
CPU0 CPU1
sys_exit() sys_futex()
do_exit() futex_lock_pi()
exit_signals(tsk) No waiters:
tsk->flags |= PF_EXITING; *uaddr == 0x00000PID
mm_release(tsk) Set waiter bit
exit_robust_list(tsk) { *uaddr = 0x80000PID;
Set owner died attach_to_pi_owner() {
*uaddr = 0xC0000000; tsk = get_task(PID);
} if (!tsk->flags & PF_EXITING) {
... attach();
tsk->flags |= PF_EXITPIDONE; } else {
if (!(tsk->flags & PF_EXITPIDONE))
return -EAGAIN;
return -ESRCH; <--- FAIL
}
ESRCH is returned all the way to user space, which triggers the glibc test
case assert. Returning ESRCH unconditionally is wrong here because the user
space value has been changed by the exiting task to 0xC0000000, i.e. the
FUTEX_OWNER_DIED bit is set and the futex PID value has been cleared. This
is a valid state and the kernel has to handle it, i.e. taking the futex.
Cure it by rereading the user space value when PF_EXITING and PF_EXITPIDONE
is set in the task which owns the futex. If the value has changed, let
the kernel retry the operation, which includes all regular sanity checks
and correctly handles the FUTEX_OWNER_DIED case.
If it hasn't changed, then return ESRCH as there is no way to distinguish
this case from malfunctioning user space. This happens when the exiting
task did not have a robust list, the robust list was corrupted or the user
space value in the futex was simply bogus.
Reported-by: Stefan Liebler <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: [email protected]
Link: https://bugzilla.kernel.org/show_bug.cgi?id=200467
---
kernel/futex.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 53 insertions(+), 4 deletions(-)
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1148,11 +1148,60 @@ static int attach_to_pi_state(u32 __user
return ret;
}
+static int handle_exit_race(u32 __user *uaddr, u32 uval, struct task_struct *tsk)
+{
+ u32 uval2;
+
+ /*
+ * If PF_EXITPIDONE is not yet set try again.
+ */
+ if (!(tsk->flags & PF_EXITPIDONE))
+ return -EAGAIN;
+
+ /*
+ * Reread the user space value to handle the following situation:
+ *
+ * CPU0 CPU1
+ *
+ * sys_exit() sys_futex()
+ * do_exit() futex_lock_pi()
+ * exit_signals(tsk) No waiters:
+ * tsk->flags |= PF_EXITING; *uaddr == 0x00000PID
+ * mm_release(tsk) Set waiter bit
+ * exit_robust_list(tsk) { *uaddr = 0x80000PID;
+ * Set owner died attach_to_pi_owner() {
+ * *uaddr = 0xC0000000; tsk = get_task(PID);
+ * } if (!tsk->flags & PF_EXITING) {
+ * ... attach();
+ * tsk->flags |= PF_EXITPIDONE; } else {
+ * if (!(tsk->flags & PF_EXITPIDONE))
+ * return -EAGAIN;
+ * return -ESRCH; <--- FAIL
+ * }
+ *
+ * Returning ESRCH unconditionally is wrong here because the
+ * user space value has been changed by the exiting task.
+ */
+ if (get_futex_value_locked(&uval2, uaddr))
+ return -EFAULT;
+
+ /* If the user space value has changed, try again. */
+ if (uval2 != uval)
+ return -EAGAIN;
+
+ /*
+ * The exiting task did not have a robust list, the robust list was
+ * corrupted or the user space value in *uaddr is simply bogus.
+ * Give up and tell user space.
+ */
+ return -ESRCH;
+}
+
/*
* Lookup the task for the TID provided from user space and attach to
* it after doing proper sanity checks.
*/
-static int attach_to_pi_owner(u32 uval, union futex_key *key,
+static int attach_to_pi_owner(u32 __user *uaddr, u32 uval, union futex_key *key,
struct futex_pi_state **ps)
{
pid_t pid = uval & FUTEX_TID_MASK;
@@ -1187,7 +1236,7 @@ static int attach_to_pi_owner(u32 uval,
* set, we know that the task has finished the
* cleanup:
*/
- int ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN;
+ int ret = handle_exit_race(uaddr, uval, p);
raw_spin_unlock_irq(&p->pi_lock);
put_task_struct(p);
@@ -1244,7 +1293,7 @@ static int lookup_pi_state(u32 __user *u
* We are the first waiter - try to look up the owner based on
* @uval and attach to it.
*/
- return attach_to_pi_owner(uval, key, ps);
+ return attach_to_pi_owner(uaddr, uval, key, ps);
}
static int lock_pi_update_atomic(u32 __user *uaddr, u32 uval, u32 newval)
@@ -1352,7 +1401,7 @@ static int futex_lock_pi_atomic(u32 __us
* attach to the owner. If that fails, no harm done, we only
* set the FUTEX_WAITERS bit in the user space variable.
*/
- return attach_to_pi_owner(uval, key, ps);
+ return attach_to_pi_owner(uaddr, uval, key, ps);
}
/**
On Mon, Dec 10, 2018 at 04:23:06PM +0100, Thomas Gleixner wrote:
> kernel/futex.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++++----
> 1 file changed, 53 insertions(+), 4 deletions(-)
>
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -1148,11 +1148,60 @@ static int attach_to_pi_state(u32 __user
> return ret;
> }
>
> +static int handle_exit_race(u32 __user *uaddr, u32 uval, struct task_struct *tsk)
> +{
> + u32 uval2;
> +
> + /*
> + * If PF_EXITPIDONE is not yet set try again.
> + */
> + if (!(tsk->flags & PF_EXITPIDONE))
> + return -EAGAIN;
> +
> + /*
> + * Reread the user space value to handle the following situation:
> + *
> + * CPU0 CPU1
> + *
> + * sys_exit() sys_futex()
> + * do_exit() futex_lock_pi()
> + * exit_signals(tsk) No waiters:
> + * tsk->flags |= PF_EXITING; *uaddr == 0x00000PID
> + * mm_release(tsk) Set waiter bit
> + * exit_robust_list(tsk) { *uaddr = 0x80000PID;
Just to clarify; this is: sys_futex() <- futex_lock_pi() <-
futex_lock_pi_atomic(), where we do:
lock_pi_update_atomic(); // changes the futex word
attach_to_pi_owner(); // possibly returns ESRCH after changing the word
> + * Set owner died attach_to_pi_owner() {
> + * *uaddr = 0xC0000000; tsk = get_task(PID);
> + * } if (!tsk->flags & PF_EXITING) {
> + * ... attach();
> + * tsk->flags |= PF_EXITPIDONE; } else {
> + * if (!(tsk->flags & PF_EXITPIDONE))
> + * return -EAGAIN;
> + * return -ESRCH; <--- FAIL
> + * }
> + *
> + * Returning ESRCH unconditionally is wrong here because the
> + * user space value has been changed by the exiting task.
> + */
> + if (get_futex_value_locked(&uval2, uaddr))
> + return -EFAULT;
> +
> + /* If the user space value has changed, try again. */
> + if (uval2 != uval)
> + return -EAGAIN;
And this then goes back to futex_lock_pi(), which does a retry loop.
> + /*
> + * The exiting task did not have a robust list, the robust list was
> + * corrupted or the user space value in *uaddr is simply bogus.
> + * Give up and tell user space.
> + */
> + return -ESRCH;
If it is unchanged; -ESRCH is a valid return value.
> +}
There is another callers of futex_lock_pi_atomic(),
futex_proxy_trylock_atomic(), which is part of futex_requeue(), that too
does a retry loop on -EAGAIN.
And there is another caller of attach_to_pi_owner(): lookup_pi_state(),
and that too is in futex_requeue() and handles the retry case properly.
Yes, this all looks good.
Acked-by: Peter Zijlstra (Intel) <[email protected]>
On Mon, 10 Dec 2018, Peter Zijlstra wrote:
> On Mon, Dec 10, 2018 at 04:23:06PM +0100, Thomas Gleixner wrote:
> There is another callers of futex_lock_pi_atomic(),
> futex_proxy_trylock_atomic(), which is part of futex_requeue(), that too
> does a retry loop on -EAGAIN.
>
> And there is another caller of attach_to_pi_owner(): lookup_pi_state(),
> and that too is in futex_requeue() and handles the retry case properly.
>
> Yes, this all looks good.
>
> Acked-by: Peter Zijlstra (Intel) <[email protected]>
Bah. The little devil in the unconcious part of my brain insisted on
thinking further about that EAGAIN loop even despite my attempt to page
that futex horrors out again immediately after sending that patch.
There is another related issue which is even worse than just mildly
confusing user space:
task1(SCHED_OTHER)
sys_exit()
do_exit()
exit_mm()
task1->flags |= PF_EXITING;
---> preemption
task2(SCHED_FIFO)
sys_futex(LOCK_PI)
....
attach_to_pi_owner() {
...
if (!task1->flags & PF_EXITING) {
attach();
} else {
if (!(tsk->flags & PF_EXITPIDONE))
return -EAGAIN;
Now assume UP or both tasks pinned on the same CPU. That results in a
livelock because task2 is going to loop forever.
No immediate idea how to cure that one w/o creating a mess.
Thanks,
tglx
On Mon, 10 Dec 2018, Sasha Levin wrote:
> This commit has been processed because it contains a -stable tag.
> The stable tag indicates that it's relevant for the following trees: all
>
> The bot has tested the following trees: v4.19.8, v4.14.87, v4.9.144, v4.4.166, v3.18.128,
>
> v4.19.8: Build OK!
> v4.14.87: Build OK!
> v4.9.144: Build failed! Errors:
> kernel/futex.c:1186:28: error: ???uaddr??? undeclared (first use in this function)
>
> v4.4.166: Build failed! Errors:
> kernel/futex.c:1181:28: error: ???uaddr??? undeclared (first use in this function)
>
> v3.18.128: Build failed! Errors:
> kernel/futex.c:1103:28: error: ???uaddr??? undeclared (first use in this function)
>
> How should we proceed with this patch?
I'll look into that once this is sorted... I so love these rotten kernels.
Thanks,
tglx
On Mon, Dec 10, 2018 at 10:16:03PM +0100, Thomas Gleixner wrote:
>On Mon, 10 Dec 2018, Sasha Levin wrote:
>> This commit has been processed because it contains a -stable tag.
>> The stable tag indicates that it's relevant for the following trees: all
>>
>> The bot has tested the following trees: v4.19.8, v4.14.87, v4.9.144, v4.4.166, v3.18.128,
>>
>> v4.19.8: Build OK!
>> v4.14.87: Build OK!
>> v4.9.144: Build failed! Errors:
>> kernel/futex.c:1186:28: error: ???uaddr??? undeclared (first use in this function)
>>
>> v4.4.166: Build failed! Errors:
>> kernel/futex.c:1181:28: error: ???uaddr??? undeclared (first use in this function)
>>
>> v3.18.128: Build failed! Errors:
>> kernel/futex.c:1103:28: error: ???uaddr??? undeclared (first use in this function)
>>
>> How should we proceed with this patch?
>
>I'll look into that once this is sorted... I so love these rotten kernels.
It seems we need:
734009e96d19 ("futex: Change locking rules")
Which isn't trivial to backport.
--
Thanks,
Sasha
Hi Thomas,
does this also handle the ESRCH returned by
attach_to_pi_owner(...)
{...
if (!pid)
return -ESRCH;
p = find_get_task_by_vpid(pid);
if (!p)
return -ESRCH;
...
I think pid should never be zero when attach_to_pi_owner is called.
But it can happen that p is null? At least I traced the "return -ESRCH"
with the 4.17 kernel. Unfortunately both returns were done by the same
instruction address.
Bye
Stefan
On 12/10/2018 04:23 PM, Thomas Gleixner wrote:
> Stefan reported, that the glibc tst-robustpi4 test case fails
> occasionally. That case creates the following race between
> sys_exit() and sys_futex(LOCK_PI):
>
> CPU0 CPU1
>
> sys_exit() sys_futex()
> do_exit() futex_lock_pi()
> exit_signals(tsk) No waiters:
> tsk->flags |= PF_EXITING; *uaddr == 0x00000PID
> mm_release(tsk) Set waiter bit
> exit_robust_list(tsk) { *uaddr = 0x80000PID;
> Set owner died attach_to_pi_owner() {
> *uaddr = 0xC0000000; tsk = get_task(PID);
> } if (!tsk->flags & PF_EXITING) {
> ... attach();
> tsk->flags |= PF_EXITPIDONE; } else {
> if (!(tsk->flags & PF_EXITPIDONE))
> return -EAGAIN;
> return -ESRCH; <--- FAIL
> }
>
> ESRCH is returned all the way to user space, which triggers the glibc test
> case assert. Returning ESRCH unconditionally is wrong here because the user
> space value has been changed by the exiting task to 0xC0000000, i.e. the
> FUTEX_OWNER_DIED bit is set and the futex PID value has been cleared. This
> is a valid state and the kernel has to handle it, i.e. taking the futex.
>
> Cure it by rereading the user space value when PF_EXITING and PF_EXITPIDONE
> is set in the task which owns the futex. If the value has changed, let
> the kernel retry the operation, which includes all regular sanity checks
> and correctly handles the FUTEX_OWNER_DIED case.
>
> If it hasn't changed, then return ESRCH as there is no way to distinguish
> this case from malfunctioning user space. This happens when the exiting
> task did not have a robust list, the robust list was corrupted or the user
> space value in the futex was simply bogus.
>
> Reported-by: Stefan Liebler <[email protected]>
> Signed-off-by: Thomas Gleixner <[email protected]>
> Cc: Heiko Carstens <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Darren Hart <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: [email protected]
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=200467
> ---
> kernel/futex.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++++----
> 1 file changed, 53 insertions(+), 4 deletions(-)
>
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -1148,11 +1148,60 @@ static int attach_to_pi_state(u32 __user
> return ret;
> }
>
> +static int handle_exit_race(u32 __user *uaddr, u32 uval, struct task_struct *tsk)
> +{
> + u32 uval2;
> +
> + /*
> + * If PF_EXITPIDONE is not yet set try again.
> + */
> + if (!(tsk->flags & PF_EXITPIDONE))
> + return -EAGAIN;
> +
> + /*
> + * Reread the user space value to handle the following situation:
> + *
> + * CPU0 CPU1
> + *
> + * sys_exit() sys_futex()
> + * do_exit() futex_lock_pi()
> + * exit_signals(tsk) No waiters:
> + * tsk->flags |= PF_EXITING; *uaddr == 0x00000PID
> + * mm_release(tsk) Set waiter bit
> + * exit_robust_list(tsk) { *uaddr = 0x80000PID;
> + * Set owner died attach_to_pi_owner() {
> + * *uaddr = 0xC0000000; tsk = get_task(PID);
> + * } if (!tsk->flags & PF_EXITING) {
> + * ... attach();
> + * tsk->flags |= PF_EXITPIDONE; } else {
> + * if (!(tsk->flags & PF_EXITPIDONE))
> + * return -EAGAIN;
> + * return -ESRCH; <--- FAIL
> + * }
> + *
> + * Returning ESRCH unconditionally is wrong here because the
> + * user space value has been changed by the exiting task.
> + */
> + if (get_futex_value_locked(&uval2, uaddr))
> + return -EFAULT;
> +
> + /* If the user space value has changed, try again. */
> + if (uval2 != uval)
> + return -EAGAIN;
> +
> + /*
> + * The exiting task did not have a robust list, the robust list was
> + * corrupted or the user space value in *uaddr is simply bogus.
> + * Give up and tell user space.
> + */
> + return -ESRCH;
> +}
> +
> /*
> * Lookup the task for the TID provided from user space and attach to
> * it after doing proper sanity checks.
> */
> -static int attach_to_pi_owner(u32 uval, union futex_key *key,
> +static int attach_to_pi_owner(u32 __user *uaddr, u32 uval, union futex_key *key,
> struct futex_pi_state **ps)
> {
> pid_t pid = uval & FUTEX_TID_MASK;
> @@ -1187,7 +1236,7 @@ static int attach_to_pi_owner(u32 uval,
> * set, we know that the task has finished the
> * cleanup:
> */
> - int ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN;
> + int ret = handle_exit_race(uaddr, uval, p);
>
> raw_spin_unlock_irq(&p->pi_lock);
> put_task_struct(p);
> @@ -1244,7 +1293,7 @@ static int lookup_pi_state(u32 __user *u
> * We are the first waiter - try to look up the owner based on
> * @uval and attach to it.
> */
> - return attach_to_pi_owner(uval, key, ps);
> + return attach_to_pi_owner(uaddr, uval, key, ps);
> }
>
> static int lock_pi_update_atomic(u32 __user *uaddr, u32 uval, u32 newval)
> @@ -1352,7 +1401,7 @@ static int futex_lock_pi_atomic(u32 __us
> * attach to the owner. If that fails, no harm done, we only
> * set the FUTEX_WAITERS bit in the user space variable.
> */
> - return attach_to_pi_owner(uval, key, ps);
> + return attach_to_pi_owner(uaddr, uval, key, ps);
> }
>
> /**
>
>
On Mon, 10 Dec 2018, Sasha Levin wrote:
> On Mon, Dec 10, 2018 at 10:16:03PM +0100, Thomas Gleixner wrote:
> > On Mon, 10 Dec 2018, Sasha Levin wrote:
> > > How should we proceed with this patch?
> >
> > I'll look into that once this is sorted... I so love these rotten kernels.
>
> It seems we need:
>
> 734009e96d19 ("futex: Change locking rules")
>
> Which isn't trivial to backport.
It's simpler to backport the fix. I'll look at that once we agreed on the
final solution.
Thanks,
tglx
Stefan,
On Tue, 11 Dec 2018, Stefan Liebler wrote:
> does this also handle the ESRCH returned by
> attach_to_pi_owner(...)
> {...
> if (!pid)
> return -ESRCH;
> p = find_get_task_by_vpid(pid);
> if (!p)
> return -ESRCH;
> ...
>
> I think pid should never be zero when attach_to_pi_owner is called.
Yeah, I just checked again. It's a paranoid check.
> But it can happen that p is null? At least I traced the "return -ESRCH" with
> the 4.17 kernel. Unfortunately both returns were done by the same instruction
> address.
Yes, you are right. We need the same sanity check for that part. Updated
patch below.
Now I "just" have to come up with a cure for that livelock thing ....
Thanks,
tglx
8<--------------
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1148,11 +1148,65 @@ static int attach_to_pi_state(u32 __user
return ret;
}
+static int handle_exit_race(u32 __user *uaddr, u32 uval,
+ struct task_struct *tsk)
+{
+ u32 uval2;
+
+ /*
+ * If PF_EXITPIDONE is not yet set, then try again.
+ */
+ if (tsk && !(tsk->flags & PF_EXITPIDONE))
+ return -EAGAIN;
+
+ /*
+ * Reread the user space value to handle the following situation:
+ *
+ * CPU0 CPU1
+ *
+ * sys_exit() sys_futex()
+ * do_exit() futex_lock_pi()
+ * futex_lock_pi_atomic()
+ * exit_signals(tsk) No waiters:
+ * tsk->flags |= PF_EXITING; *uaddr == 0x00000PID
+ * mm_release(tsk) Set waiter bit
+ * exit_robust_list(tsk) { *uaddr = 0x80000PID;
+ * Set owner died attach_to_pi_owner() {
+ * *uaddr = 0xC0000000; tsk = get_task(PID);
+ * } if (!tsk->flags & PF_EXITING) {
+ * ... attach();
+ * tsk->flags |= PF_EXITPIDONE; } else {
+ * if (!(tsk->flags & PF_EXITPIDONE))
+ * return -EAGAIN;
+ * return -ESRCH; <--- FAIL
+ * }
+ *
+ * Returning ESRCH unconditionally is wrong here because the
+ * user space value has been changed by the exiting task.
+ *
+ * The same logic applies to the case where the exiting task is
+ * already gone.
+ */
+ if (get_futex_value_locked(&uval2, uaddr))
+ return -EFAULT;
+
+ /* If the user space value has changed, try again. */
+ if (uval2 != uval)
+ return -EAGAIN;
+
+ /*
+ * The exiting task did not have a robust list, the robust list was
+ * corrupted or the user space value in *uaddr is simply bogus.
+ * Give up and tell user space.
+ */
+ return -ESRCH;
+}
+
/*
* Lookup the task for the TID provided from user space and attach to
* it after doing proper sanity checks.
*/
-static int attach_to_pi_owner(u32 uval, union futex_key *key,
+static int attach_to_pi_owner(u32 __user *uaddr, u32 uval, union futex_key *key,
struct futex_pi_state **ps)
{
pid_t pid = uval & FUTEX_TID_MASK;
@@ -1162,12 +1216,15 @@ static int attach_to_pi_owner(u32 uval,
/*
* We are the first waiter - try to look up the real owner and attach
* the new pi_state to it, but bail out when TID = 0 [1]
+ *
+ * The !pid check is paranoid. None of the call sites should end up
+ * with pid == 0, but better safe than sorry. Let the caller retry
*/
if (!pid)
- return -ESRCH;
+ return -EAGAIN;
p = find_get_task_by_vpid(pid);
if (!p)
- return -ESRCH;
+ return handle_exit_race(uaddr, uval, NULL);
if (unlikely(p->flags & PF_KTHREAD)) {
put_task_struct(p);
@@ -1187,7 +1244,7 @@ static int attach_to_pi_owner(u32 uval,
* set, we know that the task has finished the
* cleanup:
*/
- int ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN;
+ int ret = handle_exit_race(uaddr, uval, p);
raw_spin_unlock_irq(&p->pi_lock);
put_task_struct(p);
@@ -1244,7 +1301,7 @@ static int lookup_pi_state(u32 __user *u
* We are the first waiter - try to look up the owner based on
* @uval and attach to it.
*/
- return attach_to_pi_owner(uval, key, ps);
+ return attach_to_pi_owner(uaddr, uval, key, ps);
}
static int lock_pi_update_atomic(u32 __user *uaddr, u32 uval, u32 newval)
@@ -1352,7 +1409,7 @@ static int futex_lock_pi_atomic(u32 __us
* attach to the owner. If that fails, no harm done, we only
* set the FUTEX_WAITERS bit in the user space variable.
*/
- return attach_to_pi_owner(uval, key, ps);
+ return attach_to_pi_owner(uaddr, newval, key, ps);
}
/**
On Mon, Dec 10, 2018 at 06:43:51PM +0100, Thomas Gleixner wrote:
> On Mon, 10 Dec 2018, Peter Zijlstra wrote:
> > On Mon, Dec 10, 2018 at 04:23:06PM +0100, Thomas Gleixner wrote:
> > There is another callers of futex_lock_pi_atomic(),
> > futex_proxy_trylock_atomic(), which is part of futex_requeue(), that too
> > does a retry loop on -EAGAIN.
> >
> > And there is another caller of attach_to_pi_owner(): lookup_pi_state(),
> > and that too is in futex_requeue() and handles the retry case properly.
> >
> > Yes, this all looks good.
> >
> > Acked-by: Peter Zijlstra (Intel) <[email protected]>
>
> Bah. The little devil in the unconcious part of my brain insisted on
> thinking further about that EAGAIN loop even despite my attempt to page
> that futex horrors out again immediately after sending that patch.
>
> There is another related issue which is even worse than just mildly
> confusing user space:
>
> task1(SCHED_OTHER)
> sys_exit()
> do_exit()
> exit_mm()
> task1->flags |= PF_EXITING;
>
> ---> preemption
>
> task2(SCHED_FIFO)
> sys_futex(LOCK_PI)
> ....
> attach_to_pi_owner() {
> ...
> if (!task1->flags & PF_EXITING) {
> attach();
> } else {
> if (!(tsk->flags & PF_EXITPIDONE))
> return -EAGAIN;
>
> Now assume UP or both tasks pinned on the same CPU. That results in a
> livelock because task2 is going to loop forever.
>
> No immediate idea how to cure that one w/o creating a mess.
One possible; but fairly gruesome hack; would be something like the
below.
Now, this obviously introduces a priority inversion, but that's
arguablly better than a live-lock, also I'm not sure there's really
anything 'sane' you can do in the case where your lock holder is dying
instead of doing a proper unlock anyway.
But no, I'm not liking this much either...
diff --git a/kernel/exit.c b/kernel/exit.c
index 0e21e6d21f35..bc6a01112d9d 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -806,6 +806,8 @@ void __noreturn do_exit(long code)
* task into the wait for ever nirwana as well.
*/
tsk->flags |= PF_EXITPIDONE;
+ smp_mb();
+ wake_up_bit(&tsk->flags, 3 /* PF_EXITPIDONE */);
set_current_state(TASK_UNINTERRUPTIBLE);
schedule();
}
diff --git a/kernel/futex.c b/kernel/futex.c
index f423f9b6577e..a743d657e783 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1148,8 +1148,8 @@ static int attach_to_pi_state(u32 __user *uaddr, u32 uval,
* Lookup the task for the TID provided from user space and attach to
* it after doing proper sanity checks.
*/
-static int attach_to_pi_owner(u32 uval, union futex_key *key,
- struct futex_pi_state **ps)
+static int attach_to_pi_owner(u32 __user *uaddr, u32 uval, union futex_key *key,
+ struct futex_pi_state **ps, struct task_struct **pe)
{
pid_t pid = uval & FUTEX_TID_MASK;
struct futex_pi_state *pi_state;
@@ -1187,10 +1236,15 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
* set, we know that the task has finished the
* cleanup:
*/
int ret = handle_exit_race(uaddr, uval, p);
raw_spin_unlock_irq(&p->pi_lock);
- put_task_struct(p);
+
+ if (ret == -EAGAIN)
+ *pe = p;
+ else
+ put_task_struct(p);
+
return ret;
}
@@ -1244,7 +1298,7 @@ static int lookup_pi_state(u32 __user *uaddr, u32 uval,
* We are the first waiter - try to look up the owner based on
* @uval and attach to it.
*/
- return attach_to_pi_owner(uval, key, ps);
+ return attach_to_pi_owner(uaddr, uval, key, ps);
}
static int lock_pi_update_atomic(u32 __user *uaddr, u32 uval, u32 newval)
@@ -1282,7 +1336,8 @@ static int lock_pi_update_atomic(u32 __user *uaddr, u32 uval, u32 newval)
static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
union futex_key *key,
struct futex_pi_state **ps,
- struct task_struct *task, int set_waiters)
+ struct task_struct *task, int set_waiters,
+ struct task_struct **exiting)
{
u32 uval, newval, vpid = task_pid_vnr(task);
struct futex_q *top_waiter;
@@ -1352,7 +1407,7 @@ static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
* attach to the owner. If that fails, no harm done, we only
* set the FUTEX_WAITERS bit in the user space variable.
*/
- return attach_to_pi_owner(uval, key, ps);
+ return attach_to_pi_owner(uaddr, uval, key, ps, exiting);
}
/**
@@ -2716,6 +2771,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
struct rt_mutex_waiter rt_waiter;
struct futex_hash_bucket *hb;
struct futex_q q = futex_q_init;
+ struct task_struct *exiting;
int res, ret;
if (!IS_ENABLED(CONFIG_FUTEX_PI))
@@ -2733,6 +2789,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
}
retry:
+ exiting = NULL;
ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q.key, VERIFY_WRITE);
if (unlikely(ret != 0))
goto out;
@@ -2740,7 +2797,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
retry_private:
hb = queue_lock(&q);
- ret = futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current, 0);
+ ret = futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current, 0, &exiting);
if (unlikely(ret)) {
/*
* Atomic work succeeded and we got the lock,
@@ -2762,6 +2819,12 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
*/
queue_unlock(hb);
put_futex_key(&q.key);
+
+ if (exiting) {
+ wait_bit(&exiting->flags, 3 /* PF_EXITPIDONE */, TASK_UNINTERRUPTIBLE);
+ put_task_struct(exiting);
+ }
+
cond_resched();
goto retry;
default:
On Wed, 12 Dec 2018, Peter Zijlstra wrote:
> On Mon, Dec 10, 2018 at 06:43:51PM +0100, Thomas Gleixner wrote:
> @@ -806,6 +806,8 @@ void __noreturn do_exit(long code)
> * task into the wait for ever nirwana as well.
> */
> tsk->flags |= PF_EXITPIDONE;
> + smp_mb();
> + wake_up_bit(&tsk->flags, 3 /* PF_EXITPIDONE */);
Using ilog2(PF_EXITPIDONE) spares that horrible inline comment and more
importantly selects the right bit. 0x04 is bit 2 ....
> @@ -1187,10 +1236,15 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
> * set, we know that the task has finished the
> * cleanup:
> */
> int ret = handle_exit_race(uaddr, uval, p);
>
> raw_spin_unlock_irq(&p->pi_lock);
> - put_task_struct(p);
> +
> + if (ret == -EAGAIN)
> + *pe = p;
Hmm, no. We really want to split the return value for that. EAGAIN is also
returned for other reasons.
Plus requeue_pi() needs the same treatment. I'm staring into it, but all I
came up with so far is horribly ugly.
Thanks,
tglx
Commit-ID: da791a667536bf8322042e38ca85d55a78d3c273
Gitweb: https://git.kernel.org/tip/da791a667536bf8322042e38ca85d55a78d3c273
Author: Thomas Gleixner <[email protected]>
AuthorDate: Mon, 10 Dec 2018 14:35:14 +0100
Committer: Thomas Gleixner <[email protected]>
CommitDate: Tue, 18 Dec 2018 23:13:15 +0100
futex: Cure exit race
Stefan reported, that the glibc tst-robustpi4 test case fails
occasionally. That case creates the following race between
sys_exit() and sys_futex_lock_pi():
CPU0 CPU1
sys_exit() sys_futex()
do_exit() futex_lock_pi()
exit_signals(tsk) No waiters:
tsk->flags |= PF_EXITING; *uaddr == 0x00000PID
mm_release(tsk) Set waiter bit
exit_robust_list(tsk) { *uaddr = 0x80000PID;
Set owner died attach_to_pi_owner() {
*uaddr = 0xC0000000; tsk = get_task(PID);
} if (!tsk->flags & PF_EXITING) {
... attach();
tsk->flags |= PF_EXITPIDONE; } else {
if (!(tsk->flags & PF_EXITPIDONE))
return -EAGAIN;
return -ESRCH; <--- FAIL
}
ESRCH is returned all the way to user space, which triggers the glibc test
case assert. Returning ESRCH unconditionally is wrong here because the user
space value has been changed by the exiting task to 0xC0000000, i.e. the
FUTEX_OWNER_DIED bit is set and the futex PID value has been cleared. This
is a valid state and the kernel has to handle it, i.e. taking the futex.
Cure it by rereading the user space value when PF_EXITING and PF_EXITPIDONE
is set in the task which 'owns' the futex. If the value has changed, let
the kernel retry the operation, which includes all regular sanity checks
and correctly handles the FUTEX_OWNER_DIED case.
If it hasn't changed, then return ESRCH as there is no way to distinguish
this case from malfunctioning user space. This happens when the exiting
task did not have a robust list, the robust list was corrupted or the user
space value in the futex was simply bogus.
Reported-by: Stefan Liebler <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: [email protected]
Link: https://bugzilla.kernel.org/show_bug.cgi?id=200467
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/futex.c | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 63 insertions(+), 6 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c
index f423f9b6577e..5cc8083a4c89 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1148,11 +1148,65 @@ out_error:
return ret;
}
+static int handle_exit_race(u32 __user *uaddr, u32 uval,
+ struct task_struct *tsk)
+{
+ u32 uval2;
+
+ /*
+ * If PF_EXITPIDONE is not yet set, then try again.
+ */
+ if (tsk && !(tsk->flags & PF_EXITPIDONE))
+ return -EAGAIN;
+
+ /*
+ * Reread the user space value to handle the following situation:
+ *
+ * CPU0 CPU1
+ *
+ * sys_exit() sys_futex()
+ * do_exit() futex_lock_pi()
+ * futex_lock_pi_atomic()
+ * exit_signals(tsk) No waiters:
+ * tsk->flags |= PF_EXITING; *uaddr == 0x00000PID
+ * mm_release(tsk) Set waiter bit
+ * exit_robust_list(tsk) { *uaddr = 0x80000PID;
+ * Set owner died attach_to_pi_owner() {
+ * *uaddr = 0xC0000000; tsk = get_task(PID);
+ * } if (!tsk->flags & PF_EXITING) {
+ * ... attach();
+ * tsk->flags |= PF_EXITPIDONE; } else {
+ * if (!(tsk->flags & PF_EXITPIDONE))
+ * return -EAGAIN;
+ * return -ESRCH; <--- FAIL
+ * }
+ *
+ * Returning ESRCH unconditionally is wrong here because the
+ * user space value has been changed by the exiting task.
+ *
+ * The same logic applies to the case where the exiting task is
+ * already gone.
+ */
+ if (get_futex_value_locked(&uval2, uaddr))
+ return -EFAULT;
+
+ /* If the user space value has changed, try again. */
+ if (uval2 != uval)
+ return -EAGAIN;
+
+ /*
+ * The exiting task did not have a robust list, the robust list was
+ * corrupted or the user space value in *uaddr is simply bogus.
+ * Give up and tell user space.
+ */
+ return -ESRCH;
+}
+
/*
* Lookup the task for the TID provided from user space and attach to
* it after doing proper sanity checks.
*/
-static int attach_to_pi_owner(u32 uval, union futex_key *key,
+static int attach_to_pi_owner(u32 __user *uaddr, u32 uval, union futex_key *key,
struct futex_pi_state **ps)
{
pid_t pid = uval & FUTEX_TID_MASK;
@@ -1162,12 +1216,15 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
/*
* We are the first waiter - try to look up the real owner and attach
* the new pi_state to it, but bail out when TID = 0 [1]
+ *
+ * The !pid check is paranoid. None of the call sites should end up
+ * with pid == 0, but better safe than sorry. Let the caller retry
*/
if (!pid)
- return -ESRCH;
+ return -EAGAIN;
p = find_get_task_by_vpid(pid);
if (!p)
- return -ESRCH;
+ return handle_exit_race(uaddr, uval, NULL);
if (unlikely(p->flags & PF_KTHREAD)) {
put_task_struct(p);
@@ -1187,7 +1244,7 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
* set, we know that the task has finished the
* cleanup:
*/
- int ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN;
+ int ret = handle_exit_race(uaddr, uval, p);
raw_spin_unlock_irq(&p->pi_lock);
put_task_struct(p);
@@ -1244,7 +1301,7 @@ static int lookup_pi_state(u32 __user *uaddr, u32 uval,
* We are the first waiter - try to look up the owner based on
* @uval and attach to it.
*/
- return attach_to_pi_owner(uval, key, ps);
+ return attach_to_pi_owner(uaddr, uval, key, ps);
}
static int lock_pi_update_atomic(u32 __user *uaddr, u32 uval, u32 newval)
@@ -1352,7 +1409,7 @@ static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
* attach to the owner. If that fails, no harm done, we only
* set the FUTEX_WAITERS bit in the user space variable.
*/
- return attach_to_pi_owner(uval, key, ps);
+ return attach_to_pi_owner(uaddr, newval, key, ps);
}
/**
On 2018-12-18 10:31, Thomas Gleixner wrote:
> On Wed, 12 Dec 2018, Peter Zijlstra wrote:
>> On Mon, Dec 10, 2018 at 06:43:51PM +0100, Thomas Gleixner wrote:
>> @@ -806,6 +806,8 @@ void __noreturn do_exit(long code)
>> * task into the wait for ever nirwana as well.
>> */
>> tsk->flags |= PF_EXITPIDONE;
>> + smp_mb();
>> + wake_up_bit(&tsk->flags, 3 /* PF_EXITPIDONE */);
>
> Using ilog2(PF_EXITPIDONE) spares that horrible inline comment and
> more
> importantly selects the right bit. 0x04 is bit 2 ....
Plus wake_up_bit() and wait_on_bit() want an unsigned long, but
tsk->flags is
unsigned int....
Moar staring....
On Wed, 19 Dec 2018, Thomas Gleixner wrote:
> On 2018-12-18 10:31, Thomas Gleixner wrote:
> > On Wed, 12 Dec 2018, Peter Zijlstra wrote:
> > > On Mon, Dec 10, 2018 at 06:43:51PM +0100, Thomas Gleixner wrote:
> > > @@ -806,6 +806,8 @@ void __noreturn do_exit(long code)
> > > * task into the wait for ever nirwana as well.
> > > */
> > > tsk->flags |= PF_EXITPIDONE;
> > > + smp_mb();
> > > + wake_up_bit(&tsk->flags, 3 /* PF_EXITPIDONE */);
> >
> > Using ilog2(PF_EXITPIDONE) spares that horrible inline comment and more
> > importantly selects the right bit. 0x04 is bit 2 ....
>
> Plus wake_up_bit() and wait_on_bit() want an unsigned long, but tsk->flags is
> unsigned int....
>
> Moar staring....
Aside of that calling wake_on_bit() unconditionally can be slow if the
waitqueue in the hash bucket is not empty.
So while cooking up an alternative solution I found yet another exit race:
CPU0 CPU1
sys_futex() sys_exit()
futex_lock_pi() do_exit()
No waiters:
*uaddr == 0x00000PID;
Set waiter bit
*uaddr = 0x80000PID;
attach_to_pi_owner()
tsk = get_task(PID); exit_signals(tsk)
if (!(tsk->flags & PF_EXITING))
... tsk->flags |= PF_EXITING;
mm_release(tsk)
exit_robust_list(tsk)
Set owner died and clear PID
*uaddr = 0xC0000000;
if (unlikely(!list_empty(&tsk->pi_state_list)))
list_add(&pi_state->list,
&tsk->pi_state_list);
I put that all on hold until Jan 7.
If somebody is really bored, here is the WIP patch series which addresses
the live lock mess: https://tglx.de/~tglx/patches.tar
Thanks,
tglx