Hello,
with the glibc self-tests I was able to trigger the "this should not
happen" warning ;) below on s390 (with panic_on_warn=1 set). It looks
like it is hardly reproducible.
This one happened with commit d146194f31c9 for compiling the kernel.
Config can be re-created with "make ARCH=s390 performance_defconfig".
[ 649.596938] WARNING: CPU: 0 PID: 58886 at kernel/futex.c:1418 do_futex+0xa9a/0xc50
[ 649.596946] Kernel panic - not syncing: panic_on_warn set ...
[ 649.596951] CPU: 0 PID: 58886 Comm: ld64.so.1 Not tainted 4.20.0-20181125.rc3.git0.d146194f31c9.300.fc29.s390x+git #1
[ 649.596953] Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0)
[ 649.596956] Call Trace:
[ 649.596963] ([<0000000000113050>] show_stack+0x58/0x70)
[ 649.596970] [<0000000000a62a92>] dump_stack+0x7a/0xa8
[ 649.596975] [<0000000000144012>] panic+0x11a/0x258
[ 649.596978] [<0000000000143e70>] __warn+0xf8/0x118
[ 649.596981] [<0000000000a61c20>] report_bug+0xd8/0x150
[ 649.596985] [<00000000001014ac>] do_report_trap+0xc4/0xe0
[ 649.596988] [<0000000000101680>] illegal_op+0x138/0x150
[ 649.596994] [<0000000000a82bec>] pgm_check_handler+0x1cc/0x220
[ 649.596998] [<00000000001e89b2>] do_futex+0xa9a/0xc50
[ 649.597002] ([<00000000001e8b16>] do_futex+0xbfe/0xc50)
[ 649.597006] [<00000000001e8c4c>] sys_futex+0xe4/0x170
[ 649.597010] [<0000000000a82800>] system_call+0xdc/0x2c8
Thanks,
Heiko
Heiko,
On Tue, 27 Nov 2018, Heiko Carstens wrote:
> with the glibc self-tests I was able to trigger the "this should not
> happen" warning ;) below on s390 (with panic_on_warn=1 set). It looks
> like it is hardly reproducible.
Any idea which self-test triggered that?
> This one happened with commit d146194f31c9 for compiling the kernel.
> Config can be re-created with "make ARCH=s390 performance_defconfig".
Which is not really helpful for people who do not own a s390. And no, I
don't want one unless IBM pays the power bill as well :)
> [ 649.596938] WARNING: CPU: 0 PID: 58886 at kernel/futex.c:1418 do_futex+0xa9a/0xc50
> [ 649.596946] Kernel panic - not syncing: panic_on_warn set ...
> [ 649.596951] CPU: 0 PID: 58886 Comm: ld64.so.1 Not tainted 4.20.0-20181125.rc3.git0.d146194f31c9.300.fc29.s390x+git #1
That's ld64.so.1. Weird, but what do I know about glibc self tests.
I still fail to see how that can happen, but I usually page out the futex
horrors immediately. I'll keep staring at the code...
Thanks,
tglx
On Wed, Nov 28, 2018 at 03:32:45PM +0100, Thomas Gleixner wrote:
> Heiko,
>
> On Tue, 27 Nov 2018, Heiko Carstens wrote:
>
> > with the glibc self-tests I was able to trigger the "this should not
> > happen" warning ;) below on s390 (with panic_on_warn=1 set). It looks
> > like it is hardly reproducible.
>
> Any idea which self-test triggered that?
>
> > This one happened with commit d146194f31c9 for compiling the kernel.
> > Config can be re-created with "make ARCH=s390 performance_defconfig".
>
> Which is not really helpful for people who do not own a s390. And no, I
> don't want one unless IBM pays the power bill as well :)
>
> > [ 649.596938] WARNING: CPU: 0 PID: 58886 at kernel/futex.c:1418 do_futex+0xa9a/0xc50
> > [ 649.596946] Kernel panic - not syncing: panic_on_warn set ...
> > [ 649.596951] CPU: 0 PID: 58886 Comm: ld64.so.1 Not tainted 4.20.0-20181125.rc3.git0.d146194f31c9.300.fc29.s390x+git #1
>
> That's ld64.so.1. Weird, but what do I know about glibc self tests.
>
> I still fail to see how that can happen, but I usually page out the futex
> horrors immediately. I'll keep staring at the code...
I looked into the system dumps, and if I didn't screw up, then the
command line for both occurrences was
/root/glibc-build/nptl/tst-robustpi8
And indeed, if I run only this test case in an endless loop and do
some parallel work (like kernel compile) it currently seems to be
possible to reproduce the warning:
while true; do time ./testrun.sh nptl/tst-robustpi8 --direct ; done
within the build directory of glibc (2.28).
See
https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/tst-robustpi8.c;h=cbea3d6d77abb00be05ec7b466d8339c26dd2efb;hb=3c03baca37fdcb52c3881e653ca392bba7a99c2b
which includes this one:
https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/tst-robust8.c;h=9c636250d4cb0bcd6d802910e8f9ea31568bb73f;hb=3c03baca37fdcb52c3881e653ca392bba7a99c2b
Hi Thomas,
[full quote below]
Did you have any time to look into this yet? :)
The warning is still reproducible.
On Thu, Nov 29, 2018 at 12:23:21PM +0100, Heiko Carstens wrote:
> On Wed, Nov 28, 2018 at 03:32:45PM +0100, Thomas Gleixner wrote:
> > Heiko,
> >
> > On Tue, 27 Nov 2018, Heiko Carstens wrote:
> >
> > > with the glibc self-tests I was able to trigger the "this should not
> > > happen" warning ;) below on s390 (with panic_on_warn=1 set). It looks
> > > like it is hardly reproducible.
> >
> > Any idea which self-test triggered that?
> >
> > > This one happened with commit d146194f31c9 for compiling the kernel.
> > > Config can be re-created with "make ARCH=s390 performance_defconfig".
> >
> > Which is not really helpful for people who do not own a s390. And no, I
> > don't want one unless IBM pays the power bill as well :)
> >
> > > [ 649.596938] WARNING: CPU: 0 PID: 58886 at kernel/futex.c:1418 do_futex+0xa9a/0xc50
> > > [ 649.596946] Kernel panic - not syncing: panic_on_warn set ...
> > > [ 649.596951] CPU: 0 PID: 58886 Comm: ld64.so.1 Not tainted 4.20.0-20181125.rc3.git0.d146194f31c9.300.fc29.s390x+git #1
> >
> > That's ld64.so.1. Weird, but what do I know about glibc self tests.
> >
> > I still fail to see how that can happen, but I usually page out the futex
> > horrors immediately. I'll keep staring at the code...
>
> I looked into the system dumps, and if I didn't screw up, then the
> command line for both occurrences was
>
> /root/glibc-build/nptl/tst-robustpi8
>
> And indeed, if I run only this test case in an endless loop and do
> some parallel work (like kernel compile) it currently seems to be
> possible to reproduce the warning:
>
> while true; do time ./testrun.sh nptl/tst-robustpi8 --direct ; done
>
> within the build directory of glibc (2.28).
>
> See
> https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/tst-robustpi8.c;h=cbea3d6d77abb00be05ec7b466d8339c26dd2efb;hb=3c03baca37fdcb52c3881e653ca392bba7a99c2b
>
> which includes this one:
>
> https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/tst-robust8.c;h=9c636250d4cb0bcd6d802910e8f9ea31568bb73f;hb=3c03baca37fdcb52c3881e653ca392bba7a99c2b
>
On Mon, 21 Jan 2019, Heiko Carstens wrote:
> Hi Thomas,
>
> [full quote below]
>
> Did you have any time to look into this yet? :)
>
> The warning is still reproducible.
Yeah, it's on my list of stuff which I need to take care of urgently. In
the next couple of days I hope...
Thanks,
tglx
On Mon, 21 Jan 2019, Thomas Gleixner wrote:
> On Mon, 21 Jan 2019, Heiko Carstens wrote:
>
> > Hi Thomas,
> >
> > [full quote below]
> >
> > Did you have any time to look into this yet? :)
> >
> > The warning is still reproducible.
>
> Yeah, it's on my list of stuff which I need to take care of urgently. In
> the next couple of days I hope...
Hmm. Doesn't
da791a667536 ("futex: Cure exit race")
address that issue?
Thanks,
tglx
On Tue, Jan 22, 2019 at 10:14:00PM +0100, Thomas Gleixner wrote:
> On Mon, 21 Jan 2019, Thomas Gleixner wrote:
> > On Mon, 21 Jan 2019, Heiko Carstens wrote:
> >
> > > Hi Thomas,
> > >
> > > [full quote below]
> > >
> > > Did you have any time to look into this yet? :)
> > >
> > > The warning is still reproducible.
> >
> > Yeah, it's on my list of stuff which I need to take care of urgently. In
> > the next couple of days I hope...
>
> Hmm. Doesn't
>
> da791a667536 ("futex: Cure exit race")
>
> address that issue?
It doesn't look like it does. One occurrence was the one below when
using commit 7939f8beecf1 (which is post 5.0-rc2) for building the
kernel:
WARNING: CPU: 14 PID: 23505 at kernel/futex.c:1483 do_futex+0xa9a/0xc50
Kernel panic - not syncing: panic_on_warn set ...
CPU: 14 PID: 23505 Comm: ld.so.1 Not tainted 5.0.0-20190116.rc2.git0.7939f8beecf1.300.fc29.s390x+git #1
Hardware name: IBM 3906 M04 704 (LPAR)
Call Trace:
([<0000000000112e60>] show_stack+0x58/0x70)
[<0000000000a671fa>] dump_stack+0x7a/0xa8
[<0000000000143f52>] panic+0x11a/0x2d0
[<0000000000143db0>] __warn+0xf8/0x118
[<0000000000a662f8>] report_bug+0xd8/0x150
[<00000000001014ac>] do_report_trap+0xc4/0xe0
[<0000000000101680>] illegal_op+0x138/0x150
[<0000000000a87270>] pgm_check_handler+0x1c8/0x220
[<00000000001e9aea>] do_futex+0xa9a/0xc50
([<00000000001e9c4e>] do_futex+0xbfe/0xc50)
[<00000000001ea13c>] compat_sys_futex+0xe4/0x170
[<0000000000a86e84>] system_call+0xd8/0x2c8
Heiko,
On Wed, 23 Jan 2019, Heiko Carstens wrote:
> It doesn't look like it does. One occurrence was the one below when
> using commit 7939f8beecf1 (which is post 5.0-rc2) for building the
> kernel:
>
> WARNING: CPU: 14 PID: 23505 at kernel/futex.c:1483 do_futex+0xa9a/0xc50
> Kernel panic - not syncing: panic_on_warn set ...
> CPU: 14 PID: 23505 Comm: ld.so.1 Not tainted 5.0.0-20190116.rc2.git0.7939f8beecf1.300.fc29.s390x+git #1
> Hardware name: IBM 3906 M04 704 (LPAR)
> Call Trace:
> ([<0000000000112e60>] show_stack+0x58/0x70)
> [<0000000000a671fa>] dump_stack+0x7a/0xa8
> [<0000000000143f52>] panic+0x11a/0x2d0
> [<0000000000143db0>] __warn+0xf8/0x118
> [<0000000000a662f8>] report_bug+0xd8/0x150
> [<00000000001014ac>] do_report_trap+0xc4/0xe0
> [<0000000000101680>] illegal_op+0x138/0x150
> [<0000000000a87270>] pgm_check_handler+0x1c8/0x220
> [<00000000001e9aea>] do_futex+0xa9a/0xc50
> ([<00000000001e9c4e>] do_futex+0xbfe/0xc50)
> [<00000000001ea13c>] compat_sys_futex+0xe4/0x170
> [<0000000000a86e84>] system_call+0xd8/0x2c8
Ok. That's another one which I can't spot yet.
Is that only on S390?
What's the test which explodes? Kernel build?
Thanks,
tglx
On Wed, Jan 23, 2019 at 01:33:15PM +0100, Thomas Gleixner wrote:
> Heiko,
>
> On Wed, 23 Jan 2019, Heiko Carstens wrote:
> > It doesn't look like it does. One occurrence was the one below when
> > using commit 7939f8beecf1 (which is post 5.0-rc2) for building the
> > kernel:
> >
> > WARNING: CPU: 14 PID: 23505 at kernel/futex.c:1483 do_futex+0xa9a/0xc50
> > Kernel panic - not syncing: panic_on_warn set ...
> > CPU: 14 PID: 23505 Comm: ld.so.1 Not tainted 5.0.0-20190116.rc2.git0.7939f8beecf1.300.fc29.s390x+git #1
> > Hardware name: IBM 3906 M04 704 (LPAR)
> > Call Trace:
> > ([<0000000000112e60>] show_stack+0x58/0x70)
> > [<0000000000a671fa>] dump_stack+0x7a/0xa8
> > [<0000000000143f52>] panic+0x11a/0x2d0
> > [<0000000000143db0>] __warn+0xf8/0x118
> > [<0000000000a662f8>] report_bug+0xd8/0x150
> > [<00000000001014ac>] do_report_trap+0xc4/0xe0
> > [<0000000000101680>] illegal_op+0x138/0x150
> > [<0000000000a87270>] pgm_check_handler+0x1c8/0x220
> > [<00000000001e9aea>] do_futex+0xa9a/0xc50
> > ([<00000000001e9c4e>] do_futex+0xbfe/0xc50)
> > [<00000000001ea13c>] compat_sys_futex+0xe4/0x170
> > [<0000000000a86e84>] system_call+0xd8/0x2c8
>
> Ok. That's another one which I can't spot yet.
>
> Is that only on S390?
> What's the test which explodes? Kernel build?
No, it's the tst-robustpi8 glibc 2.28 selftest. Let me quote myself
from earlier in this thread:
"
... tst-robustpi8
And indeed, if I run only this test case in an endless loop and do
some parallel work (like kernel compile) it currently seems to be
possible to reproduce the warning:
while true; do time ./testrun.sh nptl/tst-robustpi8 --direct ; done
within the build directory of glibc (2.28).
See
https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/tst-robustpi8.c;h=cbea3d6d77abb00be05ec7b466d8339c26dd2efb;hb=3c03baca37fdcb52c3881e653ca392bba7a99c2b
which includes this one:
https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/tst-robust8.c;h=9c636250d4cb0bcd6d802910e8f9ea31568bb73f;hb=3c03baca37fdcb52c3881e653ca392bba7a99c2b
"
On Thu, Nov 29, 2018 at 12:23:21PM +0100, Heiko Carstens wrote:
> And indeed, if I run only this test case in an endless loop and do
> some parallel work (like kernel compile) it currently seems to be
> possible to reproduce the warning:
>
> while true; do time ./testrun.sh nptl/tst-robustpi8 --direct ; done
>
> within the build directory of glibc (2.28).
Right; so that reproduces for me.
After staring at all that for a while; trying to remember how it all
worked (or supposed to work rather), I became suspiscous of commit:
56222b212e8e ("futex: Drop hb->lock before enqueueing on the rtmutex")
And indeed, when I revert that; the above reproducer no longer works (as
in, it no longer triggers in minutes and has -- so far -- held up for an
hour+ or so).
That patch in particular allows futex_unlock_pi() to 'start' early:
futex_lock_pi() futex_unlock_pi()
lock hb
queue
lock wait_lock
unlock hb
lock hb
futex_top_waiter
get_pi_state
lock wait_lock
rt_mutex_proxy_start // fail
unlock wait_lock
// acquired wait_lock
wake_futex_pi()
rt_mutex_next_owner() // whoops, no waiter
WARN
lock hb
unqueue_me_pi
So reverting that patch should cure things, because then there is no hb
lock break between queue/unqueue and futex_unlock_pi() cannot observe
this half-arsed state.
Now obviously reverting that makes RT unhappy; let me see what the
options are.
(concurrently tglx generated a trace that corroborates)
On Mon, Jan 28, 2019 at 02:44:10PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 29, 2018 at 12:23:21PM +0100, Heiko Carstens wrote:
>
> > And indeed, if I run only this test case in an endless loop and do
> > some parallel work (like kernel compile) it currently seems to be
> > possible to reproduce the warning:
> >
> > while true; do time ./testrun.sh nptl/tst-robustpi8 --direct ; done
> >
> > within the build directory of glibc (2.28).
>
> Right; so that reproduces for me.
>
> After staring at all that for a while; trying to remember how it all
> worked (or supposed to work rather), I became suspiscous of commit:
>
> 56222b212e8e ("futex: Drop hb->lock before enqueueing on the rtmutex")
>
> And indeed, when I revert that; the above reproducer no longer works (as
> in, it no longer triggers in minutes and has -- so far -- held up for an
> hour+ or so).
>
> That patch in particular allows futex_unlock_pi() to 'start' early:
>
>
> futex_lock_pi() futex_unlock_pi()
> lock hb
> queue
> lock wait_lock
> unlock hb
> lock hb
> futex_top_waiter
> get_pi_state
> lock wait_lock
> rt_mutex_proxy_start // fail
> unlock wait_lock
> // acquired wait_lock
unlock hb
> wake_futex_pi()
> rt_mutex_next_owner() // whoops, no waiter
> WARN
and simply removing that WARN, would allow futex_unlock_pi() to spin on
retry until the futex_lock_pi() CPU comes around to doing the lock hb
below:
> lock hb
> unqueue_me_pi
Which seems undesirable from a determinsm POV.
On Mon, 28 Jan 2019, Peter Zijlstra wrote:
> On Mon, Jan 28, 2019 at 02:44:10PM +0100, Peter Zijlstra wrote:
> > On Thu, Nov 29, 2018 at 12:23:21PM +0100, Heiko Carstens wrote:
> >
> > > And indeed, if I run only this test case in an endless loop and do
> > > some parallel work (like kernel compile) it currently seems to be
> > > possible to reproduce the warning:
> > >
> > > while true; do time ./testrun.sh nptl/tst-robustpi8 --direct ; done
> > >
> > > within the build directory of glibc (2.28).
> >
> > Right; so that reproduces for me.
> >
> > After staring at all that for a while; trying to remember how it all
> > worked (or supposed to work rather), I became suspiscous of commit:
> >
> > 56222b212e8e ("futex: Drop hb->lock before enqueueing on the rtmutex")
> >
> > And indeed, when I revert that; the above reproducer no longer works (as
> > in, it no longer triggers in minutes and has -- so far -- held up for an
> > hour+ or so).
Right after staring long enough at it, the commit simply forgot to give
__rt_mutex_start_proxy_lock() the same treatment as it gave to
rt_mutex_wait_proxy_lock().
Patch below cures that.
Thanks,
tglx
8<----------------
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2845,7 +2845,7 @@ static int futex_lock_pi(u32 __user *uad
ret = rt_mutex_futex_trylock(&q.pi_state->pi_mutex);
/* Fixup the trylock return value: */
ret = ret ? 0 : -EWOULDBLOCK;
- goto no_block;
+ goto cleanup;
}
rt_mutex_init_waiter(&rt_waiter);
@@ -2870,17 +2870,15 @@ static int futex_lock_pi(u32 __user *uad
if (ret) {
if (ret == 1)
ret = 0;
-
- spin_lock(q.lock_ptr);
goto no_block;
}
-
if (unlikely(to))
hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);
ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);
+no_block:
spin_lock(q.lock_ptr);
/*
* If we failed to acquire the lock (signal/timeout), we must
@@ -2894,7 +2892,7 @@ static int futex_lock_pi(u32 __user *uad
if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
ret = 0;
-no_block:
+cleanup:
/*
* Fixup the pi_state owner and possibly acquire the lock if we
* haven't already.
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1749,9 +1749,6 @@ int __rt_mutex_start_proxy_lock(struct r
ret = 0;
}
- if (unlikely(ret))
- remove_waiter(lock, waiter);
-
debug_rt_mutex_print_deadlock(waiter);
return ret;
@@ -1778,6 +1775,8 @@ int rt_mutex_start_proxy_lock(struct rt_
raw_spin_lock_irq(&lock->wait_lock);
ret = __rt_mutex_start_proxy_lock(lock, waiter, task);
+ if (unlikely(ret))
+ remove_waiter(lock, waiter);
raw_spin_unlock_irq(&lock->wait_lock);
return ret;
On Mon, Jan 28, 2019 at 04:53:19PM +0100, Thomas Gleixner wrote:
> Right after staring long enough at it, the commit simply forgot to give
> __rt_mutex_start_proxy_lock() the same treatment as it gave to
> rt_mutex_wait_proxy_lock().
>
> Patch below cures that.
Yes, that is a very nice solution. Find below an updated patch that
includes a few comments. I found it harder than it should be to
reconstruct this code.
(also, I flipped the label names)
---
kernel/futex.c | 28 ++++++++++++++++++----------
kernel/locking/rtmutex.c | 37 ++++++++++++++++++++++++++++++++-----
2 files changed, 50 insertions(+), 15 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c
index fdd312da0992..9d8411d3142d 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2861,35 +2861,39 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
* and BUG when futex_unlock_pi() interleaves with this.
*
* Therefore acquire wait_lock while holding hb->lock, but drop the
- * latter before calling rt_mutex_start_proxy_lock(). This still fully
- * serializes against futex_unlock_pi() as that does the exact same
- * lock handoff sequence.
+ * latter before calling __rt_mutex_start_proxy_lock(). This
+ * interleaves with futex_unlock_pi() -- which does a similar lock
+ * handoff -- such that the latter can observe the futex_q::pi_state
+ * before __rt_mutex_start_proxy_lock() is done.
*/
raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock);
spin_unlock(q.lock_ptr);
+ /*
+ * __rt_mutex_start_proxy_lock() unconditionally enqueues the @rt_waiter
+ * such that futex_unlock_pi() is guaranteed to observe the waiter when
+ * it sees the futex_q::pi_state.
+ */
ret = __rt_mutex_start_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter, current);
raw_spin_unlock_irq(&q.pi_state->pi_mutex.wait_lock);
if (ret) {
if (ret == 1)
ret = 0;
-
- spin_lock(q.lock_ptr);
- goto no_block;
+ goto cleanup;
}
-
if (unlikely(to))
hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);
ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);
+cleanup:
spin_lock(q.lock_ptr);
/*
- * If we failed to acquire the lock (signal/timeout), we must
+ * If we failed to acquire the lock (deadlock/signal/timeout), we must
* first acquire the hb->lock before removing the lock from the
- * rt_mutex waitqueue, such that we can keep the hb and rt_mutex
- * wait lists consistent.
+ * rt_mutex waitqueue, such that we can keep the hb and rt_mutex wait
+ * lists consistent.
*
* In particular; it is important that futex_unlock_pi() can not
* observe this inconsistency.
@@ -3013,6 +3017,10 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
* there is no point where we hold neither; and therefore
* wake_futex_pi() must observe a state consistent with what we
* observed.
+ *
+ * In particular; this forces __rt_mutex_start_proxy() to
+ * complete such that we're guaranteed to observe the
+ * rt_waiter. Also see the WARN in wake_futex_pi().
*/
raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
spin_unlock(&hb->lock);
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 581edcc63c26..afaf37d0ac15 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1726,12 +1726,33 @@ void rt_mutex_proxy_unlock(struct rt_mutex *lock,
rt_mutex_set_owner(lock, NULL);
}
+/**
+ * __rt_mutex_start_proxy_lock() - Start lock acquisition for another task
+ * @lock: the rt_mutex to take
+ * @waiter: the pre-initialized rt_mutex_waiter
+ * @task: the task to prepare
+ *
+ * Starts the rt_mutex acquire; it enqueues the @waiter and does deadlock
+ * detection. It does not wait, see rt_mutex_wait_proxy_lock() for that.
+ *
+ * NOTE: does _NOT_ remove the @waiter on failure; must either call
+ * rt_mutex_wait_proxy_lock() or rt_mutex_cleanup_proxy_lock() after this.
+ *
+ * Returns:
+ * 0 - task blocked on lock
+ * 1 - acquired the lock for task, caller should wake it up
+ * <0 - error
+ *
+ * Special API call for PI-futex support.
+ */
int __rt_mutex_start_proxy_lock(struct rt_mutex *lock,
struct rt_mutex_waiter *waiter,
struct task_struct *task)
{
int ret;
+ lockdep_asssert_held(&lock->wait_lock);
+
if (try_to_take_rt_mutex(lock, task, NULL))
return 1;
@@ -1749,9 +1770,6 @@ int __rt_mutex_start_proxy_lock(struct rt_mutex *lock,
ret = 0;
}
- if (unlikely(ret))
- remove_waiter(lock, waiter);
-
debug_rt_mutex_print_deadlock(waiter);
return ret;
@@ -1763,12 +1781,18 @@ int __rt_mutex_start_proxy_lock(struct rt_mutex *lock,
* @waiter: the pre-initialized rt_mutex_waiter
* @task: the task to prepare
*
+ * Starts the rt_mutex acquire; it enqueues the @waiter and does deadlock
+ * detection. It does not wait, see rt_mutex_wait_proxy_lock() for that.
+ *
+ * NOTE: unlike __rt_mutex_start_proxy_lock this _DOES_ remove the @waiter
+ * on failure.
+ *
* Returns:
* 0 - task blocked on lock
* 1 - acquired the lock for task, caller should wake it up
* <0 - error
*
- * Special API call for FUTEX_REQUEUE_PI support.
+ * Special API call for PI-futex support.
*/
int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
struct rt_mutex_waiter *waiter,
@@ -1778,6 +1802,8 @@ int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
raw_spin_lock_irq(&lock->wait_lock);
ret = __rt_mutex_start_proxy_lock(lock, waiter, task);
+ if (unlikely(ret))
+ remove_waiter(lock, waiter);
raw_spin_unlock_irq(&lock->wait_lock);
return ret;
@@ -1845,7 +1871,8 @@ int rt_mutex_wait_proxy_lock(struct rt_mutex *lock,
* @lock: the rt_mutex we were woken on
* @waiter: the pre-initialized rt_mutex_waiter
*
- * Attempt to clean up after a failed rt_mutex_wait_proxy_lock().
+ * Attempt to clean up after a failed __rt_mutex_start_proxy_lock() or
+ * rt_mutex_wait_proxy_lock().
*
* Unless we acquired the lock; we're still enqueued on the wait-list and can
* in fact still be granted ownership until we're removed. Therefore we can
On Mon, Jan 28, 2019 at 04:53:19PM +0100, Thomas Gleixner wrote:
> On Mon, 28 Jan 2019, Peter Zijlstra wrote:
> > On Mon, Jan 28, 2019 at 02:44:10PM +0100, Peter Zijlstra wrote:
> > > On Thu, Nov 29, 2018 at 12:23:21PM +0100, Heiko Carstens wrote:
> > >
> > > > And indeed, if I run only this test case in an endless loop and do
> > > > some parallel work (like kernel compile) it currently seems to be
> > > > possible to reproduce the warning:
> > > >
> > > > while true; do time ./testrun.sh nptl/tst-robustpi8 --direct ; done
> > > >
> > > > within the build directory of glibc (2.28).
> > >
> > > Right; so that reproduces for me.
> > >
> > > After staring at all that for a while; trying to remember how it all
> > > worked (or supposed to work rather), I became suspiscous of commit:
> > >
> > > 56222b212e8e ("futex: Drop hb->lock before enqueueing on the rtmutex")
> > >
> > > And indeed, when I revert that; the above reproducer no longer works (as
> > > in, it no longer triggers in minutes and has -- so far -- held up for an
> > > hour+ or so).
>
> Right after staring long enough at it, the commit simply forgot to give
> __rt_mutex_start_proxy_lock() the same treatment as it gave to
> rt_mutex_wait_proxy_lock().
>
> Patch below cures that.
With your patch the kernel warning doesn't occur anymore. So if this
is supposed to be the fix feel free to add:
Tested-by: Heiko Carstens <[email protected]>
However now I see every now and then the following failure from the
same test case:
tst-robustpi8: ../nptl/pthread_mutex_lock.c:425: __pthread_mutex_lock_full: Assertion `INTERNAL_SYSCALL_ERRNO (e, __err) != ESRCH || !robust' failed.
/* ESRCH can happen only for non-robust PI mutexes where
the owner of the lock died. */
assert (INTERNAL_SYSCALL_ERRNO (e, __err) != ESRCH || !robust);
I just verified that this happened also without your patch, I just
didn't see it since I started my tests with panic_on_warn=1 and the
warning triggered always earlier.
So, this seems to be something different.
On Tue, Jan 29, 2019 at 10:01:08AM +0100, Heiko Carstens wrote:
> However now I see every now and then the following failure from the
> same test case:
>
> tst-robustpi8: ../nptl/pthread_mutex_lock.c:425: __pthread_mutex_lock_full: Assertion `INTERNAL_SYSCALL_ERRNO (e, __err) != ESRCH || !robust' failed.
>
> /* ESRCH can happen only for non-robust PI mutexes where
> the owner of the lock died. */
> assert (INTERNAL_SYSCALL_ERRNO (e, __err) != ESRCH || !robust);
>
> I just verified that this happened also without your patch, I just
> didn't see it since I started my tests with panic_on_warn=1 and the
> warning triggered always earlier.
> So, this seems to be something different.
*groan*... so while the other thing reproduced, I've not seen this
happen and I ran the test for hours.
I'll start it up again, see what happens..
On Tue, 29 Jan 2019, Heiko Carstens wrote:
> On Mon, Jan 28, 2019 at 04:53:19PM +0100, Thomas Gleixner wrote:
> > Patch below cures that.
>
> With your patch the kernel warning doesn't occur anymore. So if this
> is supposed to be the fix feel free to add:
Yes, it's supposed to be the fix.
>
> However now I see every now and then the following failure from the
> same test case:
>
> tst-robustpi8: ../nptl/pthread_mutex_lock.c:425: __pthread_mutex_lock_full: Assertion `INTERNAL_SYSCALL_ERRNO (e, __err) != ESRCH || !robust' failed.
>
> /* ESRCH can happen only for non-robust PI mutexes where
> the owner of the lock died. */
> assert (INTERNAL_SYSCALL_ERRNO (e, __err) != ESRCH || !robust);
>
> I just verified that this happened also without your patch, I just
> didn't see it since I started my tests with panic_on_warn=1 and the
> warning triggered always earlier.
> So, this seems to be something different.
Moo. I ran the test loop all night (simply because I forgot to stop it) and
of course this does not trigger here. Could you try to gather a bit more
information with lightweight tracing?
Something like the below should give us at least a clue. Stop the tst loop
when the assert triggers and then extract the trace.
Thanks,
tglx
8<--------------
#!/bin/sh
# Substitute with your ld comm string if it starts differently
C=ld-linux
echo 'comm ~ "$C*"' >/sys/kernel/debug/tracing/events/syscalls/sys_enter_futex/filter
echo 'comm ~ "$C*"' >/sys/kernel/debug/tracing/events/syscalls/sys_exit_futex/filter
echo 'comm ~ "$C*"' >/sys/kernel/debug/tracing/events/sched/sched_process_exit/filter
echo 'prev_comm ~ "$C*" || next_comm ~ "$C*"' >/sys/kernel/debug/tracing/events/sched/sched_switch/filter
echo 1 > /sys/kernel/debug/tracing/events/syscalls/sys_enter_futex/enable
echo 1 > /sys/kernel/debug/tracing/events/syscalls/sys_exit_futex/enable
echo 1 > /sys/kernel/debug/tracing/events/sched/sched_process_exit/enable
echo 1 > /sys/kernel/debug/tracing/events/sched/sched_switch/enable
On Tue, Jan 29, 2019 at 10:45:44AM +0100, Thomas Gleixner wrote:
> On Tue, 29 Jan 2019, Heiko Carstens wrote:
> > On Mon, Jan 28, 2019 at 04:53:19PM +0100, Thomas Gleixner wrote:
> > > Patch below cures that.
> >
> > With your patch the kernel warning doesn't occur anymore. So if this
> > is supposed to be the fix feel free to add:
>
> Yes, it's supposed to be the fix.
> >
> > However now I see every now and then the following failure from the
> > same test case:
> >
> > tst-robustpi8: ../nptl/pthread_mutex_lock.c:425: __pthread_mutex_lock_full: Assertion `INTERNAL_SYSCALL_ERRNO (e, __err) != ESRCH || !robust' failed.
> >
> > /* ESRCH can happen only for non-robust PI mutexes where
> > the owner of the lock died. */
> > assert (INTERNAL_SYSCALL_ERRNO (e, __err) != ESRCH || !robust);
> >
> > I just verified that this happened also without your patch, I just
> > didn't see it since I started my tests with panic_on_warn=1 and the
> > warning triggered always earlier.
> > So, this seems to be something different.
>
> Moo. I ran the test loop all night (simply because I forgot to stop it) and
> of course this does not trigger here. Could you try to gather a bit more
> information with lightweight tracing?
Yes, sure. However ;) I reproduced the above with v5.0-rc4 + your
patch. And now I am trying to reproduce with linux-next 20190129 +
your patch and it doesn't trigger. Did I miss a patch which is only in
linux-next which could fix this?
On Tue, Jan 29, 2019 at 11:24:09AM +0100, Heiko Carstens wrote:
> Yes, sure. However ;) I reproduced the above with v5.0-rc4 + your
> patch. And now I am trying to reproduce with linux-next 20190129 +
> your patch and it doesn't trigger. Did I miss a patch which is only in
> linux-next which could fix this?
>
I'm forever confused on what patch is where; but -ESRCH makes me thing
maybe you lost this one:
---
commit da791a667536bf8322042e38ca85d55a78d3c273
Author: Thomas Gleixner <[email protected]>
Date: Mon Dec 10 14:35:14 2018 +0100
futex: Cure exit race
Stefan reported, that the glibc tst-robustpi4 test case fails
occasionally. That case creates the following race between
sys_exit() and sys_futex_lock_pi():
CPU0 CPU1
sys_exit() sys_futex()
do_exit() futex_lock_pi()
exit_signals(tsk) No waiters:
tsk->flags |= PF_EXITING; *uaddr == 0x00000PID
mm_release(tsk) Set waiter bit
exit_robust_list(tsk) { *uaddr = 0x80000PID;
Set owner died attach_to_pi_owner() {
*uaddr = 0xC0000000; tsk = get_task(PID);
} if (!tsk->flags & PF_EXITING) {
... attach();
tsk->flags |= PF_EXITPIDONE; } else {
if (!(tsk->flags & PF_EXITPIDONE))
return -EAGAIN;
return -ESRCH; <--- FAIL
}
ESRCH is returned all the way to user space, which triggers the glibc test
case assert. Returning ESRCH unconditionally is wrong here because the user
space value has been changed by the exiting task to 0xC0000000, i.e. the
FUTEX_OWNER_DIED bit is set and the futex PID value has been cleared. This
is a valid state and the kernel has to handle it, i.e. taking the futex.
Cure it by rereading the user space value when PF_EXITING and PF_EXITPIDONE
is set in the task which 'owns' the futex. If the value has changed, let
the kernel retry the operation, which includes all regular sanity checks
and correctly handles the FUTEX_OWNER_DIED case.
If it hasn't changed, then return ESRCH as there is no way to distinguish
this case from malfunctioning user space. This happens when the exiting
task did not have a robust list, the robust list was corrupted or the user
space value in the futex was simply bogus.
Reported-by: Stefan Liebler <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: [email protected]
Link: https://bugzilla.kernel.org/show_bug.cgi?id=200467
Link: https://lkml.kernel.org/r/[email protected]
diff --git a/kernel/futex.c b/kernel/futex.c
index f423f9b6577e..5cc8083a4c89 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1148,11 +1148,65 @@ static int attach_to_pi_state(u32 __user *uaddr, u32 uval,
return ret;
}
+static int handle_exit_race(u32 __user *uaddr, u32 uval,
+ struct task_struct *tsk)
+{
+ u32 uval2;
+
+ /*
+ * If PF_EXITPIDONE is not yet set, then try again.
+ */
+ if (tsk && !(tsk->flags & PF_EXITPIDONE))
+ return -EAGAIN;
+
+ /*
+ * Reread the user space value to handle the following situation:
+ *
+ * CPU0 CPU1
+ *
+ * sys_exit() sys_futex()
+ * do_exit() futex_lock_pi()
+ * futex_lock_pi_atomic()
+ * exit_signals(tsk) No waiters:
+ * tsk->flags |= PF_EXITING; *uaddr == 0x00000PID
+ * mm_release(tsk) Set waiter bit
+ * exit_robust_list(tsk) { *uaddr = 0x80000PID;
+ * Set owner died attach_to_pi_owner() {
+ * *uaddr = 0xC0000000; tsk = get_task(PID);
+ * } if (!tsk->flags & PF_EXITING) {
+ * ... attach();
+ * tsk->flags |= PF_EXITPIDONE; } else {
+ * if (!(tsk->flags & PF_EXITPIDONE))
+ * return -EAGAIN;
+ * return -ESRCH; <--- FAIL
+ * }
+ *
+ * Returning ESRCH unconditionally is wrong here because the
+ * user space value has been changed by the exiting task.
+ *
+ * The same logic applies to the case where the exiting task is
+ * already gone.
+ */
+ if (get_futex_value_locked(&uval2, uaddr))
+ return -EFAULT;
+
+ /* If the user space value has changed, try again. */
+ if (uval2 != uval)
+ return -EAGAIN;
+
+ /*
+ * The exiting task did not have a robust list, the robust list was
+ * corrupted or the user space value in *uaddr is simply bogus.
+ * Give up and tell user space.
+ */
+ return -ESRCH;
+}
+
/*
* Lookup the task for the TID provided from user space and attach to
* it after doing proper sanity checks.
*/
-static int attach_to_pi_owner(u32 uval, union futex_key *key,
+static int attach_to_pi_owner(u32 __user *uaddr, u32 uval, union futex_key *key,
struct futex_pi_state **ps)
{
pid_t pid = uval & FUTEX_TID_MASK;
@@ -1162,12 +1216,15 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
/*
* We are the first waiter - try to look up the real owner and attach
* the new pi_state to it, but bail out when TID = 0 [1]
+ *
+ * The !pid check is paranoid. None of the call sites should end up
+ * with pid == 0, but better safe than sorry. Let the caller retry
*/
if (!pid)
- return -ESRCH;
+ return -EAGAIN;
p = find_get_task_by_vpid(pid);
if (!p)
- return -ESRCH;
+ return handle_exit_race(uaddr, uval, NULL);
if (unlikely(p->flags & PF_KTHREAD)) {
put_task_struct(p);
@@ -1187,7 +1244,7 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
* set, we know that the task has finished the
* cleanup:
*/
- int ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN;
+ int ret = handle_exit_race(uaddr, uval, p);
raw_spin_unlock_irq(&p->pi_lock);
put_task_struct(p);
@@ -1244,7 +1301,7 @@ static int lookup_pi_state(u32 __user *uaddr, u32 uval,
* We are the first waiter - try to look up the owner based on
* @uval and attach to it.
*/
- return attach_to_pi_owner(uval, key, ps);
+ return attach_to_pi_owner(uaddr, uval, key, ps);
}
static int lock_pi_update_atomic(u32 __user *uaddr, u32 uval, u32 newval)
@@ -1352,7 +1409,7 @@ static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
* attach to the owner. If that fails, no harm done, we only
* set the FUTEX_WAITERS bit in the user space variable.
*/
- return attach_to_pi_owner(uval, key, ps);
+ return attach_to_pi_owner(uaddr, newval, key, ps);
}
/**
On Tue, 29 Jan 2019, Peter Zijlstra wrote:
> On Tue, Jan 29, 2019 at 11:24:09AM +0100, Heiko Carstens wrote:
>
> > Yes, sure. However ;) I reproduced the above with v5.0-rc4 + your
> > patch. And now I am trying to reproduce with linux-next 20190129 +
> > your patch and it doesn't trigger. Did I miss a patch which is only in
> > linux-next which could fix this?
> >
>
> I'm forever confused on what patch is where; but -ESRCH makes me thing
> maybe you lost this one:
>
> ---
>
> commit da791a667536bf8322042e38ca85d55a78d3c273
> Author: Thomas Gleixner <[email protected]>
> Date: Mon Dec 10 14:35:14 2018 +0100
That's in Linus tree already ...
On Tue, Jan 29, 2019 at 02:03:01PM +0100, Thomas Gleixner wrote:
> On Tue, 29 Jan 2019, Peter Zijlstra wrote:
>
> > On Tue, Jan 29, 2019 at 11:24:09AM +0100, Heiko Carstens wrote:
> >
> > > Yes, sure. However ;) I reproduced the above with v5.0-rc4 + your
> > > patch. And now I am trying to reproduce with linux-next 20190129 +
> > > your patch and it doesn't trigger. Did I miss a patch which is only in
> > > linux-next which could fix this?
> > >
> >
> > I'm forever confused on what patch is where; but -ESRCH makes me thing
> > maybe you lost this one:
> >
> > ---
> >
> > commit da791a667536bf8322042e38ca85d55a78d3c273
> > Author: Thomas Gleixner <[email protected]>
> > Date: Mon Dec 10 14:35:14 2018 +0100
>
> That's in Linus tree already ...
Yes, for some reason it's hardly reproducible now. It did hit once
after an hour, but (of course) without tracing enabled... still
trying.
On 2019-01-29 16:10:58 [+0100], Heiko Carstens wrote:
> Finally... the trace output is quite large with 26 MB... Therefore an
> xz compressed attachment. Hope that's ok.
>
> The kernel used was linux-next 20190129 + your patch.
| ld64.so.1-10237 [006] .... 14232.031726: sys_futex(uaddr: 3ff88e80618, op: 7, val: 3ff00000007, utime: 3ff88e7f910, uaddr2: 3ff88e7f910, val3: 3ffc167e8d7)
FUTEX_UNLOCK_PI | SHARED
| ld64.so.1-10237 [006] .... 14232.031726: sys_futex -> 0x0
…
| ld64.so.1-10237 [006] .... 14232.051751: sched_process_exit: comm=ld64.so.1 pid=10237 prio=120
…
| ld64.so.1-10148 [006] .... 14232.061826: sys_futex(uaddr: 3ff88e80618, op: 6, val: 1, utime: 0, uaddr2: 2, val3: 0)
FUTEX_LOCK_PI | SHARED
| ld64.so.1-10148 [006] .... 14232.061826: sys_futex -> 0xfffffffffffffffd
So there got to be another task that acquired the lock in userland and
left since the last in kernel-user unlocked it. This might bring more
light to it:
diff --git a/kernel/futex.c b/kernel/futex.c
index 599da35c2768..aaa782a8a115 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1209,6 +1209,9 @@ static int handle_exit_race(u32 __user *uaddr, u32 uval,
* corrupted or the user space value in *uaddr is simply bogus.
* Give up and tell user space.
*/
+ trace_printk("uval2 vs uval %08x vs %08x (%d)\n", uval2, uval,
+ tsk ? tsk->pid : -1);
+ __WARN();
return -ESRCH;
}
@@ -1233,8 +1236,10 @@ static int attach_to_pi_owner(u32 __user *uaddr, u32 uval, union futex_key *key,
if (!pid)
return -EAGAIN;
p = find_get_task_by_vpid(pid);
- if (!p)
+ if (!p) {
+ trace_printk("Missing pid %d\n", pid);
return handle_exit_race(uaddr, uval, NULL);
+ }
if (unlikely(p->flags & PF_KTHREAD)) {
put_task_struct(p);
---
I am not sure, but isn't this the "known" issue where the kernel drops
ESRCH in a valid case and glibc upstream does not recognize it because
it is not a valid /POSIX-defined error code? (I *think* same is true for
-ENOMEM) If it is, the following C snippet is a small tc:
---->8------
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <unistd.h>
#include <stdlib.h>
#include <pthread.h>
#include <stdio.h>
static char nothing[4096];
int main(void)
{
int fd;
ssize_t wn;
void *lockm;
pid_t child;
pthread_mutex_t *the_lock;
pthread_mutexattr_t mutexattr;
int ret;
fd = open("/dev/shm/futex-test-lock", O_RDWR | O_CREAT | O_TRUNC, 0644);
if (fd < 0) {
printf("Failed to create lock file: %m\n");
return 1;
}
wn = write(fd, nothing, sizeof(nothing));
if (wn != sizeof(nothing)) {
printf("Failed to write to file: %m\n");
goto out_unlink;
}
lockm = mmap(NULL, sizeof(nothing), PROT_READ | PROT_WRITE, MAP_SHARED,
fd, 0);
if (lockm == MAP_FAILED) {
printf("mmap() failed: %m\n");
goto out_unlink;
}
close(fd);
unlink("/dev/shm/futex-test-lock");
the_lock = lockm;
ret = pthread_mutexattr_init(&mutexattr);
ret |= pthread_mutexattr_setpshared(&mutexattr, PTHREAD_PROCESS_SHARED);
ret |= pthread_mutexattr_setprotocol(&mutexattr, PTHREAD_PRIO_INHERIT);
if (ret) {
printf("Something went wrong during init\n");
return 1;
}
ret = pthread_mutex_init(the_lock, &mutexattr);
if (ret) {
printf("Failed to init the lock\n");
return 1;
}
child = fork();
if (child < 0) {
printf("fork(): %m\n");
return 1;
}
if (!child) {
pthread_mutex_lock(the_lock);
exit(2);
}
sleep(2);
ret = pthread_mutex_lock(the_lock);
printf("-> %x\n", ret);
return 0;
out_unlink:
unlink("/dev/shm/futex-test-lock");
return 1;
}
---------------8<-----------------------
strace gives this:
|openat(AT_FDCWD, "/dev/shm/futex-test-lock", O_RDWR|O_CREAT|O_TRUNC, 0644) = 3
|write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
|mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0) = 0x7f5e23e37000
|close(3) = 0
|unlink("/dev/shm/futex-test-lock") = 0
…
|clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f5e23c1da10) = 25777
|strace: Process 25777 attached
|[pid 25776] nanosleep({tv_sec=2, tv_nsec=0}, <unfinished ...>
|[pid 25777] set_robust_list(0x7f5e23c1da20, 24) = 0
|[pid 25777] exit_group(2) = ?
|[pid 25777] +++ exited with 2 +++
|<... nanosleep resumed> {tv_sec=1, tv_nsec=999821679}) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
|--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=25777, si_uid=1000, si_status=2, si_utime=0, si_stime=0} ---
|restart_syscall(<... resuming interrupted nanosleep ...>) = 0
|futex(0x7f5e23e37000, FUTEX_LOCK_PI, NULL) = -1 ESRCH (No such process)
|pause(^Cstrace: Process 25776 detached
and if I remember correctly, if asserts are not enabled we end up with a
pause loop instead.
Sebastian
On Tue, 29 Jan 2019, Sebastian Sewior wrote:
> On 2019-01-29 16:10:58 [+0100], Heiko Carstens wrote:
> > Finally... the trace output is quite large with 26 MB... Therefore an
> > xz compressed attachment. Hope that's ok.
> >
> > The kernel used was linux-next 20190129 + your patch.
> | ld64.so.1-10237 [006] .... 14232.031726: sys_futex(uaddr: 3ff88e80618, op: 7, val: 3ff00000007, utime: 3ff88e7f910, uaddr2: 3ff88e7f910, val3: 3ffc167e8d7)
> FUTEX_UNLOCK_PI | SHARED
>
> | ld64.so.1-10237 [006] .... 14232.031726: sys_futex -> 0x0
> …
> | ld64.so.1-10237 [006] .... 14232.051751: sched_process_exit: comm=ld64.so.1 pid=10237 prio=120
> …
> | ld64.so.1-10148 [006] .... 14232.061826: sys_futex(uaddr: 3ff88e80618, op: 6, val: 1, utime: 0, uaddr2: 2, val3: 0)
> FUTEX_LOCK_PI | SHARED
>
> | ld64.so.1-10148 [006] .... 14232.061826: sys_futex -> 0xfffffffffffffffd
>
> So there got to be another task that acquired the lock in userland and
> left since the last in kernel-user unlocked it. This might bring more
Well, that would mean that this very task did not have a valid robust list,
which is very unlikely according to the test case.
We might actually stick a trace point into the robust list code as well.
> light to it:
>
> diff --git a/kernel/futex.c b/kernel/futex.c
> index 599da35c2768..aaa782a8a115 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -1209,6 +1209,9 @@ static int handle_exit_race(u32 __user *uaddr, u32 uval,
> * corrupted or the user space value in *uaddr is simply bogus.
> * Give up and tell user space.
> */
> + trace_printk("uval2 vs uval %08x vs %08x (%d)\n", uval2, uval,
> + tsk ? tsk->pid : -1);
> + __WARN();
> return -ESRCH;
> }
>
> @@ -1233,8 +1236,10 @@ static int attach_to_pi_owner(u32 __user *uaddr, u32 uval, union futex_key *key,
> if (!pid)
> return -EAGAIN;
> p = find_get_task_by_vpid(pid);
> - if (!p)
> + if (!p) {
> + trace_printk("Missing pid %d\n", pid);
> return handle_exit_race(uaddr, uval, NULL);
> + }
>
> if (unlikely(p->flags & PF_KTHREAD)) {
> put_task_struct(p);
Yep, that should give us some more clue.
> I am not sure, but isn't this the "known" issue where the kernel drops
> ESRCH in a valid case and glibc upstream does not recognize it because
> it is not a valid /POSIX-defined error code? (I *think* same is true for
> -ENOMEM) If it is, the following C snippet is a small tc:
That testcase is not using robust futexes, but yes it's demonstrating the
glibc does not handle all documented error codes. But I don't think it has
anything to do with the problem at hand. Famous last words....
Thanks,
tglx
commit 56222b212e8e ("futex: Drop hb->lock before enqueueing on the
rtmutex") changed the locking rules in the futex code so that the hash
bucket lock is not longer held while the waiter is enqueued into the
rtmutex wait list. This made the lock and the unlock path symmetric, but
unfortunately the possible early exit from __rt_mutex_proxy_start() due to
a detected deadlock was not updated accordingly. That allows a concurrent
unlocker to observe inconsitent state which triggers the warning in the
unlock path.
futex_lock_pi() futex_unlock_pi()
lock(hb->lock)
queue(hb_waiter) lock(hb->lock)
lock(rtmutex->wait_lock)
unlock(hb->lock)
// acquired hb->lock
hb_waiter = futex_top_waiter()
lock(rtmutex->wait_lock)
__rt_mutex_proxy_start()
---> fail
remove(rtmutex_waiter);
---> returns -EDEADLOCK
unlock(rtmutex->wait_lock)
// acquired wait_lock
wake_futex_pi()
rt_mutex_next_owner()
--> returns NULL
--> WARN
lock(hb->lock)
unqueue(hb_waiter)
The problem is caused by the remove(rtmutex_waiter) in the failure case of
__rt_mutex_proxy_start() as this lets the unlocker observe a waiter in the
hash bucket but no waiter on the rtmutex, i.e. inconsistent state.
The original commit handles this correctly for the other early return cases
(timeout, signal) by delaying the removal of the rtmutex waiter until the
returning task reacquired the hash bucket lock.
Treat the failure case of __rt_mutex_proxy_start() in the same way and let
the existing cleanup code handle the eventual handover of the rtmutex
gracefully. The regular rt_mutex_proxy_start() gains the rtmutex waiter
removal for the failure case, so that the other callsites are still
operating correctly.
Add proper comments to the code so all these details are fully documented.
Thanks to Peter for helping with the analysis and writing the really
valuable code comments.
Fixes: 56222b212e8e ("futex: Drop hb->lock before enqueueing on the rtmutex")
Reported-by: Heiko Carstens <[email protected]>
Co-developed-by: Peter Zijlstra <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Heiko Carstens <[email protected]>
Cc: [email protected]
---
kernel/futex.c | 28 ++++++++++++++++++----------
kernel/locking/rtmutex.c | 37 ++++++++++++++++++++++++++++++++-----
2 files changed, 50 insertions(+), 15 deletions(-)
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2861,35 +2861,39 @@ static int futex_lock_pi(u32 __user *uad
* and BUG when futex_unlock_pi() interleaves with this.
*
* Therefore acquire wait_lock while holding hb->lock, but drop the
- * latter before calling rt_mutex_start_proxy_lock(). This still fully
- * serializes against futex_unlock_pi() as that does the exact same
- * lock handoff sequence.
+ * latter before calling __rt_mutex_start_proxy_lock(). This
+ * interleaves with futex_unlock_pi() -- which does a similar lock
+ * handoff -- such that the latter can observe the futex_q::pi_state
+ * before __rt_mutex_start_proxy_lock() is done.
*/
raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock);
spin_unlock(q.lock_ptr);
+ /*
+ * __rt_mutex_start_proxy_lock() unconditionally enqueues the @rt_waiter
+ * such that futex_unlock_pi() is guaranteed to observe the waiter when
+ * it sees the futex_q::pi_state.
+ */
ret = __rt_mutex_start_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter, current);
raw_spin_unlock_irq(&q.pi_state->pi_mutex.wait_lock);
if (ret) {
if (ret == 1)
ret = 0;
-
- spin_lock(q.lock_ptr);
- goto no_block;
+ goto cleanup;
}
-
if (unlikely(to))
hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);
ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);
+cleanup:
spin_lock(q.lock_ptr);
/*
- * If we failed to acquire the lock (signal/timeout), we must
+ * If we failed to acquire the lock (deadlock/signal/timeout), we must
* first acquire the hb->lock before removing the lock from the
- * rt_mutex waitqueue, such that we can keep the hb and rt_mutex
- * wait lists consistent.
+ * rt_mutex waitqueue, such that we can keep the hb and rt_mutex wait
+ * lists consistent.
*
* In particular; it is important that futex_unlock_pi() can not
* observe this inconsistency.
@@ -3013,6 +3017,10 @@ static int futex_unlock_pi(u32 __user *u
* there is no point where we hold neither; and therefore
* wake_futex_pi() must observe a state consistent with what we
* observed.
+ *
+ * In particular; this forces __rt_mutex_start_proxy() to
+ * complete such that we're guaranteed to observe the
+ * rt_waiter. Also see the WARN in wake_futex_pi().
*/
raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
spin_unlock(&hb->lock);
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1726,12 +1726,33 @@ void rt_mutex_proxy_unlock(struct rt_mut
rt_mutex_set_owner(lock, NULL);
}
+/**
+ * __rt_mutex_start_proxy_lock() - Start lock acquisition for another task
+ * @lock: the rt_mutex to take
+ * @waiter: the pre-initialized rt_mutex_waiter
+ * @task: the task to prepare
+ *
+ * Starts the rt_mutex acquire; it enqueues the @waiter and does deadlock
+ * detection. It does not wait, see rt_mutex_wait_proxy_lock() for that.
+ *
+ * NOTE: does _NOT_ remove the @waiter on failure; must either call
+ * rt_mutex_wait_proxy_lock() or rt_mutex_cleanup_proxy_lock() after this.
+ *
+ * Returns:
+ * 0 - task blocked on lock
+ * 1 - acquired the lock for task, caller should wake it up
+ * <0 - error
+ *
+ * Special API call for PI-futex support.
+ */
int __rt_mutex_start_proxy_lock(struct rt_mutex *lock,
struct rt_mutex_waiter *waiter,
struct task_struct *task)
{
int ret;
+ lockdep_asssert_held(&lock->wait_lock);
+
if (try_to_take_rt_mutex(lock, task, NULL))
return 1;
@@ -1749,9 +1770,6 @@ int __rt_mutex_start_proxy_lock(struct r
ret = 0;
}
- if (unlikely(ret))
- remove_waiter(lock, waiter);
-
debug_rt_mutex_print_deadlock(waiter);
return ret;
@@ -1763,12 +1781,18 @@ int __rt_mutex_start_proxy_lock(struct r
* @waiter: the pre-initialized rt_mutex_waiter
* @task: the task to prepare
*
+ * Starts the rt_mutex acquire; it enqueues the @waiter and does deadlock
+ * detection. It does not wait, see rt_mutex_wait_proxy_lock() for that.
+ *
+ * NOTE: unlike __rt_mutex_start_proxy_lock this _DOES_ remove the @waiter
+ * on failure.
+ *
* Returns:
* 0 - task blocked on lock
* 1 - acquired the lock for task, caller should wake it up
* <0 - error
*
- * Special API call for FUTEX_REQUEUE_PI support.
+ * Special API call for PI-futex support.
*/
int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
struct rt_mutex_waiter *waiter,
@@ -1778,6 +1802,8 @@ int rt_mutex_start_proxy_lock(struct rt_
raw_spin_lock_irq(&lock->wait_lock);
ret = __rt_mutex_start_proxy_lock(lock, waiter, task);
+ if (unlikely(ret))
+ remove_waiter(lock, waiter);
raw_spin_unlock_irq(&lock->wait_lock);
return ret;
@@ -1845,7 +1871,8 @@ int rt_mutex_wait_proxy_lock(struct rt_m
* @lock: the rt_mutex we were woken on
* @waiter: the pre-initialized rt_mutex_waiter
*
- * Attempt to clean up after a failed rt_mutex_wait_proxy_lock().
+ * Attempt to clean up after a failed __rt_mutex_start_proxy_lock() or
+ * rt_mutex_wait_proxy_lock().
*
* Unless we acquired the lock; we're still enqueued on the wait-list and can
* in fact still be granted ownership until we're removed. Therefore we can
On Tue, 29 Jan 2019, Thomas Gleixner wrote:
> int __rt_mutex_start_proxy_lock(struct rt_mutex *lock,
> struct rt_mutex_waiter *waiter,
> struct task_struct *task)
> {
> int ret;
>
> + lockdep_asssert_held(&lock->wait_lock);
I'm a moron. I fixed that on the test box and wrote the changelog on the
laptop. Syncing the two together was too much for my futex wreckaged brain
...
On Wed, 30 Jan 2019, Heiko Carstens wrote:
> On Tue, Jan 29, 2019 at 06:16:53PM +0100, Sebastian Sewior wrote:
> > if (unlikely(p->flags & PF_KTHREAD)) {
> > put_task_struct(p);
>
> Last lines of the trace with your additional patch (full log attached):
>
> <...>-50539 [003] .... 2376.398223: sys_futex -> 0x0
> <...>-50539 [003] .... 2376.398223: sys_futex(uaddr: 3ffb7700208, op: 6, val: 1, utime: 0, uaddr2: 3, val3: 0)
> <...>-50539 [003] .... 2376.398225: attach_to_pi_owner: Missing pid 50734
> <...>-50539 [003] .... 2376.398226: handle_exit_race: uval2 vs uval 8000c62e vs 8000c62e (-1)
So the user space value is: 8000c62e. FUTEX_WAITER bit is set and the owner
of the futex is PID 50734, which exited long time ago:
<...>-50734 [000] .... 2376.394936: sched_process_exit: comm=ld64.so.1 pid=50734 prio=120
But at least from the kernel view 50734 has released it last:
<...>-50734 [000] .... 2376.394930: sys_futex(uaddr: 3ffb7700208, op: 7, val: 3ff00000007, utime: 3ffb3ef8910, uaddr2: 3ffb3ef8910, val3: 3ffc0afe987)
<...>-50539 [003] .... 2376.398223: sys_futex(uaddr: 3ffb7700208, op: 6, val: 1, utime: 0, uaddr2: 3, val3: 0)
Now, if it would have acquired it in userspace again before exiting, then
the robust list exit code should have set the OWNER_DIED bit as well, but
that's not set....
debug patch for the robust list exit handling below.
Thanks,
tglx
8<-----------------
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3440,10 +3440,13 @@ static int handle_futex_death(u32 __user
{
u32 uval, uninitialized_var(nval), mval;
+ trace_printk("uaddr: %lx pi: %d\n", (unsigned long) uaddr, pi);
retry:
if (get_user(uval, uaddr))
return -1;
+ trace_printk("uaddr: %lx uval: %08x\n", (unsigned long) uaddr, uval);
+
if ((uval & FUTEX_TID_MASK) == task_pid_vnr(curr)) {
/*
* Ok, this dying thread is truly holding a futex
@@ -3480,6 +3483,7 @@ static int handle_futex_death(u32 __user
if (!pi && (uval & FUTEX_WAITERS))
futex_wake(uaddr, 1, 1, FUTEX_BITSET_MATCH_ANY);
}
+ trace_printk("uaddr: %lx success\n", (unsigned long) uaddr);
return 0;
}
On 2019-01-30 13:59:55 [+0100], Heiko Carstens wrote:
> Last lines of trace below (full log attached):
<...>-56956 [005] .... 658.931364: handle_futex_death: uaddr: 3ff9e880c58 pi: 1
…
<...>-56956 [005] .... 658.931369: handle_futex_death: uaddr: 3ff9e8808c0 success
<...>-56956 [005] .... 658.931370: sched_process_exit: comm=ld64.so.1 pid=56956 prio=120
not including 3ff9e880140
<...>-56956 [005] .... 658.931370: sched_process_exit: comm=ld64.so.1 pid=56956 prio=120
<...>-56496 [001] .... 658.932404: sys_futex(uaddr: 3ff9e880140, op: 6, val: 1, utime: 0, uaddr2: 5, val3: 0)
<...>-56496 [001] .... 658.932405: attach_to_pi_owner: Missing pid 56956
<...>-56496 [001] .... 658.932406: handle_exit_race: uval2 vs uval 8000de7c vs 8000de7c (-1)
<...>-56496 [001] .... 658.932501: sys_futex -> 0xfffffffffffffffd
…
<...>-56496 [001] .... 659.020750: handle_futex_death: uaddr: 3ff9e880140 pi: 1
<...>-56496 [001] .... 659.020750: handle_futex_death: uaddr: 3ff9e880140 uval: 8000de7c
<...>-56496 [001] .... 659.020750: handle_futex_death: uaddr: 3ff9e880140 success
and here we have it.
Sebastian
On Wed, 30 Jan 2019, Heiko Carstens wrote:
> On Wed, Jan 30, 2019 at 01:15:18PM +0100, Thomas Gleixner wrote:
> > On Wed, 30 Jan 2019, Heiko Carstens wrote:
> > > On Tue, Jan 29, 2019 at 06:16:53PM +0100, Sebastian Sewior wrote:
> > > > if (unlikely(p->flags & PF_KTHREAD)) {
> > > > put_task_struct(p);
> > >
> > > Last lines of the trace with your additional patch (full log attached):
> > >
> > > <...>-50539 [003] .... 2376.398223: sys_futex -> 0x0
> > > <...>-50539 [003] .... 2376.398223: sys_futex(uaddr: 3ffb7700208, op: 6, val: 1, utime: 0, uaddr2: 3, val3: 0)
> > > <...>-50539 [003] .... 2376.398225: attach_to_pi_owner: Missing pid 50734
> > > <...>-50539 [003] .... 2376.398226: handle_exit_race: uval2 vs uval 8000c62e vs 8000c62e (-1)
> >
> > So the user space value is: 8000c62e. FUTEX_WAITER bit is set and the owner
> > of the futex is PID 50734, which exited long time ago:
> >
> > <...>-50734 [000] .... 2376.394936: sched_process_exit: comm=ld64.so.1 pid=50734 prio=120
> >
> > But at least from the kernel view 50734 has released it last:
> >
> > <...>-50734 [000] .... 2376.394930: sys_futex(uaddr: 3ffb7700208, op: 7, val: 3ff00000007, utime: 3ffb3ef8910, uaddr2: 3ffb3ef8910, val3: 3ffc0afe987)
> > <...>-50539 [003] .... 2376.398223: sys_futex(uaddr: 3ffb7700208, op: 6, val: 1, utime: 0, uaddr2: 3, val3: 0)
> >
> > Now, if it would have acquired it in userspace again before exiting, then
> > the robust list exit code should have set the OWNER_DIED bit as well, but
> > that's not set....
> >
> > debug patch for the robust list exit handling below.
>
> Last lines of trace below (full log attached):
SNIP...
It's the same picture as last time and the only occurence of the futex in
question in the context of the dead task is:
<...>-56956 [007] .... 658.804018: sys_futex(uaddr: 3ff9e880050, op: 7, val: 3ff00000007, utime: 3ff9b078910, uaddr2: 3ff9b078910, val3: 3ffea67e3f7)
The robust list exit of that task does not contain the user space address 3ff9e880050.
Confused and of course the problem does not reproduce on x86. Sigh.
I'll think about it some more.
Thanks,
tglx
On Wed, 30 Jan 2019, Sebastian Sewior wrote:
> On 2019-01-30 13:59:55 [+0100], Heiko Carstens wrote:
> > Last lines of trace below (full log attached):
>
> <...>-56956 [005] .... 658.931364: handle_futex_death: uaddr: 3ff9e880c58 pi: 1
> …
> <...>-56956 [005] .... 658.931369: handle_futex_death: uaddr: 3ff9e8808c0 success
> <...>-56956 [005] .... 658.931370: sched_process_exit: comm=ld64.so.1 pid=56956 prio=120
>
> not including 3ff9e880140
>
> <...>-56956 [005] .... 658.931370: sched_process_exit: comm=ld64.so.1 pid=56956 prio=120
>
> <...>-56496 [001] .... 658.932404: sys_futex(uaddr: 3ff9e880140, op: 6, val: 1, utime: 0, uaddr2: 5, val3: 0)
> <...>-56496 [001] .... 658.932405: attach_to_pi_owner: Missing pid 56956
> <...>-56496 [001] .... 658.932406: handle_exit_race: uval2 vs uval 8000de7c vs 8000de7c (-1)
> <...>-56496 [001] .... 658.932501: sys_futex -> 0xfffffffffffffffd
> …
> <...>-56496 [001] .... 659.020750: handle_futex_death: uaddr: 3ff9e880140 pi: 1
> <...>-56496 [001] .... 659.020750: handle_futex_death: uaddr: 3ff9e880140 uval: 8000de7c
> <...>-56496 [001] .... 659.020750: handle_futex_death: uaddr: 3ff9e880140 success
>
> and here we have it.
Bah. I was looking at the wrong trace. /me goes to give a talk and stares
at that mess later.
On Wed, 30 Jan 2019, Thomas Gleixner wrote:
> On Wed, 30 Jan 2019, Sebastian Sewior wrote:
>
> > On 2019-01-30 13:59:55 [+0100], Heiko Carstens wrote:
> > > Last lines of trace below (full log attached):
> >
> > <...>-56956 [005] .... 658.931364: handle_futex_death: uaddr: 3ff9e880c58 pi: 1
> > …
> > <...>-56956 [005] .... 658.931369: handle_futex_death: uaddr: 3ff9e8808c0 success
> > <...>-56956 [005] .... 658.931370: sched_process_exit: comm=ld64.so.1 pid=56956 prio=120
> >
> > not including 3ff9e880140
> >
> > <...>-56956 [005] .... 658.931370: sched_process_exit: comm=ld64.so.1 pid=56956 prio=120
> >
> > <...>-56496 [001] .... 658.932404: sys_futex(uaddr: 3ff9e880140, op: 6, val: 1, utime: 0, uaddr2: 5, val3: 0)
> > <...>-56496 [001] .... 658.932405: attach_to_pi_owner: Missing pid 56956
> > <...>-56496 [001] .... 658.932406: handle_exit_race: uval2 vs uval 8000de7c vs 8000de7c (-1)
> > <...>-56496 [001] .... 658.932501: sys_futex -> 0xfffffffffffffffd
> > …
> > <...>-56496 [001] .... 659.020750: handle_futex_death: uaddr: 3ff9e880140 pi: 1
> > <...>-56496 [001] .... 659.020750: handle_futex_death: uaddr: 3ff9e880140 uval: 8000de7c
> > <...>-56496 [001] .... 659.020750: handle_futex_death: uaddr: 3ff9e880140 success
> >
> > and here we have it.
Well no. That futex is in the robust list of the dying task because it
tried to lock it and then died probably before removing itself from the
robust list.
handle_futex_death() of that task does not touch it because the userspace
TID value is de7c which is 56956.
The last entries with that uaddr are:
<...>-56956 [005] .... 658.923608: sys_futex(uaddr: 3ff9e880140, op: 7, val: 3ff00000007, utime: 3ff9b078910, uaddr2: 3ff9b078910, val3: 3ffea67e3f7)
UNLOCK
<...>-56945 [006] .... 658.923612: sys_futex(uaddr: 3ff9e880140, op: 6, val: 1, utime: 1003ff0, uaddr2: 3ff9e87f910, val3: 3ff0000de71)
LOCK
<...>-56956 [005] .... 658.923612: sys_futex(uaddr: 3ff9e880140, op: 7, val: 3ff00000007, utime: 3ff9b078910, uaddr2: 3ff9b078910, val3: 3ffea67e3f7)
UNLOCK
<...>-56945 [006] .... 658.923830: sys_futex(uaddr: 3ff9e880140, op: 7, val: 3ff00000007, utime: 3ff9e87f910, uaddr2: 3ff9e87f910, val3: 3ffea67e3f7)
UNLOCK
<...>-56496 [001] .... 658.932404: sys_futex(uaddr: 3ff9e880140, op: 6, val: 1, utime: 0, uaddr2: 5, val3: 0)
LOCK which fails.
This does not make any sense. The last kernel visible operation of 56956 on
that uaddr is the UNLOCK above.
I need to think some more about what might happen.
Thanks,
tglx
On Wed, 30 Jan 2019, Thomas Gleixner wrote:
> On Wed, 30 Jan 2019, Thomas Gleixner wrote:
> The last entries with that uaddr are:
>
> <...>-56956 [005] .... 658.923608: sys_futex(uaddr: 3ff9e880140, op: 7, val: 3ff00000007, utime: 3ff9b078910, uaddr2: 3ff9b078910, val3: 3ffea67e3f7)
>
> UNLOCK
>
> <...>-56945 [006] .... 658.923612: sys_futex(uaddr: 3ff9e880140, op: 6, val: 1, utime: 1003ff0, uaddr2: 3ff9e87f910, val3: 3ff0000de71)
>
> LOCK
>
> <...>-56956 [005] .... 658.923612: sys_futex(uaddr: 3ff9e880140, op: 7, val: 3ff00000007, utime: 3ff9b078910, uaddr2: 3ff9b078910, val3: 3ffea67e3f7)
>
> UNLOCK
>
> <...>-56945 [006] .... 658.923830: sys_futex(uaddr: 3ff9e880140, op: 7, val: 3ff00000007, utime: 3ff9e87f910, uaddr2: 3ff9e87f910, val3: 3ffea67e3f7)
>
> UNLOCK
>
> <...>-56496 [001] .... 658.932404: sys_futex(uaddr: 3ff9e880140, op: 6, val: 1, utime: 0, uaddr2: 5, val3: 0)
>
> LOCK which fails.
>
> This does not make any sense. The last kernel visible operation of 56956 on
> that uaddr is the UNLOCK above.
>
> I need to think some more about what might happen.
TBH, no clue. Below are some more traceprintks which hopefully shed some
light on that mystery. See kernel/futex.c line 30 ...
Thanks,
tglx
8<--------------
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1502,6 +1502,8 @@ static int wake_futex_pi(u32 __user *uad
* died bit, because we are the owner.
*/
newval = FUTEX_WAITERS | task_pid_vnr(new_owner);
+ trace_printk("uaddr: %lx cur: %x new: %x\n",
+ (unsigned long) uaddr, uval, newval);
if (unlikely(should_fail_futex(true)))
ret = -EFAULT;
@@ -2431,6 +2433,8 @@ static int fixup_pi_state_owner(u32 __us
for (;;) {
newval = (uval & FUTEX_OWNER_DIED) | newtid;
+ trace_printk("uaddr: %lx cur: %x new: %x\n",
+ (unsigned long) uaddr, uval, newval);
if (cmpxchg_futex_value_locked(&curval, uaddr, uval, newval))
goto handle_fault;
if (curval == uval)
@@ -2438,6 +2442,8 @@ static int fixup_pi_state_owner(u32 __us
uval = curval;
}
+ trace_printk("uaddr: %lx cur: %x new: %x\n",
+ (unsigned long) uaddr, uval, newval);
/*
* We fixed up user space. Now we need to fix the pi_state
* itself.
@@ -3028,6 +3034,9 @@ static int futex_unlock_pi(u32 __user *u
/* drops pi_state->pi_mutex.wait_lock */
ret = wake_futex_pi(uaddr, uval, pi_state);
+ trace_printk("uaddr: %lx wake: %d\n",
+ (unsigned long) uaddr, ret);
+
put_pi_state(pi_state);
/*
@@ -3056,6 +3065,8 @@ static int futex_unlock_pi(u32 __user *u
goto out_putkey;
}
+ trace_printk("uaddr: %lx cur: %x new: %x\n",
+ (unsigned long) uaddr, uval, 0);
/*
* We have no kernel internal state, i.e. no waiters in the
* kernel. Waiters which are about to queue themselves are stuck
On 2019-01-30 18:56:54 [+0100], Thomas Gleixner wrote:
> TBH, no clue. Below are some more traceprintks which hopefully shed some
> light on that mystery. See kernel/futex.c line 30 ...
The robust list it somehow buggy. In the last trace we had the
handle_futex_death() of uaddr 3ff9e880140 as the last action. That means
it was an entry in 56496's ->list_op_pending entry. This makes sense
because it tried to acquire the lock, failed, got killed.
According to uaddr pid 56956 is the owner. So 56956 invoked one of
pthread_mutex_lock() / pthread_mutex_timedlock() /
pthread_mutex_trylock() and should have obtained the lock in userland.
Depending on where it got killed, that mutex should be either recorded
in ->list_op_pending or the robust_list (or both if it didn't clear
->list_op_pending yet). But it is not.
Similar for pthread_mutex_unlock().
We don't have a trace_point if we abort processing the list.
On the other hand, it didn't trigger on x86 for hours. Could the atomic
ops be the culprit?
> Thanks,
>
> tglx
Sebastian
On Wed, 30 Jan 2019, Sebastian Sewior wrote:
> On 2019-01-30 18:56:54 [+0100], Thomas Gleixner wrote:
> > TBH, no clue. Below are some more traceprintks which hopefully shed some
> > light on that mystery. See kernel/futex.c line 30 ...
>
> The robust list it somehow buggy. In the last trace we had the
> handle_futex_death() of uaddr 3ff9e880140 as the last action. That means
> it was an entry in 56496's ->list_op_pending entry. This makes sense
> because it tried to acquire the lock, failed, got killed.
The robust list of the failing task seems to be correct.
> According to uaddr pid 56956 is the owner. So 56956 invoked one of
> pthread_mutex_lock() / pthread_mutex_timedlock() /
> pthread_mutex_trylock() and should have obtained the lock in userland.
> Depending on where it got killed, that mutex should be either recorded in
> ->list_op_pending or the robust_list (or both if it didn't clear
> ->list_op_pending yet). But it is not.
> Similar for pthread_mutex_unlock().
> We don't have a trace_point if we abort processing the list.
The only reason why it would abort is due a page fault because that cannot
be handled in the exit code anymore.
> On the other hand, it didn't trigger on x86 for hours. Could the atomic
s/hours/days/ ....
> ops be the culprit?
The glibc code does:
THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
(void *) (((uintptr_t) &mutex->__data.__list.__next)
| 1));
....
lock in user space
or
lock in kernel space
ENQUEUE_MUTEX_PI (mutex);
THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
ENQUEUE_MUTEX_PI() resolves to a THREAD_GETMEM() which reads the
list head from TLS, some list manipulation operations and the final
THREAD_SETMEM() which stores the new list head
Now on x86 THREAD_GETMEM() and THREAD_SETMEM() are resolving to
asm volatile ("movX .....")
on s390 they are
descr->member
based operations.
Now the important part of the robust list is the store sequence, i.e. the
list head and final update to the TLS visible part need to come _before_
list_op_pending is cleared.
I might be missing something, but there is no compiler barrier in that code
which would prevent the compiler from reordering the stores. It can
rightfully do so because there is no compiler visible dependency of these
two operations.
On x8664 the asm volatile might prevent it by chance, but it does not have
a 'memory' specified which would guarantee a compiler barrier.
On s390 there is certainly nothing.
So assumed that clearing list_op_pending comes before the list head update,
then the robust exit code in the kernel will fail to see either of
them. FAIL.
I might be wrong as usual, but this would definitely explain the fail very
well.
Thanks,
tglx
On Thu, Jan 31, 2019 at 12:13:51AM +0100, Thomas Gleixner wrote:
> On Wed, 30 Jan 2019, Sebastian Sewior wrote:
>
> > On 2019-01-30 18:56:54 [+0100], Thomas Gleixner wrote:
> > > TBH, no clue. Below are some more traceprintks which hopefully shed some
> > > light on that mystery. See kernel/futex.c line 30 ...
> >
> > The robust list it somehow buggy. In the last trace we had the
> > handle_futex_death() of uaddr 3ff9e880140 as the last action. That means
> > it was an entry in 56496's ->list_op_pending entry. This makes sense
> > because it tried to acquire the lock, failed, got killed.
>
> The robust list of the failing task seems to be correct.
>
> > According to uaddr pid 56956 is the owner. So 56956 invoked one of
> > pthread_mutex_lock() / pthread_mutex_timedlock() /
> > pthread_mutex_trylock() and should have obtained the lock in userland.
> > Depending on where it got killed, that mutex should be either recorded in
> > ->list_op_pending or the robust_list (or both if it didn't clear
> > ->list_op_pending yet). But it is not.
> > Similar for pthread_mutex_unlock().
>
> > We don't have a trace_point if we abort processing the list.
>
> The only reason why it would abort is due a page fault because that cannot
> be handled in the exit code anymore.
>
> > On the other hand, it didn't trigger on x86 for hours. Could the atomic
>
> s/hours/days/ ....
>
> > ops be the culprit?
>
> The glibc code does:
>
> THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
> (void *) (((uintptr_t) &mutex->__data.__list.__next)
> | 1));
>
> ....
> lock in user space
>
> or
>
> lock in kernel space
>
> ENQUEUE_MUTEX_PI (mutex);
> THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
>
> ENQUEUE_MUTEX_PI() resolves to a THREAD_GETMEM() which reads the
> list head from TLS, some list manipulation operations and the final
> THREAD_SETMEM() which stores the new list head
>
> Now on x86 THREAD_GETMEM() and THREAD_SETMEM() are resolving to
>
> asm volatile ("movX .....")
>
> on s390 they are
>
> descr->member
>
> based operations.
>
> Now the important part of the robust list is the store sequence, i.e. the
> list head and final update to the TLS visible part need to come _before_
> list_op_pending is cleared.
>
> I might be missing something, but there is no compiler barrier in that code
> which would prevent the compiler from reordering the stores. It can
> rightfully do so because there is no compiler visible dependency of these
> two operations.
>
> On x8664 the asm volatile might prevent it by chance, but it does not have
> a 'memory' specified which would guarantee a compiler barrier.
>
> On s390 there is certainly nothing.
>
> So assumed that clearing list_op_pending comes before the list head update,
> then the robust exit code in the kernel will fail to see either of
> them. FAIL.
>
> I might be wrong as usual, but this would definitely explain the fail very
> well.
On recent versions of GCC, the fix would be to put this between the two
stores that need ordering:
__atomic_thread_fence(__ATOMIC_RELEASE);
I must defer to Heiko on whether s390 GCC might tear the stores. My
guess is "probably not". ;-)
Thanx, Paul
On Wed, 30 Jan 2019, Paul E. McKenney wrote:
> On Thu, Jan 31, 2019 at 12:13:51AM +0100, Thomas Gleixner wrote:
> > I might be wrong as usual, but this would definitely explain the fail very
> > well.
>
> On recent versions of GCC, the fix would be to put this between the two
> stores that need ordering:
>
> __atomic_thread_fence(__ATOMIC_RELEASE);
>
> I must defer to Heiko on whether s390 GCC might tear the stores. My
> guess is "probably not". ;-)
So I just checked the latest glibc code. It has:
/* We must not enqueue the mutex before we have acquired it.
Also see comments at ENQUEUE_MUTEX. */
__asm ("" ::: "memory");
ENQUEUE_MUTEX_PI (mutex);
/* We need to clear op_pending after we enqueue the mutex. */
__asm ("" ::: "memory");
THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
8f9450a0b7a9 ("Add compiler barriers around modifications of the robust mutex list.")
in the glibc repository, There since Dec 24 2016 ...
So the question is whether this is sufficient. That ordering only only
matters vs. the thread itself and not for others.
Thanks,
tglx
On Thu, 31 Jan 2019, Thomas Gleixner wrote:
> On Wed, 30 Jan 2019, Paul E. McKenney wrote:
> > On Thu, Jan 31, 2019 at 12:13:51AM +0100, Thomas Gleixner wrote:
> > > I might be wrong as usual, but this would definitely explain the fail very
> > > well.
> >
> > On recent versions of GCC, the fix would be to put this between the two
> > stores that need ordering:
> >
> > __atomic_thread_fence(__ATOMIC_RELEASE);
> >
> > I must defer to Heiko on whether s390 GCC might tear the stores. My
> > guess is "probably not". ;-)
>
> So I just checked the latest glibc code. It has:
>
> /* We must not enqueue the mutex before we have acquired it.
> Also see comments at ENQUEUE_MUTEX. */
> __asm ("" ::: "memory");
> ENQUEUE_MUTEX_PI (mutex);
> /* We need to clear op_pending after we enqueue the mutex. */
> __asm ("" ::: "memory");
> THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
>
> 8f9450a0b7a9 ("Add compiler barriers around modifications of the robust mutex list.")
>
> in the glibc repository, There since Dec 24 2016 ...
And of course, I'm using the latest greatest glibc for testing that, so I'm
not at all surprised that it just does not reproduce on my tests.
I just hacked the ordering and restarted the test. If the theory holds,
then this should die sooner than later.
Thanks,
tglx
On Thu, Jan 31, 2019 at 12:55:18AM +0100, Thomas Gleixner wrote:
> On Wed, 30 Jan 2019, Paul E. McKenney wrote:
> > On Thu, Jan 31, 2019 at 12:13:51AM +0100, Thomas Gleixner wrote:
> > > I might be wrong as usual, but this would definitely explain the fail very
> > > well.
> >
> > On recent versions of GCC, the fix would be to put this between the two
> > stores that need ordering:
> >
> > __atomic_thread_fence(__ATOMIC_RELEASE);
> >
> > I must defer to Heiko on whether s390 GCC might tear the stores. My
> > guess is "probably not". ;-)
>
> So I just checked the latest glibc code. It has:
>
> /* We must not enqueue the mutex before we have acquired it.
> Also see comments at ENQUEUE_MUTEX. */
> __asm ("" ::: "memory");
> ENQUEUE_MUTEX_PI (mutex);
> /* We need to clear op_pending after we enqueue the mutex. */
> __asm ("" ::: "memory");
> THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
>
> 8f9450a0b7a9 ("Add compiler barriers around modifications of the robust mutex list.")
>
> in the glibc repository, There since Dec 24 2016 ...
>
> So the question is whether this is sufficient. That ordering only only
> matters vs. the thread itself and not for others.
Ah, in that case you can use __atomic_signal_fence(__ATOMIC_RELEASE)
instead of the __atomic_thread_fence(__ATOMIC_RELEASE).
The __atomic_thread_fence(__ATOMIC_RELEASE) provides ordering between
threads, but __atomic_signal_fence(__ATOMIC_RELEASE) only does so
within a thread.
Thanx, Paul
On Thu, Jan 31, 2019 at 01:27:25AM +0100, Thomas Gleixner wrote:
> On Thu, 31 Jan 2019, Thomas Gleixner wrote:
>
> > On Wed, 30 Jan 2019, Paul E. McKenney wrote:
> > > On Thu, Jan 31, 2019 at 12:13:51AM +0100, Thomas Gleixner wrote:
> > > > I might be wrong as usual, but this would definitely explain the fail very
> > > > well.
> > >
> > > On recent versions of GCC, the fix would be to put this between the two
> > > stores that need ordering:
> > >
> > > __atomic_thread_fence(__ATOMIC_RELEASE);
> > >
> > > I must defer to Heiko on whether s390 GCC might tear the stores. My
> > > guess is "probably not". ;-)
> >
> > So I just checked the latest glibc code. It has:
> >
> > /* We must not enqueue the mutex before we have acquired it.
> > Also see comments at ENQUEUE_MUTEX. */
> > __asm ("" ::: "memory");
> > ENQUEUE_MUTEX_PI (mutex);
> > /* We need to clear op_pending after we enqueue the mutex. */
> > __asm ("" ::: "memory");
> > THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
> >
> > 8f9450a0b7a9 ("Add compiler barriers around modifications of the robust mutex list.")
> >
> > in the glibc repository, There since Dec 24 2016 ...
>
> And of course, I'm using the latest greatest glibc for testing that, so I'm
> not at all surprised that it just does not reproduce on my tests.
Sounds about right. :-/
> I just hacked the ordering and restarted the test. If the theory holds,
> then this should die sooner than later.
;-)
Thanx, Paul
On Thu, Jan 31, 2019 at 01:27:25AM +0100, Thomas Gleixner wrote:
> On Thu, 31 Jan 2019, Thomas Gleixner wrote:
>
> > On Wed, 30 Jan 2019, Paul E. McKenney wrote:
> > > On Thu, Jan 31, 2019 at 12:13:51AM +0100, Thomas Gleixner wrote:
> > > > I might be wrong as usual, but this would definitely explain the fail very
> > > > well.
> > >
> > > On recent versions of GCC, the fix would be to put this between the two
> > > stores that need ordering:
> > >
> > > __atomic_thread_fence(__ATOMIC_RELEASE);
> > >
> > > I must defer to Heiko on whether s390 GCC might tear the stores. My
> > > guess is "probably not". ;-)
> >
> > So I just checked the latest glibc code. It has:
> >
> > /* We must not enqueue the mutex before we have acquired it.
> > Also see comments at ENQUEUE_MUTEX. */
> > __asm ("" ::: "memory");
> > ENQUEUE_MUTEX_PI (mutex);
> > /* We need to clear op_pending after we enqueue the mutex. */
> > __asm ("" ::: "memory");
> > THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
> >
> > 8f9450a0b7a9 ("Add compiler barriers around modifications of the robust mutex list.")
> >
> > in the glibc repository, There since Dec 24 2016 ...
>
> And of course, I'm using the latest greatest glibc for testing that, so I'm
> not at all surprised that it just does not reproduce on my tests.
As discussed on IRC: I used plain vanilla glibc version 2.28 for my
tests. This version already contains the commit you mentioned above.
> I just hacked the ordering and restarted the test. If the theory holds,
> then this should die sooner than later.
...nevertheless Stefan and I looked through the lovely disassembly of
_pthread_mutex_lock_full() to verify if the compiler barriers are
actually doing what they are supposed to do. The generated code
however does look correct.
So, it must be something different.
On 2019-01-31 17:52:28 [+0100], Heiko Carstens wrote:
> ...nevertheless Stefan and I looked through the lovely disassembly of
> _pthread_mutex_lock_full() to verify if the compiler barriers are
> actually doing what they are supposed to do. The generated code
> however does look correct.
> So, it must be something different.
would it make sense to use one locking function instead all three (lock,
try-lock, timed) in the test case to figure out if this is related to
one of the locking function?
Sebastian
On Thu, Jan 31, 2019 at 06:06:53PM +0100, Sebastian Sewior wrote:
> On 2019-01-31 17:52:28 [+0100], Heiko Carstens wrote:
> > ...nevertheless Stefan and I looked through the lovely disassembly of
> > _pthread_mutex_lock_full() to verify if the compiler barriers are
> > actually doing what they are supposed to do. The generated code
> > however does look correct.
> > So, it must be something different.
>
> would it make sense to use one locking function instead all three (lock,
> try-lock, timed) in the test case to figure out if this is related to
> one of the locking function?
Sure, I will give that a try tomorrow.
On Thu, Jan 31, 2019 at 06:06:53PM +0100, Sebastian Sewior wrote:
> On 2019-01-31 17:52:28 [+0100], Heiko Carstens wrote:
> > ...nevertheless Stefan and I looked through the lovely disassembly of
> > _pthread_mutex_lock_full() to verify if the compiler barriers are
> > actually doing what they are supposed to do. The generated code
> > however does look correct.
> > So, it must be something different.
>
> would it make sense to use one locking function instead all three (lock,
> try-lock, timed) in the test case to figure out if this is related to
> one of the locking function?
I tried all three variants, but it seems to be close to impossible to
re-create then. I had a single fail when using only the trylock
variant, but I wouldn't say that means anything.
Only if all three variants run in parallel it seems to be quite
reliably reproducible, even though sometimes it still takes an hour.
On Fri, 1 Feb 2019, Heiko Carstens wrote:
> On Thu, Jan 31, 2019 at 06:06:53PM +0100, Sebastian Sewior wrote:
> > On 2019-01-31 17:52:28 [+0100], Heiko Carstens wrote:
> > > ...nevertheless Stefan and I looked through the lovely disassembly of
> > > _pthread_mutex_lock_full() to verify if the compiler barriers are
> > > actually doing what they are supposed to do. The generated code
> > > however does look correct.
> > > So, it must be something different.
> >
> > would it make sense to use one locking function instead all three (lock,
> > try-lock, timed) in the test case to figure out if this is related to
> > one of the locking function?
>
> I tried all three variants, but it seems to be close to impossible to
> re-create then. I had a single fail when using only the trylock
> variant, but I wouldn't say that means anything.
>
> Only if all three variants run in parallel it seems to be quite
> reliably reproducible, even though sometimes it still takes an hour.
Were you able to capture a trace with the last set of additional trace
printks?
Thanks,
tglx
Commit-ID: 1a1fb985f2e2b85ec0d3dc2e519ee48389ec2434
Gitweb: https://git.kernel.org/tip/1a1fb985f2e2b85ec0d3dc2e519ee48389ec2434
Author: Thomas Gleixner <[email protected]>
AuthorDate: Tue, 29 Jan 2019 23:15:12 +0100
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 8 Feb 2019 13:00:36 +0100
futex: Handle early deadlock return correctly
commit 56222b212e8e ("futex: Drop hb->lock before enqueueing on the
rtmutex") changed the locking rules in the futex code so that the hash
bucket lock is not longer held while the waiter is enqueued into the
rtmutex wait list. This made the lock and the unlock path symmetric, but
unfortunately the possible early exit from __rt_mutex_proxy_start() due to
a detected deadlock was not updated accordingly. That allows a concurrent
unlocker to observe inconsitent state which triggers the warning in the
unlock path.
futex_lock_pi() futex_unlock_pi()
lock(hb->lock)
queue(hb_waiter) lock(hb->lock)
lock(rtmutex->wait_lock)
unlock(hb->lock)
// acquired hb->lock
hb_waiter = futex_top_waiter()
lock(rtmutex->wait_lock)
__rt_mutex_proxy_start()
---> fail
remove(rtmutex_waiter);
---> returns -EDEADLOCK
unlock(rtmutex->wait_lock)
// acquired wait_lock
wake_futex_pi()
rt_mutex_next_owner()
--> returns NULL
--> WARN
lock(hb->lock)
unqueue(hb_waiter)
The problem is caused by the remove(rtmutex_waiter) in the failure case of
__rt_mutex_proxy_start() as this lets the unlocker observe a waiter in the
hash bucket but no waiter on the rtmutex, i.e. inconsistent state.
The original commit handles this correctly for the other early return cases
(timeout, signal) by delaying the removal of the rtmutex waiter until the
returning task reacquired the hash bucket lock.
Treat the failure case of __rt_mutex_proxy_start() in the same way and let
the existing cleanup code handle the eventual handover of the rtmutex
gracefully. The regular rt_mutex_proxy_start() gains the rtmutex waiter
removal for the failure case, so that the other callsites are still
operating correctly.
Add proper comments to the code so all these details are fully documented.
Thanks to Peter for helping with the analysis and writing the really
valuable code comments.
Fixes: 56222b212e8e ("futex: Drop hb->lock before enqueueing on the rtmutex")
Reported-by: Heiko Carstens <[email protected]>
Co-developed-by: Peter Zijlstra <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Heiko Carstens <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: [email protected]
Cc: Stefan Liebler <[email protected]>
Cc: Sebastian Sewior <[email protected]>
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/futex.c | 28 ++++++++++++++++++----------
kernel/locking/rtmutex.c | 37 ++++++++++++++++++++++++++++++++-----
2 files changed, 50 insertions(+), 15 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c
index 5ec2473a3497..a0514e01c3eb 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2861,35 +2861,39 @@ retry_private:
* and BUG when futex_unlock_pi() interleaves with this.
*
* Therefore acquire wait_lock while holding hb->lock, but drop the
- * latter before calling rt_mutex_start_proxy_lock(). This still fully
- * serializes against futex_unlock_pi() as that does the exact same
- * lock handoff sequence.
+ * latter before calling __rt_mutex_start_proxy_lock(). This
+ * interleaves with futex_unlock_pi() -- which does a similar lock
+ * handoff -- such that the latter can observe the futex_q::pi_state
+ * before __rt_mutex_start_proxy_lock() is done.
*/
raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock);
spin_unlock(q.lock_ptr);
+ /*
+ * __rt_mutex_start_proxy_lock() unconditionally enqueues the @rt_waiter
+ * such that futex_unlock_pi() is guaranteed to observe the waiter when
+ * it sees the futex_q::pi_state.
+ */
ret = __rt_mutex_start_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter, current);
raw_spin_unlock_irq(&q.pi_state->pi_mutex.wait_lock);
if (ret) {
if (ret == 1)
ret = 0;
-
- spin_lock(q.lock_ptr);
- goto no_block;
+ goto cleanup;
}
-
if (unlikely(to))
hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);
ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);
+cleanup:
spin_lock(q.lock_ptr);
/*
- * If we failed to acquire the lock (signal/timeout), we must
+ * If we failed to acquire the lock (deadlock/signal/timeout), we must
* first acquire the hb->lock before removing the lock from the
- * rt_mutex waitqueue, such that we can keep the hb and rt_mutex
- * wait lists consistent.
+ * rt_mutex waitqueue, such that we can keep the hb and rt_mutex wait
+ * lists consistent.
*
* In particular; it is important that futex_unlock_pi() can not
* observe this inconsistency.
@@ -3013,6 +3017,10 @@ retry:
* there is no point where we hold neither; and therefore
* wake_futex_pi() must observe a state consistent with what we
* observed.
+ *
+ * In particular; this forces __rt_mutex_start_proxy() to
+ * complete such that we're guaranteed to observe the
+ * rt_waiter. Also see the WARN in wake_futex_pi().
*/
raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
spin_unlock(&hb->lock);
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 581edcc63c26..978d63a8261c 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1726,12 +1726,33 @@ void rt_mutex_proxy_unlock(struct rt_mutex *lock,
rt_mutex_set_owner(lock, NULL);
}
+/**
+ * __rt_mutex_start_proxy_lock() - Start lock acquisition for another task
+ * @lock: the rt_mutex to take
+ * @waiter: the pre-initialized rt_mutex_waiter
+ * @task: the task to prepare
+ *
+ * Starts the rt_mutex acquire; it enqueues the @waiter and does deadlock
+ * detection. It does not wait, see rt_mutex_wait_proxy_lock() for that.
+ *
+ * NOTE: does _NOT_ remove the @waiter on failure; must either call
+ * rt_mutex_wait_proxy_lock() or rt_mutex_cleanup_proxy_lock() after this.
+ *
+ * Returns:
+ * 0 - task blocked on lock
+ * 1 - acquired the lock for task, caller should wake it up
+ * <0 - error
+ *
+ * Special API call for PI-futex support.
+ */
int __rt_mutex_start_proxy_lock(struct rt_mutex *lock,
struct rt_mutex_waiter *waiter,
struct task_struct *task)
{
int ret;
+ lockdep_assert_held(&lock->wait_lock);
+
if (try_to_take_rt_mutex(lock, task, NULL))
return 1;
@@ -1749,9 +1770,6 @@ int __rt_mutex_start_proxy_lock(struct rt_mutex *lock,
ret = 0;
}
- if (unlikely(ret))
- remove_waiter(lock, waiter);
-
debug_rt_mutex_print_deadlock(waiter);
return ret;
@@ -1763,12 +1781,18 @@ int __rt_mutex_start_proxy_lock(struct rt_mutex *lock,
* @waiter: the pre-initialized rt_mutex_waiter
* @task: the task to prepare
*
+ * Starts the rt_mutex acquire; it enqueues the @waiter and does deadlock
+ * detection. It does not wait, see rt_mutex_wait_proxy_lock() for that.
+ *
+ * NOTE: unlike __rt_mutex_start_proxy_lock this _DOES_ remove the @waiter
+ * on failure.
+ *
* Returns:
* 0 - task blocked on lock
* 1 - acquired the lock for task, caller should wake it up
* <0 - error
*
- * Special API call for FUTEX_REQUEUE_PI support.
+ * Special API call for PI-futex support.
*/
int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
struct rt_mutex_waiter *waiter,
@@ -1778,6 +1802,8 @@ int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
raw_spin_lock_irq(&lock->wait_lock);
ret = __rt_mutex_start_proxy_lock(lock, waiter, task);
+ if (unlikely(ret))
+ remove_waiter(lock, waiter);
raw_spin_unlock_irq(&lock->wait_lock);
return ret;
@@ -1845,7 +1871,8 @@ int rt_mutex_wait_proxy_lock(struct rt_mutex *lock,
* @lock: the rt_mutex we were woken on
* @waiter: the pre-initialized rt_mutex_waiter
*
- * Attempt to clean up after a failed rt_mutex_wait_proxy_lock().
+ * Attempt to clean up after a failed __rt_mutex_start_proxy_lock() or
+ * rt_mutex_wait_proxy_lock().
*
* Unless we acquired the lock; we're still enqueued on the wait-list and can
* in fact still be granted ownership until we're removed. Therefore we can