2014-04-10 23:25:36

by Valdis Klētnieks

[permalink] [raw]
Subject: Possible issue with commit 6f008e72cd - locking/mutex: Fix debug checks.

So recent linux-next lock up on my laptop due to a third-party module.
Bisection finds this as the source of the issue:

6f008e72cd111a119b5d8de8c5438d892aae99eb is the first bad commit
commit 6f008e72cd111a119b5d8de8c5438d892aae99eb
Author: Peter Zijlstra <[email protected]>
Date: Wed Mar 12 13:24:42 2014 +0100

locking/mutex: Fix debug checks

OK, so commit:

1d8fe7dc8078 ("locking/mutexes: Unlock the mutex without the wait_lock")

generates this boot warning when CONFIG_DEBUG_MUTEXES=y:

WARNING: CPU: 0 PID: 139 at /usr/src/linux-2.6/kernel/locking/mutex-debug.

And that makes sense, because as soon as we release the lock a
new owner can come in...

One would think that !__mutex_slowpath_needs_to_unlock()
implementations suffer the same, but for DEBUG we fall back to
mutex-null.h which has an unconditional 1 for that.

The mutex debug code requires the mutex to be unlocked after
doing the debug checks, otherwise it can find inconsistent
state.

I was finally able to get netconsole to cough up what happened:

Apr 10 18:52:23 turing-police [ 768.277732] INFO: trying to register non-static key.
Apr 10 18:52:23 turing-police [ 768.277781] the code is fine but needs lockdep annotation.
Apr 10 18:52:23 turing-police [ 768.277794] turning off the locking correctness validator.
Apr 10 18:52:23 turing-police [ 768.277819] CPU: 0 PID: 1237 Comm: Xorg Tainted: G O 3.14.0-next-20140409-dirty #202
Apr 10 18:52:23 turing-police [ 768.277925] Hardware name: Dell Inc. Latitude E6530/07Y85M, BIOS A11 03/12/2013
Apr 10 18:52:23 turing-police [ 768.277938] 0000000000000000
Apr 10 18:52:23 turing-police ffff8800c65c7c60
Apr 10 18:52:23 turing-police ffffffff81668e5f
Apr 10 18:52:23 turing-police 0000000000000000
Apr 10 18:52:23 turing-police
Apr 10 18:52:23 turing-police [ 768.277989] ffff8800c65c7cd0
Apr 10 18:52:23 turing-police ffffffff8107b9c0
Apr 10 18:52:23 turing-police ffffffff822098d0
Apr 10 18:52:23 turing-police ffff8800c74cd8f8
Apr 10 18:52:23 turing-police
Apr 10 18:52:23 turing-police [ 768.278091] ffff8800c74cd0d0
Apr 10 18:52:23 turing-police 0000000000000019
Apr 10 18:52:23 turing-police 0000000000000000
Apr 10 18:52:23 turing-police ffffffff81c2d980
Apr 10 18:52:23 turing-police
Apr 10 18:52:23 turing-police [ 768.278119] Call Trace:
Apr 10 18:52:23 turing-police [ 768.278137] [<ffffffff81668e5f>] dump_stack+0x4f/0xa2
Apr 10 18:52:23 turing-police [ 768.278153] [<ffffffff8107b9c0>] register_lock_class+0x10d/0x2bf
Apr 10 18:52:23 turing-police [ 768.278166] [<ffffffff8107d26b>] __lock_acquire+0x127/0xdd7
Apr 10 18:52:23 turing-police [ 768.278182] [<ffffffff816732aa>] ? _raw_spin_unlock+0x5b/0x67
Apr 10 18:52:23 turing-police [ 768.278196] [<ffffffff8107e25e>] lock_acquire+0xc9/0x156
Apr 10 18:52:23 turing-police [ 768.278208] [<ffffffff81079e4e>] ? down+0x11/0x55
Apr 10 18:52:23 turing-police [ 768.278305] [<ffffffff8167319e>] _raw_spin_lock_irqsave+0x44/0x55
Apr 10 18:52:23 turing-police [ 768.278317] [<ffffffff81079e4e>] ? down+0x11/0x55
Apr 10 18:52:23 turing-police [ 768.278328] [<ffffffff81079e4e>] down+0x11/0x55

A down() of an un-annotated lock (which I'll have to annotate, I guess).

But then the kernel goes off into the weeds with complaints like:

Apr 10 18:53:28 turing-police [ 833.465272] INFO: rcu_preempt detected stalls on CPUs/tasks:
Apr 10 18:53:28 turing-police {
Apr 10 18:53:28 turing-police }
Apr 10 18:53:28 turing-police (detected by 3, t=65004 jiffies, g=4552, c=4551, q=287)
Apr 10 18:53:28 turing-police [ 833.465354] INFO: Stall ended before state dump start
Apr 10 18:55:34 turing-police [ 958.896736] INFO: task systemd:1 blocked for more than 120 seconds.

from all *over* the place - systemd in a poll() syscall, ext4, snd_hda, and
a few other places. All the symptoms of busted locking leaving something
with an unintended held lock. After a bit, the box doesn't even answer
ping anymore....

I can't be sure that it's the down() call - but nothing else actually gets
*logged*. But the commit obviously changes the actual semantics of
*something*. Was this intentional?




Attachments:
(No filename) (848.00 B)

2014-04-11 00:27:40

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: Possible issue with commit 6f008e72cd - locking/mutex: Fix debug checks.

On Thu, 2014-04-10 at 19:23 -0400, Valdis Kletnieks wrote:
> So recent linux-next lock up on my laptop due to a third-party module.
> Bisection finds this as the source of the issue:
>
> 6f008e72cd111a119b5d8de8c5438d892aae99eb is the first bad commit

This was already reported, please refer to this thread:
https://lkml.org/lkml/2014/4/6/1

2014-04-11 06:54:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Possible issue with commit 6f008e72cd - locking/mutex: Fix debug checks.

On Thu, Apr 10, 2014 at 07:23:28PM -0400, Valdis Kletnieks wrote:
> So recent linux-next lock up on my laptop due to a third-party module.
> Bisection finds this as the source of the issue:
>
> 6f008e72cd111a119b5d8de8c5438d892aae99eb is the first bad commit
> commit 6f008e72cd111a119b5d8de8c5438d892aae99eb
> Author: Peter Zijlstra <[email protected]>
> Date: Wed Mar 12 13:24:42 2014 +0100

see
http://lkml.kernel.org/r/[email protected]