2012-06-26 12:45:41

by Glauber Costa

[permalink] [raw]
Subject: "Regression" with cd3d09527537

Hi,

I've recently started seeing a lockdep warning at the end of *every*
"init 0" issued in my machine. Actually, reboots are fine, and that's
probably why I've never seen it earlier. The log is quite extensively,
but shows the following dependency chain:

[ 83.982111] -> #4 (cpu_hotplug.lock){+.+.+.}:
[...]
[ 83.982111] -> #3 (jump_label_mutex){+.+...}:
[...]
[ 83.982111] -> #2 (sk_lock-AF_INET){+.+.+.}:
[...]
[ 83.982111] -> #1 (&sig->cred_guard_mutex){+.+.+.}:
[...]
[ 83.982111] -> #0 (cgroup_mutex){+.+.+.}:

I've recently fixed bugs with the lock ordering imposed by cpusets
on cpu_hotplug.lock through jump_label_mutex, and initially thought it
to be the same kind of issue. But that was not the case.

I've omitted the full backtrace for readability, but I run this with all
cgroups disabled but the cpuset, so it can't be sock memcg (after my
initial reaction of "oh, fuck, not again"). That jump_label is there for
years, and it comes from the code that disables socket timestamps.
(net_enable_timestamp)

After a couple of days of extensive debugging, with git bisect failing
to pinpoint a culprit, I've got to that patch
"cgroup: always lock threadgroup during migration" as the one that would
trigger the bug.

The problem is, what this patch does is start calling threadgroup_lock
everytime, instead of conditionally. In that sense, it of course did not
create the bug, only made it (fortunately) always visible.

Thing is, I honestly don't know what would be a fix for this bug.
We could hold the threadgroup_lock before the cgroup_lock, but that
would hold it for way too long.

This is just another incarnation of the cgroup_lock creating nasty
dependencies with virtually everything else, because we hold it for
everything we do. I fear we'll fix this, and another one will just wake
up any time.

What do you think, Tejun?


2012-06-27 23:08:30

by Tejun Heo

[permalink] [raw]
Subject: Re: "Regression" with cd3d09527537

On Tue, Jun 26, 2012 at 04:43:03PM +0400, Glauber Costa wrote:
> Hi,
>
> I've recently started seeing a lockdep warning at the end of *every*
> "init 0" issued in my machine. Actually, reboots are fine, and
> that's probably why I've never seen it earlier. The log is quite
> extensively, but shows the following dependency chain:
>
> [ 83.982111] -> #4 (cpu_hotplug.lock){+.+.+.}:
> [...]
> [ 83.982111] -> #3 (jump_label_mutex){+.+...}:
> [...]
> [ 83.982111] -> #2 (sk_lock-AF_INET){+.+.+.}:
> [...]
> [ 83.982111] -> #1 (&sig->cred_guard_mutex){+.+.+.}:
> [...]
> [ 83.982111] -> #0 (cgroup_mutex){+.+.+.}:
>
> I've recently fixed bugs with the lock ordering imposed by cpusets
> on cpu_hotplug.lock through jump_label_mutex, and initially thought
> it to be the same kind of issue. But that was not the case.
>
> I've omitted the full backtrace for readability, but I run this with
> all cgroups disabled but the cpuset, so it can't be sock memcg
> (after my initial reaction of "oh, fuck, not again"). That
> jump_label is there for years, and it comes from the code that
> disables socket timestamps.
> (net_enable_timestamp)

Yeah, there are multiple really large locks at play here - jump label,
threadgroup and cgroup_mutex. It isn't pretty. Can you please post
the full lockdep dump? The above only shows single locking chain.
I'd like to see the other.

Thanks.

--
tejun

2012-06-27 23:10:19

by Glauber Costa

[permalink] [raw]
Subject: Re: "Regression" with cd3d09527537

On 06/28/2012 03:08 AM, Tejun Heo wrote:
> On Tue, Jun 26, 2012 at 04:43:03PM +0400, Glauber Costa wrote:
>> Hi,
>>
>> I've recently started seeing a lockdep warning at the end of *every*
>> "init 0" issued in my machine. Actually, reboots are fine, and
>> that's probably why I've never seen it earlier. The log is quite
>> extensively, but shows the following dependency chain:
>>
>> [ 83.982111] -> #4 (cpu_hotplug.lock){+.+.+.}:
>> [...]
>> [ 83.982111] -> #3 (jump_label_mutex){+.+...}:
>> [...]
>> [ 83.982111] -> #2 (sk_lock-AF_INET){+.+.+.}:
>> [...]
>> [ 83.982111] -> #1 (&sig->cred_guard_mutex){+.+.+.}:
>> [...]
>> [ 83.982111] -> #0 (cgroup_mutex){+.+.+.}:
>>
>> I've recently fixed bugs with the lock ordering imposed by cpusets
>> on cpu_hotplug.lock through jump_label_mutex, and initially thought
>> it to be the same kind of issue. But that was not the case.
>>
>> I've omitted the full backtrace for readability, but I run this with
>> all cgroups disabled but the cpuset, so it can't be sock memcg
>> (after my initial reaction of "oh, fuck, not again"). That
>> jump_label is there for years, and it comes from the code that
>> disables socket timestamps.
>> (net_enable_timestamp)
>
> Yeah, there are multiple really large locks at play here - jump label,
> threadgroup and cgroup_mutex. It isn't pretty. Can you please post
> the full lockdep dump? The above only shows single locking chain.
> I'd like to see the other.
>
> Thanks.
>



Attachments:
REBOOT-BUG (8.32 kB)