LinuxLists.cc - [BUG] lockdep splat with kernfs lockdep annotations and slab mutex from drm patch??

2019-06-14 13:40:05

Subject: [BUG] lockdep splat with kernfs lockdep annotations and slab mutex from drm patch??

I'm trying to debug some code where I need KASAN and lockdep enabled
but I can't get past this splat unrelated to my code. I bisected it
down to this commit:

32eb6bcfdda9da ("drm/i915: Make request allocation caches global")

To make sure it wasn't a bad bisect, I removed the patch and the splat
goes away. I add the patch back, and the splat returns. I did this
several times, so I have a large confidence that this is the cause of
the splat, or at least it is the commit that triggers whatever is going
on.

Perhaps all the cache updates caused the slab_mutex to be called
inverse of the kernfs locking?

Attached is my config.

Also might be helpful, the splat happens:

kernfs_fop_write()
ops->write == sysfs_kf_write
sysfs_kf_write()
ops->store = slab_attr_store

-- Steve

Starting Network Manager...
[ 42.035383]
[ 42.040982] ======================================================
[ 42.054681] WARNING: possible circular locking dependency detected
[ 42.068242] 5.2.0-rc4-test+ #306 Not tainted
[ 42.079827] ------------------------------------------------------
[ 42.093323] slub_cpu_partia/927 is trying to acquire lock:
[ 42.106136] 0000000050c8525d (slab_mutex){+.+.}, at: slab_attr_store+0x6d/0xe0
[ 42.120667]
[ 42.120667] but task is already holding lock:
[ 42.140962] 00000000fbcdac6b (kn->count#157){++++}, at: kernfs_fop_write+0x125/0x250
[ 42.155978]
[ 42.155978] which lock already depends on the new lock.
[ 42.155978]
[ 42.185778]
[ 42.185778] the existing dependency chain (in reverse order) is:
[ 42.207505]
[ 42.207505] -> #1 (kn->count#157){++++}:
[ 42.227197] __kernfs_remove+0x413/0x4a0
[ 42.238848] kernfs_remove_by_name_ns+0x40/0x80
[ 42.251141] sysfs_slab_add+0x1b5/0x2f0
[ 42.262669] __kmem_cache_create+0x511/0x560
[ 42.274623] create_cache+0xcd/0x1f0
[ 42.285844] kmem_cache_create_usercopy+0x18a/0x240
[ 42.298347] kmem_cache_create+0x12/0x20
[ 42.309854] is_active_nid+0xdb/0x230 [snd_hda_codec_generic]
[ 42.323129] snd_hda_get_path_idx+0x55/0x80 [snd_hda_codec_generic]
[ 42.336835] get_nid_path+0xc/0x170 [snd_hda_codec_generic]
[ 42.349750] do_one_initcall+0xa2/0x394
[ 42.360803] do_init_module+0xfd/0x370
[ 42.371613] load_module+0x38c6/0x3bd0
[ 42.382319] __do_sys_finit_module+0x11a/0x1b0
[ 42.393587] do_syscall_64+0x68/0x250
[ 42.403934] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 42.415554]
[ 42.415554] -> #0 (slab_mutex){+.+.}:
[ 42.432367] lock_acquire+0xbd/0x1d0
[ 42.442133] __mutex_lock+0xfc/0xb70
[ 42.451749] slab_attr_store+0x6d/0xe0
[ 42.461489] kernfs_fop_write+0x199/0x250
[ 42.471364] vfs_write+0xe1/0x240
[ 42.480433] ksys_write+0xba/0x150
[ 42.489475] do_syscall_64+0x68/0x250
[ 42.498799] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 42.509558]
[ 42.509558] other info that might help us debug this:
[ 42.509558]
[ 42.532347] Possible unsafe locking scenario:
[ 42.532347]
[ 42.547589] CPU0 CPU1
[ 42.556652] ---- ----
[ 42.565586] lock(kn->count#157);
[ 42.573255] lock(slab_mutex);
[ 42.583158] lock(kn->count#157);
[ 42.593287] lock(slab_mutex);
[ 42.600508]
[ 42.600508] *** DEADLOCK ***
[ 42.600508]
[ 42.617703] 3 locks held by slub_cpu_partia/927:
[ 42.625835] #0: 00000000fca4a536 (sb_writers#4){.+.+}, at: vfs_write+0x1ce/0x240
[ 42.636775] #1: 000000008c8467e8 (&of->mutex){+.+.}, at: kernfs_fop_write+0x115/0x250
[ 42.648075] #2: 00000000fbcdac6b (kn->count#157){++++}, at: kernfs_fop_write+0x125/0x250
[ 42.659529]
[ 42.659529] stack backtrace:
[ 42.670198] CPU: 2 PID: 927 Comm: slub_cpu_partia Not tainted 5.2.0-rc4-test+ #306
[ 42.680898] Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
[ 42.693051] Call Trace:
[ 42.698669] dump_stack+0x7c/0xc0
[ 42.705131] print_circular_bug.cold.58+0x172/0x1d1
[ 42.713151] __lock_acquire+0x1e75/0x2170
[ 42.720281] ? lockdep_hardirqs_on+0x250/0x250
[ 42.727804] ? match_held_lock+0x1b/0x230
[ 42.734842] ? find_held_lock+0xac/0xd0
[ 42.741709] ? match_held_lock+0x1b/0x230
[ 42.748707] ? find_held_lock+0xac/0xd0
[ 42.755477] ? unfreeze_partials.isra.64+0x238/0x330
[ 42.763376] lock_acquire+0xbd/0x1d0
[ 42.769883] ? slab_attr_store+0x6d/0xe0
[ 42.776771] __mutex_lock+0xfc/0xb70
[ 42.783289] ? slab_attr_store+0x6d/0xe0
[ 42.790134] ? slab_attr_store+0x6d/0xe0
[ 42.796948] ? mutex_lock_io_nested+0xab0/0xab0
[ 42.804392] ? get_partial_node.isra.65.part.66+0x290/0x290
[ 42.812864] ? on_each_cpu_cond_mask+0xec/0x140
[ 42.820309] ? ksm_migrate_page+0x170/0x170
[ 42.827398] ? preempt_count_sub+0xb0/0x100
[ 42.834487] ? on_each_cpu_cond_mask+0xec/0x140
[ 42.841924] ? get_partial_node.isra.65.part.66+0x290/0x290
[ 42.850430] ? on_each_cpu_mask+0x90/0x90
[ 42.857367] ? slab_attr_store+0x6d/0xe0
[ 42.864183] slab_attr_store+0x6d/0xe0
[ 42.870791] ? sysfs_file_ops+0xa0/0xa0
[ 42.877469] kernfs_fop_write+0x199/0x250
[ 42.884326] vfs_write+0xe1/0x240
[ 42.890484] ksys_write+0xba/0x150
[ 42.896710] ? __ia32_sys_read+0x50/0x50
[ 42.903459] ? trace_hardirqs_off_caller+0x43/0x120
[ 42.911157] ? mark_held_locks+0x29/0xa0
[ 42.917891] ? do_syscall_64+0x14/0x250
[ 42.924549] do_syscall_64+0x68/0x250
[ 42.931026] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 42.938916] RIP: 0033:0x7fd547c94ff8
[ 42.945341] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 25 77 0d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
[ 42.970255] RSP: 002b:00007ffefe76d608 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 42.981019] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fd547c94ff8
[ 42.991380] RDX: 0000000000000002 RSI: 0000560703698180 RDI: 0000000000000001
[ 43.001750] RBP: 0000560703698180 R08: 000000000000000a R09: 00007fd547d67cc0
[ 43.012134] R10: 000000000000000a R11: 0000000000000246 R12: 00007fd547d68780
[ 43.022507] R13: 0000000000000002 R14: 00007fd547d63740 R15: 0000000000000002

Attachments:

(No filename) (6.41 kB)
config (128.41 kB)
Download all attachments

2019-06-14 15:10:35

by Chris Wilson

[permalink] [raw]

Subject: Re: [BUG] lockdep splat with kernfs lockdep annotations and slab mutex from drm patch??

Quoting Steven Rostedt (2019-06-14 14:39:14)
> I'm trying to debug some code where I need KASAN and lockdep enabled
> but I can't get past this splat unrelated to my code. I bisected it
> down to this commit:
>
> 32eb6bcfdda9da ("drm/i915: Make request allocation caches global")
>
> To make sure it wasn't a bad bisect, I removed the patch and the splat
> goes away. I add the patch back, and the splat returns. I did this
> several times, so I have a large confidence that this is the cause of
> the splat, or at least it is the commit that triggers whatever is going
> on.
>
> Perhaps all the cache updates caused the slab_mutex to be called
> inverse of the kernfs locking?
>
> Attached is my config.
>
> Also might be helpful, the splat happens:
>
> kernfs_fop_write()
> ops->write == sysfs_kf_write
> sysfs_kf_write()
> ops->store = slab_attr_store

More interestingly,

static ssize_t slab_attr_store(struct kobject *kobj,
struct attribute *attr,
const char *buf, size_t len)
{
struct slab_attribute *attribute;
struct kmem_cache *s;
int err;

attribute = to_slab_attr(attr);
s = to_slab(kobj);

if (!attribute->store)
return -EIO;

err = attribute->store(s, buf, len);
#ifdef CONFIG_MEMCG
if (slab_state >= FULL && err >= 0 && is_root_cache(s)) {
struct kmem_cache *c;

mutex_lock(&slab_mutex);

so it happens to hit the error + FULL case with the additional slabcaches?

Anyway, according to lockdep, it is dangerous to use the slab_mutex inside
slab_attr_store().
-Chris

2019-06-14 15:39:26

by Tejun Heo

[permalink] [raw]

Subject: Re: [BUG] lockdep splat with kernfs lockdep annotations and slab mutex from drm patch??

Hello,

On Fri, Jun 14, 2019 at 04:08:33PM +0100, Chris Wilson wrote:
> #ifdef CONFIG_MEMCG
> if (slab_state >= FULL && err >= 0 && is_root_cache(s)) {
> struct kmem_cache *c;
>
> mutex_lock(&slab_mutex);
>
> so it happens to hit the error + FULL case with the additional slabcaches?
>
> Anyway, according to lockdep, it is dangerous to use the slab_mutex inside
> slab_attr_store().

Didn't really look into the code but it looks like slab_mutex is held
while trying to remove sysfs files. sysfs file removal flushes
on-going accesses, so if a file operation then tries to grab a mutex
which is held during removal, it leads to a deadlock.

Thanks.

--
tejun

2019-07-11 03:00:57

by Steven Rostedt

[permalink] [raw]

Subject: Re: [BUG] lockdep splat with kernfs lockdep annotations and slab mutex from drm patch??

On Fri, 14 Jun 2019 08:38:37 -0700
Tejun Heo <[email protected]> wrote:

> Hello,
>
> On Fri, Jun 14, 2019 at 04:08:33PM +0100, Chris Wilson wrote:
> > #ifdef CONFIG_MEMCG
> > if (slab_state >= FULL && err >= 0 && is_root_cache(s)) {
> > struct kmem_cache *c;
> >
> > mutex_lock(&slab_mutex);
> >
> > so it happens to hit the error + FULL case with the additional slabcaches?
> >
> > Anyway, according to lockdep, it is dangerous to use the slab_mutex inside
> > slab_attr_store().
>
> Didn't really look into the code but it looks like slab_mutex is held
> while trying to remove sysfs files. sysfs file removal flushes
> on-going accesses, so if a file operation then tries to grab a mutex
> which is held during removal, it leads to a deadlock.
>

Looks like this never got fixed and now this bug is in 5.2.

Just got this:

======================================================
WARNING: possible circular locking dependency detected
5.2.0-test #15 Not tainted
------------------------------------------------------
slub_cpu_partia/899 is trying to acquire lock:
000000000f6f2dd7 (slab_mutex){+.+.}, at: slab_attr_store+0x6d/0xe0

but task is already holding lock:
00000000b23ffe3d (kn->count#160){++++}, at: kernfs_fop_write+0x125/0x230

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (kn->count#160){++++}:
__kernfs_remove+0x413/0x4a0
kernfs_remove_by_name_ns+0x40/0x80
sysfs_slab_add+0x1b5/0x2f0
__kmem_cache_create+0x511/0x560
create_cache+0xcd/0x1f0
kmem_cache_create_usercopy+0x18a/0x240
kmem_cache_create+0x12/0x20
is_active_nid+0xdb/0x230 [snd_hda_codec_generic]
snd_hda_get_path_idx+0x55/0x80 [snd_hda_codec_generic]
get_nid_path+0xc/0x170 [snd_hda_codec_generic]
do_one_initcall+0xa2/0x394
do_init_module+0xfd/0x370
load_module+0x38c6/0x3bd0
__do_sys_finit_module+0x11a/0x1b0
do_syscall_64+0x68/0x250
entry_SYSCALL_64_after_hwframe+0x49/0xbe

-> #0 (slab_mutex){+.+.}:
lock_acquire+0xbd/0x1d0
__mutex_lock+0xfc/0xb70
slab_attr_store+0x6d/0xe0
kernfs_fop_write+0x170/0x230
vfs_write+0xe1/0x240
ksys_write+0xba/0x150
do_syscall_64+0x68/0x250
entry_SYSCALL_64_after_hwframe+0x49/0xbe

other info that might help us debug this:

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(kn->count#160);
lock(slab_mutex);
lock(kn->count#160);
lock(slab_mutex);

*** DEADLOCK ***

Attached is a config and the full dmesg.

-- Steve

Attachments:

(No filename) (2.81 kB)
dmesg (92.13 kB)
config (128.40 kB)
Download all attachments

2019-07-11 06:24:36

by Chris Wilson

[permalink] [raw]

Subject: Re: [BUG] lockdep splat with kernfs lockdep annotations and slab mutex from drm patch??

Quoting Steven Rostedt (2019-07-11 03:57:20)
> On Fri, 14 Jun 2019 08:38:37 -0700
> Tejun Heo <[email protected]> wrote:
>
> > Hello,
> >
> > On Fri, Jun 14, 2019 at 04:08:33PM +0100, Chris Wilson wrote:
> > > #ifdef CONFIG_MEMCG
> > > if (slab_state >= FULL && err >= 0 && is_root_cache(s)) {
> > > struct kmem_cache *c;
> > >
> > > mutex_lock(&slab_mutex);
> > >
> > > so it happens to hit the error + FULL case with the additional slabcaches?
> > >
> > > Anyway, according to lockdep, it is dangerous to use the slab_mutex inside
> > > slab_attr_store().
> >
> > Didn't really look into the code but it looks like slab_mutex is held
> > while trying to remove sysfs files. sysfs file removal flushes
> > on-going accesses, so if a file operation then tries to grab a mutex
> > which is held during removal, it leads to a deadlock.
> >
>
> Looks like this never got fixed and now this bug is in 5.2.

git blame gives

commit 107dab5c92d5f9c3afe962036e47c207363255c7
Author: Glauber Costa <[email protected]>
Date: Tue Dec 18 14:23:05 2012 -0800

slub: slub-specific propagation changes

for adding the mutex underneath sysfs read, and I think

commit d50d82faa0c964e31f7a946ba8aba7c715ca7ab0
Author: Mikulas Patocka <[email protected]>
Date: Wed Jun 27 23:26:09 2018 -0700

slub: fix failure when we delete and create a slab cache

added the sysfs removal underneath the slab_mutex.

> Just got this:
>
> ======================================================
> WARNING: possible circular locking dependency detected
> 5.2.0-test #15 Not tainted
> ------------------------------------------------------
> slub_cpu_partia/899 is trying to acquire lock:
> 000000000f6f2dd7 (slab_mutex){+.+.}, at: slab_attr_store+0x6d/0xe0
>
> but task is already holding lock:
> 00000000b23ffe3d (kn->count#160){++++}, at: kernfs_fop_write+0x125/0x230
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #1 (kn->count#160){++++}:
> __kernfs_remove+0x413/0x4a0
> kernfs_remove_by_name_ns+0x40/0x80
> sysfs_slab_add+0x1b5/0x2f0
> __kmem_cache_create+0x511/0x560
> create_cache+0xcd/0x1f0
> kmem_cache_create_usercopy+0x18a/0x240
> kmem_cache_create+0x12/0x20
> is_active_nid+0xdb/0x230 [snd_hda_codec_generic]
> snd_hda_get_path_idx+0x55/0x80 [snd_hda_codec_generic]
> get_nid_path+0xc/0x170 [snd_hda_codec_generic]
> do_one_initcall+0xa2/0x394
> do_init_module+0xfd/0x370
> load_module+0x38c6/0x3bd0
> __do_sys_finit_module+0x11a/0x1b0
> do_syscall_64+0x68/0x250
> entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> -> #0 (slab_mutex){+.+.}:
> lock_acquire+0xbd/0x1d0
> __mutex_lock+0xfc/0xb70
> slab_attr_store+0x6d/0xe0
> kernfs_fop_write+0x170/0x230
> vfs_write+0xe1/0x240
> ksys_write+0xba/0x150
> do_syscall_64+0x68/0x250
> entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> other info that might help us debug this:
>
> Possible unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock(kn->count#160);
> lock(slab_mutex);
> lock(kn->count#160);
> lock(slab_mutex);
>
> *** DEADLOCK ***
>
>
>
> Attached is a config and the full dmesg.
>
> -- Steve
>

2019-07-11 21:34:00

by David Rientjes

[permalink] [raw]

Subject: Re: [BUG] lockdep splat with kernfs lockdep annotations and slab mutex from drm patch??

On Thu, 11 Jul 2019, Chris Wilson wrote:

> Quoting Steven Rostedt (2019-07-11 03:57:20)
> > On Fri, 14 Jun 2019 08:38:37 -0700
> > Tejun Heo <[email protected]> wrote:
> >
> > > Hello,
> > >
> > > On Fri, Jun 14, 2019 at 04:08:33PM +0100, Chris Wilson wrote:
> > > > #ifdef CONFIG_MEMCG
> > > > if (slab_state >= FULL && err >= 0 && is_root_cache(s)) {
> > > > struct kmem_cache *c;
> > > >
> > > > mutex_lock(&slab_mutex);
> > > >
> > > > so it happens to hit the error + FULL case with the additional slabcaches?
> > > >
> > > > Anyway, according to lockdep, it is dangerous to use the slab_mutex inside
> > > > slab_attr_store().
> > >
> > > Didn't really look into the code but it looks like slab_mutex is held
> > > while trying to remove sysfs files. sysfs file removal flushes
> > > on-going accesses, so if a file operation then tries to grab a mutex
> > > which is held during removal, it leads to a deadlock.
> > >
> >
> > Looks like this never got fixed and now this bug is in 5.2.
>
> git blame gives
>
> commit 107dab5c92d5f9c3afe962036e47c207363255c7
> Author: Glauber Costa <[email protected]>
> Date: Tue Dec 18 14:23:05 2012 -0800
>
> slub: slub-specific propagation changes
>
> for adding the mutex underneath sysfs read, and I think
>
> commit d50d82faa0c964e31f7a946ba8aba7c715ca7ab0
> Author: Mikulas Patocka <[email protected]>
> Date: Wed Jun 27 23:26:09 2018 -0700
>
> slub: fix failure when we delete and create a slab cache
>
> added the sysfs removal underneath the slab_mutex.
>
> > Just got this:
> >
> > ======================================================
> > WARNING: possible circular locking dependency detected
> > 5.2.0-test #15 Not tainted
> > ------------------------------------------------------
> > slub_cpu_partia/899 is trying to acquire lock:
> > 000000000f6f2dd7 (slab_mutex){+.+.}, at: slab_attr_store+0x6d/0xe0
> >
> > but task is already holding lock:
> > 00000000b23ffe3d (kn->count#160){++++}, at: kernfs_fop_write+0x125/0x230
> >
> > which lock already depends on the new lock.
> >
> >
> > the existing dependency chain (in reverse order) is:
> >
> > -> #1 (kn->count#160){++++}:
> > __kernfs_remove+0x413/0x4a0
> > kernfs_remove_by_name_ns+0x40/0x80
> > sysfs_slab_add+0x1b5/0x2f0
> > __kmem_cache_create+0x511/0x560
> > create_cache+0xcd/0x1f0
> > kmem_cache_create_usercopy+0x18a/0x240
> > kmem_cache_create+0x12/0x20
> > is_active_nid+0xdb/0x230 [snd_hda_codec_generic]
> > snd_hda_get_path_idx+0x55/0x80 [snd_hda_codec_generic]
> > get_nid_path+0xc/0x170 [snd_hda_codec_generic]
> > do_one_initcall+0xa2/0x394
> > do_init_module+0xfd/0x370
> > load_module+0x38c6/0x3bd0
> > __do_sys_finit_module+0x11a/0x1b0
> > do_syscall_64+0x68/0x250
> > entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >

Which slab cache is getting created here? I assume that sysfs_slab_add()
is only trying to do kernfs_remove_by_name_ns() becasue its unmergeable
with other slab caches.

> > -> #0 (slab_mutex){+.+.}:
> > lock_acquire+0xbd/0x1d0
> > __mutex_lock+0xfc/0xb70
> > slab_attr_store+0x6d/0xe0
> > kernfs_fop_write+0x170/0x230
> > vfs_write+0xe1/0x240
> > ksys_write+0xba/0x150
> > do_syscall_64+0x68/0x250
> > entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >
> > other info that might help us debug this:
> >
> > Possible unsafe locking scenario:
> >
> > CPU0 CPU1
> > ---- ----
> > lock(kn->count#160);
> > lock(slab_mutex);
> > lock(kn->count#160);
> > lock(slab_mutex);
> >
> > *** DEADLOCK ***
> >