2022-01-14 21:35:01

by Xin Long

[permalink] [raw]
Subject: [PATCH] mm: slub: fix a deadlock warning in kmem_cache_destroy

cpus_read_lock() is introduced into kmem_cache_destroy() by
commit 5a836bf6b09f ("mm: slub: move flush_cpu_slab() invocations
__free_slab() invocations out of IRQ context"), and it could cause
a deadlock.

As Antoine pointed out, when one thread calls kmem_cache_destroy(), it is
blocking until kn->active becomes 0 in kernfs_drain() after holding
cpu_hotplug_lock. While in another thread, when calling kernfs_fop_write(),
it may try to hold cpu_hotplug_lock after incrementing kn->active by
calling kernfs_get_active():

CPU0 CPU1
---- ----
cpus_read_lock()
kn->active++
cpus_read_lock() [a]
wait until kn->active == 0

Although cpu_hotplug_lock is a RWSEM, [a] will not block in there. But as
lockdep annotations are added for cpu_hotplug_lock, a deadlock warning
would be detected:

======================================================
WARNING: possible circular locking dependency detected
------------------------------------------------------
dmsetup/1832 is trying to acquire lock:
ffff986f5a0f9f20 (kn->count#144){++++}-{0:0}, at: kernfs_remove+0x1d/0x30

but task is already holding lock:
ffffffffa43817c0 (slab_mutex){+.+.}-{3:3}, at: kmem_cache_destroy+0x2a/0x120

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (slab_mutex){+.+.}-{3:3}:
lock_acquire+0xe8/0x470
mutex_lock_nested+0x47/0x80
kmem_cache_destroy+0x2a/0x120
bioset_exit+0xb5/0x100
cleanup_mapped_device+0x26/0xf0 [dm_mod]
free_dev+0x43/0xb0 [dm_mod]
__dm_destroy+0x153/0x1b0 [dm_mod]
dev_remove+0xe4/0x1a0 [dm_mod]
ctl_ioctl+0x1af/0x3f0 [dm_mod]
dm_ctl_ioctl+0xa/0x10 [dm_mod]
do_vfs_ioctl+0xa5/0x760
ksys_ioctl+0x60/0x90
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x8c/0x240
entry_SYSCALL_64_after_hwframe+0x6a/0xdf

-> #1 (cpu_hotplug_lock){++++}-{0:0}:
lock_acquire+0xe8/0x470
cpus_read_lock+0x39/0x100
cpu_partial_store+0x44/0x80
slab_attr_store+0x20/0x30
kernfs_fop_write+0x101/0x1b0
vfs_write+0xd4/0x1e0
ksys_write+0x52/0xc0
do_syscall_64+0x8c/0x240
entry_SYSCALL_64_after_hwframe+0x6a/0xdf

-> #0 (kn->count#144){++++}-{0:0}:
check_prevs_add+0x185/0xb80
__lock_acquire+0xd8f/0xe90
lock_acquire+0xe8/0x470
__kernfs_remove+0x25e/0x320
kernfs_remove+0x1d/0x30
kobject_del+0x28/0x60
kmem_cache_destroy+0xf1/0x120
bioset_exit+0xb5/0x100
cleanup_mapped_device+0x26/0xf0 [dm_mod]
free_dev+0x43/0xb0 [dm_mod]
__dm_destroy+0x153/0x1b0 [dm_mod]
dev_remove+0xe4/0x1a0 [dm_mod]
ctl_ioctl+0x1af/0x3f0 [dm_mod]
dm_ctl_ioctl+0xa/0x10 [dm_mod]
do_vfs_ioctl+0xa5/0x760
ksys_ioctl+0x60/0x90
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x8c/0x240
entry_SYSCALL_64_after_hwframe+0x6a/0xdf

other info that might help us debug this:

Chain exists of:
kn->count#144 --> cpu_hotplug_lock --> slab_mutex

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(slab_mutex);
lock(cpu_hotplug_lock);
lock(slab_mutex);
lock(kn->count#144);

*** DEADLOCK ***

3 locks held by dmsetup/1832:
#0: ffffffffa43fe5c0 (bio_slab_lock){+.+.}-{3:3}, at: bioset_exit+0x62/0x100
#1: ffffffffa3e87c20 (cpu_hotplug_lock){++++}-{0:0}, at: kmem_cache_destroy+0x1c/0x120
#2: ffffffffa43817c0 (slab_mutex){+.+.}-{3:3}, at: kmem_cache_destroy+0x2a/0x120

stack backtrace:
Call Trace:
dump_stack+0x5c/0x80
check_noncircular+0xff/0x120
check_prevs_add+0x185/0xb80
__lock_acquire+0xd8f/0xe90
lock_acquire+0xe8/0x470
__kernfs_remove+0x25e/0x320
kernfs_remove+0x1d/0x30
kobject_del+0x28/0x60
kmem_cache_destroy+0xf1/0x120
bioset_exit+0xb5/0x100
cleanup_mapped_device+0x26/0xf0 [dm_mod]
free_dev+0x43/0xb0 [dm_mod]
__dm_destroy+0x153/0x1b0 [dm_mod]
dev_remove+0xe4/0x1a0 [dm_mod]
ctl_ioctl+0x1af/0x3f0 [dm_mod]
dm_ctl_ioctl+0xa/0x10 [dm_mod]
do_vfs_ioctl+0xa5/0x760
ksys_ioctl+0x60/0x90
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x8c/0x240
entry_SYSCALL_64_after_hwframe+0x6a/0xdf

Since cpus_read_lock() is supposed to protect the cpu related data, it
makes sense to fix this issue by moving cpus_read_lock() from
kmem_cache_destroy() to __kmem_cache_shutdown(). While at it,
add the missing cpus_read_lock() in slab_mem_going_offline_callback().

Fixes: 5a836bf6b09f ("mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context")
Signed-off-by: Xin Long <[email protected]>
---
mm/slab_common.c | 2 --
mm/slub.c | 4 ++--
2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index e5d080a93009..06ec3fa585e6 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -494,7 +494,6 @@ void kmem_cache_destroy(struct kmem_cache *s)
if (unlikely(!s))
return;

- cpus_read_lock();
mutex_lock(&slab_mutex);

s->refcount--;
@@ -509,7 +508,6 @@ void kmem_cache_destroy(struct kmem_cache *s)
}
out_unlock:
mutex_unlock(&slab_mutex);
- cpus_read_unlock();
}
EXPORT_SYMBOL(kmem_cache_destroy);

diff --git a/mm/slub.c b/mm/slub.c
index abe7db581d68..754f020235ee 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4311,7 +4311,7 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
int node;
struct kmem_cache_node *n;

- flush_all_cpus_locked(s);
+ flush_all(s);
/* Attempt to free all objects */
for_each_kmem_cache_node(s, node, n) {
free_partial(s, n);
@@ -4646,7 +4646,7 @@ static int slab_mem_going_offline_callback(void *arg)

mutex_lock(&slab_mutex);
list_for_each_entry(s, &slab_caches, list) {
- flush_all_cpus_locked(s);
+ flush_all(s);
__kmem_cache_do_shrink(s);
}
mutex_unlock(&slab_mutex);
--
2.27.0


2022-01-17 05:08:32

by Hyeonggon Yoo

[permalink] [raw]
Subject: Re: [PATCH] mm: slub: fix a deadlock warning in kmem_cache_destroy

On Fri, Jan 14, 2022 at 09:23:16AM -0500, Xin Long wrote:
> cpus_read_lock() is introduced into kmem_cache_destroy() by
> commit 5a836bf6b09f ("mm: slub: move flush_cpu_slab() invocations
> __free_slab() invocations out of IRQ context"), and it could cause
> a deadlock.
>
> As Antoine pointed out, when one thread calls kmem_cache_destroy(), it is
> blocking until kn->active becomes 0 in kernfs_drain() after holding
> cpu_hotplug_lock. While in another thread, when calling kernfs_fop_write(),
> it may try to hold cpu_hotplug_lock after incrementing kn->active by
> calling kernfs_get_active():
>

Hello. can you give me a link of related the thread?

> CPU0 CPU1
> ---- ----
> cpus_read_lock()
> kn->active++
> cpus_read_lock() [a]
> wait until kn->active == 0
>
> Although cpu_hotplug_lock is a RWSEM, [a] will not block in there. But as
> lockdep annotations are added for cpu_hotplug_lock, a deadlock warning
> would be detected:
>
> ======================================================
> WARNING: possible circular locking dependency detected
> ------------------------------------------------------
> dmsetup/1832 is trying to acquire lock:
> ffff986f5a0f9f20 (kn->count#144){++++}-{0:0}, at: kernfs_remove+0x1d/0x30
>
> but task is already holding lock:
> ffffffffa43817c0 (slab_mutex){+.+.}-{3:3}, at: kmem_cache_destroy+0x2a/0x120
>
> which lock already depends on the new lock.
>
> the existing dependency chain (in reverse order) is:
>
> -> #2 (slab_mutex){+.+.}-{3:3}:
> lock_acquire+0xe8/0x470
> mutex_lock_nested+0x47/0x80
> kmem_cache_destroy+0x2a/0x120
> bioset_exit+0xb5/0x100
> cleanup_mapped_device+0x26/0xf0 [dm_mod]
> free_dev+0x43/0xb0 [dm_mod]
> __dm_destroy+0x153/0x1b0 [dm_mod]
> dev_remove+0xe4/0x1a0 [dm_mod]
> ctl_ioctl+0x1af/0x3f0 [dm_mod]
> dm_ctl_ioctl+0xa/0x10 [dm_mod]
> do_vfs_ioctl+0xa5/0x760
> ksys_ioctl+0x60/0x90
> __x64_sys_ioctl+0x16/0x20
> do_syscall_64+0x8c/0x240
> entry_SYSCALL_64_after_hwframe+0x6a/0xdf
>
> -> #1 (cpu_hotplug_lock){++++}-{0:0}:
> lock_acquire+0xe8/0x470
> cpus_read_lock+0x39/0x100
> cpu_partial_store+0x44/0x80
> slab_attr_store+0x20/0x30
> kernfs_fop_write+0x101/0x1b0
> vfs_write+0xd4/0x1e0
> ksys_write+0x52/0xc0
> do_syscall_64+0x8c/0x240
> entry_SYSCALL_64_after_hwframe+0x6a/0xdf
>
> -> #0 (kn->count#144){++++}-{0:0}:
> check_prevs_add+0x185/0xb80
> __lock_acquire+0xd8f/0xe90
> lock_acquire+0xe8/0x470
> __kernfs_remove+0x25e/0x320
> kernfs_remove+0x1d/0x30
> kobject_del+0x28/0x60
> kmem_cache_destroy+0xf1/0x120
> bioset_exit+0xb5/0x100
> cleanup_mapped_device+0x26/0xf0 [dm_mod]
> free_dev+0x43/0xb0 [dm_mod]
> __dm_destroy+0x153/0x1b0 [dm_mod]
> dev_remove+0xe4/0x1a0 [dm_mod]
> ctl_ioctl+0x1af/0x3f0 [dm_mod]
> dm_ctl_ioctl+0xa/0x10 [dm_mod]
> do_vfs_ioctl+0xa5/0x760
> ksys_ioctl+0x60/0x90
> __x64_sys_ioctl+0x16/0x20
> do_syscall_64+0x8c/0x240
> entry_SYSCALL_64_after_hwframe+0x6a/0xdf
>
> other info that might help us debug this:
>
> Chain exists of:
> kn->count#144 --> cpu_hotplug_lock --> slab_mutex
>
> Possible unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock(slab_mutex);
> lock(cpu_hotplug_lock);
> lock(slab_mutex);
> lock(kn->count#144);
>
> *** DEADLOCK ***
>
> 3 locks held by dmsetup/1832:
> #0: ffffffffa43fe5c0 (bio_slab_lock){+.+.}-{3:3}, at: bioset_exit+0x62/0x100
> #1: ffffffffa3e87c20 (cpu_hotplug_lock){++++}-{0:0}, at: kmem_cache_destroy+0x1c/0x120
> #2: ffffffffa43817c0 (slab_mutex){+.+.}-{3:3}, at: kmem_cache_destroy+0x2a/0x120
>
> stack backtrace:
> Call Trace:
> dump_stack+0x5c/0x80
> check_noncircular+0xff/0x120
> check_prevs_add+0x185/0xb80
> __lock_acquire+0xd8f/0xe90
> lock_acquire+0xe8/0x470
> __kernfs_remove+0x25e/0x320
> kernfs_remove+0x1d/0x30
> kobject_del+0x28/0x60
> kmem_cache_destroy+0xf1/0x120
> bioset_exit+0xb5/0x100
> cleanup_mapped_device+0x26/0xf0 [dm_mod]
> free_dev+0x43/0xb0 [dm_mod]
> __dm_destroy+0x153/0x1b0 [dm_mod]
> dev_remove+0xe4/0x1a0 [dm_mod]
> ctl_ioctl+0x1af/0x3f0 [dm_mod]
> dm_ctl_ioctl+0xa/0x10 [dm_mod]
> do_vfs_ioctl+0xa5/0x760
> ksys_ioctl+0x60/0x90
> __x64_sys_ioctl+0x16/0x20
> do_syscall_64+0x8c/0x240
> entry_SYSCALL_64_after_hwframe+0x6a/0xdf
>

To summary the possible scenario is:

- when cache is destroyed:
cpu_hotplug_lock
-> slab_mutex
-> wait until kn->count == 0 (because it removes sysfs objects.)

- when someone writes to cpu_partial attribute:
increase kn->count (incrased in kernfs_fop_write_iter(),
using kernfs_get_active() )
-> cpu_hotplug_lock
-> slab_mutex

... So there is a circular dependency when using kernfs because
clearing sysfs stuff in kernfs makes unexpected dependency. Right?

I think it's quite unlikely but yeah, seems possible.

> Since cpus_read_lock() is supposed to protect the cpu related data, it
> makes sense to fix this issue by moving cpus_read_lock() from
> kmem_cache_destroy() to __kmem_cache_shutdown(). While at it,
> add the missing cpus_read_lock() in slab_mem_going_offline_callback().
>
> Fixes: 5a836bf6b09f ("mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context")
> Signed-off-by: Xin Long <[email protected]>
> ---
> mm/slab_common.c | 2 --
> mm/slub.c | 4 ++--
> 2 files changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index e5d080a93009..06ec3fa585e6 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -494,7 +494,6 @@ void kmem_cache_destroy(struct kmem_cache *s)
> if (unlikely(!s))
> return;
>
> - cpus_read_lock();
> mutex_lock(&slab_mutex);
>
> s->refcount--;
> @@ -509,7 +508,6 @@ void kmem_cache_destroy(struct kmem_cache *s)
> }
> out_unlock:
> mutex_unlock(&slab_mutex);
> - cpus_read_unlock();
> }

This code is changing lock order
from cpu_hotplug_lock -> slab_muitex
to slab_mutex -> cpu_hotplug_lock.

> EXPORT_SYMBOL(kmem_cache_destroy);
>
> diff --git a/mm/slub.c b/mm/slub.c
> index abe7db581d68..754f020235ee 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -4311,7 +4311,7 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
> int node;
> struct kmem_cache_node *n;
>
> - flush_all_cpus_locked(s);
> + flush_all(s);
> /* Attempt to free all objects */
> for_each_kmem_cache_node(s, node, n) {
> free_partial(s, n);
> @@ -4646,7 +4646,7 @@ static int slab_mem_going_offline_callback(void *arg)
>
> mutex_lock(&slab_mutex);
> list_for_each_entry(s, &slab_caches, list) {
> - flush_all_cpus_locked(s);
> + flush_all(s);

In My Opinion, this code is wrong. Because it's called when memory
offlining with cpu_hotplug_lock held. See function offline_pages()
in mm/memory_hotplug.c for details.

it first holds cpu_hoplug_lock by calling mem_hotplug_begin(),
and notifies memory_chain. (so slab_mem_going_offline_callback is
called.)

I think this patch will make another possible deadlock scenario.

in memory hotplugging: cpu_hotplug_lock -> slab_mutex -> cpu_hotplug_lock
in slab cache destroying: slab_mutex -> cpu_hotplug_lock

Thanks!,
Hyeonggon.

> __kmem_cache_do_shrink(s);
> }
> mutex_unlock(&slab_mutex);
> --
> 2.27.0
>
>

2022-01-17 17:01:12

by Xin Long

[permalink] [raw]
Subject: Re: [PATCH] mm: slub: fix a deadlock warning in kmem_cache_destroy

On Sun, Jan 16, 2022 at 2:35 PM Hyeonggon Yoo <[email protected]> wrote:
>
> On Fri, Jan 14, 2022 at 09:23:16AM -0500, Xin Long wrote:
> > cpus_read_lock() is introduced into kmem_cache_destroy() by
> > commit 5a836bf6b09f ("mm: slub: move flush_cpu_slab() invocations
> > __free_slab() invocations out of IRQ context"), and it could cause
> > a deadlock.
> >
> > As Antoine pointed out, when one thread calls kmem_cache_destroy(), it is
> > blocking until kn->active becomes 0 in kernfs_drain() after holding
> > cpu_hotplug_lock. While in another thread, when calling kernfs_fop_write(),
> > it may try to hold cpu_hotplug_lock after incrementing kn->active by
> > calling kernfs_get_active():
> >
>
> Hello. can you give me a link of related the thread?
Sorry, I don't have a thread on the internet, but I think the changelog
has provided all the information we have.

Just note that It was reproduced in the RHEL-8 RT Kernel after we fixed
another issue. From the code analysis, this issue does exist on the
upstream kernel, though I couldn't build an upstream RT kernel for the
testing.

>
> > CPU0 CPU1
> > ---- ----
> > cpus_read_lock()
> > kn->active++
> > cpus_read_lock() [a]
> > wait until kn->active == 0
> >
> > Although cpu_hotplug_lock is a RWSEM, [a] will not block in there. But as
> > lockdep annotations are added for cpu_hotplug_lock, a deadlock warning
> > would be detected:
> >
> > ======================================================
> > WARNING: possible circular locking dependency detected
> > ------------------------------------------------------
> > dmsetup/1832 is trying to acquire lock:
> > ffff986f5a0f9f20 (kn->count#144){++++}-{0:0}, at: kernfs_remove+0x1d/0x30
> >
> > but task is already holding lock:
> > ffffffffa43817c0 (slab_mutex){+.+.}-{3:3}, at: kmem_cache_destroy+0x2a/0x120
> >
> > which lock already depends on the new lock.
> >
> > the existing dependency chain (in reverse order) is:
> >
> > -> #2 (slab_mutex){+.+.}-{3:3}:
> > lock_acquire+0xe8/0x470
> > mutex_lock_nested+0x47/0x80
> > kmem_cache_destroy+0x2a/0x120
> > bioset_exit+0xb5/0x100
> > cleanup_mapped_device+0x26/0xf0 [dm_mod]
> > free_dev+0x43/0xb0 [dm_mod]
> > __dm_destroy+0x153/0x1b0 [dm_mod]
> > dev_remove+0xe4/0x1a0 [dm_mod]
> > ctl_ioctl+0x1af/0x3f0 [dm_mod]
> > dm_ctl_ioctl+0xa/0x10 [dm_mod]
> > do_vfs_ioctl+0xa5/0x760
> > ksys_ioctl+0x60/0x90
> > __x64_sys_ioctl+0x16/0x20
> > do_syscall_64+0x8c/0x240
> > entry_SYSCALL_64_after_hwframe+0x6a/0xdf
> >
> > -> #1 (cpu_hotplug_lock){++++}-{0:0}:
> > lock_acquire+0xe8/0x470
> > cpus_read_lock+0x39/0x100
> > cpu_partial_store+0x44/0x80
> > slab_attr_store+0x20/0x30
> > kernfs_fop_write+0x101/0x1b0
> > vfs_write+0xd4/0x1e0
> > ksys_write+0x52/0xc0
> > do_syscall_64+0x8c/0x240
> > entry_SYSCALL_64_after_hwframe+0x6a/0xdf
> >
> > -> #0 (kn->count#144){++++}-{0:0}:
> > check_prevs_add+0x185/0xb80
> > __lock_acquire+0xd8f/0xe90
> > lock_acquire+0xe8/0x470
> > __kernfs_remove+0x25e/0x320
> > kernfs_remove+0x1d/0x30
> > kobject_del+0x28/0x60
> > kmem_cache_destroy+0xf1/0x120
> > bioset_exit+0xb5/0x100
> > cleanup_mapped_device+0x26/0xf0 [dm_mod]
> > free_dev+0x43/0xb0 [dm_mod]
> > __dm_destroy+0x153/0x1b0 [dm_mod]
> > dev_remove+0xe4/0x1a0 [dm_mod]
> > ctl_ioctl+0x1af/0x3f0 [dm_mod]
> > dm_ctl_ioctl+0xa/0x10 [dm_mod]
> > do_vfs_ioctl+0xa5/0x760
> > ksys_ioctl+0x60/0x90
> > __x64_sys_ioctl+0x16/0x20
> > do_syscall_64+0x8c/0x240
> > entry_SYSCALL_64_after_hwframe+0x6a/0xdf
> >
> > other info that might help us debug this:
> >
> > Chain exists of:
> > kn->count#144 --> cpu_hotplug_lock --> slab_mutex
> >
> > Possible unsafe locking scenario:
> >
> > CPU0 CPU1
> > ---- ----
> > lock(slab_mutex);
> > lock(cpu_hotplug_lock);
> > lock(slab_mutex);
> > lock(kn->count#144);
> >
> > *** DEADLOCK ***
> >
> > 3 locks held by dmsetup/1832:
> > #0: ffffffffa43fe5c0 (bio_slab_lock){+.+.}-{3:3}, at: bioset_exit+0x62/0x100
> > #1: ffffffffa3e87c20 (cpu_hotplug_lock){++++}-{0:0}, at: kmem_cache_destroy+0x1c/0x120
> > #2: ffffffffa43817c0 (slab_mutex){+.+.}-{3:3}, at: kmem_cache_destroy+0x2a/0x120
> >
> > stack backtrace:
> > Call Trace:
> > dump_stack+0x5c/0x80
> > check_noncircular+0xff/0x120
> > check_prevs_add+0x185/0xb80
> > __lock_acquire+0xd8f/0xe90
> > lock_acquire+0xe8/0x470
> > __kernfs_remove+0x25e/0x320
> > kernfs_remove+0x1d/0x30
> > kobject_del+0x28/0x60
> > kmem_cache_destroy+0xf1/0x120
> > bioset_exit+0xb5/0x100
> > cleanup_mapped_device+0x26/0xf0 [dm_mod]
> > free_dev+0x43/0xb0 [dm_mod]
> > __dm_destroy+0x153/0x1b0 [dm_mod]
> > dev_remove+0xe4/0x1a0 [dm_mod]
> > ctl_ioctl+0x1af/0x3f0 [dm_mod]
> > dm_ctl_ioctl+0xa/0x10 [dm_mod]
> > do_vfs_ioctl+0xa5/0x760
> > ksys_ioctl+0x60/0x90
> > __x64_sys_ioctl+0x16/0x20
> > do_syscall_64+0x8c/0x240
> > entry_SYSCALL_64_after_hwframe+0x6a/0xdf
> >
>
> To summary the possible scenario is:
>
> - when cache is destroyed:
> cpu_hotplug_lock
> -> slab_mutex
> -> wait until kn->count == 0 (because it removes sysfs objects.)
not really wait for kn->count == 0, but wait for kn->active in kernfs_drain():

/* but everyone should wait for draining */
wait_event(root->deactivate_waitq,
atomic_read(&kn->active) == KN_DEACTIVATED_BIAS);

>
> - when someone writes to cpu_partial attribute:
> increase kn->count (incrased in kernfs_fop_write_iter(),
> using kernfs_get_active() )
and yes, called kernfs_get_active() to hold/increment kn->active.

> -> cpu_hotplug_lock
> -> slab_mutex
>
> ... So there is a circular dependency when using kernfs because
> clearing sysfs stuff in kernfs makes unexpected dependency. Right?
So it is:

CPU0 CPU1
---- ----
cpus_read_lock() in kmem_cache_destroy()
kn->active++ in kernfs_get_active()
cpus_read_lock() in
cpu_partial_store()->flush_all()
wait until kn->active == 0 in kernfs_drain()


>
> I think it's quite unlikely but yeah, seems possible.
Interesting that it could be easily reproduced on the RT kernel.

>
> > Since cpus_read_lock() is supposed to protect the cpu related data, it
> > makes sense to fix this issue by moving cpus_read_lock() from
> > kmem_cache_destroy() to __kmem_cache_shutdown(). While at it,
> > add the missing cpus_read_lock() in slab_mem_going_offline_callback().
> >
> > Fixes: 5a836bf6b09f ("mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context")
> > Signed-off-by: Xin Long <[email protected]>
> > ---
> > mm/slab_common.c | 2 --
> > mm/slub.c | 4 ++--
> > 2 files changed, 2 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/slab_common.c b/mm/slab_common.c
> > index e5d080a93009..06ec3fa585e6 100644
> > --- a/mm/slab_common.c
> > +++ b/mm/slab_common.c
> > @@ -494,7 +494,6 @@ void kmem_cache_destroy(struct kmem_cache *s)
> > if (unlikely(!s))
> > return;
> >
> > - cpus_read_lock();
> > mutex_lock(&slab_mutex);
> >
> > s->refcount--;
> > @@ -509,7 +508,6 @@ void kmem_cache_destroy(struct kmem_cache *s)
> > }
> > out_unlock:
> > mutex_unlock(&slab_mutex);
> > - cpus_read_unlock();
> > }
>
> This code is changing lock order
> from cpu_hotplug_lock -> slab_muitex
> to slab_mutex -> cpu_hotplug_lock.
Right.

>
> > EXPORT_SYMBOL(kmem_cache_destroy);
> >
> > diff --git a/mm/slub.c b/mm/slub.c
> > index abe7db581d68..754f020235ee 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -4311,7 +4311,7 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
> > int node;
> > struct kmem_cache_node *n;
> >
> > - flush_all_cpus_locked(s);
> > + flush_all(s);
> > /* Attempt to free all objects */
> > for_each_kmem_cache_node(s, node, n) {
> > free_partial(s, n);
> > @@ -4646,7 +4646,7 @@ static int slab_mem_going_offline_callback(void *arg)
> >
> > mutex_lock(&slab_mutex);
> > list_for_each_entry(s, &slab_caches, list) {
> > - flush_all_cpus_locked(s);
> > + flush_all(s);
>
> In My Opinion, this code is wrong. Because it's called when memory
> offlining with cpu_hotplug_lock held. See function offline_pages()
> in mm/memory_hotplug.c for details.
>
> it first holds cpu_hoplug_lock by calling mem_hotplug_begin(),
> and notifies memory_chain. (so slab_mem_going_offline_callback is
> called.)
>
> I think this patch will make another possible deadlock scenario.
>
> in memory hotplugging: cpu_hotplug_lock -> slab_mutex -> cpu_hotplug_lock
> in slab cache destroying: slab_mutex -> cpu_hotplug_lock
Now I understand why cpus_read_lock() was called in kmem_cache_destroy().
I have to think about fixing it in a better way.

Thanks for the review.

>
> Thanks!,
> Hyeonggon.
>
> > __kmem_cache_do_shrink(s);
> > }
> > mutex_unlock(&slab_mutex);
> > --
> > 2.27.0
> >
> >

Subject: Re: [PATCH] mm: slub: fix a deadlock warning in kmem_cache_destroy

On 2022-01-17 16:32:46 [+0800], Xin Long wrote:
> another issue. From the code analysis, this issue does exist on the
> upstream kernel, though I couldn't build an upstream RT kernel for the
> testing.

This should also reproduce in v5.16 since the commit in question is
there.

> > > CPU0 CPU1
> > > ---- ----
> > > cpus_read_lock()
> > > kn->active++
> > > cpus_read_lock() [a]
> > > wait until kn->active == 0
> > >
> > > Although cpu_hotplug_lock is a RWSEM, [a] will not block in there. But as
> > > lockdep annotations are added for cpu_hotplug_lock, a deadlock warning
> > > would be detected:

The cpu_hotplug_lock is a per-CPU RWSEM. The lock in [a] will block if
there is a writer pending.

> > > ======================================================
> > > WARNING: possible circular locking dependency detected
> > > ------------------------------------------------------
> > > dmsetup/1832 is trying to acquire lock:
> > > ffff986f5a0f9f20 (kn->count#144){++++}-{0:0}, at: kernfs_remove+0x1d/0x30
> > >
> > > but task is already holding lock:
> > > ffffffffa43817c0 (slab_mutex){+.+.}-{3:3}, at: kmem_cache_destroy+0x2a/0x120
> > >

I tried to create & destroy a cryptarget which creates/destroy a cache
via bio_put_slab(). Either the callchain is different or something else
is but I didn't see a lockdep warning.

Sebastian

2022-01-18 02:24:43

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm: slub: fix a deadlock warning in kmem_cache_destroy

+CC Clark

On 1/17/22 10:33, Sebastian Andrzej Siewior wrote:
> On 2022-01-17 16:32:46 [+0800], Xin Long wrote:
>> another issue. From the code analysis, this issue does exist on the
>> upstream kernel, though I couldn't build an upstream RT kernel for the
>> testing.
>
> This should also reproduce in v5.16 since the commit in question is
> there.

Yeah. I remember we had some issues with the commit during development, but
I'd hope those were resolved and the commit that's ultimately merged got the
fixes, see this subthread:

https://lore.kernel.org/all/[email protected]/

>> > > CPU0 CPU1
>> > > ---- ----
>> > > cpus_read_lock()
>> > > kn->active++
>> > > cpus_read_lock() [a]
>> > > wait until kn->active == 0
>> > >
>> > > Although cpu_hotplug_lock is a RWSEM, [a] will not block in there. But as
>> > > lockdep annotations are added for cpu_hotplug_lock, a deadlock warning
>> > > would be detected:
>
> The cpu_hotplug_lock is a per-CPU RWSEM. The lock in [a] will block if
> there is a writer pending.
>
>> > > ======================================================
>> > > WARNING: possible circular locking dependency detected
>> > > ------------------------------------------------------
>> > > dmsetup/1832 is trying to acquire lock:
>> > > ffff986f5a0f9f20 (kn->count#144){++++}-{0:0}, at: kernfs_remove+0x1d/0x30
>> > >
>> > > but task is already holding lock:
>> > > ffffffffa43817c0 (slab_mutex){+.+.}-{3:3}, at: kmem_cache_destroy+0x2a/0x120
>> > >
>
> I tried to create & destroy a cryptarget which creates/destroy a cache
> via bio_put_slab(). Either the callchain is different or something else
> is but I didn't see a lockdep warning.

RHEL-8 kernel seems to be 4.18, unless RT uses a newer one. Could be some
silently relevant backport is missing? How about e.g. 59450bbc12be ("mm,
slab, slub: stop taking cpu hotplug lock") ?

> Sebastian

2022-01-18 02:26:50

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH] mm: slub: fix a deadlock warning in kmem_cache_destroy

Hi,

On 17/01/22 13:40, Vlastimil Babka wrote:
> +CC Clark
>
> On 1/17/22 10:33, Sebastian Andrzej Siewior wrote:
> > On 2022-01-17 16:32:46 [+0800], Xin Long wrote:
> >> another issue. From the code analysis, this issue does exist on the
> >> upstream kernel, though I couldn't build an upstream RT kernel for the
> >> testing.
> >
> > This should also reproduce in v5.16 since the commit in question is
> > there.
>
> Yeah. I remember we had some issues with the commit during development, but
> I'd hope those were resolved and the commit that's ultimately merged got the
> fixes, see this subthread:
>
> https://lore.kernel.org/all/[email protected]/
>
> >> > > CPU0 CPU1
> >> > > ---- ----
> >> > > cpus_read_lock()
> >> > > kn->active++
> >> > > cpus_read_lock() [a]
> >> > > wait until kn->active == 0
> >> > >
> >> > > Although cpu_hotplug_lock is a RWSEM, [a] will not block in there. But as
> >> > > lockdep annotations are added for cpu_hotplug_lock, a deadlock warning
> >> > > would be detected:
> >
> > The cpu_hotplug_lock is a per-CPU RWSEM. The lock in [a] will block if
> > there is a writer pending.
> >
> >> > > ======================================================
> >> > > WARNING: possible circular locking dependency detected
> >> > > ------------------------------------------------------
> >> > > dmsetup/1832 is trying to acquire lock:
> >> > > ffff986f5a0f9f20 (kn->count#144){++++}-{0:0}, at: kernfs_remove+0x1d/0x30
> >> > >
> >> > > but task is already holding lock:
> >> > > ffffffffa43817c0 (slab_mutex){+.+.}-{3:3}, at: kmem_cache_destroy+0x2a/0x120
> >> > >
> >
> > I tried to create & destroy a cryptarget which creates/destroy a cache
> > via bio_put_slab(). Either the callchain is different or something else
> > is but I didn't see a lockdep warning.
>
> RHEL-8 kernel seems to be 4.18, unless RT uses a newer one. Could be some
> silently relevant backport is missing? How about e.g. 59450bbc12be ("mm,
> slab, slub: stop taking cpu hotplug lock") ?

Hummm, looks like we have backported commit 59450bbc12be in RHEL-8.

Xin Long, would you be able to check if you still see the lockdep splat
with latest upstream RT?

git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git linux-5.16.y-rt

Thanks!
Juri

2022-01-19 22:05:35

by Xin Long

[permalink] [raw]
Subject: Re: [PATCH] mm: slub: fix a deadlock warning in kmem_cache_destroy

On Mon, Jan 17, 2022 at 9:13 PM Juri Lelli <[email protected]> wrote:
>
> Hi,
>
> On 17/01/22 13:40, Vlastimil Babka wrote:
> > +CC Clark
> >
> > On 1/17/22 10:33, Sebastian Andrzej Siewior wrote:
> > > On 2022-01-17 16:32:46 [+0800], Xin Long wrote:
> > >> another issue. From the code analysis, this issue does exist on the
> > >> upstream kernel, though I couldn't build an upstream RT kernel for the
> > >> testing.
> > >
> > > This should also reproduce in v5.16 since the commit in question is
> > > there.
> >
> > Yeah. I remember we had some issues with the commit during development, but
> > I'd hope those were resolved and the commit that's ultimately merged got the
> > fixes, see this subthread:
> >
> > https://lore.kernel.org/all/[email protected]/
> >
> > >> > > CPU0 CPU1
> > >> > > ---- ----
> > >> > > cpus_read_lock()
> > >> > > kn->active++
> > >> > > cpus_read_lock() [a]
> > >> > > wait until kn->active == 0
> > >> > >
> > >> > > Although cpu_hotplug_lock is a RWSEM, [a] will not block in there. But as
> > >> > > lockdep annotations are added for cpu_hotplug_lock, a deadlock warning
> > >> > > would be detected:
> > >
> > > The cpu_hotplug_lock is a per-CPU RWSEM. The lock in [a] will block if
> > > there is a writer pending.
> > >
> > >> > > ======================================================
> > >> > > WARNING: possible circular locking dependency detected
> > >> > > ------------------------------------------------------
> > >> > > dmsetup/1832 is trying to acquire lock:
> > >> > > ffff986f5a0f9f20 (kn->count#144){++++}-{0:0}, at: kernfs_remove+0x1d/0x30
> > >> > >
> > >> > > but task is already holding lock:
> > >> > > ffffffffa43817c0 (slab_mutex){+.+.}-{3:3}, at: kmem_cache_destroy+0x2a/0x120
> > >> > >
> > >
> > > I tried to create & destroy a cryptarget which creates/destroy a cache
> > > via bio_put_slab(). Either the callchain is different or something else
> > > is but I didn't see a lockdep warning.
> >
> > RHEL-8 kernel seems to be 4.18, unless RT uses a newer one. Could be some
> > silently relevant backport is missing? How about e.g. 59450bbc12be ("mm,
> > slab, slub: stop taking cpu hotplug lock") ?
>
> Hummm, looks like we have backported commit 59450bbc12be in RHEL-8.
>
> Xin Long, would you be able to check if you still see the lockdep splat
> with latest upstream RT?
>
> git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git linux-5.16.y-rt
Hi, Juri,

Thanks for sharing the RT kernel repo.

I just tried with this kernel, and I couldn't reproduce it on my env.
But I don't see how the upstream RT kernel can avoid the call trace.

As this warning was triggered when the system was shutting down, it might
not be reproduced on it due to some timing change.

>
> Thanks!
> Juri
>

2022-01-20 07:41:15

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm: slub: fix a deadlock warning in kmem_cache_destroy


On 1/18/22 09:00, Xin Long wrote:
> On Mon, Jan 17, 2022 at 9:13 PM Juri Lelli <[email protected]> wrote:
>> >
>> > RHEL-8 kernel seems to be 4.18, unless RT uses a newer one. Could be some
>> > silently relevant backport is missing? How about e.g. 59450bbc12be ("mm,
>> > slab, slub: stop taking cpu hotplug lock") ?
>>
>> Hummm, looks like we have backported commit 59450bbc12be in RHEL-8.
>>
>> Xin Long, would you be able to check if you still see the lockdep splat
>> with latest upstream RT?
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git linux-5.16.y-rt
> Hi, Juri,
>
> Thanks for sharing the RT kernel repo.
>
> I just tried with this kernel, and I couldn't reproduce it on my env.
> But I don't see how the upstream RT kernel can avoid the call trace.
>
> As this warning was triggered when the system was shutting down, it might
> not be reproduced on it due to some timing change.

As it was caught by lockdep and not as a real deadlock, I think it should be
indepenedent of a timing change. Lockdep will correlate potentially deadlock
scenarios even if they don't really occur in the same time, AFAIK.

But let's go back to:

> Although cpu_hotplug_lock is a RWSEM, [a] will not block in there. But as
> lockdep annotations are added for cpu_hotplug_lock, a deadlock warning
> would be detected:

Is it possible that upstream lockdep handles this RWSEM scenario properly
and doesn't report it, but the RHEL kernel is missing some relevant lockdep fix?

>>
>> Thanks!
>> Juri
>>

2022-01-21 18:59:54

by Xin Long

[permalink] [raw]
Subject: Re: [PATCH] mm: slub: fix a deadlock warning in kmem_cache_destroy

On Tue, Jan 18, 2022 at 7:07 PM Vlastimil Babka <[email protected]> wrote:
>
>
> On 1/18/22 09:00, Xin Long wrote:
> > On Mon, Jan 17, 2022 at 9:13 PM Juri Lelli <[email protected]> wrote:
> >> >
> >> > RHEL-8 kernel seems to be 4.18, unless RT uses a newer one. Could be some
> >> > silently relevant backport is missing? How about e.g. 59450bbc12be ("mm,
> >> > slab, slub: stop taking cpu hotplug lock") ?
> >>
> >> Hummm, looks like we have backported commit 59450bbc12be in RHEL-8.
> >>
> >> Xin Long, would you be able to check if you still see the lockdep splat
> >> with latest upstream RT?
> >>
> >> git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git linux-5.16.y-rt
> > Hi, Juri,
> >
> > Thanks for sharing the RT kernel repo.
> >
> > I just tried with this kernel, and I couldn't reproduce it on my env.
> > But I don't see how the upstream RT kernel can avoid the call trace.
> >
> > As this warning was triggered when the system was shutting down, it might
> > not be reproduced on it due to some timing change.
>
> As it was caught by lockdep and not as a real deadlock, I think it should be
> indepenedent of a timing change. Lockdep will correlate potentially deadlock
> scenarios even if they don't really occur in the same time, AFAIK.
>
> But let's go back to:
>
> > Although cpu_hotplug_lock is a RWSEM, [a] will not block in there. But as
> > lockdep annotations are added for cpu_hotplug_lock, a deadlock warning
> > would be detected:
>
> Is it possible that upstream lockdep handles this RWSEM scenario properly
> and doesn't report it, but the RHEL kernel is missing some relevant lockdep fix?
That's a good point.

I actually think:
cpus_read_lock()
cpus_read_lock()
shouldn't be considered as a deadlock.

I will check the lockdep changes, and it may take some time.

Thanks.

>
> >>
> >> Thanks!
> >> Juri
> >>
>

2022-06-01 20:50:30

by Maurizio Lombardi

[permalink] [raw]
Subject: Re: [PATCH] mm: slub: fix a deadlock warning in kmem_cache_destroy

pá 14. 1. 2022 v 15:23 odesílatel Xin Long <[email protected]> napsal:
>
> cpus_read_lock() is introduced into kmem_cache_destroy() by
> commit 5a836bf6b09f ("mm: slub: move flush_cpu_slab() invocations
> __free_slab() invocations out of IRQ context"), and it could cause
> a deadlock.

FYI,

I received a bug report from one of our customers, he complains that
his system (with nvmefc boot from SAN) hangs when rebooting.
He runs a RHEL-9 kernel based on version 5.14.0.

What is interesting is that, according to him, after reverting commit
5a836bf6b09f
("mm: slub: move flush_cpu_slab() invocations __free_slab()
invocations out of IRQ context")
the reboot operation doesn't hang anymore.

The call trace seems to point to a possible problem due to the fact that
nvme_delete_ctrl_work is allocated with the WQ_MEM_RECLAIM bit set.

[ 453.012078] ------------[ cut here ]------------
[ 453.016744] workqueue: WQ_MEM_RECLAIM
nvme-delete-wq:nvme_delete_ctrl_work [nvme_core] is flushing
!WQ_MEM_RECLAIM events:flush_cpu_slab
[ 453.016789] WARNING: CPU: 37 PID: 410 at kernel/workqueue.c:2637
check_flush_dependency+0x10a/0x120
[...]
[ 453.262125] Call Trace:
[ 453.264582] __flush_work.isra.0+0xbf/0x220
[ 453.268775] ? __queue_work+0x1dc/0x420
[ 453.272623] flush_all_cpus_locked+0xfb/0x120
[ 453.276992] __kmem_cache_shutdown+0x2b/0x320
[ 453.281361] kmem_cache_destroy+0x49/0x100
[ 453.285465] bioset_exit+0x143/0x190
[ 453.289052] blk_release_queue+0xb9/0x100
[ 453.293075] kobject_cleanup+0x37/0x130
[ 453.296922] nvme_fc_ctrl_free+0xc6/0x150 [nvme_fc]
[ 453.302397] nvme_free_ctrl+0x1ac/0x2b0 [nvme_core]
[ 453.307818] device_release+0x31/0x90
[ 453.312005] kobject_cleanup+0x37/0x130
[ 453.316369] process_one_work+0x1e5/0x3c0
[ 453.320895] worker_thread+0x50/0x3b0
[ 453.325074] ? rescuer_thread+0x370/0x370
[ 453.329592] kthread+0x146/0x170
[ 453.333322] ? set_kthread_struct+0x40/0x40
[ 453.338027] ret_from_fork+0x1f/0x30
[ 453.342082] ---[ end trace 8c9cdd85adbbfc4f ]---

Maurizio Lombardi