2024-04-07 08:57:39

by Miaohe Lin

[permalink] [raw]
Subject: [PATCH] mm/memory-failure: fix deadlock when hugetlb_optimize_vmemmap is enabled

When I did hard offline test with hugetlb pages, below deadlock occurs:

======================================================
WARNING: possible circular locking dependency detected
6.8.0-11409-gf6cef5f8c37f #1 Not tainted
------------------------------------------------------
bash/46904 is trying to acquire lock:
ffffffffabe68910 (cpu_hotplug_lock){++++}-{0:0}, at: static_key_slow_dec+0x16/0x60

but task is already holding lock:
ffffffffabf92ea8 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable+0x16/0x40

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (pcp_batch_high_lock){+.+.}-{3:3}:
__mutex_lock+0x6c/0x770
page_alloc_cpu_online+0x3c/0x70
cpuhp_invoke_callback+0x397/0x5f0
__cpuhp_invoke_callback_range+0x71/0xe0
_cpu_up+0xeb/0x210
cpu_up+0x91/0xe0
cpuhp_bringup_mask+0x49/0xb0
bringup_nonboot_cpus+0xb7/0xe0
smp_init+0x25/0xa0
kernel_init_freeable+0x15f/0x3e0
kernel_init+0x15/0x1b0
ret_from_fork+0x2f/0x50
ret_from_fork_asm+0x1a/0x30

-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1298/0x1cd0
lock_acquire+0xc0/0x2b0
cpus_read_lock+0x2a/0xc0
static_key_slow_dec+0x16/0x60
__hugetlb_vmemmap_restore_folio+0x1b9/0x200
dissolve_free_huge_page+0x211/0x260
__page_handle_poison+0x45/0xc0
memory_failure+0x65e/0xc70
hard_offline_page_store+0x55/0xa0
kernfs_fop_write_iter+0x12c/0x1d0
vfs_write+0x387/0x550
ksys_write+0x64/0xe0
do_syscall_64+0xca/0x1e0
entry_SYSCALL_64_after_hwframe+0x6d/0x75

other info that might help us debug this:

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(pcp_batch_high_lock);
lock(cpu_hotplug_lock);
lock(pcp_batch_high_lock);
rlock(cpu_hotplug_lock);

*** DEADLOCK ***

5 locks held by bash/46904:
#0: ffff98f6c3bb23f0 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x64/0xe0
#1: ffff98f6c328e488 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0xf8/0x1d0
#2: ffff98ef83b31890 (kn->active#113){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x100/0x1d0
#3: ffffffffabf9db48 (mf_mutex){+.+.}-{3:3}, at: memory_failure+0x44/0xc70
#4: ffffffffabf92ea8 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable+0x16/0x40

stack backtrace:
CPU: 10 PID: 46904 Comm: bash Kdump: loaded Not tainted 6.8.0-11409-gf6cef5f8c37f #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x68/0xa0
check_noncircular+0x129/0x140
__lock_acquire+0x1298/0x1cd0
lock_acquire+0xc0/0x2b0
cpus_read_lock+0x2a/0xc0
static_key_slow_dec+0x16/0x60
__hugetlb_vmemmap_restore_folio+0x1b9/0x200
dissolve_free_huge_page+0x211/0x260
__page_handle_poison+0x45/0xc0
memory_failure+0x65e/0xc70
hard_offline_page_store+0x55/0xa0
kernfs_fop_write_iter+0x12c/0x1d0
vfs_write+0x387/0x550
ksys_write+0x64/0xe0
do_syscall_64+0xca/0x1e0
entry_SYSCALL_64_after_hwframe+0x6d/0x75
RIP: 0033:0x7fc862314887
Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
RSP: 002b:00007fff19311268 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007fc862314887
RDX: 000000000000000c RSI: 000056405645fe10 RDI: 0000000000000001
RBP: 000056405645fe10 R08: 00007fc8623d1460 R09: 000000007fffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
R13: 00007fc86241b780 R14: 00007fc862417600 R15: 00007fc862416a00

In short, below scene breaks the lock dependency chain:

memory_failure
__page_handle_poison
zone_pcp_disable -- lock(pcp_batch_high_lock)
dissolve_free_huge_page
__hugetlb_vmemmap_restore_folio
static_key_slow_dec
cpus_read_lock -- rlock(cpu_hotplug_lock)

Fix this by calling drain_all_pages() instead.

Signed-off-by: Miaohe Lin <[email protected]>
---
mm/memory-failure.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index edd6e114462f..88359a185c5f 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -153,11 +153,17 @@ static int __page_handle_poison(struct page *page)
{
int ret;

- zone_pcp_disable(page_zone(page));
+ /*
+ * zone_pcp_disable() can't be used here. It will hold pcp_batch_high_lock and
+ * dissolve_free_huge_page() might hold cpu_hotplug_lock via static_key_slow_dec()
+ * when hugetlb vmemmap optimization is enabled. This will break current lock
+ * dependency chain and leads to deadlock.
+ */
ret = dissolve_free_huge_page(page);
- if (!ret)
+ if (!ret) {
+ drain_all_pages(page_zone(page));
ret = take_page_off_buddy(page);
- zone_pcp_enable(page_zone(page));
+ }

return ret;
}
--
2.33.0



2024-04-08 19:30:08

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm/memory-failure: fix deadlock when hugetlb_optimize_vmemmap is enabled

On Sun, 7 Apr 2024 16:54:56 +0800 Miaohe Lin <[email protected]> wrote:

> When I did hard offline test with hugetlb pages, below deadlock occurs:
>
> ...
>
> Fix this by calling drain_all_pages() instead.

Thanks. I propose

Fixes: 510d25c92ec4a ("mm/hwpoison: disable pcp for page_handle_poison()")
Cc: <[email protected]>


2024-04-09 01:55:57

by Miaohe Lin

[permalink] [raw]
Subject: Re: [PATCH] mm/memory-failure: fix deadlock when hugetlb_optimize_vmemmap is enabled

On 2024/4/9 3:29, Andrew Morton wrote:
> On Sun, 7 Apr 2024 16:54:56 +0800 Miaohe Lin <[email protected]> wrote:
>
>> When I did hard offline test with hugetlb pages, below deadlock occurs:
>>
>> ...
>>
>> Fix this by calling drain_all_pages() instead.
>
> Thanks. I propose
>
> Fixes: 510d25c92ec4a ("mm/hwpoison: disable pcp for page_handle_poison()")

IMHO this issue won't occur until commit a6b40850c442 ("mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_key").
As it introduced rlock(cpu_hotplug_lock) in dissolve_free_huge_page() code path while lock(pcp_batch_high_lock) is already
in the __page_handle_poison(). So it might be more appropriate to use:

Fixes: a6b40850c442 ("mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_key")

> Cc: <[email protected]>

Thanks for doing this.

>
> .
>


2024-04-09 14:10:46

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH] mm/memory-failure: fix deadlock when hugetlb_optimize_vmemmap is enabled

On Sun, Apr 07, 2024 at 04:54:56PM +0800, Miaohe Lin wrote:
> In short, below scene breaks the lock dependency chain:
>
> memory_failure
> __page_handle_poison
> zone_pcp_disable -- lock(pcp_batch_high_lock)
> dissolve_free_huge_page
> __hugetlb_vmemmap_restore_folio
> static_key_slow_dec
> cpus_read_lock -- rlock(cpu_hotplug_lock)
>
> Fix this by calling drain_all_pages() instead.
>
> Signed-off-by: Miaohe Lin <[email protected]>

Acked-by: Oscar Salvador <[email protected]>

Thanks!


--
Oscar Salvador
SUSE Labs

2024-04-09 16:10:32

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH] mm/memory-failure: fix deadlock when hugetlb_optimize_vmemmap is enabled

On Tue, Apr 09, 2024 at 04:10:22PM +0200, Oscar Salvador wrote:
> On Sun, Apr 07, 2024 at 04:54:56PM +0800, Miaohe Lin wrote:
> > In short, below scene breaks the lock dependency chain:
> >
> > memory_failure
> > __page_handle_poison
> > zone_pcp_disable -- lock(pcp_batch_high_lock)
> > dissolve_free_huge_page
> > __hugetlb_vmemmap_restore_folio
> > static_key_slow_dec
> > cpus_read_lock -- rlock(cpu_hotplug_lock)
> >
> > Fix this by calling drain_all_pages() instead.
> >
> > Signed-off-by: Miaohe Lin <[email protected]>
>
> Acked-by: Oscar Salvador <[email protected]>

On a second though,

disabling pcp via zone_pcp_disable() was a deterministic approach.
Now, with drain_all_pages() we drain PCP queues to buddy, but nothing
guarantees that those pages do not end up in a PCP queue again before we
the call to take_page_off_budy() if we
need refilling, right?

I guess we can live with that because we will let the system know that we
failed to isolate that page.


--
Oscar Salvador
SUSE Labs

2024-04-10 07:52:29

by Miaohe Lin

[permalink] [raw]
Subject: Re: [PATCH] mm/memory-failure: fix deadlock when hugetlb_optimize_vmemmap is enabled

On 2024/4/10 0:10, Oscar Salvador wrote:
> On Tue, Apr 09, 2024 at 04:10:22PM +0200, Oscar Salvador wrote:
>> On Sun, Apr 07, 2024 at 04:54:56PM +0800, Miaohe Lin wrote:
>>> In short, below scene breaks the lock dependency chain:
>>>
>>> memory_failure
>>> __page_handle_poison
>>> zone_pcp_disable -- lock(pcp_batch_high_lock)
>>> dissolve_free_huge_page
>>> __hugetlb_vmemmap_restore_folio
>>> static_key_slow_dec
>>> cpus_read_lock -- rlock(cpu_hotplug_lock)
>>>
>>> Fix this by calling drain_all_pages() instead.
>>>
>>> Signed-off-by: Miaohe Lin <[email protected]>
>>
>> Acked-by: Oscar Salvador <[email protected]>

Thanks.

>
> On a second though,
>
> disabling pcp via zone_pcp_disable() was a deterministic approach.
> Now, with drain_all_pages() we drain PCP queues to buddy, but nothing
> guarantees that those pages do not end up in a PCP queue again before we
> the call to take_page_off_budy() if we
> need refilling, right?

AFAICS, iff check_pages_enabled static key is enabled and in hard offline mode,
check_new_pages() will prevent those pages from ending up in a PCP queue again
when refilling PCP list. Because PageHWPoison pages will be taken as 'bad' pages
and skipped when refill PCP list.

>
> I guess we can live with that because we will let the system know that we
> failed to isolate that page.

We're trying best to isolate that page anyway. :)

Thanks for your thought.
.

>
>


2024-04-10 08:53:06

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH] mm/memory-failure: fix deadlock when hugetlb_optimize_vmemmap is enabled

On Wed, Apr 10, 2024 at 03:52:14PM +0800, Miaohe Lin wrote:
> AFAICS, iff check_pages_enabled static key is enabled and in hard offline mode,
> check_new_pages() will prevent those pages from ending up in a PCP queue again
> when refilling PCP list. Because PageHWPoison pages will be taken as 'bad' pages
> and skipped when refill PCP list.

Yes, but check_pages_enabled static key is only enabled when
either CONFIG_DEBUG_PAGEALLOC or CONFIG_DEBUG_VM are set, which means
that under most of the systems that protection will not take place.

Which takes me to a problem we had in the past where we were handing
over hwpoisoned pages from PCP lists happily.
Now, with for soft-offline mode, we worked hard to stop doing that
because soft-offline is a non-disruptive operation and no one should get
killed.
hard-offline is another story, but still I think that extending the
comment to include the following would be a good idea:

"Disabling pcp before dissolving the page was a deterministic approach
because we made sure that those pages cannot end up in any PCP list.
Draining PCP lists expels those pages to the buddy system, but nothing
guarantees that those pages do not get back to a PCP queue if we need
to refill those."

Just to remind ourselves of the dangers of a non-deterministic
approach.


Thanks


--
Oscar Salvador
SUSE Labs

2024-04-11 02:26:57

by Miaohe Lin

[permalink] [raw]
Subject: Re: [PATCH] mm/memory-failure: fix deadlock when hugetlb_optimize_vmemmap is enabled

On 2024/4/10 16:52, Oscar Salvador wrote:
> On Wed, Apr 10, 2024 at 03:52:14PM +0800, Miaohe Lin wrote:
>> AFAICS, iff check_pages_enabled static key is enabled and in hard offline mode,
>> check_new_pages() will prevent those pages from ending up in a PCP queue again
>> when refilling PCP list. Because PageHWPoison pages will be taken as 'bad' pages
>> and skipped when refill PCP list.
>
> Yes, but check_pages_enabled static key is only enabled when
> either CONFIG_DEBUG_PAGEALLOC or CONFIG_DEBUG_VM are set, which means
> that under most of the systems that protection will not take place.
>
> Which takes me to a problem we had in the past where we were handing
> over hwpoisoned pages from PCP lists happily.
> Now, with for soft-offline mode, we worked hard to stop doing that
> because soft-offline is a non-disruptive operation and no one should get
> killed.
> hard-offline is another story, but still I think that extending the
> comment to include the following would be a good idea:
>
> "Disabling pcp before dissolving the page was a deterministic approach
> because we made sure that those pages cannot end up in any PCP list.
> Draining PCP lists expels those pages to the buddy system, but nothing
> guarantees that those pages do not get back to a PCP queue if we need
> to refill those."

This really helps. Will add it in v2.
Thanks Oscar.

>
> Just to remind ourselves of the dangers of a non-deterministic
> approach.
>
>
> Thanks
>
>