2024-04-18 02:22:42

by Miaohe Lin

[permalink] [raw]
Subject: [PATCH 1/2] mm/hugetlb: fix DEBUG_LOCKS_WARN_ON(1) when dissolve_free_hugetlb_folio()

When I did memory failure tests recently, below warning occurs:

DEBUG_LOCKS_WARN_ON(1)
WARNING: CPU: 8 PID: 1011 at kernel/locking/lockdep.c:232 __lock_acquire+0xccb/0x1ca0
Modules linked in: mce_inject hwpoison_inject
CPU: 8 PID: 1011 Comm: bash Kdump: loaded Not tainted 6.9.0-rc3-next-20240410-00012-gdb69f219f4be #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
RIP: 0010:__lock_acquire+0xccb/0x1ca0
RSP: 0018:ffffa7a1c7fe3bd0 EFLAGS: 00000082
RAX: 0000000000000000 RBX: eb851eb853975fcf RCX: ffffa1ce5fc1c9c8
RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffffa1ce5fc1c9c0
RBP: ffffa1c6865d3280 R08: ffffffffb0f570a8 R09: 0000000000009ffb
R10: 0000000000000286 R11: ffffffffb0f2ad50 R12: ffffa1c6865d3d10
R13: ffffa1c6865d3c70 R14: 0000000000000000 R15: 0000000000000004
FS: 00007ff9f32aa740(0000) GS:ffffa1ce5fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ff9f3134ba0 CR3: 00000008484e4000 CR4: 00000000000006f0
Call Trace:
<TASK>
lock_acquire+0xbe/0x2d0
_raw_spin_lock_irqsave+0x3a/0x60
hugepage_subpool_put_pages.part.0+0xe/0xc0
free_huge_folio+0x253/0x3f0
dissolve_free_huge_page+0x147/0x210
__page_handle_poison+0x9/0x70
memory_failure+0x4e6/0x8c0
hard_offline_page_store+0x55/0xa0
kernfs_fop_write_iter+0x12c/0x1d0
vfs_write+0x380/0x540
ksys_write+0x64/0xe0
do_syscall_64+0xbc/0x1d0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7ff9f3114887
RSP: 002b:00007ffecbacb458 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007ff9f3114887
RDX: 000000000000000c RSI: 0000564494164e10 RDI: 0000000000000001
RBP: 0000564494164e10 R08: 00007ff9f31d1460 R09: 000000007fffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
R13: 00007ff9f321b780 R14: 00007ff9f3217600 R15: 00007ff9f3216a00
</TASK>
Kernel panic - not syncing: kernel: panic_on_warn set ...
CPU: 8 PID: 1011 Comm: bash Kdump: loaded Not tainted 6.9.0-rc3-next-20240410-00012-gdb69f219f4be #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
panic+0x326/0x350
check_panic_on_warn+0x4f/0x50
__warn+0x98/0x190
report_bug+0x18e/0x1a0
handle_bug+0x3d/0x70
exc_invalid_op+0x18/0x70
asm_exc_invalid_op+0x1a/0x20
RIP: 0010:__lock_acquire+0xccb/0x1ca0
RSP: 0018:ffffa7a1c7fe3bd0 EFLAGS: 00000082
RAX: 0000000000000000 RBX: eb851eb853975fcf RCX: ffffa1ce5fc1c9c8
RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffffa1ce5fc1c9c0
RBP: ffffa1c6865d3280 R08: ffffffffb0f570a8 R09: 0000000000009ffb
R10: 0000000000000286 R11: ffffffffb0f2ad50 R12: ffffa1c6865d3d10
R13: ffffa1c6865d3c70 R14: 0000000000000000 R15: 0000000000000004
lock_acquire+0xbe/0x2d0
_raw_spin_lock_irqsave+0x3a/0x60
hugepage_subpool_put_pages.part.0+0xe/0xc0
free_huge_folio+0x253/0x3f0
dissolve_free_huge_page+0x147/0x210
__page_handle_poison+0x9/0x70
memory_failure+0x4e6/0x8c0
hard_offline_page_store+0x55/0xa0
kernfs_fop_write_iter+0x12c/0x1d0
vfs_write+0x380/0x540
ksys_write+0x64/0xe0
do_syscall_64+0xbc/0x1d0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7ff9f3114887
RSP: 002b:00007ffecbacb458 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007ff9f3114887
RDX: 000000000000000c RSI: 0000564494164e10 RDI: 0000000000000001
RBP: 0000564494164e10 R08: 00007ff9f31d1460 R09: 000000007fffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
R13: 00007ff9f321b780 R14: 00007ff9f3217600 R15: 00007ff9f3216a00
</TASK>

After git bisecting and digging into the code, I believe the root cause
is that _deferred_list field of folio is unioned with _hugetlb_subpool
field. In __update_and_free_hugetlb_folio(), folio->_deferred_list is
always initialized leading to corrupted folio->_hugetlb_subpool when
folio is hugetlb. Later free_huge_folio() will use _hugetlb_subpool
and above warning happens. Fix this by initialise folio->_deferred_list
iff folio is not hugetlb.

Fixes: b6952b6272dd ("mm: always initialise folio->_deferred_list")
CC: [email protected]
Signed-off-by: Miaohe Lin <[email protected]>
---
mm/hugetlb.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 26ab9dfc7d63..1da9a14a5513 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1788,7 +1788,8 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
destroy_compound_gigantic_folio(folio, huge_page_order(h));
free_gigantic_folio(folio, huge_page_order(h));
} else {
- INIT_LIST_HEAD(&folio->_deferred_list);
+ if (!folio_test_hugetlb(folio))
+ INIT_LIST_HEAD(&folio->_deferred_list);
folio_put(folio);
}
}
--
2.33.0



2024-04-18 04:06:29

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm/hugetlb: fix DEBUG_LOCKS_WARN_ON(1) when dissolve_free_hugetlb_folio()

On Thu, Apr 18, 2024 at 10:19:59AM +0800, Miaohe Lin wrote:
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 26ab9dfc7d63..1da9a14a5513 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1788,7 +1788,8 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
> destroy_compound_gigantic_folio(folio, huge_page_order(h));
> free_gigantic_folio(folio, huge_page_order(h));
> } else {
> - INIT_LIST_HEAD(&folio->_deferred_list);
> + if (!folio_test_hugetlb(folio))
> + INIT_LIST_HEAD(&folio->_deferred_list);

Ok, it took me a bit to figure this out.

So we basically init __deferred_list when we know that
folio_put will not end up calling free_huge_folio
because a previous call to remove_hugetlb_folio has already cleared the
bit.

Maybe Matthew thought that any folio ending here would not end up in
free_huge_folio (which is the one fiddling subpool).

I mean, fix looks good because if hugetlb flag is cleared,
destroy_large_folio will go straight to free_the_page, but the
whole thing is a bit subtle.

And if we decide to go with this, I think we are going to need a comment
in there explaining what is going on like "only init _deferred_list if
free_huge_folio cannot be call".


--
Oscar Salvador
SUSE Labs

2024-04-18 08:01:20

by Miaohe Lin

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm/hugetlb: fix DEBUG_LOCKS_WARN_ON(1) when dissolve_free_hugetlb_folio()

On 2024/4/18 12:05, Oscar Salvador wrote:
> On Thu, Apr 18, 2024 at 10:19:59AM +0800, Miaohe Lin wrote:
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 26ab9dfc7d63..1da9a14a5513 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -1788,7 +1788,8 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
>> destroy_compound_gigantic_folio(folio, huge_page_order(h));
>> free_gigantic_folio(folio, huge_page_order(h));
>> } else {
>> - INIT_LIST_HEAD(&folio->_deferred_list);
>> + if (!folio_test_hugetlb(folio))
>> + INIT_LIST_HEAD(&folio->_deferred_list);
>
> Ok, it took me a bit to figure this out.
>
> So we basically init __deferred_list when we know that
> folio_put will not end up calling free_huge_folio
> because a previous call to remove_hugetlb_folio has already cleared the
> bit.
>
> Maybe Matthew thought that any folio ending here would not end up in
> free_huge_folio (which is the one fiddling subpool).
>
> I mean, fix looks good because if hugetlb flag is cleared,
> destroy_large_folio will go straight to free_the_page, but the
> whole thing is a bit subtle.

AFAICS, this is the most straightforward way to fix the issue. Do you have any suggestions
on how to fix this in a more graceful way?

>
> And if we decide to go with this, I think we are going to need a comment
> in there explaining what is going on like "only init _deferred_list if
> free_huge_folio cannot be call".

Yes, this comment will help.
Thanks.
.

>
>


2024-04-18 12:41:37

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm/hugetlb: fix DEBUG_LOCKS_WARN_ON(1) when dissolve_free_hugetlb_folio()

On Thu, Apr 18, 2024 at 04:00:42PM +0800, Miaohe Lin wrote:
> On 2024/4/18 12:05, Oscar Salvador wrote:
> > On Thu, Apr 18, 2024 at 10:19:59AM +0800, Miaohe Lin wrote:
> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >> index 26ab9dfc7d63..1da9a14a5513 100644
> >> --- a/mm/hugetlb.c
> >> +++ b/mm/hugetlb.c
> >> @@ -1788,7 +1788,8 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
> >> destroy_compound_gigantic_folio(folio, huge_page_order(h));
> >> free_gigantic_folio(folio, huge_page_order(h));
> >> } else {
> >> - INIT_LIST_HEAD(&folio->_deferred_list);
> >> + if (!folio_test_hugetlb(folio))
> >> + INIT_LIST_HEAD(&folio->_deferred_list);
> >
> > Ok, it took me a bit to figure this out.
> >
> > So we basically init __deferred_list when we know that
> > folio_put will not end up calling free_huge_folio
> > because a previous call to remove_hugetlb_folio has already cleared the
> > bit.
> >
> > Maybe Matthew thought that any folio ending here would not end up in
> > free_huge_folio (which is the one fiddling subpool).
> >
> > I mean, fix looks good because if hugetlb flag is cleared,
> > destroy_large_folio will go straight to free_the_page, but the
> > whole thing is a bit subtle.
>
> AFAICS, this is the most straightforward way to fix the issue. Do you have any suggestions
> on how to fix this in a more graceful way?

Not from the top of my head.
Anyway, I have been thinking for a while that this code needs some love,
so I will check how this can be untangled.


--
Oscar Salvador
SUSE Labs

2024-04-19 02:19:09

by Miaohe Lin

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm/hugetlb: fix DEBUG_LOCKS_WARN_ON(1) when dissolve_free_hugetlb_folio()

On 2024/4/18 20:41, Oscar Salvador wrote:
> On Thu, Apr 18, 2024 at 04:00:42PM +0800, Miaohe Lin wrote:
>> On 2024/4/18 12:05, Oscar Salvador wrote:
>>> On Thu, Apr 18, 2024 at 10:19:59AM +0800, Miaohe Lin wrote:
>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>> index 26ab9dfc7d63..1da9a14a5513 100644
>>>> --- a/mm/hugetlb.c
>>>> +++ b/mm/hugetlb.c
>>>> @@ -1788,7 +1788,8 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
>>>> destroy_compound_gigantic_folio(folio, huge_page_order(h));
>>>> free_gigantic_folio(folio, huge_page_order(h));
>>>> } else {
>>>> - INIT_LIST_HEAD(&folio->_deferred_list);
>>>> + if (!folio_test_hugetlb(folio))
>>>> + INIT_LIST_HEAD(&folio->_deferred_list);
>>>
>>> Ok, it took me a bit to figure this out.
>>>
>>> So we basically init __deferred_list when we know that
>>> folio_put will not end up calling free_huge_folio
>>> because a previous call to remove_hugetlb_folio has already cleared the
>>> bit.
>>>
>>> Maybe Matthew thought that any folio ending here would not end up in
>>> free_huge_folio (which is the one fiddling subpool).
>>>
>>> I mean, fix looks good because if hugetlb flag is cleared,
>>> destroy_large_folio will go straight to free_the_page, but the
>>> whole thing is a bit subtle.
>>
>> AFAICS, this is the most straightforward way to fix the issue. Do you have any suggestions
>> on how to fix this in a more graceful way?
>
> Not from the top of my head.
> Anyway, I have been thinking for a while that this code needs some love,
> so I will check how this can be untangled.

That would be really nice. Thanks Oscar.
.

>
>