2024-01-24 08:41:05

by Miaohe Lin

[permalink] [raw]
Subject: [PATCH v2] mm/memory-failure: fix crash in split_huge_page_to_list from soft_offline_page

When I did soft offline stress test, a machine was observed to crash with
the following message:

kernel BUG at include/linux/memcontrol.h:554!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 5 PID: 3837 Comm: hwpoison.sh Not tainted 6.7.0-next-20240112-00001-g8ecf3e7fb7c8-dirty #97
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
RIP: 0010:folio_memcg+0xaf/0xd0
Code: 10 5b 5d c3 cc cc cc cc 48 c7 c6 08 b1 f2 b2 48 89 ef e8 b4 c5 f8 ff 90 0f 0b 48 c7 c6 d0 b0 f2 b2 48 89 ef e8 a2 c5 f8 ff 90 <0f> 0b 48 c7 c6 08 b1 f2 b2 48 89 ef e8 90 c5 f8 ff 90 0f 0b 66 66
RSP: 0018:ffffb6c043657c98 EFLAGS: 00000296
RAX: 000000000000004b RBX: ffff932bc1d1e401 RCX: ffff933abfb5c908
RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff933abfb5c900
RBP: ffffea6f04019080 R08: ffffffffb3338ce8 R09: 0000000000009ffb
R10: 00000000000004dd R11: ffffffffb3308d00 R12: ffffea6f04019080
R13: ffffea6f04019080 R14: 0000000000000001 R15: ffffb6c043657da0
FS: 00007f6c60f6b740(0000) GS:ffff933abfb40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000559c3bc8b980 CR3: 0000000107f1c000 CR4: 00000000000006f0
Call Trace:
<TASK>
? die+0x32/0x90
? do_trap+0xde/0x110
? folio_memcg+0xaf/0xd0
? do_error_trap+0x60/0x80
? folio_memcg+0xaf/0xd0
? exc_invalid_op+0x53/0x70
? folio_memcg+0xaf/0xd0
? asm_exc_invalid_op+0x1a/0x20
? folio_memcg+0xaf/0xd0
? folio_memcg+0xae/0xd0
split_huge_page_to_list+0x4d/0x1380
? sysvec_apic_timer_interrupt+0xf/0x80
try_to_split_thp_page+0x3a/0xf0
soft_offline_page+0x1ea/0x8a0
soft_offline_page_store+0x52/0x90
kernfs_fop_write_iter+0x118/0x1b0
vfs_write+0x30b/0x430
ksys_write+0x5e/0xe0
do_syscall_64+0xb0/0x1b0
entry_SYSCALL_64_after_hwframe+0x6d/0x75
RIP: 0033:0x7f6c60d14697
Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
RSP: 002b:00007ffe9b72b8d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f6c60d14697
RDX: 000000000000000c RSI: 0000559c3bc8b980 RDI: 0000000000000001
RBP: 0000559c3bc8b980 R08: 00007f6c60dd1460 R09: 000000007fffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
R13: 00007f6c60e1a780 R14: 00007f6c60e16600 R15: 00007f6c60e15a00

The problem is that page->mapping is overloaded with slab->slab_list or
slabs fields now, so slab pages could be taken as non-LRU movable pages
if field slabs contains PAGE_MAPPING_MOVABLE or slab_list->prev is set
to LIST_POISON2. These slab pages will be treated as thp later leading
to crash in split_huge_page_to_list().

Signed-off-by: Miaohe Lin <[email protected]>
Fixes: 130d4df57390 ("mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head")
---
v2:
Check PageSlab() first to leave the rest code alone per Matthew.
---
mm/memory-failure.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 636280d04008..9349948f1abf 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1377,6 +1377,9 @@ void ClearPageHWPoisonTakenOff(struct page *page)
*/
static inline bool HWPoisonHandlable(struct page *page, unsigned long flags)
{
+ if (PageSlab(page))
+ return false;
+
/* Soft offline could migrate non-LRU movable pages */
if ((flags & MF_SOFT_OFFLINE) && __PageMovable(page))
return true;
--
2.33.0



2024-01-24 17:32:16

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v2] mm/memory-failure: fix crash in split_huge_page_to_list from soft_offline_page

On Wed, Jan 24, 2024 at 04:40:14PM +0800, Miaohe Lin wrote:
> When I did soft offline stress test, a machine was observed to crash with
> the following message:
>
> kernel BUG at include/linux/memcontrol.h:554!
> invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> CPU: 5 PID: 3837 Comm: hwpoison.sh Not tainted 6.7.0-next-20240112-00001-g8ecf3e7fb7c8-dirty #97
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> RIP: 0010:folio_memcg+0xaf/0xd0
> Code: 10 5b 5d c3 cc cc cc cc 48 c7 c6 08 b1 f2 b2 48 89 ef e8 b4 c5 f8 ff 90 0f 0b 48 c7 c6 d0 b0 f2 b2 48 89 ef e8 a2 c5 f8 ff 90 <0f> 0b 48 c7 c6 08 b1 f2 b2 48 89 ef e8 90 c5 f8 ff 90 0f 0b 66 66
> RSP: 0018:ffffb6c043657c98 EFLAGS: 00000296
> RAX: 000000000000004b RBX: ffff932bc1d1e401 RCX: ffff933abfb5c908
> RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff933abfb5c900
> RBP: ffffea6f04019080 R08: ffffffffb3338ce8 R09: 0000000000009ffb
> R10: 00000000000004dd R11: ffffffffb3308d00 R12: ffffea6f04019080
> R13: ffffea6f04019080 R14: 0000000000000001 R15: ffffb6c043657da0
> FS: 00007f6c60f6b740(0000) GS:ffff933abfb40000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000559c3bc8b980 CR3: 0000000107f1c000 CR4: 00000000000006f0
> Call Trace:
> <TASK>
> ? die+0x32/0x90
> ? do_trap+0xde/0x110
> ? folio_memcg+0xaf/0xd0
> ? do_error_trap+0x60/0x80
> ? folio_memcg+0xaf/0xd0
> ? exc_invalid_op+0x53/0x70
> ? folio_memcg+0xaf/0xd0
> ? asm_exc_invalid_op+0x1a/0x20
> ? folio_memcg+0xaf/0xd0
> ? folio_memcg+0xae/0xd0

I might trim these ? lines out of the backtrace ...

> split_huge_page_to_list+0x4d/0x1380
> ? sysvec_apic_timer_interrupt+0xf/0x80
> try_to_split_thp_page+0x3a/0xf0
> soft_offline_page+0x1ea/0x8a0
> soft_offline_page_store+0x52/0x90
> kernfs_fop_write_iter+0x118/0x1b0
> vfs_write+0x30b/0x430
> ksys_write+0x5e/0xe0
> do_syscall_64+0xb0/0x1b0
> entry_SYSCALL_64_after_hwframe+0x6d/0x75
> RIP: 0033:0x7f6c60d14697
> Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
> RSP: 002b:00007ffe9b72b8d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f6c60d14697
> RDX: 000000000000000c RSI: 0000559c3bc8b980 RDI: 0000000000000001
> RBP: 0000559c3bc8b980 R08: 00007f6c60dd1460 R09: 000000007fffffff
> R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
> R13: 00007f6c60e1a780 R14: 00007f6c60e16600 R15: 00007f6c60e15a00
>
> The problem is that page->mapping is overloaded with slab->slab_list or
> slabs fields now, so slab pages could be taken as non-LRU movable pages
> if field slabs contains PAGE_MAPPING_MOVABLE or slab_list->prev is set
> to LIST_POISON2. These slab pages will be treated as thp later leading
> to crash in split_huge_page_to_list().
>
> Signed-off-by: Miaohe Lin <[email protected]>
> Fixes: 130d4df57390 ("mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head")

Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>

2024-01-25 12:22:31

by Miaohe Lin

[permalink] [raw]
Subject: Re: [PATCH v2] mm/memory-failure: fix crash in split_huge_page_to_list from soft_offline_page

On 2024/1/24 21:15, Matthew Wilcox wrote:
> On Wed, Jan 24, 2024 at 04:40:14PM +0800, Miaohe Lin wrote:
>> When I did soft offline stress test, a machine was observed to crash with
>> the following message:
>>
>> kernel BUG at include/linux/memcontrol.h:554!
>> invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>> CPU: 5 PID: 3837 Comm: hwpoison.sh Not tainted 6.7.0-next-20240112-00001-g8ecf3e7fb7c8-dirty #97
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
>> RIP: 0010:folio_memcg+0xaf/0xd0
>> Code: 10 5b 5d c3 cc cc cc cc 48 c7 c6 08 b1 f2 b2 48 89 ef e8 b4 c5 f8 ff 90 0f 0b 48 c7 c6 d0 b0 f2 b2 48 89 ef e8 a2 c5 f8 ff 90 <0f> 0b 48 c7 c6 08 b1 f2 b2 48 89 ef e8 90 c5 f8 ff 90 0f 0b 66 66
>> RSP: 0018:ffffb6c043657c98 EFLAGS: 00000296
>> RAX: 000000000000004b RBX: ffff932bc1d1e401 RCX: ffff933abfb5c908
>> RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff933abfb5c900
>> RBP: ffffea6f04019080 R08: ffffffffb3338ce8 R09: 0000000000009ffb
>> R10: 00000000000004dd R11: ffffffffb3308d00 R12: ffffea6f04019080
>> R13: ffffea6f04019080 R14: 0000000000000001 R15: ffffb6c043657da0
>> FS: 00007f6c60f6b740(0000) GS:ffff933abfb40000(0000) knlGS:0000000000000000
>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 0000559c3bc8b980 CR3: 0000000107f1c000 CR4: 00000000000006f0
>> Call Trace:
>> <TASK>
>> ? die+0x32/0x90
>> ? do_trap+0xde/0x110
>> ? folio_memcg+0xaf/0xd0
>> ? do_error_trap+0x60/0x80
>> ? folio_memcg+0xaf/0xd0
>> ? exc_invalid_op+0x53/0x70
>> ? folio_memcg+0xaf/0xd0
>> ? asm_exc_invalid_op+0x1a/0x20
>> ? folio_memcg+0xaf/0xd0
>> ? folio_memcg+0xae/0xd0
>
> I might trim these ? lines out of the backtrace ...

Do you mean make backtrace looks like something below?

Call Trace:
<TASK>
split_huge_page_to_list+0x4d/0x1380
? sysvec_apic_timer_interrupt+0xf/0x80
try_to_split_thp_page+0x3a/0xf0
soft_offline_page+0x1ea/0x8a0
soft_offline_page_store+0x52/0x90
kernfs_fop_write_iter+0x118/0x1b0
vfs_write+0x30b/0x430
ksys_write+0x5e/0xe0
do_syscall_64+0xb0/0x1b0
entry_SYSCALL_64_after_hwframe+0x6d/0x75
RIP: 0033:0x7f6c60d14697

>
>> split_huge_page_to_list+0x4d/0x1380
>> ? sysvec_apic_timer_interrupt+0xf/0x80
>> try_to_split_thp_page+0x3a/0xf0
>> soft_offline_page+0x1ea/0x8a0
>> soft_offline_page_store+0x52/0x90
>> kernfs_fop_write_iter+0x118/0x1b0
>> vfs_write+0x30b/0x430
>> ksys_write+0x5e/0xe0
>> do_syscall_64+0xb0/0x1b0
>> entry_SYSCALL_64_after_hwframe+0x6d/0x75
>> RIP: 0033:0x7f6c60d14697
>> Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
>> RSP: 002b:00007ffe9b72b8d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
>> RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f6c60d14697
>> RDX: 000000000000000c RSI: 0000559c3bc8b980 RDI: 0000000000000001
>> RBP: 0000559c3bc8b980 R08: 00007f6c60dd1460 R09: 000000007fffffff
>> R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
>> R13: 00007f6c60e1a780 R14: 00007f6c60e16600 R15: 00007f6c60e15a00
>>
>> The problem is that page->mapping is overloaded with slab->slab_list or
>> slabs fields now, so slab pages could be taken as non-LRU movable pages
>> if field slabs contains PAGE_MAPPING_MOVABLE or slab_list->prev is set
>> to LIST_POISON2. These slab pages will be treated as thp later leading
>> to crash in split_huge_page_to_list().
>>
>> Signed-off-by: Miaohe Lin <[email protected]>
>> Fixes: 130d4df57390 ("mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head")
>
> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>

Many thanks for your review.


2024-01-25 14:22:53

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v2] mm/memory-failure: fix crash in split_huge_page_to_list from soft_offline_page

On Thu, Jan 25, 2024 at 07:53:25PM +0800, Miaohe Lin wrote:
> On 2024/1/24 21:15, Matthew Wilcox wrote:
> >> Call Trace:
> >> <TASK>
> >> ? die+0x32/0x90
> >> ? do_trap+0xde/0x110
> >> ? folio_memcg+0xaf/0xd0
> >> ? do_error_trap+0x60/0x80
> >> ? folio_memcg+0xaf/0xd0
> >> ? exc_invalid_op+0x53/0x70
> >> ? folio_memcg+0xaf/0xd0
> >> ? asm_exc_invalid_op+0x1a/0x20
> >> ? folio_memcg+0xaf/0xd0
> >> ? folio_memcg+0xae/0xd0
> >
> > I might trim these ? lines out of the backtrace ...
>
> Do you mean make backtrace looks like something below?
>
> Call Trace:
> <TASK>
> split_huge_page_to_list+0x4d/0x1380
> ? sysvec_apic_timer_interrupt+0xf/0x80
> try_to_split_thp_page+0x3a/0xf0
> soft_offline_page+0x1ea/0x8a0
> soft_offline_page_store+0x52/0x90
> kernfs_fop_write_iter+0x118/0x1b0
> vfs_write+0x30b/0x430
> ksys_write+0x5e/0xe0
> do_syscall_64+0xb0/0x1b0
> entry_SYSCALL_64_after_hwframe+0x6d/0x75
> RIP: 0033:0x7f6c60d14697

Yes. I'd trim the sysvec_apic_timer_interrupt+0xf/0x80 line too.
These lines aren't actually part of the call trace. They're addresses
that the unwinder found on the stack but don't actually fit the call
trace. It puts them in in case they're helpful, but marks them with a ?
to indicate that they're probably not part of the call trace.

2024-01-26 01:14:05

by Miaohe Lin

[permalink] [raw]
Subject: Re: [PATCH v2] mm/memory-failure: fix crash in split_huge_page_to_list from soft_offline_page

On 2024/1/25 22:22, Matthew Wilcox wrote:
> On Thu, Jan 25, 2024 at 07:53:25PM +0800, Miaohe Lin wrote:
>> On 2024/1/24 21:15, Matthew Wilcox wrote:
>>>> Call Trace:
>>>> <TASK>
>>>> ? die+0x32/0x90
>>>> ? do_trap+0xde/0x110
>>>> ? folio_memcg+0xaf/0xd0
>>>> ? do_error_trap+0x60/0x80
>>>> ? folio_memcg+0xaf/0xd0
>>>> ? exc_invalid_op+0x53/0x70
>>>> ? folio_memcg+0xaf/0xd0
>>>> ? asm_exc_invalid_op+0x1a/0x20
>>>> ? folio_memcg+0xaf/0xd0
>>>> ? folio_memcg+0xae/0xd0
>>>
>>> I might trim these ? lines out of the backtrace ...
>>
>> Do you mean make backtrace looks like something below?
>>
>> Call Trace:
>> <TASK>
>> split_huge_page_to_list+0x4d/0x1380
>> ? sysvec_apic_timer_interrupt+0xf/0x80
>> try_to_split_thp_page+0x3a/0xf0
>> soft_offline_page+0x1ea/0x8a0
>> soft_offline_page_store+0x52/0x90
>> kernfs_fop_write_iter+0x118/0x1b0
>> vfs_write+0x30b/0x430
>> ksys_write+0x5e/0xe0
>> do_syscall_64+0xb0/0x1b0
>> entry_SYSCALL_64_after_hwframe+0x6d/0x75
>> RIP: 0033:0x7f6c60d14697
>
> Yes. I'd trim the sysvec_apic_timer_interrupt+0xf/0x80 line too.
> These lines aren't actually part of the call trace. They're addresses
> that the unwinder found on the stack but don't actually fit the call
> trace. It puts them in in case they're helpful, but marks them with a ?
> to indicate that they're probably not part of the call trace.

I see. Many thanks for your explanation. Will update backtrace in next version.

Thanks.