2023-10-09 14:56:28

by Usama Arif

[permalink] [raw]
Subject: [PATCH] mm: hugetlb: Only prep and add allocated folios for non-gigantic pages

Calling prep_and_add_allocated_folios when allocating gigantic pages
at boot time causes the kernel to crash as folio_list is empty
and iterating it causes a NULL pointer dereference. Call this only
for non-gigantic pages when folio_list has entires.

Fixes: bfb41d6b2fe148 ("hugetlb: restructure pool allocations")
Signed-off-by: Usama Arif <[email protected]>
---
mm/hugetlb.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f3749fc125d4..b12f5fd295bb 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3397,7 +3397,8 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
}

/* list will be empty if hstate_is_gigantic */
- prep_and_add_allocated_folios(h, &folio_list);
+ if (!hstate_is_gigantic(h))
+ prep_and_add_allocated_folios(h, &folio_list);

if (i < h->max_huge_pages) {
char buf[32];
--
2.25.1


2023-10-10 01:24:57

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH] mm: hugetlb: Only prep and add allocated folios for non-gigantic pages

On 10/09/23 15:56, Usama Arif wrote:
> Calling prep_and_add_allocated_folios when allocating gigantic pages
> at boot time causes the kernel to crash as folio_list is empty
> and iterating it causes a NULL pointer dereference. Call this only
> for non-gigantic pages when folio_list has entires.

Thanks!

However, are you sure the issue is the result of iterating through a
NULL list? For reference, the routine prep_and_add_allocated_folios is:

static void prep_and_add_allocated_folios(struct hstate *h,
struct list_head *folio_list)
{
struct folio *folio, *tmp_f;

/* Add all new pool pages to free lists in one lock cycle */
spin_lock_irq(&hugetlb_lock);
list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
__prep_account_new_huge_page(h, folio_nid(folio));
enqueue_hugetlb_folio(h, folio);
}
spin_unlock_irq(&hugetlb_lock);
}

If folio_list is empty, then the only code that should be executed is
acquiring the lock, notice the list is empty, release the lock.

In the case of gigantic pages addressed below, I do see the warning:

[ 0.055140] DEBUG_LOCKS_WARN_ON(early_boot_irqs_disabled)
[ 0.055149] WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:4345 lockdep_hardirqs_on_prepare+0x1a8/0x1b0
[ 0.055153] Modules linked in:
[ 0.055155] CPU: 0 PID: 0 Comm: swapper Not tainted 6.6.0-rc4+ #40
[ 0.055157] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
[ 0.055158] RIP: 0010:lockdep_hardirqs_on_prepare+0x1a8/0x1b0
[ 0.055160] Code: 00 85 c0 0f 84 5e ff ff ff 8b 0d a7 20 74 01 85 c9 0f 85 50 ff ff ff 48 c7 c6 48 25 42 82 48 c7 c7 70 7f 40 82 e8 18 10 f7 ff <0f> 0b 5b e9 e0 d8 af 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90
[ 0.055162] RSP: 0000:ffffffff82603d40 EFLAGS: 00010086 ORIG_RAX: 0000000000000000
[ 0.055164] RAX: 0000000000000000 RBX: ffffffff827911e0 RCX: 0000000000000000
[ 0.055165] RDX: 0000000000000004 RSI: ffffffff8246b3e1 RDI: 00000000ffffffff
[ 0.055166] RBP: 0000000000000002 R08: 0000000000000001 R09: 0000000000000000
[ 0.055166] R10: ffffffffffffffff R11: 284e4f5f4e524157 R12: 0000000000000001
[ 0.055167] R13: ffffffff82eb6316 R14: ffffffff82603d70 R15: ffffffff82ee5f70
[ 0.055169] FS: 0000000000000000(0000) GS:ffff888277c00000(0000) knlGS:0000000000000000
[ 0.055170] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.055171] CR2: ffff88847ffff000 CR3: 000000000263a000 CR4: 00000000000200b0
[ 0.055174] Call Trace:
[ 0.055174] <TASK>
[ 0.055175] ? lockdep_hardirqs_on_prepare+0x1a8/0x1b0
[ 0.055177] ? __warn+0x81/0x170
[ 0.055181] ? lockdep_hardirqs_on_prepare+0x1a8/0x1b0
[ 0.055182] ? report_bug+0x18d/0x1c0
[ 0.055186] ? early_fixup_exception+0x92/0xb0
[ 0.055189] ? early_idt_handler_common+0x2f/0x40
[ 0.055194] ? lockdep_hardirqs_on_prepare+0x1a8/0x1b0
[ 0.055196] trace_hardirqs_on+0x10/0xa0
[ 0.055198] _raw_spin_unlock_irq+0x24/0x50
[ 0.055201] hugetlb_hstate_alloc_pages+0x311/0x3e0
[ 0.055206] hugepages_setup+0x220/0x2c0
[ 0.055210] unknown_bootoption+0x98/0x1d0
[ 0.055213] parse_args+0x152/0x440
[ 0.055216] ? __pfx_unknown_bootoption+0x10/0x10
[ 0.055220] start_kernel+0x1af/0x6c0
[ 0.055222] ? __pfx_unknown_bootoption+0x10/0x10
[ 0.055225] x86_64_start_reservations+0x14/0x30
[ 0.055227] x86_64_start_kernel+0x74/0x80
[ 0.055229] secondary_startup_64_no_verify+0x166/0x16b
[ 0.055234] </TASK>
[ 0.055235] irq event stamp: 0
[ 0.055236] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[ 0.055238] hardirqs last disabled at (0): [<0000000000000000>] 0x0
[ 0.055239] softirqs last enabled at (0): [<0000000000000000>] 0x0
[ 0.055240] softirqs last disabled at (0): [<0000000000000000>] 0x0
[ 0.055240] ---[ end trace 0000000000000000 ]---

This is because interrupts are not enabled this early in boot, and the
spin_unlock_irq() would incorrectly enable interrupts too early. I wonder
if this 'warning' could translate to a panic or NULL deref under certain
configurations?

Konrad, I am interested to see if this addresses your booting problem. But,
your stack trace is a bit different. My 'guess' is that this will not address
your issue. If it does not, can you try the following patch? This
applies to next-20231009.
--
Mike Kravetz

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f3749fc125d4..8346c98e5616 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2178,18 +2178,19 @@ static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
static void prep_and_add_allocated_folios(struct hstate *h,
struct list_head *folio_list)
{
+ unsigned long flags;
struct folio *folio, *tmp_f;

/* Send list for bulk vmemmap optimization processing */
hugetlb_vmemmap_optimize_folios(h, folio_list);

/* Add all new pool pages to free lists in one lock cycle */
- spin_lock_irq(&hugetlb_lock);
+ spin_lock_irqsave(&hugetlb_lock, flags);
list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
__prep_account_new_huge_page(h, folio_nid(folio));
enqueue_hugetlb_folio(h, folio);
}
- spin_unlock_irq(&hugetlb_lock);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
}

/*
@@ -3224,13 +3225,14 @@ static void __init hugetlb_folio_init_vmemmap(struct folio *folio,
static void __init prep_and_add_bootmem_folios(struct hstate *h,
struct list_head *folio_list)
{
+ unsigned long flags;
struct folio *folio, *tmp_f;

/* Send list for bulk vmemmap optimization processing */
hugetlb_vmemmap_optimize_folios(h, folio_list);

/* Add all new pool pages to free lists in one lock cycle */
- spin_lock_irq(&hugetlb_lock);
+ spin_lock_irqsave(&hugetlb_lock, flags);
list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
if (!folio_test_hugetlb_vmemmap_optimized(folio)) {
/*
@@ -3246,7 +3248,7 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
__prep_account_new_huge_page(h, folio_nid(folio));
enqueue_hugetlb_folio(h, folio);
}
- spin_unlock_irq(&hugetlb_lock);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
}

/*

2023-10-10 17:02:21

by Usama Arif

[permalink] [raw]
Subject: Re: [External] Re: [PATCH] mm: hugetlb: Only prep and add allocated folios for non-gigantic pages



On 10/10/2023 02:23, Mike Kravetz wrote:
> On 10/09/23 15:56, Usama Arif wrote:
>> Calling prep_and_add_allocated_folios when allocating gigantic pages
>> at boot time causes the kernel to crash as folio_list is empty
>> and iterating it causes a NULL pointer dereference. Call this only
>> for non-gigantic pages when folio_list has entires.
>
> Thanks!
>
> However, are you sure the issue is the result of iterating through a
> NULL list? For reference, the routine prep_and_add_allocated_folios is:
>

Yes, you are right, it wasnt an issue with the list, but the lock. If I
do the below diff it boots.

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 73803d62066a..f428af13e98a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2178,18 +2178,19 @@ static struct folio
*alloc_fresh_hugetlb_folio(struct hstate *h,
static void prep_and_add_allocated_folios(struct hstate *h,
struct list_head *folio_list)
{
+ unsigned long flags;
struct folio *folio, *tmp_f;

/* Send list for bulk vmemmap optimization processing */
hugetlb_vmemmap_optimize_folios(h, folio_list);

/* Add all new pool pages to free lists in one lock cycle */
- spin_lock_irq(&hugetlb_lock);
+ spin_lock_irqsave(&hugetlb_lock, flags);
list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
__prep_account_new_huge_page(h, folio_nid(folio));
enqueue_hugetlb_folio(h, folio);
}
- spin_unlock_irq(&hugetlb_lock);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
}

/*


FYI, this was an x86 VM with kvm enabled.

Thanks,
Usama

> static void prep_and_add_allocated_folios(struct hstate *h,
> struct list_head *folio_list)
> {
> struct folio *folio, *tmp_f;
>
> /* Add all new pool pages to free lists in one lock cycle */
> spin_lock_irq(&hugetlb_lock);
> list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
> __prep_account_new_huge_page(h, folio_nid(folio));
> enqueue_hugetlb_folio(h, folio);
> }
> spin_unlock_irq(&hugetlb_lock);
> }
>
> If folio_list is empty, then the only code that should be executed is
> acquiring the lock, notice the list is empty, release the lock.
>
> In the case of gigantic pages addressed below, I do see the warning:
>
> [ 0.055140] DEBUG_LOCKS_WARN_ON(early_boot_irqs_disabled)
> [ 0.055149] WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:4345 lockdep_hardirqs_on_prepare+0x1a8/0x1b0
> [ 0.055153] Modules linked in:
> [ 0.055155] CPU: 0 PID: 0 Comm: swapper Not tainted 6.6.0-rc4+ #40
> [ 0.055157] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
> [ 0.055158] RIP: 0010:lockdep_hardirqs_on_prepare+0x1a8/0x1b0
> [ 0.055160] Code: 00 85 c0 0f 84 5e ff ff ff 8b 0d a7 20 74 01 85 c9 0f 85 50 ff ff ff 48 c7 c6 48 25 42 82 48 c7 c7 70 7f 40 82 e8 18 10 f7 ff <0f> 0b 5b e9 e0 d8 af 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90
> [ 0.055162] RSP: 0000:ffffffff82603d40 EFLAGS: 00010086 ORIG_RAX: 0000000000000000
> [ 0.055164] RAX: 0000000000000000 RBX: ffffffff827911e0 RCX: 0000000000000000
> [ 0.055165] RDX: 0000000000000004 RSI: ffffffff8246b3e1 RDI: 00000000ffffffff
> [ 0.055166] RBP: 0000000000000002 R08: 0000000000000001 R09: 0000000000000000
> [ 0.055166] R10: ffffffffffffffff R11: 284e4f5f4e524157 R12: 0000000000000001
> [ 0.055167] R13: ffffffff82eb6316 R14: ffffffff82603d70 R15: ffffffff82ee5f70
> [ 0.055169] FS: 0000000000000000(0000) GS:ffff888277c00000(0000) knlGS:0000000000000000
> [ 0.055170] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 0.055171] CR2: ffff88847ffff000 CR3: 000000000263a000 CR4: 00000000000200b0
> [ 0.055174] Call Trace:
> [ 0.055174] <TASK>
> [ 0.055175] ? lockdep_hardirqs_on_prepare+0x1a8/0x1b0
> [ 0.055177] ? __warn+0x81/0x170
> [ 0.055181] ? lockdep_hardirqs_on_prepare+0x1a8/0x1b0
> [ 0.055182] ? report_bug+0x18d/0x1c0
> [ 0.055186] ? early_fixup_exception+0x92/0xb0
> [ 0.055189] ? early_idt_handler_common+0x2f/0x40
> [ 0.055194] ? lockdep_hardirqs_on_prepare+0x1a8/0x1b0
> [ 0.055196] trace_hardirqs_on+0x10/0xa0
> [ 0.055198] _raw_spin_unlock_irq+0x24/0x50
> [ 0.055201] hugetlb_hstate_alloc_pages+0x311/0x3e0
> [ 0.055206] hugepages_setup+0x220/0x2c0
> [ 0.055210] unknown_bootoption+0x98/0x1d0
> [ 0.055213] parse_args+0x152/0x440
> [ 0.055216] ? __pfx_unknown_bootoption+0x10/0x10
> [ 0.055220] start_kernel+0x1af/0x6c0
> [ 0.055222] ? __pfx_unknown_bootoption+0x10/0x10
> [ 0.055225] x86_64_start_reservations+0x14/0x30
> [ 0.055227] x86_64_start_kernel+0x74/0x80
> [ 0.055229] secondary_startup_64_no_verify+0x166/0x16b
> [ 0.055234] </TASK>
> [ 0.055235] irq event stamp: 0
> [ 0.055236] hardirqs last enabled at (0): [<0000000000000000>] 0x0
> [ 0.055238] hardirqs last disabled at (0): [<0000000000000000>] 0x0
> [ 0.055239] softirqs last enabled at (0): [<0000000000000000>] 0x0
> [ 0.055240] softirqs last disabled at (0): [<0000000000000000>] 0x0
> [ 0.055240] ---[ end trace 0000000000000000 ]---
>
> This is because interrupts are not enabled this early in boot, and the
> spin_unlock_irq() would incorrectly enable interrupts too early. I wonder
> if this 'warning' could translate to a panic or NULL deref under certain
> configurations?
>
> Konrad, I am interested to see if this addresses your booting problem. But,
> your stack trace is a bit different. My 'guess' is that this will not address
> your issue. If it does not, can you try the following patch? This
> applies to next-20231009.

2023-10-10 21:31:03

by Mike Kravetz

[permalink] [raw]
Subject: Re: [External] Re: [PATCH] mm: hugetlb: Only prep and add allocated folios for non-gigantic pages

On 10/10/23 18:01, Usama Arif wrote:
>
>
> On 10/10/2023 02:23, Mike Kravetz wrote:
> > On 10/09/23 15:56, Usama Arif wrote:
> > > Calling prep_and_add_allocated_folios when allocating gigantic pages
> > > at boot time causes the kernel to crash as folio_list is empty
> > > and iterating it causes a NULL pointer dereference. Call this only
> > > for non-gigantic pages when folio_list has entires.
> >
> > Thanks!
> >
> > However, are you sure the issue is the result of iterating through a
> > NULL list? For reference, the routine prep_and_add_allocated_folios is:
> >
>
> Yes, you are right, it wasnt an issue with the list, but the lock. If I do
> the below diff it boots.

Thanks!

I believe that may be that may be the root cause of boot issues with
this series. It is unfortunate that the failures were not consistent
and did not directly point at the root cause.

Hopefully, these changes will resolve the boot issues for Konrad as well.

I will create a new version of the "Batch hugetlb vmemmap modification
operations" series with these locking changes.
--
Mike Kravetz

2023-10-10 21:32:21

by Konrad Dybcio

[permalink] [raw]
Subject: Re: [External] Re: [PATCH] mm: hugetlb: Only prep and add allocated folios for non-gigantic pages



On 10/10/23 23:30, Mike Kravetz wrote:
> On 10/10/23 18:01, Usama Arif wrote:
>>
>>
>> On 10/10/2023 02:23, Mike Kravetz wrote:
>>> On 10/09/23 15:56, Usama Arif wrote:
>>>> Calling prep_and_add_allocated_folios when allocating gigantic pages
>>>> at boot time causes the kernel to crash as folio_list is empty
>>>> and iterating it causes a NULL pointer dereference. Call this only
>>>> for non-gigantic pages when folio_list has entires.
>>>
>>> Thanks!
>>>
>>> However, are you sure the issue is the result of iterating through a
>>> NULL list? For reference, the routine prep_and_add_allocated_folios is:
>>>
>>
>> Yes, you are right, it wasnt an issue with the list, but the lock. If I do
>> the below diff it boots.
>
> Thanks!
>
> I believe that may be that may be the root cause of boot issues with
> this series. It is unfortunate that the failures were not consistent
> and did not directly point at the root cause.
>
> Hopefully, these changes will resolve the boot issues for Konrad as well.
>
> I will create a new version of the "Batch hugetlb vmemmap modification
> operations" series with these locking changes.
We sent a reply at the same time :P [1]

Konrad

[1]
https://lore.kernel.org/all/[email protected]/

2023-10-12 00:04:11

by Nathan Chancellor

[permalink] [raw]
Subject: Re: [PATCH] mm: hugetlb: Only prep and add allocated folios for non-gigantic pages

Hi Mike,

On Mon, Oct 09, 2023 at 06:23:45PM -0700, Mike Kravetz wrote:
> On 10/09/23 15:56, Usama Arif wrote:
> > Calling prep_and_add_allocated_folios when allocating gigantic pages
> > at boot time causes the kernel to crash as folio_list is empty
> > and iterating it causes a NULL pointer dereference. Call this only
> > for non-gigantic pages when folio_list has entires.
>
> Thanks!
>
> However, are you sure the issue is the result of iterating through a
> NULL list? For reference, the routine prep_and_add_allocated_folios is:
>
> static void prep_and_add_allocated_folios(struct hstate *h,
> struct list_head *folio_list)
> {
> struct folio *folio, *tmp_f;
>
> /* Add all new pool pages to free lists in one lock cycle */
> spin_lock_irq(&hugetlb_lock);
> list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
> __prep_account_new_huge_page(h, folio_nid(folio));
> enqueue_hugetlb_folio(h, folio);
> }
> spin_unlock_irq(&hugetlb_lock);
> }
>
> If folio_list is empty, then the only code that should be executed is
> acquiring the lock, notice the list is empty, release the lock.
>
> In the case of gigantic pages addressed below, I do see the warning:
>
> [ 0.055140] DEBUG_LOCKS_WARN_ON(early_boot_irqs_disabled)
> [ 0.055149] WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:4345 lockdep_hardirqs_on_prepare+0x1a8/0x1b0
> [ 0.055153] Modules linked in:
> [ 0.055155] CPU: 0 PID: 0 Comm: swapper Not tainted 6.6.0-rc4+ #40
> [ 0.055157] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
> [ 0.055158] RIP: 0010:lockdep_hardirqs_on_prepare+0x1a8/0x1b0
> [ 0.055160] Code: 00 85 c0 0f 84 5e ff ff ff 8b 0d a7 20 74 01 85 c9 0f 85 50 ff ff ff 48 c7 c6 48 25 42 82 48 c7 c7 70 7f 40 82 e8 18 10 f7 ff <0f> 0b 5b e9 e0 d8 af 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90
> [ 0.055162] RSP: 0000:ffffffff82603d40 EFLAGS: 00010086 ORIG_RAX: 0000000000000000
> [ 0.055164] RAX: 0000000000000000 RBX: ffffffff827911e0 RCX: 0000000000000000
> [ 0.055165] RDX: 0000000000000004 RSI: ffffffff8246b3e1 RDI: 00000000ffffffff
> [ 0.055166] RBP: 0000000000000002 R08: 0000000000000001 R09: 0000000000000000
> [ 0.055166] R10: ffffffffffffffff R11: 284e4f5f4e524157 R12: 0000000000000001
> [ 0.055167] R13: ffffffff82eb6316 R14: ffffffff82603d70 R15: ffffffff82ee5f70
> [ 0.055169] FS: 0000000000000000(0000) GS:ffff888277c00000(0000) knlGS:0000000000000000
> [ 0.055170] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 0.055171] CR2: ffff88847ffff000 CR3: 000000000263a000 CR4: 00000000000200b0
> [ 0.055174] Call Trace:
> [ 0.055174] <TASK>
> [ 0.055175] ? lockdep_hardirqs_on_prepare+0x1a8/0x1b0
> [ 0.055177] ? __warn+0x81/0x170
> [ 0.055181] ? lockdep_hardirqs_on_prepare+0x1a8/0x1b0
> [ 0.055182] ? report_bug+0x18d/0x1c0
> [ 0.055186] ? early_fixup_exception+0x92/0xb0
> [ 0.055189] ? early_idt_handler_common+0x2f/0x40
> [ 0.055194] ? lockdep_hardirqs_on_prepare+0x1a8/0x1b0
> [ 0.055196] trace_hardirqs_on+0x10/0xa0
> [ 0.055198] _raw_spin_unlock_irq+0x24/0x50
> [ 0.055201] hugetlb_hstate_alloc_pages+0x311/0x3e0
> [ 0.055206] hugepages_setup+0x220/0x2c0
> [ 0.055210] unknown_bootoption+0x98/0x1d0
> [ 0.055213] parse_args+0x152/0x440
> [ 0.055216] ? __pfx_unknown_bootoption+0x10/0x10
> [ 0.055220] start_kernel+0x1af/0x6c0
> [ 0.055222] ? __pfx_unknown_bootoption+0x10/0x10
> [ 0.055225] x86_64_start_reservations+0x14/0x30
> [ 0.055227] x86_64_start_kernel+0x74/0x80
> [ 0.055229] secondary_startup_64_no_verify+0x166/0x16b
> [ 0.055234] </TASK>
> [ 0.055235] irq event stamp: 0
> [ 0.055236] hardirqs last enabled at (0): [<0000000000000000>] 0x0
> [ 0.055238] hardirqs last disabled at (0): [<0000000000000000>] 0x0
> [ 0.055239] softirqs last enabled at (0): [<0000000000000000>] 0x0
> [ 0.055240] softirqs last disabled at (0): [<0000000000000000>] 0x0
> [ 0.055240] ---[ end trace 0000000000000000 ]---
>
> This is because interrupts are not enabled this early in boot, and the
> spin_unlock_irq() would incorrectly enable interrupts too early. I wonder
> if this 'warning' could translate to a panic or NULL deref under certain
> configurations?
>
> Konrad, I am interested to see if this addresses your booting problem. But,
> your stack trace is a bit different. My 'guess' is that this will not address
> your issue. If it does not, can you try the following patch? This
> applies to next-20231009.
> --
> Mike Kravetz
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index f3749fc125d4..8346c98e5616 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2178,18 +2178,19 @@ static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
> static void prep_and_add_allocated_folios(struct hstate *h,
> struct list_head *folio_list)
> {
> + unsigned long flags;
> struct folio *folio, *tmp_f;
>
> /* Send list for bulk vmemmap optimization processing */
> hugetlb_vmemmap_optimize_folios(h, folio_list);
>
> /* Add all new pool pages to free lists in one lock cycle */
> - spin_lock_irq(&hugetlb_lock);
> + spin_lock_irqsave(&hugetlb_lock, flags);
> list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
> __prep_account_new_huge_page(h, folio_nid(folio));
> enqueue_hugetlb_folio(h, folio);
> }
> - spin_unlock_irq(&hugetlb_lock);
> + spin_unlock_irqrestore(&hugetlb_lock, flags);
> }
>
> /*
> @@ -3224,13 +3225,14 @@ static void __init hugetlb_folio_init_vmemmap(struct folio *folio,
> static void __init prep_and_add_bootmem_folios(struct hstate *h,
> struct list_head *folio_list)
> {
> + unsigned long flags;
> struct folio *folio, *tmp_f;
>
> /* Send list for bulk vmemmap optimization processing */
> hugetlb_vmemmap_optimize_folios(h, folio_list);
>
> /* Add all new pool pages to free lists in one lock cycle */
> - spin_lock_irq(&hugetlb_lock);
> + spin_lock_irqsave(&hugetlb_lock, flags);
> list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
> if (!folio_test_hugetlb_vmemmap_optimized(folio)) {
> /*
> @@ -3246,7 +3248,7 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
> __prep_account_new_huge_page(h, folio_nid(folio));
> enqueue_hugetlb_folio(h, folio);
> }
> - spin_unlock_irq(&hugetlb_lock);
> + spin_unlock_irqrestore(&hugetlb_lock, flags);
> }
>
> /*

I suspect the crash that our continuous integration spotted [1] is the
same issue that Konrad is seeing, as I have bisected that failure to
bfb41d6b2fe1 in next-20231009. However, neither the first half of your
diff (since the second half does not apply at bfb41d6b2fe1) nor the
original patch in this thread resolves the issue though, so maybe it is
entirely different from Konrad's?

For what it's worth, this issue is only visible for me when building for
arm64 using LLVM with CONFIG_INIT_STACK_NONE=y, instead of the default
CONFIG_INIT_STACK_ALL_ZERO=y (which appears to hide the problem?),
making it seem like it could be something with uninitialized memory... I
have not been able to reproduce it with GCC, which could also mean
something.

Using LLVM 17.0.2 from kernel.org [2]:

$ make -skj"$(nproc)" ARCH=arm64 LLVM=1 mrproper defconfig

$ scripts/config -d INIT_STACK_ALL_ZERO -e INIT_STACK_NONE

$ make -skj"$(nproc)" ARCH=arm64 LLVM=1 Image.gz

$ qemu-system-aarch64 \
-display none \
-nodefaults \
-cpu max,pauth-impdef=true \
-machine virt,gic-version=max,virtualization=true \
-append 'console=ttyAMA0 earlycon' \
-kernel arch/arm64/boot/Image.gz \
-initrd arm64-rootfs.cpio \
-m 512m \
-serial mon:stdio
...
[ 0.000000] Linux version 6.6.0-rc4-00317-gbfb41d6b2fe1 ([email protected]) (ClangBuiltLinux clang version 17.0.2 (https://github.com/llvm/llvm-project b2417f51dbbd7435eb3aaf203de24de6754da50e), ClangBuiltLinux LLD 17.0.2) #1 SMP PREEMPT Wed Oct 11 16:44:41 MST 2023
...
[ 0.304543] Unable to handle kernel paging request at virtual address ffffff602827f9f4
[ 0.304899] Mem abort info:
[ 0.305022] ESR = 0x0000000096000004
[ 0.305438] EC = 0x25: DABT (current EL), IL = 32 bits
[ 0.305668] SET = 0, FnV = 0
[ 0.305804] EA = 0, S1PTW = 0
[ 0.305949] FSC = 0x04: level 0 translation fault
[ 0.306156] Data abort info:
[ 0.306287] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[ 0.306500] CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[ 0.306711] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[ 0.306976] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000041cc3000
[ 0.307251] [ffffff602827f9f4] pgd=0000000000000000, p4d=0000000000000000
[ 0.308086] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[ 0.308428] Modules linked in:
[ 0.308722] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.6.0-rc4-00317-gbfb41d6b2fe1 #1
[ 0.309159] Hardware name: linux,dummy-virt (DT)
[ 0.309496] pstate: 61400009 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
[ 0.309987] pc : gather_bootmem_prealloc+0x80/0x1a8
[ 0.310673] lr : hugetlb_init+0x1c8/0x2ec
[ 0.310871] sp : ffff80008000ba10
[ 0.311038] x29: ffff80008000ba30 x28: 0000000000000000 x27: ffffd80a09fe7db8
[ 0.311417] x26: 0000000000000001 x25: ffffd80a09fe7db8 x24: 0000000000000100
[ 0.311702] x23: fffffc0000000000 x22: 0001000000000000 x21: ffff80008000ba18
[ 0.311987] x20: ffffff602827f9c0 x19: ffffd80a0a555b60 x18: 00000000fbf7386f
[ 0.312272] x17: 00000000bee83943 x16: 000000002ae32058 x15: 0000000000000000
[ 0.312557] x14: 0000000000000009 x13: ffffd80a0a556d28 x12: ffffffffffffee38
[ 0.312831] x11: ffffd80a0a556d28 x10: 0000000000000004 x9 : ffffd80a09fe7000
[ 0.313141] x8 : 0000000d80a09fe7 x7 : 0000000001e1f7fb x6 : 0000000000000008
[ 0.313425] x5 : ffffd80a09ef1454 x4 : ffff00001fed5630 x3 : 0000000000019e00
[ 0.313703] x2 : ffff000002407b80 x1 : 0000000000019d00 x0 : 0000000000000000
[ 0.314054] Call trace:
[ 0.314259] gather_bootmem_prealloc+0x80/0x1a8
[ 0.314536] hugetlb_init+0x1c8/0x2ec
[ 0.314743] do_one_initcall+0xac/0x220
[ 0.314928] do_initcall_level+0x8c/0xac
[ 0.315114] do_initcalls+0x54/0x94
[ 0.315276] do_basic_setup+0x1c/0x28
[ 0.315450] kernel_init_freeable+0x104/0x170
[ 0.315648] kernel_init+0x20/0x1a0
[ 0.315822] ret_from_fork+0x10/0x20
[ 0.316235] Code: 979e8c0d 8b160328 d34cfd08 8b081af4 (b9403688)
[ 0.316745] ---[ end trace 0000000000000000 ]---
[ 0.317463] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[ 0.318093] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---

The rootfs is available at [3] in case it is relevant. I am more than
happy to provide any additional information or test any patches as
necessary.

[1]: https://github.com/ClangBuiltLinux/continuous-integration2/actions/runs/6469151768/job/17570882198
[2]: https://mirrors.edge.kernel.org/pub/tools/llvm/
[3]: https://github.com/ClangBuiltLinux/boot-utils/releases

Cheers,
Nathan

2023-10-12 14:54:28

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH] mm: hugetlb: Only prep and add allocated folios for non-gigantic pages

On 10/11/23 17:03, Nathan Chancellor wrote:
> On Mon, Oct 09, 2023 at 06:23:45PM -0700, Mike Kravetz wrote:
> > On 10/09/23 15:56, Usama Arif wrote:
>
> I suspect the crash that our continuous integration spotted [1] is the
> same issue that Konrad is seeing, as I have bisected that failure to
> bfb41d6b2fe1 in next-20231009. However, neither the first half of your
> diff (since the second half does not apply at bfb41d6b2fe1) nor the
> original patch in this thread resolves the issue though, so maybe it is
> entirely different from Konrad's?
>
> For what it's worth, this issue is only visible for me when building for
> arm64 using LLVM with CONFIG_INIT_STACK_NONE=y, instead of the default
> CONFIG_INIT_STACK_ALL_ZERO=y (which appears to hide the problem?),
> making it seem like it could be something with uninitialized memory... I
> have not been able to reproduce it with GCC, which could also mean
> something.

Thank you Nathan! That is very helpful.

I will use this information to try and recreate. If I can recreate, I
should be able to get to root cause.
--
Mike Kravetz

> Using LLVM 17.0.2 from kernel.org [2]:
>
> $ make -skj"$(nproc)" ARCH=arm64 LLVM=1 mrproper defconfig
>
> $ scripts/config -d INIT_STACK_ALL_ZERO -e INIT_STACK_NONE
>
> $ make -skj"$(nproc)" ARCH=arm64 LLVM=1 Image.gz
>
> $ qemu-system-aarch64 \
> -display none \
> -nodefaults \
> -cpu max,pauth-impdef=true \
> -machine virt,gic-version=max,virtualization=true \
> -append 'console=ttyAMA0 earlycon' \
> -kernel arch/arm64/boot/Image.gz \
> -initrd arm64-rootfs.cpio \
> -m 512m \
> -serial mon:stdio
> ...
> [ 0.000000] Linux version 6.6.0-rc4-00317-gbfb41d6b2fe1 ([email protected]) (ClangBuiltLinux clang version 17.0.2 (https://github.com/llvm/llvm-project b2417f51dbbd7435eb3aaf203de24de6754da50e), ClangBuiltLinux LLD 17.0.2) #1 SMP PREEMPT Wed Oct 11 16:44:41 MST 2023
> ...
> [ 0.304543] Unable to handle kernel paging request at virtual address ffffff602827f9f4
> [ 0.304899] Mem abort info:
> [ 0.305022] ESR = 0x0000000096000004
> [ 0.305438] EC = 0x25: DABT (current EL), IL = 32 bits
> [ 0.305668] SET = 0, FnV = 0
> [ 0.305804] EA = 0, S1PTW = 0
> [ 0.305949] FSC = 0x04: level 0 translation fault
> [ 0.306156] Data abort info:
> [ 0.306287] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
> [ 0.306500] CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> [ 0.306711] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> [ 0.306976] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000041cc3000
> [ 0.307251] [ffffff602827f9f4] pgd=0000000000000000, p4d=0000000000000000
> [ 0.308086] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
> [ 0.308428] Modules linked in:
> [ 0.308722] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.6.0-rc4-00317-gbfb41d6b2fe1 #1
> [ 0.309159] Hardware name: linux,dummy-virt (DT)
> [ 0.309496] pstate: 61400009 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
> [ 0.309987] pc : gather_bootmem_prealloc+0x80/0x1a8
> [ 0.310673] lr : hugetlb_init+0x1c8/0x2ec
> [ 0.310871] sp : ffff80008000ba10
> [ 0.311038] x29: ffff80008000ba30 x28: 0000000000000000 x27: ffffd80a09fe7db8
> [ 0.311417] x26: 0000000000000001 x25: ffffd80a09fe7db8 x24: 0000000000000100
> [ 0.311702] x23: fffffc0000000000 x22: 0001000000000000 x21: ffff80008000ba18
> [ 0.311987] x20: ffffff602827f9c0 x19: ffffd80a0a555b60 x18: 00000000fbf7386f
> [ 0.312272] x17: 00000000bee83943 x16: 000000002ae32058 x15: 0000000000000000
> [ 0.312557] x14: 0000000000000009 x13: ffffd80a0a556d28 x12: ffffffffffffee38
> [ 0.312831] x11: ffffd80a0a556d28 x10: 0000000000000004 x9 : ffffd80a09fe7000
> [ 0.313141] x8 : 0000000d80a09fe7 x7 : 0000000001e1f7fb x6 : 0000000000000008
> [ 0.313425] x5 : ffffd80a09ef1454 x4 : ffff00001fed5630 x3 : 0000000000019e00
> [ 0.313703] x2 : ffff000002407b80 x1 : 0000000000019d00 x0 : 0000000000000000
> [ 0.314054] Call trace:
> [ 0.314259] gather_bootmem_prealloc+0x80/0x1a8
> [ 0.314536] hugetlb_init+0x1c8/0x2ec
> [ 0.314743] do_one_initcall+0xac/0x220
> [ 0.314928] do_initcall_level+0x8c/0xac
> [ 0.315114] do_initcalls+0x54/0x94
> [ 0.315276] do_basic_setup+0x1c/0x28
> [ 0.315450] kernel_init_freeable+0x104/0x170
> [ 0.315648] kernel_init+0x20/0x1a0
> [ 0.315822] ret_from_fork+0x10/0x20
> [ 0.316235] Code: 979e8c0d 8b160328 d34cfd08 8b081af4 (b9403688)
> [ 0.316745] ---[ end trace 0000000000000000 ]---
> [ 0.317463] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
> [ 0.318093] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---
>
> The rootfs is available at [3] in case it is relevant. I am more than
> happy to provide any additional information or test any patches as
> necessary.
>
> [1]: https://github.com/ClangBuiltLinux/continuous-integration2/actions/runs/6469151768/job/17570882198
> [2]: https://mirrors.edge.kernel.org/pub/tools/llvm/
> [3]: https://github.com/ClangBuiltLinux/boot-utils/releases
>
> Cheers,
> Nathan

2023-10-13 00:13:59

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH] mm: hugetlb: Only prep and add allocated folios for non-gigantic pages

On 10/12/23 07:53, Mike Kravetz wrote:
> On 10/11/23 17:03, Nathan Chancellor wrote:
> > On Mon, Oct 09, 2023 at 06:23:45PM -0700, Mike Kravetz wrote:
> > > On 10/09/23 15:56, Usama Arif wrote:
> >
> > I suspect the crash that our continuous integration spotted [1] is the
> > same issue that Konrad is seeing, as I have bisected that failure to
> > bfb41d6b2fe1 in next-20231009. However, neither the first half of your
> > diff (since the second half does not apply at bfb41d6b2fe1) nor the
> > original patch in this thread resolves the issue though, so maybe it is
> > entirely different from Konrad's?
> >
> > For what it's worth, this issue is only visible for me when building for
> > arm64 using LLVM with CONFIG_INIT_STACK_NONE=y, instead of the default
> > CONFIG_INIT_STACK_ALL_ZERO=y (which appears to hide the problem?),
> > making it seem like it could be something with uninitialized memory... I
> > have not been able to reproduce it with GCC, which could also mean
> > something.
>
> Thank you Nathan! That is very helpful.
>
> I will use this information to try and recreate. If I can recreate, I
> should be able to get to root cause.

I could easily recreate the issue using the provided instructions. First
thing I did was add a few printk's to check/verify state. The beginning
of gather_bootmem_prealloc looked like this:

static void __init gather_bootmem_prealloc(void)
{
LIST_HEAD(folio_list);
struct huge_bootmem_page *m;
struct hstate *h, *prev_h = NULL;

if (list_empty(&huge_boot_pages))
printk("gather_bootmem_prealloc: huge_boot_pages list empty\n");

list_for_each_entry(m, &huge_boot_pages, list) {
struct page *page = virt_to_page(m);
struct folio *folio = (void *)page;

printk("gather_bootmem_prealloc: loop entry m %lx\n",
(unsigned long)m);

The STRANGE thing is that the printk after testing for list_empty would
print, then we would enter the 'list_for_each_entry()' loop as if the list
was not empty. This is the cause of the addressing exception. m pointed
to the list head as opposed to an entry on the list.

I have attached disassembly of gather_bootmem_prealloc with INIT_STACK_NONE
and INIT_STACK_ALL_ZERO. disassembly listings are for code without
printks.

This is the first time I have looked at arm assembly, so I may be missing
something. However, in the INIT_STACK_NONE case it looks like we get the
address of huge_boot_pages into a register but do not use it to determine
if we should execute the loop. Code generated with INIT_STACK_ALL_ZERO seems
to show code checking the list before entering the loop.

Can someone with more arm assembly experience take a quick look? Since
huge_boot_pages is a global variable rather than on the stack, I can't
see how INIT_STACK_ALL_ZERO/INIT_STACK_NONE could make a difference.
--
Mike Kravetz


Attachments:
(No filename) (2.82 kB)
disass_INIT_STACK_NONE (9.87 kB)
disass_INIT_STACK_ALL_ZERO (10.13 kB)
Download all attachments

2023-10-14 00:05:53

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH] mm: hugetlb: Only prep and add allocated folios for non-gigantic pages

On 10/12/23 17:12, Mike Kravetz wrote:
> On 10/12/23 07:53, Mike Kravetz wrote:
> > On 10/11/23 17:03, Nathan Chancellor wrote:
> > > On Mon, Oct 09, 2023 at 06:23:45PM -0700, Mike Kravetz wrote:
> > > > On 10/09/23 15:56, Usama Arif wrote:
> >
> > Thank you Nathan! That is very helpful.
> >
> > I will use this information to try and recreate. If I can recreate, I
> > should be able to get to root cause.
>
> I could easily recreate the issue using the provided instructions. First
> thing I did was add a few printk's to check/verify state. The beginning
> of gather_bootmem_prealloc looked like this:

Hi Nathan,

This is looking more and more like a Clang issue to me. I did a little
more problem isolation today. Here is what I did:

- Check out commit "hugetlb: restructure pool allocations" in linux-next
- Fix the known issue with early disable/enable IRQs via locking by
applying:

commit 266789498210dff6cf9a14b64fa3a5cb2fcc5858
Author: Mike Kravetz <[email protected]>
Date: Fri Oct 13 13:14:15 2023 -0700

fix prep_and_add_allocated_folios locking

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c843506654f8..d8ab2d9b391b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2246,15 +2246,16 @@ static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
static void prep_and_add_allocated_folios(struct hstate *h,
struct list_head *folio_list)
{
+ unsigned long flags;
struct folio *folio, *tmp_f;

/* Add all new pool pages to free lists in one lock cycle */
- spin_lock_irq(&hugetlb_lock);
+ spin_lock_irqsave(&hugetlb_lock, flags);
list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
__prep_account_new_huge_page(h, folio_nid(folio));
enqueue_hugetlb_folio(h, folio);
}
- spin_unlock_irq(&hugetlb_lock);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
}

/*

- Add the following code which would only trigger a BUG if we were to
traverse an empty list; which should NEVER happen.

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d8ab2d9b391b..be234831b33f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3294,11 +3294,21 @@ static void __init gather_bootmem_prealloc(void)
LIST_HEAD(folio_list);
struct huge_bootmem_page *m;
struct hstate *h, *prev_h = NULL;
+ bool empty;
+
+ empty = list_empty(&huge_boot_pages);
+ if (empty)
+ printk("gather_bootmem_prealloc: huge_boot_pages list empty\n");

list_for_each_entry(m, &huge_boot_pages, list) {
struct page *page = virt_to_page(m);
struct folio *folio = (void *)page;

+ if (empty) {
+ printk(" Traversing an empty list as if not empty!!!\n");
+ BUG();
+ }
+
h = m->hstate;
/*
* It is possible to have multiple huge page sizes (hstates)

- As you have experienced, this will BUG if built with LLVM 17.0.2 and
CONFIG_INIT_STACK_NONE

- It will NOT BUG if built with LLVM 13.0.1 but will BUG if built with
LLVM llvm-14.0.6-x86_64 and later.

As mentioned in the previous email, the generated code for loop entry
looks wrong to my untrained eyes. Can you or someone on the llvm team
take a look?
--
Mike Kravetz

2023-10-18 20:56:08

by Nick Desaulniers

[permalink] [raw]
Subject: Re: [PATCH] mm: hugetlb: Only prep and add allocated folios for non-gigantic pages

On Fri, Oct 13, 2023 at 5:05 PM Mike Kravetz <[email protected]> wrote:
>
> On 10/12/23 17:12, Mike Kravetz wrote:
> > On 10/12/23 07:53, Mike Kravetz wrote:
> > > On 10/11/23 17:03, Nathan Chancellor wrote:
> > > > On Mon, Oct 09, 2023 at 06:23:45PM -0700, Mike Kravetz wrote:
> > > > > On 10/09/23 15:56, Usama Arif wrote:
> > >
> > > Thank you Nathan! That is very helpful.
> > >
> > > I will use this information to try and recreate. If I can recreate, I
> > > should be able to get to root cause.
> >
> > I could easily recreate the issue using the provided instructions. First
> > thing I did was add a few printk's to check/verify state. The beginning
> > of gather_bootmem_prealloc looked like this:
>
> Hi Nathan,
>
> This is looking more and more like a Clang issue to me. I did a little
> more problem isolation today. Here is what I did:
>
> - Check out commit "hugetlb: restructure pool allocations" in linux-next
> - Fix the known issue with early disable/enable IRQs via locking by
> applying:
>
> commit 266789498210dff6cf9a14b64fa3a5cb2fcc5858
> Author: Mike Kravetz <[email protected]>
> Date: Fri Oct 13 13:14:15 2023 -0700
>
> fix prep_and_add_allocated_folios locking
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index c843506654f8..d8ab2d9b391b 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2246,15 +2246,16 @@ static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
> static void prep_and_add_allocated_folios(struct hstate *h,
> struct list_head *folio_list)
> {
> + unsigned long flags;
> struct folio *folio, *tmp_f;
>
> /* Add all new pool pages to free lists in one lock cycle */
> - spin_lock_irq(&hugetlb_lock);
> + spin_lock_irqsave(&hugetlb_lock, flags);
> list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
> __prep_account_new_huge_page(h, folio_nid(folio));
> enqueue_hugetlb_folio(h, folio);
> }
> - spin_unlock_irq(&hugetlb_lock);
> + spin_unlock_irqrestore(&hugetlb_lock, flags);
> }
>
> /*
>
> - Add the following code which would only trigger a BUG if we were to
> traverse an empty list; which should NEVER happen.
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index d8ab2d9b391b..be234831b33f 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3294,11 +3294,21 @@ static void __init gather_bootmem_prealloc(void)
> LIST_HEAD(folio_list);
> struct huge_bootmem_page *m;
> struct hstate *h, *prev_h = NULL;
> + bool empty;
> +
> + empty = list_empty(&huge_boot_pages);
> + if (empty)
> + printk("gather_bootmem_prealloc: huge_boot_pages list empty\n");
>
> list_for_each_entry(m, &huge_boot_pages, list) {
> struct page *page = virt_to_page(m);
> struct folio *folio = (void *)page;
>
> + if (empty) {
> + printk(" Traversing an empty list as if not empty!!!\n");
> + BUG();
> + }
> +
> h = m->hstate;
> /*
> * It is possible to have multiple huge page sizes (hstates)
>
> - As you have experienced, this will BUG if built with LLVM 17.0.2 and
> CONFIG_INIT_STACK_NONE
>
> - It will NOT BUG if built with LLVM 13.0.1 but will BUG if built with
> LLVM llvm-14.0.6-x86_64 and later.
>
> As mentioned in the previous email, the generated code for loop entry
> looks wrong to my untrained eyes. Can you or someone on the llvm team
> take a look?

I think you need to initialize h, otherwise what value is passed to
prep_and_add_bootmem_folios if the loop is not run because the list is
empty. The compiler sees `h` is only given a value in the loop, so
the loop must be run. That's obviously hazardous, but the compiler
assumes there's no UB. At least that's my limited understanding
looking at the IR diff Nathan got me in
https://github.com/ClangBuiltLinux/linux/issues/1946.
--
Thanks,
~Nick Desaulniers

2023-10-18 22:21:23

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH] mm: hugetlb: Only prep and add allocated folios for non-gigantic pages

On 10/18/23 13:54, Nick Desaulniers wrote:
> On Fri, Oct 13, 2023 at 5:05 PM Mike Kravetz <[email protected]> wrote:
> >
> > On 10/12/23 17:12, Mike Kravetz wrote:
> > > On 10/12/23 07:53, Mike Kravetz wrote:
> > > > On 10/11/23 17:03, Nathan Chancellor wrote:
> > > > > On Mon, Oct 09, 2023 at 06:23:45PM -0700, Mike Kravetz wrote:
> > > > > > On 10/09/23 15:56, Usama Arif wrote:
> > > >
> > > > Thank you Nathan! That is very helpful.
> > > >
> > > > I will use this information to try and recreate. If I can recreate, I
> > > > should be able to get to root cause.
> > >
> > > I could easily recreate the issue using the provided instructions. First
> > > thing I did was add a few printk's to check/verify state. The beginning
> > > of gather_bootmem_prealloc looked like this:
> >
> > Hi Nathan,
> >
> > This is looking more and more like a Clang issue to me. I did a little
> > more problem isolation today. Here is what I did:
> >
> > - Check out commit "hugetlb: restructure pool allocations" in linux-next
> > - Fix the known issue with early disable/enable IRQs via locking by
> > applying:
> >
> > commit 266789498210dff6cf9a14b64fa3a5cb2fcc5858
> > Author: Mike Kravetz <[email protected]>
> > Date: Fri Oct 13 13:14:15 2023 -0700
> >
> > fix prep_and_add_allocated_folios locking
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index c843506654f8..d8ab2d9b391b 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -2246,15 +2246,16 @@ static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
> > static void prep_and_add_allocated_folios(struct hstate *h,
> > struct list_head *folio_list)
> > {
> > + unsigned long flags;
> > struct folio *folio, *tmp_f;
> >
> > /* Add all new pool pages to free lists in one lock cycle */
> > - spin_lock_irq(&hugetlb_lock);
> > + spin_lock_irqsave(&hugetlb_lock, flags);
> > list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
> > __prep_account_new_huge_page(h, folio_nid(folio));
> > enqueue_hugetlb_folio(h, folio);
> > }
> > - spin_unlock_irq(&hugetlb_lock);
> > + spin_unlock_irqrestore(&hugetlb_lock, flags);
> > }
> >
> > /*
> >
> > - Add the following code which would only trigger a BUG if we were to
> > traverse an empty list; which should NEVER happen.
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index d8ab2d9b391b..be234831b33f 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -3294,11 +3294,21 @@ static void __init gather_bootmem_prealloc(void)
> > LIST_HEAD(folio_list);
> > struct huge_bootmem_page *m;
> > struct hstate *h, *prev_h = NULL;
> > + bool empty;
> > +
> > + empty = list_empty(&huge_boot_pages);
> > + if (empty)
> > + printk("gather_bootmem_prealloc: huge_boot_pages list empty\n");
> >
> > list_for_each_entry(m, &huge_boot_pages, list) {
> > struct page *page = virt_to_page(m);
> > struct folio *folio = (void *)page;
> >
> > + if (empty) {
> > + printk(" Traversing an empty list as if not empty!!!\n");
> > + BUG();
> > + }
> > +
> > h = m->hstate;
> > /*
> > * It is possible to have multiple huge page sizes (hstates)
> >
> > - As you have experienced, this will BUG if built with LLVM 17.0.2 and
> > CONFIG_INIT_STACK_NONE
> >
> > - It will NOT BUG if built with LLVM 13.0.1 but will BUG if built with
> > LLVM llvm-14.0.6-x86_64 and later.
> >
> > As mentioned in the previous email, the generated code for loop entry
> > looks wrong to my untrained eyes. Can you or someone on the llvm team
> > take a look?
>
> I think you need to initialize h, otherwise what value is passed to
> prep_and_add_bootmem_folios if the loop is not run because the list is
> empty. The compiler sees `h` is only given a value in the loop, so
> the loop must be run. That's obviously hazardous, but the compiler
> assumes there's no UB. At least that's my limited understanding
> looking at the IR diff Nathan got me in
> https://github.com/ClangBuiltLinux/linux/issues/1946.

Thanks for looking closer at this Nick and Nathan!

I think you are saying the compiler is running the loop because it wants
to initialize h before passing the value to another function. It does
this even if the explicit loop entry condition is false. Is that correct?

For me, that is unexpected.

Internally, someone brought up the possibility that this could have been
caused by h not be initialized. However, I dismissed this. Why?
If h is not initialized, then this means we did not enter the loop and
process any entries. Hence, the list (folio_list) also passed to
prep_and_add_bootmem_folios is empty. In this case, prep_and_add_bootmem_folios
does not use the passed value h. h only applies to values in the list.

Sure, the coding is a little sloppy. But, I really did not expect this
to result in making a run through the loop when the entry condition was
false.

I will verify that initializing h will address the issue and if so, send
another version of this series.
--
Mike Kravetz

2023-10-19 02:39:33

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH] mm: hugetlb: Only prep and add allocated folios for non-gigantic pages

On 10/09/23 15:56, Usama Arif wrote:
> Calling prep_and_add_allocated_folios when allocating gigantic pages
> at boot time causes the kernel to crash as folio_list is empty
> and iterating it causes a NULL pointer dereference. Call this only
> for non-gigantic pages when folio_list has entires.
>
> Fixes: bfb41d6b2fe148 ("hugetlb: restructure pool allocations")
> Signed-off-by: Usama Arif <[email protected]>
> ---
> mm/hugetlb.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)

Hi Andrew,

Can you remove this from mm-unstable. The root cause of Usama's crash
was improper irq enablement via locking calls. The changes Usama
verified (later in this thread) are in v8 of the "Batch hugetlb vmemmap
modification operations" series I just sent.
--
Mike Kravetz

2023-10-19 04:33:30

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCH] mm: hugetlb: Only prep and add allocated folios for non-gigantic pages

On (23/10/18 15:20), Mike Kravetz wrote:
> > I think you need to initialize h, otherwise what value is passed to
> > prep_and_add_bootmem_folios if the loop is not run because the list is
> > empty. The compiler sees `h` is only given a value in the loop, so
> > the loop must be run. That's obviously hazardous, but the compiler
> > assumes there's no UB. At least that's my limited understanding
> > looking at the IR diff Nathan got me in
> > https://github.com/ClangBuiltLinux/linux/issues/1946.
>
> Thanks for looking closer at this Nick and Nathan!
>
> I think you are saying the compiler is running the loop because it wants
> to initialize h before passing the value to another function. It does
> this even if the explicit loop entry condition is false. Is that correct?

The loop is getting promoted to "infinite" loop, there is no
&pos->member != (head) condition check in the generated code
at all (at least on my machine).

I wish we could at least get the "possibly uninitialized variable"
warning from the compiler in this case, which we'd translate to
"hold my beer, I'm going to try one thing".

2023-10-19 14:21:49

by Nathan Chancellor

[permalink] [raw]
Subject: Re: [PATCH] mm: hugetlb: Only prep and add allocated folios for non-gigantic pages

On Thu, Oct 19, 2023 at 01:33:05PM +0900, Sergey Senozhatsky wrote:
> On (23/10/18 15:20), Mike Kravetz wrote:
> > > I think you need to initialize h, otherwise what value is passed to
> > > prep_and_add_bootmem_folios if the loop is not run because the list is
> > > empty. The compiler sees `h` is only given a value in the loop, so
> > > the loop must be run. That's obviously hazardous, but the compiler
> > > assumes there's no UB. At least that's my limited understanding
> > > looking at the IR diff Nathan got me in
> > > https://github.com/ClangBuiltLinux/linux/issues/1946.
> >
> > Thanks for looking closer at this Nick and Nathan!
> >
> > I think you are saying the compiler is running the loop because it wants
> > to initialize h before passing the value to another function. It does
> > this even if the explicit loop entry condition is false. Is that correct?
>
> The loop is getting promoted to "infinite" loop, there is no
> &pos->member != (head) condition check in the generated code
> at all (at least on my machine).
>
> I wish we could at least get the "possibly uninitialized variable"
> warning from the compiler in this case, which we'd translate to
> "hold my beer, I'm going to try one thing".

GCC would warn about this under -Wmaybe-uninitialized but it has been
disabled in a normal build for the past three years, see commit
78a5255ffb6a ("Stop the ad-hoc games with -Wno-maybe-initialized").

In function 'gather_bootmem_prealloc',
inlined from 'hugetlb_init' at mm/hugetlb.c:4299:2:
mm/hugetlb.c:3203:9: warning: 'h' may be used uninitialized [-Wmaybe-uninitialized]
3203 | prep_and_add_allocated_folios(h, &folio_list);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/hugetlb.c: In function 'hugetlb_init':
mm/hugetlb.c:3166:24: note: 'h' was declared here
3166 | struct hstate *h, *prev_h = NULL;
| ^

Clang's -Wconditional-uninitialized would have flagged it too but it
suffers from the same problems as -Wmaybe-uninitialized.

mm/hugetlb.c:3203:32: warning: variable 'h' may be uninitialized when used here [-Wconditional-uninitialized]
3203 | prep_and_add_allocated_folios(h, &folio_list);
| ^
mm/hugetlb.c:3166:18: note: initialize the variable 'h' to silence this warning
3166 | struct hstate *h, *prev_h = NULL;
| ^
| = NULL

I know clang has some handling for loops in -Wsometimes-uninitialized, I
wonder why that does not trigger here...

Cheers,
Nathan