On 30 Mar 2022, at 17:43, Vlastimil Babka wrote:
> On 3/30/22 22:05, Linus Torvalds wrote:
>> On Wed, Mar 30, 2022 at 12:42 PM Steven Rostedt <[email protected]> wrote:
>>>
>>> I started testing new patches and it crashed when doing the x86-32 test on
>>> boot up.
>>>
>>> Initializing HighMem for node 0 (000375fe:0021ee00)
>>> BUG: kernel NULL pointer dereference, address: 00000878
>>> #PF: supervisor read access in kernel mode
>>> #PF: error_code(0x0000) - not-present page
>>> *pdpt = 0000000000000000 *pde = f0000000f000eef3
>>> Oops: 0000 [#1] PREEMPT SMP PTI
>>> CPU: 0 PID: 0 Comm: swapper Not tainted 5.17.0-test+ #469
>>> Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
>>> EIP: get_pfnblock_flags_mask+0x2c/0x36
>>> Code: 6d ea ff 55 89 e5 56 89 ce 53 8b 18 89 d8 c1 eb 1e e8 f7 fb ff ff 69 db c0 02 00 00 89 c1 89 c2 c1 ea 05 8b 83 7c d7 79 c1 5b <8b> 04 90 d3 e8 21 f0 5e 5d c3 55 89 e5 57 56 89 d6 53 89 c3 64 a1
>>
>> The whole function is in that Code: thing, and it decodes to:
>>
>> 0: 55 push %ebp
>> 1: 89 e5 mov %esp,%ebp
>> 3: 56 push %esi
>> 4: 89 ce mov %ecx,%esi
>> 6: 53 push %ebx
>> 7: 8b 18 mov (%eax),%ebx
>> 9: 89 d8 mov %ebx,%eax
>> b: c1 eb 1e shr $0x1e,%ebx
>> e: e8 f7 fb ff ff call 0xfffffc0a
>> 13: 69 db c0 02 00 00 imul $0x2c0,%ebx,%ebx
>> 19: 89 c1 mov %eax,%ecx
>> 1b: 89 c2 mov %eax,%edx
>> 1d: c1 ea 05 shr $0x5,%edx
>> 20: 8b 83 7c d7 79 c1 mov -0x3e862884(%ebx),%eax
>> 26: 5b pop %ebx
>> 27:* 8b 04 90 mov (%eax,%edx,4),%eax <-- trapping instruction
>> 2a: d3 e8 shr %cl,%eax
>> 2c: 21 f0 and %esi,%eax
>> 2e: 5e pop %esi
>> 2f: 5d pop %ebp
>> 30: c3 ret
>>
>> with '%eax' being NULL, and %edx being 0x21e.
>>
>> (The call seems to be to 'pfn_to_bitidx().isra.0' if my compiler does
>> similar code generation, so it's out-of-lined part of pfn_to_bitidx()
>> despite being marked inline)
>>
>> So that oops is that
>>
>> word = bitmap[word_bitidx];
>>
>> line, with 'bitmap' being NULL (and %edx contains 'word_bitidx').
>>
>> Looking around, your 'config-bad' doesn't even have
>> CONFIG_MEMORY_ISOLATION enabled, and so I suspect the culprit is this
>> part of the change:
>>
>> - if (unlikely(has_isolate_pageblock(zone))) {
>>
>> which used to always be false for that config, and now the code is
>> suddenly enabled.
>
> If CONFIG_MEMORY_ISOLATION was enabled then the zone layout would be the
> same, so I think it's not simply that. I think it's the timing -
> has_isolate_pageblock(zone) would only be possible to become true later
> in runtime when some isolation is ongoing, but here it seems we are
> still in the early boot. Probably at a boundary of highmem with another
> zone that doesn't have the pageblock bitmap yet initialized? While later
> it would have, and all would be fine.
It seems so based on the boot log from Steven.
>
> As Zi Yan said, the usual merging code will, through page_is_buddy()
> find safely enough the buddy is not applicable, so I agree with his
> patch direction. Seems this also shows the code tried to become too
> smart and for the next merge window we should try just move the
> migratetype checks into the main while loop (under something like "if
> (order >= max_order)") and simplify the function a lot, hopefully with
> negligible perf impact.
Something like this?
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6c6af8658775..568ecaf5700d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1053,7 +1053,6 @@ static inline void __free_one_page(struct page *page,
int migratetype, fpi_t fpi_flags)
{
struct capture_control *capc = task_capc(zone);
- unsigned int max_order = pageblock_order;
unsigned long buddy_pfn;
unsigned long combined_pfn;
struct page *buddy;
@@ -1069,8 +1068,7 @@ static inline void __free_one_page(struct page *page,
VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
VM_BUG_ON_PAGE(bad_range(zone, page), page);
-continue_merging:
- while (order < max_order) {
+ while (order < MAX_ORDER - 1) {
if (compaction_capture(capc, page, order, migratetype)) {
__mod_zone_freepage_state(zone, -(1 << order),
migratetype);
@@ -1081,6 +1079,27 @@ static inline void __free_one_page(struct page *page,
if (!page_is_buddy(page, buddy, order))
goto done_merging;
+
+ if (order >= pageblock_order) {
+ /* If we are here, it means order is >= pageblock_order.
+ * We want to prevent merge between freepages on pageblock
+ * without fallbacks and normal pageblock. Without this,
+ * pageblock isolation could cause incorrect freepage or CMA
+ * accounting or HIGHATOMIC accounting.
+ *
+ * We don't want to hit this code for the more frequent
+ * low-order merging.
+ */
+ int buddy_mt;
+
+ buddy_mt = get_pageblock_migratetype(buddy);
+
+ if (migratetype != buddy_mt
+ && (!migratetype_is_mergeable(migratetype) ||
+ !migratetype_is_mergeable(buddy_mt)))
+ goto done_merging;
+ }
+
/*
* Our buddy is free or it is CONFIG_DEBUG_PAGEALLOC guard page,
* merge with it and move up one order.
@@ -1094,32 +1113,6 @@ static inline void __free_one_page(struct page *page,
pfn = combined_pfn;
order++;
}
- if (order < MAX_ORDER - 1) {
- /* If we are here, it means order is >= pageblock_order.
- * We want to prevent merge between freepages on pageblock
- * without fallbacks and normal pageblock. Without this,
- * pageblock isolation could cause incorrect freepage or CMA
- * accounting or HIGHATOMIC accounting.
- *
- * We don't want to hit this code for the more frequent
- * low-order merging.
- */
- int buddy_mt;
-
- buddy_pfn = __find_buddy_pfn(pfn, order);
- buddy = page + (buddy_pfn - pfn);
-
- if (!page_is_buddy(page, buddy, order))
- goto done_merging;
- buddy_mt = get_pageblock_migratetype(buddy);
-
- if (migratetype != buddy_mt
- && (!migratetype_is_mergeable(migratetype) ||
- !migratetype_is_mergeable(buddy_mt)))
- goto done_merging;
- max_order = order + 1;
- goto continue_merging;
- }
done_merging:
set_buddy_order(page, order);
--
Best Regards,
Yan, Zi
On 3/31/22 00:03, Zi Yan wrote:
> On 30 Mar 2022, at 17:43, Vlastimil Babka wrote:
>
>> On 3/30/22 22:05, Linus Torvalds wrote:
>>> On Wed, Mar 30, 2022 at 12:42 PM Steven Rostedt <[email protected]> wrote:
>>>>
>>>> I started testing new patches and it crashed when doing the x86-32 test on
>>>> boot up.
>>>>
>>>> Initializing HighMem for node 0 (000375fe:0021ee00)
>>>> BUG: kernel NULL pointer dereference, address: 00000878
>>>> #PF: supervisor read access in kernel mode
>>>> #PF: error_code(0x0000) - not-present page
>>>> *pdpt = 0000000000000000 *pde = f0000000f000eef3
>>>> Oops: 0000 [#1] PREEMPT SMP PTI
>>>> CPU: 0 PID: 0 Comm: swapper Not tainted 5.17.0-test+ #469
>>>> Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
>>>> EIP: get_pfnblock_flags_mask+0x2c/0x36
>>>> Code: 6d ea ff 55 89 e5 56 89 ce 53 8b 18 89 d8 c1 eb 1e e8 f7 fb ff ff 69 db c0 02 00 00 89 c1 89 c2 c1 ea 05 8b 83 7c d7 79 c1 5b <8b> 04 90 d3 e8 21 f0 5e 5d c3 55 89 e5 57 56 89 d6 53 89 c3 64 a1
>>>
>>> The whole function is in that Code: thing, and it decodes to:
>>>
>>> 0: 55 push %ebp
>>> 1: 89 e5 mov %esp,%ebp
>>> 3: 56 push %esi
>>> 4: 89 ce mov %ecx,%esi
>>> 6: 53 push %ebx
>>> 7: 8b 18 mov (%eax),%ebx
>>> 9: 89 d8 mov %ebx,%eax
>>> b: c1 eb 1e shr $0x1e,%ebx
>>> e: e8 f7 fb ff ff call 0xfffffc0a
>>> 13: 69 db c0 02 00 00 imul $0x2c0,%ebx,%ebx
>>> 19: 89 c1 mov %eax,%ecx
>>> 1b: 89 c2 mov %eax,%edx
>>> 1d: c1 ea 05 shr $0x5,%edx
>>> 20: 8b 83 7c d7 79 c1 mov -0x3e862884(%ebx),%eax
>>> 26: 5b pop %ebx
>>> 27:* 8b 04 90 mov (%eax,%edx,4),%eax <-- trapping instruction
>>> 2a: d3 e8 shr %cl,%eax
>>> 2c: 21 f0 and %esi,%eax
>>> 2e: 5e pop %esi
>>> 2f: 5d pop %ebp
>>> 30: c3 ret
>>>
>>> with '%eax' being NULL, and %edx being 0x21e.
>>>
>>> (The call seems to be to 'pfn_to_bitidx().isra.0' if my compiler does
>>> similar code generation, so it's out-of-lined part of pfn_to_bitidx()
>>> despite being marked inline)
>>>
>>> So that oops is that
>>>
>>> word = bitmap[word_bitidx];
>>>
>>> line, with 'bitmap' being NULL (and %edx contains 'word_bitidx').
>>>
>>> Looking around, your 'config-bad' doesn't even have
>>> CONFIG_MEMORY_ISOLATION enabled, and so I suspect the culprit is this
>>> part of the change:
>>>
>>> - if (unlikely(has_isolate_pageblock(zone))) {
>>>
>>> which used to always be false for that config, and now the code is
>>> suddenly enabled.
>>
>> If CONFIG_MEMORY_ISOLATION was enabled then the zone layout would be the
>> same, so I think it's not simply that. I think it's the timing -
>> has_isolate_pageblock(zone) would only be possible to become true later
>> in runtime when some isolation is ongoing, but here it seems we are
>> still in the early boot. Probably at a boundary of highmem with another
>> zone that doesn't have the pageblock bitmap yet initialized? While later
>> it would have, and all would be fine.
>
> It seems so based on the boot log from Steven.
>
>>
>> As Zi Yan said, the usual merging code will, through page_is_buddy()
>> find safely enough the buddy is not applicable, so I agree with his
>> patch direction. Seems this also shows the code tried to become too
>> smart and for the next merge window we should try just move the
>> migratetype checks into the main while loop (under something like "if
>> (order >= max_order)") and simplify the function a lot, hopefully with
>> negligible perf impact.
>
> Something like this?
Exactly (with some nits).
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6c6af8658775..568ecaf5700d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1053,7 +1053,6 @@ static inline void __free_one_page(struct page *page,
> int migratetype, fpi_t fpi_flags)
> {
> struct capture_control *capc = task_capc(zone);
> - unsigned int max_order = pageblock_order;
> unsigned long buddy_pfn;
> unsigned long combined_pfn;
> struct page *buddy;
> @@ -1069,8 +1068,7 @@ static inline void __free_one_page(struct page *page,
> VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
> VM_BUG_ON_PAGE(bad_range(zone, page), page);
>
> -continue_merging:
> - while (order < max_order) {
> + while (order < MAX_ORDER - 1) {
> if (compaction_capture(capc, page, order, migratetype)) {
> __mod_zone_freepage_state(zone, -(1 << order),
> migratetype);
> @@ -1081,6 +1079,27 @@ static inline void __free_one_page(struct page *page,
>
> if (!page_is_buddy(page, buddy, order))
> goto done_merging;
> +
> + if (order >= pageblock_order) {
Would use unlikely() here.
> + /* If we are here, it means order is >= pageblock_order.
Can drop this like as it became quite obvious now :)
> + * We want to prevent merge between freepages on pageblock
> + * without fallbacks and normal pageblock. Without this,
> + * pageblock isolation could cause incorrect freepage or CMA
> + * accounting or HIGHATOMIC accounting.
> + *
> + * We don't want to hit this code for the more frequent
> + * low-order merging.
The last two lines should be also dropped now.
> + */
> + int buddy_mt;
> +
> + buddy_mt = get_pageblock_migratetype(buddy);
Can be now single line.
> +
> + if (migratetype != buddy_mt
> + && (!migratetype_is_mergeable(migratetype) ||
> + !migratetype_is_mergeable(buddy_mt)))
> + goto done_merging;
> + }
> +
> /*
> * Our buddy is free or it is CONFIG_DEBUG_PAGEALLOC guard page,
> * merge with it and move up one order.
> @@ -1094,32 +1113,6 @@ static inline void __free_one_page(struct page *page,
> pfn = combined_pfn;
> order++;
> }
> - if (order < MAX_ORDER - 1) {
> - /* If we are here, it means order is >= pageblock_order.
> - * We want to prevent merge between freepages on pageblock
> - * without fallbacks and normal pageblock. Without this,
> - * pageblock isolation could cause incorrect freepage or CMA
> - * accounting or HIGHATOMIC accounting.
> - *
> - * We don't want to hit this code for the more frequent
> - * low-order merging.
> - */
> - int buddy_mt;
> -
> - buddy_pfn = __find_buddy_pfn(pfn, order);
> - buddy = page + (buddy_pfn - pfn);
> -
> - if (!page_is_buddy(page, buddy, order))
> - goto done_merging;
> - buddy_mt = get_pageblock_migratetype(buddy);
> -
> - if (migratetype != buddy_mt
> - && (!migratetype_is_mergeable(migratetype) ||
> - !migratetype_is_mergeable(buddy_mt)))
> - goto done_merging;
> - max_order = order + 1;
> - goto continue_merging;
> - }
>
> done_merging:
> set_buddy_order(page, order);
>
> --
> Best Regards,
> Yan, Zi
On 30 Mar 2022, at 18:15, Vlastimil Babka wrote:
> On 3/31/22 00:03, Zi Yan wrote:
>> On 30 Mar 2022, at 17:43, Vlastimil Babka wrote:
>>
>>> On 3/30/22 22:05, Linus Torvalds wrote:
>>>> On Wed, Mar 30, 2022 at 12:42 PM Steven Rostedt <[email protected]> wrote:
>>>>>
>>>>> I started testing new patches and it crashed when doing the x86-32 test on
>>>>> boot up.
>>>>>
>>>>> Initializing HighMem for node 0 (000375fe:0021ee00)
>>>>> BUG: kernel NULL pointer dereference, address: 00000878
>>>>> #PF: supervisor read access in kernel mode
>>>>> #PF: error_code(0x0000) - not-present page
>>>>> *pdpt = 0000000000000000 *pde = f0000000f000eef3
>>>>> Oops: 0000 [#1] PREEMPT SMP PTI
>>>>> CPU: 0 PID: 0 Comm: swapper Not tainted 5.17.0-test+ #469
>>>>> Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
>>>>> EIP: get_pfnblock_flags_mask+0x2c/0x36
>>>>> Code: 6d ea ff 55 89 e5 56 89 ce 53 8b 18 89 d8 c1 eb 1e e8 f7 fb ff ff 69 db c0 02 00 00 89 c1 89 c2 c1 ea 05 8b 83 7c d7 79 c1 5b <8b> 04 90 d3 e8 21 f0 5e 5d c3 55 89 e5 57 56 89 d6 53 89 c3 64 a1
>>>>
>>>> The whole function is in that Code: thing, and it decodes to:
>>>>
>>>> 0: 55 push %ebp
>>>> 1: 89 e5 mov %esp,%ebp
>>>> 3: 56 push %esi
>>>> 4: 89 ce mov %ecx,%esi
>>>> 6: 53 push %ebx
>>>> 7: 8b 18 mov (%eax),%ebx
>>>> 9: 89 d8 mov %ebx,%eax
>>>> b: c1 eb 1e shr $0x1e,%ebx
>>>> e: e8 f7 fb ff ff call 0xfffffc0a
>>>> 13: 69 db c0 02 00 00 imul $0x2c0,%ebx,%ebx
>>>> 19: 89 c1 mov %eax,%ecx
>>>> 1b: 89 c2 mov %eax,%edx
>>>> 1d: c1 ea 05 shr $0x5,%edx
>>>> 20: 8b 83 7c d7 79 c1 mov -0x3e862884(%ebx),%eax
>>>> 26: 5b pop %ebx
>>>> 27:* 8b 04 90 mov (%eax,%edx,4),%eax <-- trapping instruction
>>>> 2a: d3 e8 shr %cl,%eax
>>>> 2c: 21 f0 and %esi,%eax
>>>> 2e: 5e pop %esi
>>>> 2f: 5d pop %ebp
>>>> 30: c3 ret
>>>>
>>>> with '%eax' being NULL, and %edx being 0x21e.
>>>>
>>>> (The call seems to be to 'pfn_to_bitidx().isra.0' if my compiler does
>>>> similar code generation, so it's out-of-lined part of pfn_to_bitidx()
>>>> despite being marked inline)
>>>>
>>>> So that oops is that
>>>>
>>>> word = bitmap[word_bitidx];
>>>>
>>>> line, with 'bitmap' being NULL (and %edx contains 'word_bitidx').
>>>>
>>>> Looking around, your 'config-bad' doesn't even have
>>>> CONFIG_MEMORY_ISOLATION enabled, and so I suspect the culprit is this
>>>> part of the change:
>>>>
>>>> - if (unlikely(has_isolate_pageblock(zone))) {
>>>>
>>>> which used to always be false for that config, and now the code is
>>>> suddenly enabled.
>>>
>>> If CONFIG_MEMORY_ISOLATION was enabled then the zone layout would be the
>>> same, so I think it's not simply that. I think it's the timing -
>>> has_isolate_pageblock(zone) would only be possible to become true later
>>> in runtime when some isolation is ongoing, but here it seems we are
>>> still in the early boot. Probably at a boundary of highmem with another
>>> zone that doesn't have the pageblock bitmap yet initialized? While later
>>> it would have, and all would be fine.
>>
>> It seems so based on the boot log from Steven.
>>
>>>
>>> As Zi Yan said, the usual merging code will, through page_is_buddy()
>>> find safely enough the buddy is not applicable, so I agree with his
>>> patch direction. Seems this also shows the code tried to become too
>>> smart and for the next merge window we should try just move the
>>> migratetype checks into the main while loop (under something like "if
>>> (order >= max_order)") and simplify the function a lot, hopefully with
>>> negligible perf impact.
>>
>> Something like this?
>
> Exactly (with some nits).
Thanks for the comment. Should I send the patch below with nits fix to
supersede the patch I just sent?
>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 6c6af8658775..568ecaf5700d 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1053,7 +1053,6 @@ static inline void __free_one_page(struct page *page,
>> int migratetype, fpi_t fpi_flags)
>> {
>> struct capture_control *capc = task_capc(zone);
>> - unsigned int max_order = pageblock_order;
>> unsigned long buddy_pfn;
>> unsigned long combined_pfn;
>> struct page *buddy;
>> @@ -1069,8 +1068,7 @@ static inline void __free_one_page(struct page *page,
>> VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
>> VM_BUG_ON_PAGE(bad_range(zone, page), page);
>>
>> -continue_merging:
>> - while (order < max_order) {
>> + while (order < MAX_ORDER - 1) {
>> if (compaction_capture(capc, page, order, migratetype)) {
>> __mod_zone_freepage_state(zone, -(1 << order),
>> migratetype);
>> @@ -1081,6 +1079,27 @@ static inline void __free_one_page(struct page *page,
>>
>> if (!page_is_buddy(page, buddy, order))
>> goto done_merging;
>> +
>> + if (order >= pageblock_order) {
>
> Would use unlikely() here.
>
>> + /* If we are here, it means order is >= pageblock_order.
>
> Can drop this like as it became quite obvious now :)
>
>> + * We want to prevent merge between freepages on pageblock
>> + * without fallbacks and normal pageblock. Without this,
>> + * pageblock isolation could cause incorrect freepage or CMA
>> + * accounting or HIGHATOMIC accounting.
>> + *
>> + * We don't want to hit this code for the more frequent
>> + * low-order merging.
>
> The last two lines should be also dropped now.
>
>> + */
>> + int buddy_mt;
>> +
>> + buddy_mt = get_pageblock_migratetype(buddy);
>
> Can be now single line.
>
>> +
>> + if (migratetype != buddy_mt
>> + && (!migratetype_is_mergeable(migratetype) ||
>> + !migratetype_is_mergeable(buddy_mt)))
>> + goto done_merging;
>> + }
>> +
>> /*
>> * Our buddy is free or it is CONFIG_DEBUG_PAGEALLOC guard page,
>> * merge with it and move up one order.
>> @@ -1094,32 +1113,6 @@ static inline void __free_one_page(struct page *page,
>> pfn = combined_pfn;
>> order++;
>> }
>> - if (order < MAX_ORDER - 1) {
>> - /* If we are here, it means order is >= pageblock_order.
>> - * We want to prevent merge between freepages on pageblock
>> - * without fallbacks and normal pageblock. Without this,
>> - * pageblock isolation could cause incorrect freepage or CMA
>> - * accounting or HIGHATOMIC accounting.
>> - *
>> - * We don't want to hit this code for the more frequent
>> - * low-order merging.
>> - */
>> - int buddy_mt;
>> -
>> - buddy_pfn = __find_buddy_pfn(pfn, order);
>> - buddy = page + (buddy_pfn - pfn);
>> -
>> - if (!page_is_buddy(page, buddy, order))
>> - goto done_merging;
>> - buddy_mt = get_pageblock_migratetype(buddy);
>> -
>> - if (migratetype != buddy_mt
>> - && (!migratetype_is_mergeable(migratetype) ||
>> - !migratetype_is_mergeable(buddy_mt)))
>> - goto done_merging;
>> - max_order = order + 1;
>> - goto continue_merging;
>> - }
>>
>> done_merging:
>> set_buddy_order(page, order);
>>
>> --
>> Best Regards,
>> Yan, Zi
--
Best Regards,
Yan, Zi