2023-09-20 02:39:46

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene

On 19 Sep 2023, at 20:32, Mike Kravetz wrote:

> On 09/19/23 16:57, Zi Yan wrote:
>> On 19 Sep 2023, at 14:47, Mike Kravetz wrote:
>>
>>> On 09/19/23 02:49, Johannes Weiner wrote:
>>>> On Mon, Sep 18, 2023 at 10:40:37AM -0700, Mike Kravetz wrote:
>>>>> On 09/18/23 10:52, Johannes Weiner wrote:
>>>>>> On Mon, Sep 18, 2023 at 09:16:58AM +0200, Vlastimil Babka wrote:
>>>>>>> On 9/16/23 21:57, Mike Kravetz wrote:
>>>>>>>> On 09/15/23 10:16, Johannes Weiner wrote:
>>>>>>>>> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
>>>
>>> Sorry for causing the confusion!
>>>
>>> When I originally saw the warnings pop up, I was running the above script
>>> as well as another that only allocated order 9 hugetlb pages:
>>>
>>> while true; do
>>> echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>>> echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>>> done
>>>
>>> The warnings were actually triggered by allocations in this second script.
>>>
>>> However, when reporting the warnings I wanted to include the simplest
>>> way to recreate. And, I noticed that that second script running in
>>> parallel was not required. Again, sorry for the confusion! Here is a
>>> warning triggered via the alloc_contig_range path only running the one
>>> script.
>>>
>>> [ 107.275821] ------------[ cut here ]------------
>>> [ 107.277001] page type is 0, passed migratetype is 1 (nr=512)
>>> [ 107.278379] WARNING: CPU: 1 PID: 886 at mm/page_alloc.c:699 del_page_from_free_list+0x137/0x170
>>> [ 107.280514] Modules linked in: rfkill ip6table_filter ip6_tables sunrpc snd_hda_codec_generic joydev 9p snd_hda_intel netfs snd_intel_dspcfg snd_hda_codec snd_hwdep 9pnet_virtio snd_hda_core snd_seq snd_seq_device 9pnet virtio_balloon snd_pcm snd_timer snd soundcore virtio_net net_failover failover virtio_console virtio_blk crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw virtio_pci virtio virtio_pci_legacy_dev virtio_pci_modern_dev virtio_ring fuse
>>> [ 107.291033] CPU: 1 PID: 886 Comm: bash Not tainted 6.6.0-rc2-next-20230919-dirty #35
>>> [ 107.293000] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
>>> [ 107.295187] RIP: 0010:del_page_from_free_list+0x137/0x170
>>> [ 107.296618] Code: c6 05 20 9b 35 01 01 e8 b7 fb ff ff 44 89 f1 44 89 e2 48 c7 c7 d8 ab 22 82 48 89 c6 b8 01 00 00 00 d3 e0 89 c1 e8 e9 99 df ff <0f> 0b e9 03 ff ff ff 48 c7 c6 10 ac 22 82 48 89 df e8 f3 e0 fc ff
>>> [ 107.301236] RSP: 0018:ffffc90003ba7a70 EFLAGS: 00010086
>>> [ 107.302535] RAX: 0000000000000000 RBX: ffffea0007ff8000 RCX: 0000000000000000
>>> [ 107.304467] RDX: 0000000000000004 RSI: ffffffff8224e9de RDI: 00000000ffffffff
>>> [ 107.306289] RBP: 00000000001ffe00 R08: 0000000000009ffb R09: 00000000ffffdfff
>>> [ 107.308135] R10: 00000000ffffdfff R11: ffffffff824660e0 R12: 0000000000000001
>>> [ 107.309956] R13: ffff88827fffcd80 R14: 0000000000000009 R15: 00000000001ffc00
>>> [ 107.311839] FS: 00007fabb8cba740(0000) GS:ffff888277d00000(0000) knlGS:0000000000000000
>>> [ 107.314695] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [ 107.316159] CR2: 00007f41ba01acf0 CR3: 0000000282ed4006 CR4: 0000000000370ee0
>>> [ 107.317971] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> [ 107.319783] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> [ 107.321575] Call Trace:
>>> [ 107.322314] <TASK>
>>> [ 107.323002] ? del_page_from_free_list+0x137/0x170
>>> [ 107.324380] ? __warn+0x7d/0x130
>>> [ 107.325341] ? del_page_from_free_list+0x137/0x170
>>> [ 107.326627] ? report_bug+0x18d/0x1c0
>>> [ 107.327632] ? prb_read_valid+0x17/0x20
>>> [ 107.328711] ? handle_bug+0x41/0x70
>>> [ 107.329685] ? exc_invalid_op+0x13/0x60
>>> [ 107.330787] ? asm_exc_invalid_op+0x16/0x20
>>> [ 107.331937] ? del_page_from_free_list+0x137/0x170
>>> [ 107.333189] __free_one_page+0x2ab/0x6f0
>>> [ 107.334375] free_pcppages_bulk+0x169/0x210
>>> [ 107.335575] drain_pages_zone+0x3f/0x50
>>> [ 107.336691] __drain_all_pages+0xe2/0x1e0
>>> [ 107.337843] alloc_contig_range+0x143/0x280
>>> [ 107.339026] alloc_contig_pages+0x210/0x270
>>> [ 107.340200] alloc_fresh_hugetlb_folio+0xa6/0x270
>>> [ 107.341529] alloc_pool_huge_page+0x7d/0x100
>>> [ 107.342745] set_max_huge_pages+0x162/0x340
>>> [ 107.345059] nr_hugepages_store_common+0x91/0xf0
>>> [ 107.346329] kernfs_fop_write_iter+0x108/0x1f0
>>> [ 107.347547] vfs_write+0x207/0x400
>>> [ 107.348543] ksys_write+0x63/0xe0
>>> [ 107.349511] do_syscall_64+0x37/0x90
>>> [ 107.350543] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
>>> [ 107.351940] RIP: 0033:0x7fabb8daee87
>>> [ 107.352819] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
>>> [ 107.356373] RSP: 002b:00007ffc02737478 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
>>> [ 107.358103] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fabb8daee87
>>> [ 107.359695] RDX: 0000000000000002 RSI: 000055fe584a1620 RDI: 0000000000000001
>>> [ 107.361258] RBP: 000055fe584a1620 R08: 000000000000000a R09: 00007fabb8e460c0
>>> [ 107.362842] R10: 00007fabb8e45fc0 R11: 0000000000000246 R12: 0000000000000002
>>> [ 107.364385] R13: 00007fabb8e82520 R14: 0000000000000002 R15: 00007fabb8e82720
>>> [ 107.365968] </TASK>
>>> [ 107.366534] ---[ end trace 0000000000000000 ]---
>>> [ 121.542474] ------------[ cut here ]------------
>>>
>>> Perhaps that is another piece of information in that the warning can be
>>> triggered via both allocation paths.
>>>
>>> To be perfectly clear, here is what I did today:
>>> - built next-20230919. It does not contain your series
>>> I could not recreate the issue.
>>> - Added your series and the patch to remove
>>> VM_BUG_ON_PAGE(is_migrate_isolate(mt), page) from free_pcppages_bulk
>>> I could recreate the issue while running only the one script.
>>> The warning above is from that run.
>>> - Added this suggested patch from Zi
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 1400e674ab86..77a4aea31a7f 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
>>> end = pageblock_end_pfn(pfn) - 1;
>>>
>>> /* Do not cross zone boundaries */
>>> +#if 0
>>> if (!zone_spans_pfn(zone, start))
>>> start = zone->zone_start_pfn;
>>> +#else
>>> + if (!zone_spans_pfn(zone, start))
>>> + start = pfn;
>>> +#endif
>>> if (!zone_spans_pfn(zone, end))
>>> return false;
>>> I can still trigger warnings.
>>
>> OK. One thing to note is that the page type in the warning changed from
>> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.
>>
>
> Just to be really clear,
> - the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path.
> - the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call
> path WITHOUT your change.
>
> I am guessing the difference here has more to do with the allocation path?
>
> I went back and reran focusing on the specific migrate type.
> Without your patch, and coming from the alloc_contig_range call path,
> I got two warnings of 'page type is 0, passed migratetype is 1' as above.
> With your patch I got one 'page type is 0, passed migratetype is 1'
> warning and one 'page type is 1, passed migratetype is 0' warning.
>
> I could be wrong, but I do not think your patch changes things.

Got it. Thanks for the clarification.
>
>>>
>>> One idea about recreating the issue is that it may have to do with size
>>> of my VM (16G) and the requested allocation sizes 4G. However, I tried
>>> to really stress the allocations by increasing the number of hugetlb
>>> pages requested and that did not help. I also noticed that I only seem
>>> to get two warnings and then they stop, even if I continue to run the
>>> script.
>>>
>>> Zi asked about my config, so it is attached.
>>
>> With your config, I still have no luck reproducing the issue. I will keep
>> trying. Thanks.
>>
>
> Perhaps try running both scripts in parallel?

Yes. It seems to do the trick.

> Adjust the number of hugetlb pages allocated to equal 25% of memory?

I am able to reproduce it with the script below:

while true; do
echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages&
echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages&
wait
echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
done

I will look into the issue.

--
Best Regards,
Yan, Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2023-09-20 08:47:23

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene

On 9/20/23 03:38, Zi Yan wrote:
> On 19 Sep 2023, at 20:32, Mike Kravetz wrote:
>
>> On 09/19/23 16:57, Zi Yan wrote:
>>> On 19 Sep 2023, at 14:47, Mike Kravetz wrote:
>>>
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
>>>> end = pageblock_end_pfn(pfn) - 1;
>>>>
>>>> /* Do not cross zone boundaries */
>>>> +#if 0
>>>> if (!zone_spans_pfn(zone, start))
>>>> start = zone->zone_start_pfn;
>>>> +#else
>>>> + if (!zone_spans_pfn(zone, start))
>>>> + start = pfn;
>>>> +#endif
>>>> if (!zone_spans_pfn(zone, end))
>>>> return false;
>>>> I can still trigger warnings.
>>>
>>> OK. One thing to note is that the page type in the warning changed from
>>> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.
>>>
>>
>> Just to be really clear,
>> - the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path.
>> - the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call
>> path WITHOUT your change.
>>
>> I am guessing the difference here has more to do with the allocation path?
>>
>> I went back and reran focusing on the specific migrate type.
>> Without your patch, and coming from the alloc_contig_range call path,
>> I got two warnings of 'page type is 0, passed migratetype is 1' as above.
>> With your patch I got one 'page type is 0, passed migratetype is 1'
>> warning and one 'page type is 1, passed migratetype is 0' warning.
>>
>> I could be wrong, but I do not think your patch changes things.
>
> Got it. Thanks for the clarification.
>>
>>>>
>>>> One idea about recreating the issue is that it may have to do with size
>>>> of my VM (16G) and the requested allocation sizes 4G. However, I tried
>>>> to really stress the allocations by increasing the number of hugetlb
>>>> pages requested and that did not help. I also noticed that I only seem
>>>> to get two warnings and then they stop, even if I continue to run the
>>>> script.
>>>>
>>>> Zi asked about my config, so it is attached.
>>>
>>> With your config, I still have no luck reproducing the issue. I will keep
>>> trying. Thanks.
>>>
>>
>> Perhaps try running both scripts in parallel?
>
> Yes. It seems to do the trick.
>
>> Adjust the number of hugetlb pages allocated to equal 25% of memory?
>
> I am able to reproduce it with the script below:
>
> while true; do
> echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages&
> echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages&
> wait
> echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> done
>
> I will look into the issue.

With migratetypes 0 and 1 and somewhat harder to reproduce scenario (= less
deterministic, more racy) it's possible we now see what I suspected can
happen here:
https://lore.kernel.org/all/[email protected]/
In that there are places reading the migratetype outside of zone lock.

2023-09-20 17:54:56

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene

On Wed, Sep 20, 2023 at 08:07:53AM +0200, Vlastimil Babka wrote:
> On 9/20/23 03:38, Zi Yan wrote:
> > On 19 Sep 2023, at 20:32, Mike Kravetz wrote:
> >
> >> On 09/19/23 16:57, Zi Yan wrote:
> >>> On 19 Sep 2023, at 14:47, Mike Kravetz wrote:
> >>>
> >>>> --- a/mm/page_alloc.c
> >>>> +++ b/mm/page_alloc.c
> >>>> @@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
> >>>> end = pageblock_end_pfn(pfn) - 1;
> >>>>
> >>>> /* Do not cross zone boundaries */
> >>>> +#if 0
> >>>> if (!zone_spans_pfn(zone, start))
> >>>> start = zone->zone_start_pfn;
> >>>> +#else
> >>>> + if (!zone_spans_pfn(zone, start))
> >>>> + start = pfn;
> >>>> +#endif
> >>>> if (!zone_spans_pfn(zone, end))
> >>>> return false;
> >>>> I can still trigger warnings.
> >>>
> >>> OK. One thing to note is that the page type in the warning changed from
> >>> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.
> >>>
> >>
> >> Just to be really clear,
> >> - the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path.
> >> - the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call
> >> path WITHOUT your change.
> >>
> >> I am guessing the difference here has more to do with the allocation path?
> >>
> >> I went back and reran focusing on the specific migrate type.
> >> Without your patch, and coming from the alloc_contig_range call path,
> >> I got two warnings of 'page type is 0, passed migratetype is 1' as above.
> >> With your patch I got one 'page type is 0, passed migratetype is 1'
> >> warning and one 'page type is 1, passed migratetype is 0' warning.
> >>
> >> I could be wrong, but I do not think your patch changes things.
> >
> > Got it. Thanks for the clarification.
> >>
> >>>>
> >>>> One idea about recreating the issue is that it may have to do with size
> >>>> of my VM (16G) and the requested allocation sizes 4G. However, I tried
> >>>> to really stress the allocations by increasing the number of hugetlb
> >>>> pages requested and that did not help. I also noticed that I only seem
> >>>> to get two warnings and then they stop, even if I continue to run the
> >>>> script.
> >>>>
> >>>> Zi asked about my config, so it is attached.
> >>>
> >>> With your config, I still have no luck reproducing the issue. I will keep
> >>> trying. Thanks.
> >>>
> >>
> >> Perhaps try running both scripts in parallel?
> >
> > Yes. It seems to do the trick.
> >
> >> Adjust the number of hugetlb pages allocated to equal 25% of memory?
> >
> > I am able to reproduce it with the script below:
> >
> > while true; do
> > echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages&
> > echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages&
> > wait
> > echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> > echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> > done
> >
> > I will look into the issue.

Nice!

I managed to reproduce it ONCE, triggering it not even a second after
starting the script. But I can't seem to do it twice, even after
several reboots and letting it run for minutes.

> With migratetypes 0 and 1 and somewhat harder to reproduce scenario (= less
> deterministic, more racy) it's possible we now see what I suspected can
> happen here:
> https://lore.kernel.org/all/[email protected]/
> In that there are places reading the migratetype outside of zone lock.

Good point!

I had already written up a fix for this issue. Still trying to get the
reproducer to work, but attaching the fix below in case somebody with
a working environment beats me to it.

---

From 94f67bfa29a602a66014d079431b224cacbf79e9 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <[email protected]>
Date: Fri, 15 Sep 2023 16:23:38 -0400
Subject: [PATCH] mm: page_alloc: close migratetype race between freeing and
stealing

There are three freeing paths that read the page's migratetype
optimistically before grabbing the zone lock. When this races with
block stealing, those pages go on the wrong freelist.

The paths in question are:
- when freeing >costly orders that aren't THP
- when freeing pages to the buddy upon pcp lock contention
- when freeing pages that are isolated
- when freeing pages initially during boot
- when freeing the remainder in alloc_pages_exact()
- when "accepting" unaccepted VM host memory before first use
- when freeing pages during unpoisoning

None of these are so hot that they would need this optimization at the
cost of hampering defrag efforts. Especially when contrasted with the
fact that the most common buddy freeing path - free_pcppages_bulk - is
checking the migratetype under the zone->lock just fine.

In addition, isolated pages need to look up the migratetype under the
lock anyway, which adds branches to the locked section, and results in
a double lookup when the pages are in fact isolated.

Move the lookups into the lock.

Reported-by: Vlastimil Babka <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---
mm/page_alloc.c | 47 +++++++++++++++++------------------------------
1 file changed, 17 insertions(+), 30 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0ca999d24a00..d902a8aaa3fd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1222,18 +1222,15 @@ static void free_pcppages_bulk(struct zone *zone, int count,
spin_unlock_irqrestore(&zone->lock, flags);
}

-static void free_one_page(struct zone *zone,
- struct page *page, unsigned long pfn,
- unsigned int order,
- int migratetype, fpi_t fpi_flags)
+static void free_one_page(struct zone *zone, struct page *page,
+ unsigned long pfn, unsigned int order,
+ fpi_t fpi_flags)
{
unsigned long flags;
+ int migratetype;

spin_lock_irqsave(&zone->lock, flags);
- if (unlikely(has_isolate_pageblock(zone) ||
- is_migrate_isolate(migratetype))) {
- migratetype = get_pfnblock_migratetype(page, pfn);
- }
+ migratetype = get_pfnblock_migratetype(page, pfn);
__free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
spin_unlock_irqrestore(&zone->lock, flags);
}
@@ -1249,18 +1246,8 @@ static void __free_pages_ok(struct page *page, unsigned int order,
if (!free_pages_prepare(page, order, fpi_flags))
return;

- /*
- * Calling get_pfnblock_migratetype() without spin_lock_irqsave() here
- * is used to avoid calling get_pfnblock_migratetype() under the lock.
- * This will reduce the lock holding time.
- */
- migratetype = get_pfnblock_migratetype(page, pfn);
-
spin_lock_irqsave(&zone->lock, flags);
- if (unlikely(has_isolate_pageblock(zone) ||
- is_migrate_isolate(migratetype))) {
- migratetype = get_pfnblock_migratetype(page, pfn);
- }
+ migratetype = get_pfnblock_migratetype(page, pfn);
__free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
spin_unlock_irqrestore(&zone->lock, flags);

@@ -2404,7 +2391,7 @@ void free_unref_page(struct page *page, unsigned int order)
struct per_cpu_pages *pcp;
struct zone *zone;
unsigned long pfn = page_to_pfn(page);
- int migratetype, pcpmigratetype;
+ int migratetype;

if (!free_pages_prepare(page, order, FPI_NONE))
return;
@@ -2416,23 +2403,23 @@ void free_unref_page(struct page *page, unsigned int order)
* get those areas back if necessary. Otherwise, we may have to free
* excessively into the page allocator
*/
- migratetype = pcpmigratetype = get_pfnblock_migratetype(page, pfn);
+ migratetype = get_pfnblock_migratetype(page, pfn);
if (unlikely(migratetype >= MIGRATE_PCPTYPES)) {
if (unlikely(is_migrate_isolate(migratetype))) {
- free_one_page(page_zone(page), page, pfn, order, migratetype, FPI_NONE);
+ free_one_page(page_zone(page), page, pfn, order, FPI_NONE);
return;
}
- pcpmigratetype = MIGRATE_MOVABLE;
+ migratetype = MIGRATE_MOVABLE;
}

zone = page_zone(page);
pcp_trylock_prepare(UP_flags);
pcp = pcp_spin_trylock(zone->per_cpu_pageset);
if (pcp) {
- free_unref_page_commit(zone, pcp, page, pcpmigratetype, order);
+ free_unref_page_commit(zone, pcp, page, migratetype, order);
pcp_spin_unlock(pcp);
} else {
- free_one_page(zone, page, pfn, order, migratetype, FPI_NONE);
+ free_one_page(zone, page, pfn, order, FPI_NONE);
}
pcp_trylock_finish(UP_flags);
}
@@ -2465,7 +2452,7 @@ void free_unref_page_list(struct list_head *list)
migratetype = get_pfnblock_migratetype(page, pfn);
if (unlikely(is_migrate_isolate(migratetype))) {
list_del(&page->lru);
- free_one_page(page_zone(page), page, pfn, 0, migratetype, FPI_NONE);
+ free_one_page(page_zone(page), page, pfn, 0, FPI_NONE);
continue;
}
}
@@ -2498,8 +2485,7 @@ void free_unref_page_list(struct list_head *list)
pcp = pcp_spin_trylock(zone->per_cpu_pageset);
if (unlikely(!pcp)) {
pcp_trylock_finish(UP_flags);
- free_one_page(zone, page, pfn,
- 0, migratetype, FPI_NONE);
+ free_one_page(zone, page, pfn, 0, FPI_NONE);
locked_zone = NULL;
continue;
}
@@ -6537,13 +6523,14 @@ bool take_page_off_buddy(struct page *page)
bool put_page_back_buddy(struct page *page)
{
struct zone *zone = page_zone(page);
- unsigned long pfn = page_to_pfn(page);
unsigned long flags;
- int migratetype = get_pfnblock_migratetype(page, pfn);
bool ret = false;

spin_lock_irqsave(&zone->lock, flags);
if (put_page_testzero(page)) {
+ unsigned long pfn = page_to_pfn(page);
+ int migratetype = get_pfnblock_migratetype(page, pfn);
+
ClearPageHWPoisonTakenOff(page);
__free_one_page(page, pfn, zone, 0, migratetype, FPI_NONE);
if (TestClearPageHWPoison(page)) {
--
2.42.0

2023-09-21 06:39:53

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene

On Wed, Sep 20, 2023 at 09:48:12AM -0400, Johannes Weiner wrote:
> On Wed, Sep 20, 2023 at 08:07:53AM +0200, Vlastimil Babka wrote:
> > On 9/20/23 03:38, Zi Yan wrote:
> > > On 19 Sep 2023, at 20:32, Mike Kravetz wrote:
> > >
> > >> On 09/19/23 16:57, Zi Yan wrote:
> > >>> On 19 Sep 2023, at 14:47, Mike Kravetz wrote:
> > >>>
> > >>>> --- a/mm/page_alloc.c
> > >>>> +++ b/mm/page_alloc.c
> > >>>> @@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
> > >>>> end = pageblock_end_pfn(pfn) - 1;
> > >>>>
> > >>>> /* Do not cross zone boundaries */
> > >>>> +#if 0
> > >>>> if (!zone_spans_pfn(zone, start))
> > >>>> start = zone->zone_start_pfn;
> > >>>> +#else
> > >>>> + if (!zone_spans_pfn(zone, start))
> > >>>> + start = pfn;
> > >>>> +#endif
> > >>>> if (!zone_spans_pfn(zone, end))
> > >>>> return false;
> > >>>> I can still trigger warnings.
> > >>>
> > >>> OK. One thing to note is that the page type in the warning changed from
> > >>> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.
> > >>>
> > >>
> > >> Just to be really clear,
> > >> - the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path.
> > >> - the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call
> > >> path WITHOUT your change.
> > >>
> > >> I am guessing the difference here has more to do with the allocation path?
> > >>
> > >> I went back and reran focusing on the specific migrate type.
> > >> Without your patch, and coming from the alloc_contig_range call path,
> > >> I got two warnings of 'page type is 0, passed migratetype is 1' as above.
> > >> With your patch I got one 'page type is 0, passed migratetype is 1'
> > >> warning and one 'page type is 1, passed migratetype is 0' warning.
> > >>
> > >> I could be wrong, but I do not think your patch changes things.
> > >
> > > Got it. Thanks for the clarification.
> > >>
> > >>>>
> > >>>> One idea about recreating the issue is that it may have to do with size
> > >>>> of my VM (16G) and the requested allocation sizes 4G. However, I tried
> > >>>> to really stress the allocations by increasing the number of hugetlb
> > >>>> pages requested and that did not help. I also noticed that I only seem
> > >>>> to get two warnings and then they stop, even if I continue to run the
> > >>>> script.
> > >>>>
> > >>>> Zi asked about my config, so it is attached.
> > >>>
> > >>> With your config, I still have no luck reproducing the issue. I will keep
> > >>> trying. Thanks.
> > >>>
> > >>
> > >> Perhaps try running both scripts in parallel?
> > >
> > > Yes. It seems to do the trick.
> > >
> > >> Adjust the number of hugetlb pages allocated to equal 25% of memory?
> > >
> > > I am able to reproduce it with the script below:
> > >
> > > while true; do
> > > echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages&
> > > echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages&
> > > wait
> > > echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> > > echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> > > done
> > >
> > > I will look into the issue.
>
> Nice!
>
> I managed to reproduce it ONCE, triggering it not even a second after
> starting the script. But I can't seem to do it twice, even after
> several reboots and letting it run for minutes.

I managed to reproduce it reliably by cutting the nr_hugepages
parameters respectively in half.

The one that triggers for me is always MIGRATE_ISOLATE. With some
printk-tracing, the scenario seems to be this:

#0 #1
start_isolate_page_range()
isolate_single_pageblock()
set_migratetype_isolate(tail)
lock zone->lock
move_freepages_block(tail) // nop
set_pageblock_migratetype(tail)
unlock zone->lock
del_page_from_freelist(head)
expand(head, head_mt)
WARN(head_mt != tail_mt)
start_pfn = ALIGN_DOWN(MAX_ORDER_NR_PAGES)
for (pfn = start_pfn, pfn < end_pfn)
if (PageBuddy())
split_free_page(head)

IOW, we update a pageblock that isn't MAX_ORDER aligned, then drop the
lock. The move_freepages_block() does nothing because the PageBuddy()
is set on the pageblock to the left. Once we drop the lock, the buddy
gets allocated and the expand() puts things on the wrong list. The
splitting code that handles MAX_ORDER blocks runs *after* the tail
type is set and the lock has been dropped, so it's too late.

I think this would work fine if we always set MIGRATE_ISOLATE in a
linear fashion, with start and end aligned to MAX_ORDER. Then we also
wouldn't have to split things.

There are two reasons this doesn't happen today:

1. The isolation range is rounded to pageblocks, not MAX_ORDER. In
this test case they always seem aligned, but it's not
guaranteed. However,

2. start_isolate_page_range() explicitly breaks ordering by doing the
last block in the range before the center. It's that last block
that triggers the race with __rmqueue_smallest -> expand() for me.

With the below patch I can no longer reproduce the issue:

---

diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index b5c7a9d21257..b7c8730bf0e2 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -538,8 +538,8 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
unsigned long pfn;
struct page *page;
/* isolation is done at page block granularity */
- unsigned long isolate_start = pageblock_start_pfn(start_pfn);
- unsigned long isolate_end = pageblock_align(end_pfn);
+ unsigned long isolate_start = ALIGN_DOWN(start_pfn, MAX_ORDER_NR_PAGES);
+ unsigned long isolate_end = ALIGN(end_pfn, MAX_ORDER_NR_PAGES);
int ret;
bool skip_isolation = false;

@@ -549,17 +549,6 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
if (ret)
return ret;

- if (isolate_start == isolate_end - pageblock_nr_pages)
- skip_isolation = true;
-
- /* isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock */
- ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true,
- skip_isolation, migratetype);
- if (ret) {
- unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype);
- return ret;
- }
-
/* skip isolated pageblocks at the beginning and end */
for (pfn = isolate_start + pageblock_nr_pages;
pfn < isolate_end - pageblock_nr_pages;
@@ -568,12 +557,21 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
if (page && set_migratetype_isolate(page, migratetype, flags,
start_pfn, end_pfn)) {
undo_isolate_page_range(isolate_start, pfn, migratetype);
- unset_migratetype_isolate(
- pfn_to_page(isolate_end - pageblock_nr_pages),
- migratetype);
return -EBUSY;
}
}
+
+ if (isolate_start == isolate_end - pageblock_nr_pages)
+ skip_isolation = true;
+
+ /* isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock */
+ ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true,
+ skip_isolation, migratetype);
+ if (ret) {
+ undo_isolate_page_range(isolate_start, pfn, migratetype);
+ return ret;
+ }
+
return 0;
}

@@ -591,8 +589,8 @@ void undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
{
unsigned long pfn;
struct page *page;
- unsigned long isolate_start = pageblock_start_pfn(start_pfn);
- unsigned long isolate_end = pageblock_align(end_pfn);
+ unsigned long isolate_start = ALIGN_DOWN(start_pfn, MAX_ORDER_NR_PAGES);
+ unsigned long isolate_end = ALIGN(end_pfn, MAX_ORDER_NR_PAGES);

for (pfn = isolate_start;
pfn < isolate_end;