2023-09-15 03:10:42

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene

In next-20230913, I started hitting the following BUG. Seems related
to this series. And, if series is reverted I do not see the BUG.

I can easily reproduce on a small 16G VM. kernel command line contains
"hugetlb_free_vmemmap=on hugetlb_cma=4G". Then run the script,
while true; do
echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
done

For the BUG below I believe it was the first (or second) 1G page creation from
CMA that triggered: cma_alloc of 1G.

Sorry, have not looked deeper into the issue.

[ 28.643019] page:ffffea0004fb4280 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x13ed0a
[ 28.645455] flags: 0x200000000000000(node=0|zone=2)
[ 28.646835] page_type: 0xffffffff()
[ 28.647886] raw: 0200000000000000 dead000000000100 dead000000000122 0000000000000000
[ 28.651170] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
[ 28.653124] page dumped because: VM_BUG_ON_PAGE(is_migrate_isolate(mt))
[ 28.654769] ------------[ cut here ]------------
[ 28.655972] kernel BUG at mm/page_alloc.c:1231!
[ 28.657139] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[ 28.658354] CPU: 2 PID: 885 Comm: bash Not tainted 6.6.0-rc1-next-20230913+ #3
[ 28.660090] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
[ 28.662054] RIP: 0010:free_pcppages_bulk+0x192/0x240
[ 28.663284] Code: 22 48 89 45 08 8b 44 24 0c 41 29 44 24 04 41 29 c6 41 83 f8 05 0f 85 4c ff ff ff 48 c7 c6 20 a5 22 82 48 89 df e8 4e cf fc ff <0f> 0b 65 8b 05 41 8b d3 7e 89 c0 48 0f a3 05 fb 35 39 01 0f 83 40
[ 28.667422] RSP: 0018:ffffc90003b9faf0 EFLAGS: 00010046
[ 28.668643] RAX: 000000000000003b RBX: ffffea0004fb4280 RCX: 0000000000000000
[ 28.670245] RDX: 0000000000000000 RSI: ffffffff8224dace RDI: 00000000ffffffff
[ 28.671920] RBP: ffffea0004fb4288 R08: 0000000000009ffb R09: 00000000ffffdfff
[ 28.673614] R10: 00000000ffffdfff R11: ffffffff824660c0 R12: ffff888477c30540
[ 28.675213] R13: ffff888477c30550 R14: 00000000000012f5 R15: 000000000013ed0a
[ 28.676832] FS: 00007f60039b9740(0000) GS:ffff888477c00000(0000) knlGS:0000000000000000
[ 28.678709] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 28.680046] CR2: 00005615f9bf3048 CR3: 00000003128b6005 CR4: 0000000000370ee0
[ 28.682897] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 28.684501] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 28.686098] Call Trace:
[ 28.686792] <TASK>
[ 28.687414] ? die+0x32/0x80
[ 28.688197] ? do_trap+0xd6/0x100
[ 28.689069] ? free_pcppages_bulk+0x192/0x240
[ 28.690135] ? do_error_trap+0x6a/0x90
[ 28.691082] ? free_pcppages_bulk+0x192/0x240
[ 28.692187] ? exc_invalid_op+0x49/0x60
[ 28.693154] ? free_pcppages_bulk+0x192/0x240
[ 28.694225] ? asm_exc_invalid_op+0x16/0x20
[ 28.695291] ? free_pcppages_bulk+0x192/0x240
[ 28.696405] drain_pages_zone+0x3f/0x50
[ 28.697404] __drain_all_pages+0xe2/0x1e0
[ 28.698472] alloc_contig_range+0x143/0x280
[ 28.699581] ? bitmap_find_next_zero_area_off+0x3d/0x90
[ 28.700902] cma_alloc+0x156/0x470
[ 28.701852] ? kernfs_fop_write_iter+0x160/0x1f0
[ 28.703053] alloc_fresh_hugetlb_folio+0x7e/0x270
[ 28.704272] alloc_pool_huge_page+0x7d/0x100
[ 28.705448] set_max_huge_pages+0x162/0x390
[ 28.706530] nr_hugepages_store_common+0x91/0xf0
[ 28.707689] kernfs_fop_write_iter+0x108/0x1f0
[ 28.708819] vfs_write+0x207/0x400
[ 28.709743] ksys_write+0x63/0xe0
[ 28.710640] do_syscall_64+0x37/0x90
[ 28.712649] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 28.713919] RIP: 0033:0x7f6003aade87
[ 28.714879] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[ 28.719096] RSP: 002b:00007ffdfd9d2e98 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 28.720945] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f6003aade87
[ 28.722626] RDX: 0000000000000002 RSI: 00005615f9bac620 RDI: 0000000000000001
[ 28.724288] RBP: 00005615f9bac620 R08: 000000000000000a R09: 00007f6003b450c0
[ 28.725939] R10: 00007f6003b44fc0 R11: 0000000000000246 R12: 0000000000000002
[ 28.727611] R13: 00007f6003b81520 R14: 0000000000000002 R15: 00007f6003b81720
[ 28.729285] </TASK>
[ 28.729944] Modules linked in: rfkill ip6table_filter ip6_tables sunrpc snd_hda_codec_generic snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core snd_seq 9p snd_seq_device netfs joydev snd_pcm snd_timer 9pnet_virtio snd soundcore virtio_balloon 9pnet virtio_console virtio_net virtio_blk net_failover failover crct10dif_pclmul crc32_pclmul crc32c_intel virtio_pci ghash_clmulni_intel serio_raw virtio virtio_pci_legacy_dev virtio_pci_modern_dev virtio_ring fuse
[ 28.739325] ---[ end trace 0000000000000000 ]---

--
Mike Kravetz


2023-09-15 17:49:40

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene

On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
> In next-20230913, I started hitting the following BUG. Seems related
> to this series. And, if series is reverted I do not see the BUG.
>
> I can easily reproduce on a small 16G VM. kernel command line contains
> "hugetlb_free_vmemmap=on hugetlb_cma=4G". Then run the script,
> while true; do
> echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
> echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> done
>
> For the BUG below I believe it was the first (or second) 1G page creation from
> CMA that triggered: cma_alloc of 1G.
>
> Sorry, have not looked deeper into the issue.

Thanks for the report, and sorry about the breakage!

I was scratching my head at this:

/* MIGRATE_ISOLATE page should not go to pcplists */
VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);

because there is nothing in page isolation that prevents setting
MIGRATE_ISOLATE on something that's on the pcplist already. So why
didn't this trigger before already?

Then it clicked: it used to only check the *pcpmigratetype* determined
by free_unref_page(), which of course mustn't be MIGRATE_ISOLATE.

Pages that get isolated while *already* on the pcplist are fine, and
are handled properly:

mt = get_pcppage_migratetype(page);

/* MIGRATE_ISOLATE page should not go to pcplists */
VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);

/* Pageblock could have been isolated meanwhile */
if (unlikely(isolated_pageblocks))
mt = get_pageblock_migratetype(page);

So this was purely a sanity check against the pcpmigratetype cache
operations. With that gone, we can remove it.

---

From b0cb92ed10b40fab0921002effa8b726df245790 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <[email protected]>
Date: Fri, 15 Sep 2023 09:59:52 -0400
Subject: [PATCH] mm: page_alloc: remove pcppage migratetype caching fix

Mike reports the following crash in -next:

[ 28.643019] page:ffffea0004fb4280 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x13ed0a
[ 28.645455] flags: 0x200000000000000(node=0|zone=2)
[ 28.646835] page_type: 0xffffffff()
[ 28.647886] raw: 0200000000000000 dead000000000100 dead000000000122 0000000000000000
[ 28.651170] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
[ 28.653124] page dumped because: VM_BUG_ON_PAGE(is_migrate_isolate(mt))
[ 28.654769] ------------[ cut here ]------------
[ 28.655972] kernel BUG at mm/page_alloc.c:1231!

This VM_BUG_ON() used to check that the cached pcppage_migratetype set
by free_unref_page() wasn't MIGRATE_ISOLATE.

When I removed the caching, I erroneously changed the assert to check
that no isolated pages are on the pcplist. This is quite different,
because pages can be isolated *after* they had been put on the
freelist already (which is handled just fine).

IOW, this was purely a sanity check on the migratetype caching. With
that gone, the check should have been removed as well. Do that now.

Reported-by: Mike Kravetz <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---
mm/page_alloc.c | 3 ---
1 file changed, 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e3f1c777feed..9469e4660b53 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1207,9 +1207,6 @@ static void free_pcppages_bulk(struct zone *zone, int count,
count -= nr_pages;
pcp->count -= nr_pages;

- /* MIGRATE_ISOLATE page should not go to pcplists */
- VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
-
__free_one_page(page, pfn, zone, order, mt, FPI_NONE);
trace_mm_page_pcpu_drain(page, order, mt);
} while (count > 0 && !list_empty(list));
--
2.42.0

2023-09-16 03:12:52

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene

On 09/15/23 10:16, Johannes Weiner wrote:
> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
> > In next-20230913, I started hitting the following BUG. Seems related
> > to this series. And, if series is reverted I do not see the BUG.
> >
> > I can easily reproduce on a small 16G VM. kernel command line contains
> > "hugetlb_free_vmemmap=on hugetlb_cma=4G". Then run the script,
> > while true; do
> > echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> > echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
> > echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> > done
> >
> > For the BUG below I believe it was the first (or second) 1G page creation from
> > CMA that triggered: cma_alloc of 1G.
> >
> > Sorry, have not looked deeper into the issue.
>
> Thanks for the report, and sorry about the breakage!
>
> I was scratching my head at this:
>
> /* MIGRATE_ISOLATE page should not go to pcplists */
> VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
>
> because there is nothing in page isolation that prevents setting
> MIGRATE_ISOLATE on something that's on the pcplist already. So why
> didn't this trigger before already?
>
> Then it clicked: it used to only check the *pcpmigratetype* determined
> by free_unref_page(), which of course mustn't be MIGRATE_ISOLATE.
>
> Pages that get isolated while *already* on the pcplist are fine, and
> are handled properly:
>
> mt = get_pcppage_migratetype(page);
>
> /* MIGRATE_ISOLATE page should not go to pcplists */
> VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
>
> /* Pageblock could have been isolated meanwhile */
> if (unlikely(isolated_pageblocks))
> mt = get_pageblock_migratetype(page);
>
> So this was purely a sanity check against the pcpmigratetype cache
> operations. With that gone, we can remove it.

Thanks! That makes sense.

Glad my testing (for something else) triggered it.
--
Mike Kravetz

2023-09-16 22:52:31

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene

On 09/15/23 10:16, Johannes Weiner wrote:
> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
> > In next-20230913, I started hitting the following BUG. Seems related
> > to this series. And, if series is reverted I do not see the BUG.
> >
> > I can easily reproduce on a small 16G VM. kernel command line contains
> > "hugetlb_free_vmemmap=on hugetlb_cma=4G". Then run the script,
> > while true; do
> > echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> > echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
> > echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> > done
> >
> > For the BUG below I believe it was the first (or second) 1G page creation from
> > CMA that triggered: cma_alloc of 1G.
> >
> > Sorry, have not looked deeper into the issue.
>
> Thanks for the report, and sorry about the breakage!
>
> I was scratching my head at this:
>
> /* MIGRATE_ISOLATE page should not go to pcplists */
> VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
>
> because there is nothing in page isolation that prevents setting
> MIGRATE_ISOLATE on something that's on the pcplist already. So why
> didn't this trigger before already?
>
> Then it clicked: it used to only check the *pcpmigratetype* determined
> by free_unref_page(), which of course mustn't be MIGRATE_ISOLATE.
>
> Pages that get isolated while *already* on the pcplist are fine, and
> are handled properly:
>
> mt = get_pcppage_migratetype(page);
>
> /* MIGRATE_ISOLATE page should not go to pcplists */
> VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
>
> /* Pageblock could have been isolated meanwhile */
> if (unlikely(isolated_pageblocks))
> mt = get_pageblock_migratetype(page);
>
> So this was purely a sanity check against the pcpmigratetype cache
> operations. With that gone, we can remove it.

With the patch below applied, a slightly different workload triggers the
following warnings. It seems related, and appears to go away when
reverting the series.

[ 331.595382] ------------[ cut here ]------------
[ 331.596665] page type is 5, passed migratetype is 1 (nr=512)
[ 331.598121] WARNING: CPU: 2 PID: 935 at mm/page_alloc.c:662 expand+0x1c9/0x200
[ 331.600549] Modules linked in: rfkill ip6table_filter ip6_tables sunrpc snd_hda_codec_generic snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core snd_seq 9p snd_seq_device netfs 9pnet_virtio snd_pcm joydev snd_timer virtio_balloon snd soundcore 9pnet virtio_blk virtio_console virtio_net net_failover failover crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw virtio_pci virtio virtio_pci_legacy_dev virtio_pci_modern_dev virtio_ring fuse
[ 331.609530] CPU: 2 PID: 935 Comm: bash Tainted: G W 6.6.0-rc1-next-20230913+ #26
[ 331.611603] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
[ 331.613527] RIP: 0010:expand+0x1c9/0x200
[ 331.614492] Code: 89 ef be 07 00 00 00 c6 05 c9 b1 35 01 01 e8 de f7 ff ff 8b 4c 24 30 8b 54 24 0c 48 c7 c7 68 9f 22 82 48 89 c6 e8 97 b3 df ff <0f> 0b e9 db fe ff ff 48 c7 c6 f8 9f 22 82 48 89 df e8 41 e3 fc ff
[ 331.618540] RSP: 0018:ffffc90003c97a88 EFLAGS: 00010086
[ 331.619801] RAX: 0000000000000000 RBX: ffffea0007ff8000 RCX: 0000000000000000
[ 331.621331] RDX: 0000000000000005 RSI: ffffffff8224dce6 RDI: 00000000ffffffff
[ 331.622914] RBP: 00000000001ffe00 R08: 0000000000009ffb R09: 00000000ffffdfff
[ 331.624712] R10: 00000000ffffdfff R11: ffffffff824660c0 R12: ffff88827fffcd80
[ 331.626317] R13: 0000000000000009 R14: 0000000000000200 R15: 000000000000000a
[ 331.627810] FS: 00007f24b3932740(0000) GS:ffff888477c00000(0000) knlGS:0000000000000000
[ 331.630593] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 331.631865] CR2: 0000560a53875018 CR3: 000000017eee8003 CR4: 0000000000370ee0
[ 331.633382] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 331.634873] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 331.636324] Call Trace:
[ 331.636934] <TASK>
[ 331.637521] ? expand+0x1c9/0x200
[ 331.638320] ? __warn+0x7d/0x130
[ 331.639116] ? expand+0x1c9/0x200
[ 331.639957] ? report_bug+0x18d/0x1c0
[ 331.640832] ? handle_bug+0x41/0x70
[ 331.641635] ? exc_invalid_op+0x13/0x60
[ 331.642522] ? asm_exc_invalid_op+0x16/0x20
[ 331.643494] ? expand+0x1c9/0x200
[ 331.644264] ? expand+0x1c9/0x200
[ 331.645007] rmqueue_bulk+0xf4/0x530
[ 331.645847] get_page_from_freelist+0x3ed/0x1040
[ 331.646837] ? prepare_alloc_pages.constprop.0+0x197/0x1b0
[ 331.647977] __alloc_pages+0xec/0x240
[ 331.648783] alloc_buddy_hugetlb_folio.isra.0+0x6a/0x150
[ 331.649912] __alloc_fresh_hugetlb_folio+0x157/0x230
[ 331.650938] alloc_pool_huge_folio+0xad/0x110
[ 331.651909] set_max_huge_pages+0x17d/0x390
[ 331.652760] nr_hugepages_store_common+0x91/0xf0
[ 331.653825] kernfs_fop_write_iter+0x108/0x1f0
[ 331.654986] vfs_write+0x207/0x400
[ 331.655925] ksys_write+0x63/0xe0
[ 331.656832] do_syscall_64+0x37/0x90
[ 331.657793] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 331.660398] RIP: 0033:0x7f24b3a26e87
[ 331.661342] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[ 331.665673] RSP: 002b:00007ffccd603de8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 331.667541] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f24b3a26e87
[ 331.669197] RDX: 0000000000000005 RSI: 0000560a5381bb50 RDI: 0000000000000001
[ 331.670883] RBP: 0000560a5381bb50 R08: 000000000000000a R09: 00007f24b3abe0c0
[ 331.672536] R10: 00007f24b3abdfc0 R11: 0000000000000246 R12: 0000000000000005
[ 331.674175] R13: 00007f24b3afa520 R14: 0000000000000005 R15: 00007f24b3afa720
[ 331.675841] </TASK>
[ 331.676450] ---[ end trace 0000000000000000 ]---
[ 331.677659] ------------[ cut here ]------------


[ 331.677659] ------------[ cut here ]------------
[ 331.679109] page type is 5, passed migratetype is 1 (nr=512)
[ 331.680376] WARNING: CPU: 2 PID: 935 at mm/page_alloc.c:699 del_page_from_free_list+0x137/0x170
[ 331.682314] Modules linked in: rfkill ip6table_filter ip6_tables sunrpc snd_hda_codec_generic snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core snd_seq 9p snd_seq_device netfs 9pnet_virtio snd_pcm joydev snd_timer virtio_balloon snd soundcore 9pnet virtio_blk virtio_console virtio_net net_failover failover crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw virtio_pci virtio virtio_pci_legacy_dev virtio_pci_modern_dev virtio_ring fuse
[ 331.691852] CPU: 2 PID: 935 Comm: bash Tainted: G W 6.6.0-rc1-next-20230913+ #26
[ 331.694026] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
[ 331.696162] RIP: 0010:del_page_from_free_list+0x137/0x170
[ 331.697589] Code: c6 05 a0 b5 35 01 01 e8 b7 fb ff ff 44 89 f1 44 89 e2 48 c7 c7 68 9f 22 82 48 89 c6 b8 01 00 00 00 d3 e0 89 c1 e8 69 b7 df ff <0f> 0b e9 03 ff ff ff 48 c7 c6 a0 9f 22 82 48 89 df e8 13 e7 fc ff
[ 331.702060] RSP: 0018:ffffc90003c97ac8 EFLAGS: 00010086
[ 331.703430] RAX: 0000000000000000 RBX: ffffea0007ff8000 RCX: 0000000000000000
[ 331.705284] RDX: 0000000000000005 RSI: ffffffff8224dce6 RDI: 00000000ffffffff
[ 331.707101] RBP: 00000000001ffe00 R08: 0000000000009ffb R09: 00000000ffffdfff
[ 331.708933] R10: 00000000ffffdfff R11: ffffffff824660c0 R12: 0000000000000001
[ 331.710754] R13: ffff88827fffcd80 R14: 0000000000000009 R15: 0000000000000009
[ 331.712637] FS: 00007f24b3932740(0000) GS:ffff888477c00000(0000) knlGS:0000000000000000
[ 331.714861] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 331.716466] CR2: 0000560a53875018 CR3: 000000017eee8003 CR4: 0000000000370ee0
[ 331.718441] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 331.720372] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 331.723583] Call Trace:
[ 331.724351] <TASK>
[ 331.725045] ? del_page_from_free_list+0x137/0x170
[ 331.726370] ? __warn+0x7d/0x130
[ 331.727326] ? del_page_from_free_list+0x137/0x170
[ 331.728637] ? report_bug+0x18d/0x1c0
[ 331.729688] ? handle_bug+0x41/0x70
[ 331.730707] ? exc_invalid_op+0x13/0x60
[ 331.731798] ? asm_exc_invalid_op+0x16/0x20
[ 331.733007] ? del_page_from_free_list+0x137/0x170
[ 331.734317] ? del_page_from_free_list+0x137/0x170
[ 331.735649] rmqueue_bulk+0xdf/0x530
[ 331.736741] get_page_from_freelist+0x3ed/0x1040
[ 331.738069] ? prepare_alloc_pages.constprop.0+0x197/0x1b0
[ 331.739578] __alloc_pages+0xec/0x240
[ 331.740666] alloc_buddy_hugetlb_folio.isra.0+0x6a/0x150
[ 331.742135] __alloc_fresh_hugetlb_folio+0x157/0x230
[ 331.743521] alloc_pool_huge_folio+0xad/0x110
[ 331.744768] set_max_huge_pages+0x17d/0x390
[ 331.745988] nr_hugepages_store_common+0x91/0xf0
[ 331.747306] kernfs_fop_write_iter+0x108/0x1f0
[ 331.748651] vfs_write+0x207/0x400
[ 331.749735] ksys_write+0x63/0xe0
[ 331.750808] do_syscall_64+0x37/0x90
[ 331.753203] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 331.754857] RIP: 0033:0x7f24b3a26e87
[ 331.756184] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[ 331.760239] RSP: 002b:00007ffccd603de8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 331.761935] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f24b3a26e87
[ 331.763524] RDX: 0000000000000005 RSI: 0000560a5381bb50 RDI: 0000000000000001
[ 331.765102] RBP: 0000560a5381bb50 R08: 000000000000000a R09: 00007f24b3abe0c0
[ 331.766740] R10: 00007f24b3abdfc0 R11: 0000000000000246 R12: 0000000000000005
[ 331.768344] R13: 00007f24b3afa520 R14: 0000000000000005 R15: 00007f24b3afa720
[ 331.769949] </TASK>
[ 331.770559] ---[ end trace 0000000000000000 ]---

--
Mike Kravetz

> ---
>
> From b0cb92ed10b40fab0921002effa8b726df245790 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <[email protected]>
> Date: Fri, 15 Sep 2023 09:59:52 -0400
> Subject: [PATCH] mm: page_alloc: remove pcppage migratetype caching fix
>
> Mike reports the following crash in -next:
>
> [ 28.643019] page:ffffea0004fb4280 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x13ed0a
> [ 28.645455] flags: 0x200000000000000(node=0|zone=2)
> [ 28.646835] page_type: 0xffffffff()
> [ 28.647886] raw: 0200000000000000 dead000000000100 dead000000000122 0000000000000000
> [ 28.651170] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
> [ 28.653124] page dumped because: VM_BUG_ON_PAGE(is_migrate_isolate(mt))
> [ 28.654769] ------------[ cut here ]------------
> [ 28.655972] kernel BUG at mm/page_alloc.c:1231!
>
> This VM_BUG_ON() used to check that the cached pcppage_migratetype set
> by free_unref_page() wasn't MIGRATE_ISOLATE.
>
> When I removed the caching, I erroneously changed the assert to check
> that no isolated pages are on the pcplist. This is quite different,
> because pages can be isolated *after* they had been put on the
> freelist already (which is handled just fine).
>
> IOW, this was purely a sanity check on the migratetype caching. With
> that gone, the check should have been removed as well. Do that now.
>
> Reported-by: Mike Kravetz <[email protected]>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> mm/page_alloc.c | 3 ---
> 1 file changed, 3 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e3f1c777feed..9469e4660b53 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1207,9 +1207,6 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> count -= nr_pages;
> pcp->count -= nr_pages;
>
> - /* MIGRATE_ISOLATE page should not go to pcplists */
> - VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> -
> __free_one_page(page, pfn, zone, order, mt, FPI_NONE);
> trace_mm_page_pcpu_drain(page, order, mt);
> } while (count > 0 && !list_empty(list));
> --
> 2.42.0
>

2023-09-18 11:52:35

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene

On 9/15/23 16:16, Johannes Weiner wrote:
> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
>> In next-20230913, I started hitting the following BUG. Seems related
>> to this series. And, if series is reverted I do not see the BUG.
>>
>> I can easily reproduce on a small 16G VM. kernel command line contains
>> "hugetlb_free_vmemmap=on hugetlb_cma=4G". Then run the script,
>> while true; do
>> echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>> echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
>> echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>> done
>>
>> For the BUG below I believe it was the first (or second) 1G page creation from
>> CMA that triggered: cma_alloc of 1G.
>>
>> Sorry, have not looked deeper into the issue.
>
> Thanks for the report, and sorry about the breakage!
>
> I was scratching my head at this:
>
> /* MIGRATE_ISOLATE page should not go to pcplists */
> VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
>
> because there is nothing in page isolation that prevents setting
> MIGRATE_ISOLATE on something that's on the pcplist already. So why
> didn't this trigger before already?
>
> Then it clicked: it used to only check the *pcpmigratetype* determined
> by free_unref_page(), which of course mustn't be MIGRATE_ISOLATE.
>
> Pages that get isolated while *already* on the pcplist are fine, and
> are handled properly:
>
> mt = get_pcppage_migratetype(page);
>
> /* MIGRATE_ISOLATE page should not go to pcplists */
> VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
>
> /* Pageblock could have been isolated meanwhile */
> if (unlikely(isolated_pageblocks))
> mt = get_pageblock_migratetype(page);
>
> So this was purely a sanity check against the pcpmigratetype cache
> operations. With that gone, we can remove it.

Agreed, I assume you'll fold it in 1/6 in v3.

2023-09-18 13:38:49

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene

On 9/16/23 21:57, Mike Kravetz wrote:
> On 09/15/23 10:16, Johannes Weiner wrote:
>> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
>> > In next-20230913, I started hitting the following BUG. Seems related
>> > to this series. And, if series is reverted I do not see the BUG.
>> >
>> > I can easily reproduce on a small 16G VM. kernel command line contains
>> > "hugetlb_free_vmemmap=on hugetlb_cma=4G". Then run the script,
>> > while true; do
>> > echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>> > echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
>> > echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>> > done
>> >
>> > For the BUG below I believe it was the first (or second) 1G page creation from
>> > CMA that triggered: cma_alloc of 1G.
>> >
>> > Sorry, have not looked deeper into the issue.
>>
>> Thanks for the report, and sorry about the breakage!
>>
>> I was scratching my head at this:
>>
>> /* MIGRATE_ISOLATE page should not go to pcplists */
>> VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
>>
>> because there is nothing in page isolation that prevents setting
>> MIGRATE_ISOLATE on something that's on the pcplist already. So why
>> didn't this trigger before already?
>>
>> Then it clicked: it used to only check the *pcpmigratetype* determined
>> by free_unref_page(), which of course mustn't be MIGRATE_ISOLATE.
>>
>> Pages that get isolated while *already* on the pcplist are fine, and
>> are handled properly:
>>
>> mt = get_pcppage_migratetype(page);
>>
>> /* MIGRATE_ISOLATE page should not go to pcplists */
>> VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
>>
>> /* Pageblock could have been isolated meanwhile */
>> if (unlikely(isolated_pageblocks))
>> mt = get_pageblock_migratetype(page);
>>
>> So this was purely a sanity check against the pcpmigratetype cache
>> operations. With that gone, we can remove it.
>
> With the patch below applied, a slightly different workload triggers the
> following warnings. It seems related, and appears to go away when
> reverting the series.
>
> [ 331.595382] ------------[ cut here ]------------
> [ 331.596665] page type is 5, passed migratetype is 1 (nr=512)
> [ 331.598121] WARNING: CPU: 2 PID: 935 at mm/page_alloc.c:662 expand+0x1c9/0x200

Initially I thought this demonstrates the possible race I was suggesting in
reply to 6/6. But, assuming you have CONFIG_CMA, page type 5 is cma and we
are trying to get a MOVABLE page from a CMA page block, which is something
that's normally done and the pageblock stays CMA. So yeah if the warnings
are to stay, they need to handle this case. Maybe the same can happen with
HIGHATOMIC blocks?

> [ 331.600549] Modules linked in: rfkill ip6table_filter ip6_tables sunrpc snd_hda_codec_generic snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core snd_seq 9p snd_seq_device netfs 9pnet_virtio snd_pcm joydev snd_timer virtio_balloon snd soundcore 9pnet virtio_blk virtio_console virtio_net net_failover failover crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw virtio_pci virtio virtio_pci_legacy_dev virtio_pci_modern_dev virtio_ring fuse
> [ 331.609530] CPU: 2 PID: 935 Comm: bash Tainted: G W 6.6.0-rc1-next-20230913+ #26
> [ 331.611603] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
> [ 331.613527] RIP: 0010:expand+0x1c9/0x200
> [ 331.614492] Code: 89 ef be 07 00 00 00 c6 05 c9 b1 35 01 01 e8 de f7 ff ff 8b 4c 24 30 8b 54 24 0c 48 c7 c7 68 9f 22 82 48 89 c6 e8 97 b3 df ff <0f> 0b e9 db fe ff ff 48 c7 c6 f8 9f 22 82 48 89 df e8 41 e3 fc ff
> [ 331.618540] RSP: 0018:ffffc90003c97a88 EFLAGS: 00010086
> [ 331.619801] RAX: 0000000000000000 RBX: ffffea0007ff8000 RCX: 0000000000000000
> [ 331.621331] RDX: 0000000000000005 RSI: ffffffff8224dce6 RDI: 00000000ffffffff
> [ 331.622914] RBP: 00000000001ffe00 R08: 0000000000009ffb R09: 00000000ffffdfff
> [ 331.624712] R10: 00000000ffffdfff R11: ffffffff824660c0 R12: ffff88827fffcd80
> [ 331.626317] R13: 0000000000000009 R14: 0000000000000200 R15: 000000000000000a
> [ 331.627810] FS: 00007f24b3932740(0000) GS:ffff888477c00000(0000) knlGS:0000000000000000
> [ 331.630593] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 331.631865] CR2: 0000560a53875018 CR3: 000000017eee8003 CR4: 0000000000370ee0
> [ 331.633382] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 331.634873] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 331.636324] Call Trace:
> [ 331.636934] <TASK>
> [ 331.637521] ? expand+0x1c9/0x200
> [ 331.638320] ? __warn+0x7d/0x130
> [ 331.639116] ? expand+0x1c9/0x200
> [ 331.639957] ? report_bug+0x18d/0x1c0
> [ 331.640832] ? handle_bug+0x41/0x70
> [ 331.641635] ? exc_invalid_op+0x13/0x60
> [ 331.642522] ? asm_exc_invalid_op+0x16/0x20
> [ 331.643494] ? expand+0x1c9/0x200
> [ 331.644264] ? expand+0x1c9/0x200
> [ 331.645007] rmqueue_bulk+0xf4/0x530
> [ 331.645847] get_page_from_freelist+0x3ed/0x1040
> [ 331.646837] ? prepare_alloc_pages.constprop.0+0x197/0x1b0
> [ 331.647977] __alloc_pages+0xec/0x240
> [ 331.648783] alloc_buddy_hugetlb_folio.isra.0+0x6a/0x150
> [ 331.649912] __alloc_fresh_hugetlb_folio+0x157/0x230
> [ 331.650938] alloc_pool_huge_folio+0xad/0x110
> [ 331.651909] set_max_huge_pages+0x17d/0x390
> [ 331.652760] nr_hugepages_store_common+0x91/0xf0
> [ 331.653825] kernfs_fop_write_iter+0x108/0x1f0
> [ 331.654986] vfs_write+0x207/0x400
> [ 331.655925] ksys_write+0x63/0xe0
> [ 331.656832] do_syscall_64+0x37/0x90
> [ 331.657793] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> [ 331.660398] RIP: 0033:0x7f24b3a26e87
> [ 331.661342] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
> [ 331.665673] RSP: 002b:00007ffccd603de8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> [ 331.667541] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f24b3a26e87
> [ 331.669197] RDX: 0000000000000005 RSI: 0000560a5381bb50 RDI: 0000000000000001
> [ 331.670883] RBP: 0000560a5381bb50 R08: 000000000000000a R09: 00007f24b3abe0c0
> [ 331.672536] R10: 00007f24b3abdfc0 R11: 0000000000000246 R12: 0000000000000005
> [ 331.674175] R13: 00007f24b3afa520 R14: 0000000000000005 R15: 00007f24b3afa720
> [ 331.675841] </TASK>
> [ 331.676450] ---[ end trace 0000000000000000 ]---
> [ 331.677659] ------------[ cut here ]------------
>
>

2023-09-18 17:18:08

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene

On Mon, Sep 18, 2023 at 09:07:53AM +0200, Vlastimil Babka wrote:
> On 9/15/23 16:16, Johannes Weiner wrote:
> > On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
> >> In next-20230913, I started hitting the following BUG. Seems related
> >> to this series. And, if series is reverted I do not see the BUG.
> >>
> >> I can easily reproduce on a small 16G VM. kernel command line contains
> >> "hugetlb_free_vmemmap=on hugetlb_cma=4G". Then run the script,
> >> while true; do
> >> echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> >> echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
> >> echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> >> done
> >>
> >> For the BUG below I believe it was the first (or second) 1G page creation from
> >> CMA that triggered: cma_alloc of 1G.
> >>
> >> Sorry, have not looked deeper into the issue.
> >
> > Thanks for the report, and sorry about the breakage!
> >
> > I was scratching my head at this:
> >
> > /* MIGRATE_ISOLATE page should not go to pcplists */
> > VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> >
> > because there is nothing in page isolation that prevents setting
> > MIGRATE_ISOLATE on something that's on the pcplist already. So why
> > didn't this trigger before already?
> >
> > Then it clicked: it used to only check the *pcpmigratetype* determined
> > by free_unref_page(), which of course mustn't be MIGRATE_ISOLATE.
> >
> > Pages that get isolated while *already* on the pcplist are fine, and
> > are handled properly:
> >
> > mt = get_pcppage_migratetype(page);
> >
> > /* MIGRATE_ISOLATE page should not go to pcplists */
> > VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> >
> > /* Pageblock could have been isolated meanwhile */
> > if (unlikely(isolated_pageblocks))
> > mt = get_pageblock_migratetype(page);
> >
> > So this was purely a sanity check against the pcpmigratetype cache
> > operations. With that gone, we can remove it.
>
> Agreed, I assume you'll fold it in 1/6 in v3.

Yes, will do.

2023-09-18 23:37:16

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene

On Mon, Sep 18, 2023 at 09:16:58AM +0200, Vlastimil Babka wrote:
> On 9/16/23 21:57, Mike Kravetz wrote:
> > On 09/15/23 10:16, Johannes Weiner wrote:
> >> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
> >> > In next-20230913, I started hitting the following BUG. Seems related
> >> > to this series. And, if series is reverted I do not see the BUG.
> >> >
> >> > I can easily reproduce on a small 16G VM. kernel command line contains
> >> > "hugetlb_free_vmemmap=on hugetlb_cma=4G". Then run the script,
> >> > while true; do
> >> > echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> >> > echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
> >> > echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> >> > done
> >> >
> >> > For the BUG below I believe it was the first (or second) 1G page creation from
> >> > CMA that triggered: cma_alloc of 1G.
> >> >
> >> > Sorry, have not looked deeper into the issue.
> >>
> >> Thanks for the report, and sorry about the breakage!
> >>
> >> I was scratching my head at this:
> >>
> >> /* MIGRATE_ISOLATE page should not go to pcplists */
> >> VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> >>
> >> because there is nothing in page isolation that prevents setting
> >> MIGRATE_ISOLATE on something that's on the pcplist already. So why
> >> didn't this trigger before already?
> >>
> >> Then it clicked: it used to only check the *pcpmigratetype* determined
> >> by free_unref_page(), which of course mustn't be MIGRATE_ISOLATE.
> >>
> >> Pages that get isolated while *already* on the pcplist are fine, and
> >> are handled properly:
> >>
> >> mt = get_pcppage_migratetype(page);
> >>
> >> /* MIGRATE_ISOLATE page should not go to pcplists */
> >> VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> >>
> >> /* Pageblock could have been isolated meanwhile */
> >> if (unlikely(isolated_pageblocks))
> >> mt = get_pageblock_migratetype(page);
> >>
> >> So this was purely a sanity check against the pcpmigratetype cache
> >> operations. With that gone, we can remove it.
> >
> > With the patch below applied, a slightly different workload triggers the
> > following warnings. It seems related, and appears to go away when
> > reverting the series.
> >
> > [ 331.595382] ------------[ cut here ]------------
> > [ 331.596665] page type is 5, passed migratetype is 1 (nr=512)
> > [ 331.598121] WARNING: CPU: 2 PID: 935 at mm/page_alloc.c:662 expand+0x1c9/0x200
>
> Initially I thought this demonstrates the possible race I was suggesting in
> reply to 6/6. But, assuming you have CONFIG_CMA, page type 5 is cma and we
> are trying to get a MOVABLE page from a CMA page block, which is something
> that's normally done and the pageblock stays CMA. So yeah if the warnings
> are to stay, they need to handle this case. Maybe the same can happen with
> HIGHATOMIC blocks?

Hm I don't think that's quite it.

CMA and HIGHATOMIC have their own freelists. When MOVABLE requests dip
into CMA and HIGHATOMIC, we explicitly pass that migratetype to
__rmqueue_smallest(). This takes a chunk of e.g. CMA, expands the
remainder to the CMA freelist, then returns the page. While you get a
different mt than requested, the freelist typing should be consistent.

In this splat, the migratetype passed to __rmqueue_smallest() is
MOVABLE. There is no preceding warning from del_page_from_freelist()
(Mike, correct me if I'm wrong), so we got a confirmed MOVABLE
order-10 block from the MOVABLE list. So far so good. However, when we
expand() the order-9 tail of this block to the MOVABLE list, it warns
that its pageblock type is CMA.

This means we have an order-10 page where one half is MOVABLE and the
other is CMA.

I don't see how the merging code in __free_one_page() could have done
that. The CMA buddy would have failed the migrate_is_mergeable() test
and we should have left it at order-9s.

I also don't see how the CMA setup could have done this because
MIGRATE_CMA is set on the range before the pages are fed to the buddy.

Mike, could you describe the workload that is triggering this?

Does this reproduce instantly and reliably?

Is there high load on the system, or is it requesting the huge page
with not much else going on?

Do you see compact_* history in /proc/vmstat after this triggers?

Could you please also provide /proc/zoneinfo, /proc/pagetypeinfo and
the hugetlb_cma= parameter you're using?

Thanks!

2023-09-19 00:17:19

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene

On 09/18/23 10:52, Johannes Weiner wrote:
> On Mon, Sep 18, 2023 at 09:16:58AM +0200, Vlastimil Babka wrote:
> > On 9/16/23 21:57, Mike Kravetz wrote:
> > > On 09/15/23 10:16, Johannes Weiner wrote:
> > >> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
> > >
> > > With the patch below applied, a slightly different workload triggers the
> > > following warnings. It seems related, and appears to go away when
> > > reverting the series.
> > >
> > > [ 331.595382] ------------[ cut here ]------------
> > > [ 331.596665] page type is 5, passed migratetype is 1 (nr=512)
> > > [ 331.598121] WARNING: CPU: 2 PID: 935 at mm/page_alloc.c:662 expand+0x1c9/0x200
> >
> > Initially I thought this demonstrates the possible race I was suggesting in
> > reply to 6/6. But, assuming you have CONFIG_CMA, page type 5 is cma and we
> > are trying to get a MOVABLE page from a CMA page block, which is something
> > that's normally done and the pageblock stays CMA. So yeah if the warnings
> > are to stay, they need to handle this case. Maybe the same can happen with
> > HIGHATOMIC blocks?
>
> Hm I don't think that's quite it.
>
> CMA and HIGHATOMIC have their own freelists. When MOVABLE requests dip
> into CMA and HIGHATOMIC, we explicitly pass that migratetype to
> __rmqueue_smallest(). This takes a chunk of e.g. CMA, expands the
> remainder to the CMA freelist, then returns the page. While you get a
> different mt than requested, the freelist typing should be consistent.
>
> In this splat, the migratetype passed to __rmqueue_smallest() is
> MOVABLE. There is no preceding warning from del_page_from_freelist()
> (Mike, correct me if I'm wrong), so we got a confirmed MOVABLE
> order-10 block from the MOVABLE list. So far so good. However, when we
> expand() the order-9 tail of this block to the MOVABLE list, it warns
> that its pageblock type is CMA.
>
> This means we have an order-10 page where one half is MOVABLE and the
> other is CMA.
>
> I don't see how the merging code in __free_one_page() could have done
> that. The CMA buddy would have failed the migrate_is_mergeable() test
> and we should have left it at order-9s.
>
> I also don't see how the CMA setup could have done this because
> MIGRATE_CMA is set on the range before the pages are fed to the buddy.
>
> Mike, could you describe the workload that is triggering this?

This 'slightly different workload' is actually a slightly different
environment. Sorry for mis-speaking! The slight difference is that this
environment does not use the 'alloc hugetlb gigantic pages from CMA'
(hugetlb_cma) feature that triggered the previous issue.

This is still on a 16G VM. Kernel command line here is:
"BOOT_IMAGE=(hd0,msdos1)/vmlinuz-6.6.0-rc1-next-20230913+
root=UUID=49c13301-2555-44dc-847b-caabe1d62bdf ro console=tty0
console=ttyS0,115200 audit=0 selinux=0 transparent_hugepage=always
hugetlb_free_vmemmap=on"

The workload is just running this script:
while true; do
echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
done

>
> Does this reproduce instantly and reliably?
>

It is not 'instant' but will reproduce fairly reliably within a minute
or so.

Note that the 'echo 4 > .../hugepages-1048576kB/nr_hugepages' is going
to end up calling alloc_contig_pages -> alloc_contig_range. Those pages
will eventually be freed via __free_pages(folio, 9).

> Is there high load on the system, or is it requesting the huge page
> with not much else going on?

Only the script was running.

> Do you see compact_* history in /proc/vmstat after this triggers?

As one might expect, compact_isolated continually increases during this
this run.

> Could you please also provide /proc/zoneinfo, /proc/pagetypeinfo and
> the hugetlb_cma= parameter you're using?

As mentioned above, hugetlb_cma is not used in this environment. Strangely
enough, this does not reproduce (easily at least) if I use hugetlb_cma as
in the previous report.

The following are during a run after WARNING is triggered.

# cat /proc/zoneinfo
Node 0, zone DMA
per-node stats
nr_inactive_anon 11800
nr_active_anon 109
nr_inactive_file 38161
nr_active_file 10007
nr_unevictable 12
nr_slab_reclaimable 2766
nr_slab_unreclaimable 6881
nr_isolated_anon 0
nr_isolated_file 0
workingset_nodes 0
workingset_refault_anon 0
workingset_refault_file 0
workingset_activate_anon 0
workingset_activate_file 0
workingset_restore_anon 0
workingset_restore_file 0
workingset_nodereclaim 0
nr_anon_pages 11750
nr_mapped 18402
nr_file_pages 48339
nr_dirty 0
nr_writeback 0
nr_writeback_temp 0
nr_shmem 166
nr_shmem_hugepages 0
nr_shmem_pmdmapped 0
nr_file_hugepages 0
nr_file_pmdmapped 0
nr_anon_transparent_hugepages 6
nr_vmscan_write 0
nr_vmscan_immediate_reclaim 0
nr_dirtied 14766
nr_written 7701
nr_throttled_written 0
nr_kernel_misc_reclaimable 0
nr_foll_pin_acquired 96
nr_foll_pin_released 96
nr_kernel_stack 1816
nr_page_table_pages 1100
nr_sec_page_table_pages 0
nr_swapcached 0
pages free 3840
boost 0
min 21
low 26
high 31
spanned 4095
present 3998
managed 3840
cma 0
protection: (0, 1908, 7923, 7923)
nr_free_pages 3840
nr_zone_inactive_anon 0
nr_zone_active_anon 0
nr_zone_inactive_file 0
nr_zone_active_file 0
nr_zone_unevictable 0
nr_zone_write_pending 0
nr_mlock 0
nr_bounce 0
nr_zspages 0
nr_free_cma 0
numa_hit 0
numa_miss 0
numa_foreign 0
numa_interleave 0
numa_local 0
numa_other 0
pagesets
cpu: 0
count: 0
high: 13
batch: 1
vm stats threshold: 6
cpu: 1
count: 0
high: 13
batch: 1
vm stats threshold: 6
cpu: 2
count: 0
high: 13
batch: 1
vm stats threshold: 6
cpu: 3
count: 0
high: 13
batch: 1
vm stats threshold: 6
node_unreclaimable: 0
start_pfn: 1
Node 0, zone DMA32
pages free 495317
boost 0
min 2687
low 3358
high 4029
spanned 1044480
present 520156
managed 496486
cma 0
protection: (0, 0, 6015, 6015)
nr_free_pages 495317
nr_zone_inactive_anon 0
nr_zone_active_anon 0
nr_zone_inactive_file 0
nr_zone_active_file 0
nr_zone_unevictable 0
nr_zone_write_pending 0
nr_mlock 0
nr_bounce 0
nr_zspages 0
nr_free_cma 0
numa_hit 0
numa_miss 0
numa_foreign 0
numa_interleave 0
numa_local 0
numa_other 0
pagesets
cpu: 0
count: 913
high: 1679
batch: 63
vm stats threshold: 30
cpu: 1
count: 0
high: 1679
batch: 63
vm stats threshold: 30
cpu: 2
count: 0
high: 1679
batch: 63
vm stats threshold: 30
cpu: 3
count: 256
high: 1679
batch: 63
vm stats threshold: 30
node_unreclaimable: 0
start_pfn: 4096
Node 0, zone Normal
pages free 1360836
boost 0
min 8473
low 10591
high 12709
spanned 1572864
present 1572864
managed 1552266
cma 0
protection: (0, 0, 0, 0)
nr_free_pages 1360836
nr_zone_inactive_anon 11800
nr_zone_active_anon 109
nr_zone_inactive_file 38161
nr_zone_active_file 10007
nr_zone_unevictable 12
nr_zone_write_pending 0
nr_mlock 12
nr_bounce 0
nr_zspages 3
nr_free_cma 0
numa_hit 10623572
numa_miss 0
numa_foreign 0
numa_interleave 1357
numa_local 6902986
numa_other 3720586
pagesets
cpu: 0
count: 156
high: 5295
batch: 63
vm stats threshold: 42
cpu: 1
count: 210
high: 5295
batch: 63
vm stats threshold: 42
cpu: 2
count: 4956
high: 5295
batch: 63
vm stats threshold: 42
cpu: 3
count: 1
high: 5295
batch: 63
vm stats threshold: 42
node_unreclaimable: 0
start_pfn: 1048576
Node 0, zone Movable
pages free 0
boost 0
min 32
low 32
high 32
spanned 0
present 0
managed 0
cma 0
protection: (0, 0, 0, 0)
Node 1, zone DMA
pages free 0
boost 0
min 0
low 0
high 0
spanned 0
present 0
managed 0
cma 0
protection: (0, 0, 0, 0)
Node 1, zone DMA32
pages free 0
boost 0
min 0
low 0
high 0
spanned 0
present 0
managed 0
cma 0
protection: (0, 0, 0, 0)
Node 1, zone Normal
per-node stats
nr_inactive_anon 15381
nr_active_anon 81
nr_inactive_file 66550
nr_active_file 25965
nr_unevictable 421
nr_slab_reclaimable 4069
nr_slab_unreclaimable 7836
nr_isolated_anon 0
nr_isolated_file 0
workingset_nodes 0
workingset_refault_anon 0
workingset_refault_file 0
workingset_activate_anon 0
workingset_activate_file 0
workingset_restore_anon 0
workingset_restore_file 0
workingset_nodereclaim 0
nr_anon_pages 15420
nr_mapped 24331
nr_file_pages 92978
nr_dirty 0
nr_writeback 0
nr_writeback_temp 0
nr_shmem 100
nr_shmem_hugepages 0
nr_shmem_pmdmapped 0
nr_file_hugepages 0
nr_file_pmdmapped 0
nr_anon_transparent_hugepages 11
nr_vmscan_write 0
nr_vmscan_immediate_reclaim 0
nr_dirtied 6217
nr_written 2902
nr_throttled_written 0
nr_kernel_misc_reclaimable 0
nr_foll_pin_acquired 0
nr_foll_pin_released 0
nr_kernel_stack 1656
nr_page_table_pages 756
nr_sec_page_table_pages 0
nr_swapcached 0
pages free 1829073
boost 0
min 11345
low 14181
high 17017
spanned 2097152
present 2097152
managed 2086594
cma 0
protection: (0, 0, 0, 0)
nr_free_pages 1829073
nr_zone_inactive_anon 15381
nr_zone_active_anon 81
nr_zone_inactive_file 66550
nr_zone_active_file 25965
nr_zone_unevictable 421
nr_zone_write_pending 0
nr_mlock 421
nr_bounce 0
nr_zspages 0
nr_free_cma 0
numa_hit 10522401
numa_miss 0
numa_foreign 0
numa_interleave 961
numa_local 4057399
numa_other 6465002
pagesets
cpu: 0
count: 0
high: 7090
batch: 63
vm stats threshold: 42
cpu: 1
count: 17
high: 7090
batch: 63
vm stats threshold: 42
cpu: 2
count: 6997
high: 7090
batch: 63
vm stats threshold: 42
cpu: 3
count: 0
high: 7090
batch: 63
vm stats threshold: 42
node_unreclaimable: 0
start_pfn: 2621440
Node 1, zone Movable
pages free 0
boost 0
min 32
low 32
high 32
spanned 0
present 0
managed 0
cma 0
protection: (0, 0, 0, 0)

# cat /proc/pagetypeinfo
Page block order: 9
Pages per block: 512

Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10
Node 0, zone DMA, type Unmovable 0 0 0 0 0 0 0 0 1 0 0
Node 0, zone DMA, type Movable 0 0 0 0 0 0 0 0 0 1 3
Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type CMA 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32, type Unmovable 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32, type Movable 1 0 1 2 2 3 3 3 4 4 480
Node 0, zone DMA32, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32, type CMA 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone Normal, type Unmovable 566 14 22 7 8 8 9 4 7 0 1
Node 0, zone Normal, type Movable 214 299 120 53 15 10 6 6 1 4 1159
Node 0, zone Normal, type Reclaimable 0 9 18 11 6 1 0 0 0 0 0
Node 0, zone Normal, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone Normal, type CMA 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone Normal, type Isolate 0 0 0 0 0 0 0 0 0 0 0

Number of blocks type Unmovable Movable Reclaimable HighAtomic CMA Isolate
Node 0, zone DMA 1 7 0 0 0 0
Node 0, zone DMA32 0 1016 0 0 0 0
Node 0, zone Normal 71 2995 6 0 0 0
Page block order: 9
Pages per block: 512

Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10
Node 1, zone Normal, type Unmovable 459 12 5 6 6 5 5 5 6 2 1
Node 1, zone Normal, type Movable 1287 502 171 85 34 14 13 8 2 5 1861
Node 1, zone Normal, type Reclaimable 1 5 12 6 9 3 1 1 0 1 0
Node 1, zone Normal, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0
Node 1, zone Normal, type CMA 0 0 0 0 0 0 0 0 0 0 0
Node 1, zone Normal, type Isolate 0 0 0 0 0 0 0 0 0 0 3

Number of blocks type Unmovable Movable Reclaimable HighAtomic CMA Isolate
Node 1, zone Normal 101 3977 10 0 0 8

--
Mike Kravetz

2023-09-19 16:19:51

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene

On Mon, Sep 18, 2023 at 10:40:37AM -0700, Mike Kravetz wrote:
> On 09/18/23 10:52, Johannes Weiner wrote:
> > On Mon, Sep 18, 2023 at 09:16:58AM +0200, Vlastimil Babka wrote:
> > > On 9/16/23 21:57, Mike Kravetz wrote:
> > > > On 09/15/23 10:16, Johannes Weiner wrote:
> > > >> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
> > > >
> > > > With the patch below applied, a slightly different workload triggers the
> > > > following warnings. It seems related, and appears to go away when
> > > > reverting the series.
> > > >
> > > > [ 331.595382] ------------[ cut here ]------------
> > > > [ 331.596665] page type is 5, passed migratetype is 1 (nr=512)
> > > > [ 331.598121] WARNING: CPU: 2 PID: 935 at mm/page_alloc.c:662 expand+0x1c9/0x200
> > >
> > > Initially I thought this demonstrates the possible race I was suggesting in
> > > reply to 6/6. But, assuming you have CONFIG_CMA, page type 5 is cma and we
> > > are trying to get a MOVABLE page from a CMA page block, which is something
> > > that's normally done and the pageblock stays CMA. So yeah if the warnings
> > > are to stay, they need to handle this case. Maybe the same can happen with
> > > HIGHATOMIC blocks?

Ok, the CMA thing gave me pause because Mike's pagetypeinfo didn't
show any CMA pages.

5 is actually MIGRATE_ISOLATE - see the double use of 3 for PCPTYPES
and HIGHATOMIC.

> > This means we have an order-10 page where one half is MOVABLE and the
> > other is CMA.

This means the scenario is different:

We get a MAX_ORDER page off the MOVABLE freelist. The removal checks
that the first pageblock is indeed MOVABLE. During the expand, the
second pageblock turns out to be of type MIGRATE_ISOLATE.

The page allocator wouldn't have merged those types. It triggers a bit
too fast to be a race condition.

It appears that MIGRATE_ISOLATE is simply set on the tail pageblock
while the head is on the list, and then stranded there.

Could this be an issue in the page_isolation code? Maybe a range
rounding error?

Zi Yan, does this ring a bell for you?

I don't quite see how my patches could have caused this. But AFAICS we
also didn't have warnings for this scenario so it could be an old bug.

> > Mike, could you describe the workload that is triggering this?
>
> This 'slightly different workload' is actually a slightly different
> environment. Sorry for mis-speaking! The slight difference is that this
> environment does not use the 'alloc hugetlb gigantic pages from CMA'
> (hugetlb_cma) feature that triggered the previous issue.
>
> This is still on a 16G VM. Kernel command line here is:
> "BOOT_IMAGE=(hd0,msdos1)/vmlinuz-6.6.0-rc1-next-20230913+
> root=UUID=49c13301-2555-44dc-847b-caabe1d62bdf ro console=tty0
> console=ttyS0,115200 audit=0 selinux=0 transparent_hugepage=always
> hugetlb_free_vmemmap=on"
>
> The workload is just running this script:
> while true; do
> echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
> echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> done
>
> >
> > Does this reproduce instantly and reliably?
> >
>
> It is not 'instant' but will reproduce fairly reliably within a minute
> or so.
>
> Note that the 'echo 4 > .../hugepages-1048576kB/nr_hugepages' is going
> to end up calling alloc_contig_pages -> alloc_contig_range. Those pages
> will eventually be freed via __free_pages(folio, 9).

No luck reproducing this yet, but I have a question. In that crash
stack trace, the expand() is called via this:

[ 331.645847] get_page_from_freelist+0x3ed/0x1040
[ 331.646837] ? prepare_alloc_pages.constprop.0+0x197/0x1b0
[ 331.647977] __alloc_pages+0xec/0x240
[ 331.648783] alloc_buddy_hugetlb_folio.isra.0+0x6a/0x150
[ 331.649912] __alloc_fresh_hugetlb_folio+0x157/0x230
[ 331.650938] alloc_pool_huge_folio+0xad/0x110
[ 331.651909] set_max_huge_pages+0x17d/0x390

I don't see an __alloc_fresh_hugetlb_folio() in my tree. Only
alloc_fresh_hugetlb_folio(), which has this:

if (hstate_is_gigantic(h))
folio = alloc_gigantic_folio(h, gfp_mask, nid, nmask);
else
folio = alloc_buddy_hugetlb_folio(h, gfp_mask,
nid, nmask, node_alloc_noretry);

where gigantic is defined as the order exceeding MAX_ORDER, which
should be the case for 1G pages on x86.

So the crashing stack must be from a 2M allocation, no? I'm confused
how that could happen with the above test case.

2023-09-19 18:13:09

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene

On 19 Sep 2023, at 2:49, Johannes Weiner wrote:

> On Mon, Sep 18, 2023 at 10:40:37AM -0700, Mike Kravetz wrote:
>> On 09/18/23 10:52, Johannes Weiner wrote:
>>> On Mon, Sep 18, 2023 at 09:16:58AM +0200, Vlastimil Babka wrote:
>>>> On 9/16/23 21:57, Mike Kravetz wrote:
>>>>> On 09/15/23 10:16, Johannes Weiner wrote:
>>>>>> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
>>>>>
>>>>> With the patch below applied, a slightly different workload triggers the
>>>>> following warnings. It seems related, and appears to go away when
>>>>> reverting the series.
>>>>>
>>>>> [ 331.595382] ------------[ cut here ]------------
>>>>> [ 331.596665] page type is 5, passed migratetype is 1 (nr=512)
>>>>> [ 331.598121] WARNING: CPU: 2 PID: 935 at mm/page_alloc.c:662 expand+0x1c9/0x200
>>>>
>>>> Initially I thought this demonstrates the possible race I was suggesting in
>>>> reply to 6/6. But, assuming you have CONFIG_CMA, page type 5 is cma and we
>>>> are trying to get a MOVABLE page from a CMA page block, which is something
>>>> that's normally done and the pageblock stays CMA. So yeah if the warnings
>>>> are to stay, they need to handle this case. Maybe the same can happen with
>>>> HIGHATOMIC blocks?
>
> Ok, the CMA thing gave me pause because Mike's pagetypeinfo didn't
> show any CMA pages.
>
> 5 is actually MIGRATE_ISOLATE - see the double use of 3 for PCPTYPES
> and HIGHATOMIC.
>
>>> This means we have an order-10 page where one half is MOVABLE and the
>>> other is CMA.
>
> This means the scenario is different:
>
> We get a MAX_ORDER page off the MOVABLE freelist. The removal checks
> that the first pageblock is indeed MOVABLE. During the expand, the
> second pageblock turns out to be of type MIGRATE_ISOLATE.
>
> The page allocator wouldn't have merged those types. It triggers a bit
> too fast to be a race condition.
>
> It appears that MIGRATE_ISOLATE is simply set on the tail pageblock
> while the head is on the list, and then stranded there.
>
> Could this be an issue in the page_isolation code? Maybe a range
> rounding error?
>
> Zi Yan, does this ring a bell for you?

Since isolation code works on pageblocks, a scenario I can think of
is that alloc_contig_range() is given a range starting from that tail
pageblock.

Hmm, I also notice that move_freepages_block() called by
set_migratetype_isolate() might change isolation range by your change.
I wonder if reverting that behavior would fix the issue. Basically,
do

if (!zone_spans_pfn(zone, start))
start = pfn;

in prep_move_freepages_block(). Just a wild guess. Mike, do you mind
giving it a try?

Meanwhile, let me try to reproduce it and look into it deeper.

>
> I don't quite see how my patches could have caused this. But AFAICS we
> also didn't have warnings for this scenario so it could be an old bug.
>
>>> Mike, could you describe the workload that is triggering this?
>>
>> This 'slightly different workload' is actually a slightly different
>> environment. Sorry for mis-speaking! The slight difference is that this
>> environment does not use the 'alloc hugetlb gigantic pages from CMA'
>> (hugetlb_cma) feature that triggered the previous issue.
>>
>> This is still on a 16G VM. Kernel command line here is:
>> "BOOT_IMAGE=(hd0,msdos1)/vmlinuz-6.6.0-rc1-next-20230913+
>> root=UUID=49c13301-2555-44dc-847b-caabe1d62bdf ro console=tty0
>> console=ttyS0,115200 audit=0 selinux=0 transparent_hugepage=always
>> hugetlb_free_vmemmap=on"
>>
>> The workload is just running this script:
>> while true; do
>> echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>> echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
>> echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>> done
>>
>>>
>>> Does this reproduce instantly and reliably?
>>>
>>
>> It is not 'instant' but will reproduce fairly reliably within a minute
>> or so.
>>
>> Note that the 'echo 4 > .../hugepages-1048576kB/nr_hugepages' is going
>> to end up calling alloc_contig_pages -> alloc_contig_range. Those pages
>> will eventually be freed via __free_pages(folio, 9).
>
> No luck reproducing this yet, but I have a question. In that crash
> stack trace, the expand() is called via this:
>
> [ 331.645847] get_page_from_freelist+0x3ed/0x1040
> [ 331.646837] ? prepare_alloc_pages.constprop.0+0x197/0x1b0
> [ 331.647977] __alloc_pages+0xec/0x240
> [ 331.648783] alloc_buddy_hugetlb_folio.isra.0+0x6a/0x150
> [ 331.649912] __alloc_fresh_hugetlb_folio+0x157/0x230
> [ 331.650938] alloc_pool_huge_folio+0xad/0x110
> [ 331.651909] set_max_huge_pages+0x17d/0x390
>
> I don't see an __alloc_fresh_hugetlb_folio() in my tree. Only
> alloc_fresh_hugetlb_folio(), which has this:
>
> if (hstate_is_gigantic(h))
> folio = alloc_gigantic_folio(h, gfp_mask, nid, nmask);
> else
> folio = alloc_buddy_hugetlb_folio(h, gfp_mask,
> nid, nmask, node_alloc_noretry);
>
> where gigantic is defined as the order exceeding MAX_ORDER, which
> should be the case for 1G pages on x86.
>
> So the crashing stack must be from a 2M allocation, no? I'm confused
> how that could happen with the above test case.

That matches my thinking too. Why the crash happened during 1GB page
allocation time? The range should be 1GB-aligned and of course cannot
be in the middle of a MAX_ORDER free page block.


--
Best Regards,
Yan, Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2023-09-19 22:54:38

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH V2 0/6] mm: page_alloc: freelist migratetype hygiene

On 09/19/23 02:49, Johannes Weiner wrote:
> On Mon, Sep 18, 2023 at 10:40:37AM -0700, Mike Kravetz wrote:
> > On 09/18/23 10:52, Johannes Weiner wrote:
> > > On Mon, Sep 18, 2023 at 09:16:58AM +0200, Vlastimil Babka wrote:
> > > > On 9/16/23 21:57, Mike Kravetz wrote:
> > > > > On 09/15/23 10:16, Johannes Weiner wrote:
> > > > >> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
> > > > >
> > > > > With the patch below applied, a slightly different workload triggers the
> > > > > following warnings. It seems related, and appears to go away when
> > > > > reverting the series.
> > > > >
> > > > > [ 331.595382] ------------[ cut here ]------------
> > > > > [ 331.596665] page type is 5, passed migratetype is 1 (nr=512)
> > > > > [ 331.598121] WARNING: CPU: 2 PID: 935 at mm/page_alloc.c:662 expand+0x1c9/0x200
> > > >
> > > > Initially I thought this demonstrates the possible race I was suggesting in
> > > > reply to 6/6. But, assuming you have CONFIG_CMA, page type 5 is cma and we
> > > > are trying to get a MOVABLE page from a CMA page block, which is something
> > > > that's normally done and the pageblock stays CMA. So yeah if the warnings
> > > > are to stay, they need to handle this case. Maybe the same can happen with
> > > > HIGHATOMIC blocks?
>
> Ok, the CMA thing gave me pause because Mike's pagetypeinfo didn't
> show any CMA pages.
>
> 5 is actually MIGRATE_ISOLATE - see the double use of 3 for PCPTYPES
> and HIGHATOMIC.
>
> > > This means we have an order-10 page where one half is MOVABLE and the
> > > other is CMA.
>
> This means the scenario is different:
>
> We get a MAX_ORDER page off the MOVABLE freelist. The removal checks
> that the first pageblock is indeed MOVABLE. During the expand, the
> second pageblock turns out to be of type MIGRATE_ISOLATE.
>
> The page allocator wouldn't have merged those types. It triggers a bit
> too fast to be a race condition.
>
> It appears that MIGRATE_ISOLATE is simply set on the tail pageblock
> while the head is on the list, and then stranded there.
>
> Could this be an issue in the page_isolation code? Maybe a range
> rounding error?
>
> Zi Yan, does this ring a bell for you?
>
> I don't quite see how my patches could have caused this. But AFAICS we
> also didn't have warnings for this scenario so it could be an old bug.
>
> > > Mike, could you describe the workload that is triggering this?
> >
> > This 'slightly different workload' is actually a slightly different
> > environment. Sorry for mis-speaking! The slight difference is that this
> > environment does not use the 'alloc hugetlb gigantic pages from CMA'
> > (hugetlb_cma) feature that triggered the previous issue.
> >
> > This is still on a 16G VM. Kernel command line here is:
> > "BOOT_IMAGE=(hd0,msdos1)/vmlinuz-6.6.0-rc1-next-20230913+
> > root=UUID=49c13301-2555-44dc-847b-caabe1d62bdf ro console=tty0
> > console=ttyS0,115200 audit=0 selinux=0 transparent_hugepage=always
> > hugetlb_free_vmemmap=on"
> >
> > The workload is just running this script:
> > while true; do
> > echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> > echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
> > echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> > done
> >
> > >
> > > Does this reproduce instantly and reliably?
> > >
> >
> > It is not 'instant' but will reproduce fairly reliably within a minute
> > or so.
> >
> > Note that the 'echo 4 > .../hugepages-1048576kB/nr_hugepages' is going
> > to end up calling alloc_contig_pages -> alloc_contig_range. Those pages
> > will eventually be freed via __free_pages(folio, 9).
>
> No luck reproducing this yet, but I have a question. In that crash
> stack trace, the expand() is called via this:
>
> [ 331.645847] get_page_from_freelist+0x3ed/0x1040
> [ 331.646837] ? prepare_alloc_pages.constprop.0+0x197/0x1b0
> [ 331.647977] __alloc_pages+0xec/0x240
> [ 331.648783] alloc_buddy_hugetlb_folio.isra.0+0x6a/0x150
> [ 331.649912] __alloc_fresh_hugetlb_folio+0x157/0x230
> [ 331.650938] alloc_pool_huge_folio+0xad/0x110
> [ 331.651909] set_max_huge_pages+0x17d/0x390
>
> I don't see an __alloc_fresh_hugetlb_folio() in my tree. Only
> alloc_fresh_hugetlb_folio(), which has this:
>
> if (hstate_is_gigantic(h))
> folio = alloc_gigantic_folio(h, gfp_mask, nid, nmask);
> else
> folio = alloc_buddy_hugetlb_folio(h, gfp_mask,
> nid, nmask, node_alloc_noretry);
>
> where gigantic is defined as the order exceeding MAX_ORDER, which
> should be the case for 1G pages on x86.
>
> So the crashing stack must be from a 2M allocation, no? I'm confused
> how that could happen with the above test case.

Sorry for causing the confusion!

When I originally saw the warnings pop up, I was running the above script
as well as another that only allocated order 9 hugetlb pages:

while true; do
echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
done

The warnings were actually triggered by allocations in this second script.

However, when reporting the warnings I wanted to include the simplest
way to recreate. And, I noticed that that second script running in
parallel was not required. Again, sorry for the confusion! Here is a
warning triggered via the alloc_contig_range path only running the one
script.

[ 107.275821] ------------[ cut here ]------------
[ 107.277001] page type is 0, passed migratetype is 1 (nr=512)
[ 107.278379] WARNING: CPU: 1 PID: 886 at mm/page_alloc.c:699 del_page_from_free_list+0x137/0x170
[ 107.280514] Modules linked in: rfkill ip6table_filter ip6_tables sunrpc snd_hda_codec_generic joydev 9p snd_hda_intel netfs snd_intel_dspcfg snd_hda_codec snd_hwdep 9pnet_virtio snd_hda_core snd_seq snd_seq_device 9pnet virtio_balloon snd_pcm snd_timer snd soundcore virtio_net net_failover failover virtio_console virtio_blk crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw virtio_pci virtio virtio_pci_legacy_dev virtio_pci_modern_dev virtio_ring fuse
[ 107.291033] CPU: 1 PID: 886 Comm: bash Not tainted 6.6.0-rc2-next-20230919-dirty #35
[ 107.293000] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
[ 107.295187] RIP: 0010:del_page_from_free_list+0x137/0x170
[ 107.296618] Code: c6 05 20 9b 35 01 01 e8 b7 fb ff ff 44 89 f1 44 89 e2 48 c7 c7 d8 ab 22 82 48 89 c6 b8 01 00 00 00 d3 e0 89 c1 e8 e9 99 df ff <0f> 0b e9 03 ff ff ff 48 c7 c6 10 ac 22 82 48 89 df e8 f3 e0 fc ff
[ 107.301236] RSP: 0018:ffffc90003ba7a70 EFLAGS: 00010086
[ 107.302535] RAX: 0000000000000000 RBX: ffffea0007ff8000 RCX: 0000000000000000
[ 107.304467] RDX: 0000000000000004 RSI: ffffffff8224e9de RDI: 00000000ffffffff
[ 107.306289] RBP: 00000000001ffe00 R08: 0000000000009ffb R09: 00000000ffffdfff
[ 107.308135] R10: 00000000ffffdfff R11: ffffffff824660e0 R12: 0000000000000001
[ 107.309956] R13: ffff88827fffcd80 R14: 0000000000000009 R15: 00000000001ffc00
[ 107.311839] FS: 00007fabb8cba740(0000) GS:ffff888277d00000(0000) knlGS:0000000000000000
[ 107.314695] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 107.316159] CR2: 00007f41ba01acf0 CR3: 0000000282ed4006 CR4: 0000000000370ee0
[ 107.317971] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 107.319783] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 107.321575] Call Trace:
[ 107.322314] <TASK>
[ 107.323002] ? del_page_from_free_list+0x137/0x170
[ 107.324380] ? __warn+0x7d/0x130
[ 107.325341] ? del_page_from_free_list+0x137/0x170
[ 107.326627] ? report_bug+0x18d/0x1c0
[ 107.327632] ? prb_read_valid+0x17/0x20
[ 107.328711] ? handle_bug+0x41/0x70
[ 107.329685] ? exc_invalid_op+0x13/0x60
[ 107.330787] ? asm_exc_invalid_op+0x16/0x20
[ 107.331937] ? del_page_from_free_list+0x137/0x170
[ 107.333189] __free_one_page+0x2ab/0x6f0
[ 107.334375] free_pcppages_bulk+0x169/0x210
[ 107.335575] drain_pages_zone+0x3f/0x50
[ 107.336691] __drain_all_pages+0xe2/0x1e0
[ 107.337843] alloc_contig_range+0x143/0x280
[ 107.339026] alloc_contig_pages+0x210/0x270
[ 107.340200] alloc_fresh_hugetlb_folio+0xa6/0x270
[ 107.341529] alloc_pool_huge_page+0x7d/0x100
[ 107.342745] set_max_huge_pages+0x162/0x340
[ 107.345059] nr_hugepages_store_common+0x91/0xf0
[ 107.346329] kernfs_fop_write_iter+0x108/0x1f0
[ 107.347547] vfs_write+0x207/0x400
[ 107.348543] ksys_write+0x63/0xe0
[ 107.349511] do_syscall_64+0x37/0x90
[ 107.350543] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 107.351940] RIP: 0033:0x7fabb8daee87
[ 107.352819] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[ 107.356373] RSP: 002b:00007ffc02737478 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 107.358103] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fabb8daee87
[ 107.359695] RDX: 0000000000000002 RSI: 000055fe584a1620 RDI: 0000000000000001
[ 107.361258] RBP: 000055fe584a1620 R08: 000000000000000a R09: 00007fabb8e460c0
[ 107.362842] R10: 00007fabb8e45fc0 R11: 0000000000000246 R12: 0000000000000002
[ 107.364385] R13: 00007fabb8e82520 R14: 0000000000000002 R15: 00007fabb8e82720
[ 107.365968] </TASK>
[ 107.366534] ---[ end trace 0000000000000000 ]---
[ 121.542474] ------------[ cut here ]------------

Perhaps that is another piece of information in that the warning can be
triggered via both allocation paths.

To be perfectly clear, here is what I did today:
- built next-20230919. It does not contain your series
I could not recreate the issue.
- Added your series and the patch to remove
VM_BUG_ON_PAGE(is_migrate_isolate(mt), page) from free_pcppages_bulk
I could recreate the issue while running only the one script.
The warning above is from that run.
- Added this suggested patch from Zi
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1400e674ab86..77a4aea31a7f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
end = pageblock_end_pfn(pfn) - 1;

/* Do not cross zone boundaries */
+#if 0
if (!zone_spans_pfn(zone, start))
start = zone->zone_start_pfn;
+#else
+ if (!zone_spans_pfn(zone, start))
+ start = pfn;
+#endif
if (!zone_spans_pfn(zone, end))
return false;
I can still trigger warnings.

One idea about recreating the issue is that it may have to do with size
of my VM (16G) and the requested allocation sizes 4G. However, I tried
to really stress the allocations by increasing the number of hugetlb
pages requested and that did not help. I also noticed that I only seem
to get two warnings and then they stop, even if I continue to run the
script.

Zi asked about my config, so it is attached.
--
Mike Kravetz


Attachments:
(No filename) (10.96 kB)
mike.config (160.64 kB)
Download all attachments