2021-08-30 23:15:13

by Sultan Alsawaf

[permalink] [raw]
Subject: Stuck looping on list_empty(list) in free_pcppages_bulk()

Hi,

I was recently given the following CPU stall splat and asked to look into it:
----------------8<----------------
rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
rcu: 44-...0: (1 GPs behind) idle=77e/1/0x4000000000000000 softirq=28480656/28480657 fqs=279939
(detected by 62, t=1140032 jiffies, g=224165577, q=8883218)
Sending NMI from CPU 62 to CPUs 44:
NMI backtrace for cpu 44
CPU: 44 PID: 83957 Comm: perl Tainted: G L 5.8.18-100.fc31.x86_64 #1
RIP: 0010:free_pcppages_bulk+0x63/0x2a0
RSP: 0018:ffffb3078698fb60 EFLAGS: 00000086
RAX: ffff8b647db30390 RBX: ffffee5fcab67f48 RCX: ffffee5f30c79980
RDX: 0000000000c31e66 RSI: 0000000000000007 RDI: 0000000000000007
RBP: 0000000000000000 R08: ffffb3078698fb80 R09: ffffb3078698fb80
R10: 00000000002ffa93 R11: 0000000000000000 R12: ffff8b647db30390
R13: 0000000000000000 R14: ffff8b647db30380 R15: 000000006a340084
FS: 0000000000000000(0000) GS:ffff8b647db00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ff2b5200fa8 CR3: 00000034ada0a000 CR4: 0000000000340ee0
Call Trace:
free_unref_page_list+0x113/0x1a0
release_pages+0x3ad/0x450
tlb_flush_mmu+0x36/0x160
unmap_page_range+0xab6/0xee0
unmap_vmas+0x6a/0xd0
exit_mmap+0x97/0x170
mmput+0x61/0x140
do_exit+0x306/0xb80
? syscall_trace_enter+0x160/0x290
do_group_exit+0x3a/0xa0
__x64_sys_exit_group+0x14/0x20
do_syscall_64+0x4d/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
---------------->8----------------

I apologize in advance for reporting a bug on an EOL kernel. I don't see any
changes as of 5.14 that could address something like this, so I'm emailing in
case whatever happened here may be a bug affecting newer kernels.

With gdb, it appears that the CPU got stuck in the list_empty(list) loop inside
free_pcppages_bulk():
----------------8<----------------
do {
batch_free++;
if (++migratetype == MIGRATE_PCPTYPES)
migratetype = 0;
list = &pcp->lists[migratetype];
} while (list_empty(list));
---------------->8----------------

Although this code snippet is slightly different in 5.14, it's still ultimately
the same. Side note: I noticed that the way `migratetype` is incremented causes
`&pcp->lists[1]` to get looked at first rather than `&pcp->lists[0]`, since
`migratetype` will start out at 1. This quirk is still present in 5.14, though
the variable in question is now called `pindex`.

With some more gdb digging, I found that the `count` variable was stored in %ESI
at the time of the stall. According to register dump in the splat, %ESI was 7.

It looks like, for some reason, the pcp count was 7 higher than the number of
pages actually present in the pcp lists.

I tried to find some way that this could happen, but the only thing I could
think of was that maybe an allocation had both __GFP_RECLAIMABLE and
__GFP_MOVABLE set in its gfp mask, in which case the rmqueue() call in
get_page_from_freelist() would pass in a migratetype equal to MIGRATE_PCPTYPES
and then pages could be added to an out-of-bounds pcp list while still
incrementing the overall pcp count. This seems pretty unlikely though. As
another side note, it looks like there's nothing stopping this from occurring;
there's only a VM_WARN_ON() in gfp_migratetype() that checks if both bits are
set.

Any ideas on what may have happened here, or a link to a commit that may have
fixed this issue in newer kernels, would be much appreciated.

Thanks,
Sultan


2021-08-31 12:48:54

by Mel Gorman

[permalink] [raw]
Subject: Re: Stuck looping on list_empty(list) in free_pcppages_bulk()

On Mon, Aug 30, 2021 at 04:12:51PM -0700, Sultan Alsawaf wrote:
> I apologize in advance for reporting a bug on an EOL kernel. I don't see any
> changes as of 5.14 that could address something like this, so I'm emailing in
> case whatever happened here may be a bug affecting newer kernels.
>
> With gdb, it appears that the CPU got stuck in the list_empty(list) loop inside
> free_pcppages_bulk():
> ----------------8<----------------
> do {
> batch_free++;
> if (++migratetype == MIGRATE_PCPTYPES)
> migratetype = 0;
> list = &pcp->lists[migratetype];
> } while (list_empty(list));
> ---------------->8----------------
>
> Although this code snippet is slightly different in 5.14, it's still ultimately
> the same. Side note: I noticed that the way `migratetype` is incremented causes
> `&pcp->lists[1]` to get looked at first rather than `&pcp->lists[0]`, since
> `migratetype` will start out at 1. This quirk is still present in 5.14, though
> the variable in question is now called `pindex`.
>
> With some more gdb digging, I found that the `count` variable was stored in %ESI
> at the time of the stall. According to register dump in the splat, %ESI was 7.
>
> It looks like, for some reason, the pcp count was 7 higher than the number of
> pages actually present in the pcp lists.
>

That's your answer -- the PCP count has been corrupted or misaccounted.
Given this is a Fedora kernel, check for any patches affecting
mm/page_alloc.c that could be accounting related or that would affect
the IRQ disabling or zone lock acquisition for problems. Another
possibility is memory corruption -- either kernel or the hardware
itself.

> I tried to find some way that this could happen, but the only thing I could
> think of was that maybe an allocation had both __GFP_RECLAIMABLE and
> __GFP_MOVABLE set in its gfp mask, in which case the rmqueue() call in
> get_page_from_freelist() would pass in a migratetype equal to MIGRATE_PCPTYPES
> and then pages could be added to an out-of-bounds pcp list while still
> incrementing the overall pcp count. This seems pretty unlikely though.

It's unlikely because it would be an outright bug to specify both flags.

> As
> another side note, it looks like there's nothing stopping this from occurring;
> there's only a VM_WARN_ON() in gfp_migratetype() that checks if both bits are
> set.
>

There is no explicit check for it because they should not be both set.
I don't think this happens in kernel but if an out-of-tree module did
it, it might corrupt adjacent PCPs.

--
Mel Gorman
SUSE Labs

2021-08-31 16:52:58

by Vlastimil Babka

[permalink] [raw]
Subject: Re: Stuck looping on list_empty(list) in free_pcppages_bulk()

On 8/31/21 01:12, Sultan Alsawaf wrote:
> With some more gdb digging, I found that the `count` variable was stored in %ESI
> at the time of the stall. According to register dump in the splat, %ESI was 7.
>
> It looks like, for some reason, the pcp count was 7 higher than the number of
> pages actually present in the pcp lists.
>
> I tried to find some way that this could happen, but the only thing I could
> think of was that maybe an allocation had both __GFP_RECLAIMABLE and
> __GFP_MOVABLE set in its gfp mask, in which case the rmqueue() call in
> get_page_from_freelist() would pass in a migratetype equal to MIGRATE_PCPTYPES
> and then pages could be added to an out-of-bounds pcp list while still
> incrementing the overall pcp count. This seems pretty unlikely though. As
> another side note, it looks like there's nothing stopping this from occurring;
> there's only a VM_WARN_ON() in gfp_migratetype() that checks if both bits are
> set.
>
> Any ideas on what may have happened here, or a link to a commit that may have
> fixed this issue in newer kernels, would be much appreciated.

Does the kernel have commit 88e8ac11d2ea3 backported? If not, and there were
memory hotplug operations happening, the infinite loop could happen. If that
commit was not backported, and instead 5c3ad2eb7104 was backported, possibly
there are more scenarios outside hotplug that can cause trouble.

> Thanks,
> Sultan
>

2021-08-31 18:05:12

by Sultan Alsawaf

[permalink] [raw]
Subject: Re: Stuck looping on list_empty(list) in free_pcppages_bulk()

On Tue, Aug 31, 2021 at 01:44:49PM +0100, Mel Gorman wrote:
> That's your answer -- the PCP count has been corrupted or misaccounted.
> Given this is a Fedora kernel, check for any patches affecting
> mm/page_alloc.c that could be accounting related or that would affect
> the IRQ disabling or zone lock acquisition for problems. Another
> possibility is memory corruption -- either kernel or the hardware
> itself.

Hmm, I don't see any changes to mm/page_alloc.c from Fedora for this kernel.

What about a memory allocation originating from inside an NMI? I think this
could occur quite easily with an eBPF program registered to a tracepoint, as
some of the eBPF helpers do allocate memory on the fly for certain map types.

> > I tried to find some way that this could happen, but the only thing I could
> > think of was that maybe an allocation had both __GFP_RECLAIMABLE and
> > __GFP_MOVABLE set in its gfp mask, in which case the rmqueue() call in
> > get_page_from_freelist() would pass in a migratetype equal to MIGRATE_PCPTYPES
> > and then pages could be added to an out-of-bounds pcp list while still
> > incrementing the overall pcp count. This seems pretty unlikely though.
>
> It's unlikely because it would be an outright bug to specify both flags.

Perhaps that VM_WARN_ON should be changed to a VM_BUG_ON?

Thanks,
Sultan

2021-08-31 18:36:16

by Sultan Alsawaf

[permalink] [raw]
Subject: Re: Stuck looping on list_empty(list) in free_pcppages_bulk()

On Tue, Aug 31, 2021 at 06:51:23PM +0200, Vlastimil Babka wrote:
> Does the kernel have commit 88e8ac11d2ea3 backported? If not, and there were
> memory hotplug operations happening, the infinite loop could happen. If that
> commit was not backported, and instead 5c3ad2eb7104 was backported, possibly
> there are more scenarios outside hotplug that can cause trouble.

Yes, it appears that 5.8.18 contains the 88e8ac11d2ea3 backport (commit ID
5b21a116b1365 in linux-5.8.y).

Thanks,
Sultan