Hello everyone,
while rebuilding a few packages in Arch Linux we have recently come
across a regression in the linux kernel which was made visible by a test
failure in libguestfs[0], where the booted kernel showed a Call Trace
like the following one:
[ 218.738568] CPU: 0 PID: 167 Comm: guestfsd Not tainted 6.7.0-rc4-1-mainline-00158-gb5ba474f3f51 #1 bf39861cf50acae7a79c534e25532f28afe4e593^M
[ 218.739007] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.3-1-1 04/01/2014^M
[ 218.739787] RIP: 0010:memcg_page_state+0x9/0x30^M
[ 218.740299] Code: 0d b8 ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 <48> 8b 87 00 06 00 00 48 63 f6 31 d2 48 8b 04 f0 48 85 c0 48 0f 48^M
[ 218.740727] RSP: 0018:ffffb5fa808dfc10 EFLAGS: 00000202^M
[ 218.740862] RAX: 0000000000000000 RBX: ffffb5fa808dfce0 RCX: 0000000000000002^M
[ 218.741016] RDX: 0000000000000001 RSI: 0000000000000033 RDI: 0000000000000000^M
[ 218.741168] RBP: 0000000000000000 R08: ffff976681ff8000 R09: 0000000000000000^M
[ 218.741322] R10: 0000000000000001 R11: ffff9766833f9d00 R12: ffff9766ffffe780^M
[ 218.742167] R13: 0000000000000000 R14: ffff976680cc1800 R15: ffff976682204d80^M
[ 218.742376] FS: 00007f1479d9f540(0000) GS:ffff9766fbc00000(0000) knlGS:0000000000000000^M
[ 218.742569] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
[ 218.743256] CR2: 0000000000000600 CR3: 0000000103606000 CR4: 0000000000750ef0^M
[ 218.743494] PKRU: 55555554^M
[ 218.743593] Call Trace:^M
[ 218.743733] <TASK>^M
[ 218.743847] ? __die+0x23/0x70^M
[ 218.743957] ? page_fault_oops+0x171/0x4e0^M
[ 218.744056] ? free_unref_page+0xf6/0x180^M
[ 218.744458] ? exc_page_fault+0x7f/0x180^M
[ 218.744551] ? asm_exc_page_fault+0x26/0x30^M
[ 218.744684] ? memcg_page_state+0x9/0x30^M
[ 218.744779] zswap_shrinker_count+0x9d/0x110^M
[ 218.744896] do_shrink_slab+0x3a/0x360^M
[ 218.744990] shrink_slab+0xc7/0x3c0^M
[ 218.745609] drop_slab+0x85/0x140^M
[ 218.745691] drop_caches_sysctl_handler+0x7e/0xd0^M
[ 218.745799] proc_sys_call_handler+0x1c0/0x2e0^M
[ 218.745912] vfs_write+0x23d/0x400^M
[ 218.745998] ksys_write+0x6f/0xf0^M
[ 218.746080] do_syscall_64+0x64/0xe0^M
[ 218.746169] ? exit_to_user_mode_prepare+0x132/0x1f0^M
[ 218.746873] entry_SYSCALL_64_after_hwframe+0x6e/0x76^M
The regression is present in the mainline kernel and also was
independently reported to the redhat bugtracker[1].
I have bisected (see log[2]) the regression between v6.9-rc4 and v6.6
and have landed on the following results (removed unrelated test commit)
as remainders since some of the commits were not buildable for me:
- 7108cc3f765c ("mm: memcg: add per-memcg zswap writeback stat")
- a65b0e7607cc ("zswap: make shrinking memcg-aware")
- b5ba474f3f51 ("zswap: shrink zswap pool based on memory pressure")
I have decided on good/bad commits with the relevant libguestfs tests,
but I think the reproducer in the redhat bugzilla is simpler (although I
only became aware of it during the bisection and therefore didn't test
it myself):
LIBGUESTFS_MEMSIZE=4096 LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1 make -C /build/libguestfs/src/libguestfs-1.52.0/tests -k check TESTS=c-api/tests
I hope I have included everything needed to debug this further, if there
is more to add I'm happy to provide more details!
Cheers,
Christian
[0]: https://github.com/libguestfs/libguestfs/issues/139
[1]: https://bugzilla.redhat.com/show_bug.cgi?id=2275252
[2]: https://gist.github.com/christian-heusel/d5095c36b72ae90871e27dfed32ddc46
On Tue, 16 Apr 2024 14:19:24 +0200 Christian Heusel <[email protected]> wrote:
> Hello everyone,
>
> while rebuilding a few packages in Arch Linux we have recently come
> across a regression in the linux kernel which was made visible by a test
> failure in libguestfs[0], where the booted kernel showed a Call Trace
> like the following one:
>
> [ 218.738568] CPU: 0 PID: 167 Comm: guestfsd Not tainted 6.7.0-rc4-1-mainline-00158-gb5ba474f3f51 #1 bf39861cf50acae7a79c534e25532f28afe4e593^M
(cc more zswap people)
Fairly old kernel. We might have fixed it since then, but it would be
good to know what fixed this, for backporting reasons.
> [ 218.739007] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.3-1-1 04/01/2014^M
> [ 218.739787] RIP: 0010:memcg_page_state+0x9/0x30^M
> [ 218.740299] Code: 0d b8 ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 <48> 8b 87 00 06 00 00 48 63 f6 31 d2 48 8b 04 f0 48 85 c0 48 0f 48^M
> [ 218.740727] RSP: 0018:ffffb5fa808dfc10 EFLAGS: 00000202^M
> [ 218.740862] RAX: 0000000000000000 RBX: ffffb5fa808dfce0 RCX: 0000000000000002^M
> [ 218.741016] RDX: 0000000000000001 RSI: 0000000000000033 RDI: 0000000000000000^M
> [ 218.741168] RBP: 0000000000000000 R08: ffff976681ff8000 R09: 0000000000000000^M
> [ 218.741322] R10: 0000000000000001 R11: ffff9766833f9d00 R12: ffff9766ffffe780^M
> [ 218.742167] R13: 0000000000000000 R14: ffff976680cc1800 R15: ffff976682204d80^M
> [ 218.742376] FS: 00007f1479d9f540(0000) GS:ffff9766fbc00000(0000) knlGS:0000000000000000^M
> [ 218.742569] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
> [ 218.743256] CR2: 0000000000000600 CR3: 0000000103606000 CR4: 0000000000750ef0^M
> [ 218.743494] PKRU: 55555554^M
> [ 218.743593] Call Trace:^M
> [ 218.743733] <TASK>^M
> [ 218.743847] ? __die+0x23/0x70^M
> [ 218.743957] ? page_fault_oops+0x171/0x4e0^M
> [ 218.744056] ? free_unref_page+0xf6/0x180^M
> [ 218.744458] ? exc_page_fault+0x7f/0x180^M
> [ 218.744551] ? asm_exc_page_fault+0x26/0x30^M
> [ 218.744684] ? memcg_page_state+0x9/0x30^M
> [ 218.744779] zswap_shrinker_count+0x9d/0x110^M
> [ 218.744896] do_shrink_slab+0x3a/0x360^M
> [ 218.744990] shrink_slab+0xc7/0x3c0^M
> [ 218.745609] drop_slab+0x85/0x140^M
> [ 218.745691] drop_caches_sysctl_handler+0x7e/0xd0^M
> [ 218.745799] proc_sys_call_handler+0x1c0/0x2e0^M
> [ 218.745912] vfs_write+0x23d/0x400^M
> [ 218.745998] ksys_write+0x6f/0xf0^M
> [ 218.746080] do_syscall_64+0x64/0xe0^M
> [ 218.746169] ? exit_to_user_mode_prepare+0x132/0x1f0^M
> [ 218.746873] entry_SYSCALL_64_after_hwframe+0x6e/0x76^M
>
> The regression is present in the mainline kernel and also was
> independently reported to the redhat bugtracker[1].
>
> I have bisected (see log[2]) the regression between v6.9-rc4 and v6.6
> and have landed on the following results (removed unrelated test commit)
> as remainders since some of the commits were not buildable for me:
> - 7108cc3f765c ("mm: memcg: add per-memcg zswap writeback stat")
> - a65b0e7607cc ("zswap: make shrinking memcg-aware")
> - b5ba474f3f51 ("zswap: shrink zswap pool based on memory pressure")
>
> I have decided on good/bad commits with the relevant libguestfs tests,
> but I think the reproducer in the redhat bugzilla is simpler (although I
> only became aware of it during the bisection and therefore didn't test
> it myself):
>
> LIBGUESTFS_MEMSIZE=4096 LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1 make -C /build/libguestfs/src/libguestfs-1.52.0/tests -k check TESTS=c-api/tests
>
> I hope I have included everything needed to debug this further, if there
> is more to add I'm happy to provide more details!
>
> Cheers,
> Christian
>
> [0]: https://github.com/libguestfs/libguestfs/issues/139
> [1]: https://bugzilla.redhat.com/show_bug.cgi?id=2275252
> [2]: https://gist.github.com/christian-heusel/d5095c36b72ae90871e27dfed32ddc46
On 24/04/16 12:18PM, Andrew Morton wrote:
> On Tue, 16 Apr 2024 14:19:24 +0200 Christian Heusel <[email protected]> wrote:
> > [ 218.738568] CPU: 0 PID: 167 Comm: guestfsd Not tainted 6.7.0-rc4-1-mainline-00158-gb5ba474f3f51 #1 bf39861cf50acae7a79c534e25532f28afe4e593^M
>
> (cc more zswap people)
>
> Fairly old kernel. We might have fixed it since then, but it would be
> good to know what fixed this, for backporting reasons.
Hey Andrew,
I guess you must have missed the end of my previous mail, I have
verified that the bug is still present with with the latest mainline
release and bisected the issue down to a few commits. The trace I was
posting above was just a random one from during the bisection.
Sorry if my formatting was a bit confusing .. :)
Cheers,
Christian
On Tue, Apr 16, 2024 at 5:19 AM Christian Heusel <[email protected]> wrote:
>
> Hello everyone,
Thanks for the report, Christian! Looking at it now.
>
> while rebuilding a few packages in Arch Linux we have recently come
> across a regression in the linux kernel which was made visible by a test
> failure in libguestfs[0], where the booted kernel showed a Call Trace
> like the following one:
>
> [ 218.738568] CPU: 0 PID: 167 Comm: guestfsd Not tainted 6.7.0-rc4-1-mainline-00158-gb5ba474f3f51 #1 bf39861cf50acae7a79c534e25532f28afe4e593^M
Is this one of the kernel versions that was broken? That looks a bit
odd, as zswap shrinker landed on 6.8...
> [ 218.739007] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.3-1-1 04/01/2014^M
> [ 218.739787] RIP: 0010:memcg_page_state+0x9/0x30^M
> [ 218.740299] Code: 0d b8 ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 <48> 8b 87 00 06 00 00 48 63 f6 31 d2 48 8b 04 f0 48 85 c0 48 0f 48^M
> [ 218.740727] RSP: 0018:ffffb5fa808dfc10 EFLAGS: 00000202^M
> [ 218.740862] RAX: 0000000000000000 RBX: ffffb5fa808dfce0 RCX: 0000000000000002^M
> [ 218.741016] RDX: 0000000000000001 RSI: 0000000000000033 RDI: 0000000000000000^M
> [ 218.741168] RBP: 0000000000000000 R08: ffff976681ff8000 R09: 0000000000000000^M
> [ 218.741322] R10: 0000000000000001 R11: ffff9766833f9d00 R12: ffff9766ffffe780^M
> [ 218.742167] R13: 0000000000000000 R14: ffff976680cc1800 R15: ffff976682204d80^M
> [ 218.742376] FS: 00007f1479d9f540(0000) GS:ffff9766fbc00000(0000) knlGS:0000000000000000^M
> [ 218.742569] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
> [ 218.743256] CR2: 0000000000000600 CR3: 0000000103606000 CR4: 0000000000750ef0^M
> [ 218.743494] PKRU: 55555554^M
> [ 218.743593] Call Trace:^M
> [ 218.743733] <TASK>^M
> [ 218.743847] ? __die+0x23/0x70^M
> [ 218.743957] ? page_fault_oops+0x171/0x4e0^M
> [ 218.744056] ? free_unref_page+0xf6/0x180^M
> [ 218.744458] ? exc_page_fault+0x7f/0x180^M
> [ 218.744551] ? asm_exc_page_fault+0x26/0x30^M
> [ 218.744684] ? memcg_page_state+0x9/0x30^M
> [ 218.744779] zswap_shrinker_count+0x9d/0x110^M
> [ 218.744896] do_shrink_slab+0x3a/0x360^M
> [ 218.744990] shrink_slab+0xc7/0x3c0^M
> [ 218.745609] drop_slab+0x85/0x140^M
> [ 218.745691] drop_caches_sysctl_handler+0x7e/0xd0^M
> [ 218.745799] proc_sys_call_handler+0x1c0/0x2e0^M
> [ 218.745912] vfs_write+0x23d/0x400^M
> [ 218.745998] ksys_write+0x6f/0xf0^M
> [ 218.746080] do_syscall_64+0x64/0xe0^M
> [ 218.746169] ? exit_to_user_mode_prepare+0x132/0x1f0^M
> [ 218.746873] entry_SYSCALL_64_after_hwframe+0x6e/0x76^M
>
> The regression is present in the mainline kernel and also was
> independently reported to the redhat bugtracker[1].
>
> I have bisected (see log[2]) the regression between v6.9-rc4 and v6.6
> and have landed on the following results (removed unrelated test commit)
> as remainders since some of the commits were not buildable for me:
> - 7108cc3f765c ("mm: memcg: add per-memcg zswap writeback stat")
> - a65b0e7607cc ("zswap: make shrinking memcg-aware")
> - b5ba474f3f51 ("zswap: shrink zswap pool based on memory pressure")
On Tue, Apr 16, 2024 at 3:14 PM Nhat Pham <[email protected]> wrote:
>
> On Tue, Apr 16, 2024 at 5:19 AM Christian Heusel <[email protected]> wrote:
> >
> > Hello everyone,
>
> Thanks for the report, Christian! Looking at it now.
>
> >
> > while rebuilding a few packages in Arch Linux we have recently come
> > across a regression in the linux kernel which was made visible by a test
> > failure in libguestfs[0], where the booted kernel showed a Call Trace
> > like the following one:
> >
> > [ 218.738568] CPU: 0 PID: 167 Comm: guestfsd Not tainted 6.7.0-rc4-1-mainline-00158-gb5ba474f3f51 #1 bf39861cf50acae7a79c534e25532f28afe4e593^M
>
> Is this one of the kernel versions that was broken? That looks a bit
> odd, as zswap shrinker landed on 6.8...
Ah ignore this - I understand the versioning now...
>
> > [ 218.739007] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.3-1-1 04/01/2014^M
> > [ 218.739787] RIP: 0010:memcg_page_state+0x9/0x30^M
> > [ 218.740299] Code: 0d b8 ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 <48> 8b 87 00 06 00 00 48 63 f6 31 d2 48 8b 04 f0 48 85 c0 48 0f 48^M
> > [ 218.740727] RSP: 0018:ffffb5fa808dfc10 EFLAGS: 00000202^M
> > [ 218.740862] RAX: 0000000000000000 RBX: ffffb5fa808dfce0 RCX: 0000000000000002^M
> > [ 218.741016] RDX: 0000000000000001 RSI: 0000000000000033 RDI: 0000000000000000^M
> > [ 218.741168] RBP: 0000000000000000 R08: ffff976681ff8000 R09: 0000000000000000^M
> > [ 218.741322] R10: 0000000000000001 R11: ffff9766833f9d00 R12: ffff9766ffffe780^M
> > [ 218.742167] R13: 0000000000000000 R14: ffff976680cc1800 R15: ffff976682204d80^M
> > [ 218.742376] FS: 00007f1479d9f540(0000) GS:ffff9766fbc00000(0000) knlGS:0000000000000000^M
> > [ 218.742569] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
> > [ 218.743256] CR2: 0000000000000600 CR3: 0000000103606000 CR4: 0000000000750ef0^M
> > [ 218.743494] PKRU: 55555554^M
> > [ 218.743593] Call Trace:^M
> > [ 218.743733] <TASK>^M
> > [ 218.743847] ? __die+0x23/0x70^M
> > [ 218.743957] ? page_fault_oops+0x171/0x4e0^M
> > [ 218.744056] ? free_unref_page+0xf6/0x180^M
> > [ 218.744458] ? exc_page_fault+0x7f/0x180^M
> > [ 218.744551] ? asm_exc_page_fault+0x26/0x30^M
> > [ 218.744684] ? memcg_page_state+0x9/0x30^M
> > [ 218.744779] zswap_shrinker_count+0x9d/0x110^M
> > [ 218.744896] do_shrink_slab+0x3a/0x360^M
> > [ 218.744990] shrink_slab+0xc7/0x3c0^M
> > [ 218.745609] drop_slab+0x85/0x140^M
> > [ 218.745691] drop_caches_sysctl_handler+0x7e/0xd0^M
> > [ 218.745799] proc_sys_call_handler+0x1c0/0x2e0^M
> > [ 218.745912] vfs_write+0x23d/0x400^M
> > [ 218.745998] ksys_write+0x6f/0xf0^M
> > [ 218.746080] do_syscall_64+0x64/0xe0^M
> > [ 218.746169] ? exit_to_user_mode_prepare+0x132/0x1f0^M
> > [ 218.746873] entry_SYSCALL_64_after_hwframe+0x6e/0x76^M
> >
> > The regression is present in the mainline kernel and also was
> > independently reported to the redhat bugtracker[1].
> >
> > I have bisected (see log[2]) the regression between v6.9-rc4 and v6.6
> > and have landed on the following results (removed unrelated test commit)
> > as remainders since some of the commits were not buildable for me:
> > - 7108cc3f765c ("mm: memcg: add per-memcg zswap writeback stat")
> > - a65b0e7607cc ("zswap: make shrinking memcg-aware")
> > - b5ba474f3f51 ("zswap: shrink zswap pool based on memory pressure")
Hmmm, this smells like a memcg reference counting bug to me. The memcg
was successfully accessed just a couple of LoCs above, but failed
after the stats flushing.
My suspicion right now is the new memcg refcnt dance to select an
memcg for the global, capacity-based, shrinker, introduced in this
patch series. That dance looks solid to me tho - not sure where it
goes wrong. And, if it's a reference counting issue, that should pop
up in different places (i.e other mysterious memcg traces)...
On Tue, Apr 16, 2024 at 4:29 PM Nhat Pham <[email protected]> wrote:
>
> On Tue, Apr 16, 2024 at 3:14 PM Nhat Pham <[email protected]> wrote:
> >
> > On Tue, Apr 16, 2024 at 5:19 AM Christian Heusel <[email protected]> wrote:
> > >
> > > Hello everyone,
> >
> > Thanks for the report, Christian! Looking at it now.
> >
> > >
> > > while rebuilding a few packages in Arch Linux we have recently come
> > > across a regression in the linux kernel which was made visible by a test
> > > failure in libguestfs[0], where the booted kernel showed a Call Trace
> > > like the following one:
> > >
> > > [ 218.738568] CPU: 0 PID: 167 Comm: guestfsd Not tainted 6.7.0-rc4-1-mainline-00158-gb5ba474f3f51 #1 bf39861cf50acae7a79c534e25532f28afe4e593^M
> >
> > Is this one of the kernel versions that was broken? That looks a bit
> > odd, as zswap shrinker landed on 6.8...
>
> Ah ignore this - I understand the versioning now...
>
> >
> > > [ 218.739007] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.3-1-1 04/01/2014^M
> > > [ 218.739787] RIP: 0010:memcg_page_state+0x9/0x30^M
> > > [ 218.740299] Code: 0d b8 ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 <48> 8b 87 00 06 00 00 48 63 f6 31 d2 48 8b 04 f0 48 85 c0 48 0f 48^M
> > > [ 218.740727] RSP: 0018:ffffb5fa808dfc10 EFLAGS: 00000202^M
> > > [ 218.740862] RAX: 0000000000000000 RBX: ffffb5fa808dfce0 RCX: 0000000000000002^M
> > > [ 218.741016] RDX: 0000000000000001 RSI: 0000000000000033 RDI: 0000000000000000^M
> > > [ 218.741168] RBP: 0000000000000000 R08: ffff976681ff8000 R09: 0000000000000000^M
> > > [ 218.741322] R10: 0000000000000001 R11: ffff9766833f9d00 R12: ffff9766ffffe780^M
> > > [ 218.742167] R13: 0000000000000000 R14: ffff976680cc1800 R15: ffff976682204d80^M
> > > [ 218.742376] FS: 00007f1479d9f540(0000) GS:ffff9766fbc00000(0000) knlGS:0000000000000000^M
> > > [ 218.742569] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
> > > [ 218.743256] CR2: 0000000000000600 CR3: 0000000103606000 CR4: 0000000000750ef0^M
> > > [ 218.743494] PKRU: 55555554^M
> > > [ 218.743593] Call Trace:^M
> > > [ 218.743733] <TASK>^M
> > > [ 218.743847] ? __die+0x23/0x70^M
> > > [ 218.743957] ? page_fault_oops+0x171/0x4e0^M
> > > [ 218.744056] ? free_unref_page+0xf6/0x180^M
> > > [ 218.744458] ? exc_page_fault+0x7f/0x180^M
> > > [ 218.744551] ? asm_exc_page_fault+0x26/0x30^M
> > > [ 218.744684] ? memcg_page_state+0x9/0x30^M
> > > [ 218.744779] zswap_shrinker_count+0x9d/0x110^M
> > > [ 218.744896] do_shrink_slab+0x3a/0x360^M
> > > [ 218.744990] shrink_slab+0xc7/0x3c0^M
> > > [ 218.745609] drop_slab+0x85/0x140^M
> > > [ 218.745691] drop_caches_sysctl_handler+0x7e/0xd0^M
> > > [ 218.745799] proc_sys_call_handler+0x1c0/0x2e0^M
> > > [ 218.745912] vfs_write+0x23d/0x400^M
> > > [ 218.745998] ksys_write+0x6f/0xf0^M
> > > [ 218.746080] do_syscall_64+0x64/0xe0^M
> > > [ 218.746169] ? exit_to_user_mode_prepare+0x132/0x1f0^M
> > > [ 218.746873] entry_SYSCALL_64_after_hwframe+0x6e/0x76^M
> > >
Actually, inspecting the code a bit more - can memcg be null here?
Specifically, if mem_cgroup_disabled() is true, can we see null memcg
here? Looks like in this case, mem_cgroup_iter() can return null, and
the first iteration of drop_slab_node() can have null memcg if it's
returned by mem_cgroup_iter().
shrink_slab() will still proceed and call do_shrink_slab() if the
memcg is null - provided that mem_cgroup_disabled() holds:
if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
Inside zswap_shrink_count(), all the memcg accessors in this area seem
to always check for null memcg (mem_cgroup_lruvec,
mem_cgroup_zswap_writeback_enabled), *except* memcg_page_state, which
is the one line that fail.
If this is all to it, we can simply add a null check or
mem_cgroup_disabled() check here, and use pool stats instead?
On Tue, Apr 16, 2024 at 5:19 AM Christian Heusel <[email protected]> wrote:
>
> Hello everyone,
>
> while rebuilding a few packages in Arch Linux we have recently come
> across a regression in the linux kernel which was made visible by a test
> failure in libguestfs[0], where the booted kernel showed a Call Trace
> like the following one:
>
> [ 218.738568] CPU: 0 PID: 167 Comm: guestfsd Not tainted 6.7.0-rc4-1-mainline-00158-gb5ba474f3f51 #1 bf39861cf50acae7a79c534e25532f28afe4e593^M
> [ 218.739007] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.3-1-1 04/01/2014^M
> [ 218.739787] RIP: 0010:memcg_page_state+0x9/0x30^M
> [ 218.740299] Code: 0d b8 ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 <48> 8b 87 00 06 00 00 48 63 f6 31 d2 48 8b 04 f0 48 85 c0 48 0f 48^M
> [ 218.740727] RSP: 0018:ffffb5fa808dfc10 EFLAGS: 00000202^M
> [ 218.740862] RAX: 0000000000000000 RBX: ffffb5fa808dfce0 RCX: 0000000000000002^M
> [ 218.741016] RDX: 0000000000000001 RSI: 0000000000000033 RDI: 0000000000000000^M
> [ 218.741168] RBP: 0000000000000000 R08: ffff976681ff8000 R09: 0000000000000000^M
> [ 218.741322] R10: 0000000000000001 R11: ffff9766833f9d00 R12: ffff9766ffffe780^M
> [ 218.742167] R13: 0000000000000000 R14: ffff976680cc1800 R15: ffff976682204d80^M
> [ 218.742376] FS: 00007f1479d9f540(0000) GS:ffff9766fbc00000(0000) knlGS:0000000000000000^M
> [ 218.742569] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
> [ 218.743256] CR2: 0000000000000600 CR3: 0000000103606000 CR4: 0000000000750ef0^M
> [ 218.743494] PKRU: 55555554^M
> [ 218.743593] Call Trace:^M
> [ 218.743733] <TASK>^M
> [ 218.743847] ? __die+0x23/0x70^M
> [ 218.743957] ? page_fault_oops+0x171/0x4e0^M
> [ 218.744056] ? free_unref_page+0xf6/0x180^M
> [ 218.744458] ? exc_page_fault+0x7f/0x180^M
> [ 218.744551] ? asm_exc_page_fault+0x26/0x30^M
> [ 218.744684] ? memcg_page_state+0x9/0x30^M
> [ 218.744779] zswap_shrinker_count+0x9d/0x110^M
> [ 218.744896] do_shrink_slab+0x3a/0x360^M
> [ 218.744990] shrink_slab+0xc7/0x3c0^M
> [ 218.745609] drop_slab+0x85/0x140^M
> [ 218.745691] drop_caches_sysctl_handler+0x7e/0xd0^M
> [ 218.745799] proc_sys_call_handler+0x1c0/0x2e0^M
> [ 218.745912] vfs_write+0x23d/0x400^M
> [ 218.745998] ksys_write+0x6f/0xf0^M
> [ 218.746080] do_syscall_64+0x64/0xe0^M
> [ 218.746169] ? exit_to_user_mode_prepare+0x132/0x1f0^M
> [ 218.746873] entry_SYSCALL_64_after_hwframe+0x6e/0x76^M
>
> The regression is present in the mainline kernel and also was
> independently reported to the redhat bugtracker[1].
>
> I have bisected (see log[2]) the regression between v6.9-rc4 and v6.6
> and have landed on the following results (removed unrelated test commit)
> as remainders since some of the commits were not buildable for me:
> - 7108cc3f765c ("mm: memcg: add per-memcg zswap writeback stat")
> - a65b0e7607cc ("zswap: make shrinking memcg-aware")
> - b5ba474f3f51 ("zswap: shrink zswap pool based on memory pressure")
>
> I have decided on good/bad commits with the relevant libguestfs tests,
> but I think the reproducer in the redhat bugzilla is simpler (although I
> only became aware of it during the bisection and therefore didn't test
> it myself):
>
> LIBGUESTFS_MEMSIZE=4096 LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1 make -C /build/libguestfs/src/libguestfs-1.52.0/tests -k check TESTS=c-api/tests
>
I have a suspect for the bug:
https://lore.kernel.org/all/CAKEwX=MWPUf1NMGdn+1AkRdOUf25ifAbPyoP9zppPTx3U3Tv2Q@mail.gmail.com/
Feel free to fact check me, but let me see if I can reproduce this bug
on my own setting based on this analysis and send a fix accordingly :)
On 2024/4/17 08:22, Nhat Pham wrote:
> On Tue, Apr 16, 2024 at 4:29 PM Nhat Pham <[email protected]> wrote:
>>
>> On Tue, Apr 16, 2024 at 3:14 PM Nhat Pham <[email protected]> wrote:
>>>
>>> On Tue, Apr 16, 2024 at 5:19 AM Christian Heusel <[email protected]> wrote:
>>>>
>>>> Hello everyone,
>>>
>>> Thanks for the report, Christian! Looking at it now.
>>>
>>>>
>>>> while rebuilding a few packages in Arch Linux we have recently come
>>>> across a regression in the linux kernel which was made visible by a test
>>>> failure in libguestfs[0], where the booted kernel showed a Call Trace
>>>> like the following one:
>>>>
>>>> [ 218.738568] CPU: 0 PID: 167 Comm: guestfsd Not tainted 6.7.0-rc4-1-mainline-00158-gb5ba474f3f51 #1 bf39861cf50acae7a79c534e25532f28afe4e593^M
>>>
>>> Is this one of the kernel versions that was broken? That looks a bit
>>> odd, as zswap shrinker landed on 6.8...
>>
>> Ah ignore this - I understand the versioning now...
>>
>>>
>>>> [ 218.739007] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.3-1-1 04/01/2014^M
>>>> [ 218.739787] RIP: 0010:memcg_page_state+0x9/0x30^M
>>>> [ 218.740299] Code: 0d b8 ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 <48> 8b 87 00 06 00 00 48 63 f6 31 d2 48 8b 04 f0 48 85 c0 48 0f 48^M
>>>> [ 218.740727] RSP: 0018:ffffb5fa808dfc10 EFLAGS: 00000202^M
>>>> [ 218.740862] RAX: 0000000000000000 RBX: ffffb5fa808dfce0 RCX: 0000000000000002^M
>>>> [ 218.741016] RDX: 0000000000000001 RSI: 0000000000000033 RDI: 0000000000000000^M
>>>> [ 218.741168] RBP: 0000000000000000 R08: ffff976681ff8000 R09: 0000000000000000^M
>>>> [ 218.741322] R10: 0000000000000001 R11: ffff9766833f9d00 R12: ffff9766ffffe780^M
>>>> [ 218.742167] R13: 0000000000000000 R14: ffff976680cc1800 R15: ffff976682204d80^M
>>>> [ 218.742376] FS: 00007f1479d9f540(0000) GS:ffff9766fbc00000(0000) knlGS:0000000000000000^M
>>>> [ 218.742569] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
>>>> [ 218.743256] CR2: 0000000000000600 CR3: 0000000103606000 CR4: 0000000000750ef0^M
>>>> [ 218.743494] PKRU: 55555554^M
>>>> [ 218.743593] Call Trace:^M
>>>> [ 218.743733] <TASK>^M
>>>> [ 218.743847] ? __die+0x23/0x70^M
>>>> [ 218.743957] ? page_fault_oops+0x171/0x4e0^M
>>>> [ 218.744056] ? free_unref_page+0xf6/0x180^M
>>>> [ 218.744458] ? exc_page_fault+0x7f/0x180^M
>>>> [ 218.744551] ? asm_exc_page_fault+0x26/0x30^M
>>>> [ 218.744684] ? memcg_page_state+0x9/0x30^M
>>>> [ 218.744779] zswap_shrinker_count+0x9d/0x110^M
>>>> [ 218.744896] do_shrink_slab+0x3a/0x360^M
>>>> [ 218.744990] shrink_slab+0xc7/0x3c0^M
>>>> [ 218.745609] drop_slab+0x85/0x140^M
>>>> [ 218.745691] drop_caches_sysctl_handler+0x7e/0xd0^M
>>>> [ 218.745799] proc_sys_call_handler+0x1c0/0x2e0^M
>>>> [ 218.745912] vfs_write+0x23d/0x400^M
>>>> [ 218.745998] ksys_write+0x6f/0xf0^M
>>>> [ 218.746080] do_syscall_64+0x64/0xe0^M
>>>> [ 218.746169] ? exit_to_user_mode_prepare+0x132/0x1f0^M
>>>> [ 218.746873] entry_SYSCALL_64_after_hwframe+0x6e/0x76^M
>>>>
>
> Actually, inspecting the code a bit more - can memcg be null here?
>
> Specifically, if mem_cgroup_disabled() is true, can we see null memcg
> here? Looks like in this case, mem_cgroup_iter() can return null, and
> the first iteration of drop_slab_node() can have null memcg if it's
> returned by mem_cgroup_iter().
>
> shrink_slab() will still proceed and call do_shrink_slab() if the
> memcg is null - provided that mem_cgroup_disabled() holds:
>
> if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
> return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>
Ah, I think your analysis is very right, here memcg is NULL but memcg_page_state
won't check.
> Inside zswap_shrink_count(), all the memcg accessors in this area seem
> to always check for null memcg (mem_cgroup_lruvec,
> mem_cgroup_zswap_writeback_enabled), *except* memcg_page_state, which
> is the one line that fail.
>
> If this is all to it, we can simply add a null check or
> mem_cgroup_disabled() check here, and use pool stats instead?
Both look ok to me. The VM could only set sc->memcg to NULL when memcg
disabled, right?
Thanks.
On Wed, Apr 17, 2024 at 11:44:45AM +0800, Chengming Zhou wrote:
> On 2024/4/17 08:22, Nhat Pham wrote:
> > On Tue, Apr 16, 2024 at 4:29 PM Nhat Pham <[email protected]> wrote:
> >>
> >> On Tue, Apr 16, 2024 at 3:14 PM Nhat Pham <[email protected]> wrote:
> >>>
> >>> On Tue, Apr 16, 2024 at 5:19 AM Christian Heusel <[email protected]> wrote:
> >>>>
> >>>> Hello everyone,
> >>>
> >>> Thanks for the report, Christian! Looking at it now.
> >>>
> >>>>
> >>>> while rebuilding a few packages in Arch Linux we have recently come
> >>>> across a regression in the linux kernel which was made visible by a test
> >>>> failure in libguestfs[0], where the booted kernel showed a Call Trace
> >>>> like the following one:
> >>>>
> >>>> [ 218.738568] CPU: 0 PID: 167 Comm: guestfsd Not tainted 6.7.0-rc4-1-mainline-00158-gb5ba474f3f51 #1 bf39861cf50acae7a79c534e25532f28afe4e593^M
> >>>
> >>> Is this one of the kernel versions that was broken? That looks a bit
> >>> odd, as zswap shrinker landed on 6.8...
> >>
> >> Ah ignore this - I understand the versioning now...
> >>
> >>>
> >>>> [ 218.739007] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.3-1-1 04/01/2014^M
> >>>> [ 218.739787] RIP: 0010:memcg_page_state+0x9/0x30^M
> >>>> [ 218.740299] Code: 0d b8 ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 <48> 8b 87 00 06 00 00 48 63 f6 31 d2 48 8b 04 f0 48 85 c0 48 0f 48^M
> >>>> [ 218.740727] RSP: 0018:ffffb5fa808dfc10 EFLAGS: 00000202^M
> >>>> [ 218.740862] RAX: 0000000000000000 RBX: ffffb5fa808dfce0 RCX: 0000000000000002^M
> >>>> [ 218.741016] RDX: 0000000000000001 RSI: 0000000000000033 RDI: 0000000000000000^M
> >>>> [ 218.741168] RBP: 0000000000000000 R08: ffff976681ff8000 R09: 0000000000000000^M
> >>>> [ 218.741322] R10: 0000000000000001 R11: ffff9766833f9d00 R12: ffff9766ffffe780^M
> >>>> [ 218.742167] R13: 0000000000000000 R14: ffff976680cc1800 R15: ffff976682204d80^M
> >>>> [ 218.742376] FS: 00007f1479d9f540(0000) GS:ffff9766fbc00000(0000) knlGS:0000000000000000^M
> >>>> [ 218.742569] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
> >>>> [ 218.743256] CR2: 0000000000000600 CR3: 0000000103606000 CR4: 0000000000750ef0^M
> >>>> [ 218.743494] PKRU: 55555554^M
> >>>> [ 218.743593] Call Trace:^M
> >>>> [ 218.743733] <TASK>^M
> >>>> [ 218.743847] ? __die+0x23/0x70^M
> >>>> [ 218.743957] ? page_fault_oops+0x171/0x4e0^M
> >>>> [ 218.744056] ? free_unref_page+0xf6/0x180^M
> >>>> [ 218.744458] ? exc_page_fault+0x7f/0x180^M
> >>>> [ 218.744551] ? asm_exc_page_fault+0x26/0x30^M
> >>>> [ 218.744684] ? memcg_page_state+0x9/0x30^M
> >>>> [ 218.744779] zswap_shrinker_count+0x9d/0x110^M
> >>>> [ 218.744896] do_shrink_slab+0x3a/0x360^M
> >>>> [ 218.744990] shrink_slab+0xc7/0x3c0^M
> >>>> [ 218.745609] drop_slab+0x85/0x140^M
> >>>> [ 218.745691] drop_caches_sysctl_handler+0x7e/0xd0^M
> >>>> [ 218.745799] proc_sys_call_handler+0x1c0/0x2e0^M
> >>>> [ 218.745912] vfs_write+0x23d/0x400^M
> >>>> [ 218.745998] ksys_write+0x6f/0xf0^M
> >>>> [ 218.746080] do_syscall_64+0x64/0xe0^M
> >>>> [ 218.746169] ? exit_to_user_mode_prepare+0x132/0x1f0^M
> >>>> [ 218.746873] entry_SYSCALL_64_after_hwframe+0x6e/0x76^M
> >>>>
> >
> > Actually, inspecting the code a bit more - can memcg be null here?
> >
> > Specifically, if mem_cgroup_disabled() is true, can we see null memcg
> > here? Looks like in this case, mem_cgroup_iter() can return null, and
> > the first iteration of drop_slab_node() can have null memcg if it's
> > returned by mem_cgroup_iter().
> >
> > shrink_slab() will still proceed and call do_shrink_slab() if the
> > memcg is null - provided that mem_cgroup_disabled() holds:
> >
> > if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
> > return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
> >
>
> Ah, I think your analysis is very right, here memcg is NULL but memcg_page_state
> won't check.
+1.
I could reproduce the NULL crash locally with cgroup_disable=memory,
the shrinker enabled, and echo 3 >/proc/sys/vm/drop_caches.
> > Inside zswap_shrink_count(), all the memcg accessors in this area seem
> > to always check for null memcg (mem_cgroup_lruvec,
> > mem_cgroup_zswap_writeback_enabled), *except* memcg_page_state, which
> > is the one line that fail.
> >
> > If this is all to it, we can simply add a null check or
> > mem_cgroup_disabled() check here, and use pool stats instead?
>
> Both look ok to me. The VM could only set sc->memcg to NULL when memcg
> disabled, right?
Christian, can you please test the below patch on top of current
upstream?
diff --git a/mm/zswap.c b/mm/zswap.c
index caed028945b0..6f8850c44b61 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1331,15 +1331,22 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
if (!gfp_has_io_fs(sc->gfp_mask))
return 0;
-#ifdef CONFIG_MEMCG_KMEM
- mem_cgroup_flush_stats(memcg);
- nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
- nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
-#else
- /* use pool stats instead of memcg stats */
- nr_backing = zswap_pool_total_size >> PAGE_SHIFT;
- nr_stored = atomic_read(&zswap_nr_stored);
-#endif
+ /*
+ * For memcg, use the cgroup-wide ZSWAP stats since we don't
+ * have them per-node and thus per-lruvec. Careful if memcg is
+ * runtime-disabled: we can get sc->memcg == NULL, which is ok
+ * for the lruvec, but not for memcg_page_state().
+ *
+ * Without memcg, use the zswap pool-wide metrics.
+ */
+ if (!mem_cgroup_disabled()) {
+ mem_cgroup_flush_stats(memcg);
+ nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
+ nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
+ } else {
+ nr_backing = zswap_pool_total_size >> PAGE_SHIFT;
+ nr_stored = atomic_read(&zswap_nr_stored);
+ }
if (!nr_stored)
return 0;
On Wed, Apr 17, 2024 at 10:33:24AM -0400, Johannes Weiner wrote:
> I could reproduce the NULL crash locally with cgroup_disable=memory,
> the shrinker enabled, and echo 3 >/proc/sys/vm/drop_caches.
Can confirm that libguestfs uses cgroup_disable=memory for its
appliance (as it saves some RAM and we don't need cgroups).
Rich.
--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html
On 24/04/17 10:33AM, Johannes Weiner wrote:
>
> Christian, can you please test the below patch on top of current
> upstream?
>
Hey Johannes,
I have applied your patch on top of 6.9-rc4 and it did solve the crash for
me, thanks for hacking together a fix so quickly! ????
Tested-By: Christian Heusel <[email protected]>
Cheers,
Christian
On 24/04/16 03:14PM, Nhat Pham wrote:
> Is this one of the kernel versions that was broken? That looks a bit
> odd, as zswap shrinker landed on 6.8...
Yes that is the build at commit b5ba474f3f51 ("zswap: shrink zswap pool
based on memory pressure") which is one of the kernel versions that is
broken, although the version string seems a bit confusing. This is due
to the build script invoking "make -s kernelrelease" which gives the
following:
$ git checkout b5ba474f3f51
[...]
$ make -s kernelrelease
6.7.0-rc4-00158-gb5ba474f3f51
The commit is contained in v6.8+ only as you say, so its just confusing
in the version string:
$ git tag --contains b5ba474f3f51 | grep -v next
v6.8
v6.8-rc1
v6.8-rc2
v6.8-rc3
v6.8-rc4
v6.8-rc5
v6.8-rc6
v6.8-rc7
v6.9-rc1
v6.9-rc2
v6.9-rc3
v6.9-rc4
Please see my initial report for the detailed bisection results and the
link there to the bisection log.
Cheers and thanks for looking into this!
Christian
On Wed, Apr 17, 2024 at 07:18:14PM +0200, Christian Heusel wrote:
> On 24/04/17 10:33AM, Johannes Weiner wrote:
> >
> > Christian, can you please test the below patch on top of current
> > upstream?
> >
>
> Hey Johannes,
>
> I have applied your patch on top of 6.9-rc4 and it did solve the crash for
> me, thanks for hacking together a fix so quickly! ????
>
> Tested-By: Christian Heusel <[email protected]>
Thanks for confirming it, and sorry about the breakage.
Andrew, can you please use the updated changelog below?
---
From 52f67f5fab6a743c2aedfc8e04a582a9d1025c28 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <[email protected]>
Date: Thu, 18 Apr 2024 08:26:28 -0400
Subject: [PATCH] mm: zswap: fix shrinker NULL crash with cgroup_disable=memory
Christian reports a NULL deref in zswap that he bisected down to the
zswap shrinker. The issue also cropped up in the bug trackers of
libguestfs [1] and the Red Hat bugzilla [2].
The problem is that when memcg is disabled with the boot time flag,
the zswap shrinker might get called with sc->memcg == NULL. This is
okay in many places, like the lruvec operations. But it crashes in
memcg_page_state() - which is only used due to the non-node accounting
of cgroup's the zswap memory to begin with.
Nhat spotted that the memcg can be NULL in the memcg-disabled case,
and I was then able to reproduce the crash locally as well.
[1] https://github.com/libguestfs/libguestfs/issues/139
[2] https://bugzilla.redhat.com/show_bug.cgi?id=2275252
Fixes: b5ba474f3f51 ("zswap: shrink zswap pool based on memory pressure")
Cc: [email protected] [v6.8]
Link: https://lkml.kernel.org/r/[email protected]
Reported-by: Christian Heusel <[email protected]>
Debugged-by: Nhat Pham <[email protected]>
Suggested-by: Nhat Pham <[email protected]>
Tested-By: Christian Heusel <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---
mm/zswap.c | 25 ++++++++++++++++---------
1 file changed, 16 insertions(+), 9 deletions(-)
diff --git a/mm/zswap.c b/mm/zswap.c
index caed028945b0..6f8850c44b61 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1331,15 +1331,22 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
if (!gfp_has_io_fs(sc->gfp_mask))
return 0;
-#ifdef CONFIG_MEMCG_KMEM
- mem_cgroup_flush_stats(memcg);
- nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
- nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
-#else
- /* use pool stats instead of memcg stats */
- nr_backing = zswap_pool_total_size >> PAGE_SHIFT;
- nr_stored = atomic_read(&zswap_nr_stored);
-#endif
+ /*
+ * For memcg, use the cgroup-wide ZSWAP stats since we don't
+ * have them per-node and thus per-lruvec. Careful if memcg is
+ * runtime-disabled: we can get sc->memcg == NULL, which is ok
+ * for the lruvec, but not for memcg_page_state().
+ *
+ * Without memcg, use the zswap pool-wide metrics.
+ */
+ if (!mem_cgroup_disabled()) {
+ mem_cgroup_flush_stats(memcg);
+ nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
+ nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
+ } else {
+ nr_backing = zswap_pool_total_size >> PAGE_SHIFT;
+ nr_stored = atomic_read(&zswap_nr_stored);
+ }
if (!nr_stored)
return 0;
--
2.44.0
On 18.04.24 14:40, Johannes Weiner wrote:
> On Wed, Apr 17, 2024 at 07:18:14PM +0200, Christian Heusel wrote:
>> On 24/04/17 10:33AM, Johannes Weiner wrote:
> Christian reports a NULL deref in zswap that he bisected down to the
> zswap shrinker. The issue also cropped up in the bug trackers of
> libguestfs [1] and the Red Hat bugzilla [2].
>
> The problem is that when memcg is disabled with the boot time flag,
> the zswap shrinker might get called with sc->memcg == NULL. This is
> okay in many places, like the lruvec operations. But it crashes in
> memcg_page_state() - which is only used due to the non-node accounting
> of cgroup's the zswap memory to begin with.
>
> Nhat spotted that the memcg can be NULL in the memcg-disabled case,
> and I was then able to reproduce the crash locally as well.
Thx for the fix. Nitpicking:
> [1] https://github.com/libguestfs/libguestfs/issues/139
> [2] https://bugzilla.redhat.com/show_bug.cgi?id=2275252
FWIW, those should ideally look like this:
Link: https://github.com/libguestfs/libguestfs/issues/139 [1]
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2275252 [2]
> Fixes: b5ba474f3f51 ("zswap: shrink zswap pool based on memory pressure")
> Cc: [email protected] [v6.8]
> Link: https://lkml.kernel.org/r/[email protected]
> Reported-by: Christian Heusel <[email protected]>
And here checkpatch.pl should have complained that the above line should
ideally be followed by a Link or Closes tag to the report, e.g.:
Closes:
https://lore.kernel.org/all/3iccc6vjl5gminut3lvpl4va2lbnsgku5ei2d7ylftoofy3n2v@gcfdvtsq6dx2/
Which in this case would be nice, as I'm tracking this regression, hence
regzbot will then track the patch and consider the regression resolved
once the fix lands in mainline.
Ciao, Thorsten
On Thu, Apr 18, 2024 at 5:40 AM Johannes Weiner <[email protected]> wrote:
>
> On Wed, Apr 17, 2024 at 07:18:14PM +0200, Christian Heusel wrote:
> > On 24/04/17 10:33AM, Johannes Weiner wrote:
> > >
> > > Christian, can you please test the below patch on top of current
> > > upstream?
> > >
> >
> > Hey Johannes,
> >
> > I have applied your patch on top of 6.9-rc4 and it did solve the crash for
> > me, thanks for hacking together a fix so quickly! ????
> >
> > Tested-By: Christian Heusel <[email protected]>
>
> Thanks for confirming it, and sorry about the breakage.
>
> Andrew, can you please use the updated changelog below?
>
> ---
>
> From 52f67f5fab6a743c2aedfc8e04a582a9d1025c28 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <[email protected]>
> Date: Thu, 18 Apr 2024 08:26:28 -0400
> Subject: [PATCH] mm: zswap: fix shrinker NULL crash with cgroup_disable=memory
>
> Christian reports a NULL deref in zswap that he bisected down to the
> zswap shrinker. The issue also cropped up in the bug trackers of
> libguestfs [1] and the Red Hat bugzilla [2].
>
> The problem is that when memcg is disabled with the boot time flag,
> the zswap shrinker might get called with sc->memcg == NULL. This is
> okay in many places, like the lruvec operations. But it crashes in
> memcg_page_state() - which is only used due to the non-node accounting
> of cgroup's the zswap memory to begin with.
>
> Nhat spotted that the memcg can be NULL in the memcg-disabled case,
> and I was then able to reproduce the crash locally as well.
>
> [1] https://github.com/libguestfs/libguestfs/issues/139
> [2] https://bugzilla.redhat.com/show_bug.cgi?id=2275252
>
> Fixes: b5ba474f3f51 ("zswap: shrink zswap pool based on memory pressure")
> Cc: [email protected] [v6.8]
> Link: https://lkml.kernel.org/r/[email protected]
> Reported-by: Christian Heusel <[email protected]>
> Debugged-by: Nhat Pham <[email protected]>
> Suggested-by: Nhat Pham <[email protected]>
> Tested-By: Christian Heusel <[email protected]>
> Signed-off-by: Johannes Weiner <[email protected]>
Thanks for fixing this. A couple of comments/questions below, but anyway LGTM:
Acked-by: Yosry Ahmed <[email protected]>
> ---
> mm/zswap.c | 25 ++++++++++++++++---------
> 1 file changed, 16 insertions(+), 9 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index caed028945b0..6f8850c44b61 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1331,15 +1331,22 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
> if (!gfp_has_io_fs(sc->gfp_mask))
> return 0;
>
> -#ifdef CONFIG_MEMCG_KMEM
> - mem_cgroup_flush_stats(memcg);
> - nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
> - nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
> -#else
> - /* use pool stats instead of memcg stats */
> - nr_backing = zswap_pool_total_size >> PAGE_SHIFT;
> - nr_stored = atomic_read(&zswap_nr_stored);
> -#endif
> + /*
> + * For memcg, use the cgroup-wide ZSWAP stats since we don't
> + * have them per-node and thus per-lruvec. Careful if memcg is
> + * runtime-disabled: we can get sc->memcg == NULL, which is ok
> + * for the lruvec, but not for memcg_page_state().
> + *
> + * Without memcg, use the zswap pool-wide metrics.
> + */
> + if (!mem_cgroup_disabled()) {
With the current shrinker code, it seems like we cannot get sc->memcg
== NULL unless mem_cgroup_disabled() is true indeed. However, maybe
it's better to check if sc->memcg is not NULL directly instead?
This would be more resilient in case the shrinker code changes and
passing sc->memcg == NULL becomes possible in other cases (e.g. during
global shrinking). It seems like other shrinkers do this, for example
see count_shadow_nodes() and deferred_split_count().
I am also wondering if we should also check !mem_cgroup_is_root()
here? We can avoid the expensive global flush and use the global stats
directly in this case. I could also send a follow up patch for this if
that's preferred.
> + mem_cgroup_flush_stats(memcg);
> + nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
> + nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
> + } else {
> + nr_backing = zswap_pool_total_size >> PAGE_SHIFT;
> + nr_stored = atomic_read(&zswap_nr_stored);
> + }
>
> if (!nr_stored)
> return 0;
> --
> 2.44.0
On Thu, Apr 18, 2024 at 01:09:22PM -0700, Yosry Ahmed wrote:
> On Thu, Apr 18, 2024 at 5:40 AM Johannes Weiner <[email protected]> wrote:
> >
> > On Wed, Apr 17, 2024 at 07:18:14PM +0200, Christian Heusel wrote:
> > > On 24/04/17 10:33AM, Johannes Weiner wrote:
> > > >
> > > > Christian, can you please test the below patch on top of current
> > > > upstream?
> > > >
> > >
> > > Hey Johannes,
> > >
> > > I have applied your patch on top of 6.9-rc4 and it did solve the crash for
> > > me, thanks for hacking together a fix so quickly! ????
> > >
> > > Tested-By: Christian Heusel <[email protected]>
> >
> > Thanks for confirming it, and sorry about the breakage.
> >
> > Andrew, can you please use the updated changelog below?
> >
> > ---
> >
> > From 52f67f5fab6a743c2aedfc8e04a582a9d1025c28 Mon Sep 17 00:00:00 2001
> > From: Johannes Weiner <[email protected]>
> > Date: Thu, 18 Apr 2024 08:26:28 -0400
> > Subject: [PATCH] mm: zswap: fix shrinker NULL crash with cgroup_disable=memory
> >
> > Christian reports a NULL deref in zswap that he bisected down to the
> > zswap shrinker. The issue also cropped up in the bug trackers of
> > libguestfs [1] and the Red Hat bugzilla [2].
> >
> > The problem is that when memcg is disabled with the boot time flag,
> > the zswap shrinker might get called with sc->memcg == NULL. This is
> > okay in many places, like the lruvec operations. But it crashes in
> > memcg_page_state() - which is only used due to the non-node accounting
> > of cgroup's the zswap memory to begin with.
> >
> > Nhat spotted that the memcg can be NULL in the memcg-disabled case,
> > and I was then able to reproduce the crash locally as well.
> >
> > [1] https://github.com/libguestfs/libguestfs/issues/139
> > [2] https://bugzilla.redhat.com/show_bug.cgi?id=2275252
> >
> > Fixes: b5ba474f3f51 ("zswap: shrink zswap pool based on memory pressure")
> > Cc: [email protected] [v6.8]
> > Link: https://lkml.kernel.org/r/[email protected]
> > Reported-by: Christian Heusel <[email protected]>
> > Debugged-by: Nhat Pham <[email protected]>
> > Suggested-by: Nhat Pham <[email protected]>
> > Tested-By: Christian Heusel <[email protected]>
> > Signed-off-by: Johannes Weiner <[email protected]>
>
> Thanks for fixing this. A couple of comments/questions below, but anyway LGTM:
>
> Acked-by: Yosry Ahmed <[email protected]>
>
> > ---
> > mm/zswap.c | 25 ++++++++++++++++---------
> > 1 file changed, 16 insertions(+), 9 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index caed028945b0..6f8850c44b61 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -1331,15 +1331,22 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
> > if (!gfp_has_io_fs(sc->gfp_mask))
> > return 0;
> >
> > -#ifdef CONFIG_MEMCG_KMEM
> > - mem_cgroup_flush_stats(memcg);
> > - nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
> > - nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
> > -#else
> > - /* use pool stats instead of memcg stats */
> > - nr_backing = zswap_pool_total_size >> PAGE_SHIFT;
> > - nr_stored = atomic_read(&zswap_nr_stored);
> > -#endif
> > + /*
> > + * For memcg, use the cgroup-wide ZSWAP stats since we don't
> > + * have them per-node and thus per-lruvec. Careful if memcg is
> > + * runtime-disabled: we can get sc->memcg == NULL, which is ok
> > + * for the lruvec, but not for memcg_page_state().
> > + *
> > + * Without memcg, use the zswap pool-wide metrics.
> > + */
> > + if (!mem_cgroup_disabled()) {
>
> With the current shrinker code, it seems like we cannot get sc->memcg
> == NULL unless mem_cgroup_disabled() is true indeed. However, maybe
> it's better to check if sc->memcg is not NULL directly instead?
>
> This would be more resilient in case the shrinker code changes and
> passing sc->memcg == NULL becomes possible in other cases (e.g. during
> global shrinking). It seems like other shrinkers do this, for example
> see count_shadow_nodes() and deferred_split_count().
Eh, I'm not sure it's better or worse, so it's a bit hard to care. We
shouldn't get NULL here when memcg is enabled, and if somebody
introduces that bug it's better to catch it early than run into subtle
priority inversions when the kernel is deployed to millions of hosts.
> I am also wondering if we should also check !mem_cgroup_is_root()
> here? We can avoid the expensive global flush and use the global stats
> directly in this case. I could also send a follow up patch for this if
> that's preferred.
I'd rather not proliferate more memcg internals in this code. If this
is a concern, optimizing it in the flush and stat functions would make
more sense. Reclaim already flushes the subtree before getting here,
so odds are good this is a no-op in most cases.