2024-05-21 20:17:42

by Mikhail Gavrilov

[permalink] [raw]
Subject: 6.10/bisected/regression - commit 8430557fc584 cause warning at mm/page_table_check.c:198 __page_table_check_ptes_set+0x306

Hi,
Yesterday, after the next kernel snapshot update I spotted new warning
at mm/page_table_check.c:198 with follow stacktrace:
[ 5.524572] debug_vm_pgtable: [debug_vm_pgtable ]:
Validating architecture page table helpers
[ 5.572473] ------------[ cut here ]------------
[ 5.572871] WARNING: CPU: 0 PID: 1 at mm/page_table_check.c:198
__page_table_check_ptes_set+0x306/0x3c0
[ 5.573364] Modules linked in:
[ 5.573604] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G W
------- ---
6.10.0-0.rc0.20240520giteb6a9339efeb.9.fc41.x86_64+debug #1
[ 5.574089] Hardware name: ASRock B650I Lightning WiFi/B650I
Lightning WiFi, BIOS 2.10 03/20/2024
[ 5.574339] RIP: 0010:__page_table_check_ptes_set+0x306/0x3c0
[ 5.574591] Code: 74 24 04 89 ea 48 89 df e8 e7 f3 ff ff e9 12 ff
ff ff 0f 1f 44 00 00 48 c1 e8 06 89 c5 83 e5 01 e9 b0 fe ff ff f6 c2
02 74 31 <0f> 0b e9 de fd ff ff 49 83 e7 f7 48 89 c1 4c 21 f9 89 ca 83
e1 02
[ 5.575434] RSP: 0018:ffffc9000018f9d0 EFLAGS: 00010246
[ 5.575739] RAX: fff0000000000fff RBX: ffff888124da5000 RCX: 0000000000000001
[ 5.576064] RDX: 0000000000000040 RSI: bffffffffffffff5 RDI: ffffc9000018fa00
[ 5.576395] RBP: ffff888124511e40 R08: 0000000000000000 R09: 0000000000000001
[ 5.576730] R10: ffffffff97f63527 R11: 0000000000000000 R12: ffffea0005000008
[ 5.577048] R13: 1ffff92000031f3c R14: 0000000000000000 R15: bffffffffffffff5
[ 5.577335] FS: 0000000000000000(0000) GS:ffff888df7e00000(0000)
knlGS:0000000000000000
[ 5.577631] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5.577925] CR2: ffff888a53601000 CR3: 0000000a4de98000 CR4: 0000000000f50ef0
[ 5.578208] PKRU: 55555554
[ 5.578483] Call Trace:
[ 5.578496] usb 1-3: new high-speed USB device number 2 using xhci_hcd
[ 5.578760] <TASK>
[ 5.579331] ? __warn.cold+0x5b/0x1af
[ 5.579618] ? __page_table_check_ptes_set+0x306/0x3c0
[ 5.579903] ? report_bug+0x1fc/0x3d0
[ 5.580188] ? handle_bug+0x3c/0x80
[ 5.580461] ? exc_invalid_op+0x17/0x40
[ 5.580731] ? asm_exc_invalid_op+0x1a/0x20
[ 5.581003] ? __page_table_check_ptes_set+0x306/0x3c0
[ 5.581274] ? __pfx___page_table_check_ptes_set+0x10/0x10
[ 5.581544] ? __pfx_check_pgprot+0x10/0x10
[ 5.581806] set_ptes.constprop.0+0x66/0xd0
[ 5.582072] ? __pfx_set_ptes.constprop.0+0x10/0x10
[ 5.582333] ? __pfx_pte_val+0x10/0x10
[ 5.582595] debug_vm_pgtable+0x1c04/0x3360
[ 5.582849] ? __pfx_debug_vm_pgtable+0x10/0x10
[ 5.583099] ? add_device_randomness+0xb8/0xf0
[ 5.583334] ? __pfx_add_device_randomness+0x10/0x10
[ 5.583573] ? __pfx_debug_vm_pgtable+0x10/0x10
[ 5.583804] do_one_initcall+0xd6/0x460
[ 5.584034] ? __pfx_do_one_initcall+0x10/0x10
[ 5.584252] ? kernel_init_freeable+0x4cb/0x750
[ 5.584465] kernel_init_freeable+0x6b4/0x750
[ 5.584674] ? __pfx_kernel_init_freeable+0x10/0x10
[ 5.584877] ? __pfx_kernel_init+0x10/0x10
[ 5.585068] ? __pfx_kernel_init+0x10/0x10
[ 5.585253] kernel_init+0x1c/0x150
[ 5.585434] ? __pfx_kernel_init+0x10/0x10
[ 5.585616] ret_from_fork+0x31/0x70
[ 5.585791] ? __pfx_kernel_init+0x10/0x10
[ 5.585971] ret_from_fork_asm+0x1a/0x30
[ 5.586146] </TASK>
[ 5.586312] irq event stamp: 1743772
[ 5.586475] hardirqs last enabled at (1743771):
[<ffffffff92c35f2e>] kasan_quarantine_put+0x12e/0x250
[ 5.586816] hardirqs last disabled at (1743772):
[<ffffffff9546895c>] _raw_spin_lock_irqsave+0x7c/0xa0
[ 5.587185] softirqs last enabled at (1742786):
[<ffffffff922721fb>] __irq_exit_rcu+0xbb/0x1c0
[ 5.587379] softirqs last disabled at (1742781):
[<ffffffff922721fb>] __irq_exit_rcu+0xbb/0x1c0
[ 5.587573] ---[ end trace 0000000000000000 ]---
[ 5.656111] page_owner is disabled

Bisect is pointed to commit:
8430557fc584657559bfbd5150b6ae1bb90f35a0
Author: Peter Xu <[email protected]>
Date: Wed Apr 17 17:25:49 2024 -0400

mm/page_table_check: support userfault wr-protect entries

Allow page_table_check hooks to check over userfaultfd wr-protect criteria
upon pgtable updates. The rule is no co-existance allowed for any
writable flag against userfault wr-protect flag.

This should be better than c2da319c2e, where we used to only sanitize such
issues during a pgtable walk, but when hitting such issue we don't have a
good chance to know where does that writable bit came from [1], so that
even the pgtable walk exposes a kernel bug (which is still helpful on
triaging) but not easy to track and debug.

Now we switch to track the source. It's much easier too with the recent
introduction of page table check.

There are some limitations with using the page table check here for
userfaultfd wr-protect purpose:

- It is only enabled with explicit enablement of page table check configs
and/or boot parameters, but should be good enough to track at least
syzbot issues, as syzbot should enable PAGE_TABLE_CHECK[_ENFORCED] for
x86 [1]. We used to have DEBUG_VM but it's now off for most distros,
while distros also normally not enable PAGE_TABLE_CHECK[_ENFORCED], which
is similar.

- It conditionally works with the ptep_modify_prot API. It will be
bypassed when e.g. XEN PV is enabled, however still work for most of the
rest scenarios, which should be the common cases so should be good
enough.

- Hugetlb check is a bit hairy, as the page table check cannot identify
hugetlb pte or normal pte via trapping at set_pte_at(), because of the
current design where hugetlb maps every layers to pte_t... For example,
the default set_huge_pte_at() can invoke set_pte_at() directly and lose
the hugetlb context, treating it the same as a normal pte_t. So far it's
fine because we have huge_pte_uffd_wp() always equals to pte_uffd_wp() as
long as supported (x86 only). It'll be a bigger problem when we'll
define _PAGE_UFFD_WP differently at various pgtable levels, because then
one huge_pte_uffd_wp() per-arch will stop making sense first.. as of now
we can leave this for later too.

This patch also removes commit c2da319c2e altogether, as we have something
better now.

[1] https://lore.kernel.org/all/[email protected]/

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: Pasha Tatashin <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Nadav Amit <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

Documentation/mm/page_table_check.rst | 9 ++++++++-
arch/x86/include/asm/pgtable.h | 18 +-----------------
mm/page_table_check.c | 30 ++++++++++++++++++++++++++++++
3 files changed, 39 insertions(+), 18 deletions(-)


For convincing that bisect was a correct I reverted this commit and
checked again kernel snapshot.
And yes, the warning message is gone.

I also attach below a full kernel log and build config.

My hardware specs: https://linux-hardware.org/?probe=b34f0353df

Peter, can you look please.

--
Best Regards,
Mike Gavrilov.


2024-05-21 20:45:10

by Peter Xu

[permalink] [raw]
Subject: Re: 6.10/bisected/regression - commit 8430557fc584 cause warning at mm/page_table_check.c:198 __page_table_check_ptes_set+0x306

On Wed, May 22, 2024 at 01:17:19AM +0500, Mikhail Gavrilov wrote:
> Hi,

Hi,

> I also attach below a full kernel log and build config.
>
> My hardware specs: https://linux-hardware.org/?probe=b34f0353df
>
> Peter, can you look please.

Did you forget to attach the kernel config? If so, please attach it, I'll
try that, as my local config won't reproduce with CONFIG_DEBUG_VM_PGTABLE=y.

Thanks,

--
Peter Xu


2024-05-21 20:49:42

by Mikhail Gavrilov

[permalink] [raw]
Subject: Re: 6.10/bisected/regression - commit 8430557fc584 cause warning at mm/page_table_check.c:198 __page_table_check_ptes_set+0x306

On Wed, May 22, 2024 at 1:45 AM Peter Xu <[email protected]> wrote:
>
>
> Did you forget to attach the kernel config? If so, please attach it, I'll
> try that, as my local config won't reproduce with CONFIG_DEBUG_VM_PGTABLE=y.
>

Oh, sorry.
Now I've definitely attached it.

--
Best Regards,
Mike Gavrilov.


Attachments:
dmesg.zip (51.94 kB)
.config.zip (64.92 kB)
Download all attachments

2024-05-21 21:37:21

by Peter Xu

[permalink] [raw]
Subject: Re: 6.10/bisected/regression - commit 8430557fc584 cause warning at mm/page_table_check.c:198 __page_table_check_ptes_set+0x306

On Wed, May 22, 2024 at 01:48:36AM +0500, Mikhail Gavrilov wrote:
> On Wed, May 22, 2024 at 1:45 AM Peter Xu <[email protected]> wrote:
> >
> >
> > Did you forget to attach the kernel config? If so, please attach it, I'll
> > try that, as my local config won't reproduce with CONFIG_DEBUG_VM_PGTABLE=y.
> >
>
> Oh, sorry.
> Now I've definitely attached it.

Hmm I still cannot reproduce. Weird.

Would it be possible for you to identify which line in debug_vm_pgtable.c
triggered that issue?

I think it should be some set_pte_at() but I'm not sure, as there aren't a
lot and all of them look benign so far. It could be that I missed
something important.

Thanks,

--
Peter Xu


2024-05-21 22:21:25

by Mikhail Gavrilov

[permalink] [raw]
Subject: Re: 6.10/bisected/regression - commit 8430557fc584 cause warning at mm/page_table_check.c:198 __page_table_check_ptes_set+0x306

On Wed, May 22, 2024 at 2:37 AM Peter Xu <[email protected]> wrote:
> Hmm I still cannot reproduce. Weird.
>
> Would it be possible for you to identify which line in debug_vm_pgtable.c
> triggered that issue?
>
> I think it should be some set_pte_at() but I'm not sure, as there aren't a
> lot and all of them look benign so far. It could be that I missed
> something important.

I hope it's helps:

> sh /usr/src/kernels/(uname -r)/scripts/faddr2line /lib/debug/lib/modules/(uname -r)/vmlinux debug_vm_pgtable+0x1c04
debug_vm_pgtable+0x1c04/0x3360:
native_ptep_get_and_clear at arch/x86/include/asm/pgtable_64.h:94
(inlined by) ptep_get_and_clear at arch/x86/include/asm/pgtable.h:1262
(inlined by) ptep_clear at include/linux/pgtable.h:509
(inlined by) pte_clear_tests at mm/debug_vm_pgtable.c:643
(inlined by) debug_vm_pgtable at mm/debug_vm_pgtable.c:1392

> cat -n /usr/src/debug/kernel-6.9-10323-g8f6a15f095a6/linux-6.10.0-0.rc0.20240521git8f6a15f095a6.10.fc41.x86_64/mm/debug_vm_pgtable.c | sed -n '1387,1397 p'
1387 * Page table modifying tests. They need to hold
1388 * proper page table lock.
1389 */
1390
1391 args.ptep = pte_offset_map_lock(args.mm, args.pmdp, args.vaddr, &ptl);
1392 pte_clear_tests(&args);
1393 pte_advanced_tests(&args);
1394 if (args.ptep)
1395 pte_unmap_unlock(args.ptep, ptl);
1396
1397 ptl = pmd_lock(args.mm, args.pmdp);

--
Best Regards,
Mike Gavrilov.

2024-05-21 22:36:17

by Peter Xu

[permalink] [raw]
Subject: Re: 6.10/bisected/regression - commit 8430557fc584 cause warning at mm/page_table_check.c:198 __page_table_check_ptes_set+0x306

On Wed, May 22, 2024 at 03:21:04AM +0500, Mikhail Gavrilov wrote:
> On Wed, May 22, 2024 at 2:37 AM Peter Xu <[email protected]> wrote:
> > Hmm I still cannot reproduce. Weird.
> >
> > Would it be possible for you to identify which line in debug_vm_pgtable.c
> > triggered that issue?
> >
> > I think it should be some set_pte_at() but I'm not sure, as there aren't a
> > lot and all of them look benign so far. It could be that I missed
> > something important.
>
> I hope it's helps:

Thanks for offering this, it's just that it doesn't look coherent with what
was reported for some reason.

>
> > sh /usr/src/kernels/(uname -r)/scripts/faddr2line /lib/debug/lib/modules/(uname -r)/vmlinux debug_vm_pgtable+0x1c04
> debug_vm_pgtable+0x1c04/0x3360:
> native_ptep_get_and_clear at arch/x86/include/asm/pgtable_64.h:94
> (inlined by) ptep_get_and_clear at arch/x86/include/asm/pgtable.h:1262
> (inlined by) ptep_clear at include/linux/pgtable.h:509

This is a pte_clear(), and pte_clear() shouldn't even do the set() checks,
and shouldn't stumble over what I added.

IOW, it doesn't match with the real stack dump previously:

[ 5.581003] ? __page_table_check_ptes_set+0x306/0x3c0
[ 5.581274] ? __pfx___page_table_check_ptes_set+0x10/0x10
[ 5.581544] ? __pfx_check_pgprot+0x10/0x10
[ 5.581806] set_ptes.constprop.0+0x66/0xd0
[ 5.582072] ? __pfx_set_ptes.constprop.0+0x10/0x10
[ 5.582333] ? __pfx_pte_val+0x10/0x10
[ 5.582595] debug_vm_pgtable+0x1c04/0x3360

Would it be possible that e.g. you recompiled the kernel so the vmlinux
didn't match?

> (inlined by) pte_clear_tests at mm/debug_vm_pgtable.c:643
> (inlined by) debug_vm_pgtable at mm/debug_vm_pgtable.c:1392
>
> > cat -n /usr/src/debug/kernel-6.9-10323-g8f6a15f095a6/linux-6.10.0-0.rc0.20240521git8f6a15f095a6.10.fc41.x86_64/mm/debug_vm_pgtable.c | sed -n '1387,1397 p'
> 1387 * Page table modifying tests. They need to hold
> 1388 * proper page table lock.
> 1389 */
> 1390
> 1391 args.ptep = pte_offset_map_lock(args.mm, args.pmdp, args.vaddr, &ptl);
> 1392 pte_clear_tests(&args);
> 1393 pte_advanced_tests(&args);
> 1394 if (args.ptep)
> 1395 pte_unmap_unlock(args.ptep, ptl);
> 1396
> 1397 ptl = pmd_lock(args.mm, args.pmdp);
>
> --
> Best Regards,
> Mike Gavrilov.
>

--
Peter Xu


2024-05-21 23:26:26

by Mikhail Gavrilov

[permalink] [raw]
Subject: Re: 6.10/bisected/regression - commit 8430557fc584 cause warning at mm/page_table_check.c:198 __page_table_check_ptes_set+0x306

On Wed, May 22, 2024 at 3:36 AM Peter Xu <[email protected]> wrote:
>
> On Wed, May 22, 2024 at 03:21:04AM +0500, Mikhail Gavrilov wrote:
> > On Wed, May 22, 2024 at 2:37 AM Peter Xu <[email protected]> wrote:
> > > Hmm I still cannot reproduce. Weird.
> > >
> > > Would it be possible for you to identify which line in debug_vm_pgtable.c
> > > triggered that issue?
> > >
> > > I think it should be some set_pte_at() but I'm not sure, as there aren't a
> > > lot and all of them look benign so far. It could be that I missed
> > > something important.
> >
> > I hope it's helps:
>
> Thanks for offering this, it's just that it doesn't look coherent with what
> was reported for some reason.
>

There can be no mistake here.
I just copy the console output without rebooting.

> sudo dmesg | grep "debug_vm_pgtable"
[ 8.043229] debug_vm_pgtable: [debug_vm_pgtable ]:
Validating architecture page table helpers
[ 8.103359] debug_vm_pgtable+0x1c04/0x3360
[ 8.103607] ? __pfx_debug_vm_pgtable+0x10/0x10
[ 8.104312] ? __pfx_debug_vm_pgtable+0x10/0x10

> sh /lib/modules/6.9.0-test-eb6a9339efeb+/build/scripts/faddr2line /lib/modules/6.9.0-test-eb6a9339efeb+/build/vmlinux debug_vm_pgtable+0x1c04
debug_vm_pgtable+0x1c04/0x3360:
native_ptep_get_and_clear at arch/x86/include/asm/pgtable_64.h:94
(inlined by) ptep_get_and_clear at arch/x86/include/asm/pgtable.h:1278
(inlined by) ptep_clear at include/linux/pgtable.h:509
(inlined by) pte_clear_tests at mm/debug_vm_pgtable.c:643
(inlined by) debug_vm_pgtable at mm/debug_vm_pgtable.c:1392

--
Best Regards,
Mike Gavrilov.


Attachments:
full-kernel-log.zip (51.21 kB)

2024-05-22 07:49:03

by David Hildenbrand

[permalink] [raw]
Subject: Re: 6.10/bisected/regression - commit 8430557fc584 cause warning at mm/page_table_check.c:198 __page_table_check_ptes_set+0x306

On 22.05.24 00:36, Peter Xu wrote:
> On Wed, May 22, 2024 at 03:21:04AM +0500, Mikhail Gavrilov wrote:
>> On Wed, May 22, 2024 at 2:37 AM Peter Xu <[email protected]> wrote:
>>> Hmm I still cannot reproduce. Weird.
>>>
>>> Would it be possible for you to identify which line in debug_vm_pgtable.c
>>> triggered that issue?
>>>
>>> I think it should be some set_pte_at() but I'm not sure, as there aren't a
>>> lot and all of them look benign so far. It could be that I missed
>>> something important.
>>
>> I hope it's helps:
>
> Thanks for offering this, it's just that it doesn't look coherent with what
> was reported for some reason.
>
>>
>>> sh /usr/src/kernels/(uname -r)/scripts/faddr2line /lib/debug/lib/modules/(uname -r)/vmlinux debug_vm_pgtable+0x1c04
>> debug_vm_pgtable+0x1c04/0x3360:
>> native_ptep_get_and_clear at arch/x86/include/asm/pgtable_64.h:94
>> (inlined by) ptep_get_and_clear at arch/x86/include/asm/pgtable.h:1262
>> (inlined by) ptep_clear at include/linux/pgtable.h:509
>
> This is a pte_clear(), and pte_clear() shouldn't even do the set() checks,
> and shouldn't stumble over what I added.
>
> IOW, it doesn't match with the real stack dump previously:
>
> [ 5.581003] ? __page_table_check_ptes_set+0x306/0x3c0
> [ 5.581274] ? __pfx___page_table_check_ptes_set+0x10/0x10
> [ 5.581544] ? __pfx_check_pgprot+0x10/0x10
> [ 5.581806] set_ptes.constprop.0+0x66/0xd0
> [ 5.582072] ? __pfx_set_ptes.constprop.0+0x10/0x10
> [ 5.582333] ? __pfx_pte_val+0x10/0x10
> [ 5.582595] debug_vm_pgtable+0x1c04/0x3360
>

Staring at pte_clear_tests():

#ifndef CONFIG_RISCV
pte = __pte(pte_val(pte) | RANDOM_ORVALUE);
#endif
set_pte_at(args->mm, args->vaddr, args->ptep, pte);

So we set random PTE bits, probably setting the present, uffd and write
bit at the same time. That doesn't make too much sense when we want to
perform that such combinations cannot exist.

In pmd_clear_tests() and friends we use WRITE_ONCE() instead, so there
we don't run into trouble.

--
Cheers,

David / dhildenb


2024-05-22 15:18:26

by Peter Xu

[permalink] [raw]
Subject: Re: 6.10/bisected/regression - commit 8430557fc584 cause warning at mm/page_table_check.c:198 __page_table_check_ptes_set+0x306

On Wed, May 22, 2024 at 09:48:51AM +0200, David Hildenbrand wrote:
> On 22.05.24 00:36, Peter Xu wrote:
> > On Wed, May 22, 2024 at 03:21:04AM +0500, Mikhail Gavrilov wrote:
> > > On Wed, May 22, 2024 at 2:37 AM Peter Xu <[email protected]> wrote:
> > > > Hmm I still cannot reproduce. Weird.
> > > >
> > > > Would it be possible for you to identify which line in debug_vm_pgtable.c
> > > > triggered that issue?
> > > >
> > > > I think it should be some set_pte_at() but I'm not sure, as there aren't a
> > > > lot and all of them look benign so far. It could be that I missed
> > > > something important.
> > >
> > > I hope it's helps:
> >
> > Thanks for offering this, it's just that it doesn't look coherent with what
> > was reported for some reason.
> >
> > >
> > > > sh /usr/src/kernels/(uname -r)/scripts/faddr2line /lib/debug/lib/modules/(uname -r)/vmlinux debug_vm_pgtable+0x1c04
> > > debug_vm_pgtable+0x1c04/0x3360:
> > > native_ptep_get_and_clear at arch/x86/include/asm/pgtable_64.h:94
> > > (inlined by) ptep_get_and_clear at arch/x86/include/asm/pgtable.h:1262
> > > (inlined by) ptep_clear at include/linux/pgtable.h:509
> >
> > This is a pte_clear(), and pte_clear() shouldn't even do the set() checks,
> > and shouldn't stumble over what I added.
> >
> > IOW, it doesn't match with the real stack dump previously:
> >
> > [ 5.581003] ? __page_table_check_ptes_set+0x306/0x3c0
> > [ 5.581274] ? __pfx___page_table_check_ptes_set+0x10/0x10
> > [ 5.581544] ? __pfx_check_pgprot+0x10/0x10
> > [ 5.581806] set_ptes.constprop.0+0x66/0xd0
> > [ 5.582072] ? __pfx_set_ptes.constprop.0+0x10/0x10
> > [ 5.582333] ? __pfx_pte_val+0x10/0x10
> > [ 5.582595] debug_vm_pgtable+0x1c04/0x3360
> >
>
> Staring at pte_clear_tests():
>
> #ifndef CONFIG_RISCV
> pte = __pte(pte_val(pte) | RANDOM_ORVALUE);
> #endif
> set_pte_at(args->mm, args->vaddr, args->ptep, pte);
>
> So we set random PTE bits, probably setting the present, uffd and write bit
> at the same time. That doesn't make too much sense when we want to perform
> that such combinations cannot exist.

Here the issue is I don't think it should set W bit anyway, as we init
page_prot to be RWX but !shared:

args->page_prot = vm_get_page_prot(VM_ACCESS_FLAGS);

On x86_64 (Mikhail's system) it should have W bit cleared afaict, meanwhile
the RANDOM_ORVALUE won't touch bit W due to S390_SKIP_MASK (which contains
bit W / bit 1, which is another "accident"..). Then even if with that it
should not trigger.. I think that's also why I cannot reproduce this
problem locally.

But I think applying random bits are indeed tricky, and I don't really know
why we did that. I can get that we want to set some non-empty pte, but
AFAIU this should be far enough:

pte_t pte = pfn_pte(args->pte_pfn, args->page_prot);

As that should already be pte_none()==false, then we clear and recheck
making sure pte_none(), looks good enough already. Obviously that trick
already broke PPC64 and S390 before due to existance of PPC64_SKIP_MASK
etc..

I guess it won't hurt in this case to double check, though. Mikhail, would
you mind mark this line to see whether it's the line that triggered your
WARNING? Perhaps also dump something more than that, something like:

===8<===
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index f1c9a2c5abc0..610b1996b2e9 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -635,7 +635,8 @@ static void __init pte_clear_tests(struct pgtable_debug_args *args)
return;

#ifndef CONFIG_RISCV
- pte = __pte(pte_val(pte) | RANDOM_ORVALUE);
+ pr_info("page_prot=0x%lx\n", pgprot_val(args->page_prot));
+ pr_info("pteval|RANDOM_ORVALUE=0x%lx\n", pte_val(pte) | RANDOM_ORVALUE);
#endif
set_pte_at(args->mm, args->vaddr, args->ptep, pte);
flush_dcache_page(page);
===8<===

For me it dumps:

[ 2.249478] debug_vm_pgtable: [pte_clear_tests ]: page_prot=0x25
[ 2.250049] debug_vm_pgtable: [pte_clear_tests ]: pteval|RANDOM_ORVALUE=0xbffffffffffffff5

Logically you should see the same, but since faddr2line doesn't seem to
work properly for some reason, maybe we can try.

>
> In pmd_clear_tests() and friends we use WRITE_ONCE() instead, so there we
> don't run into trouble.

Right, and I think they should probably use set_pmd_at() rather than
WRITE_ONCE() if we want to cover the helpers.. but that's another story.

Thanks,

--
Peter Xu


2024-05-22 15:34:35

by David Hildenbrand

[permalink] [raw]
Subject: Re: 6.10/bisected/regression - commit 8430557fc584 cause warning at mm/page_table_check.c:198 __page_table_check_ptes_set+0x306

On 22.05.24 17:18, Peter Xu wrote:
> On Wed, May 22, 2024 at 09:48:51AM +0200, David Hildenbrand wrote:
>> On 22.05.24 00:36, Peter Xu wrote:
>>> On Wed, May 22, 2024 at 03:21:04AM +0500, Mikhail Gavrilov wrote:
>>>> On Wed, May 22, 2024 at 2:37 AM Peter Xu <[email protected]> wrote:
>>>>> Hmm I still cannot reproduce. Weird.
>>>>>
>>>>> Would it be possible for you to identify which line in debug_vm_pgtable.c
>>>>> triggered that issue?
>>>>>
>>>>> I think it should be some set_pte_at() but I'm not sure, as there aren't a
>>>>> lot and all of them look benign so far. It could be that I missed
>>>>> something important.
>>>>
>>>> I hope it's helps:
>>>
>>> Thanks for offering this, it's just that it doesn't look coherent with what
>>> was reported for some reason.
>>>
>>>>
>>>>> sh /usr/src/kernels/(uname -r)/scripts/faddr2line /lib/debug/lib/modules/(uname -r)/vmlinux debug_vm_pgtable+0x1c04
>>>> debug_vm_pgtable+0x1c04/0x3360:
>>>> native_ptep_get_and_clear at arch/x86/include/asm/pgtable_64.h:94
>>>> (inlined by) ptep_get_and_clear at arch/x86/include/asm/pgtable.h:1262
>>>> (inlined by) ptep_clear at include/linux/pgtable.h:509
>>>
>>> This is a pte_clear(), and pte_clear() shouldn't even do the set() checks,
>>> and shouldn't stumble over what I added.
>>>
>>> IOW, it doesn't match with the real stack dump previously:
>>>
>>> [ 5.581003] ? __page_table_check_ptes_set+0x306/0x3c0
>>> [ 5.581274] ? __pfx___page_table_check_ptes_set+0x10/0x10
>>> [ 5.581544] ? __pfx_check_pgprot+0x10/0x10
>>> [ 5.581806] set_ptes.constprop.0+0x66/0xd0
>>> [ 5.582072] ? __pfx_set_ptes.constprop.0+0x10/0x10
>>> [ 5.582333] ? __pfx_pte_val+0x10/0x10
>>> [ 5.582595] debug_vm_pgtable+0x1c04/0x3360
>>>
>>
>> Staring at pte_clear_tests():
>>
>> #ifndef CONFIG_RISCV
>> pte = __pte(pte_val(pte) | RANDOM_ORVALUE);
>> #endif
>> set_pte_at(args->mm, args->vaddr, args->ptep, pte);
>>
>> So we set random PTE bits, probably setting the present, uffd and write bit
>> at the same time. That doesn't make too much sense when we want to perform
>> that such combinations cannot exist.
>
> Here the issue is I don't think it should set W bit anyway, as we init
> page_prot to be RWX but !shared:
>
> args->page_prot = vm_get_page_prot(VM_ACCESS_FLAGS);
>
> On x86_64 (Mikhail's system) it should have W bit cleared afaict, meanwhile
> the RANDOM_ORVALUE won't touch bit W due to S390_SKIP_MASK (which contains
> bit W / bit 1, which is another "accident"..). Then even if with that it
> should not trigger.. I think that's also why I cannot reproduce this
> problem locally.

Why oh why are skip mask applied independently of the architecture.

While _PAGE_RW should indeed be masked out by RANDOM_ORVALUE.

But with shadow stacks we consider a PTE writable (see
pte_write()->pte_shstk()) if
(1) X86_FEATURE_SHSTK is enabled
(2) _PAGE_RW is clear
(3) _PAGE_DIRTY is set

_PAGE_DIRTY is bit 6.

Likely your CPU does not support shadow stacks.


--
Cheers,

David / dhildenb


2024-05-22 16:11:03

by Peter Xu

[permalink] [raw]
Subject: Re: 6.10/bisected/regression - commit 8430557fc584 cause warning at mm/page_table_check.c:198 __page_table_check_ptes_set+0x306

On Wed, May 22, 2024 at 05:34:21PM +0200, David Hildenbrand wrote:
> On 22.05.24 17:18, Peter Xu wrote:
> > On Wed, May 22, 2024 at 09:48:51AM +0200, David Hildenbrand wrote:
> > > On 22.05.24 00:36, Peter Xu wrote:
> > > > On Wed, May 22, 2024 at 03:21:04AM +0500, Mikhail Gavrilov wrote:
> > > > > On Wed, May 22, 2024 at 2:37 AM Peter Xu <[email protected]> wrote:
> > > > > > Hmm I still cannot reproduce. Weird.
> > > > > >
> > > > > > Would it be possible for you to identify which line in debug_vm_pgtable.c
> > > > > > triggered that issue?
> > > > > >
> > > > > > I think it should be some set_pte_at() but I'm not sure, as there aren't a
> > > > > > lot and all of them look benign so far. It could be that I missed
> > > > > > something important.
> > > > >
> > > > > I hope it's helps:
> > > >
> > > > Thanks for offering this, it's just that it doesn't look coherent with what
> > > > was reported for some reason.
> > > >
> > > > >
> > > > > > sh /usr/src/kernels/(uname -r)/scripts/faddr2line /lib/debug/lib/modules/(uname -r)/vmlinux debug_vm_pgtable+0x1c04
> > > > > debug_vm_pgtable+0x1c04/0x3360:
> > > > > native_ptep_get_and_clear at arch/x86/include/asm/pgtable_64.h:94
> > > > > (inlined by) ptep_get_and_clear at arch/x86/include/asm/pgtable.h:1262
> > > > > (inlined by) ptep_clear at include/linux/pgtable.h:509
> > > >
> > > > This is a pte_clear(), and pte_clear() shouldn't even do the set() checks,
> > > > and shouldn't stumble over what I added.
> > > >
> > > > IOW, it doesn't match with the real stack dump previously:
> > > >
> > > > [ 5.581003] ? __page_table_check_ptes_set+0x306/0x3c0
> > > > [ 5.581274] ? __pfx___page_table_check_ptes_set+0x10/0x10
> > > > [ 5.581544] ? __pfx_check_pgprot+0x10/0x10
> > > > [ 5.581806] set_ptes.constprop.0+0x66/0xd0
> > > > [ 5.582072] ? __pfx_set_ptes.constprop.0+0x10/0x10
> > > > [ 5.582333] ? __pfx_pte_val+0x10/0x10
> > > > [ 5.582595] debug_vm_pgtable+0x1c04/0x3360
> > > >
> > >
> > > Staring at pte_clear_tests():
> > >
> > > #ifndef CONFIG_RISCV
> > > pte = __pte(pte_val(pte) | RANDOM_ORVALUE);
> > > #endif
> > > set_pte_at(args->mm, args->vaddr, args->ptep, pte);
> > >
> > > So we set random PTE bits, probably setting the present, uffd and write bit
> > > at the same time. That doesn't make too much sense when we want to perform
> > > that such combinations cannot exist.
> >
> > Here the issue is I don't think it should set W bit anyway, as we init
> > page_prot to be RWX but !shared:
> >
> > args->page_prot = vm_get_page_prot(VM_ACCESS_FLAGS);
> >
> > On x86_64 (Mikhail's system) it should have W bit cleared afaict, meanwhile
> > the RANDOM_ORVALUE won't touch bit W due to S390_SKIP_MASK (which contains
> > bit W / bit 1, which is another "accident"..). Then even if with that it
> > should not trigger.. I think that's also why I cannot reproduce this
> > problem locally.
>
> Why oh why are skip mask applied independently of the architecture.
>
> While _PAGE_RW should indeed be masked out by RANDOM_ORVALUE.
>
> But with shadow stacks we consider a PTE writable (see
> pte_write()->pte_shstk()) if
> (1) X86_FEATURE_SHSTK is enabled
> (2) _PAGE_RW is clear
> (3) _PAGE_DIRTY is set
>
> _PAGE_DIRTY is bit 6.
>
> Likely your CPU does not support shadow stacks.

Good point. My host has it, but I tested in the VM which doesn't. I
suppose we can wait and double check whether Mikhail should see the issue
went away with that patch provided.

In this case, instead of keep fiddling with random bits to apply and
further work on top of per-arch random bits, I'd hope we can simply drop
that random mechanism as I don't think it'll be pxx_none() now. I attached
a patch I plan to post. Does it look reasonable?

I also copied Anshuman, Gavin and Aneesh.

Thanks,

===8<===
From c10cde00b14d2d305390dd418a8a8855d3e6437f Mon Sep 17 00:00:00 2001
From: Peter Xu <[email protected]>
Date: Wed, 22 May 2024 12:04:33 -0400
Subject: [PATCH] drop RANDOM_ORVALUE bits

Signed-off-by: Peter Xu <[email protected]>
---
mm/debug_vm_pgtable.c | 30 ++++--------------------------
1 file changed, 4 insertions(+), 26 deletions(-)

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index f1c9a2c5abc0..b5d7be05063a 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -40,22 +40,7 @@
* Please refer Documentation/mm/arch_pgtable_helpers.rst for the semantics
* expectations that are being validated here. All future changes in here
* or the documentation need to be in sync.
- *
- * On s390 platform, the lower 4 bits are used to identify given page table
- * entry type. But these bits might affect the ability to clear entries with
- * pxx_clear() because of how dynamic page table folding works on s390. So
- * while loading up the entries do not change the lower 4 bits. It does not
- * have affect any other platform. Also avoid the 62nd bit on ppc64 that is
- * used to mark a pte entry.
*/
-#define S390_SKIP_MASK GENMASK(3, 0)
-#if __BITS_PER_LONG == 64
-#define PPC64_SKIP_MASK GENMASK(62, 62)
-#else
-#define PPC64_SKIP_MASK 0x0
-#endif
-#define ARCH_SKIP_MASK (S390_SKIP_MASK | PPC64_SKIP_MASK)
-#define RANDOM_ORVALUE (GENMASK(BITS_PER_LONG - 1, 0) & ~ARCH_SKIP_MASK)
#define RANDOM_NZVALUE GENMASK(7, 0)

struct pgtable_debug_args {
@@ -511,8 +496,7 @@ static void __init pud_clear_tests(struct pgtable_debug_args *args)
return;

pr_debug("Validating PUD clear\n");
- pud = __pud(pud_val(pud) | RANDOM_ORVALUE);
- WRITE_ONCE(*args->pudp, pud);
+ WARN_ON(pud_none(pud));
pud_clear(args->pudp);
pud = READ_ONCE(*args->pudp);
WARN_ON(!pud_none(pud));
@@ -548,8 +532,7 @@ static void __init p4d_clear_tests(struct pgtable_debug_args *args)
return;

pr_debug("Validating P4D clear\n");
- p4d = __p4d(p4d_val(p4d) | RANDOM_ORVALUE);
- WRITE_ONCE(*args->p4dp, p4d);
+ WARN_ON(p4d_none(p4d));
p4d_clear(args->p4dp);
p4d = READ_ONCE(*args->p4dp);
WARN_ON(!p4d_none(p4d));
@@ -582,8 +565,7 @@ static void __init pgd_clear_tests(struct pgtable_debug_args *args)
return;

pr_debug("Validating PGD clear\n");
- pgd = __pgd(pgd_val(pgd) | RANDOM_ORVALUE);
- WRITE_ONCE(*args->pgdp, pgd);
+ WARN_ON(pgd_none(pgd));
pgd_clear(args->pgdp);
pgd = READ_ONCE(*args->pgdp);
WARN_ON(!pgd_none(pgd));
@@ -634,9 +616,6 @@ static void __init pte_clear_tests(struct pgtable_debug_args *args)
if (WARN_ON(!args->ptep))
return;

-#ifndef CONFIG_RISCV
- pte = __pte(pte_val(pte) | RANDOM_ORVALUE);
-#endif
set_pte_at(args->mm, args->vaddr, args->ptep, pte);
flush_dcache_page(page);
barrier();
@@ -650,8 +629,7 @@ static void __init pmd_clear_tests(struct pgtable_debug_args *args)
pmd_t pmd = READ_ONCE(*args->pmdp);

pr_debug("Validating PMD clear\n");
- pmd = __pmd(pmd_val(pmd) | RANDOM_ORVALUE);
- WRITE_ONCE(*args->pmdp, pmd);
+ WARN_ON(pmd_none(pmd));
pmd_clear(args->pmdp);
pmd = READ_ONCE(*args->pmdp);
WARN_ON(!pmd_none(pmd));
--
2.45.0

--
Peter Xu


2024-05-22 16:14:02

by Peter Xu

[permalink] [raw]
Subject: Re: 6.10/bisected/regression - commit 8430557fc584 cause warning at mm/page_table_check.c:198 __page_table_check_ptes_set+0x306

On Wed, May 22, 2024 at 12:10:30PM -0400, Peter Xu wrote:
> On Wed, May 22, 2024 at 05:34:21PM +0200, David Hildenbrand wrote:
> > On 22.05.24 17:18, Peter Xu wrote:
> > > On Wed, May 22, 2024 at 09:48:51AM +0200, David Hildenbrand wrote:
> > > > On 22.05.24 00:36, Peter Xu wrote:
> > > > > On Wed, May 22, 2024 at 03:21:04AM +0500, Mikhail Gavrilov wrote:
> > > > > > On Wed, May 22, 2024 at 2:37 AM Peter Xu <[email protected]> wrote:
> > > > > > > Hmm I still cannot reproduce. Weird.
> > > > > > >
> > > > > > > Would it be possible for you to identify which line in debug_vm_pgtable.c
> > > > > > > triggered that issue?
> > > > > > >
> > > > > > > I think it should be some set_pte_at() but I'm not sure, as there aren't a
> > > > > > > lot and all of them look benign so far. It could be that I missed
> > > > > > > something important.
> > > > > >
> > > > > > I hope it's helps:
> > > > >
> > > > > Thanks for offering this, it's just that it doesn't look coherent with what
> > > > > was reported for some reason.
> > > > >
> > > > > >
> > > > > > > sh /usr/src/kernels/(uname -r)/scripts/faddr2line /lib/debug/lib/modules/(uname -r)/vmlinux debug_vm_pgtable+0x1c04
> > > > > > debug_vm_pgtable+0x1c04/0x3360:
> > > > > > native_ptep_get_and_clear at arch/x86/include/asm/pgtable_64.h:94
> > > > > > (inlined by) ptep_get_and_clear at arch/x86/include/asm/pgtable.h:1262
> > > > > > (inlined by) ptep_clear at include/linux/pgtable.h:509
> > > > >
> > > > > This is a pte_clear(), and pte_clear() shouldn't even do the set() checks,
> > > > > and shouldn't stumble over what I added.
> > > > >
> > > > > IOW, it doesn't match with the real stack dump previously:
> > > > >
> > > > > [ 5.581003] ? __page_table_check_ptes_set+0x306/0x3c0
> > > > > [ 5.581274] ? __pfx___page_table_check_ptes_set+0x10/0x10
> > > > > [ 5.581544] ? __pfx_check_pgprot+0x10/0x10
> > > > > [ 5.581806] set_ptes.constprop.0+0x66/0xd0
> > > > > [ 5.582072] ? __pfx_set_ptes.constprop.0+0x10/0x10
> > > > > [ 5.582333] ? __pfx_pte_val+0x10/0x10
> > > > > [ 5.582595] debug_vm_pgtable+0x1c04/0x3360
> > > > >
> > > >
> > > > Staring at pte_clear_tests():
> > > >
> > > > #ifndef CONFIG_RISCV
> > > > pte = __pte(pte_val(pte) | RANDOM_ORVALUE);
> > > > #endif
> > > > set_pte_at(args->mm, args->vaddr, args->ptep, pte);
> > > >
> > > > So we set random PTE bits, probably setting the present, uffd and write bit
> > > > at the same time. That doesn't make too much sense when we want to perform
> > > > that such combinations cannot exist.
> > >
> > > Here the issue is I don't think it should set W bit anyway, as we init
> > > page_prot to be RWX but !shared:
> > >
> > > args->page_prot = vm_get_page_prot(VM_ACCESS_FLAGS);
> > >
> > > On x86_64 (Mikhail's system) it should have W bit cleared afaict, meanwhile
> > > the RANDOM_ORVALUE won't touch bit W due to S390_SKIP_MASK (which contains
> > > bit W / bit 1, which is another "accident"..). Then even if with that it
> > > should not trigger.. I think that's also why I cannot reproduce this
> > > problem locally.
> >
> > Why oh why are skip mask applied independently of the architecture.
> >
> > While _PAGE_RW should indeed be masked out by RANDOM_ORVALUE.
> >
> > But with shadow stacks we consider a PTE writable (see
> > pte_write()->pte_shstk()) if
> > (1) X86_FEATURE_SHSTK is enabled
> > (2) _PAGE_RW is clear
> > (3) _PAGE_DIRTY is set
> >
> > _PAGE_DIRTY is bit 6.
> >
> > Likely your CPU does not support shadow stacks.
>
> Good point. My host has it, but I tested in the VM which doesn't. I
> suppose we can wait and double check whether Mikhail should see the issue
> went away with that patch provided.
>
> In this case, instead of keep fiddling with random bits to apply and
> further work on top of per-arch random bits, I'd hope we can simply drop
> that random mechanism as I don't think it'll be pxx_none() now. I attached
> a patch I plan to post. Does it look reasonable?
>
> I also copied Anshuman, Gavin and Aneesh.

No I didn't.. this one will..

>
> Thanks,
>
> ===8<===
> From c10cde00b14d2d305390dd418a8a8855d3e6437f Mon Sep 17 00:00:00 2001
> From: Peter Xu <[email protected]>
> Date: Wed, 22 May 2024 12:04:33 -0400
> Subject: [PATCH] drop RANDOM_ORVALUE bits
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> mm/debug_vm_pgtable.c | 30 ++++--------------------------
> 1 file changed, 4 insertions(+), 26 deletions(-)
>
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index f1c9a2c5abc0..b5d7be05063a 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
> @@ -40,22 +40,7 @@
> * Please refer Documentation/mm/arch_pgtable_helpers.rst for the semantics
> * expectations that are being validated here. All future changes in here
> * or the documentation need to be in sync.
> - *
> - * On s390 platform, the lower 4 bits are used to identify given page table
> - * entry type. But these bits might affect the ability to clear entries with
> - * pxx_clear() because of how dynamic page table folding works on s390. So
> - * while loading up the entries do not change the lower 4 bits. It does not
> - * have affect any other platform. Also avoid the 62nd bit on ppc64 that is
> - * used to mark a pte entry.
> */
> -#define S390_SKIP_MASK GENMASK(3, 0)
> -#if __BITS_PER_LONG == 64
> -#define PPC64_SKIP_MASK GENMASK(62, 62)
> -#else
> -#define PPC64_SKIP_MASK 0x0
> -#endif
> -#define ARCH_SKIP_MASK (S390_SKIP_MASK | PPC64_SKIP_MASK)
> -#define RANDOM_ORVALUE (GENMASK(BITS_PER_LONG - 1, 0) & ~ARCH_SKIP_MASK)
> #define RANDOM_NZVALUE GENMASK(7, 0)
>
> struct pgtable_debug_args {
> @@ -511,8 +496,7 @@ static void __init pud_clear_tests(struct pgtable_debug_args *args)
> return;
>
> pr_debug("Validating PUD clear\n");
> - pud = __pud(pud_val(pud) | RANDOM_ORVALUE);
> - WRITE_ONCE(*args->pudp, pud);
> + WARN_ON(pud_none(pud));
> pud_clear(args->pudp);
> pud = READ_ONCE(*args->pudp);
> WARN_ON(!pud_none(pud));
> @@ -548,8 +532,7 @@ static void __init p4d_clear_tests(struct pgtable_debug_args *args)
> return;
>
> pr_debug("Validating P4D clear\n");
> - p4d = __p4d(p4d_val(p4d) | RANDOM_ORVALUE);
> - WRITE_ONCE(*args->p4dp, p4d);
> + WARN_ON(p4d_none(p4d));
> p4d_clear(args->p4dp);
> p4d = READ_ONCE(*args->p4dp);
> WARN_ON(!p4d_none(p4d));
> @@ -582,8 +565,7 @@ static void __init pgd_clear_tests(struct pgtable_debug_args *args)
> return;
>
> pr_debug("Validating PGD clear\n");
> - pgd = __pgd(pgd_val(pgd) | RANDOM_ORVALUE);
> - WRITE_ONCE(*args->pgdp, pgd);
> + WARN_ON(pgd_none(pgd));
> pgd_clear(args->pgdp);
> pgd = READ_ONCE(*args->pgdp);
> WARN_ON(!pgd_none(pgd));
> @@ -634,9 +616,6 @@ static void __init pte_clear_tests(struct pgtable_debug_args *args)
> if (WARN_ON(!args->ptep))
> return;
>
> -#ifndef CONFIG_RISCV
> - pte = __pte(pte_val(pte) | RANDOM_ORVALUE);
> -#endif
> set_pte_at(args->mm, args->vaddr, args->ptep, pte);
> flush_dcache_page(page);
> barrier();
> @@ -650,8 +629,7 @@ static void __init pmd_clear_tests(struct pgtable_debug_args *args)
> pmd_t pmd = READ_ONCE(*args->pmdp);
>
> pr_debug("Validating PMD clear\n");
> - pmd = __pmd(pmd_val(pmd) | RANDOM_ORVALUE);
> - WRITE_ONCE(*args->pmdp, pmd);
> + WARN_ON(pmd_none(pmd));
> pmd_clear(args->pmdp);
> pmd = READ_ONCE(*args->pmdp);
> WARN_ON(!pmd_none(pmd));
> --
> 2.45.0
>
> --
> Peter Xu

--
Peter Xu


2024-05-22 20:26:12

by David Hildenbrand

[permalink] [raw]
Subject: Re: 6.10/bisected/regression - commit 8430557fc584 cause warning at mm/page_table_check.c:198 __page_table_check_ptes_set+0x306

On 22.05.24 18:10, Peter Xu wrote:
> On Wed, May 22, 2024 at 05:34:21PM +0200, David Hildenbrand wrote:
>> On 22.05.24 17:18, Peter Xu wrote:
>>> On Wed, May 22, 2024 at 09:48:51AM +0200, David Hildenbrand wrote:
>>>> On 22.05.24 00:36, Peter Xu wrote:
>>>>> On Wed, May 22, 2024 at 03:21:04AM +0500, Mikhail Gavrilov wrote:
>>>>>> On Wed, May 22, 2024 at 2:37 AM Peter Xu <[email protected]> wrote:
>>>>>>> Hmm I still cannot reproduce. Weird.
>>>>>>>
>>>>>>> Would it be possible for you to identify which line in debug_vm_pgtable.c
>>>>>>> triggered that issue?
>>>>>>>
>>>>>>> I think it should be some set_pte_at() but I'm not sure, as there aren't a
>>>>>>> lot and all of them look benign so far. It could be that I missed
>>>>>>> something important.
>>>>>>
>>>>>> I hope it's helps:
>>>>>
>>>>> Thanks for offering this, it's just that it doesn't look coherent with what
>>>>> was reported for some reason.
>>>>>
>>>>>>
>>>>>>> sh /usr/src/kernels/(uname -r)/scripts/faddr2line /lib/debug/lib/modules/(uname -r)/vmlinux debug_vm_pgtable+0x1c04
>>>>>> debug_vm_pgtable+0x1c04/0x3360:
>>>>>> native_ptep_get_and_clear at arch/x86/include/asm/pgtable_64.h:94
>>>>>> (inlined by) ptep_get_and_clear at arch/x86/include/asm/pgtable.h:1262
>>>>>> (inlined by) ptep_clear at include/linux/pgtable.h:509
>>>>>
>>>>> This is a pte_clear(), and pte_clear() shouldn't even do the set() checks,
>>>>> and shouldn't stumble over what I added.
>>>>>
>>>>> IOW, it doesn't match with the real stack dump previously:
>>>>>
>>>>> [ 5.581003] ? __page_table_check_ptes_set+0x306/0x3c0
>>>>> [ 5.581274] ? __pfx___page_table_check_ptes_set+0x10/0x10
>>>>> [ 5.581544] ? __pfx_check_pgprot+0x10/0x10
>>>>> [ 5.581806] set_ptes.constprop.0+0x66/0xd0
>>>>> [ 5.582072] ? __pfx_set_ptes.constprop.0+0x10/0x10
>>>>> [ 5.582333] ? __pfx_pte_val+0x10/0x10
>>>>> [ 5.582595] debug_vm_pgtable+0x1c04/0x3360
>>>>>
>>>>
>>>> Staring at pte_clear_tests():
>>>>
>>>> #ifndef CONFIG_RISCV
>>>> pte = __pte(pte_val(pte) | RANDOM_ORVALUE);
>>>> #endif
>>>> set_pte_at(args->mm, args->vaddr, args->ptep, pte);
>>>>
>>>> So we set random PTE bits, probably setting the present, uffd and write bit
>>>> at the same time. That doesn't make too much sense when we want to perform
>>>> that such combinations cannot exist.
>>>
>>> Here the issue is I don't think it should set W bit anyway, as we init
>>> page_prot to be RWX but !shared:
>>>
>>> args->page_prot = vm_get_page_prot(VM_ACCESS_FLAGS);
>>>
>>> On x86_64 (Mikhail's system) it should have W bit cleared afaict, meanwhile
>>> the RANDOM_ORVALUE won't touch bit W due to S390_SKIP_MASK (which contains
>>> bit W / bit 1, which is another "accident"..). Then even if with that it
>>> should not trigger.. I think that's also why I cannot reproduce this
>>> problem locally.
>>
>> Why oh why are skip mask applied independently of the architecture.
>>
>> While _PAGE_RW should indeed be masked out by RANDOM_ORVALUE.
>>
>> But with shadow stacks we consider a PTE writable (see
>> pte_write()->pte_shstk()) if
>> (1) X86_FEATURE_SHSTK is enabled
>> (2) _PAGE_RW is clear
>> (3) _PAGE_DIRTY is set
>>
>> _PAGE_DIRTY is bit 6.
>>
>> Likely your CPU does not support shadow stacks.
>
> Good point. My host has it, but I tested in the VM which doesn't. I
> suppose we can wait and double check whether Mikhail should see the issue
> went away with that patch provided.
>
> In this case, instead of keep fiddling with random bits to apply and
> further work on top of per-arch random bits, I'd hope we can simply drop
> that random mechanism as I don't think it'll be pxx_none() now. I attached
> a patch I plan to post. Does it look reasonable?

I doubt that randomness ever helped in finding a BUG. Clearing is just
too simple ... but I might just be wrong :)

I'd vote for removing that, this will likely not be the last issue we
run into once we add more sanity checks during set_pte_at().

--
Cheers,

David / dhildenb


2024-05-23 06:35:08

by Mikhail Gavrilov

[permalink] [raw]
Subject: Re: 6.10/bisected/regression - commit 8430557fc584 cause warning at mm/page_table_check.c:198 __page_table_check_ptes_set+0x306

On Wed, May 22, 2024 at 9:10 PM Peter Xu <[email protected]> wrote:
>
> ===8<===
> From c10cde00b14d2d305390dd418a8a8855d3e6437f Mon Sep 17 00:00:00 2001
> From: Peter Xu <[email protected]>
> Date: Wed, 22 May 2024 12:04:33 -0400
> Subject: [PATCH] drop RANDOM_ORVALUE bits
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> mm/debug_vm_pgtable.c | 30 ++++--------------------------
> 1 file changed, 4 insertions(+), 26 deletions(-)
>
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index f1c9a2c5abc0..b5d7be05063a 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
> @@ -40,22 +40,7 @@
> * Please refer Documentation/mm/arch_pgtable_helpers.rst for the semantics
> * expectations that are being validated here. All future changes in here
> * or the documentation need to be in sync.
> - *
> - * On s390 platform, the lower 4 bits are used to identify given page table
> - * entry type. But these bits might affect the ability to clear entries with
> - * pxx_clear() because of how dynamic page table folding works on s390. So
> - * while loading up the entries do not change the lower 4 bits. It does not
> - * have affect any other platform. Also avoid the 62nd bit on ppc64 that is
> - * used to mark a pte entry.
> */
> -#define S390_SKIP_MASK GENMASK(3, 0)
> -#if __BITS_PER_LONG == 64
> -#define PPC64_SKIP_MASK GENMASK(62, 62)
> -#else
> -#define PPC64_SKIP_MASK 0x0
> -#endif
> -#define ARCH_SKIP_MASK (S390_SKIP_MASK | PPC64_SKIP_MASK)
> -#define RANDOM_ORVALUE (GENMASK(BITS_PER_LONG - 1, 0) & ~ARCH_SKIP_MASK)
> #define RANDOM_NZVALUE GENMASK(7, 0)
>
> struct pgtable_debug_args {
> @@ -511,8 +496,7 @@ static void __init pud_clear_tests(struct pgtable_debug_args *args)
> return;
>
> pr_debug("Validating PUD clear\n");
> - pud = __pud(pud_val(pud) | RANDOM_ORVALUE);
> - WRITE_ONCE(*args->pudp, pud);
> + WARN_ON(pud_none(pud));
> pud_clear(args->pudp);
> pud = READ_ONCE(*args->pudp);
> WARN_ON(!pud_none(pud));
> @@ -548,8 +532,7 @@ static void __init p4d_clear_tests(struct pgtable_debug_args *args)
> return;
>
> pr_debug("Validating P4D clear\n");
> - p4d = __p4d(p4d_val(p4d) | RANDOM_ORVALUE);
> - WRITE_ONCE(*args->p4dp, p4d);
> + WARN_ON(p4d_none(p4d));
> p4d_clear(args->p4dp);
> p4d = READ_ONCE(*args->p4dp);
> WARN_ON(!p4d_none(p4d));
> @@ -582,8 +565,7 @@ static void __init pgd_clear_tests(struct pgtable_debug_args *args)
> return;
>
> pr_debug("Validating PGD clear\n");
> - pgd = __pgd(pgd_val(pgd) | RANDOM_ORVALUE);
> - WRITE_ONCE(*args->pgdp, pgd);
> + WARN_ON(pgd_none(pgd));
> pgd_clear(args->pgdp);
> pgd = READ_ONCE(*args->pgdp);
> WARN_ON(!pgd_none(pgd));
> @@ -634,9 +616,6 @@ static void __init pte_clear_tests(struct pgtable_debug_args *args)
> if (WARN_ON(!args->ptep))
> return;
>
> -#ifndef CONFIG_RISCV
> - pte = __pte(pte_val(pte) | RANDOM_ORVALUE);
> -#endif
> set_pte_at(args->mm, args->vaddr, args->ptep, pte);
> flush_dcache_page(page);
> barrier();
> @@ -650,8 +629,7 @@ static void __init pmd_clear_tests(struct pgtable_debug_args *args)
> pmd_t pmd = READ_ONCE(*args->pmdp);
>
> pr_debug("Validating PMD clear\n");
> - pmd = __pmd(pmd_val(pmd) | RANDOM_ORVALUE);
> - WRITE_ONCE(*args->pmdp, pmd);
> + WARN_ON(pmd_none(pmd));
> pmd_clear(args->pmdp);
> pmd = READ_ONCE(*args->pmdp);
> WARN_ON(!pmd_none(pmd));
> --
> 2.45.0
>
> --
> Peter Xu
>

Good news the patch works and the warning at mm/page_table_check.c:198
__page_table_check_ptes_set+0x306 is gone.
Tested-by: Mikhail Gavrilov <[email protected]>

Bad news the testing terminated with an old annoying problem which
appeared during the 6.9 release cycle [1] and looks like it has not
been fixed yet.
[24119.281379] BUG: Bad page state in process kcompactd0 pfn:3ae37e
[24119.281387] page: refcount:0 mapcount:0 mapping:00000000d16c2d75
index:0x272ea3200 pfn:0x3ae37e
[24119.281390] aops:btree_aops ino:1
[24119.281395] flags:
0x17ffffc000020c(referenced|uptodate|workingset|node=0|zone=2|lastcpupid=0x1fffff)
[24119.281400] raw: 0017ffffc000020c dead000000000100 dead000000000122
ffff888136ecd220
[24119.281402] raw: 0000000272ea3200 0000000000000000 00000000ffffffff
0000000000000000
[24119.281403] page dumped because: non-NULL mapping
[24119.281405] Modules linked in: overlay tun crypto_user uinput
snd_seq_dummy snd_hrtimer rfcomm nf_conntrack_netbios_ns
nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set
nf_tables qrtr uhid bnep sunrpc binfmt_misc amd_atl intel_rapl_msr
intel_rapl_common mt76x2u mt76x2_common mt7921e mt7921_common
mt76x02_usb mt76_usb mt792x_lib mt76x02_lib mt76_connac_lib vfat mt76
fat mac80211 snd_hda_codec_hdmi snd_hda_intel edac_mce_amd
snd_intel_dspcfg snd_intel_sdw_acpi snd_usb_audio uvcvideo
snd_hda_codec kvm_amd btusb snd_usbmidi_lib uvc snd_hda_core snd_ump
btrtl videobuf2_vmalloc btintel videobuf2_memops snd_rawmidi snd_hwdep
videobuf2_v4l2 btbcm btmtk snd_seq videobuf2_common libarc4
snd_seq_device kvm bluetooth ledtrig_netdev videodev snd_pcm cfg80211
joydev asus_nb_wmi eeepc_wmi mc snd_timer asus_wmi sparse_keymap rapl
apple_mfi_fastcharge snd wmi_bmof platform_profile
[24119.281465] pcspkr igc k10temp soundcore i2c_piix4 rfkill
gpio_amdpt gpio_generic loop nfnetlink zram amdgpu crct10dif_pclmul
crc32_pclmul crc32c_intel amdxcp polyval_clmulni i2c_algo_bit
polyval_generic drm_ttm_helper ttm nvme drm_exec ghash_clmulni_intel
gpu_sched drm_suballoc_helper sha512_ssse3 drm_buddy nvme_core ccp
sha256_ssse3 drm_display_helper sha1_ssse3 sp5100_tco video nvme_auth
wmi hid_apple ip6_tables ip_tables fuse
[24119.281496] CPU: 30 PID: 221 Comm: kcompactd0 Tainted: G W
L 6.9.0-test-5f16eb0549ab-with-drop-RANDOM_ORVALUE-bits+ #34
[24119.281498] Hardware name: ASUS System Product Name/ROG STRIX
B650E-I GAMING WIFI, BIOS 2611 04/07/2024
[24119.281500] Call Trace:
[24119.281502] <TASK>
[24119.281503] dump_stack_lvl+0x84/0xd0
[24119.281508] bad_page.cold+0xbe/0xe0
[24119.281510] ? __pfx_bad_page+0x10/0x10
[24119.281514] ? page_bad_reason+0x9d/0x1f0
[24119.281517] free_unref_page+0x838/0x10e0
[24119.281520] __folio_put+0x1ba/0x2b0
[24119.281523] ? __pfx___folio_put+0x10/0x10
[24119.281525] ? __pfx___might_resched+0x10/0x10
[24119.281528] ? migrate_folio_done+0x1de/0x2b0
[24119.281531] migrate_pages_batch+0xe73/0x2880
[24119.281534] ? __pfx_compaction_alloc+0x10/0x10
[24119.281536] ? __pfx_compaction_free+0x10/0x10
[24119.281539] ? __pfx_migrate_pages_batch+0x10/0x10
[24119.281543] ? rcu_is_watching+0x12/0xc0
[24119.281546] migrate_pages+0x194f/0x22f0
[24119.281548] ? __pfx_compaction_alloc+0x10/0x10
[24119.281550] ? __pfx_compaction_free+0x10/0x10
[24119.281553] ? __pfx_migrate_pages+0x10/0x10
[24119.281555] ? rcu_is_watching+0x12/0xc0
[24119.281557] ? isolate_migratepages_block+0x2b02/0x4560
[24119.281561] ? __pfx_isolate_migratepages_block+0x10/0x10
[24119.281563] ? folio_putback_lru+0x5e/0xb0
[24119.281566] compact_zone+0x1a7c/0x3860
[24119.281569] ? rcu_is_watching+0x12/0xc0
[24119.281571] ? __pfx___free_object+0x10/0x10
[24119.281575] ? __pfx_compact_zone+0x10/0x10
[24119.281577] ? rcu_is_watching+0x12/0xc0
[24119.281579] ? lock_acquire+0x457/0x540
[24119.281581] ? kcompactd+0x2fa/0xc70
[24119.281583] ? rcu_is_watching+0x12/0xc0
[24119.281585] compact_node+0x144/0x240
[24119.281588] ? __pfx_compact_node+0x10/0x10
[24119.281593] ? rcu_is_watching+0x12/0xc0
[24119.281595] kcompactd+0x686/0xc70
[24119.281598] ? __pfx_kcompactd+0x10/0x10
[24119.281600] ? __pfx_autoremove_wake_function+0x10/0x10
[24119.281603] ? __kthread_parkme+0xb1/0x1d0
[24119.281605] ? __pfx_kcompactd+0x10/0x10
[24119.281608] ? __pfx_kcompactd+0x10/0x10
[24119.281610] kthread+0x2d2/0x3a0
[24119.281612] ? _raw_spin_unlock_irq+0x28/0x60
[24119.281614] ? __pfx_kthread+0x10/0x10
[24119.281616] ret_from_fork+0x31/0x70
[24119.281618] ? __pfx_kthread+0x10/0x10
[24119.281620] ret_from_fork_asm+0x1a/0x30
[24119.281624] </TASK>
[24171.367867] watchdog: BUG: soft lockup - CPU#25 stuck for 26s!
[kworker/u130:3:2474335]

I attached the full kernel log below.

[1] https://lore.kernel.org/linux-kernel/CABXGCsPktcHQOvKTbPaTwegMExije=Gpgci5NW=hqORo-s7diA@mail.gmail.com/

--
Best Regards,
Mike Gavrilov.


Attachments:
dmesg.zip (76.37 kB)

2024-05-23 13:28:06

by Peter Xu

[permalink] [raw]
Subject: Re: 6.10/bisected/regression - commit 8430557fc584 cause warning at mm/page_table_check.c:198 __page_table_check_ptes_set+0x306

On Thu, May 23, 2024 at 11:34:37AM +0500, Mikhail Gavrilov wrote:
> On Wed, May 22, 2024 at 9:10 PM Peter Xu <[email protected]> wrote:
> >
> > ===8<===
> > From c10cde00b14d2d305390dd418a8a8855d3e6437f Mon Sep 17 00:00:00 2001
> > From: Peter Xu <[email protected]>
> > Date: Wed, 22 May 2024 12:04:33 -0400
> > Subject: [PATCH] drop RANDOM_ORVALUE bits
> >
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
> > mm/debug_vm_pgtable.c | 30 ++++--------------------------
> > 1 file changed, 4 insertions(+), 26 deletions(-)
> >
> > diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> > index f1c9a2c5abc0..b5d7be05063a 100644
> > --- a/mm/debug_vm_pgtable.c
> > +++ b/mm/debug_vm_pgtable.c
> > @@ -40,22 +40,7 @@
> > * Please refer Documentation/mm/arch_pgtable_helpers.rst for the semantics
> > * expectations that are being validated here. All future changes in here
> > * or the documentation need to be in sync.
> > - *
> > - * On s390 platform, the lower 4 bits are used to identify given page table
> > - * entry type. But these bits might affect the ability to clear entries with
> > - * pxx_clear() because of how dynamic page table folding works on s390. So
> > - * while loading up the entries do not change the lower 4 bits. It does not
> > - * have affect any other platform. Also avoid the 62nd bit on ppc64 that is
> > - * used to mark a pte entry.
> > */
> > -#define S390_SKIP_MASK GENMASK(3, 0)
> > -#if __BITS_PER_LONG == 64
> > -#define PPC64_SKIP_MASK GENMASK(62, 62)
> > -#else
> > -#define PPC64_SKIP_MASK 0x0
> > -#endif
> > -#define ARCH_SKIP_MASK (S390_SKIP_MASK | PPC64_SKIP_MASK)
> > -#define RANDOM_ORVALUE (GENMASK(BITS_PER_LONG - 1, 0) & ~ARCH_SKIP_MASK)
> > #define RANDOM_NZVALUE GENMASK(7, 0)
> >
> > struct pgtable_debug_args {
> > @@ -511,8 +496,7 @@ static void __init pud_clear_tests(struct pgtable_debug_args *args)
> > return;
> >
> > pr_debug("Validating PUD clear\n");
> > - pud = __pud(pud_val(pud) | RANDOM_ORVALUE);
> > - WRITE_ONCE(*args->pudp, pud);
> > + WARN_ON(pud_none(pud));
> > pud_clear(args->pudp);
> > pud = READ_ONCE(*args->pudp);
> > WARN_ON(!pud_none(pud));
> > @@ -548,8 +532,7 @@ static void __init p4d_clear_tests(struct pgtable_debug_args *args)
> > return;
> >
> > pr_debug("Validating P4D clear\n");
> > - p4d = __p4d(p4d_val(p4d) | RANDOM_ORVALUE);
> > - WRITE_ONCE(*args->p4dp, p4d);
> > + WARN_ON(p4d_none(p4d));
> > p4d_clear(args->p4dp);
> > p4d = READ_ONCE(*args->p4dp);
> > WARN_ON(!p4d_none(p4d));
> > @@ -582,8 +565,7 @@ static void __init pgd_clear_tests(struct pgtable_debug_args *args)
> > return;
> >
> > pr_debug("Validating PGD clear\n");
> > - pgd = __pgd(pgd_val(pgd) | RANDOM_ORVALUE);
> > - WRITE_ONCE(*args->pgdp, pgd);
> > + WARN_ON(pgd_none(pgd));
> > pgd_clear(args->pgdp);
> > pgd = READ_ONCE(*args->pgdp);
> > WARN_ON(!pgd_none(pgd));
> > @@ -634,9 +616,6 @@ static void __init pte_clear_tests(struct pgtable_debug_args *args)
> > if (WARN_ON(!args->ptep))
> > return;
> >
> > -#ifndef CONFIG_RISCV
> > - pte = __pte(pte_val(pte) | RANDOM_ORVALUE);
> > -#endif
> > set_pte_at(args->mm, args->vaddr, args->ptep, pte);
> > flush_dcache_page(page);
> > barrier();
> > @@ -650,8 +629,7 @@ static void __init pmd_clear_tests(struct pgtable_debug_args *args)
> > pmd_t pmd = READ_ONCE(*args->pmdp);
> >
> > pr_debug("Validating PMD clear\n");
> > - pmd = __pmd(pmd_val(pmd) | RANDOM_ORVALUE);
> > - WRITE_ONCE(*args->pmdp, pmd);
> > + WARN_ON(pmd_none(pmd));
> > pmd_clear(args->pmdp);
> > pmd = READ_ONCE(*args->pmdp);
> > WARN_ON(!pmd_none(pmd));
> > --
> > 2.45.0
> >
> > --
> > Peter Xu
> >
>
> Good news the patch works and the warning at mm/page_table_check.c:198
> __page_table_check_ptes_set+0x306 is gone.
> Tested-by: Mikhail Gavrilov <[email protected]>

Thanks.

>
> Bad news the testing terminated with an old annoying problem which
> appeared during the 6.9 release cycle [1] and looks like it has not
> been fixed yet.
> [24119.281379] BUG: Bad page state in process kcompactd0 pfn:3ae37e
> [24119.281387] page: refcount:0 mapcount:0 mapping:00000000d16c2d75
> index:0x272ea3200 pfn:0x3ae37e
> [24119.281390] aops:btree_aops ino:1
> [24119.281395] flags:
> 0x17ffffc000020c(referenced|uptodate|workingset|node=0|zone=2|lastcpupid=0x1fffff)
> [24119.281400] raw: 0017ffffc000020c dead000000000100 dead000000000122
> ffff888136ecd220
> [24119.281402] raw: 0000000272ea3200 0000000000000000 00000000ffffffff
> 0000000000000000
> [24119.281403] page dumped because: non-NULL mapping
> [24119.281405] Modules linked in: overlay tun crypto_user uinput
> snd_seq_dummy snd_hrtimer rfcomm nf_conntrack_netbios_ns
> nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
> nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct
> nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set
> nf_tables qrtr uhid bnep sunrpc binfmt_misc amd_atl intel_rapl_msr
> intel_rapl_common mt76x2u mt76x2_common mt7921e mt7921_common
> mt76x02_usb mt76_usb mt792x_lib mt76x02_lib mt76_connac_lib vfat mt76
> fat mac80211 snd_hda_codec_hdmi snd_hda_intel edac_mce_amd
> snd_intel_dspcfg snd_intel_sdw_acpi snd_usb_audio uvcvideo
> snd_hda_codec kvm_amd btusb snd_usbmidi_lib uvc snd_hda_core snd_ump
> btrtl videobuf2_vmalloc btintel videobuf2_memops snd_rawmidi snd_hwdep
> videobuf2_v4l2 btbcm btmtk snd_seq videobuf2_common libarc4
> snd_seq_device kvm bluetooth ledtrig_netdev videodev snd_pcm cfg80211
> joydev asus_nb_wmi eeepc_wmi mc snd_timer asus_wmi sparse_keymap rapl
> apple_mfi_fastcharge snd wmi_bmof platform_profile
> [24119.281465] pcspkr igc k10temp soundcore i2c_piix4 rfkill
> gpio_amdpt gpio_generic loop nfnetlink zram amdgpu crct10dif_pclmul
> crc32_pclmul crc32c_intel amdxcp polyval_clmulni i2c_algo_bit
> polyval_generic drm_ttm_helper ttm nvme drm_exec ghash_clmulni_intel
> gpu_sched drm_suballoc_helper sha512_ssse3 drm_buddy nvme_core ccp
> sha256_ssse3 drm_display_helper sha1_ssse3 sp5100_tco video nvme_auth
> wmi hid_apple ip6_tables ip_tables fuse
> [24119.281496] CPU: 30 PID: 221 Comm: kcompactd0 Tainted: G W
> L 6.9.0-test-5f16eb0549ab-with-drop-RANDOM_ORVALUE-bits+ #34
> [24119.281498] Hardware name: ASUS System Product Name/ROG STRIX
> B650E-I GAMING WIFI, BIOS 2611 04/07/2024
> [24119.281500] Call Trace:
> [24119.281502] <TASK>
> [24119.281503] dump_stack_lvl+0x84/0xd0
> [24119.281508] bad_page.cold+0xbe/0xe0
> [24119.281510] ? __pfx_bad_page+0x10/0x10
> [24119.281514] ? page_bad_reason+0x9d/0x1f0
> [24119.281517] free_unref_page+0x838/0x10e0
> [24119.281520] __folio_put+0x1ba/0x2b0
> [24119.281523] ? __pfx___folio_put+0x10/0x10
> [24119.281525] ? __pfx___might_resched+0x10/0x10
> [24119.281528] ? migrate_folio_done+0x1de/0x2b0
> [24119.281531] migrate_pages_batch+0xe73/0x2880
> [24119.281534] ? __pfx_compaction_alloc+0x10/0x10
> [24119.281536] ? __pfx_compaction_free+0x10/0x10
> [24119.281539] ? __pfx_migrate_pages_batch+0x10/0x10
> [24119.281543] ? rcu_is_watching+0x12/0xc0
> [24119.281546] migrate_pages+0x194f/0x22f0
> [24119.281548] ? __pfx_compaction_alloc+0x10/0x10
> [24119.281550] ? __pfx_compaction_free+0x10/0x10
> [24119.281553] ? __pfx_migrate_pages+0x10/0x10
> [24119.281555] ? rcu_is_watching+0x12/0xc0
> [24119.281557] ? isolate_migratepages_block+0x2b02/0x4560
> [24119.281561] ? __pfx_isolate_migratepages_block+0x10/0x10
> [24119.281563] ? folio_putback_lru+0x5e/0xb0
> [24119.281566] compact_zone+0x1a7c/0x3860
> [24119.281569] ? rcu_is_watching+0x12/0xc0
> [24119.281571] ? __pfx___free_object+0x10/0x10
> [24119.281575] ? __pfx_compact_zone+0x10/0x10
> [24119.281577] ? rcu_is_watching+0x12/0xc0
> [24119.281579] ? lock_acquire+0x457/0x540
> [24119.281581] ? kcompactd+0x2fa/0xc70
> [24119.281583] ? rcu_is_watching+0x12/0xc0
> [24119.281585] compact_node+0x144/0x240
> [24119.281588] ? __pfx_compact_node+0x10/0x10
> [24119.281593] ? rcu_is_watching+0x12/0xc0
> [24119.281595] kcompactd+0x686/0xc70
> [24119.281598] ? __pfx_kcompactd+0x10/0x10
> [24119.281600] ? __pfx_autoremove_wake_function+0x10/0x10
> [24119.281603] ? __kthread_parkme+0xb1/0x1d0
> [24119.281605] ? __pfx_kcompactd+0x10/0x10
> [24119.281608] ? __pfx_kcompactd+0x10/0x10
> [24119.281610] kthread+0x2d2/0x3a0
> [24119.281612] ? _raw_spin_unlock_irq+0x28/0x60
> [24119.281614] ? __pfx_kthread+0x10/0x10
> [24119.281616] ret_from_fork+0x31/0x70
> [24119.281618] ? __pfx_kthread+0x10/0x10
> [24119.281620] ret_from_fork_asm+0x1a/0x30
> [24119.281624] </TASK>
> [24171.367867] watchdog: BUG: soft lockup - CPU#25 stuck for 26s!
> [kworker/u130:3:2474335]
>
> I attached the full kernel log below.
>
> [1] https://lore.kernel.org/linux-kernel/CABXGCsPktcHQOvKTbPaTwegMExije=Gpgci5NW=hqORo-s7diA@mail.gmail.com/

Sorry to know that nobody is looking at this for two months.. However I
think we'll need to fix them separately anyway. Let me post a fix for the
known first.

--
Peter Xu