2024-04-01 16:55:16

by Javier Pello

[permalink] [raw]
Subject: [PATCH 0/1] Split lock detected in kernel mode on x86-32 with PAE

Dear list,

I am hitting a problem with split locks after recently updating the
kernel on my system to 6.8.2. Some tasks occasionally get stuck on
program termination, and dmesg shows output like the following:

[0.000000] Split lock detected
[0.000000] : 0000 [#1] SMP
[0.000000] CPU: 10 PID: 1330 Comm: a.out Not tainted 6.8.2 #2
[0.000000] Hardware name: Gigabyte Technology Co., Ltd. H610M S2H DDR4/H610M S2H DDR4, BIOS FL 11/15/2022
[0.000000] EIP: __split_huge_pmd+0x468/0xa0c
[0.000000] Code: e0 0c 83 c8 67 89 55 bc 8b 55 d8 89 45 b8 8b 45 d4 8b 4d bc 89 55 b4 8b 55 b4 89 45 b0 8b 45 b8 89 7d b8 89 df 89 c3 8b 45 b0 <f0> 0f c7 4d d4 75 f9 8d 45 d4 89 fb 8b 7d b8 31 c9 89 fa e8 6c cf
[0.000000] EAX: c36797c0 EBX: b85cd067 ECX: 00000004 EDX: 00000000
[0.000000] ESI: ef0bf000 EDI: 00000000 EBP: cf3bfe28 ESP: cf3bfda4
[0.000000] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210282
[0.000000] CR0: 80050033 CR2: b7344000 CR3: 0a47a320 CR4: 00352ef0
[0.000000] Call Trace:
[0.000000] ? show_regs+0x70/0x78
[0.000000] ? die+0x29/0x74
[0.000000] ? exc_alignment_check+0x9e/0xa0
[0.000000] ? exc_stack_segment+0x3c/0x3c
[0.000000] ? handle_exception+0x14d/0x14d
[0.000000] ? copy_page_range+0x26b/0xc8c
[0.000000] ? exc_stack_segment+0x3c/0x3c
[0.000000] ? __split_huge_pmd+0x468/0xa0c
[0.000000] ? exc_stack_segment+0x3c/0x3c
[0.000000] ? __split_huge_pmd+0x468/0xa0c
[0.000000] vma_adjust_trans_huge+0xb8/0x160
[0.000000] __split_vma+0x1f9/0x2e0
[0.000000] do_vmi_align_munmap.isra.0+0x90/0x3ac
[0.000000] do_vmi_munmap+0x69/0xa4
[0.000000] __vm_munmap+0x6e/0xa8
[0.000000] __ia32_sys_munmap+0x12/0x14
[0.000000] __do_fast_syscall_32+0x5a/0xd8
[0.000000] do_fast_syscall_32+0x29/0x5c
[0.000000] do_SYSENTER_32+0x15/0x20
[0.000000] entry_SYSENTER_32+0xa2/0x102
[0.000000] EIP: 0xb7eec569
[0.000000] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
[0.000000] EAX: ffffffda EBX: b4e68000 ECX: 00198000 EDX: b4e68000
[0.000000] ESI: 00000000 EDI: b4ff8000 EBP: 00000002 ESP: bf91e23c
[0.000000] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200292
[0.000000] Modules linked in: kvm_intel kvm irqbypass
[0.000000] ---[ end trace 0000000000000000 ]---
[0.000000] EIP: __split_huge_pmd+0x468/0xa0c
[0.000000] Code: e0 0c 83 c8 67 89 55 bc 8b 55 d8 89 45 b8 8b 45 d4 8b 4d bc 89 55 b4 8b 55 b4 89 45 b0 8b 45 b8 89 7d b8 89 df 89 c3 8b 45 b0 <f0> 0f c7 4d d4 75 f9 8d 45 d4 89 fb 8b 7d b8 31 c9 89 fa e8 6c cf
[0.000000] EAX: c36797c0 EBX: b85cd067 ECX: 00000004 EDX: 00000000
[0.000000] ESI: ef0bf000 EDI: 00000000 EBP: cf3bfe28 ESP: cf3bfda4
[0.000000] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210282
[0.000000] CR0: 80050033 CR2: b7344000 CR3: 0a47a320 CR4: 00352ef0
[0.000000] note: a.out[1330] exited with irqs disabled

The offending process does not exit after this, although the issue
seems to be in kernel code. The problem is not perfectly reproducible,
but close enough: Most processes behave normally, but certain tasks
(a particular browser session, for instance) are prone to end up like
this more often than not.

I have done some analysis of the problem and the critical point seems
to be in __split_huge_pmd+0x468/0xa0c. Disassembly of this location
shows

2c28: f0 0f c7 4d d4 lock cmpxchg8b QWORD PTR [ebp-0x2c]
2c2d: 75 f9 jne 2c28 <__split_huge_pmd+0x468>

On my kernel this code maps to __split_huge_pmd_locked in
mm/huge_memory.c, around line 2581

pgtable = pgtable_trans_huge_withdraw(mm, pmd);
pmd_populate(mm, &_pmd, pgtable);

where pmd_populate seems to ultimately call native_set_pmd in
arch/x86/include/asm/pgtable-3level.h, as I am running an x86-32 kernel
with PAE, and native_set_pmd relies on the macro

#define pxx_xchg64(_pxx, _ptr, _val) ({ \
_pxx##val_t *_p = (_pxx##val_t *)_ptr; \
_pxx##val_t _o = *_p; \
do { } while (!try_cmpxchg64(_p, &_o, (_val))); \
native_make_##_pxx(_o); \
})

where the busy loop translates into the assembly above. Now, as seen
from the assembly, variable _pmd is a qword that lies at ebp-0x2c,
and the trace shows that ebp is 0xcf3bfe28 at that point, which means
that the 8 bytes of _pmd cross a 512-byte alignment boundary, and this
is consistent with its being split across cache lines (if they are of
that size or less, which I do not know).

I have a few other logs like this one, and all of them are similar in
that all of them happen at __split_huge_pmd+0x468/0xa0c and in all of
them the value of ebp ends in 0xe28.

Since the issue seems to be that native_set_pmd tries to perform an
atomic compare-and-exchange operation on the location of its first
(pointer) argument, and since such an operation triggers an exception
if the target location spans two cache lines, I patched the kernel
to avoid that by raising the alignment of pmd_t to 8 bytes, and did the
same to pte_t and pud_t, for the same reason. I have been using the
proposed patch for a few days and the exceptions are gone.

Note that the patch also raises the alignment of p4dval_t, pgdval_t and
pgprotval_t, even though I do not see any immediate reason to do this,
because, well, my goal was to get the bug fixed and I thought that I
would rather overshoot first and fine-tune later. Any insight into
whether the extra alignment is required also for these types would be
appreciated.

Javier Pello (1):
x86/mm/pae: Align up pteval_t, pmdval_t and pudval_t to avoid split locks

arch/x86/include/asm/pgtable-3level_types.h | 16 ++++++++++------
1 file changed, 10 insertions(+), 6 deletions(-)

--
2.43.3


2024-04-01 16:57:40

by Javier Pello

[permalink] [raw]
Subject: [PATCH 1/1] x86/mm/pae: Align up pteval_t, pmdval_t and pudval_t to avoid split locks

From: Javier Pello <[email protected]>

When PAE is enabled on x86-32, the types pte_t, pmd_t and pud_t are
defined in terms of pteval_t, pmdval_t and pudval_t, respectively,
all of which are typedefs for u64, which is an 8-byte type with
4-byte alignment. However, variables of these types are subject to
compare-and-exchange operations in their respective native_set_*
functions, which can trigger split locks if a variable spans two
cache lines, for instance on the stack. Annotate these types with
__aligned(8) to avoid this.

Signed-off-by: Javier Pello <[email protected]>
Cc: [email protected]
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: [email protected]
---
arch/x86/include/asm/pgtable-3level_types.h | 16 ++++++++++------
1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/pgtable-3level_types.h b/arch/x86/include/asm/pgtable-3level_types.h
index 80911349..61f62ab4 100644
--- a/arch/x86/include/asm/pgtable-3level_types.h
+++ b/arch/x86/include/asm/pgtable-3level_types.h
@@ -5,12 +5,16 @@
#ifndef __ASSEMBLY__
#include <linux/types.h>

-typedef u64 pteval_t;
-typedef u64 pmdval_t;
-typedef u64 pudval_t;
-typedef u64 p4dval_t;
-typedef u64 pgdval_t;
-typedef u64 pgprotval_t;
+/*
+ * Variables of these types are subject to atomic compare-and-exchange
+ * operations, so they have to be properly aligned to avoid split locks.
+ */
+typedef u64 pteval_t __aligned(8);
+typedef u64 pmdval_t __aligned(8);
+typedef u64 pudval_t __aligned(8);
+typedef u64 p4dval_t __aligned(8);
+typedef u64 pgdval_t __aligned(8);
+typedef u64 pgprotval_t __aligned(8);

typedef union {
struct {
--
2.43.3

2024-04-01 17:56:38

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 1/1] x86/mm/pae: Align up pteval_t, pmdval_t and pudval_t to avoid split locks

On 4/1/24 09:57, Javier Pello wrote:
> -typedef u64 pteval_t;
> -typedef u64 pmdval_t;
> -typedef u64 pudval_t;
> -typedef u64 p4dval_t;
> -typedef u64 pgdval_t;
> -typedef u64 pgprotval_t;
> +/*
> + * Variables of these types are subject to atomic compare-and-exchange
> + * operations, so they have to be properly aligned to avoid split locks.
> + */
> +typedef u64 pteval_t __aligned(8);
> +typedef u64 pmdval_t __aligned(8);
> +typedef u64 pudval_t __aligned(8);
> +typedef u64 p4dval_t __aligned(8);
> +typedef u64 pgdval_t __aligned(8);
> +typedef u64 pgprotval_t __aligned(8);

First of all, how is it that you're running a PAE kernel on new, 64-bit
hardware? That's rather odd.

The case that you're hitting is actually an on-stack pmd_t. The fun
part is that it's not shared and doesn't even _need_ atomics. I think
it's just using pmd_populate() because it's convenient.

I'd honestly much rather just disable split lock support in 32-bit
builds than mess with this stuff. You really shouldn't be running
32-but kernels on this hardware.

2024-04-02 17:24:01

by Javier Pello

[permalink] [raw]
Subject: Re: [PATCH 1/1] x86/mm/pae: Align up pteval_t, pmdval_t and pudval_t to avoid split locks

On Mon, 1 Apr 2024 10:56:14 -0700 Dave Hansen wrote:

> First of all, how is it that you're running a PAE kernel on new,
> 64-bit hardware? That's rather odd.

I got this motherboard and cpu fairly recently to replace old
hardware, and I just plugged my old hard disk and went along with
it, because I did not feel like bootstrapping a 64-bit system.

> The case that you're hitting is actually an on-stack pmd_t. The
> fun part is that it's not shared and doesn't even _need_ atomics.
> I think it's just using pmd_populate() because it's convenient.

I see. So just annotating the variable on the stack with
__aligned(8) should do it? But the code is under mm/, so it should
be arch-agnostic, right? What would the correct fix be, then? I take
from your message that using atomics through pmd_populate() here is
not needed, but what accessors should be used instead? I am not
familiar at all with this part of the kernel.

> I'd honestly much rather just disable split lock support in 32-bit
> builds than mess with this stuff. You really shouldn't be running
> 32-but kernels on this hardware.

Why? Is it unsupported? The hardware runs fine, it is just a choice
made by the kernel to crash a task if a split lock is detected in
kernel mode, but I do not see any technical reason not to fix the
split lock. Disabling split lock detection would not make the split
locks go away, they would just go unnoticed instead.

Regards,
Javier

2024-04-02 17:43:42

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 1/1] x86/mm/pae: Align up pteval_t, pmdval_t and pudval_t to avoid split locks

On 4/2/24 10:23, Javier Pello wrote:
> On Mon, 1 Apr 2024 10:56:14 -0700 Dave Hansen wrote:
>> First of all, how is it that you're running a PAE kernel on new,
>> 64-bit hardware? That's rather odd.
>
> I got this motherboard and cpu fairly recently to replace old
> hardware, and I just plugged my old hard disk and went along with
> it, because I did not feel like bootstrapping a 64-bit system.

Fair enough. I can totally understand wanting the convenience. But
you're leaving _so_ much performance on the floor that split locks are
the least of your problems.

>> The case that you're hitting is actually an on-stack pmd_t. The
>> fun part is that it's not shared and doesn't even _need_ atomics.
>> I think it's just using pmd_populate() because it's convenient.
>
> I see. So just annotating the variable on the stack with
> __aligned(8) should do it? But the code is under mm/, so it should
> be arch-agnostic, right? What would the correct fix be, then? I take
> from your message that using atomics through pmd_populate() here is
> not needed, but what accessors should be used instead? I am not
> familiar at all with this part of the kernel.

I don't think there's a better accessor.

>> I'd honestly much rather just disable split lock support in 32-bit
>> builds than mess with this stuff. You really shouldn't be running
>> 32-but kernels on this hardware.
>
> Why? Is it unsupported?

Yes, it's effectively unsupported. We're not adding new hardware
features to 32-bit. The fact that split lock detection got enabled was
an accident.

> The hardware runs fine, it is just a choice made by the kernel to
> crash a task if a split lock is detected in kernel mode, but I do not
> see any technical reason not to fix the split lock. Disabling split
> lock detection would not make the split locks go away, they would
> just go unnoticed instead.

It's not a technical reason. It's a practical one: I don't want to
spend time reviewing the fixes and dealing with the fallout and
regressions that the fixes might cause.


2024-04-03 08:00:07

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 1/1] x86/mm/pae: Align up pteval_t, pmdval_t and pudval_t to avoid split locks


* Dave Hansen <[email protected]> wrote:

> On 4/2/24 10:23, Javier Pello wrote:
> > On Mon, 1 Apr 2024 10:56:14 -0700 Dave Hansen wrote:
> >> First of all, how is it that you're running a PAE kernel on new,
> >> 64-bit hardware? That's rather odd.
> >
> > I got this motherboard and cpu fairly recently to replace old
> > hardware, and I just plugged my old hard disk and went along with
> > it, because I did not feel like bootstrapping a 64-bit system.
>
> Fair enough. I can totally understand wanting the convenience. But
> you're leaving _so_ much performance on the floor that split locks are
> the least of your problems.
>
> >> The case that you're hitting is actually an on-stack pmd_t. The
> >> fun part is that it's not shared and doesn't even _need_ atomics.
> >> I think it's just using pmd_populate() because it's convenient.
> >
> > I see. So just annotating the variable on the stack with
> > __aligned(8) should do it? But the code is under mm/, so it should
> > be arch-agnostic, right? What would the correct fix be, then? I take
> > from your message that using atomics through pmd_populate() here is
> > not needed, but what accessors should be used instead? I am not
> > familiar at all with this part of the kernel.
>
> I don't think there's a better accessor.
>
> >> I'd honestly much rather just disable split lock support in 32-bit
> >> builds than mess with this stuff. You really shouldn't be running
> >> 32-but kernels on this hardware.
> >
> > Why? Is it unsupported?
>
> Yes, it's effectively unsupported. We're not adding new hardware
> features to 32-bit. The fact that split lock detection got enabled
> was an accident.

We do accept well-tested fixes and minor enablement patches though,
within reason - but indeed this page table entry alignment quirk added
for the sake of a split-lock debugging false positive doesn't seem to
be worth it.

> It's not a technical reason. It's a practical one: I don't want to
> spend time reviewing the fixes and dealing with the fallout and
> regressions that the fixes might cause.

Yeah, so it's an indirect technical argument: fixes *with tradeoffs*
like this one have a future maintenance & robustness cost. Fixes
without tradeoffs are fine of course.

Thanks,

Ingo

2024-04-03 11:11:53

by Brian Gerst

[permalink] [raw]
Subject: Re: [PATCH 1/1] x86/mm/pae: Align up pteval_t, pmdval_t and pudval_t to avoid split locks

On Wed, Apr 3, 2024 at 4:00 AM Ingo Molnar <[email protected]> wrote:
>
>
> * Dave Hansen <[email protected]> wrote:
>
> > On 4/2/24 10:23, Javier Pello wrote:
> > > On Mon, 1 Apr 2024 10:56:14 -0700 Dave Hansen wrote:
> > >> First of all, how is it that you're running a PAE kernel on new,
> > >> 64-bit hardware? That's rather odd.
> > >
> > > I got this motherboard and cpu fairly recently to replace old
> > > hardware, and I just plugged my old hard disk and went along with
> > > it, because I did not feel like bootstrapping a 64-bit system.
> >
> > Fair enough. I can totally understand wanting the convenience. But
> > you're leaving _so_ much performance on the floor that split locks are
> > the least of your problems.
> >
> > >> The case that you're hitting is actually an on-stack pmd_t. The
> > >> fun part is that it's not shared and doesn't even _need_ atomics.
> > >> I think it's just using pmd_populate() because it's convenient.
> > >
> > > I see. So just annotating the variable on the stack with
> > > __aligned(8) should do it? But the code is under mm/, so it should
> > > be arch-agnostic, right? What would the correct fix be, then? I take
> > > from your message that using atomics through pmd_populate() here is
> > > not needed, but what accessors should be used instead? I am not
> > > familiar at all with this part of the kernel.
> >
> > I don't think there's a better accessor.
> >
> > >> I'd honestly much rather just disable split lock support in 32-bit
> > >> builds than mess with this stuff. You really shouldn't be running
> > >> 32-but kernels on this hardware.
> > >
> > > Why? Is it unsupported?
> >
> > Yes, it's effectively unsupported. We're not adding new hardware
> > features to 32-bit. The fact that split lock detection got enabled
> > was an accident.
>
> We do accept well-tested fixes and minor enablement patches though,
> within reason - but indeed this page table entry alignment quirk added
> for the sake of a split-lock debugging false positive doesn't seem to
> be worth it.

What would happen if you ran a 32-bit VM on such hardware? If the
split lock detection on the guest were disabled, would the host get
the fault instead?

> > It's not a technical reason. It's a practical one: I don't want to
> > spend time reviewing the fixes and dealing with the fallout and
> > regressions that the fixes might cause.
>
> Yeah, so it's an indirect technical argument: fixes *with tradeoffs*
> like this one have a future maintenance & robustness cost. Fixes
> without tradeoffs are fine of course.

What tradeoffs are there with this patch? This would not affect the
page tables, since those are already properly aligned. Forcing
alignment of stack variables is only a problem if it tickles a
compiler bug.

Brian Gerst

2024-04-03 15:55:35

by Javier Pello

[permalink] [raw]
Subject: Re: [PATCH 1/1] x86/mm/pae: Align up pteval_t, pmdval_t and pudval_t to avoid split locks

On Tue, 2 Apr 2024 10:43:26 -0700, Dave Hansen wrote:

> Fair enough. I can totally understand wanting the convenience.
> But you're leaving _so_ much performance on the floor that split
> locks are the least of your problems.

Split locks may not be a problem, but tasks stuck in kernel mode
are. :-)

> > I see. So just annotating the variable on the stack with
> > __aligned(8) should do it? But the code is under mm/, so it
> > should be arch-agnostic, right? What would the correct fix be,
> > then? I take from your message that using atomics through
> > pmd_populate() here is not needed, but what accessors should
> > be used instead? I am not familiar at all with this part of the
> > kernel.
>
> I don't think there's a better accessor.

That is what I thought, too, after looking through the code.

My assessment, then, is the following: The code in mm/huge_memory.c
uses a variable of type pmd_t on the stack, and uses pmd_populate()
to initialise it. On x86-32 with PAE, that variable is 8 bytes in
size but has an alignment of 4 bytes, and pmd_populate(), through
native_set_pmd(), uses an 8-byte atomic instruction on it. If the
variable is unlucky enough to span two cache lines, this triggers
an exception in kernel mode, and the kernel oopses. This is not
merely a false positive: a task gets stuck (repeatedly, in my
case) with user-visible consequences, and that is not nice.

I can see three ways forward:

- The first way to handle this (and the one requiring most work)
is to get rid of the lock altogether. There are two places in
mm/huge_memory.c (in __split_huge_zero_page_pmd, around line 2078,
and in __split_huge_pmd_locked, around line 2237) where a
temporary pmd_t is passed to pmd_populate() just to get a pte_t
for an address:

pmd_populate(mm, &_pmd, pgtable);
pte = pte_offset_map(&_pmd, haddr);
VM_BUG_ON(!pte);

Variable _pmd is never used afterwards in either of the cases.
Such a temporary variable is not subject to concurrent access,
but (correct me if I am wrong) pmd_populate() is designed for the
general case and uses atomics, at least on x86. Avoiding atomics
here would benefit everyone, since they are not needed, as you
pointed out, but pmd_populate() is an arch hook, so the new
"pmd_populate+pte_offset_map without an intermediate pmd_t" would
also have to be. A possible course of action for this would be to
first add a generic pmd_populate+pte_offset_map implementation in
mm/pgtable-generic.c inside a __HAVE_ARCH_... guard and then add
per-arch specific variants. This would remove the need for
alignment because the new hook could dispense with atomics
entirely, knowing that it is dealing with a local variable.

- The second way would be to keep the use of pmd_populate() as it
is, but make the variable aligned so that it does not trip over a
split lock. This is what my original patch intended to do, because
I had not realised that the lock is not needed, as you pointed out.
On my system, this increases the size of the kernel code by 7K, but
increasing the alignment of a variable should not have any other
impact.

- The third way would be to disable split lock detection on x86-32.
This can be as simple as setting the default to "none" in
sld_state_setup(). Not the most elegant of solutions, but beats
having unresponsive tasks.

Would going for the first one be worth the trouble?

Regards,
Javier

2024-04-03 15:58:46

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 1/1] x86/mm/pae: Align up pteval_t, pmdval_t and pudval_t to avoid split locks

On 4/3/24 08:54, Javier Pello wrote:
> - The third way would be to disable split lock detection on x86-32.
> This can be as simple as setting the default to "none" in
> sld_state_setup(). Not the most elegant of solutions, but beats
> having unresponsive tasks.
>
> Would going for the first one be worth the trouble?

No, it's not worth it. Let's just disable split lock detection on
32-bit and move on with life.

2024-04-04 07:57:06

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 1/1] x86/mm/pae: Align up pteval_t, pmdval_t and pudval_t to avoid split locks


* Brian Gerst <[email protected]> wrote:

> > > It's not a technical reason. It's a practical one: I don't want
> > > to spend time reviewing the fixes and dealing with the fallout
> > > and regressions that the fixes might cause.
> >
> > Yeah, so it's an indirect technical argument: fixes *with
> > tradeoffs* like this one have a future maintenance & robustness
> > cost. Fixes without tradeoffs are fine of course.
>
> What tradeoffs are there with this patch? This would not affect the
> page tables, since those are already properly aligned. Forcing
> alignment of stack variables is only a problem if it tickles a
> compiler bug.

It creates extra constraints on stack layout that wasn't there before,
so it can only be an invariant if the compiler can reorder variables,
or make the stack layout worse (introducing more holes).

Thanks,

Ingo

2024-04-04 15:27:27

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 1/1] x86/mm/pae: Align up pteval_t, pmdval_t and pudval_t to avoid split locks

On Wed, Apr 03, 2024, Dave Hansen wrote:
> On 4/3/24 08:54, Javier Pello wrote:
> > - The third way would be to disable split lock detection on x86-32.
> > This can be as simple as setting the default to "none" in
> > sld_state_setup(). Not the most elegant of solutions, but beats
> > having unresponsive tasks.
> >
> > Would going for the first one be worth the trouble?
>
> No, it's not worth it. Let's just disable split lock detection on
> 32-bit and move on with life.

Please don't paper over the kernel flaw by disabling split lock detection. As
Brian alluded to with his question:

: What would happen if you ran a 32-bit VM on such hardware? If the
: split lock detection on the guest were disabled, would the host get
: the fault instead?

running these kernels under a hypervisor that has enabled split-lock detection,
or Intel's newfangled BUS_LOCK_DETECTION VM-Exit, will result in silently
degraded performance for the guest kernel. The split-lock will trap to the
hypervisor, which will likely throttle the vCPU to guard against a DoS attack,
e.g. under KVM, IIRC the default behavior for split-lock is to stall the task
for 10 _milliseconds_.

In other words, practically speaking this isn't about supporting a new hardware
feature on 32-bit kernels, it's about preserving performance in real world
scenarios when running 32-bit kernels on new hardware. And that means keeping
split-lock detection enabled on 32-bit kernels is a good thing, as kernels bugs
that would cause hard-to-debug _customer_ issues when running 32-bit Linux in a
VM will show up as easy-to-debug splats when running 32-bit kernels on bare
metal. I doubt there are many people that are still running 32-bit kernels on
bare metal, but any coverage we can get would be very nice to have.

2024-04-04 15:37:19

by Brian Gerst

[permalink] [raw]
Subject: Re: [PATCH 1/1] x86/mm/pae: Align up pteval_t, pmdval_t and pudval_t to avoid split locks

On Thu, Apr 4, 2024 at 3:56 AM Ingo Molnar <[email protected]> wrote:
>
>
> * Brian Gerst <[email protected]> wrote:
>
> > > > It's not a technical reason. It's a practical one: I don't want
> > > > to spend time reviewing the fixes and dealing with the fallout
> > > > and regressions that the fixes might cause.
> > >
> > > Yeah, so it's an indirect technical argument: fixes *with
> > > tradeoffs* like this one have a future maintenance & robustness
> > > cost. Fixes without tradeoffs are fine of course.
> >
> > What tradeoffs are there with this patch? This would not affect the
> > page tables, since those are already properly aligned. Forcing
> > alignment of stack variables is only a problem if it tickles a
> > compiler bug.
>
> It creates extra constraints on stack layout that wasn't there before,
> so it can only be an invariant if the compiler can reorder variables,
> or make the stack layout worse (introducing more holes).

For the record, stack alignments are not rare.

objdump -dr vmlinux | grep "and.*,%.sp$" | wc -l

64-bit defconfig: 646
32-bit defconfig: 449


Brian Gerst

2024-04-04 16:30:20

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 1/1] x86/mm/pae: Align up pteval_t, pmdval_t and pudval_t to avoid split locks

On 4/4/24 08:36, Brian Gerst wrote:
>> It creates extra constraints on stack layout that wasn't there before,
>> so it can only be an invariant if the compiler can reorder variables,
>> or make the stack layout worse (introducing more holes).
> For the record, stack alignments are not rare.
>
> objdump -dr vmlinux | grep "and.*,%.sp$" | wc -l
>
> 64-bit defconfig: 646
> 32-bit defconfig: 449

It's not about stack alignment for me.

The bigger question is: are we going to *maintain* the 32-bit code base
to keep it free of split locks.

It's not even just a question of how many more of these are lurking
today. Any future code that (for instance) checks
system_has_cmpxchg64() and then goes to do something fancy on the stack,
or with structure members that aren't carefully aligned will need
maintenance.

Sean is basically saying, "Please keep this feature so that it's easy to
find these on 32-bit and maintain the code base to preserve
performance". That's a perfectly reasonable argument, but only if we
decide that it's worth maintaining the 32-bit code base in this way.

2024-04-04 18:52:48

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 1/1] x86/mm/pae: Align up pteval_t, pmdval_t and pudval_t to avoid split locks

On 4/4/24 08:26, Sean Christopherson wrote:
> In other words, practically speaking this isn't about supporting a new hardware
> feature on 32-bit kernels, it's about preserving performance in real world
> scenarios when running 32-bit kernels on new hardware.

Realistically, most of the 32-bit kernels in the world are going to be
*OLD* distros, right? Old CentOS/RHEL/SLES kernels from before the
kernel had split lock detection, or split lock fixes. Those trip over
VMM split lock detection now, and presumably will forever.

I suspect new CentOS/RHEL/SLES kernels that have split lock detection
all happened after 32-bit support was dropped from those distros.

I think that basically leaves Debian. Someone would need to:

1. Make a *new* 32-bit Debian distro install (or one of the other
less common distros that still do 32-bit)
2. Run it on hardware with split lock detection
3. On a VMM that enables split lock detection
4. Stay close enough to mainline to get split lock fixes (like from
this thread)
5. Care about performance, despite *ACTIVELY* choosing a 32-bit distro
on 64-bit hardware in 2024

Those steps are certainly possible. I'm just not sure how much trouble
we want to go to in 2024 to support people that choose new 32-bit
distros and desire performance. It feels to me to be approaching "I
want a pony" territory.

Or am I just lacking empathy today? :)

2024-04-04 21:55:48

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 1/1] x86/mm/pae: Align up pteval_t, pmdval_t and pudval_t to avoid split locks

On Thu, Apr 04, 2024, Dave Hansen wrote:
> On 4/4/24 08:26, Sean Christopherson wrote:
> > In other words, practically speaking this isn't about supporting a new hardware
> > feature on 32-bit kernels, it's about preserving performance in real world
> > scenarios when running 32-bit kernels on new hardware.
>
> Realistically, most of the 32-bit kernels in the world are going to be
> *OLD* distros, right? Old CentOS/RHEL/SLES kernels from before the
> kernel had split lock detection, or split lock fixes. Those trip over
> VMM split lock detection now, and presumably will forever.
>
> I suspect new CentOS/RHEL/SLES kernels that have split lock detection
> all happened after 32-bit support was dropped from those distros.
>
> I think that basically leaves Debian. Someone would need to:
>
> 1. Make a *new* 32-bit Debian distro install (or one of the other
> less common distros that still do 32-bit)
> 2. Run it on hardware with split lock detection
> 3. On a VMM that enables split lock detection
> 4. Stay close enough to mainline to get split lock fixes (like from
> this thread)
> 5. Care about performance, despite *ACTIVELY* choosing a 32-bit distro
> on 64-bit hardware in 2024
>
> Those steps are certainly possible. I'm just not sure how much trouble
> we want to go to in 2024 to support people that choose new 32-bit
> distros and desire performance.

I'm worried about a scenario where the throttling is so bad that it's not a perf
issue, but a functional issue ("performance" was a bad choice of word).

I do agree that the probability of this being a real problem is super low, but at
the same time it doesn't seem too onerous to clean up. And in the unlikely case
that this does cause a problem, the pain on our end can be quite high.

> It feels to me to be approaching "I want a pony" territory.
>
> Or am I just lacking empathy today? :)

Nah, I'm probably asking for a pony, but AFAICT it's a pretty cheap pony.