LinuxLists.cc - [PATCH v2 0/4] x86/fpu: Reduce unnecessary FNINIT and MXCSR usage

2021-01-19 18:55:08

Subject: [PATCH v2 0/4] x86/fpu: Reduce unnecessary FNINIT and MXCSR usage

This series fixes two regressions: a boot failure on AMD K7 and a
performance regression on everything.

I did a double-take here -- the regressions were reported by different
people, both named Krzysztof :)

Changes from v1:
- Fix MMX better -- MMX really does need FNINIT.
- Improve the EFI code.
- Rename the KFPU constants.
- Changelog improvements.

Andy Lutomirski (4):
x86/fpu: Add kernel_fpu_begin_mask() to selectively initialize state
x86/mmx: Use KFPU_387 for MMX string operations
x86/fpu: Make the EFI FPU calling convention explicit
x86/fpu/64: Don't FNINIT in kernel_fpu_begin()

arch/x86/include/asm/efi.h | 24 ++++++++++++++++++++----
arch/x86/include/asm/fpu/api.h | 27 +++++++++++++++++++++++++--
arch/x86/kernel/fpu/core.c | 17 +++++++++++------
arch/x86/lib/mmx_32.c | 20 +++++++++++++++-----
arch/x86/platform/efi/efi_64.c | 4 ++--
5 files changed, 73 insertions(+), 19 deletions(-)

--
2.29.2

2021-01-19 18:57:45

by Andy Lutomirski

[permalink] [raw]

Subject: [PATCH v2 2/4] x86/mmx: Use KFPU_387 for MMX string operations

The default kernel_fpu_begin() doesn't work on systems that support XMM but
haven't yet enabled CR4.OSFXSR. This causes crashes when _mmx_memcpy() is
called too early because LDMXCSR generates #UD when the aforementioned bit
is clear.

Fix it by using kernel_fpu_begin_mask(KFPU_387) explicitly.

Fixes: 7ad816762f9b ("x86/fpu: Reset MXCSR to default in kernel_fpu_begin()")
Reported-by: Krzysztof Mazur <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/lib/mmx_32.c | 20 +++++++++++++++-----
1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/arch/x86/lib/mmx_32.c b/arch/x86/lib/mmx_32.c
index 4321fa02e18d..2a6ad7aa148a 100644
--- a/arch/x86/lib/mmx_32.c
+++ b/arch/x86/lib/mmx_32.c
@@ -26,6 +26,16 @@
#include <asm/fpu/api.h>
#include <asm/asm.h>

+/*
+ * For MMX, we use KFPU_387. MMX instructions are not affected by MXCSR,
+ * but both AMD and Intel documentation states that even integer MMX
+ * operations will result in #MF if an exception is pending in FCW.
+ *
+ * We don't need EMMS afterwards because, after we call kernel_fpu_end(),
+ * any subsequent user of the 387 stack will reinitialize it using
+ * KFPU_387.
+ */
+
void *_mmx_memcpy(void *to, const void *from, size_t len)
{
void *p;
@@ -37,7 +47,7 @@ void *_mmx_memcpy(void *to, const void *from, size_t len)
p = to;
i = len >> 6; /* len/64 */

- kernel_fpu_begin();
+ kernel_fpu_begin_mask(KFPU_387);

__asm__ __volatile__ (
"1: prefetch (%0)\n" /* This set is 28 bytes */
@@ -127,7 +137,7 @@ static void fast_clear_page(void *page)
{
int i;

- kernel_fpu_begin();
+ kernel_fpu_begin_mask(KFPU_387);

__asm__ __volatile__ (
" pxor %%mm0, %%mm0\n" : :
@@ -160,7 +170,7 @@ static void fast_copy_page(void *to, void *from)
{
int i;

- kernel_fpu_begin();
+ kernel_fpu_begin_mask(KFPU_387);

/*
* maybe the prefetch stuff can go before the expensive fnsave...
@@ -247,7 +257,7 @@ static void fast_clear_page(void *page)
{
int i;

- kernel_fpu_begin();
+ kernel_fpu_begin_mask(KFPU_387);

__asm__ __volatile__ (
" pxor %%mm0, %%mm0\n" : :
@@ -282,7 +292,7 @@ static void fast_copy_page(void *to, void *from)
{
int i;

- kernel_fpu_begin();
+ kernel_fpu_begin_mask(KFPU_387);

__asm__ __volatile__ (
"1: prefetch (%0)\n"
--
2.29.2

2021-01-19 18:57:54

by Andy Lutomirski

[permalink] [raw]

Subject: [PATCH v2 3/4] x86/fpu: Make the EFI FPU calling convention explicit

EFI uses kernel_fpu_begin() to conform to the UEFI calling convention.
This specifically requires initializing FCW, whereas no sane 64-bit kernel
code should use legacy 387 operations that reference FCW.

This should enable us to safely change the default semantics of
kernel_fpu_begin() to stop initializing FCW on 64-bit kernels.

Cc: Arnd Bergmann <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/efi.h | 24 ++++++++++++++++++++----
arch/x86/platform/efi/efi_64.c | 4 ++--
2 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
index bc9758ef292e..f3201544fbf8 100644
--- a/arch/x86/include/asm/efi.h
+++ b/arch/x86/include/asm/efi.h
@@ -68,17 +68,33 @@ extern unsigned long efi_fw_vendor, efi_config_table;
#f " called with too many arguments (" #p ">" #n ")"); \
})

+static inline void efi_fpu_begin(void)
+{
+ /*
+ * The UEFI calling convention (UEFI spec 2.3.2 and 2.3.4) requires
+ * that FCW and MXCSR (64-bit) must be initialized prior to calling
+ * UEFI code. (Oddly the spec does not require that the FPU stack
+ * be empty.)
+ */
+ kernel_fpu_begin_mask(KFPU_387 | KFPU_MXCSR);
+}
+
+static inline void efi_fpu_end(void)
+{
+ kernel_fpu_end();
+}
+
#ifdef CONFIG_X86_32
#define arch_efi_call_virt_setup() \
({ \
- kernel_fpu_begin(); \
+ efi_fpu_begin(); \
firmware_restrict_branch_speculation_start(); \
})

#define arch_efi_call_virt_teardown() \
({ \
firmware_restrict_branch_speculation_end(); \
- kernel_fpu_end(); \
+ efi_fpu_end(); \
})

#define arch_efi_call_virt(p, f, args...) p->f(args)
@@ -107,7 +123,7 @@ struct efi_scratch {
#define arch_efi_call_virt_setup() \
({ \
efi_sync_low_kernel_mappings(); \
- kernel_fpu_begin(); \
+ efi_fpu_begin(); \
firmware_restrict_branch_speculation_start(); \
efi_switch_mm(&efi_mm); \
})
@@ -119,7 +135,7 @@ struct efi_scratch {
({ \
efi_switch_mm(efi_scratch.prev_mm); \
firmware_restrict_branch_speculation_end(); \
- kernel_fpu_end(); \
+ efi_fpu_end(); \
})

#ifdef CONFIG_KASAN
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 8f5759df7776..f042040b5da1 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -848,7 +848,7 @@ efi_set_virtual_address_map(unsigned long memory_map_size,
virtual_map);
efi_switch_mm(&efi_mm);

- kernel_fpu_begin();
+ efi_fpu_begin();

/* Disable interrupts around EFI calls: */
local_irq_save(flags);
@@ -857,7 +857,7 @@ efi_set_virtual_address_map(unsigned long memory_map_size,
descriptor_version, virtual_map);
local_irq_restore(flags);

- kernel_fpu_end();
+ efi_fpu_end();

/* grab the virtually remapped EFI runtime services table pointer */
efi.runtime = READ_ONCE(systab->runtime);
--
2.29.2

2021-01-19 20:41:39

by Andy Lutomirski

[permalink] [raw]

Subject: [PATCH v2 4/4] x86/fpu/64: Don't FNINIT in kernel_fpu_begin()

The remaining callers of kernel_fpu_begin() in 64-bit kernels don't use 387
instructions, so there's no need to sanitize the FPU state. Skip it to get
most of the performance we lost back.

Reported-by: Krzysztof Olędzki <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/fpu/api.h | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index 38f4936045ab..435bc59d539b 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -32,7 +32,19 @@ extern void fpregs_mark_activate(void);
/* Code that is unaware of kernel_fpu_begin_mask() can use this */
static inline void kernel_fpu_begin(void)
{
+#ifdef CONFIG_X86_64
+ /*
+ * Any 64-bit code that uses 387 instructions must explicitly request
+ * KFPU_387.
+ */
+ kernel_fpu_begin_mask(KFPU_MXCSR);
+#else
+ /*
+ * 32-bit kernel code may use 387 operations as well as SSE2, etc,
+ * as long as it checks that the CPU has the required capability.
+ */
kernel_fpu_begin_mask(KFPU_387 | KFPU_MXCSR);
+#endif
}

/*
--
2.29.2

2021-01-20 08:00:00

by Krzysztof Olędzki

[permalink] [raw]

Subject: Re: [PATCH v2 0/4] x86/fpu: Reduce unnecessary FNINIT and MXCSR usage

On 2021-01-19 at 09:38, Andy Lutomirski wrote:
> This series fixes two regressions: a boot failure on AMD K7 and a
> performance regression on everything.
>
> I did a double-take here -- the regressions were reported by different
> people, both named Krzysztof :)
>
> Changes from v1:
> - Fix MMX better -- MMX really does need FNINIT.
> - Improve the EFI code.
> - Rename the KFPU constants.
> - Changelog improvements.
>
> Andy Lutomirski (4):
> x86/fpu: Add kernel_fpu_begin_mask() to selectively initialize state
> x86/mmx: Use KFPU_387 for MMX string operations
> x86/fpu: Make the EFI FPU calling convention explicit
> x86/fpu/64: Don't FNINIT in kernel_fpu_begin()

Hi Andy.

I have quickly tested the new version on E3-1280 V2.

* 5.10.9 + 7ad816762f9bf89e940e618ea40c43138b479e10 reverted (aka unfixed)
xor: measuring software checksum speed
avx : 38616 MB/sec
prefetch64-sse : 25797 MB/sec
generic_sse : 23147 MB/sec
xor: using function: avx (38616 MB/sec)

* 5.10.9 (the original)
xor: measuring software checksum speed
avx : 23318 MB/sec
prefetch64-sse : 22562 MB/sec
generic_sse : 20431 MB/sec
xor: using function: avx (23318 MB/sec)

* 5.10.9 + "Reduce unnecessary FNINIT and MXCSR usage" v2
xor: measuring software checksum speed
avx : 26451 MB/sec
prefetch64-sse : 25777 MB/sec
generic_sse : 23178 MB/sec
xor: using function: avx (26451 MB/sec)

Overall, kernel xor benchmark reports better performance on 5.10.9 than
on 5.4.90 (see my prev e-mail), but the general trend is the same.

The "unfixed" kernel is much faster for all of avx, prefetch64-sse and
generic_sse. With the fix, we see the expected perf regression.

Now, with your patchset, both prefetch64-sse and generic_sse are able to
recover the full performance, as seen on 5.4. However, this is not the
case for avx. While there is still an improvement, it is nowhere close
to where it used to be.

I wonder why AVX still sees a regression and if anything more can be
done about it?

Will do more tests tomorrow.

Thanks,
Krzysztof

2021-01-20 10:46:59

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH v2 4/4] x86/fpu/64: Don't FNINIT in kernel_fpu_begin()

On Tue, Jan 19, 2021 at 09:39:02AM -0800, Andy Lutomirski wrote:
> The remaining callers of kernel_fpu_begin() in 64-bit kernels don't use 387
> instructions, so there's no need to sanitize the FPU state. Skip it to get
> most of the performance we lost back.
>
> Reported-by: Krzysztof Olędzki <[email protected]>
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> arch/x86/include/asm/fpu/api.h | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
> index 38f4936045ab..435bc59d539b 100644
> --- a/arch/x86/include/asm/fpu/api.h
> +++ b/arch/x86/include/asm/fpu/api.h
> @@ -32,7 +32,19 @@ extern void fpregs_mark_activate(void);
> /* Code that is unaware of kernel_fpu_begin_mask() can use this */
> static inline void kernel_fpu_begin(void)
> {
> +#ifdef CONFIG_X86_64
> + /*
> + * Any 64-bit code that uses 387 instructions must explicitly request
> + * KFPU_387.
> + */
> + kernel_fpu_begin_mask(KFPU_MXCSR);

I'm also still sitting on this:

git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git x86/fpu

what do we do with that?

2021-01-20 18:31:03

by Borislav Petkov

[permalink] [raw]

Subject: Re: [PATCH v2 2/4] x86/mmx: Use KFPU_387 for MMX string operations

On Tue, Jan 19, 2021 at 09:39:00AM -0800, Andy Lutomirski wrote:
> The default kernel_fpu_begin() doesn't work on systems that support XMM but
> haven't yet enabled CR4.OSFXSR. This causes crashes when _mmx_memcpy() is
> called too early because LDMXCSR generates #UD when the aforementioned bit
> is clear.
>
> Fix it by using kernel_fpu_begin_mask(KFPU_387) explicitly.
>
> Fixes: 7ad816762f9b ("x86/fpu: Reset MXCSR to default in kernel_fpu_begin()")
> Reported-by: Krzysztof Mazur <[email protected]>
> Signed-off-by: Andy Lutomirski <[email protected]>

Cc: <[email protected]> I guess.

> ---
> arch/x86/lib/mmx_32.c | 20 +++++++++++++++-----
> 1 file changed, 15 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/lib/mmx_32.c b/arch/x86/lib/mmx_32.c
> index 4321fa02e18d..2a6ad7aa148a 100644
> --- a/arch/x86/lib/mmx_32.c
> +++ b/arch/x86/lib/mmx_32.c
> @@ -26,6 +26,16 @@
> #include <asm/fpu/api.h>
> #include <asm/asm.h>
>
> +/*
> + * For MMX, we use KFPU_387. MMX instructions are not affected by MXCSR,
> + * but both AMD and Intel documentation states that even integer MMX
> + * operations will result in #MF if an exception is pending in FCW.
> + *
> + * We don't need EMMS afterwards because, after we call kernel_fpu_end(),
> + * any subsequent user of the 387 stack will reinitialize it using
> + * KFPU_387.

Please use passive voice and convert the "we" to something impersonal.
For example:

"Use KFPU_387 for MMX. MMX instructions are not affected by MXCSR, but
both AMD and Intel documentation states that even integer MMX operations
will result in #MF if an exception is pending in FCW.

EMMS afterwards is not needed because, after kernel_fpu_end(), any
subsequent user of the 387 stack will reinitialize it using KFPU_387."

Voila, de-we-fied!

:-)

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-01-20 18:44:24

by Borislav Petkov

[permalink] [raw]

Subject: Re: [PATCH v2 4/4] x86/fpu/64: Don't FNINIT in kernel_fpu_begin()

On Wed, Jan 20, 2021 at 11:07:11AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 19, 2021 at 09:39:02AM -0800, Andy Lutomirski wrote:
> > The remaining callers of kernel_fpu_begin() in 64-bit kernels don't use 387
> > instructions, so there's no need to sanitize the FPU state. Skip it to get
> > most of the performance we lost back.
> >
> > Reported-by: Krzysztof Olędzki <[email protected]>
> > Signed-off-by: Andy Lutomirski <[email protected]>
> > ---
> > arch/x86/include/asm/fpu/api.h | 12 ++++++++++++
> > 1 file changed, 12 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
> > index 38f4936045ab..435bc59d539b 100644
> > --- a/arch/x86/include/asm/fpu/api.h
> > +++ b/arch/x86/include/asm/fpu/api.h
> > @@ -32,7 +32,19 @@ extern void fpregs_mark_activate(void);
> > /* Code that is unaware of kernel_fpu_begin_mask() can use this */
> > static inline void kernel_fpu_begin(void)
> > {
> > +#ifdef CONFIG_X86_64
> > + /*
> > + * Any 64-bit code that uses 387 instructions must explicitly request
> > + * KFPU_387.
> > + */
> > + kernel_fpu_begin_mask(KFPU_MXCSR);
>
> I'm also still sitting on this:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git x86/fpu
>
> what do we do with that?

Yah, I'd prefer an actual explicit check infra for stuff like that
instead of us expecting callers to know what bits they would need to
supply in the mask and then inadvertently goofing it up, leading to
funky context corruption bugs...

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-01-21 05:09:33

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [PATCH v2 0/4] x86/fpu: Reduce unnecessary FNINIT and MXCSR usage

On Tue, Jan 19, 2021 at 11:51 PM Krzysztof Olędzki <[email protected]> wrote:
>
> On 2021-01-19 at 09:38, Andy Lutomirski wrote:
> > This series fixes two regressions: a boot failure on AMD K7 and a
> > performance regression on everything.
> >
> > I did a double-take here -- the regressions were reported by different
> > people, both named Krzysztof :)
> >
> > Changes from v1:
> > - Fix MMX better -- MMX really does need FNINIT.
> > - Improve the EFI code.
> > - Rename the KFPU constants.
> > - Changelog improvements.
> >
> > Andy Lutomirski (4):
> > x86/fpu: Add kernel_fpu_begin_mask() to selectively initialize state
> > x86/mmx: Use KFPU_387 for MMX string operations
> > x86/fpu: Make the EFI FPU calling convention explicit
> > x86/fpu/64: Don't FNINIT in kernel_fpu_begin()
>
> Hi Andy.
>
> I have quickly tested the new version on E3-1280 V2.
>
> * 5.10.9 + 7ad816762f9bf89e940e618ea40c43138b479e10 reverted (aka unfixed)
> xor: measuring software checksum speed
> avx : 38616 MB/sec
> prefetch64-sse : 25797 MB/sec
> generic_sse : 23147 MB/sec
> xor: using function: avx (38616 MB/sec)
>
> * 5.10.9 (the original)
> xor: measuring software checksum speed
> avx : 23318 MB/sec
> prefetch64-sse : 22562 MB/sec
> generic_sse : 20431 MB/sec
> xor: using function: avx (23318 MB/sec)
>
> * 5.10.9 + "Reduce unnecessary FNINIT and MXCSR usage" v2
> xor: measuring software checksum speed
> avx : 26451 MB/sec
> prefetch64-sse : 25777 MB/sec
> generic_sse : 23178 MB/sec
> xor: using function: avx (26451 MB/sec)
>
> Overall, kernel xor benchmark reports better performance on 5.10.9 than
> on 5.4.90 (see my prev e-mail), but the general trend is the same.
>
> The "unfixed" kernel is much faster for all of avx, prefetch64-sse and
> generic_sse. With the fix, we see the expected perf regression.
>
> Now, with your patchset, both prefetch64-sse and generic_sse are able to
> recover the full performance, as seen on 5.4. However, this is not the
> case for avx. While there is still an improvement, it is nowhere close
> to where it used to be.
>
> I wonder why AVX still sees a regression and if anything more can be
> done about it?
>
> Will do more tests tomorrow.

I'm moderately confident that the problem is that LDMXCSR is
considered a "legacy SSE" instruction and it's triggering the
VEX-to-SSE and SSE-to-VEX penalties. perf could tell you for sure,
and testing with VLDMXCSR might be informative.

I'm not sure whether the best solution is to try to use VLDMXCSR, to
throw some VZEROUPPER instructions in, or to try to get rid of MXCSR
initialization entirely for integer code. VZEROUPPER might be good
regardless, since for all we know, we're coming from user mode and
user mode could have been using SSE. If we do the latter, we should
probably arrange to do it just once per user-FP-to-kernel-FP
transition.

--Andy