This series fixes two regressions: a boot failure on AMD K7 and a
performance regression on everything.
I did a double-take here -- the regressions were reported by different
people, both named Krzysztof :)
Andy Lutomirski (4):
x86/fpu: Add kernel_fpu_begin_mask() to selectively initialize state
x86/mmx: Use KFPU_MMX for MMX string operations
x86/fpu: Make the EFI FPU calling convention explicit
x86/fpu/64: Don't FNINIT in kernel_fpu_begin()
arch/x86/include/asm/efi.h | 4 ++--
arch/x86/include/asm/fpu/api.h | 35 ++++++++++++++++++++++++++++++++--
arch/x86/kernel/fpu/core.c | 17 +++++++++++------
arch/x86/lib/mmx_32.c | 10 +++++-----
arch/x86/platform/efi/efi_64.c | 2 +-
5 files changed, 52 insertions(+), 16 deletions(-)
--
2.29.2
EFI uses kernel_fpu_begin() to conform to the UEFI calling convention.
This specifically requires initializing FCW, whereas no sane 64-bit kernel
code should use legacy 387 operations that reference FCW.
Add KFPU_EFI to make this self-documenting, and use it in the EFI code.
This should enable us to safely change the default semantics of
kernel_fpu_begin() to stop initializing FCW on 64-bit kernels.
Cc: Arnd Bergmann <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/efi.h | 4 ++--
arch/x86/include/asm/fpu/api.h | 7 +++++++
arch/x86/platform/efi/efi_64.c | 2 +-
3 files changed, 10 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
index bc9758ef292e..c60be69a5c82 100644
--- a/arch/x86/include/asm/efi.h
+++ b/arch/x86/include/asm/efi.h
@@ -71,7 +71,7 @@ extern unsigned long efi_fw_vendor, efi_config_table;
#ifdef CONFIG_X86_32
#define arch_efi_call_virt_setup() \
({ \
- kernel_fpu_begin(); \
+ kernel_fpu_begin_mask(KFPU_EFI); \
firmware_restrict_branch_speculation_start(); \
})
@@ -107,7 +107,7 @@ struct efi_scratch {
#define arch_efi_call_virt_setup() \
({ \
efi_sync_low_kernel_mappings(); \
- kernel_fpu_begin(); \
+ kernel_fpu_begin_mask(KFPU_EFI); \
firmware_restrict_branch_speculation_start(); \
efi_switch_mm(&efi_mm); \
})
diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index 133907a200ef..e95a06845443 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -25,6 +25,13 @@
#define KFPU_XYZMM _BITUL(1) /* MXCSR will be initialized */
#define KFPU_MMX 0 /* nothing gets initialized */
+/*
+ * The UEFI calling convention (UEFI spec 2.3.2 and 2.3.4) requires
+ * that FCW (32-bit and 64-bit) and MXCSR (64-bit) must be initialized
+ * prior to calling UEFI code.
+ */
+#define KFPU_EFI (KFPU_387 | KFPU_XYZMM)
+
extern void kernel_fpu_begin_mask(unsigned int kfpu_mask);
extern void kernel_fpu_end(void);
extern bool irq_fpu_usable(void);
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 8f5759df7776..c304c8da862b 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -848,7 +848,7 @@ efi_set_virtual_address_map(unsigned long memory_map_size,
virtual_map);
efi_switch_mm(&efi_mm);
- kernel_fpu_begin();
+ kernel_fpu_begin_mask(KFPU_EFI);
/* Disable interrupts around EFI calls: */
local_irq_save(flags);
--
2.29.2
The remaining callers of kernel_fpu_begin() in 64-bit kernels don't use 387
instructions, so there's no need to sanitize FCW. Skip it to get the
performance we lost back.
Reported-by: Krzysztof Olędzki <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/fpu/api.h | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index e95a06845443..6e826796a734 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -40,7 +40,19 @@ extern void fpregs_mark_activate(void);
/* Code that is unaware of kernel_fpu_begin_mask() can use this */
static inline void kernel_fpu_begin(void)
{
+#ifdef CONFIG_X86_64
+ /*
+ * Any 64-bit code that uses 387 instructions must explicitly request
+ * KFPU_387.
+ */
+ kernel_fpu_begin_mask(KFPU_XYZMM);
+#else
+ /*
+ * 32-bit kernel code may use 387 operations as well as SSE2, etc,
+ * as long as it checks that the CPU has the required capability.
+ */
kernel_fpu_begin_mask(KFPU_387 | KFPU_XYZMM);
+#endif
}
/*
--
2.29.2
The default kernel_fpu_begin() doesn't work on systems that support XMM but
haven't yet enabled OSFXSR. This causes crashes when _mmx_memcpy() is
called too early.
Fix it by using kernel_fpu_begin(KFPU_MMX) explicitly. This should also be
faster, since it skips both the reasonably fast LDMXCSR and also the rather
slow FNINIT instructions.
Fixes: 7ad816762f9b ("x86/fpu: Reset MXCSR to default in kernel_fpu_begin()")
Reported-by: Krzysztof Mazur <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/lib/mmx_32.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/arch/x86/lib/mmx_32.c b/arch/x86/lib/mmx_32.c
index 4321fa02e18d..daa80fa005fb 100644
--- a/arch/x86/lib/mmx_32.c
+++ b/arch/x86/lib/mmx_32.c
@@ -37,7 +37,7 @@ void *_mmx_memcpy(void *to, const void *from, size_t len)
p = to;
i = len >> 6; /* len/64 */
- kernel_fpu_begin();
+ kernel_fpu_begin_mask(KFPU_MMX);
__asm__ __volatile__ (
"1: prefetch (%0)\n" /* This set is 28 bytes */
@@ -127,7 +127,7 @@ static void fast_clear_page(void *page)
{
int i;
- kernel_fpu_begin();
+ kernel_fpu_begin_mask(KFPU_MMX);
__asm__ __volatile__ (
" pxor %%mm0, %%mm0\n" : :
@@ -160,7 +160,7 @@ static void fast_copy_page(void *to, void *from)
{
int i;
- kernel_fpu_begin();
+ kernel_fpu_begin_mask(KFPU_MMX);
/*
* maybe the prefetch stuff can go before the expensive fnsave...
@@ -247,7 +247,7 @@ static void fast_clear_page(void *page)
{
int i;
- kernel_fpu_begin();
+ kernel_fpu_begin_mask(KFPU_MMX);
__asm__ __volatile__ (
" pxor %%mm0, %%mm0\n" : :
@@ -282,7 +282,7 @@ static void fast_copy_page(void *to, void *from)
{
int i;
- kernel_fpu_begin();
+ kernel_fpu_begin_mask(KFPU_MMX);
__asm__ __volatile__ (
"1: prefetch (%0)\n"
--
2.29.2
Currently, requesting kernel FPU access doesn't distinguish which parts of
the extended ("FPU") state are needed. This is nice for simplicity, but
there are a few cases in which it's suboptimal:
- The vast majority of in-kernel FPU users want XMM/YMM/ZMM state but do
not use legacy 387 state. These users want MXCSR initialized but don't
care about the FPU control word. Skipping FNINIT would save time.
(Empirically, FNINIT is several times slower than LDMXCSR.)
- Code that wants MMX doesn't want need MXCSR or FCW initialized.
_mmx_memcpy(), for example, can run before CR4.OSFXSR gets set, and
initializing MXCSR will fail.
- Any future in-kernel users of XFD (eXtended Feature Disable)-capable
dynamic states will need special handling.
This patch adds a more specific API that allows callers specify exactly
what they want.
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/fpu/api.h | 16 ++++++++++++++--
arch/x86/kernel/fpu/core.c | 17 +++++++++++------
2 files changed, 25 insertions(+), 8 deletions(-)
diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index dcd9503b1098..133907a200ef 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -16,14 +16,26 @@
* Use kernel_fpu_begin/end() if you intend to use FPU in kernel context. It
* disables preemption so be careful if you intend to use it for long periods
* of time.
- * If you intend to use the FPU in softirq you need to check first with
+ * If you intend to use the FPU in irq/softirq you need to check first with
* irq_fpu_usable() if it is possible.
*/
-extern void kernel_fpu_begin(void);
+
+/* Kernel FPU states to initialize in kernel_fpu_begin_mask() */
+#define KFPU_387 _BITUL(0) /* FCW will be initialized */
+#define KFPU_XYZMM _BITUL(1) /* MXCSR will be initialized */
+#define KFPU_MMX 0 /* nothing gets initialized */
+
+extern void kernel_fpu_begin_mask(unsigned int kfpu_mask);
extern void kernel_fpu_end(void);
extern bool irq_fpu_usable(void);
extern void fpregs_mark_activate(void);
+/* Code that is unaware of kernel_fpu_begin_mask() can use this */
+static inline void kernel_fpu_begin(void)
+{
+ kernel_fpu_begin_mask(KFPU_387 | KFPU_XYZMM);
+}
+
/*
* Use fpregs_lock() while editing CPU's FPU registers or fpu->state.
* A context switch will (and softirq might) save CPU's FPU registers to
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index eb86a2b831b1..52d05c806aa6 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -121,7 +121,7 @@ int copy_fpregs_to_fpstate(struct fpu *fpu)
}
EXPORT_SYMBOL(copy_fpregs_to_fpstate);
-void kernel_fpu_begin(void)
+void kernel_fpu_begin_mask(unsigned int kfpu_mask)
{
preempt_disable();
@@ -141,13 +141,18 @@ void kernel_fpu_begin(void)
}
__cpu_invalidate_fpregs_state();
- if (boot_cpu_has(X86_FEATURE_XMM))
- ldmxcsr(MXCSR_DEFAULT);
+ /* Put sane initial values into the control registers. */
+ if (likely(kfpu_mask & KFPU_XYZMM)) {
+ if (boot_cpu_has(X86_FEATURE_XMM))
+ ldmxcsr(MXCSR_DEFAULT);
+ }
- if (boot_cpu_has(X86_FEATURE_FPU))
- asm volatile ("fninit");
+ if (unlikely(kfpu_mask & KFPU_387)) {
+ if (boot_cpu_has(X86_FEATURE_FPU))
+ asm volatile ("fninit");
+ }
}
-EXPORT_SYMBOL_GPL(kernel_fpu_begin);
+EXPORT_SYMBOL_GPL(kernel_fpu_begin_mask);
void kernel_fpu_end(void)
{
--
2.29.2
On 2021-01-17 at 22:20, Andy Lutomirski wrote:
> This series fixes two regressions: a boot failure on AMD K7 and a
> performance regression on everything.
>
> I did a double-take here -- the regressions were reported by different
> people, both named Krzysztof :)
>
> Andy Lutomirski (4):
> x86/fpu: Add kernel_fpu_begin_mask() to selectively initialize state
> x86/mmx: Use KFPU_MMX for MMX string operations
> x86/fpu: Make the EFI FPU calling convention explicit
> x86/fpu/64: Don't FNINIT in kernel_fpu_begin()
Thank you so much Andy!
What a coincidence! Sadly, my AMD K7 is sitting somewhere in a closet,
on a different continent, and was running Linux for the last time over
10 years ago. :/ However, I can offer some testing on different AMD &
Intel CPUs.
Now... It is 12 AM here so I tested it very quickly only on 5.4-stable,
where I initially noticed the problem. The patch applies almost cleanly
in this release, almost as arch/x86/platform/efi/efi_64.c does not have
kernel_fpu_begin() call to update. The kernel complies and boots.
Here is the result for:
Intel(R) Xeon(R) CPU E3-1280 V2 @ 3.60GHz (family: 0x6, model: 0x3a,
stepping: 0x9)
5.4-stable (with "Reset MXCSR to default in kernel_fpu_begin"):
avx : 21072.000 MB/sec
prefetch64-sse: 20392.000 MB/sec
generic_sse: 18572.000 MB/sec
xor: using function: avx (21072.000 MB/sec)
5.4-stable-c4db485dd3f2378b4923503aed995f7816e265b7-revert:
avx : 33764.000 MB/sec
prefetch64-sse: 23432.000 MB/sec
generic_sse: 21036.000 MB/sec
xor: using function: avx (33764.000 MB/sec)
5.4-stable-kernel_fpu_begin_mask:
avx : 23576.000 MB/sec
prefetch64-sse: 23024.000 MB/sec
generic_sse: 20880.000 MB/sec
xor: using function: avx (23576.000 MB/sec)
So, the performance regression for prefetch64-sse and generic_sse is
almost gone, but the AVX code is still impacted. Not as much as before,
but still noticeably, and it is now barely better than fixed prefetch64-sse.
I'm going to test the patches on 5.10 / 5.11-rc to make sure what I have
seen on 5.4 is not due to wrong backporting, and on different CPUs.
However, this may have to wait until Tuesday / Wednesday due to family
duties, as Monday is a holiday here.
Best regards,
Krzysztof Olędzki
On Sun, Jan 17, 2021 at 10:20:40PM -0800, Andy Lutomirski wrote:
> EFI uses kernel_fpu_begin() to conform to the UEFI calling convention.
> This specifically requires initializing FCW, whereas no sane 64-bit kernel
> code should use legacy 387 operations that reference FCW.
>
> Add KFPU_EFI to make this self-documenting, and use it in the EFI code.
I'd prefer if you slap a comment over the kernel_fpu_begin() calls in
efi instead of adding a separate define etc.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Sun, Jan 17, 2021 at 10:20:38PM -0800, Andy Lutomirski wrote:
> - Code that wants MMX doesn't want need MXCSR or FCW initialized.
> _mmx_memcpy(), for example, can run before CR4.OSFXSR gets set, and
> initializing MXCSR will fail.
> +#define KFPU_MMX 0 /* nothing gets initialized */
This... why is that correct?
From: Andy Lutomirski
> Sent: 18 January 2021 06:21
>
> Currently, requesting kernel FPU access doesn't distinguish which parts of
> the extended ("FPU") state are needed. This is nice for simplicity, but
> there are a few cases in which it's suboptimal:
>
> - The vast majority of in-kernel FPU users want XMM/YMM/ZMM state but do
> not use legacy 387 state. These users want MXCSR initialized but don't
> care about the FPU control word. Skipping FNINIT would save time.
> (Empirically, FNINIT is several times slower than LDMXCSR.)
>
> - Code that wants MMX doesn't want need MXCSR or FCW initialized.
> _mmx_memcpy(), for example, can run before CR4.OSFXSR gets set, and
> initializing MXCSR will fail.
>
> - Any future in-kernel users of XFD (eXtended Feature Disable)-capable
> dynamic states will need special handling.
>
> This patch adds a more specific API that allows callers specify exactly
> what they want.
Is it worth returning whether the required fpu feature is available?
Or, maybe optionally, available cheaply?
There are also code fragments that really just want one or two
[xyx]mm registers to speed something up.
For instance PCIe reads can be a lot faster if a wide register
can be used.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Just nitpicks:
On Sun, Jan 17, 2021 at 10:20:38PM -0800, Andy Lutomirski wrote:
> Currently, requesting kernel FPU access doesn't distinguish which parts of
> the extended ("FPU") state are needed. This is nice for simplicity, but
> there are a few cases in which it's suboptimal:
>
> - The vast majority of in-kernel FPU users want XMM/YMM/ZMM state but do
> not use legacy 387 state. These users want MXCSR initialized but don't
> care about the FPU control word. Skipping FNINIT would save time.
> (Empirically, FNINIT is several times slower than LDMXCSR.)
>
> - Code that wants MMX doesn't want need MXCSR or FCW initialized.
"want/need" ?
> _mmx_memcpy(), for example, can run before CR4.OSFXSR gets set, and
> initializing MXCSR will fail.
"... because LDMXCSR generates an #UD when the aforementioned CR4 bit is
not set."
> - Any future in-kernel users of XFD (eXtended Feature Disable)-capable
> dynamic states will need special handling.
>
> This patch adds a more specific API that allows callers specify exactly
s/This patch adds/Add/
> what they want.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette