2022-04-05 03:03:02

by Thomas Gleixner

[permalink] [raw]
Subject: [patch 3/3] x86/fpu/xsave: Optimize XSAVEC/S when XGETBV1 is supported

XSAVEC/S store the FPU state in compacted format, which avoids holes in the
memory image. The kernel uses this feature in a very naive way and just
avoids holes which come from unsupported features, like PT. That's a
marginal saving of 128 byte vs. the uncompacted format on a SKL-X.

The first 576 bytes are fixed. 512 byte legacy (FP/SSE) and 64 byte XSAVE
header. On a SKL-X machine the other components are stored at the following
offsets:

xstate_offset[2]: 576, xstate_sizes[2]: 256
xstate_offset[3]: 832, xstate_sizes[3]: 64
xstate_offset[4]: 896, xstate_sizes[4]: 64
xstate_offset[5]: 960, xstate_sizes[5]: 64
xstate_offset[6]: 1024, xstate_sizes[6]: 512
xstate_offset[7]: 1536, xstate_sizes[7]: 1024
xstate_offset[9]: 2560, xstate_sizes[9]: 8

XSAVEC/S use the init optimization which does not write data of a component
when the component is in init state. The state is stored in the XSTATE_BV
bitmap of the XSTATE header.

The kernel requests to save all enabled components, which results in a
suboptimal write/read pattern when the set of active components is sparse.

A typical scenario is an active set of 0x202 (PKRU + SSE) out of the full
supported set of 0x2FF. That means XSAVEC/S writes and XRSTOR[S] reads:

- SSE in the legacy area (0-511)
- Part of the XSTATE header (512-575)
- PKRU at offset 2560

which is suboptimal. Prefetch works better when the access is linear. But
what's worse is that PKRU can be located in a different page which
obviously affects dTLB.

XSAVEC/S allows to further reduce the memory footprint when the active
feature set is sparse and the CPU supports XGETBV1. XGETBV1 reads the state
of the XSTATE components as a bitmap. This bitmap can be fed into XSAVEC/S
to request only the storage of the active components, which changes the
layout of the state buffer to:

- SSE in the legacy area (0-511)
- Part of the XSTATE header (512-575)
- PKRU at offset 576

This optimization does not gain much for e.g. a kernel build, but for
context switch heavy applications it's very visible. Perf stats from
hackbench:

Before:

242,618.89 msec task-clock # 102.928 CPUs utilized ( +- 0.20% )
1,038,988 context-switches # 0.004 M/sec ( +- 0.54% )
460,081 cpu-migrations # 0.002 M/sec ( +- 0.56% )
10,813 page-faults # 0.045 K/sec ( +- 0.62% )
506,912,353,968 cycles # 2.089 GHz ( +- 0.20% )
167,267,811,210 instructions # 0.33 insn per cycle ( +- 0.04% )
34,481,978,727 branches # 142.124 M/sec ( +- 0.04% )
305,975,304 branch-misses # 0.89% of all branches ( +- 0.09% )

2.35717 +- 0.00607 seconds time elapsed ( +- 0.26% )

506,064,738,921 cycles ( +- 0.43% )
3,334,160,871 L1-dcache-load-misses ( +- 0.77% )
135,271,979 dTLB-load-misses ( +- 2.12% )
18,169,634 dTLB-store-misses ( +- 1.78% )

2.3323 +- 0.0117 seconds time elapsed ( +- 0.50% )

After:

222,252.90 msec task-clock # 103.800 CPUs utilized ( +- 0.51% )
1,004,665 context-switches # 0.005 M/sec ( +- 0.42% )
459,123 cpu-migrations # 0.002 M/sec ( +- 0.33% )
10,677 page-faults # 0.048 K/sec ( +- 0.79% )
464,356,465,870 cycles # 2.089 GHz ( +- 0.51% )
166,615,501,152 instructions # 0.36 insn per cycle ( +- 0.05% )
34,355,848,663 branches # 154.580 M/sec ( +- 0.05% )
300,049,704 branch-misses # 0.87% of all branches ( +- 0.14% )

2.1412 +- 0.0117 seconds time elapsed ( +- 0.55% )

473,864,807,936 cycles ( +- 0.64% )
3,198,078,809 L1-dcache-load-misses ( +- 0.24% )
27,798,721 dTLB-load-misses ( +- 2.33% )
4,981,069 dTLB-store-misses ( +- 1.80% )

2.1733 +- 0.0132 seconds time elapsed ( +- 0.61% )

The most significant change is in dTLB misses.

The effect depends on the application scenario, the kernel configuration
and the allocation placement of task_struct, so it might be not noticable
at all. As the XGETBV1 optimization is not introducing a measurable
overhead it's worth to use it if supported by the hardware.

Enable it when available with a static key and mask out the non-active
states in the requested bitmap for XSAVEC/S.

Signed-off-by: Thomas Gleixner <[email protected]>
---
arch/x86/kernel/fpu/xstate.c | 10 ++++++++--
arch/x86/kernel/fpu/xstate.h | 16 +++++++++++++---
2 files changed, 21 insertions(+), 5 deletions(-)

--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -86,6 +86,8 @@ static unsigned int xstate_flags[XFEATUR
#define XSTATE_FLAG_SUPERVISOR BIT(0)
#define XSTATE_FLAG_ALIGNED64 BIT(1)

+DEFINE_STATIC_KEY_FALSE(__xsave_use_xgetbv1);
+
/*
* Return whether the system supports a given xfeature.
*
@@ -1481,7 +1483,7 @@ void xfd_validate_state(struct fpstate *
}
#endif /* CONFIG_X86_DEBUG_FPU */

-static int __init xfd_update_static_branch(void)
+static int __init fpu_update_static_branches(void)
{
/*
* If init_fpstate.xfd has bits set then dynamic features are
@@ -1489,9 +1491,13 @@ static int __init xfd_update_static_bran
*/
if (init_fpstate.xfd)
static_branch_enable(&__fpu_state_size_dynamic);
+
+ if (cpu_feature_enabled(X86_FEATURE_XGETBV1) &&
+ cpu_feature_enabled(X86_FEATURE_XCOMPACTED))
+ static_branch_enable(&__xsave_use_xgetbv1);
return 0;
}
-arch_initcall(xfd_update_static_branch)
+arch_initcall(fpu_update_static_branches)

void fpstate_free(struct fpu *fpu)
{
--- a/arch/x86/kernel/fpu/xstate.h
+++ b/arch/x86/kernel/fpu/xstate.h
@@ -10,7 +10,12 @@
DECLARE_PER_CPU(u64, xfd_state);
#endif

-static inline bool xsave_use_xgetbv1(void) { return false; }
+DECLARE_STATIC_KEY_FALSE(__xsave_use_xgetbv1);
+
+static __always_inline __pure bool xsave_use_xgetbv1(void)
+{
+ return static_branch_likely(&__xsave_use_xgetbv1);
+}

static inline void __xstate_init_xcomp_bv(struct xregs_state *xsave, u64 mask)
{
@@ -185,13 +190,18 @@ static inline int __xfd_enable_feature(u
static inline void os_xsave(struct fpstate *fpstate)
{
u64 mask = fpstate->xfeatures;
- u32 lmask = mask;
- u32 hmask = mask >> 32;
+ u32 lmask, hmask;
int err;

WARN_ON_FPU(!alternatives_patched);
xfd_validate_state(fpstate, mask, false);

+ if (xsave_use_xgetbv1())
+ mask &= xgetbv(1);
+
+ lmask = mask;
+ hmask = mask >> 32;
+
XSTATE_XSAVE(&fpstate->regs.xsave, lmask, hmask, err);

/* We should never fault when copying to a kernel buffer: */


2022-04-16 02:40:39

by Dave Hansen

[permalink] [raw]
Subject: Re: [patch 3/3] x86/fpu/xsave: Optimize XSAVEC/S when XGETBV1 is supported

On 4/4/22 05:11, Thomas Gleixner wrote:
> A typical scenario is an active set of 0x202 (PKRU + SSE) out of the full
> supported set of 0x2FF. That means XSAVEC/S writes and XRSTOR[S] reads:

It might be worth reminding folks why PKRU is a special snowflake:

The default PKRU enforced by the kernel is its most restrictive possible
value (0xfffffffc). This means that PKRU defaults to being in its
non-init state even for tasks which do nothing protection-keys-related.


> which is suboptimal. Prefetch works better when the access is linear. But
> what's worse is that PKRU can be located in a different page which
> obviously affects dTLB.

The numbers don't lie, but I'm still surprised by this. Was this in a
VM that isn't backed with large pages? task_struct.thread.fpu is
kmem_cache_alloc()'d and is in the direct map, which should be 2M/1G
pages almost all the time.

> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -86,6 +86,8 @@ static unsigned int xstate_flags[XFEATUR
> #define XSTATE_FLAG_SUPERVISOR BIT(0)
> #define XSTATE_FLAG_ALIGNED64 BIT(1)
>
> +DEFINE_STATIC_KEY_FALSE(__xsave_use_xgetbv1);
> +
> /*
> * Return whether the system supports a given xfeature.
> *
> @@ -1481,7 +1483,7 @@ void xfd_validate_state(struct fpstate *
> }
> #endif /* CONFIG_X86_DEBUG_FPU */
>
> -static int __init xfd_update_static_branch(void)
> +static int __init fpu_update_static_branches(void)
> {
> /*
> * If init_fpstate.xfd has bits set then dynamic features are
> @@ -1489,9 +1491,13 @@ static int __init xfd_update_static_bran
> */
> if (init_fpstate.xfd)
> static_branch_enable(&__fpu_state_size_dynamic);
> +
> + if (cpu_feature_enabled(X86_FEATURE_XGETBV1) &&
> + cpu_feature_enabled(X86_FEATURE_XCOMPACTED))
> + static_branch_enable(&__xsave_use_xgetbv1);
> return 0;
> }
> -arch_initcall(xfd_update_static_branch)
> +arch_initcall(fpu_update_static_branches)
>
> void fpstate_free(struct fpu *fpu)
> {
> --- a/arch/x86/kernel/fpu/xstate.h
> +++ b/arch/x86/kernel/fpu/xstate.h
> @@ -10,7 +10,12 @@
> DECLARE_PER_CPU(u64, xfd_state);
> #endif
>
> -static inline bool xsave_use_xgetbv1(void) { return false; }
> +DECLARE_STATIC_KEY_FALSE(__xsave_use_xgetbv1);
> +
> +static __always_inline __pure bool xsave_use_xgetbv1(void)
> +{
> + return static_branch_likely(&__xsave_use_xgetbv1);
> +}
>
> static inline void __xstate_init_xcomp_bv(struct xregs_state *xsave, u64 mask)
> {
> @@ -185,13 +190,18 @@ static inline int __xfd_enable_feature(u
> static inline void os_xsave(struct fpstate *fpstate)
> {
> u64 mask = fpstate->xfeatures;
> - u32 lmask = mask;
> - u32 hmask = mask >> 32;
> + u32 lmask, hmask;
> int err;
>
> WARN_ON_FPU(!alternatives_patched);
> xfd_validate_state(fpstate, mask, false);
>
> + if (xsave_use_xgetbv1())
> + mask &= xgetbv(1);

How about this comment for the masking operation:

/*
* Remove features in their init state from the mask. This
* makes the XSAVE{S,C} writes less sparse and quicker for
* the CPU.
*/

> + lmask = mask;
> + hmask = mask >> 32;
> +
> XSTATE_XSAVE(&fpstate->regs.xsave, lmask, hmask, err);
>
> /* We should never fault when copying to a kernel buffer: */

2022-04-19 16:47:33

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 3/3] x86/fpu/xsave: Optimize XSAVEC/S when XGETBV1 is supported

On Thu, Apr 14 2022 at 10:24, Dave Hansen wrote:
> On 4/4/22 05:11, Thomas Gleixner wrote:
>> which is suboptimal. Prefetch works better when the access is linear. But
>> what's worse is that PKRU can be located in a different page which
>> obviously affects dTLB.
>
> The numbers don't lie, but I'm still surprised by this. Was this in a
> VM that isn't backed with large pages? task_struct.thread.fpu is
> kmem_cache_alloc()'d and is in the direct map, which should be 2M/1G
> pages almost all the time.

Hmm. Indeed, that's weird.

That was bare metal and I just checked that this was a production config
and not some weird debug muck which breaks large pages. I'll look deeper
into that.

Thanks,

tglx



2022-04-20 03:03:25

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 3/3] x86/fpu/xsave: Optimize XSAVEC/S when XGETBV1 is supported

On Tue, Apr 19 2022 at 15:43, Thomas Gleixner wrote:
> On Thu, Apr 14 2022 at 10:24, Dave Hansen wrote:
>> On 4/4/22 05:11, Thomas Gleixner wrote:
>>> which is suboptimal. Prefetch works better when the access is linear. But
>>> what's worse is that PKRU can be located in a different page which
>>> obviously affects dTLB.
>>
>> The numbers don't lie, but I'm still surprised by this. Was this in a
>> VM that isn't backed with large pages? task_struct.thread.fpu is
>> kmem_cache_alloc()'d and is in the direct map, which should be 2M/1G
>> pages almost all the time.
>
> Hmm. Indeed, that's weird.
>
> That was bare metal and I just checked that this was a production config
> and not some weird debug muck which breaks large pages. I'll look deeper
> into that.

I can't find any reasonable explanation. The pages are definitely large
pages, so yes the dTLB miss count does not make sense, but it's
consistently faster and it's always the dTLB miss count which makes the
big difference according to perf.

For enhanced fun, I ran the lot on a AMD Zen3 machine and with the same
test case (hackbench -l 10000) repeated 10 times by perf stat this is
consistently slower than the non optimized variant. There is at least an
explanation for that. A tight loop of 1 Mio xgetbv(1) invocations takes
9 Mio cycles on a SKL-X and 50 Mio cycles on a AMD Zen3.

XSAVE is wonderful, isn't it?

Thanks,

tglx

2022-04-22 17:16:38

by Tom Lendacky

[permalink] [raw]
Subject: Re: [patch 3/3] x86/fpu/xsave: Optimize XSAVEC/S when XGETBV1 is supported

On 4/19/22 16:22, Thomas Gleixner wrote:
> On Tue, Apr 19 2022 at 15:43, Thomas Gleixner wrote:
>> On Thu, Apr 14 2022 at 10:24, Dave Hansen wrote:
>>> On 4/4/22 05:11, Thomas Gleixner wrote:
>>>> which is suboptimal. Prefetch works better when the access is linear. But
>>>> what's worse is that PKRU can be located in a different page which
>>>> obviously affects dTLB.
>>>
>>> The numbers don't lie, but I'm still surprised by this. Was this in a
>>> VM that isn't backed with large pages? task_struct.thread.fpu is
>>> kmem_cache_alloc()'d and is in the direct map, which should be 2M/1G
>>> pages almost all the time.
>>
>> Hmm. Indeed, that's weird.
>>
>> That was bare metal and I just checked that this was a production config
>> and not some weird debug muck which breaks large pages. I'll look deeper
>> into that.
>
> I can't find any reasonable explanation. The pages are definitely large
> pages, so yes the dTLB miss count does not make sense, but it's
> consistently faster and it's always the dTLB miss count which makes the
> big difference according to perf.
>
> For enhanced fun, I ran the lot on a AMD Zen3 machine and with the same
> test case (hackbench -l 10000) repeated 10 times by perf stat this is
> consistently slower than the non optimized variant. There is at least an
> explanation for that. A tight loop of 1 Mio xgetbv(1) invocations takes
> 9 Mio cycles on a SKL-X and 50 Mio cycles on a AMD Zen3.

I'll take a look into this and see what I find. Might be interesting to
see if the actual XSAVES is slower or quicker, too, based on the input mask.

If the performance slowdown shows up in real world benchmarks, we might
want to consider not using the xgetbv() call on AMD.

Thanks,
Tom

>
> XSAVE is wonderful, isn't it?
>
> Thanks,
>
> tglx

2022-04-22 22:29:58

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 3/3] x86/fpu/xsave: Optimize XSAVEC/S when XGETBV1 is supported

On Wed, Apr 20 2022 at 13:15, Tom Lendacky wrote:
> On 4/19/22 16:22, Thomas Gleixner wrote:
>>> That was bare metal and I just checked that this was a production config
>>> and not some weird debug muck which breaks large pages. I'll look deeper
>>> into that.
>>
>> I can't find any reasonable explanation. The pages are definitely large
>> pages, so yes the dTLB miss count does not make sense, but it's
>> consistently faster and it's always the dTLB miss count which makes the
>> big difference according to perf.
>>
>> For enhanced fun, I ran the lot on a AMD Zen3 machine and with the same
>> test case (hackbench -l 10000) repeated 10 times by perf stat this is
>> consistently slower than the non optimized variant. There is at least an
>> explanation for that. A tight loop of 1 Mio xgetbv(1) invocations takes
>> 9 Mio cycles on a SKL-X and 50 Mio cycles on a AMD Zen3.
>
> I'll take a look into this and see what I find. Might be interesting to
> see if the actual XSAVES is slower or quicker, too, based on the input mask.
>
> If the performance slowdown shows up in real world benchmarks, we might
> want to consider not using the xgetbv() call on AMD.

As things stand now, I'm not going to pursue this further at the moment.

The effect on SKL-X is not explainable especially the dTLB miss count
decrease does not make any sense. Aside of that I just figured out that
it is very sensitive to kernel configurations and I have no idea yet
what exactly is the screw to turn to make the effect come and go.

So I just go and add the XSAVEC support alone as that's actually
something which _is_ beneficial for guests.

Thanks,

tglx

2022-04-23 21:29:42

by Dave Hansen

[permalink] [raw]
Subject: Re: [patch 3/3] x86/fpu/xsave: Optimize XSAVEC/S when XGETBV1 is supported

On 4/22/22 12:30, Thomas Gleixner wrote:
> So I just go and add the XSAVEC support alone as that's actually
> something which _is_ beneficial for guests.

Yeah, agreed.

When I went to test these patches, a bit loop with XSAVEC was ~10%
faster that XSAVEOPT. This system has XSAVES, so wouldn't have been
using XSAVEOPT in the first place in the kernel. But, this is at least
one more data point in favor of XSAVEC.