* Yi Sun <[email protected]> wrote:
> @@ -113,7 +116,7 @@ static inline u64 xfeatures_mask_independent(void)
> * original instruction which gets replaced. We need to use it here as the
> * address of the instruction where we might get an exception at.
> */
> -#define XSTATE_XSAVE(st, lmask, hmask, err) \
> +#define __XSTATE_XSAVE(st, lmask, hmask, err) \
> asm volatile(ALTERNATIVE_3(XSAVE, \
> XSAVEOPT, X86_FEATURE_XSAVEOPT, \
> XSAVEC, X86_FEATURE_XSAVEC, \
> @@ -130,7 +133,7 @@ static inline u64 xfeatures_mask_independent(void)
> * Use XRSTORS to restore context if it is enabled. XRSTORS supports compact
> * XSAVE area format.
> */
> -#define XSTATE_XRESTORE(st, lmask, hmask) \
> +#define __XSTATE_XRESTORE(st, lmask, hmask) \
> asm volatile(ALTERNATIVE(XRSTOR, \
> XRSTORS, X86_FEATURE_XSAVES) \
> "\n" \
> @@ -140,6 +143,35 @@ static inline u64 xfeatures_mask_independent(void)
> : "D" (st), "m" (*st), "a" (lmask), "d" (hmask) \
> : "memory")
>
> +#if defined(CONFIG_X86_DEBUG_FPU)
> +#define XSTATE_XSAVE(fps, lmask, hmask, err) \
> + do { \
> + struct fpstate *f = fps; \
> + u64 tc = -1; \
> + if (tracepoint_enabled(x86_fpu_latency_xsave)) \
> + tc = trace_clock(); \
> + __XSTATE_XSAVE(&f->regs.xsave, lmask, hmask, err); \
> + if (tracepoint_enabled(x86_fpu_latency_xsave)) \
> + trace_x86_fpu_latency_xsave(f, trace_clock() - tc);\
> + } while (0)
> +
> +#define XSTATE_XRESTORE(fps, lmask, hmask) \
> + do { \
> + struct fpstate *f = fps; \
> + u64 tc = -1; \
> + if (tracepoint_enabled(x86_fpu_latency_xrstor)) \
> + tc = trace_clock(); \
> + __XSTATE_XRESTORE(&f->regs.xsave, lmask, hmask); \
> + if (tracepoint_enabled(x86_fpu_latency_xrstor)) \
> + trace_x86_fpu_latency_xrstor(f, trace_clock() - tc);\
This v7 version does not adequately address the review feedback I gave for
v6: it adds tracing overhead to potential hot paths, and putting it behind
CONFIG_X86_DEBUG_FPU is not a solution either: it's default-y, so de-facto
enabled on all major distros...
It seems unnecessarily complex: why does it have to measure latency
directly? Tracepoints *by default* come with event timestamps. A latency
measurement tool should be able to subtract two timestamps to extract the
latency between two tracepoints...
In fact, function tracing is enabled on all major Linux distros:
kepler:~/tip> grep FUNCTION_TRACER /boot/config-6.2.0-33-generic
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_FUNCTION_TRACER=y
Why not just enable function tracing for the affected FPU context switching
functions?
The relevant functions are already standalone in a typical kernel:
xsave: # ffffffff8103cfe0 T save_fpregs_to_fpstate
xrstor: # ffffffff8103d160 T restore_fpregs_from_fpstate
xrstor_supervisor: # ffffffff8103dc50 T fpu__clear_user_states
... and FPU context switching overhead dominates the cost of these
functions.
Thanks,
Ingo
On 21.09.2023 09:03, Ingo Molnar wrote:
>
>* Yi Sun <[email protected]> wrote:
>
>> -#define XSTATE_XRESTORE(st, lmask, hmask) \
>> +#define __XSTATE_XRESTORE(st, lmask, hmask) \
>> asm volatile(ALTERNATIVE(XRSTOR, \
>> XRSTORS, X86_FEATURE_XSAVES) \
>> "\n" \
>> @@ -140,6 +143,35 @@ static inline u64 xfeatures_mask_independent(void)
>> : "D" (st), "m" (*st), "a" (lmask), "d" (hmask) \
>> : "memory")
>>
>> +#if defined(CONFIG_X86_DEBUG_FPU)
>> +#define XSTATE_XSAVE(fps, lmask, hmask, err) \
>> + do { \
>> + struct fpstate *f = fps; \
>> + u64 tc = -1; \
>> + if (tracepoint_enabled(x86_fpu_latency_xsave)) \
>> + tc = trace_clock(); \
>> + __XSTATE_XSAVE(&f->regs.xsave, lmask, hmask, err); \
>> + if (tracepoint_enabled(x86_fpu_latency_xsave)) \
>> + trace_x86_fpu_latency_xsave(f, trace_clock() - tc);\
>> + } while (0)
>> +
>> +#define XSTATE_XRESTORE(fps, lmask, hmask) \
>> + do { \
>> + struct fpstate *f = fps; \
>> + u64 tc = -1; \
>> + if (tracepoint_enabled(x86_fpu_latency_xrstor)) \
>> + tc = trace_clock(); \
>> + __XSTATE_XRESTORE(&f->regs.xsave, lmask, hmask); \
>> + if (tracepoint_enabled(x86_fpu_latency_xrstor)) \
>> + trace_x86_fpu_latency_xrstor(f, trace_clock() - tc);\
>
>This v7 version does not adequately address the review feedback I gave for
>v6: it adds tracing overhead to potential hot paths, and putting it behind
>CONFIG_X86_DEBUG_FPU is not a solution either: it's default-y, so de-facto
>enabled on all major distros...
Hi Ingo,
Appreciate your comments!
What do you think if I were to add a new config here instead of
CONFIG_X86_DEBUG_FPU, and set it as the default value of n?
This way, all the overhead will be removed from the hot paths.
```
#else
#define XSTATE_XSAVE(fps, lmask, hmask, err)
__XSTATE_XSAVE(&(fps)->regs.xsave, lmask, hmask, err)
#define XSTATE_XRESTORE(fps, lmask, hmask)
__XSTATE_XRESTORE(&(fps)->regs.xsave, lmask, hmask)
#endif
```
>
>It seems unnecessarily complex: why does it have to measure latency
>directly? Tracepoints *by default* come with event timestamps. A latency
>measurement tool should be able to subtract two timestamps to extract the
>latency between two tracepoints...
>
>In fact, function tracing is enabled on all major Linux distros:
>
> kepler:~/tip> grep FUNCTION_TRACER /boot/config-6.2.0-33-generic
> CONFIG_HAVE_FUNCTION_TRACER=y
> CONFIG_FUNCTION_TRACER=y
>
>Why not just enable function tracing for the affected FPU context switching
>functions?
>
>The relevant functions are already standalone in a typical kernel:
>
> xsave: # ffffffff8103cfe0 T save_fpregs_to_fpstate
> xrstor: # ffffffff8103d160 T restore_fpregs_from_fpstate
> xrstor_supervisor: # ffffffff8103dc50 T fpu__clear_user_states
>
>... and FPU context switching overhead dominates the cost of these
>functions.
>
I am aware of these trace points, but we add new ones for two reasons:
1. As I described in the patch comments. It would be more accurate to
calculate the latency in a single trace event, which will NOT include the
latency of the trace functions themselves. The cost of the trace
functions is much longer than the latency of xsaves, by our experiments.
Calculating the latency by subtracting two timestamps has to
contain the latency of trace points, that would make the actual latency
data less prominent.
2. We want to measure with finer granularity. The new added trace points
dump XINUSE and RFBM each time they are called. By doing so, we can
collect the latency and analyze the data grouped according to each FPU
instruction indicated in the XCR0 bits.
Thanks
--Sun, Yi
> It seems unnecessarily complex: why does it have to measure latency
> directly? Tracepoints *by default* come with event timestamps. A latency
> measurement tool should be able to subtract two timestamps to extract the
> latency between two tracepoints...
>
> In fact, function tracing is enabled on all major Linux distros:
>
> kepler:~/tip> grep FUNCTION_TRACER /boot/config-6.2.0-33-generic
> CONFIG_HAVE_FUNCTION_TRACER=y
> CONFIG_FUNCTION_TRACER=y
>
> Why not just enable function tracing for the affected FPU context switching
> functions?
Or use PT address filters to get it even accurately, as described
in [1]. In any case I agree the trace points are not needed.
-Andi
[1] https://lore.kernel.org/lkml/ZPOIVmC6aY9GBtdJ@tassilo/
On 21.09.2023 09:52, Andi Kleen wrote:
>> It seems unnecessarily complex: why does it have to measure latency
>> directly? Tracepoints *by default* come with event timestamps. A latency
>> measurement tool should be able to subtract two timestamps to extract the
>> latency between two tracepoints...
>>
>> In fact, function tracing is enabled on all major Linux distros:
>>
>> kepler:~/tip> grep FUNCTION_TRACER /boot/config-6.2.0-33-generic
>> CONFIG_HAVE_FUNCTION_TRACER=y
>> CONFIG_FUNCTION_TRACER=y
>>
>> Why not just enable function tracing for the affected FPU context switching
>> functions?
>
>Or use PT address filters to get it even accurately, as described
>in [1]. In any case I agree the trace points are not needed.
>
Hi Andi,
Here, we have implemented latency measurement for all x86 platforms, not
just for Intel but also for other vendors like AMD. This allows us to
obtain comparable data.
Thanks
--Sun, Yi