2020-03-20 04:14:13

by Kyung Min Park

[permalink] [raw]
Subject: [PATCH v2 2/2] x86/delay: Introduce TPAUSE delay

TPAUSE instructs the processor to enter an implementation-dependent
optimized state. The instruction execution wakes up when the time-stamp
counter reaches or exceeds the implicit EDX:EAX 64-bit input value.
The instruction execution also wakes up due to the expiration of
the operating system time-limit or by an external interrupt
or exceptions such as a debug exception or a machine check exception.

TPAUSE offers a choice of two lower power states:
1. Light-weight power/performance optimized state C0.1
2. Improved power/performance optimized state C0.2
This way, it can save power with low wake-up latency in comparison to
spinloop based delay. The selection between the two is governed by the
input register.

TPAUSE is available on processors with X86_FEATURE_WAITPKG.

Reviewed-by: Tony Luck <[email protected]>
Co-developed-by: Fenghua Yu <[email protected]>
Signed-off-by: Fenghua Yu <[email protected]>
Signed-off-by: Kyung Min Park <[email protected]>
---
arch/x86/include/asm/mwait.h | 17 +++++++++++++++++
arch/x86/lib/delay.c | 27 ++++++++++++++++++++++++++-
2 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
index aaf6643..fd59db0 100644
--- a/arch/x86/include/asm/mwait.h
+++ b/arch/x86/include/asm/mwait.h
@@ -22,6 +22,8 @@
#define MWAITX_ECX_TIMER_ENABLE BIT(1)
#define MWAITX_MAX_WAIT_CYCLES UINT_MAX
#define MWAITX_DISABLE_CSTATES 0xf0
+#define TPAUSE_C01_STATE 1
+#define TPAUSE_C02_STATE 0

static inline void __monitor(const void *eax, unsigned long ecx,
unsigned long edx)
@@ -120,4 +122,19 @@ static inline void mwait_idle_with_hints(unsigned long eax, unsigned long ecx)
current_clr_polling();
}

+/*
+ * Caller can specify whether to enter C0.1 (low latency, less
+ * power saving) or C0.2 state (saves more power, but longer wakeup
+ * latency). This may be overridden by the IA32_UMWAIT_CONTROL MSR
+ * which can force requests for C0.2 to be downgraded to C0.1.
+ */
+static inline void __tpause(unsigned int ecx, unsigned int edx,
+ unsigned int eax)
+{
+ /* "tpause %ecx, %edx, %eax;" */
+ asm volatile(".byte 0x66, 0x0f, 0xae, 0xf1\t\n"
+ :
+ : "c"(ecx), "d"(edx), "a"(eax));
+}
+
#endif /* _ASM_X86_MWAIT_H */
diff --git a/arch/x86/lib/delay.c b/arch/x86/lib/delay.c
index e6db855..5f11f0a 100644
--- a/arch/x86/lib/delay.c
+++ b/arch/x86/lib/delay.c
@@ -97,6 +97,27 @@ static void delay_tsc(u64 cycles)
}

/*
+ * On Intel the TPAUSE instruction waits until any of:
+ * 1) the TSC counter exceeds the value provided in EAX:EDX
+ * 2) global timeout in IA32_UMWAIT_CONTROL is exceeded
+ * 3) an external interrupt occurs
+ */
+static void delay_halt_tpause(u64 start, u64 cycles)
+{
+ u64 until = start + cycles;
+ unsigned int eax, edx;
+
+ eax = (unsigned int)(until & 0xffffffff);
+ edx = (unsigned int)(until >> 32);
+
+ /*
+ * Hard code the deeper (C0.2) sleep state because exit latency is
+ * small compared to the "microseconds" that usleep() will delay.
+ */
+ __tpause(TPAUSE_C02_STATE, edx, eax);
+}
+
+/*
* On some AMD platforms, MWAITX has a configurable 32-bit timer, that
* counts with TSC frequency. The input value is the number of TSC cycles
* to wait. MWAITX will also exit when the timer expires.
@@ -152,8 +173,12 @@ static void delay_halt(u64 __cycles)

void use_tsc_delay(void)
{
- if (delay_fn == delay_loop)
+ if (static_cpu_has(X86_FEATURE_WAITPKG)) {
+ delay_halt_fn = delay_halt_tpause;
+ delay_fn = delay_halt;
+ } else if (delay_fn == delay_loop) {
delay_fn = delay_tsc;
+ }
}

void use_mwaitx_delay(void)
--
2.7.4


2020-03-20 04:24:48

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] x86/delay: Introduce TPAUSE delay

On Thu, Mar 19, 2020 at 9:13 PM Kyung Min Park <[email protected]> wrote:
>
> TPAUSE instructs the processor to enter an implementation-dependent
> optimized state. The instruction execution wakes up when the time-stamp
> counter reaches or exceeds the implicit EDX:EAX 64-bit input value.
> The instruction execution also wakes up due to the expiration of
> the operating system time-limit or by an external interrupt
> or exceptions such as a debug exception or a machine check exception.
>
> TPAUSE offers a choice of two lower power states:
> 1. Light-weight power/performance optimized state C0.1
> 2. Improved power/performance optimized state C0.2
> This way, it can save power with low wake-up latency in comparison to
> spinloop based delay. The selection between the two is governed by the
> input register.
>
> TPAUSE is available on processors with X86_FEATURE_WAITPKG.
>
> Reviewed-by: Tony Luck <[email protected]>
> Co-developed-by: Fenghua Yu <[email protected]>
> Signed-off-by: Fenghua Yu <[email protected]>
> Signed-off-by: Kyung Min Park <[email protected]>
> ---
> arch/x86/include/asm/mwait.h | 17 +++++++++++++++++
> arch/x86/lib/delay.c | 27 ++++++++++++++++++++++++++-
> 2 files changed, 43 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
> index aaf6643..fd59db0 100644
> --- a/arch/x86/include/asm/mwait.h
> +++ b/arch/x86/include/asm/mwait.h
> @@ -22,6 +22,8 @@
> #define MWAITX_ECX_TIMER_ENABLE BIT(1)
> #define MWAITX_MAX_WAIT_CYCLES UINT_MAX
> #define MWAITX_DISABLE_CSTATES 0xf0
> +#define TPAUSE_C01_STATE 1
> +#define TPAUSE_C02_STATE 0
>
> static inline void __monitor(const void *eax, unsigned long ecx,
> unsigned long edx)
> @@ -120,4 +122,19 @@ static inline void mwait_idle_with_hints(unsigned long eax, unsigned long ecx)
> current_clr_polling();
> }
>
> +/*
> + * Caller can specify whether to enter C0.1 (low latency, less
> + * power saving) or C0.2 state (saves more power, but longer wakeup
> + * latency). This may be overridden by the IA32_UMWAIT_CONTROL MSR
> + * which can force requests for C0.2 to be downgraded to C0.1.
> + */
> +static inline void __tpause(unsigned int ecx, unsigned int edx,
> + unsigned int eax)
> +{
> + /* "tpause %ecx, %edx, %eax;" */
> + asm volatile(".byte 0x66, 0x0f, 0xae, 0xf1\t\n"
> + :
> + : "c"(ecx), "d"(edx), "a"(eax));
> +}
> +
> #endif /* _ASM_X86_MWAIT_H */
> diff --git a/arch/x86/lib/delay.c b/arch/x86/lib/delay.c
> index e6db855..5f11f0a 100644
> --- a/arch/x86/lib/delay.c
> +++ b/arch/x86/lib/delay.c
> @@ -97,6 +97,27 @@ static void delay_tsc(u64 cycles)
> }
>
> /*
> + * On Intel the TPAUSE instruction waits until any of:
> + * 1) the TSC counter exceeds the value provided in EAX:EDX
> + * 2) global timeout in IA32_UMWAIT_CONTROL is exceeded
> + * 3) an external interrupt occurs
> + */
> +static void delay_halt_tpause(u64 start, u64 cycles)
> +{
> + u64 until = start + cycles;
> + unsigned int eax, edx;
> +
> + eax = (unsigned int)(until & 0xffffffff);
> + edx = (unsigned int)(until >> 32);
> +
> + /*
> + * Hard code the deeper (C0.2) sleep state because exit latency is
> + * small compared to the "microseconds" that usleep() will delay.
> + */
> + __tpause(TPAUSE_C02_STATE, edx, eax);
> +}
> +
> +/*
> * On some AMD platforms, MWAITX has a configurable 32-bit timer, that
> * counts with TSC frequency. The input value is the number of TSC cycles
> * to wait. MWAITX will also exit when the timer expires.
> @@ -152,8 +173,12 @@ static void delay_halt(u64 __cycles)
>
> void use_tsc_delay(void)
> {
> - if (delay_fn == delay_loop)
> + if (static_cpu_has(X86_FEATURE_WAITPKG)) {
> + delay_halt_fn = delay_halt_tpause;
> + delay_fn = delay_halt;
> + } else if (delay_fn == delay_loop) {
> delay_fn = delay_tsc;
> + }
> }

This is an odd way to dispatch: you're using static_cpu_has(), but
you're using it once to populate a function pointer. Why not just put
the static_cpu_has() directly into delay_halt() and open-code the
three variants? That will also make it a lot easier to understand the
oddity with start and cycles.

--Andy

> --
> 2.7.4
>

2020-03-20 10:00:48

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] x86/delay: Introduce TPAUSE delay

Andy Lutomirski <[email protected]> writes:
> On Thu, Mar 19, 2020 at 9:13 PM Kyung Min Park <[email protected]> wrote:
>> void use_tsc_delay(void)
>> {
>> - if (delay_fn == delay_loop)
>> + if (static_cpu_has(X86_FEATURE_WAITPKG)) {
>> + delay_halt_fn = delay_halt_tpause;
>> + delay_fn = delay_halt;
>> + } else if (delay_fn == delay_loop) {
>> delay_fn = delay_tsc;
>> + }
>> }
>
> This is an odd way to dispatch: you're using static_cpu_has(), but
> you're using it once to populate a function pointer. Why not just put
> the static_cpu_has() directly into delay_halt() and open-code the
> three variants?

Two: mwaitx and tpause.

> That will also make it a lot easier to understand the oddity with
> start and cycles.

Indeed. That makes sense. Should have thought about it :)

Thanks,

tglx

2020-03-20 10:09:55

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] x86/delay: Introduce TPAUSE delay

On Thu, 2020-03-19 at 21:13 -0700, Kyung Min Park wrote:
> TPAUSE instructs the processor to enter an implementation-dependent
> optimized state. The instruction execution wakes up when the time-stamp
> counter reaches or exceeds the implicit EDX:EAX 64-bit input value.
> The instruction execution also wakes up due to the expiration of
> the operating system time-limit or by an external interrupt
> or exceptions such as a debug exception or a machine check exception.
[]
> diff --git a/arch/x86/lib/delay.c b/arch/x86/lib/delay.c
[]
> @@ -97,6 +97,27 @@ static void delay_tsc(u64 cycles)
> }
>
> /*
> + * On Intel the TPAUSE instruction waits until any of:
> + * 1) the TSC counter exceeds the value provided in EAX:EDX
> + * 2) global timeout in IA32_UMWAIT_CONTROL is exceeded
> + * 3) an external interrupt occurs
> + */
> +static void delay_halt_tpause(u64 start, u64 cycles)
> +{
> + u64 until = start + cycles;
> + unsigned int eax, edx;
> +
> + eax = (unsigned int)(until & 0xffffffff);
> + edx = (unsigned int)(until >> 32);

trivia:

perhaps lower_32_bits and upper_32_bits


2020-03-20 21:53:14

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] x86/delay: Introduce TPAUSE delay

On Fri, Mar 20, 2020 at 3:00 AM Thomas Gleixner <[email protected]> wrote:
>
> Andy Lutomirski <[email protected]> writes:
> > On Thu, Mar 19, 2020 at 9:13 PM Kyung Min Park <[email protected]> wrote:
> >> void use_tsc_delay(void)
> >> {
> >> - if (delay_fn == delay_loop)
> >> + if (static_cpu_has(X86_FEATURE_WAITPKG)) {
> >> + delay_halt_fn = delay_halt_tpause;
> >> + delay_fn = delay_halt;
> >> + } else if (delay_fn == delay_loop) {
> >> delay_fn = delay_tsc;
> >> + }
> >> }
> >
> > This is an odd way to dispatch: you're using static_cpu_has(), but
> > you're using it once to populate a function pointer. Why not just put
> > the static_cpu_has() directly into delay_halt() and open-code the
> > three variants?
>
> Two: mwaitx and tpause.

I was imagining there would also be a variant for systems with neither feature.

2020-03-20 23:24:37

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] x86/delay: Introduce TPAUSE delay

Andy Lutomirski <[email protected]> writes:

> On Fri, Mar 20, 2020 at 3:00 AM Thomas Gleixner <[email protected]> wrote:
>>
>> Andy Lutomirski <[email protected]> writes:
>> > On Thu, Mar 19, 2020 at 9:13 PM Kyung Min Park <[email protected]> wrote:
>> >> void use_tsc_delay(void)
>> >> {
>> >> - if (delay_fn == delay_loop)
>> >> + if (static_cpu_has(X86_FEATURE_WAITPKG)) {
>> >> + delay_halt_fn = delay_halt_tpause;
>> >> + delay_fn = delay_halt;
>> >> + } else if (delay_fn == delay_loop) {
>> >> delay_fn = delay_tsc;
>> >> + }
>> >> }
>> >
>> > This is an odd way to dispatch: you're using static_cpu_has(), but
>> > you're using it once to populate a function pointer. Why not just put
>> > the static_cpu_has() directly into delay_halt() and open-code the
>> > three variants?
>>
>> Two: mwaitx and tpause.
>
> I was imagining there would also be a variant for systems with neither feature.

Oh I see, you want to get rid of both function pointers. That's tricky.

The boot time function is delay_loop() which is using the magic (1 << 12)
boot time value until calibration in one way or the other happens and
something calls use_tsc_delay() or use_mwaitx_delay(). Yes, that's all
horrible but X86_FEATURE_TSC is unusable for this.

Let me think about it.

Thanks,

tglx








2020-03-20 23:58:13

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] x86/delay: Introduce TPAUSE delay

On Fri, Mar 20, 2020 at 4:23 PM Thomas Gleixner <[email protected]> wrote:
>
> Andy Lutomirski <[email protected]> writes:
>
> > On Fri, Mar 20, 2020 at 3:00 AM Thomas Gleixner <[email protected]> wrote:
> >>
> >> Andy Lutomirski <[email protected]> writes:
> >> > On Thu, Mar 19, 2020 at 9:13 PM Kyung Min Park <[email protected]> wrote:
> >> >> void use_tsc_delay(void)
> >> >> {
> >> >> - if (delay_fn == delay_loop)
> >> >> + if (static_cpu_has(X86_FEATURE_WAITPKG)) {
> >> >> + delay_halt_fn = delay_halt_tpause;
> >> >> + delay_fn = delay_halt;
> >> >> + } else if (delay_fn == delay_loop) {
> >> >> delay_fn = delay_tsc;
> >> >> + }
> >> >> }
> >> >
> >> > This is an odd way to dispatch: you're using static_cpu_has(), but
> >> > you're using it once to populate a function pointer. Why not just put
> >> > the static_cpu_has() directly into delay_halt() and open-code the
> >> > three variants?
> >>
> >> Two: mwaitx and tpause.
> >
> > I was imagining there would also be a variant for systems with neither feature.
>
> Oh I see, you want to get rid of both function pointers. That's tricky.
>
> The boot time function is delay_loop() which is using the magic (1 << 12)
> boot time value until calibration in one way or the other happens and
> something calls use_tsc_delay() or use_mwaitx_delay(). Yes, that's all
> horrible but X86_FEATURE_TSC is unusable for this.
>
> Let me think about it.

This is definitely not worth overoptimizing. It's a *delay* function
-- the retpoline isn't going to kill us :)

>
> Thanks,
>
> tglx
>
>
>
>
>
>
>
>

2020-03-23 05:19:56

by Kyung Min Park

[permalink] [raw]
Subject: RE: [PATCH v2 2/2] x86/delay: Introduce TPAUSE delay

Hi Joe,

> -----Original Message-----
> From: Joe Perches <[email protected]>
> Sent: Friday, March 20, 2020 3:07 AM
> To: Park, Kyung Min <[email protected]>; [email protected]; linux-
> [email protected]
> Cc: [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; Luck, Tony
> <[email protected]>; Raj, Ashok <[email protected]>; Shankar, Ravi V
> <[email protected]>; Yu, Fenghua <[email protected]>
> Subject: Re: [PATCH v2 2/2] x86/delay: Introduce TPAUSE delay
>
> On Thu, 2020-03-19 at 21:13 -0700, Kyung Min Park wrote:
> > TPAUSE instructs the processor to enter an implementation-dependent
> > optimized state. The instruction execution wakes up when the
> > time-stamp counter reaches or exceeds the implicit EDX:EAX 64-bit input value.
> > The instruction execution also wakes up due to the expiration of the
> > operating system time-limit or by an external interrupt or exceptions
> > such as a debug exception or a machine check exception.
> []
> > diff --git a/arch/x86/lib/delay.c b/arch/x86/lib/delay.c
> []
> > @@ -97,6 +97,27 @@ static void delay_tsc(u64 cycles) }
> >
> > /*
> > + * On Intel the TPAUSE instruction waits until any of:
> > + * 1) the TSC counter exceeds the value provided in EAX:EDX
> > + * 2) global timeout in IA32_UMWAIT_CONTROL is exceeded
> > + * 3) an external interrupt occurs
> > + */
> > +static void delay_halt_tpause(u64 start, u64 cycles) {
> > + u64 until = start + cycles;
> > + unsigned int eax, edx;
> > +
> > + eax = (unsigned int)(until & 0xffffffff);
> > + edx = (unsigned int)(until >> 32);
>
> trivia:
>
> perhaps lower_32_bits and upper_32_bits

Thank you for your comment. I'll update in the next patch.

2020-03-30 23:43:45

by Kyung Min Park

[permalink] [raw]
Subject: RE: [PATCH v2 2/2] x86/delay: Introduce TPAUSE delay

Hi Andy/Thomas,

On Fri, Mar 20, 2020 at 4:23 PM Thomas Gleixner <[email protected]> wrote:
> >
> > Andy Lutomirski <[email protected]> writes:
> >
> > > On Fri, Mar 20, 2020 at 3:00 AM Thomas Gleixner <[email protected]>
> wrote:
> > >>
> > >> Andy Lutomirski <[email protected]> writes:
> > >> > On Thu, Mar 19, 2020 at 9:13 PM Kyung Min Park
> <[email protected]> wrote:
> > >> >> void use_tsc_delay(void)
> > >> >> {
> > >> >> - if (delay_fn == delay_loop)
> > >> >> + if (static_cpu_has(X86_FEATURE_WAITPKG)) {
> > >> >> + delay_halt_fn = delay_halt_tpause;
> > >> >> + delay_fn = delay_halt;
> > >> >> + } else if (delay_fn == delay_loop) {
> > >> >> delay_fn = delay_tsc;
> > >> >> + }
> > >> >> }
> > >> >
> > >> > This is an odd way to dispatch: you're using static_cpu_has(),
> > >> > but you're using it once to populate a function pointer. Why not
> > >> > just put the static_cpu_has() directly into delay_halt() and
> > >> > open-code the three variants?
> > >>
> > >> Two: mwaitx and tpause.
> > >
> > > I was imagining there would also be a variant for systems with neither
> feature.
> >
> > Oh I see, you want to get rid of both function pointers. That's tricky.
> >
> > The boot time function is delay_loop() which is using the magic (1 <<
> > 12) boot time value until calibration in one way or the other happens
> > and something calls use_tsc_delay() or use_mwaitx_delay(). Yes, that's
> > all horrible but X86_FEATURE_TSC is unusable for this.
> >
> > Let me think about it.
>
> This is definitely not worth overoptimizing. It's a *delay* function
> -- the retpoline isn't going to kill us :)

Since the use_tsc_delay() is used just once in __init tsc_init(),
how about adding "__init" to the use_tsc_delay() and keep these function pointers?