2023-11-15 15:14:14

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 0/4] x86/cpuidle fixes and optimization

This is the x86 part of a cpuidle series I posted a few time ago. There
is a fix by Peter (Not-yet-signed-off-by btw.), the rest is comment and
optimizations.

Frederic Weisbecker (3):
x86: Add a comment about the "magic" behind shadow sti before mwait
x86: Remove __current_clr_polling() from mwait_idle()
x86: Remove the current_clr_polling() call upon mwait exit

Peter Zijlstra (1):
x86: Fix CPUIDLE_FLAG_IRQ_ENABLE leaking timer reprogram

arch/x86/include/asm/mwait.h | 21 ++++++++++++++++++---
arch/x86/kernel/process.c | 1 -
drivers/acpi/acpi_pad.c | 1 +
drivers/idle/intel_idle.c | 19 +++++++------------
4 files changed, 26 insertions(+), 16 deletions(-)

--
2.42.1


2023-11-15 15:14:17

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 1/4] x86: Add a comment about the "magic" behind shadow sti before mwait

Add a note to make sure we never miss and break the requirements behind
it.

Acked-by: Rafael J. Wysocki <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
arch/x86/include/asm/mwait.h | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
index 778df05f8539..341ee4f1d91e 100644
--- a/arch/x86/include/asm/mwait.h
+++ b/arch/x86/include/asm/mwait.h
@@ -87,6 +87,15 @@ static __always_inline void __mwaitx(unsigned long eax, unsigned long ebx,
:: "a" (eax), "b" (ebx), "c" (ecx));
}

+/*
+ * Re-enable interrupts right upon calling mwait in such a way that
+ * no interrupt can fire _before_ the execution of mwait, ie: no
+ * instruction must be placed between "sti" and "mwait".
+ *
+ * This is necessary because if an interrupt queues a timer before
+ * executing mwait, it would otherwise go unnoticed and the next tick
+ * would not be reprogrammed accordingly before mwait ever wakes up.
+ */
static __always_inline void __sti_mwait(unsigned long eax, unsigned long ecx)
{
mds_idle_clear_cpu_buffers();
--
2.42.1

2023-11-15 15:14:23

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 2/4] x86: Fix CPUIDLE_FLAG_IRQ_ENABLE leaking timer reprogram

From: Peter Zijlstra <[email protected]>

intel_idle_irq() re-enables IRQs very early. As a result, an interrupt
may fire before mwait() is eventually called. If such an interrupt queues
a timer, it may go unnoticed until mwait returns and the idle loop
handles the tick re-evaluation. And monitoring TIF_NEED_RESCHED doesn't
help because a local timer enqueue doesn't set that flag.

The issue is mitigated by the fact that this idle handler is only invoked
for shallow C-states when, presumably, the next tick is supposed to be
close enough. There may still be rare cases though when the next tick
is far away and the selected C-state is shallow, resulting in a timer
getting ignored for a while.

Fix this with using sti_mwait() whose IRQ-reenablement only triggers
upon calling mwait(), dealing with the race while keeping the interrupt
latency within acceptable bounds.

Fixes: c227233ad64c (intel_idle: enable interrupts before C1 on Xeons)
Not-yet-signed-off-by: Peter Zijlstra <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
arch/x86/include/asm/mwait.h | 11 +++++++++--
drivers/idle/intel_idle.c | 19 +++++++------------
2 files changed, 16 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
index 341ee4f1d91e..920426d691ce 100644
--- a/arch/x86/include/asm/mwait.h
+++ b/arch/x86/include/asm/mwait.h
@@ -124,8 +124,15 @@ static __always_inline void mwait_idle_with_hints(unsigned long eax, unsigned lo
}

__monitor((void *)&current_thread_info()->flags, 0, 0);
- if (!need_resched())
- __mwait(eax, ecx);
+
+ if (!need_resched()) {
+ if (ecx & 1) {
+ __mwait(eax, ecx);
+ } else {
+ __sti_mwait(eax, ecx);
+ raw_local_irq_disable();
+ }
+ }
}
current_clr_polling();
}
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index dcda0afecfc5..3e01a6b23e75 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -131,11 +131,12 @@ static unsigned int mwait_substates __initdata;
#define MWAIT2flg(eax) ((eax & 0xFF) << 24)

static __always_inline int __intel_idle(struct cpuidle_device *dev,
- struct cpuidle_driver *drv, int index)
+ struct cpuidle_driver *drv,
+ int index, bool irqoff)
{
struct cpuidle_state *state = &drv->states[index];
unsigned long eax = flg2MWAIT(state->flags);
- unsigned long ecx = 1; /* break on interrupt flag */
+ unsigned long ecx = 1*irqoff; /* break on interrupt flag */

mwait_idle_with_hints(eax, ecx);

@@ -159,19 +160,13 @@ static __always_inline int __intel_idle(struct cpuidle_device *dev,
static __cpuidle int intel_idle(struct cpuidle_device *dev,
struct cpuidle_driver *drv, int index)
{
- return __intel_idle(dev, drv, index);
+ return __intel_idle(dev, drv, index, true);
}

static __cpuidle int intel_idle_irq(struct cpuidle_device *dev,
struct cpuidle_driver *drv, int index)
{
- int ret;
-
- raw_local_irq_enable();
- ret = __intel_idle(dev, drv, index);
- raw_local_irq_disable();
-
- return ret;
+ return __intel_idle(dev, drv, index, false);
}

static __cpuidle int intel_idle_ibrs(struct cpuidle_device *dev,
@@ -184,7 +179,7 @@ static __cpuidle int intel_idle_ibrs(struct cpuidle_device *dev,
if (smt_active)
__update_spec_ctrl(0);

- ret = __intel_idle(dev, drv, index);
+ ret = __intel_idle(dev, drv, index, true);

if (smt_active)
__update_spec_ctrl(spec_ctrl);
@@ -196,7 +191,7 @@ static __cpuidle int intel_idle_xstate(struct cpuidle_device *dev,
struct cpuidle_driver *drv, int index)
{
fpu_idle_fpregs();
- return __intel_idle(dev, drv, index);
+ return __intel_idle(dev, drv, index, true);
}

/**
--
2.42.1

2023-11-15 15:14:27

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 4/4] x86: Remove the current_clr_polling() call upon mwait exit

mwait_idle_with_hints() is mostly called from cpuidle which already sets
back TIF_NR_POLLING and fold TIF_NEED_RESCHED if necessary.

The only non-cpuidle caller is power_saving_thread() which acts as an idle
loop and is the only reason why mwait_idle_with_hints() carries a costly
fully ordered atomic operation upon idle exit.

Make this overhead proper to power_saving_thread() instead which already
duplicates quite some cpuidle code.

Acked-by: Rafael J. Wysocki <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
arch/x86/include/asm/mwait.h | 1 -
drivers/acpi/acpi_pad.c | 1 +
2 files changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
index 920426d691ce..c1be3775b94a 100644
--- a/arch/x86/include/asm/mwait.h
+++ b/arch/x86/include/asm/mwait.h
@@ -134,7 +134,6 @@ static __always_inline void mwait_idle_with_hints(unsigned long eax, unsigned lo
}
}
}
- current_clr_polling();
}

/*
diff --git a/drivers/acpi/acpi_pad.c b/drivers/acpi/acpi_pad.c
index bd1ad07f0290..86b57123713f 100644
--- a/drivers/acpi/acpi_pad.c
+++ b/drivers/acpi/acpi_pad.c
@@ -175,6 +175,7 @@ static int power_saving_thread(void *data)
stop_critical_timings();

mwait_idle_with_hints(power_saving_mwait_eax, 1);
+ current_clr_polling();

start_critical_timings();
tick_broadcast_exit();
--
2.42.1

2023-11-15 15:14:42

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 3/4] x86: Remove __current_clr_polling() from mwait_idle()

mwait_idle() is only ever called through by cpuidle, either from
default_idle_call() or from cpuidle_enter(). In any case
cpuidle_idle_call() sets again TIF_NR_POLLING after calling it so there
is no point for this atomic operation upon idle exit.

Acked-by: Rafael J. Wysocki <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
arch/x86/kernel/process.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index b6f4e8399fca..fc7a38084606 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -930,7 +930,6 @@ static __cpuidle void mwait_idle(void)
raw_local_irq_disable();
}
}
- __current_clr_polling();
}

void select_idle_routine(const struct cpuinfo_x86 *c)
--
2.42.1

2023-11-15 15:54:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/4] x86: Fix CPUIDLE_FLAG_IRQ_ENABLE leaking timer reprogram

On Wed, Nov 15, 2023 at 10:13:23AM -0500, Frederic Weisbecker wrote:
> From: Peter Zijlstra <[email protected]>
>
> intel_idle_irq() re-enables IRQs very early. As a result, an interrupt
> may fire before mwait() is eventually called. If such an interrupt queues
> a timer, it may go unnoticed until mwait returns and the idle loop
> handles the tick re-evaluation. And monitoring TIF_NEED_RESCHED doesn't
> help because a local timer enqueue doesn't set that flag.
>
> The issue is mitigated by the fact that this idle handler is only invoked
> for shallow C-states when, presumably, the next tick is supposed to be
> close enough. There may still be rare cases though when the next tick
> is far away and the selected C-state is shallow, resulting in a timer
> getting ignored for a while.
>
> Fix this with using sti_mwait() whose IRQ-reenablement only triggers
> upon calling mwait(), dealing with the race while keeping the interrupt
> latency within acceptable bounds.
>
> Fixes: c227233ad64c (intel_idle: enable interrupts before C1 on Xeons)
> Not-yet-signed-off-by: Peter Zijlstra <[email protected]>

Feel free to change to normal SOB, I'm assuming it actually compiles and
works by now :-)

> Acked-by: Rafael J. Wysocki <[email protected]>
> Signed-off-by: Frederic Weisbecker <[email protected]>
> ---
> arch/x86/include/asm/mwait.h | 11 +++++++++--
> drivers/idle/intel_idle.c | 19 +++++++------------
> 2 files changed, 16 insertions(+), 14 deletions(-)
>
> diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
> index 341ee4f1d91e..920426d691ce 100644
> --- a/arch/x86/include/asm/mwait.h
> +++ b/arch/x86/include/asm/mwait.h
> @@ -124,8 +124,15 @@ static __always_inline void mwait_idle_with_hints(unsigned long eax, unsigned lo
> }
>
> __monitor((void *)&current_thread_info()->flags, 0, 0);
> - if (!need_resched())
> - __mwait(eax, ecx);
> +
> + if (!need_resched()) {
> + if (ecx & 1) {
> + __mwait(eax, ecx);
> + } else {
> + __sti_mwait(eax, ecx);
> + raw_local_irq_disable();
> + }
> + }
> }
> current_clr_polling();
> }
> diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
> index dcda0afecfc5..3e01a6b23e75 100644
> --- a/drivers/idle/intel_idle.c
> +++ b/drivers/idle/intel_idle.c
> @@ -131,11 +131,12 @@ static unsigned int mwait_substates __initdata;
> #define MWAIT2flg(eax) ((eax & 0xFF) << 24)
>
> static __always_inline int __intel_idle(struct cpuidle_device *dev,
> - struct cpuidle_driver *drv, int index)
> + struct cpuidle_driver *drv,
> + int index, bool irqoff)
> {
> struct cpuidle_state *state = &drv->states[index];
> unsigned long eax = flg2MWAIT(state->flags);
> - unsigned long ecx = 1; /* break on interrupt flag */
> + unsigned long ecx = 1*irqoff; /* break on interrupt flag */
>
> mwait_idle_with_hints(eax, ecx);
>
> @@ -159,19 +160,13 @@ static __always_inline int __intel_idle(struct cpuidle_device *dev,
> static __cpuidle int intel_idle(struct cpuidle_device *dev,
> struct cpuidle_driver *drv, int index)
> {
> - return __intel_idle(dev, drv, index);
> + return __intel_idle(dev, drv, index, true);
> }
>
> static __cpuidle int intel_idle_irq(struct cpuidle_device *dev,
> struct cpuidle_driver *drv, int index)
> {
> - int ret;
> -
> - raw_local_irq_enable();
> - ret = __intel_idle(dev, drv, index);
> - raw_local_irq_disable();
> -
> - return ret;
> + return __intel_idle(dev, drv, index, false);
> }
>
> static __cpuidle int intel_idle_ibrs(struct cpuidle_device *dev,
> @@ -184,7 +179,7 @@ static __cpuidle int intel_idle_ibrs(struct cpuidle_device *dev,
> if (smt_active)
> __update_spec_ctrl(0);
>
> - ret = __intel_idle(dev, drv, index);
> + ret = __intel_idle(dev, drv, index, true);
>
> if (smt_active)
> __update_spec_ctrl(spec_ctrl);
> @@ -196,7 +191,7 @@ static __cpuidle int intel_idle_xstate(struct cpuidle_device *dev,
> struct cpuidle_driver *drv, int index)
> {
> fpu_idle_fpregs();
> - return __intel_idle(dev, drv, index);
> + return __intel_idle(dev, drv, index, true);
> }
>
> /**
> --
> 2.42.1
>

2023-11-15 15:58:05

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 2/4] x86: Fix CPUIDLE_FLAG_IRQ_ENABLE leaking timer reprogram

Le Wed, Nov 15, 2023 at 04:52:32PM +0100, Peter Zijlstra a ?crit :
> On Wed, Nov 15, 2023 at 10:13:23AM -0500, Frederic Weisbecker wrote:
> > From: Peter Zijlstra <[email protected]>
> >
> > intel_idle_irq() re-enables IRQs very early. As a result, an interrupt
> > may fire before mwait() is eventually called. If such an interrupt queues
> > a timer, it may go unnoticed until mwait returns and the idle loop
> > handles the tick re-evaluation. And monitoring TIF_NEED_RESCHED doesn't
> > help because a local timer enqueue doesn't set that flag.
> >
> > The issue is mitigated by the fact that this idle handler is only invoked
> > for shallow C-states when, presumably, the next tick is supposed to be
> > close enough. There may still be rare cases though when the next tick
> > is far away and the selected C-state is shallow, resulting in a timer
> > getting ignored for a while.
> >
> > Fix this with using sti_mwait() whose IRQ-reenablement only triggers
> > upon calling mwait(), dealing with the race while keeping the interrupt
> > latency within acceptable bounds.
> >
> > Fixes: c227233ad64c (intel_idle: enable interrupts before C1 on Xeons)
> > Not-yet-signed-off-by: Peter Zijlstra <[email protected]>
>
> Feel free to change to normal SOB, I'm assuming it actually compiles and
> works by now :-)

Not sure, I might have tested it at some point ;-)

Thanks!

2023-11-16 15:13:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/4] x86: Remove __current_clr_polling() from mwait_idle()

On Wed, Nov 15, 2023 at 10:13:24AM -0500, Frederic Weisbecker wrote:
> mwait_idle() is only ever called through by cpuidle, either from
> default_idle_call() or from cpuidle_enter(). In any case
> cpuidle_idle_call() sets again TIF_NR_POLLING after calling it so there
> is no point for this atomic operation upon idle exit.
>
> Acked-by: Rafael J. Wysocki <[email protected]>
> Signed-off-by: Frederic Weisbecker <[email protected]>
> ---
> arch/x86/kernel/process.c | 1 -
> 1 file changed, 1 deletion(-)
>
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index b6f4e8399fca..fc7a38084606 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -930,7 +930,6 @@ static __cpuidle void mwait_idle(void)
> raw_local_irq_disable();
> }
> }
> - __current_clr_polling();
> }
>
> void select_idle_routine(const struct cpuinfo_x86 *c)


Urgh at this and the next one... That is, yes we can do this, but it
makes these function asymmetric and doesn't actually solve the
underlying problem that all of the polling stuff is inside-out.

Idle loop sets polling, then clears polling because it assumes all
arch/driver idle loops are non-polling, then individual drivers re-set
polling, and to be symmetric (above) clear it again, for the generic
code to set it again, only to clear it again when leaving idle.

Follow that? ;-)

Anyway, drivers ought to tell up-front if they're polling and then we
can avoid the whole dance and everything is better.

Something like the very crude below.

---
arch/x86/include/asm/mwait.h | 3 +--
arch/x86/kernel/process.c | 3 +--
drivers/acpi/processor_idle.c | 2 +-
drivers/idle/intel_idle.c | 3 +++
include/linux/cpuidle.h | 6 ++++++
include/linux/sched/idle.h | 7 ++++++-
kernel/sched/idle.c | 22 +++++++++++++++++-----
7 files changed, 35 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
index 778df05f8539..c0c60fde7a4d 100644
--- a/arch/x86/include/asm/mwait.h
+++ b/arch/x86/include/asm/mwait.h
@@ -107,7 +107,7 @@ static __always_inline void __sti_mwait(unsigned long eax, unsigned long ecx)
*/
static __always_inline void mwait_idle_with_hints(unsigned long eax, unsigned long ecx)
{
- if (static_cpu_has_bug(X86_BUG_MONITOR) || !current_set_polling_and_test()) {
+ if (static_cpu_has_bug(X86_BUG_MONITOR) || !need_resched()) {
if (static_cpu_has_bug(X86_BUG_CLFLUSH_MONITOR)) {
mb();
clflush((void *)&current_thread_info()->flags);
@@ -118,7 +118,6 @@ static __always_inline void mwait_idle_with_hints(unsigned long eax, unsigned lo
if (!need_resched())
__mwait(eax, ecx);
}
- current_clr_polling();
}

/*
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index b6f4e8399fca..73baa82584c2 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -917,7 +917,7 @@ static int prefer_mwait_c1_over_halt(const struct cpuinfo_x86 *c)
*/
static __cpuidle void mwait_idle(void)
{
- if (!current_set_polling_and_test()) {
+ if (!need_resched()) {
if (this_cpu_has(X86_BUG_CLFLUSH_MONITOR)) {
mb(); /* quirk */
clflush((void *)&current_thread_info()->flags);
@@ -930,7 +930,6 @@ static __cpuidle void mwait_idle(void)
raw_local_irq_disable();
}
}
- __current_clr_polling();
}

void select_idle_routine(const struct cpuinfo_x86 *c)
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 3a34a8c425fe..58765557b1b8 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -1219,7 +1219,7 @@ static int acpi_processor_setup_lpi_states(struct acpi_processor *pr)
state->target_residency = lpi->min_residency;
state->flags |= arch_get_idle_state_flags(lpi->arch_flags);
if (i != 0 && lpi->entry_method == ACPI_CSTATE_FFH)
- state->flags |= CPUIDLE_FLAG_RCU_IDLE;
+ state->flags |= CPUIDLE_FLAG_RCU_IDLE | CPUIDLE_FLAG_NO_IPI;
state->enter = acpi_idle_lpi_enter;
drv->safe_state_index = i;
}
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index dcda0afecfc5..496b6a309f4f 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -1569,6 +1569,7 @@ static void __init intel_idle_init_cstates_acpi(struct cpuidle_driver *drv)
state->target_residency *= 3;

state->flags = MWAIT2flg(cx->address);
+ state->flags |= CPUIDLE_FLAG_NO_IPI;
if (cx->type > ACPI_STATE_C2)
state->flags |= CPUIDLE_FLAG_TLB_FLUSHED;

@@ -1841,6 +1842,8 @@ static bool __init intel_idle_verify_cstate(unsigned int mwait_hint)

static void state_update_enter_method(struct cpuidle_state *state, int cstate)
{
+ state->flags |= CPUIDLE_FLAG_NO_IPI;
+
if (state->flags & CPUIDLE_FLAG_INIT_XSTATE) {
/*
* Combining with XSTATE with IBRS or IRQ_ENABLE flags
diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index 3183aeb7f5b4..623d88bd7658 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -85,6 +85,7 @@ struct cpuidle_state {
#define CPUIDLE_FLAG_OFF BIT(4) /* disable this state by default */
#define CPUIDLE_FLAG_TLB_FLUSHED BIT(5) /* idle-state flushes TLBs */
#define CPUIDLE_FLAG_RCU_IDLE BIT(6) /* idle-state takes care of RCU */
+#define CPUIDLE_FLAG_NO_IPI BIT(7) /* has TIF_POLLING_NRFLAG */

struct cpuidle_device_kobj;
struct cpuidle_state_kobj;
@@ -168,6 +169,11 @@ struct cpuidle_driver {
};

#ifdef CONFIG_CPU_IDLE
+bool cpuidle_state_needs_ipi(struct cpuidle_driver *drv, int state)
+{
+ return !(drv->states[state].flags & CPUIDLE_FLAG_NO_IPI);
+}
+
extern void disable_cpuidle(void);
extern bool cpuidle_not_available(struct cpuidle_driver *drv,
struct cpuidle_device *dev);
diff --git a/include/linux/sched/idle.h b/include/linux/sched/idle.h
index 478084f9105e..24b29bb5d43b 100644
--- a/include/linux/sched/idle.h
+++ b/include/linux/sched/idle.h
@@ -68,6 +68,8 @@ static __always_inline bool __must_check current_set_polling_and_test(void)

static __always_inline bool __must_check current_clr_polling_and_test(void)
{
+ bool ret;
+
__current_clr_polling();

/*
@@ -76,7 +78,10 @@ static __always_inline bool __must_check current_clr_polling_and_test(void)
*/
smp_mb__after_atomic();

- return unlikely(tif_need_resched());
+ bool ret = unlikely(tif_need_resched());
+ if (ret)
+ __current_set_polling();
+ return ret;
}

#else
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 565f8374ddbb..65e2474cb82b 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -94,11 +94,12 @@ void __cpuidle default_idle_call(void)
stop_critical_timings();

ct_cpuidle_enter();
- arch_cpu_idle();
+ arch_cpu_idle(); // XXX assumes !polling
ct_cpuidle_exit();

start_critical_timings();
trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
+ __current_set_polling();
}
local_irq_enable();
instrumentation_end();
@@ -107,20 +108,27 @@ void __cpuidle default_idle_call(void)
static int call_cpuidle_s2idle(struct cpuidle_driver *drv,
struct cpuidle_device *dev)
{
+ int ret;
+
if (current_clr_polling_and_test())
return -EBUSY;

- return cpuidle_enter_s2idle(drv, dev);
+ ret = cpuidle_enter_s2idle(drv, dev);
+ __current_set_polling();
+ return ret;
}

static int call_cpuidle(struct cpuidle_driver *drv, struct cpuidle_device *dev,
int next_state)
{
+ bool need_ipi = cpuidle_state_needs_ipi(drv, next_state);
+ int ret;
+
/*
* The idle task must be scheduled, it is pointless to go to idle, just
* update no idle residency and return.
*/
- if (current_clr_polling_and_test()) {
+ if (needs_ipi && current_clr_polling_and_test()) {
dev->last_residency_ns = 0;
local_irq_enable();
return -EBUSY;
@@ -131,7 +139,12 @@ static int call_cpuidle(struct cpuidle_driver *drv, struct cpuidle_device *dev,
* This function will block until an interrupt occurs and will take
* care of re-enabling the local interrupts
*/
- return cpuidle_enter(drv, dev, next_state);
+ ret = cpuidle_enter(drv, dev, next_state);
+
+ if (needs_ipi)
+ __current_set_polling();
+
+ return ret;
}

/**
@@ -220,7 +233,6 @@ static void cpuidle_idle_call(void)
}

exit_idle:
- __current_set_polling();

/*
* It is up to the idle functions to reenable local interrupts

2023-11-16 18:49:17

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 3/4] x86: Remove __current_clr_polling() from mwait_idle()

Le Thu, Nov 16, 2023 at 04:13:16PM +0100, Peter Zijlstra a ?crit :
> On Wed, Nov 15, 2023 at 10:13:24AM -0500, Frederic Weisbecker wrote:
> > mwait_idle() is only ever called through by cpuidle, either from
> > default_idle_call() or from cpuidle_enter(). In any case
> > cpuidle_idle_call() sets again TIF_NR_POLLING after calling it so there
> > is no point for this atomic operation upon idle exit.
> >
> > Acked-by: Rafael J. Wysocki <[email protected]>
> > Signed-off-by: Frederic Weisbecker <[email protected]>
> > ---
> > arch/x86/kernel/process.c | 1 -
> > 1 file changed, 1 deletion(-)
> >
> > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> > index b6f4e8399fca..fc7a38084606 100644
> > --- a/arch/x86/kernel/process.c
> > +++ b/arch/x86/kernel/process.c
> > @@ -930,7 +930,6 @@ static __cpuidle void mwait_idle(void)
> > raw_local_irq_disable();
> > }
> > }
> > - __current_clr_polling();
> > }
> >
> > void select_idle_routine(const struct cpuinfo_x86 *c)
>
>
> Urgh at this and the next one... That is, yes we can do this, but it
> makes these function asymmetric and doesn't actually solve the
> underlying problem that all of the polling stuff is inside-out.
>
> Idle loop sets polling, then clears polling because it assumes all
> arch/driver idle loops are non-polling, then individual drivers re-set
> polling, and to be symmetric (above) clear it again, for the generic
> code to set it again, only to clear it again when leaving idle.
>
> Follow that? ;-)

That's right :-)

>
> Anyway, drivers ought to tell up-front if they're polling and then we
> can avoid the whole dance and everything is better.
>
> Something like the very crude below.

Yeah that makes perfect sense (can I use your SoB right away?)

Though I sometimes wonder why we even bother with setting TIF_NR_POLLING
for some short parts in the generic idle loop even on !mwait and
!cpuidle-state-polling states.

Like for example why do we bother with setting TIF_NR_POLLING for just
the portion in the generic idle loop that looks up the cpuidle state
and stops the tick then clear TIF_NR_POLLING before calling wfi on ARM?

Or may be it's a frequent pattern to have a remote wake up happening while
entering the idle loop?

Thanks.

Subject: [tip: x86/core] x86: Fix CPUIDLE_FLAG_IRQ_ENABLE leaking timer reprogram

The following commit has been merged into the x86/core branch of tip:

Commit-ID: edc8fc01f608108b0b7580cb2c29dfb5135e5f0e
Gitweb: https://git.kernel.org/tip/edc8fc01f608108b0b7580cb2c29dfb5135e5f0e
Author: Peter Zijlstra <[email protected]>
AuthorDate: Wed, 15 Nov 2023 10:13:23 -05:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Wed, 29 Nov 2023 15:44:01 +01:00

x86: Fix CPUIDLE_FLAG_IRQ_ENABLE leaking timer reprogram

intel_idle_irq() re-enables IRQs very early. As a result, an interrupt
may fire before mwait() is eventually called. If such an interrupt queues
a timer, it may go unnoticed until mwait returns and the idle loop
handles the tick re-evaluation. And monitoring TIF_NEED_RESCHED doesn't
help because a local timer enqueue doesn't set that flag.

The issue is mitigated by the fact that this idle handler is only invoked
for shallow C-states when, presumably, the next tick is supposed to be
close enough. There may still be rare cases though when the next tick
is far away and the selected C-state is shallow, resulting in a timer
getting ignored for a while.

Fix this with using sti_mwait() whose IRQ-reenablement only triggers
upon calling mwait(), dealing with the race while keeping the interrupt
latency within acceptable bounds.

Fixes: c227233ad64c (intel_idle: enable interrupts before C1 on Xeons)
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/include/asm/mwait.h | 11 +++++++++--
drivers/idle/intel_idle.c | 19 +++++++------------
2 files changed, 16 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
index 341ee4f..920426d 100644
--- a/arch/x86/include/asm/mwait.h
+++ b/arch/x86/include/asm/mwait.h
@@ -124,8 +124,15 @@ static __always_inline void mwait_idle_with_hints(unsigned long eax, unsigned lo
}

__monitor((void *)&current_thread_info()->flags, 0, 0);
- if (!need_resched())
- __mwait(eax, ecx);
+
+ if (!need_resched()) {
+ if (ecx & 1) {
+ __mwait(eax, ecx);
+ } else {
+ __sti_mwait(eax, ecx);
+ raw_local_irq_disable();
+ }
+ }
}
current_clr_polling();
}
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index dcda0af..3e01a6b 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -131,11 +131,12 @@ static unsigned int mwait_substates __initdata;
#define MWAIT2flg(eax) ((eax & 0xFF) << 24)

static __always_inline int __intel_idle(struct cpuidle_device *dev,
- struct cpuidle_driver *drv, int index)
+ struct cpuidle_driver *drv,
+ int index, bool irqoff)
{
struct cpuidle_state *state = &drv->states[index];
unsigned long eax = flg2MWAIT(state->flags);
- unsigned long ecx = 1; /* break on interrupt flag */
+ unsigned long ecx = 1*irqoff; /* break on interrupt flag */

mwait_idle_with_hints(eax, ecx);

@@ -159,19 +160,13 @@ static __always_inline int __intel_idle(struct cpuidle_device *dev,
static __cpuidle int intel_idle(struct cpuidle_device *dev,
struct cpuidle_driver *drv, int index)
{
- return __intel_idle(dev, drv, index);
+ return __intel_idle(dev, drv, index, true);
}

static __cpuidle int intel_idle_irq(struct cpuidle_device *dev,
struct cpuidle_driver *drv, int index)
{
- int ret;
-
- raw_local_irq_enable();
- ret = __intel_idle(dev, drv, index);
- raw_local_irq_disable();
-
- return ret;
+ return __intel_idle(dev, drv, index, false);
}

static __cpuidle int intel_idle_ibrs(struct cpuidle_device *dev,
@@ -184,7 +179,7 @@ static __cpuidle int intel_idle_ibrs(struct cpuidle_device *dev,
if (smt_active)
__update_spec_ctrl(0);

- ret = __intel_idle(dev, drv, index);
+ ret = __intel_idle(dev, drv, index, true);

if (smt_active)
__update_spec_ctrl(spec_ctrl);
@@ -196,7 +191,7 @@ static __cpuidle int intel_idle_xstate(struct cpuidle_device *dev,
struct cpuidle_driver *drv, int index)
{
fpu_idle_fpregs();
- return __intel_idle(dev, drv, index);
+ return __intel_idle(dev, drv, index, true);
}

/**

Subject: [tip: x86/core] x86: Add a comment about the "magic" behind shadow sti before mwait

The following commit has been merged into the x86/core branch of tip:

Commit-ID: 7d09a052a3bdb62de9a86d43359d6c22eeaf105a
Gitweb: https://git.kernel.org/tip/7d09a052a3bdb62de9a86d43359d6c22eeaf105a
Author: Frederic Weisbecker <[email protected]>
AuthorDate: Wed, 15 Nov 2023 10:13:22 -05:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Wed, 29 Nov 2023 15:44:01 +01:00

x86: Add a comment about the "magic" behind shadow sti before mwait

Add a note to make sure we never miss and break the requirements behind
it.

Signed-off-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/include/asm/mwait.h | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
index 778df05..341ee4f 100644
--- a/arch/x86/include/asm/mwait.h
+++ b/arch/x86/include/asm/mwait.h
@@ -87,6 +87,15 @@ static __always_inline void __mwaitx(unsigned long eax, unsigned long ebx,
:: "a" (eax), "b" (ebx), "c" (ecx));
}

+/*
+ * Re-enable interrupts right upon calling mwait in such a way that
+ * no interrupt can fire _before_ the execution of mwait, ie: no
+ * instruction must be placed between "sti" and "mwait".
+ *
+ * This is necessary because if an interrupt queues a timer before
+ * executing mwait, it would otherwise go unnoticed and the next tick
+ * would not be reprogrammed accordingly before mwait ever wakes up.
+ */
static __always_inline void __sti_mwait(unsigned long eax, unsigned long ecx)
{
mds_idle_clear_cpu_buffers();

2023-11-30 11:16:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip: x86/core] x86: Fix CPUIDLE_FLAG_IRQ_ENABLE leaking timer reprogram

On Wed, Nov 29, 2023 at 02:55:55PM -0000, tip-bot2 for Peter Zijlstra wrote:
> diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
> index 341ee4f..920426d 100644
> --- a/arch/x86/include/asm/mwait.h
> +++ b/arch/x86/include/asm/mwait.h
> @@ -124,8 +124,15 @@ static __always_inline void mwait_idle_with_hints(unsigned long eax, unsigned lo
> }
>
> __monitor((void *)&current_thread_info()->flags, 0, 0);
> - if (!need_resched())
> - __mwait(eax, ecx);
> +
> + if (!need_resched()) {
> + if (ecx & 1) {
> + __mwait(eax, ecx);
> + } else {
> + __sti_mwait(eax, ecx);
> + raw_local_irq_disable();
> + }
> + }

Andrew noted that this is only safe if it precludes #DB from happening
on mwait, because #DB can wreck the STI shadow thing.

> @@ -159,19 +160,13 @@ static __always_inline int __intel_idle(struct cpuidle_device *dev,
> static __cpuidle int intel_idle(struct cpuidle_device *dev,
> struct cpuidle_driver *drv, int index)
> {
> + return __intel_idle(dev, drv, index, true);
> }
>
> static __cpuidle int intel_idle_irq(struct cpuidle_device *dev,
> struct cpuidle_driver *drv, int index)
> {
> + return __intel_idle(dev, drv, index, false);
> }
>
> static __cpuidle int intel_idle_ibrs(struct cpuidle_device *dev,
> @@ -184,7 +179,7 @@ static __cpuidle int intel_idle_ibrs(struct cpuidle_device *dev,
> if (smt_active)
> __update_spec_ctrl(0);
>
> + ret = __intel_idle(dev, drv, index, true);
>
> if (smt_active)
> __update_spec_ctrl(spec_ctrl);
> @@ -196,7 +191,7 @@ static __cpuidle int intel_idle_xstate(struct cpuidle_device *dev,
> struct cpuidle_driver *drv, int index)
> {
> fpu_idle_fpregs();
> + return __intel_idle(dev, drv, index, true);
> }

This is so, because all mwait users should be in __cpuidle section,
which itself is part of the noinstr section and as such
kprobes/hw-breakpoints etc.. are disallowed.

Notable vmlinux.lds.h has:

#define NOINSTR_TEXT \
ALIGN_FUNCTION(); \
__noinstr_text_start = .; \
*(.noinstr.text) \
__cpuidle_text_start = .; \
*(.cpuidle.text) \
__cpuidle_text_end = .; \
__noinstr_text_end = .;

2023-12-12 13:47:12

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [tip: x86/core] x86: Fix CPUIDLE_FLAG_IRQ_ENABLE leaking timer reprogram

On Thu, Nov 30, 2023 at 12:15:19PM +0100, Peter Zijlstra wrote:
> This is so, because all mwait users should be in __cpuidle section,
> which itself is part of the noinstr section and as such
> kprobes/hw-breakpoints etc.. are disallowed.
>
> Notable vmlinux.lds.h has:
>
> #define NOINSTR_TEXT \
> ALIGN_FUNCTION(); \
> __noinstr_text_start = .; \
> *(.noinstr.text) \
> __cpuidle_text_start = .; \
> *(.cpuidle.text) \
> __cpuidle_text_end = .; \
> __noinstr_text_end = .;

So #DB aren't supposed to happen then, right? Or you noticed an mwait
user that doesn't have __cpuidle?

Thanks.