Hi All,
Thanks a lot for the discussion and testing so far!
This is a total respin of the whole series, so please look at it afresh.
Patches 2 and 3 are the most similar to their previous versions, but
still they are different enough.
The summary below still applies:
On Sunday, March 4, 2018 11:21:30 PM CET Rafael J. Wysocki wrote:
>
> The problem is that if we stop the sched tick in
> tick_nohz_idle_enter() and then the idle governor predicts short idle
> duration, we lose regardless of whether or not it is right.
>
> If it is right, we've lost already, because we stopped the tick
> unnecessarily. If it is not right, we'll lose going forward, because
> the idle state selected by the governor is going to be too shallow and
> we'll draw too much power (that has been reported recently to actually
> happen often enough for people to care).
>
> This patch series is an attempt to improve the situation and the idea
> here is to make the decision whether or not to stop the tick deeper in
> the idle loop and in particular after running the idle state selection
> in the path where the idle governor is invoked. This way the problem
> can be avoided, because the idle duration predicted by the idle governor
> can be used to decide whether or not to stop the tick so that the tick
> is only stopped if that value is large enough (and, consequently, the
> idle state selected by the governor is deep enough).
>
> The series tires to avoid adding too much new code, rather reorder the
> existing code and make it more fine-grained.
Patch 1 prepares the tick-sched code for the subsequent modifications and it
doesn't change the code's functionality (at least not intentionally).
Patch 2 starts pushing the tick stopping decision deeper into the idle
loop, but it is limited to do_idle() and tick_nohz_irq_exit().
Patch 3 makes cpuidle_idle_call() decide whether or not to stop the tick
and sets the stage for the subsequent changes.
Patch 4 adds a bool pointer argument to cpuidle_select() and the ->select
governor callback allowing them to return a "nohz" hint on whether or not to
stop the tick to the caller. It also adds code to decide what value to
return as "nohz" to the menu governor.
Patch 5 reorders the idle state selection with respect to the stopping of the
tick and causes the additional "nohz" hint from cpuidle_select() to be used
for deciding whether or not to stop the tick.
Patch 6 causes the menu governor to refine the state selection in case the
tick is not going to be stopped and the already selected state may not fit
before the next tick time.
And the two paragraphs below still apply:
> I have tested these patches on a couple of machines, including the very laptop
> I'm sending them from, without any obvious issues, but please give them a go
> if you can, especially if you have an easy way to reproduce the problem they
> are targeting.
>
> The above said, this is just RFC, so no pets close to the machines running it,
> please.
The patches are on top of 4.16-rc4 (if you need a git branch with them for
easier testing, please let me know).
Thanks,
Rafael
From: Rafael J. Wysocki <[email protected]>
In order to address the issue with short idle duration predictions
by the idle governor after the tick has been stopped, reorder the
code in cpuidle_idle_call() so that the governor idle state selection
runs before tick_nohz_idle_go_idle() and use the "nohz" hint returned
by cpuidle_select() to decide whether or not to stop the tick.
This isn't straightforward, because menu_select() invokes
tick_nohz_get_sleep_length() to get the time to the next timer
event and the number returned by the latter comes from
__tick_nohz_idle_enter(). Fortunately, however, it is possible
to compute that number without actually stopping the tick and with
the help of the existing code.
Namely, notice that tick_nohz_stop_sched_tick() already computes the
next timer event time to reprogram the scheduler tick hrtimer and
that time can be used as a proxy for the actual next timer event
time in the idle duration predicition. Moreover, it is possible
to split tick_nohz_stop_sched_tick() into two separate routines,
one computing the time to the next timer event and the other
simply stopping the tick when the time to the next timer event
is known.
Accordingly, split tick_nohz_stop_sched_tick() into
__tick_nohz_next_event() and __tick_nohz_stop_tick() and use the
latter in tick_nohz_get_sleep_length(). Add two new extra fields,
timer_expires and timer_expires_basemono, to struct tick_sched for
passing data between these two functions and one more extra field,
tick_may_stop, to indicate that __tick_nohz_next_event() has run
and __tick_nohz_stop_tick() can be called now. Also drop the now
redundant sleep_length field from there.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
include/linux/tick.h | 2
kernel/sched/idle.c | 11 ++-
kernel/time/tick-sched.c | 146 ++++++++++++++++++++++++++++++-----------------
kernel/time/tick-sched.h | 6 +
4 files changed, 110 insertions(+), 55 deletions(-)
Index: linux-pm/kernel/time/tick-sched.h
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.h
+++ linux-pm/kernel/time/tick-sched.h
@@ -30,6 +30,7 @@ enum tick_nohz_mode {
* when the CPU returns from nohz sleep.
* @next_tick: Next tick to be fired when in dynticks mode.
* @tick_stopped: Indicator that the idle tick has been stopped
+ * @tick_may_stop: Indicator that the idle tick may be stopped shortly
* @idle_jiffies: jiffies at the entry to idle for idle time accounting
* @idle_calls: Total number of idle calls
* @idle_sleeps: Number of idle calls, where the sched tick was stopped
@@ -38,7 +39,6 @@ enum tick_nohz_mode {
* @idle_exittime: Time when the idle state was left
* @idle_sleeptime: Sum of the time slept in idle with sched tick stopped
* @iowait_sleeptime: Sum of the time slept in idle with sched tick stopped, with IO outstanding
- * @sleep_length: Duration of the current idle sleep
* @do_timer_lst: CPU was the last one doing do_timer before going idle
*/
struct tick_sched {
@@ -49,6 +49,7 @@ struct tick_sched {
ktime_t next_tick;
int inidle;
int tick_stopped;
+ int tick_may_stop;
unsigned long idle_jiffies;
unsigned long idle_calls;
unsigned long idle_sleeps;
@@ -58,8 +59,9 @@ struct tick_sched {
ktime_t idle_exittime;
ktime_t idle_sleeptime;
ktime_t iowait_sleeptime;
- ktime_t sleep_length;
unsigned long last_jiffies;
+ u64 timer_expires;
+ u64 timer_expires_basemono;
u64 next_timer;
ktime_t idle_expires;
int do_timer_last;
Index: linux-pm/kernel/sched/idle.c
===================================================================
--- linux-pm.orig/kernel/sched/idle.c
+++ linux-pm/kernel/sched/idle.c
@@ -190,13 +190,18 @@ static void cpuidle_idle_call(void)
} else {
bool nohz = true;
- tick_nohz_idle_stop_tick();
- rcu_idle_enter();
-
/*
* Ask the cpuidle framework to choose a convenient idle state.
*/
next_state = cpuidle_select(drv, dev, &nohz);
+
+ if (nohz)
+ tick_nohz_idle_stop_tick();
+ else
+ tick_nohz_idle_retain_tick();
+
+ rcu_idle_enter();
+
entered_state = call_cpuidle(drv, dev, next_state);
/*
* Give the governor an opportunity to reflect on the outcome
Index: linux-pm/kernel/time/tick-sched.c
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.c
+++ linux-pm/kernel/time/tick-sched.c
@@ -652,13 +652,10 @@ static inline bool local_timer_softirq_p
return local_softirq_pending() & TIMER_SOFTIRQ;
}
-static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
- ktime_t now, int cpu)
+static ktime_t __tick_nohz_next_event(struct tick_sched *ts, int cpu)
{
- struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
u64 basemono, next_tick, next_tmr, next_rcu, delta, expires;
unsigned long seq, basejiff;
- ktime_t tick;
/* Read jiffies and the time when jiffies were updated last */
do {
@@ -667,6 +664,7 @@ static ktime_t tick_nohz_stop_sched_tick
basejiff = jiffies;
} while (read_seqretry(&jiffies_lock, seq));
ts->last_jiffies = basejiff;
+ ts->timer_expires_basemono = basemono;
/*
* Keep the periodic tick, when RCU, architecture or irq_work
@@ -711,31 +709,24 @@ static ktime_t tick_nohz_stop_sched_tick
* next period, so no point in stopping it either, bail.
*/
if (!ts->tick_stopped) {
- tick = 0;
+ ts->timer_expires = 0;
goto out;
}
}
/*
- * If this CPU is the one which updates jiffies, then give up
- * the assignment and let it be taken by the CPU which runs
- * the tick timer next, which might be this CPU as well. If we
- * don't drop this here the jiffies might be stale and
- * do_timer() never invoked. Keep track of the fact that it
- * was the one which had the do_timer() duty last. If this CPU
- * is the one which had the do_timer() duty last, we limit the
- * sleep time to the timekeeping max_deferment value.
+ * If this CPU is the one which had the do_timer() duty last, we limit
+ * the sleep time to the timekeeping max_deferment value.
* Otherwise we can sleep as long as we want.
*/
delta = timekeeping_max_deferment();
- if (cpu == tick_do_timer_cpu) {
- tick_do_timer_cpu = TICK_DO_TIMER_NONE;
- ts->do_timer_last = 1;
- } else if (tick_do_timer_cpu != TICK_DO_TIMER_NONE) {
- delta = KTIME_MAX;
- ts->do_timer_last = 0;
- } else if (!ts->do_timer_last) {
- delta = KTIME_MAX;
+ if (cpu != tick_do_timer_cpu) {
+ if (tick_do_timer_cpu != TICK_DO_TIMER_NONE) {
+ delta = KTIME_MAX;
+ ts->do_timer_last = 0;
+ } else if (!ts->do_timer_last) {
+ delta = KTIME_MAX;
+ }
}
#ifdef CONFIG_NO_HZ_FULL
@@ -750,14 +741,41 @@ static ktime_t tick_nohz_stop_sched_tick
else
expires = KTIME_MAX;
- expires = min_t(u64, expires, next_tick);
- tick = expires;
+ ts->timer_expires = min_t(u64, expires, next_tick);
+
+out:
+ ts->tick_may_stop = 1;
+ return ts->timer_expires;
+}
+
+static void __tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
+{
+ struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
+ u64 basemono = ts->timer_expires_basemono;
+ u64 expires = ts->timer_expires;
+ ktime_t tick = expires;
+
+ /* Make sure we won't be trying to stop it twice in a row. */
+ ts->tick_may_stop = 0;
+
+ /*
+ * If this CPU is the one which updates jiffies, then give up
+ * the assignment and let it be taken by the CPU which runs
+ * the tick timer next, which might be this CPU as well. If we
+ * don't drop this here the jiffies might be stale and
+ * do_timer() never invoked. Keep track of the fact that it
+ * was the one which had the do_timer() duty last.
+ */
+ if (cpu == tick_do_timer_cpu) {
+ tick_do_timer_cpu = TICK_DO_TIMER_NONE;
+ ts->do_timer_last = 1;
+ }
/* Skip reprogram of event if its not changed */
if (ts->tick_stopped && (expires == ts->next_tick)) {
/* Sanity check: make sure clockevent is actually programmed */
if (tick == KTIME_MAX || ts->next_tick == hrtimer_get_expires(&ts->sched_timer))
- goto out;
+ return;
WARN_ON_ONCE(1);
printk_once("basemono: %llu ts->next_tick: %llu dev->next_event: %llu timer->active: %d timer->expires: %llu\n",
@@ -791,7 +809,7 @@ static ktime_t tick_nohz_stop_sched_tick
if (unlikely(expires == KTIME_MAX)) {
if (ts->nohz_mode == NOHZ_MODE_HIGHRES)
hrtimer_cancel(&ts->sched_timer);
- goto out;
+ return;
}
hrtimer_set_expires(&ts->sched_timer, tick);
@@ -800,15 +818,18 @@ static ktime_t tick_nohz_stop_sched_tick
hrtimer_start_expires(&ts->sched_timer, HRTIMER_MODE_ABS_PINNED);
else
tick_program_event(tick, 1);
-out:
- /*
- * Update the estimated sleep length until the next timer
- * (not only the tick).
- */
- ts->sleep_length = ktime_sub(dev->next_event, now);
- return tick;
}
+#ifdef CONFIG_NO_HZ_FULL
+static void tick_nohz_stop_sched_tick(struct tick_sched *ts, int cpu)
+{
+ if (__tick_nohz_next_event(ts, cpu))
+ __tick_nohz_stop_tick(ts, cpu);
+ else
+ ts->tick_may_stop = 0;
+}
+#endif /* CONFIG_NO_HZ_FULL */
+
static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now)
{
/* Update jiffies first */
@@ -844,7 +865,7 @@ static void tick_nohz_full_update_tick(s
return;
if (can_stop_full_tick(cpu, ts))
- tick_nohz_stop_sched_tick(ts, ktime_get(), cpu);
+ tick_nohz_stop_sched_tick(ts, cpu);
else if (ts->tick_stopped)
tick_nohz_restart_sched_tick(ts, ktime_get());
#endif
@@ -870,10 +891,8 @@ static bool can_stop_idle_tick(int cpu,
return false;
}
- if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE)) {
- ts->sleep_length = NSEC_PER_SEC / HZ;
+ if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
return false;
- }
if (need_resched())
return false;
@@ -913,25 +932,33 @@ static void __tick_nohz_idle_stop_tick(s
ktime_t expires;
int cpu = smp_processor_id();
- if (can_stop_idle_tick(cpu, ts)) {
+ /*
+ * If tick_nohz_get_sleep_length() ran __tick_nohz_next_event(), the
+ * tick timer expiration time is known already.
+ */
+ if (ts->tick_may_stop)
+ expires = ts->timer_expires;
+ else if (can_stop_idle_tick(cpu, ts))
+ expires = __tick_nohz_next_event(ts, cpu);
+ else
+ return;
+
+ ts->idle_calls++;
+
+ if (expires > 0LL) {
int was_stopped = ts->tick_stopped;
- ts->idle_calls++;
+ __tick_nohz_stop_tick(ts, cpu);
- /*
- * The idle entry time should be a sufficient approximation of
- * the current time at this point.
- */
- expires = tick_nohz_stop_sched_tick(ts, ts->idle_entrytime, cpu);
- if (expires > 0LL) {
- ts->idle_sleeps++;
- ts->idle_expires = expires;
- }
+ ts->idle_sleeps++;
+ ts->idle_expires = expires;
if (!was_stopped && ts->tick_stopped) {
ts->idle_jiffies = ts->last_jiffies;
nohz_balance_enter_idle(cpu);
}
+ } else {
+ ts->tick_may_stop = 0;
}
}
@@ -945,6 +972,11 @@ void tick_nohz_idle_stop_tick(void)
__tick_nohz_idle_stop_tick(this_cpu_ptr(&tick_cpu_sched));
}
+void tick_nohz_idle_retain_tick(void)
+{
+ __this_cpu_write(tick_cpu_sched.tick_may_stop, 0);
+}
+
/**
* tick_nohz_idle_enter - prepare for entering idle on the current CPU
*
@@ -957,7 +989,7 @@ void tick_nohz_idle_enter(void)
lockdep_assert_irqs_enabled();
/*
* Update the idle state in the scheduler domain hierarchy
- * when tick_nohz_stop_sched_tick() is called from the idle loop.
+ * when __tick_nohz_stop_tick() is called from the idle loop.
* State will be updated to busy during the first busy tick after
* exiting idle.
*/
@@ -991,15 +1023,29 @@ void tick_nohz_irq_exit(void)
}
/**
- * tick_nohz_get_sleep_length - return the length of the current sleep
+ * tick_nohz_get_sleep_length - return the expected length of the current sleep
*
* Called from power state control code with interrupts disabled
*/
ktime_t tick_nohz_get_sleep_length(void)
{
+ struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
+ int cpu = smp_processor_id();
+ /*
+ * The idle entry time is expected to be a sufficient approximation of
+ * the current time at this point.
+ */
+ ktime_t now = ts->idle_entrytime;
+
+ if (can_stop_idle_tick(cpu, ts)) {
+ ktime_t next_event = __tick_nohz_next_event(ts, cpu);
+
+ if (next_event)
+ return ktime_sub(next_event, now);
+ }
- return ts->sleep_length;
+ return ktime_sub(dev->next_event, now);
}
/**
Index: linux-pm/include/linux/tick.h
===================================================================
--- linux-pm.orig/include/linux/tick.h
+++ linux-pm/include/linux/tick.h
@@ -115,6 +115,7 @@ enum tick_dep_bits {
extern bool tick_nohz_enabled;
extern int tick_nohz_tick_stopped(void);
extern void tick_nohz_idle_stop_tick(void);
+extern void tick_nohz_idle_retain_tick(void);
extern void tick_nohz_idle_enter(void);
extern void tick_nohz_idle_exit(void);
extern void tick_nohz_irq_exit(void);
@@ -127,6 +128,7 @@ extern u64 get_cpu_iowait_time_us(int cp
#define tick_nohz_enabled (0)
static inline int tick_nohz_tick_stopped(void) { return 0; }
static inline void tick_nohz_idle_stop_tick(void) { }
+static inline void tick_nohz_idle_retain_tick(void) { }
static inline void tick_nohz_idle_enter(void) { }
static inline void tick_nohz_idle_exit(void) { }
From: Rafael J. Wysocki <[email protected]>
Prepare the scheduler tick code for reworking the idle loop to
avoid stopping the tick in some cases.
Move away the tick_nohz_start_idle() invocation from
__tick_nohz_idle_enter(), rename the latter to
__tick_nohz_idle_stop_tick() and define tick_nohz_idle_stop_tick()
as a wrapper around it for calling it from the outside.
Make tick_nohz_idle_enter() only call tick_nohz_start_idle() instead
of calling the entire __tick_nohz_idle_enter(), add another wrapper
disabling and enabling interrupts around tick_nohz_idle_stop_tick()
and make the current callers of tick_nohz_idle_enter() call it too
to retain their current functionality.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
arch/x86/xen/smp_pv.c | 1 +
include/linux/tick.h | 9 +++++++++
kernel/sched/idle.c | 1 +
kernel/time/tick-sched.c | 46 +++++++++++++++++++++++++---------------------
4 files changed, 36 insertions(+), 21 deletions(-)
Index: linux-pm/include/linux/tick.h
===================================================================
--- linux-pm.orig/include/linux/tick.h
+++ linux-pm/include/linux/tick.h
@@ -114,6 +114,7 @@ enum tick_dep_bits {
#ifdef CONFIG_NO_HZ_COMMON
extern bool tick_nohz_enabled;
extern int tick_nohz_tick_stopped(void);
+extern void tick_nohz_idle_stop_tick(void);
extern void tick_nohz_idle_enter(void);
extern void tick_nohz_idle_exit(void);
extern void tick_nohz_irq_exit(void);
@@ -125,6 +126,7 @@ extern u64 get_cpu_iowait_time_us(int cp
#else /* !CONFIG_NO_HZ_COMMON */
#define tick_nohz_enabled (0)
static inline int tick_nohz_tick_stopped(void) { return 0; }
+static inline void tick_nohz_idle_stop_tick(void) { }
static inline void tick_nohz_idle_enter(void) { }
static inline void tick_nohz_idle_exit(void) { }
@@ -136,6 +138,13 @@ static inline u64 get_cpu_idle_time_us(i
static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
#endif /* !CONFIG_NO_HZ_COMMON */
+static inline void tick_nohz_idle_stop_tick_protected(void)
+{
+ local_irq_disable();
+ tick_nohz_idle_stop_tick();
+ local_irq_enable();
+}
+
#ifdef CONFIG_NO_HZ_FULL
extern bool tick_nohz_full_running;
extern cpumask_var_t tick_nohz_full_mask;
Index: linux-pm/kernel/time/tick-sched.c
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.c
+++ linux-pm/kernel/time/tick-sched.c
@@ -539,14 +539,11 @@ static void tick_nohz_stop_idle(struct t
sched_clock_idle_wakeup_event();
}
-static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
+static void tick_nohz_start_idle(struct tick_sched *ts)
{
- ktime_t now = ktime_get();
-
- ts->idle_entrytime = now;
+ ts->idle_entrytime = ktime_get();
ts->idle_active = 1;
sched_clock_idle_sleep_event();
- return now;
}
/**
@@ -911,19 +908,21 @@ static bool can_stop_idle_tick(int cpu,
return true;
}
-static void __tick_nohz_idle_enter(struct tick_sched *ts)
+static void __tick_nohz_idle_stop_tick(struct tick_sched *ts)
{
- ktime_t now, expires;
+ ktime_t expires;
int cpu = smp_processor_id();
- now = tick_nohz_start_idle(ts);
-
if (can_stop_idle_tick(cpu, ts)) {
int was_stopped = ts->tick_stopped;
ts->idle_calls++;
- expires = tick_nohz_stop_sched_tick(ts, now, cpu);
+ /*
+ * The idle entry time should be a sufficient approximation of
+ * the current time at this point.
+ */
+ expires = tick_nohz_stop_sched_tick(ts, ts->idle_entrytime, cpu);
if (expires > 0LL) {
ts->idle_sleeps++;
ts->idle_expires = expires;
@@ -937,16 +936,19 @@ static void __tick_nohz_idle_enter(struc
}
/**
- * tick_nohz_idle_enter - stop the idle tick from the idle task
+ * tick_nohz_idle_stop_tick - stop the idle tick from the idle task
*
* When the next event is more than a tick into the future, stop the idle tick
- * Called when we start the idle loop.
- *
- * The arch is responsible of calling:
+ */
+void tick_nohz_idle_stop_tick(void)
+{
+ __tick_nohz_idle_stop_tick(this_cpu_ptr(&tick_cpu_sched));
+}
+
+/**
+ * tick_nohz_idle_enter - prepare for entering idle on the current CPU
*
- * - rcu_idle_enter() after its last use of RCU before the CPU is put
- * to sleep.
- * - rcu_idle_exit() before the first use of RCU after the CPU is woken up.
+ * Called when we start the idle loop.
*/
void tick_nohz_idle_enter(void)
{
@@ -965,7 +967,7 @@ void tick_nohz_idle_enter(void)
ts = this_cpu_ptr(&tick_cpu_sched);
ts->inidle = 1;
- __tick_nohz_idle_enter(ts);
+ tick_nohz_start_idle(ts);
local_irq_enable();
}
@@ -982,10 +984,12 @@ void tick_nohz_irq_exit(void)
{
struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
- if (ts->inidle)
- __tick_nohz_idle_enter(ts);
- else
+ if (ts->inidle) {
+ tick_nohz_start_idle(ts);
+ __tick_nohz_idle_stop_tick(ts);
+ } else {
tick_nohz_full_update_tick(ts);
+ }
}
/**
Index: linux-pm/arch/x86/xen/smp_pv.c
===================================================================
--- linux-pm.orig/arch/x86/xen/smp_pv.c
+++ linux-pm/arch/x86/xen/smp_pv.c
@@ -425,6 +425,7 @@ static void xen_pv_play_dead(void) /* us
* data back is to call:
*/
tick_nohz_idle_enter();
+ tick_nohz_idle_stop_tick_protected();
cpuhp_online_idle(CPUHP_AP_ONLINE_IDLE);
}
Index: linux-pm/kernel/sched/idle.c
===================================================================
--- linux-pm.orig/kernel/sched/idle.c
+++ linux-pm/kernel/sched/idle.c
@@ -221,6 +221,7 @@ static void do_idle(void)
__current_set_polling();
tick_nohz_idle_enter();
+ tick_nohz_idle_stop_tick_protected();
while (!need_resched()) {
check_pgt_cache();
From: Rafael J. Wysocki <[email protected]>
If the tick isn't stopped, the target residency of the state selected
by the menu governor may be greater than the actual time to the next
tick and that means lost energy.
To avoid that, make tick_nohz_get_sleep_length() return the current
time the the next event (before stopping the tick) in addition to
the estimated one via an extra pointer argument and make
menu_select() use that value to refine the state selection
when necessary.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
drivers/cpuidle/governors/menu.c | 21 ++++++++++++++++++---
include/linux/tick.h | 2 +-
kernel/time/tick-sched.c | 7 +++++--
3 files changed, 24 insertions(+), 6 deletions(-)
Index: linux-pm/include/linux/tick.h
===================================================================
--- linux-pm.orig/include/linux/tick.h
+++ linux-pm/include/linux/tick.h
@@ -119,7 +119,7 @@ extern void tick_nohz_idle_retain_tick(v
extern void tick_nohz_idle_enter(void);
extern void tick_nohz_idle_exit(void);
extern void tick_nohz_irq_exit(void);
-extern ktime_t tick_nohz_get_sleep_length(void);
+extern ktime_t tick_nohz_get_sleep_length(ktime_t *cur_ret);
extern unsigned long tick_nohz_get_idle_calls(void);
extern unsigned long tick_nohz_get_idle_calls_cpu(int cpu);
extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
Index: linux-pm/kernel/time/tick-sched.c
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.c
+++ linux-pm/kernel/time/tick-sched.c
@@ -1024,10 +1024,11 @@ void tick_nohz_irq_exit(void)
/**
* tick_nohz_get_sleep_length - return the expected length of the current sleep
+ * @cur_ret: pointer for returning the current time to the next event
*
* Called from power state control code with interrupts disabled
*/
-ktime_t tick_nohz_get_sleep_length(void)
+ktime_t tick_nohz_get_sleep_length(ktime_t *cur_ret)
{
struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
@@ -1038,6 +1039,8 @@ ktime_t tick_nohz_get_sleep_length(void)
*/
ktime_t now = ts->idle_entrytime;
+ *cur_ret = ktime_sub(dev->next_event, now);
+
if (can_stop_idle_tick(cpu, ts)) {
ktime_t next_event = __tick_nohz_next_event(ts, cpu);
@@ -1045,7 +1048,7 @@ ktime_t tick_nohz_get_sleep_length(void)
return ktime_sub(next_event, now);
}
- return ktime_sub(dev->next_event, now);
+ return *cur_ret;
}
/**
Index: linux-pm/drivers/cpuidle/governors/menu.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/menu.c
+++ linux-pm/drivers/cpuidle/governors/menu.c
@@ -296,6 +296,7 @@ static int menu_select(struct cpuidle_dr
unsigned int expected_interval;
unsigned long nr_iowaiters, cpu_load;
int resume_latency = dev_pm_qos_raw_read_value(device);
+ ktime_t tick_time;
if (data->needs_update) {
menu_update(drv, dev);
@@ -313,7 +314,7 @@ static int menu_select(struct cpuidle_dr
}
/* determine the expected residency time, round up */
- data->next_timer_us = ktime_to_us(tick_nohz_get_sleep_length());
+ data->next_timer_us = ktime_to_us(tick_nohz_get_sleep_length(&tick_time));
get_iowait_load(&nr_iowaiters, &cpu_load);
data->bucket = which_bucket(data->next_timer_us, nr_iowaiters);
@@ -400,8 +401,22 @@ static int menu_select(struct cpuidle_dr
* within the tick period range that could be used if longer
* idle duration was predicted.
*/
- *nohz_ret = !(first_idx > idx &&
- drv->states[first_idx].target_residency < TICK_USEC_HZ);
+ if (first_idx > idx &&
+ drv->states[first_idx].target_residency < TICK_USEC_HZ) {
+ unsigned int tick_us = ktime_to_us(tick_time);
+
+ /*
+ * Find a state with target residency less than the
+ * time to the next timer event including the tick.
+ */
+ while (idx > 0 &&
+ (drv->states[idx].target_residency > tick_us ||
+ drv->states[idx].disabled ||
+ dev->states_usage[idx].disable))
+ idx--;
+
+ *nohz_ret = false;
+ }
}
data->last_state_idx = idx;
From: Rafael J. Wysocki <[email protected]>
Push the decision whether or not to stop the tick somewhat deeper
into the idle loop.
Stopping the tick upfront leads to unpleasant outcomes in case the
idle governor doesn't agree with the timekeeping code on the duration
of the upcoming idle period. Specifically, if the tick has been
stopped and the idle governor predicts short idle, the situation is
bad regardless of whether or not the prediction is accurate. If it
is accurate, the tick has been stopped unnecessarily which means
excessive overhead. If it is not accurate, the CPU is likely to
spend too much time in the (shallow, because short idle has been
predicted) idle state selected by the governor [1].
As the first step towards addressing this problem, change the code
to make the tick stopping decision inside of the loop in do_idle().
In particular, do not stop the tick in the cpu_idle_poll() code path.
Also don't do that in tick_nohz_irq_exit() which doesn't really have
enough information on whether or not to stop the tick.
Link: https://marc.info/?l=linux-pm&m=150116085925208&w=2 # [1]
Link: https://tu-dresden.de/zih/forschung/ressourcen/dateien/projekte/haec/powernightmares.pdf
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
kernel/sched/idle.c | 8 +++++---
kernel/time/tick-sched.c | 6 ++----
2 files changed, 7 insertions(+), 7 deletions(-)
Index: linux-pm/kernel/sched/idle.c
===================================================================
--- linux-pm.orig/kernel/sched/idle.c
+++ linux-pm/kernel/sched/idle.c
@@ -221,13 +221,13 @@ static void do_idle(void)
__current_set_polling();
tick_nohz_idle_enter();
- tick_nohz_idle_stop_tick_protected();
while (!need_resched()) {
check_pgt_cache();
rmb();
if (cpu_is_offline(cpu)) {
+ tick_nohz_idle_stop_tick_protected();
cpuhp_report_idle_dead();
arch_cpu_idle_dead();
}
@@ -241,10 +241,12 @@ static void do_idle(void)
* broadcast device expired for us, we don't want to go deep
* idle as we know that the IPI is going to arrive right away.
*/
- if (cpu_idle_force_poll || tick_check_broadcast_expired())
+ if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
cpu_idle_poll();
- else
+ } else {
+ tick_nohz_idle_stop_tick();
cpuidle_idle_call();
+ }
arch_cpu_idle_exit();
}
Index: linux-pm/kernel/time/tick-sched.c
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.c
+++ linux-pm/kernel/time/tick-sched.c
@@ -984,12 +984,10 @@ void tick_nohz_irq_exit(void)
{
struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
- if (ts->inidle) {
+ if (ts->inidle)
tick_nohz_start_idle(ts);
- __tick_nohz_idle_stop_tick(ts);
- } else {
+ else
tick_nohz_full_update_tick(ts);
- }
}
/**
From: Rafael J. Wysocki <[email protected]>
Add a new pointer argument to cpuidle_select() and to the ->select
cpuidle governor callback to allow a boolean value indicating
whether or not the tick should be stopped before entering the
selected state to be returned from there.
Make the ladder governor ignore that pointer (to preserve its
current behavior) and make the menu governor return 'false" through
it if:
(1) the idle exit latency is constrained at 0,
(2) the selected state is a polling one, or
(3) the target residency of the selected state is less than the
length of the tick period and there is at least one deeper
idle state available within the tick period range.
Since the value returned through the new argument pointer is not
used yet, this change is not expected to alter the functionality of
the code.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
drivers/cpuidle/cpuidle.c | 10 ++++++++--
drivers/cpuidle/governors/ladder.c | 3 ++-
drivers/cpuidle/governors/menu.c | 32 +++++++++++++++++++++++++++++---
include/linux/cpuidle.h | 6 ++++--
kernel/sched/idle.c | 4 +++-
5 files changed, 46 insertions(+), 9 deletions(-)
Index: linux-pm/include/linux/cpuidle.h
===================================================================
--- linux-pm.orig/include/linux/cpuidle.h
+++ linux-pm/include/linux/cpuidle.h
@@ -131,7 +131,8 @@ extern bool cpuidle_not_available(struct
struct cpuidle_device *dev);
extern int cpuidle_select(struct cpuidle_driver *drv,
- struct cpuidle_device *dev);
+ struct cpuidle_device *dev,
+ bool *nohz_ret);
extern int cpuidle_enter(struct cpuidle_driver *drv,
struct cpuidle_device *dev, int index);
extern void cpuidle_reflect(struct cpuidle_device *dev, int index);
@@ -246,7 +247,8 @@ struct cpuidle_governor {
struct cpuidle_device *dev);
int (*select) (struct cpuidle_driver *drv,
- struct cpuidle_device *dev);
+ struct cpuidle_device *dev,
+ bool *nohz_ret);
void (*reflect) (struct cpuidle_device *dev, int index);
};
Index: linux-pm/kernel/sched/idle.c
===================================================================
--- linux-pm.orig/kernel/sched/idle.c
+++ linux-pm/kernel/sched/idle.c
@@ -188,13 +188,15 @@ static void cpuidle_idle_call(void)
next_state = cpuidle_find_deepest_state(drv, dev);
call_cpuidle(drv, dev, next_state);
} else {
+ bool nohz = true;
+
tick_nohz_idle_stop_tick();
rcu_idle_enter();
/*
* Ask the cpuidle framework to choose a convenient idle state.
*/
- next_state = cpuidle_select(drv, dev);
+ next_state = cpuidle_select(drv, dev, &nohz);
entered_state = call_cpuidle(drv, dev, next_state);
/*
* Give the governor an opportunity to reflect on the outcome
Index: linux-pm/drivers/cpuidle/cpuidle.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/cpuidle.c
+++ linux-pm/drivers/cpuidle/cpuidle.c
@@ -263,12 +263,18 @@ int cpuidle_enter_state(struct cpuidle_d
*
* @drv: the cpuidle driver
* @dev: the cpuidle device
+ * @nohz_ret: indication on whether or not to stop the tick
*
* Returns the index of the idle state. The return value must not be negative.
+ *
+ * The memory location pointed to by @nohz_ret is expected to be written the
+ * 'false' boolean value if the scheduler tick should not be stopped before
+ * entering the returned state.
*/
-int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
+ bool *nohz_ret)
{
- return cpuidle_curr_governor->select(drv, dev);
+ return cpuidle_curr_governor->select(drv, dev, nohz_ret);
}
/**
Index: linux-pm/drivers/cpuidle/governors/ladder.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/ladder.c
+++ linux-pm/drivers/cpuidle/governors/ladder.c
@@ -63,9 +63,10 @@ static inline void ladder_do_selection(s
* ladder_select_state - selects the next state to enter
* @drv: cpuidle driver
* @dev: the CPU
+ * @dummy: not used
*/
static int ladder_select_state(struct cpuidle_driver *drv,
- struct cpuidle_device *dev)
+ struct cpuidle_device *dev, bool *dummy)
{
struct ladder_device *ldev = this_cpu_ptr(&ladder_devices);
struct device *device = get_cpu_device(dev->cpu);
Index: linux-pm/drivers/cpuidle/governors/menu.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/menu.c
+++ linux-pm/drivers/cpuidle/governors/menu.c
@@ -275,12 +275,16 @@ again:
goto again;
}
+#define TICK_USEC_HZ ((USEC_PER_SEC + HZ/2) / HZ)
+
/**
* menu_select - selects the next idle state to enter
* @drv: cpuidle driver containing state data
* @dev: the CPU
+ * @nohz_ret: indication on whether or not to stop the tick
*/
-static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
+ bool *nohz_ret)
{
struct menu_device *data = this_cpu_ptr(&menu_devices);
struct device *device = get_cpu_device(dev->cpu);
@@ -303,8 +307,10 @@ static int menu_select(struct cpuidle_dr
latency_req = resume_latency;
/* Special case when user has set very strict latency requirement */
- if (unlikely(latency_req == 0))
+ if (unlikely(latency_req == 0)) {
+ *nohz_ret = false;
return 0;
+ }
/* determine the expected residency time, round up */
data->next_timer_us = ktime_to_us(tick_nohz_get_sleep_length());
@@ -369,7 +375,7 @@ static int menu_select(struct cpuidle_dr
idx = i; /* first enabled state */
if (s->target_residency > data->predicted_us)
break;
- if (s->exit_latency > latency_req)
+ if (s->exit_latency > latency_req)
break;
idx = i;
@@ -378,6 +384,26 @@ static int menu_select(struct cpuidle_dr
if (idx == -1)
idx = 0; /* No states enabled. Must use 0. */
+ if (drv->states[idx].flags & CPUIDLE_FLAG_POLLING) {
+ *nohz_ret = false;
+ } else if (drv->states[idx].target_residency < TICK_USEC_HZ) {
+ first_idx = idx;
+ for (i = idx + 1; i < drv->state_count; i++)
+ if (!drv->states[i].disabled &&
+ !dev->states_usage[i].disable) {
+ first_idx = i;
+ break;
+ }
+
+ /*
+ * Do not stop the tick if there is at least one more state
+ * within the tick period range that could be used if longer
+ * idle duration was predicted.
+ */
+ *nohz_ret = !(first_idx > idx &&
+ drv->states[first_idx].target_residency < TICK_USEC_HZ);
+ }
+
data->last_state_idx = idx;
return data->last_state_idx;
From: Rafael J. Wysocki <[email protected]>
Make cpuidle_idle_call() decide whether or not to stop the tick.
First, the cpuidle_enter_s2idle() path deals with the tick (and with
the entire timekeeping for that matter) by itself and it doesn't need
the tick to be stopped beforehand.
Second, to address the issue with short idle duration predictions
by the idle governor after the tick has been stopped, it will be
necessary to change the ordering of cpuidle_select() with respect
to tick_nohz_idle_stop_tick(). To prepare for that, put a
tick_nohz_idle_stop_tick() call in the same branch in which
cpuidle_select() is called.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
kernel/sched/idle.c | 25 ++++++++++++++++++-------
1 file changed, 18 insertions(+), 7 deletions(-)
Index: linux-pm/kernel/sched/idle.c
===================================================================
--- linux-pm.orig/kernel/sched/idle.c
+++ linux-pm/kernel/sched/idle.c
@@ -146,13 +146,15 @@ static void cpuidle_idle_call(void)
}
/*
- * Tell the RCU framework we are entering an idle section,
- * so no more rcu read side critical sections and one more
+ * The RCU framework needs to be told that we are entering an idle
+ * section, so no more rcu read side critical sections and one more
* step to the grace period
*/
- rcu_idle_enter();
if (cpuidle_not_available(drv, dev)) {
+ tick_nohz_idle_stop_tick();
+ rcu_idle_enter();
+
default_idle_call();
goto exit_idle;
}
@@ -169,16 +171,26 @@ static void cpuidle_idle_call(void)
if (idle_should_enter_s2idle() || dev->use_deepest_state) {
if (idle_should_enter_s2idle()) {
+ rcu_idle_enter();
+
entered_state = cpuidle_enter_s2idle(drv, dev);
if (entered_state > 0) {
local_irq_enable();
goto exit_idle;
}
+
+ rcu_idle_exit();
}
+ tick_nohz_idle_stop_tick();
+ rcu_idle_enter();
+
next_state = cpuidle_find_deepest_state(drv, dev);
call_cpuidle(drv, dev, next_state);
} else {
+ tick_nohz_idle_stop_tick();
+ rcu_idle_enter();
+
/*
* Ask the cpuidle framework to choose a convenient idle state.
*/
@@ -241,12 +253,11 @@ static void do_idle(void)
* broadcast device expired for us, we don't want to go deep
* idle as we know that the IPI is going to arrive right away.
*/
- if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
+ if (cpu_idle_force_poll || tick_check_broadcast_expired())
cpu_idle_poll();
- } else {
- tick_nohz_idle_stop_tick();
+ else
cpuidle_idle_call();
- }
+
arch_cpu_idle_exit();
}
On Fri, 2018-03-09 at 10:34 +0100, Rafael J. Wysocki wrote:
> Hi All,
>
> Thanks a lot for the discussion and testing so far!
>
> This is a total respin of the whole series, so please look at it
> afresh.
> Patches 2 and 3 are the most similar to their previous versions, but
> still they are different enough.
This series gives no RCU errors on startup,
and no CPUs seem to be getting stuck any more.
I will run some performance tests with these
patches.
--
All Rights Reversed.
On Fri, 2018-03-09 at 10:34 +0100, Rafael J. Wysocki wrote:
> Hi All,
>
> Thanks a lot for the discussion and testing so far!
>
> This is a total respin of the whole series, so please look at it afresh.
> Patches 2 and 3 are the most similar to their previous versions, but
> still they are different enough.
Respin of testdrive...
i4790 booted nopti nospectre_v2
30 sec tbench
4.16.0.g1b88acc-master (virgin)
Throughput 559.279 MB/sec 1 clients 1 procs max_latency=0.046 ms
Throughput 997.119 MB/sec 2 clients 2 procs max_latency=0.246 ms
Throughput 1693.04 MB/sec 4 clients 4 procs max_latency=4.309 ms
Throughput 3597.2 MB/sec 8 clients 8 procs max_latency=6.760 ms
Throughput 3474.55 MB/sec 16 clients 16 procs max_latency=6.743 ms
4.16.0.g1b88acc-master (+ v2)
Throughput 588.929 MB/sec 1 clients 1 procs max_latency=0.291 ms
Throughput 1080.93 MB/sec 2 clients 2 procs max_latency=0.639 ms
Throughput 1826.3 MB/sec 4 clients 4 procs max_latency=0.647 ms
Throughput 3561.01 MB/sec 8 clients 8 procs max_latency=1.279 ms
Throughput 3382.98 MB/sec 16 clients 16 procs max_latency=4.817 ms
4.16.0.g1b88acc-master (+ v3)
Throughput 588.711 MB/sec 1 clients 1 procs max_latency=0.067 ms
Throughput 1077.71 MB/sec 2 clients 2 procs max_latency=0.298 ms
Throughput 1803.47 MB/sec 4 clients 4 procs max_latency=0.667 ms
Throughput 3591.4 MB/sec 8 clients 8 procs max_latency=4.999 ms
Throughput 3444.74 MB/sec 16 clients 16 procs max_latency=1.995 ms
4.16.0.g1b88acc-master (+ my local patches)
Throughput 722.559 MB/sec 1 clients 1 procs max_latency=0.087 ms
Throughput 1208.59 MB/sec 2 clients 2 procs max_latency=0.289 ms
Throughput 2071.94 MB/sec 4 clients 4 procs max_latency=0.654 ms
Throughput 3784.91 MB/sec 8 clients 8 procs max_latency=0.974 ms
Throughput 3644.4 MB/sec 16 clients 16 procs max_latency=5.620 ms
turbostat -q -- firefox /root/tmp/video/BigBuckBunny-DivXPlusHD.mkv & sleep 300;killall firefox
PkgWatt
1 2 3
4.16.0.g1b88acc-master 6.95 7.03 6.91 (virgin)
4.16.0.g1b88acc-master 7.20 7.25 7.26 (+v2)
4.16.0.g1b88acc-master 7.04 6.97 7.07 (+v3)
4.16.0.g1b88acc-master 6.90 7.06 6.95 (+my patches)
No change wrt nohz high frequency cross core scheduling overhead, but
the light load power consumption oddity did go away.
(btw, don't read anything into max_latency numbers, that's GUI noise)
-Mike
On 2018.03.09 07:19 Rik van Riel wrote:
> On Fri, 2018-03-09 at 10:34 +0100, Rafael J. Wysocki wrote:
>> Hi All,
>>
>> Thanks a lot for the discussion and testing so far!
>>
>> This is a total respin of the whole series, so please look at it
>> afresh.
>> Patches 2 and 3 are the most similar to their previous versions, but
>> still they are different enough.
>
> This series gives no RCU errors on startup,
> and no CPUs seem to be getting stuck any more.
Confirmed on my test server. Boot is normal and no other errors, so far.
Part 1: Idle test:
I was able to repeat Mike's higher power issue under very light load,
well no load in my case, with V2.
V3 is much better.
A one hour trace on my very idle server was 22 times smaller with V3
than V2, and mainly due to idle state 4 not exiting and re-entering
every tick time for great periods of time.
Disclaimer: From past experience, 1 hour is not nearly long enough
for this test. Issues tend to come in bunches, sometimes many hours
apart.
V2:
Idle State 4: Entries: 1359560
CPU: 0: Entries: 125305
CPU: 1: Entries: 62489
CPU: 2: Entries: 10203
CPU: 3: Entries: 108107
CPU: 4: Entries: 19915
CPU: 5: Entries: 430253
CPU: 6: Entries: 564650
CPU: 7: Entries: 38638
V3:
Idle State 4: Entries: 64505
CPU: 0: Entries: 13060
CPU: 1: Entries: 5266
CPU: 2: Entries: 15744
CPU: 3: Entries: 5574
CPU: 4: Entries: 8425
CPU: 5: Entries: 6270
CPU: 6: Entries: 5592
CPU: 7: Entries: 4574
Kernel 4.16-rc4:
Idle State 4: Entries: 61390
CPU: 0: Entries: 9529
CPU: 1: Entries: 10556
CPU: 2: Entries: 5478
CPU: 3: Entries: 5991
CPU: 4: Entries: 3686
CPU: 5: Entries: 7610
CPU: 6: Entries: 11074
CPU: 7: Entries: 7466
With apologies to those that do not like the term "PowerNightmares",
it has become very ingrained in my tools:
V2:
1 hour idle Summary:
Idle State 0: Total Entries: 113 : PowerNightmares: 56 : Not PN time (seconds): 0.001224 : PN time: 65.543239 : Ratio: 53548.397792
Idle State 1: Total Entries: 1015 : PowerNightmares: 42 : Not PN time (seconds): 0.053986 : PN time: 21.054470 : Ratio: 389.998703
Idle State 2: Total Entries: 1382 : PowerNightmares: 17 : Not PN time (seconds): 0.728686 : PN time: 6.046906 : Ratio: 8.298370
Idle State 3: Total Entries: 113 : PowerNightmares: 13 : Not PN time (seconds): 0.069055 : PN time: 6.021458 : Ratio: 87.198002
V3:
1 hour idle Summary: Average processor package power 3.78 watts
Idle State 0: Total Entries: 134 : PowerNightmares: 109 : Not PN time (seconds): 0.000477 : PN time: 144.719723 : Ratio: 303395.646541
Idle State 1: Total Entries: 1104 : PowerNightmares: 84 : Not PN time (seconds): 0.052639 : PN time: 74.639142 : Ratio: 1417.943768
Idle State 2: Total Entries: 968 : PowerNightmares: 141 : Not PN time (seconds): 0.325953 : PN time: 128.235137 : Ratio: 393.416035
Idle State 3: Total Entries: 295 : PowerNightmares: 103 : Not PN time (seconds): 0.164884 : PN time: 97.159421 : Ratio: 589.259243
Kernel 4.16-rc4: Average processor package power (excluding a few minutes of abnormal power) 3.70 watts.
1 hour idle Summary:
Idle State 0: Total Entries: 168 : PowerNightmares: 59 : Not PN time (seconds): 0.001323 : PN time: 81.802197 : Ratio: 61830.836545
Idle State 1: Total Entries: 1669 : PowerNightmares: 78 : Not PN time (seconds): 0.022003 : PN time: 37.477413 : Ratio: 1703.286509
Idle State 2: Total Entries: 1447 : PowerNightmares: 30 : Not PN time (seconds): 0.502672 : PN time: 0.789344 : Ratio: 1.570296
Idle State 3: Total Entries: 176 : PowerNightmares: 0 : Not PN time (seconds): 0.259425 : PN time: 0.000000 : Ratio: 0.000000
Part 2: 100% load on one CPU test. Test duration 4 hours
V3: Summary: Average processor package power 26.75 watts
Idle State 0: Total Entries: 10039 : PowerNightmares: 7186 : Not PN time (seconds): 0.067477 : PN time: 6215.220295 : Ratio: 92108.722903
Idle State 1: Total Entries: 17268 : PowerNightmares: 195 : Not PN time (seconds): 0.213049 : PN time: 55.905323 : Ratio: 262.405939
Idle State 2: Total Entries: 5858 : PowerNightmares: 676 : Not PN time (seconds): 2.578006 : PN time: 167.282069 : Ratio: 64.888161
Idle State 3: Total Entries: 1500 : PowerNightmares: 488 : Not PN time (seconds): 0.772463 : PN time: 125.514015 : Ratio: 162.485472
Kernel 4.16-rc4: Summary: Average processor package power 27.41 watts
Idle State 0: Total Entries: 9096 : PowerNightmares: 6540 : Not PN time (seconds): 0.051532 : PN time: 7886.309553 : Ratio: 153037.133492
Idle State 1: Total Entries: 28731 : PowerNightmares: 215 : Not PN time (seconds): 0.211999 : PN time: 77.395467 : Ratio: 365.074679
Idle State 2: Total Entries: 4474 : PowerNightmares: 97 : Not PN time (seconds): 1.959059 : PN time: 0.874112 : Ratio: 0.446190
Idle State 3: Total Entries: 2319 : PowerNightmares: 0 : Not PN time (seconds): 1.663376 : PN time: 0.000000 : Ratio: 0.000000
Graph of package power verses time: http://fast.smythies.com/rjwv3_100.png
... Doug
On Saturday, March 10, 2018 8:41:39 AM CET Doug Smythies wrote:
> On 2018.03.09 07:19 Rik van Riel wrote:
> > On Fri, 2018-03-09 at 10:34 +0100, Rafael J. Wysocki wrote:
> >> Hi All,
> >>
> >> Thanks a lot for the discussion and testing so far!
> >>
> >> This is a total respin of the whole series, so please look at it
> >> afresh.
> >> Patches 2 and 3 are the most similar to their previous versions, but
> >> still they are different enough.
> >
> > This series gives no RCU errors on startup,
> > and no CPUs seem to be getting stuck any more.
>
> Confirmed on my test server. Boot is normal and no other errors, so far.
Thanks for testing, much appreciated!
> Part 1: Idle test:
>
> I was able to repeat Mike's higher power issue under very light load,
> well no load in my case, with V2.
>
> V3 is much better.
>
> A one hour trace on my very idle server was 22 times smaller with V3
> than V2, and mainly due to idle state 4 not exiting and re-entering
> every tick time for great periods of time.
>
> Disclaimer: From past experience, 1 hour is not nearly long enough
> for this test. Issues tend to come in bunches, sometimes many hours
> apart.
>
> V2:
> Idle State 4: Entries: 1359560
> CPU: 0: Entries: 125305
> CPU: 1: Entries: 62489
> CPU: 2: Entries: 10203
> CPU: 3: Entries: 108107
> CPU: 4: Entries: 19915
> CPU: 5: Entries: 430253
> CPU: 6: Entries: 564650
> CPU: 7: Entries: 38638
>
> V3:
> Idle State 4: Entries: 64505
> CPU: 0: Entries: 13060
> CPU: 1: Entries: 5266
> CPU: 2: Entries: 15744
> CPU: 3: Entries: 5574
> CPU: 4: Entries: 8425
> CPU: 5: Entries: 6270
> CPU: 6: Entries: 5592
> CPU: 7: Entries: 4574
>
> Kernel 4.16-rc4:
> Idle State 4: Entries: 61390
> CPU: 0: Entries: 9529
> CPU: 1: Entries: 10556
> CPU: 2: Entries: 5478
> CPU: 3: Entries: 5991
> CPU: 4: Entries: 3686
> CPU: 5: Entries: 7610
> CPU: 6: Entries: 11074
> CPU: 7: Entries: 7466
>
> With apologies to those that do not like the term "PowerNightmares",
OK, and what exactly do you count as "PowerNightmares"?
> it has become very ingrained in my tools:
>
> V2:
> 1 hour idle Summary:
>
> Idle State 0: Total Entries: 113 : PowerNightmares: 56 : Not PN time (seconds): 0.001224 : PN time: 65.543239 : Ratio: 53548.397792
> Idle State 1: Total Entries: 1015 : PowerNightmares: 42 : Not PN time (seconds): 0.053986 : PN time: 21.054470 : Ratio: 389.998703
> Idle State 2: Total Entries: 1382 : PowerNightmares: 17 : Not PN time (seconds): 0.728686 : PN time: 6.046906 : Ratio: 8.298370
> Idle State 3: Total Entries: 113 : PowerNightmares: 13 : Not PN time (seconds): 0.069055 : PN time: 6.021458 : Ratio: 87.198002
The V2 had a serious bug, please discard it entirely.
>
> V3:
> 1 hour idle Summary: Average processor package power 3.78 watts
>
> Idle State 0: Total Entries: 134 : PowerNightmares: 109 : Not PN time (seconds): 0.000477 : PN time: 144.719723 : Ratio: 303395.646541
> Idle State 1: Total Entries: 1104 : PowerNightmares: 84 : Not PN time (seconds): 0.052639 : PN time: 74.639142 : Ratio: 1417.943768
> Idle State 2: Total Entries: 968 : PowerNightmares: 141 : Not PN time (seconds): 0.325953 : PN time: 128.235137 : Ratio: 393.416035
> Idle State 3: Total Entries: 295 : PowerNightmares: 103 : Not PN time (seconds): 0.164884 : PN time: 97.159421 : Ratio: 589.259243
>
> Kernel 4.16-rc4: Average processor package power (excluding a few minutes of abnormal power) 3.70 watts.
> 1 hour idle Summary:
>
> Idle State 0: Total Entries: 168 : PowerNightmares: 59 : Not PN time (seconds): 0.001323 : PN time: 81.802197 : Ratio: 61830.836545
> Idle State 1: Total Entries: 1669 : PowerNightmares: 78 : Not PN time (seconds): 0.022003 : PN time: 37.477413 : Ratio: 1703.286509
> Idle State 2: Total Entries: 1447 : PowerNightmares: 30 : Not PN time (seconds): 0.502672 : PN time: 0.789344 : Ratio: 1.570296
> Idle State 3: Total Entries: 176 : PowerNightmares: 0 : Not PN time (seconds): 0.259425 : PN time: 0.000000 : Ratio: 0.000000
>
> Part 2: 100% load on one CPU test. Test duration 4 hours
>
> V3: Summary: Average processor package power 26.75 watts
>
> Idle State 0: Total Entries: 10039 : PowerNightmares: 7186 : Not PN time (seconds): 0.067477 : PN time: 6215.220295 : Ratio: 92108.722903
> Idle State 1: Total Entries: 17268 : PowerNightmares: 195 : Not PN time (seconds): 0.213049 : PN time: 55.905323 : Ratio: 262.405939
> Idle State 2: Total Entries: 5858 : PowerNightmares: 676 : Not PN time (seconds): 2.578006 : PN time: 167.282069 : Ratio: 64.888161
> Idle State 3: Total Entries: 1500 : PowerNightmares: 488 : Not PN time (seconds): 0.772463 : PN time: 125.514015 : Ratio: 162.485472
>
> Kernel 4.16-rc4: Summary: Average processor package power 27.41 watts
>
> Idle State 0: Total Entries: 9096 : PowerNightmares: 6540 : Not PN time (seconds): 0.051532 : PN time: 7886.309553 : Ratio: 153037.133492
> Idle State 1: Total Entries: 28731 : PowerNightmares: 215 : Not PN time (seconds): 0.211999 : PN time: 77.395467 : Ratio: 365.074679
> Idle State 2: Total Entries: 4474 : PowerNightmares: 97 : Not PN time (seconds): 1.959059 : PN time: 0.874112 : Ratio: 0.446190
> Idle State 3: Total Entries: 2319 : PowerNightmares: 0 : Not PN time (seconds): 1.663376 : PN time: 0.000000 : Ratio: 0.000000
>
> Graph of package power verses time: http://fast.smythies.com/rjwv3_100.png
The graph actually shows an improvement to my eyes, as the blue line is quite
consistently above the red one except for a few regions (and I don't really
understand the drop in the blue line by the end of the test window).
Thanks!
On Saturday, March 10, 2018 6:01:31 AM CET Mike Galbraith wrote:
> On Fri, 2018-03-09 at 10:34 +0100, Rafael J. Wysocki wrote:
> > Hi All,
> >
> > Thanks a lot for the discussion and testing so far!
> >
> > This is a total respin of the whole series, so please look at it afresh.
> > Patches 2 and 3 are the most similar to their previous versions, but
> > still they are different enough.
>
> Respin of testdrive...
Appreciated, thanks!
> i4790 booted nopti nospectre_v2
>
> 30 sec tbench
> 4.16.0.g1b88acc-master (virgin)
> Throughput 559.279 MB/sec 1 clients 1 procs max_latency=0.046 ms
> Throughput 997.119 MB/sec 2 clients 2 procs max_latency=0.246 ms
> Throughput 1693.04 MB/sec 4 clients 4 procs max_latency=4.309 ms
> Throughput 3597.2 MB/sec 8 clients 8 procs max_latency=6.760 ms
> Throughput 3474.55 MB/sec 16 clients 16 procs max_latency=6.743 ms
>
> 4.16.0.g1b88acc-master (+ v2)
> Throughput 588.929 MB/sec 1 clients 1 procs max_latency=0.291 ms
> Throughput 1080.93 MB/sec 2 clients 2 procs max_latency=0.639 ms
> Throughput 1826.3 MB/sec 4 clients 4 procs max_latency=0.647 ms
> Throughput 3561.01 MB/sec 8 clients 8 procs max_latency=1.279 ms
> Throughput 3382.98 MB/sec 16 clients 16 procs max_latency=4.817 ms
>
> 4.16.0.g1b88acc-master (+ v3)
> Throughput 588.711 MB/sec 1 clients 1 procs max_latency=0.067 ms
> Throughput 1077.71 MB/sec 2 clients 2 procs max_latency=0.298 ms
This is a bit better than "raw". Around 8-9% I'd say.
> Throughput 1803.47 MB/sec 4 clients 4 procs max_latency=0.667 ms
This one is too, but not as much.
> Throughput 3591.4 MB/sec 8 clients 8 procs max_latency=4.999 ms
> Throughput 3444.74 MB/sec 16 clients 16 procs max_latency=1.995 ms
And these are slightly worse, but just slightly.
> 4.16.0.g1b88acc-master (+ my local patches)
> Throughput 722.559 MB/sec 1 clients 1 procs max_latency=0.087 ms
> Throughput 1208.59 MB/sec 2 clients 2 procs max_latency=0.289 ms
> Throughput 2071.94 MB/sec 4 clients 4 procs max_latency=0.654 ms
> Throughput 3784.91 MB/sec 8 clients 8 procs max_latency=0.974 ms
> Throughput 3644.4 MB/sec 16 clients 16 procs max_latency=5.620 ms
>
> turbostat -q -- firefox /root/tmp/video/BigBuckBunny-DivXPlusHD.mkv & sleep 300;killall firefox
>
> PkgWatt
> 1 2 3
> 4.16.0.g1b88acc-master 6.95 7.03 6.91 (virgin)
> 4.16.0.g1b88acc-master 7.20 7.25 7.26 (+v2)
> 4.16.0.g1b88acc-master 7.04 6.97 7.07 (+v3)
> 4.16.0.g1b88acc-master 6.90 7.06 6.95 (+my patches)
>
> No change wrt nohz high frequency cross core scheduling overhead, but
> the light load power consumption oddity did go away.
OK
> (btw, don't read anything into max_latency numbers, that's GUI noise)
I see.
Thanks!
On 2018.03.10 01:00 Rafael J. Wysocki wrote:
> On Saturday, March 10, 2018 8:41:39 AM CET Doug Smythies wrote:
>>
>> With apologies to those that do not like the term "PowerNightmares",
>
> OK, and what exactly do you count as "PowerNightmares"?
I'll snip some below and then explain:
...[snip]...
>>
>> Kernel 4.16-rc4: Summary: Average processor package power 27.41 watts
>>
>> Idle State 0: Total Entries: 9096 : PowerNightmares: 6540 : Not PN time (seconds): 0.051532 : PN time: 7886.309553 : Ratio:
153037.133492
>> Idle State 1: Total Entries: 28731 : PowerNightmares: 215 : Not PN time (seconds): 0.211999 : PN time: 77.395467 : Ratio:
365.074679
>> Idle State 2: Total Entries: 4474 : PowerNightmares: 97 : Not PN time (seconds): 1.959059 : PN time: 0.874112 : Ratio: 0.446190
>> Idle State 3: Total Entries: 2319 : PowerNightmares: 0 : Not PN time (seconds): 1.663376 : PN time: 0.000000 : Ratio: 0.000000
O.K. let's go deeper than the summary, and focus on idle state 0, which has been my area of interest in this saga.
Idle State 0:
CPU: 0: Entries: 2093 : PowerNightmares: 1136 : Not PN time (seconds): 0.024840 : PN time: 1874.417840 : Ratio: 75459.655439
CPU: 1: Entries: 1051 : PowerNightmares: 721 : Not PN time (seconds): 0.004668 : PN time: 198.845193 : Ratio: 42597.513425
CPU: 2: Entries: 759 : PowerNightmares: 583 : Not PN time (seconds): 0.003299 : PN time: 1099.069256 : Ratio: 333152.246028
CPU: 3: Entries: 1033 : PowerNightmares: 1008 : Not PN time (seconds): 0.000361 : PN time: 1930.340683 : Ratio: 5347203.995237
CPU: 4: Entries: 1310 : PowerNightmares: 1025 : Not PN time (seconds): 0.006214 : PN time: 1332.497114 : Ratio: 214434.682950
CPU: 5: Entries: 1097 : PowerNightmares: 848 : Not PN time (seconds): 0.005029 : PN time: 785.366864 : Ratio: 156167.601340
CPU: 6: Entries: 1753 : PowerNightmares: 1219 : Not PN time (seconds): 0.007121 : PN time: 665.772603 : Ratio: 93494.256958
Note: CPU 7 is busy and doesn't go into idle at all.
And also look at the histograms of the times spent in idle state 0:
CPU 3 might be the most interesting:
Idle State: 0 CPU: 3:
4 1
5 3
7 2
11 1
12 1
13 2
14 3
15 3
17 4
18 1
19 2
28 2
7563 1
8012 1
9999 1006
Where:
Column 1 is the time in microseconds it was in that idle state
up to 9999 microseconds, which includes anything more.
Column 2 is the number of occurrences of that time.
Notice that 1008 times out of the 1033, it spent an excessive amount
of time in idle state 0, leading to excessive power consumption.
I adopted Thomas Ilsche's "Powernightmare" term for this several
months ago.
This CPU 3 example was pretty clear, but sometimes it is not so
obvious. I admit that my thresholds for is it a "powernigthmare" or
not are somewhat arbitrary, and I'll change them to whatever anybody
wants. Currently:
#define THRESHOLD_0 100 /* Idle state 0 PowerNightmare threshold in microseconds */
#define THRESHOLD_1 1000 /* Idle state 1 PowerNightmare threshold in microseconds */
#define THRESHOLD_2 2000 /* Idle state 2 PowerNightmare threshold in microseconds */
#define THRESHOLD_3 4000 /* Idle state 3 PowerNightmare threshold in microseconds */
Let's have a look at another example from the same test run:
Idle State 1:
CPU: 0: Entries: 3104 : PowerNightmares: 41 : Not PN time (seconds): 0.012196 : PN time: 10.841577 : Ratio: 888.945312
CPU: 1: Entries: 2637 : PowerNightmares: 40 : Not PN time (seconds): 0.013135 : PN time: 11.334686 : Ratio: 862.937649
CPU: 2: Entries: 1618 : PowerNightmares: 41 : Not PN time (seconds): 0.008351 : PN time: 10.193641 : Ratio: 1220.649147
CPU: 3: Entries: 10180 : PowerNightmares: 31 : Not PN time (seconds): 0.087596 : PN time: 14.748787 : Ratio: 168.372836
CPU: 4: Entries: 3878 : PowerNightmares: 22 : Not PN time (seconds): 0.040360 : PN time: 14.207233 : Ratio: 352.012710
CPU: 5: Entries: 3658 : PowerNightmares: 1 : Not PN time (seconds): 0.031188 : PN time: 0.604176 : Ratio: 19.372066
CPU: 6: Entries: 3656 : PowerNightmares: 39 : Not PN time (seconds): 0.019173 : PN time: 15.465367 : Ratio: 806.622179
Idle State: 1 CPU: 2:
0 230
1 566
2 161
3 86
4 61
5 13
6 32
7 37
8 42
9 41
10 4
11 41
12 38
13 24
14 27
15 26
16 5
17 21
18 16
19 17
20 15
21 1
22 12
23 17
24 16
25 11
26 2
27 5
28 5
29 3
35 1
47 1
1733 1
1850 1
2027 1
3929 1
9999 37
The 41 "Powernightmares" out of 1618 seems correct to me.
Even if someone claims that the threshold should have been >3929 uSec,
there are still 37 "powenightmares".
>>>
>> Graph of package power verses time: http://fast.smythies.com/rjwv3_100.png
>
> The graph actually shows an improvement to my eyes, as the blue line is quite
> consistently above the red one except for a few regions (and I don't really
> understand the drop in the blue line by the end of the test window).
Agreed, it shows improvement, as does the average package power.
The roughly 22 minute drop in the reference test, unmodified kernel 4.16-rc4,
the blue line, is something that has been driving me crazy since the start
of this work and is the reason my tests run for so long (even 4 hours is short).
It all has to do with subtle timing changes of events. Sometimes events lineup
in such a way that there are lots of these "powernightmares" and sometimes
events lineup in such way that there are not. These conditions can persist for
hours at a time. By "events" I mean ones that ask for a resched upon exit and ones
that don't (I think, I am going from memory here and haven't found my notes).
I have the trace data and can look in more detail if you want.
... Doug
On Saturday, March 10, 2018 5:07:36 PM CET Doug Smythies wrote:
> On 2018.03.10 01:00 Rafael J. Wysocki wrote:
> > On Saturday, March 10, 2018 8:41:39 AM CET Doug Smythies wrote:
> >>
> >> With apologies to those that do not like the term "PowerNightmares",
> >
> > OK, and what exactly do you count as "PowerNightmares"?
>
> I'll snip some below and then explain:
>
> ...[snip]...
>
> >>
> >> Kernel 4.16-rc4: Summary: Average processor package power 27.41 watts
> >>
> >> Idle State 0: Total Entries: 9096 : PowerNightmares: 6540 : Not PN time (seconds): 0.051532 : PN time: 7886.309553 : Ratio:
> 153037.133492
> >> Idle State 1: Total Entries: 28731 : PowerNightmares: 215 : Not PN time (seconds): 0.211999 : PN time: 77.395467 : Ratio:
> 365.074679
> >> Idle State 2: Total Entries: 4474 : PowerNightmares: 97 : Not PN time (seconds): 1.959059 : PN time: 0.874112 : Ratio: 0.446190
> >> Idle State 3: Total Entries: 2319 : PowerNightmares: 0 : Not PN time (seconds): 1.663376 : PN time: 0.000000 : Ratio: 0.000000
>
> O.K. let's go deeper than the summary, and focus on idle state 0, which has been my area of interest in this saga.
>
> Idle State 0:
> CPU: 0: Entries: 2093 : PowerNightmares: 1136 : Not PN time (seconds): 0.024840 : PN time: 1874.417840 : Ratio: 75459.655439
> CPU: 1: Entries: 1051 : PowerNightmares: 721 : Not PN time (seconds): 0.004668 : PN time: 198.845193 : Ratio: 42597.513425
> CPU: 2: Entries: 759 : PowerNightmares: 583 : Not PN time (seconds): 0.003299 : PN time: 1099.069256 : Ratio: 333152.246028
> CPU: 3: Entries: 1033 : PowerNightmares: 1008 : Not PN time (seconds): 0.000361 : PN time: 1930.340683 : Ratio: 5347203.995237
> CPU: 4: Entries: 1310 : PowerNightmares: 1025 : Not PN time (seconds): 0.006214 : PN time: 1332.497114 : Ratio: 214434.682950
> CPU: 5: Entries: 1097 : PowerNightmares: 848 : Not PN time (seconds): 0.005029 : PN time: 785.366864 : Ratio: 156167.601340
> CPU: 6: Entries: 1753 : PowerNightmares: 1219 : Not PN time (seconds): 0.007121 : PN time: 665.772603 : Ratio: 93494.256958
>
> Note: CPU 7 is busy and doesn't go into idle at all.
>
> And also look at the histograms of the times spent in idle state 0:
> CPU 3 might be the most interesting:
>
> Idle State: 0 CPU: 3:
> 4 1
> 5 3
> 7 2
> 11 1
> 12 1
> 13 2
> 14 3
> 15 3
> 17 4
> 18 1
> 19 2
> 28 2
> 7563 1
> 8012 1
> 9999 1006
>
> Where:
> Column 1 is the time in microseconds it was in that idle state
> up to 9999 microseconds, which includes anything more.
> Column 2 is the number of occurrences of that time.
>
> Notice that 1008 times out of the 1033, it spent an excessive amount
> of time in idle state 0, leading to excessive power consumption.
> I adopted Thomas Ilsche's "Powernightmare" term for this several
> months ago.
>
> This CPU 3 example was pretty clear, but sometimes it is not so
> obvious. I admit that my thresholds for is it a "powernigthmare" or
> not are somewhat arbitrary, and I'll change them to whatever anybody
> wants. Currently:
>
> #define THRESHOLD_0 100 /* Idle state 0 PowerNightmare threshold in microseconds */
> #define THRESHOLD_1 1000 /* Idle state 1 PowerNightmare threshold in microseconds */
> #define THRESHOLD_2 2000 /* Idle state 2 PowerNightmare threshold in microseconds */
> #define THRESHOLD_3 4000 /* Idle state 3 PowerNightmare threshold in microseconds */
That's clear, thanks!
Well, the main reason why I have a problem with the "powernigthmare" term is
because it is somewhat arbitrary overall. After all, you ended up having to
explain what you meant in detail even though you have used it in the
previous message, so it doesn't appear all that useful to me. :-)
Also, the current work isn't even concerned about idle times below the
length of the tick period, so the information that some CPUs spent over
100 us in state 0 for a certain number of times during the test is not that
relevant here. The information that they often spend more time than a tick
period in state 0 in one go *is* relevant, though.
The $subject patch series, on the one hand, is about adding a safety net for
possible governor mispredictions using the existing tick infrastructure,
along with avoiding unnecessary timer manipulation overhead related to the
stopping and starting of the tick, on the other hand. Of course, the safety
net will not improve the accuracy of governor predictions anyway, it may only
reduce their impact.
That said, it doesn't catch one case which turns out to be quite significant
and which is when the tick is stopped already and the governor predicts short
idle. That, again, may cause the CPU to spend a long time in a shallow idle
state which then will qualify as a "powernightmare" I suppose. If I'm reading
your data correctly, that is the main reason for the majority of cases in which
CPUs spend 9999 us and more in state 0 on your system.
That issue can be dealt with in a couple of ways and the patch below is a
rather straightforward attempt to do that. The idea, basically, is to discard
the result of governor prediction if the tick has been stopped alread and
the predicted idle duration is within the tick range.
Please try it on top of the v3 and tell me if you see an improvement.
Thanks!
---
From: Rafael J. Wysocki <[email protected]>
Subject: [PATCH] cpuidle: menu: Avoid selecting shallow states with stopped tick
If the scheduler tick has been stopped already and the governor
selects a shallow idle state, the CPU can spend a long time in that
state if the selection is based on an inaccurate prediction of idle
period duration. That effect turns out to occur relatively often,
so it needs to be mitigated.
To that end, modify the menu governor to discard the result of the
idle period duration prediction if it is less than the tick period
duration and the tick is stopped, unless the tick timer is going to
expire soon.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
drivers/cpuidle/governors/menu.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)
Index: linux-pm/drivers/cpuidle/governors/menu.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/menu.c
+++ linux-pm/drivers/cpuidle/governors/menu.c
@@ -297,6 +297,7 @@ static int menu_select(struct cpuidle_dr
unsigned long nr_iowaiters, cpu_load;
int resume_latency = dev_pm_qos_raw_read_value(device);
ktime_t tick_time;
+ unsigned int tick_us;
if (data->needs_update) {
menu_update(drv, dev);
@@ -315,6 +316,7 @@ static int menu_select(struct cpuidle_dr
/* determine the expected residency time, round up */
data->next_timer_us = ktime_to_us(tick_nohz_get_sleep_length(&tick_time));
+ tick_us = ktime_to_us(tick_time);
get_iowait_load(&nr_iowaiters, &cpu_load);
data->bucket = which_bucket(data->next_timer_us, nr_iowaiters);
@@ -354,6 +356,17 @@ static int menu_select(struct cpuidle_dr
data->predicted_us = min(data->predicted_us, expected_interval);
/*
+ * If the tick is already stopped, the cost of possible misprediction is
+ * much higher, because the CPU may be stuck in a shallow idle state for
+ * a long time as a result of it. For this reason, if that happens say
+ * we might mispredict and try to force the CPU into a state for which
+ * we would have stopped the tick, unless the tick timer is going to
+ * expire really soon anyway.
+ */
+ if (tick_nohz_tick_stopped() && data->predicted_us < TICK_USEC_HZ)
+ data->predicted_us = min_t(unsigned int, TICK_USEC_HZ, tick_us);
+
+ /*
* Use the performance multiplier and the user-configurable
* latency_req to determine the maximum exit latency.
*/
@@ -403,8 +416,6 @@ static int menu_select(struct cpuidle_dr
*/
if (first_idx > idx &&
drv->states[first_idx].target_residency < TICK_USEC_HZ) {
- unsigned int tick_us = ktime_to_us(tick_time);
-
/*
* Find a state with target residency less than the
* time to the next timer event including the tick.
On Fri, Mar 09, 2018 at 10:46:55AM +0100, Rafael J. Wysocki wrote:
> --- linux-pm.orig/kernel/time/tick-sched.h
> +++ linux-pm/kernel/time/tick-sched.h
> @@ -30,6 +30,7 @@ enum tick_nohz_mode {
> * when the CPU returns from nohz sleep.
> * @next_tick: Next tick to be fired when in dynticks mode.
> * @tick_stopped: Indicator that the idle tick has been stopped
> + * @tick_may_stop: Indicator that the idle tick may be stopped shortly
Perhaps we can set timer_expires to 0 instead when we want to invalidate
the last value?
> * @idle_jiffies: jiffies at the entry to idle for idle time accounting
> * @idle_calls: Total number of idle calls
> * @idle_sleeps: Number of idle calls, where the sched tick was stopped
> @@ -38,7 +39,6 @@ enum tick_nohz_mode {
> * @idle_exittime: Time when the idle state was left
> * @idle_sleeptime: Sum of the time slept in idle with sched tick stopped
> * @iowait_sleeptime: Sum of the time slept in idle with sched tick stopped, with IO outstanding
> - * @sleep_length: Duration of the current idle sleep
> * @do_timer_lst: CPU was the last one doing do_timer before going idle
> */
> struct tick_sched {
> @@ -49,6 +49,7 @@ struct tick_sched {
> ktime_t next_tick;
> int inidle;
> int tick_stopped;
> + int tick_may_stop;
> unsigned long idle_jiffies;
> unsigned long idle_calls;
> unsigned long idle_sleeps;
> @@ -58,8 +59,9 @@ struct tick_sched {
> ktime_t idle_exittime;
> ktime_t idle_sleeptime;
> ktime_t iowait_sleeptime;
> - ktime_t sleep_length;
> unsigned long last_jiffies;
> + u64 timer_expires;
> + u64 timer_expires_basemono;
We may need documentation for the above fields too.
> Index: linux-pm/kernel/time/tick-sched.c
> ===================================================================
> --- linux-pm.orig/kernel/time/tick-sched.c
> +++ linux-pm/kernel/time/tick-sched.c
> @@ -652,13 +652,10 @@ static inline bool local_timer_softirq_p
> return local_softirq_pending() & TIMER_SOFTIRQ;
> }
>
> -static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
> - ktime_t now, int cpu)
> +static ktime_t __tick_nohz_next_event(struct tick_sched *ts, int cpu)
Since we don't seem to have a lower level version, can we remove the underscores?
> {
> - struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
> u64 basemono, next_tick, next_tmr, next_rcu, delta, expires;
> unsigned long seq, basejiff;
> - ktime_t tick;
>
> /* Read jiffies and the time when jiffies were updated last */
> do {
[...]
> +static void __tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
(Same comment here about the underscores).
> +{
> + struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
> + u64 basemono = ts->timer_expires_basemono;
> + u64 expires = ts->timer_expires;
> + ktime_t tick = expires;
> +
> + /* Make sure we won't be trying to stop it twice in a row. */
> + ts->tick_may_stop = 0;
> +
> + /*
> + * If this CPU is the one which updates jiffies, then give up
> + * the assignment and let it be taken by the CPU which runs
> + * the tick timer next, which might be this CPU as well. If we
> + * don't drop this here the jiffies might be stale and
> + * do_timer() never invoked. Keep track of the fact that it
> + * was the one which had the do_timer() duty last.
> + */
> + if (cpu == tick_do_timer_cpu) {
> + tick_do_timer_cpu = TICK_DO_TIMER_NONE;
> + ts->do_timer_last = 1;
> + }
>
> /* Skip reprogram of event if its not changed */
> if (ts->tick_stopped && (expires == ts->next_tick)) {
> /* Sanity check: make sure clockevent is actually programmed */
> if (tick == KTIME_MAX || ts->next_tick == hrtimer_get_expires(&ts->sched_timer))
> - goto out;
> + return;
>
> WARN_ON_ONCE(1);
> printk_once("basemono: %llu ts->next_tick: %llu dev->next_event: %llu timer->active: %d timer->expires: %llu\n",
[...]
> +void tick_nohz_idle_retain_tick(void)
> +{
> + __this_cpu_write(tick_cpu_sched.tick_may_stop, 0);
> +}
> +
So, I've become overly-paranoid about cached expiration on nohz code, we've run
into bugs that took months to debug before. It seems that the cached version shouldn't
leak in any way there, still can we have checks such as this in tick_nohz_idle_enter/exit()?
WARN_ON_ONCE(__this_cpu_read(tick_cpu_sched.tick_may_stop));
Otherwise a leaking cached expiration may mislead further nohz tick stop and
bypass calls to tick_nohz_next_event().
Also let's make sure we never call tick_nohz_get_sleep_length() outside idle:
WARN_ON_ONCE(!__this_cpu_read(tick_cpu_sched.inidle));
Thanks!
On 2018.03.10 15:55 Rafael J. Wysocki wrote:
>On Saturday, March 10, 2018 5:07:36 PM CET Doug Smythies wrote:
>> On 2018.03.10 01:00 Rafael J. Wysocki wrote:
>
... [snip] ...
> The information that they often spend more time than a tick
> period in state 0 in one go *is* relevant, though.
>
>
> That issue can be dealt with in a couple of ways and the patch below is a
> rather straightforward attempt to do that. The idea, basically, is to discard
> the result of governor prediction if the tick has been stopped alread and
> the predicted idle duration is within the tick range.
>
> Please try it on top of the v3 and tell me if you see an improvement.
It seems pretty good so far.
See a new line added to the previous graph, "rjwv3plus".
http://fast.smythies.com/rjwv3plus_100.png
I'll do another 100% load on one CPU test overnight, this time with
a trace.
... Doug
On Sunday, March 11, 2018 8:43:02 AM CET Doug Smythies wrote:
> On 2018.03.10 15:55 Rafael J. Wysocki wrote:
> >On Saturday, March 10, 2018 5:07:36 PM CET Doug Smythies wrote:
> >> On 2018.03.10 01:00 Rafael J. Wysocki wrote:
> >
> ... [snip] ...
>
> > The information that they often spend more time than a tick
> > period in state 0 in one go *is* relevant, though.
> >
> >
> > That issue can be dealt with in a couple of ways and the patch below is a
> > rather straightforward attempt to do that. The idea, basically, is to discard
> > the result of governor prediction if the tick has been stopped alread and
> > the predicted idle duration is within the tick range.
> >
> > Please try it on top of the v3 and tell me if you see an improvement.
>
> It seems pretty good so far.
> See a new line added to the previous graph, "rjwv3plus".
>
> http://fast.smythies.com/rjwv3plus_100.png
OK, cool!
Below is a respin of the last patch which also prevents shallow states from
being chosen due to interactivity_req when the tick is stopped.
You may also add a poll_idle() fix I've just posted:
https://patchwork.kernel.org/patch/10274595/
on top of this. It makes quite a bit of a difference for me. :-)
> I'll do another 100% load on one CPU test overnight, this time with
> a trace.
Thanks!
On Sunday, March 11, 2018 2:44:37 AM CET Frederic Weisbecker wrote:
> On Fri, Mar 09, 2018 at 10:46:55AM +0100, Rafael J. Wysocki wrote:
> > --- linux-pm.orig/kernel/time/tick-sched.h
> > +++ linux-pm/kernel/time/tick-sched.h
> > @@ -30,6 +30,7 @@ enum tick_nohz_mode {
> > * when the CPU returns from nohz sleep.
> > * @next_tick: Next tick to be fired when in dynticks mode.
> > * @tick_stopped: Indicator that the idle tick has been stopped
> > + * @tick_may_stop: Indicator that the idle tick may be stopped shortly
>
> Perhaps we can set timer_expires to 0 instead when we want to invalidate
> the last value?
timer_expires can be 0 for other reasons.
I can use timer_expires_basemono for that (I actually did that in one version
of my patches, but it looked odd and I decided to use a new field for clarity)
if you prefer to avoid adding an extra field.
> > * @idle_jiffies: jiffies at the entry to idle for idle time accounting
> > * @idle_calls: Total number of idle calls
> > * @idle_sleeps: Number of idle calls, where the sched tick was stopped
> > @@ -38,7 +39,6 @@ enum tick_nohz_mode {
> > * @idle_exittime: Time when the idle state was left
> > * @idle_sleeptime: Sum of the time slept in idle with sched tick stopped
> > * @iowait_sleeptime: Sum of the time slept in idle with sched tick stopped, with IO outstanding
> > - * @sleep_length: Duration of the current idle sleep
> > * @do_timer_lst: CPU was the last one doing do_timer before going idle
> > */
> > struct tick_sched {
> > @@ -49,6 +49,7 @@ struct tick_sched {
> > ktime_t next_tick;
> > int inidle;
> > int tick_stopped;
> > + int tick_may_stop;
> > unsigned long idle_jiffies;
> > unsigned long idle_calls;
> > unsigned long idle_sleeps;
> > @@ -58,8 +59,9 @@ struct tick_sched {
> > ktime_t idle_exittime;
> > ktime_t idle_sleeptime;
> > ktime_t iowait_sleeptime;
> > - ktime_t sleep_length;
> > unsigned long last_jiffies;
> > + u64 timer_expires;
> > + u64 timer_expires_basemono;
>
> We may need documentation for the above fields too.
OK
> > Index: linux-pm/kernel/time/tick-sched.c
> > ===================================================================
> > --- linux-pm.orig/kernel/time/tick-sched.c
> > +++ linux-pm/kernel/time/tick-sched.c
> > @@ -652,13 +652,10 @@ static inline bool local_timer_softirq_p
> > return local_softirq_pending() & TIMER_SOFTIRQ;
> > }
> >
> > -static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
> > - ktime_t now, int cpu)
> > +static ktime_t __tick_nohz_next_event(struct tick_sched *ts, int cpu)
>
> Since we don't seem to have a lower level version, can we remove the underscores?
Sure, will do.
> > {
> > - struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
> > u64 basemono, next_tick, next_tmr, next_rcu, delta, expires;
> > unsigned long seq, basejiff;
> > - ktime_t tick;
> >
> > /* Read jiffies and the time when jiffies were updated last */
> > do {
> [...]
>
>
> > +static void __tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
>
> (Same comment here about the underscores).
Yup.
> > +{
> > + struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
> > + u64 basemono = ts->timer_expires_basemono;
> > + u64 expires = ts->timer_expires;
> > + ktime_t tick = expires;
> > +
> > + /* Make sure we won't be trying to stop it twice in a row. */
> > + ts->tick_may_stop = 0;
> > +
> > + /*
> > + * If this CPU is the one which updates jiffies, then give up
> > + * the assignment and let it be taken by the CPU which runs
> > + * the tick timer next, which might be this CPU as well. If we
> > + * don't drop this here the jiffies might be stale and
> > + * do_timer() never invoked. Keep track of the fact that it
> > + * was the one which had the do_timer() duty last.
> > + */
> > + if (cpu == tick_do_timer_cpu) {
> > + tick_do_timer_cpu = TICK_DO_TIMER_NONE;
> > + ts->do_timer_last = 1;
> > + }
> >
> > /* Skip reprogram of event if its not changed */
> > if (ts->tick_stopped && (expires == ts->next_tick)) {
> > /* Sanity check: make sure clockevent is actually programmed */
> > if (tick == KTIME_MAX || ts->next_tick == hrtimer_get_expires(&ts->sched_timer))
> > - goto out;
> > + return;
> >
> > WARN_ON_ONCE(1);
> > printk_once("basemono: %llu ts->next_tick: %llu dev->next_event: %llu timer->active: %d timer->expires: %llu\n",
> [...]
> > +void tick_nohz_idle_retain_tick(void)
> > +{
> > + __this_cpu_write(tick_cpu_sched.tick_may_stop, 0);
> > +}
> > +
>
> So, I've become overly-paranoid about cached expiration on nohz code, we've run
> into bugs that took months to debug before. It seems that the cached version shouldn't
> leak in any way there, still can we have checks such as this in tick_nohz_idle_enter/exit()?
>
> WARN_ON_ONCE(__this_cpu_read(tick_cpu_sched.tick_may_stop));
OK
> Otherwise a leaking cached expiration may mislead further nohz tick stop and
> bypass calls to tick_nohz_next_event().
>
> Also let's make sure we never call tick_nohz_get_sleep_length() outside idle:
>
> WARN_ON_ONCE(!__this_cpu_read(tick_cpu_sched.inidle));
OK
I'll respin the series with the above comments addressed early next week.
Thanks!
On Sunday, March 11, 2018 11:21:36 AM CET Rafael J. Wysocki wrote:
> On Sunday, March 11, 2018 8:43:02 AM CET Doug Smythies wrote:
> > On 2018.03.10 15:55 Rafael J. Wysocki wrote:
> > >On Saturday, March 10, 2018 5:07:36 PM CET Doug Smythies wrote:
> > >> On 2018.03.10 01:00 Rafael J. Wysocki wrote:
> > >
> > ... [snip] ...
> >
> > > The information that they often spend more time than a tick
> > > period in state 0 in one go *is* relevant, though.
> > >
> > >
> > > That issue can be dealt with in a couple of ways and the patch below is a
> > > rather straightforward attempt to do that. The idea, basically, is to discard
> > > the result of governor prediction if the tick has been stopped alread and
> > > the predicted idle duration is within the tick range.
> > >
> > > Please try it on top of the v3 and tell me if you see an improvement.
> >
> > It seems pretty good so far.
> > See a new line added to the previous graph, "rjwv3plus".
> >
> > http://fast.smythies.com/rjwv3plus_100.png
>
> OK, cool!
>
> Below is a respin of the last patch which also prevents shallow states from
> being chosen due to interactivity_req when the tick is stopped.
Actually appending the patch this time, sorry.
> You may also add a poll_idle() fix I've just posted:
>
> https://patchwork.kernel.org/patch/10274595/
>
> on top of this. It makes quite a bit of a difference for me. :-)
>
> > I'll do another 100% load on one CPU test overnight, this time with
> > a trace.
>
> Thanks!
---
From: Rafael J. Wysocki <[email protected]>
Subject: [PATCH] cpuidle: menu: Avoid selecting shallow states with stopped tick
If the scheduler tick has been stopped already and the governor
selects a shallow idle state, the CPU can spend a long time in that
state if the selection is based on an inaccurate prediction of idle
period duration. That effect turns out to occur relatively often,
so it needs to be mitigated.
To that end, modify the menu governor to discard the result of the
idle period duration prediction if it is less than the tick period
duration and the tick is stopped, unless the tick timer is going to
expire soon.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
drivers/cpuidle/governors/menu.c | 26 +++++++++++++++++++-------
1 file changed, 19 insertions(+), 7 deletions(-)
Index: linux-pm/drivers/cpuidle/governors/menu.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/menu.c
+++ linux-pm/drivers/cpuidle/governors/menu.c
@@ -297,6 +297,7 @@ static int menu_select(struct cpuidle_dr
unsigned long nr_iowaiters, cpu_load;
int resume_latency = dev_pm_qos_raw_read_value(device);
ktime_t tick_time;
+ unsigned int tick_us;
if (data->needs_update) {
menu_update(drv, dev);
@@ -315,6 +316,7 @@ static int menu_select(struct cpuidle_dr
/* determine the expected residency time, round up */
data->next_timer_us = ktime_to_us(tick_nohz_get_sleep_length(&tick_time));
+ tick_us = ktime_to_us(tick_time);
get_iowait_load(&nr_iowaiters, &cpu_load);
data->bucket = which_bucket(data->next_timer_us, nr_iowaiters);
@@ -354,12 +356,24 @@ static int menu_select(struct cpuidle_dr
data->predicted_us = min(data->predicted_us, expected_interval);
/*
- * Use the performance multiplier and the user-configurable
- * latency_req to determine the maximum exit latency.
+ * If the tick is already stopped, the cost of possible misprediction is
+ * much higher, because the CPU may be stuck in a shallow idle state for
+ * a long time as a result of it. For this reason, if that happens say
+ * we might mispredict and try to force the CPU into a state for which
+ * we would have stopped the tick, unless the tick timer is going to
+ * expire really soon anyway.
*/
- interactivity_req = data->predicted_us / performance_multiplier(nr_iowaiters, cpu_load);
- if (latency_req > interactivity_req)
- latency_req = interactivity_req;
+ if (tick_nohz_tick_stopped() && data->predicted_us < TICK_USEC_HZ) {
+ data->predicted_us = min_t(unsigned int, TICK_USEC_HZ, tick_us);
+ } else {
+ /*
+ * Use the performance multiplier and the user-configurable
+ * latency_req to determine the maximum exit latency.
+ */
+ interactivity_req = data->predicted_us / performance_multiplier(nr_iowaiters, cpu_load);
+ if (latency_req > interactivity_req)
+ latency_req = interactivity_req;
+ }
/*
* Find the idle state with the lowest power while satisfying
@@ -403,8 +417,6 @@ static int menu_select(struct cpuidle_dr
*/
if (first_idx > idx &&
drv->states[first_idx].target_residency < TICK_USEC_HZ) {
- unsigned int tick_us = ktime_to_us(tick_time);
-
/*
* Find a state with target residency less than the
* time to the next timer event including the tick.
On 2018.03.11 03:22 Rafael J. Wysocki wrote:
> On Sunday, March 11, 2018 8:43:02 AM CET Doug Smythies wrote:
>> On 2018.03.10 15:55 Rafael J. Wysocki wrote:
>>>On Saturday, March 10, 2018 5:07:36 PM CET Doug Smythies wrote:
>>>> On 2018.03.10 01:00 Rafael J. Wysocki wrote:
>>>
>> ... [snip] ...
>>
>>> The information that they often spend more time than a tick
>>>> period in state 0 in one go *is* relevant, though.
>>>
>>>
>>> That issue can be dealt with in a couple of ways and the patch below is a
>>> rather straightforward attempt to do that. The idea, basically, is to discard
>>> the result of governor prediction if the tick has been stopped alread and
>>> the predicted idle duration is within the tick range.
>>>
>>> Please try it on top of the v3 and tell me if you see an improvement.
>>
>> It seems pretty good so far.
>> See a new line added to the previous graph, "rjwv3plus".
>>
>> http://fast.smythies.com/rjwv3plus_100.png
>
> OK, cool!
>
> Below is a respin of the last patch which also prevents shallow states from
> being chosen due to interactivity_req when the tick is stopped.
>
> You may also add a poll_idle() fix I've just posted:
>
> https://patchwork.kernel.org/patch/10274595/
>
> on top of this. It makes quite a bit of a difference for me. :-)
I will add and test, but I already know from testing previous versions
of this patch, from Rik van Riel and myself, that the results will be
awesome.
>
>> I'll do another 100% load on one CPU test overnight, this time with
>> a trace.
The only thing I'll add from the 7 hour overnight test with trace is that
there were 0 occurrences of excessive times spent in idle states above 0.
The histograms show almost entirely those idle states being limited to
one tick time (I am using a 1000 Hz kernel). Exceptions:
Idle State: 3 CPU: 0: 1 occurrence of 1790 uSec (which is O.K. anyhow)
Idle State: 3 CPU: 6: 1 occurrence of 2372 uSec (which is O.K. anyhow)
... Doug
On 2018.03.11 08:52 Doug Smythies wrote:
> On 2018.03.11 03:22 Rafael J. Wysocki wrote:
>> On Sunday, March 11, 2018 8:43:02 AM CET Doug Smythies wrote:
>>> On 2018.03.10 15:55 Rafael J. Wysocki wrote:
>>>> On Saturday, March 10, 2018 5:07:36 PM CET Doug Smythies wrote:
>>>>> On 2018.03.10 01:00 Rafael J. Wysocki wrote:
>>>
>>> ... [snip] ...
>>>
>>>> The information that they often spend more time than a tick
>>>> period in state 0 in one go *is* relevant, though.
>>>>
>>>>
>>>> That issue can be dealt with in a couple of ways and the patch below is a
>>>> rather straightforward attempt to do that. The idea, basically, is to discard
>>>> the result of governor prediction if the tick has been stopped alread and
>>>> the predicted idle duration is within the tick range.
>>>
>>>> Please try it on top of the v3 and tell me if you see an improvement.
>>>
>>> It seems pretty good so far.
>>> See a new line added to the previous graph, "rjwv3plus".
>>>
>>> http://fast.smythies.com/rjwv3plus_100.png
>>
>> OK, cool!
>>
>> Below is a respin of the last patch which also prevents shallow states from
>> being chosen due to interactivity_req when the tick is stopped.
>>
>> You may also add a poll_idle() fix I've just posted:
>>
>> https://patchwork.kernel.org/patch/10274595/
>>
>> on top of this. It makes quite a bit of a difference for me. :-)
>
> I will add and test, but I already know from testing previous versions
> of this patch, from Rik van Riel and myself, that the results will be
> awesome.
And the results are indeed awesome.
A four hour 100% load on one CPU test was run, with trace, however
there is nothing to report, as everything is great.
The same graph as the last couple of days, with a new line added for
V3 + the respin of patch 7 of 6 + the poll-idle fix, called rjwv3pp,
is here:
http://fast.smythies.com/rjwv3pp_100.png
... Doug
On Monday, March 12, 2018 12:02:14 AM CET Doug Smythies wrote:
> On 2018.03.11 08:52 Doug Smythies wrote:
> > On 2018.03.11 03:22 Rafael J. Wysocki wrote:
> >> On Sunday, March 11, 2018 8:43:02 AM CET Doug Smythies wrote:
> >>> On 2018.03.10 15:55 Rafael J. Wysocki wrote:
> >>>> On Saturday, March 10, 2018 5:07:36 PM CET Doug Smythies wrote:
> >>>>> On 2018.03.10 01:00 Rafael J. Wysocki wrote:
> >>>
> >>> ... [snip] ...
> >>>
> >>>> The information that they often spend more time than a tick
> >>>> period in state 0 in one go *is* relevant, though.
> >>>>
> >>>>
> >>>> That issue can be dealt with in a couple of ways and the patch below is a
> >>>> rather straightforward attempt to do that. The idea, basically, is to discard
> >>>> the result of governor prediction if the tick has been stopped alread and
> >>>> the predicted idle duration is within the tick range.
> >>>
> >>>> Please try it on top of the v3 and tell me if you see an improvement.
> >>>
> >>> It seems pretty good so far.
> >>> See a new line added to the previous graph, "rjwv3plus".
> >>>
> >>> http://fast.smythies.com/rjwv3plus_100.png
> >>
> >> OK, cool!
> >>
> >> Below is a respin of the last patch which also prevents shallow states from
> >> being chosen due to interactivity_req when the tick is stopped.
> >>
> >> You may also add a poll_idle() fix I've just posted:
> >>
> >> https://patchwork.kernel.org/patch/10274595/
> >>
> >> on top of this. It makes quite a bit of a difference for me. :-)
> >
> > I will add and test, but I already know from testing previous versions
> > of this patch, from Rik van Riel and myself, that the results will be
> > awesome.
>
> And the results are indeed awesome.
>
> A four hour 100% load on one CPU test was run, with trace, however
> there is nothing to report, as everything is great.
>
> The same graph as the last couple of days, with a new line added for
> V3 + the respin of patch 7 of 6 + the poll-idle fix, called rjwv3pp,
> is here:
>
> http://fast.smythies.com/rjwv3pp_100.png
That looks great, thank you!