2018-03-06 09:18:25

by Rafael J. Wysocki

[permalink] [raw]
Subject: [RFC/RFT][PATCH v2 0/6] sched/cpuidle: Idle loop rework

Hi All,

Thanks a lot for the discussion so far!

Here's a new version of the series addressing some comments from the
discussion and (most importantly) replacing patches 4 and 5 with another
(simpler) patch.

The summary below still applies:

On Sunday, March 4, 2018 11:21:30 PM CET Rafael J. Wysocki wrote:
>
> The problem is that if we stop the sched tick in
> tick_nohz_idle_enter() and then the idle governor predicts short idle
> duration, we lose regardless of whether or not it is right.
>
> If it is right, we've lost already, because we stopped the tick
> unnecessarily. If it is not right, we'll lose going forward, because
> the idle state selected by the governor is going to be too shallow and
> we'll draw too much power (that has been reported recently to actually
> happen often enough for people to care).
>
> This patch series is an attempt to improve the situation and the idea
> here is to make the decision whether or not to stop the tick deeper in
> the idle loop and in particular after running the idle state selection
> in the path where the idle governor is invoked. This way the problem
> can be avoided, because the idle duration predicted by the idle governor
> can be used to decide whether or not to stop the tick so that the tick
> is only stopped if that value is large enough (and, consequently, the
> idle state selected by the governor is deep enough).
>
> The series tires to avoid adding too much new code, rather reorder the
> existing code and make it more fine-grained.
>
> Patch 1 prepares the tick-sched code for the subsequent modifications and it
> doesn't change the code's functionality (at least not intentionally).
>
> Patch 2 starts pushing the tick stopping decision deeper into the idle
> loop, but it is limited to do_idle() and tick_nohz_irq_exit().
>
> Patch 3 makes cpuidle_idle_call() decide whether or not to stop the tick
> and sets the stage for the changes in patch 6.

Patch 4 adds a bool pointer argument to cpuidle_select() and the ->select
governor callback allowing them to return a "nohz" hint on whether or not to
stop the tick to the caller.

Patch 5 reorders the idle state selection with respect to the stopping of the
tick and causes the additional "nohz" hint from cpuidle_select() to be used
for deciding whether or not to stop the tick.

Patch 6 cleans up the code to avoid running one piece of it twice in a row
in some cases.

And the two paragraphs below still apply:

> I have tested these patches on a couple of machines, including the very laptop
> I'm sending them from, without any obvious issues, but please give them a go
> if you can, especially if you have an easy way to reproduce the problem they
> are targeting. The patches are on top of 4.16-rc3 (if you need a git branch
> with them for easier testing, please let me know).
>
> The above said, this is just RFC, so no pets close to the machines running it,
> please, and I'm kind of expecting Peter and Thomas to tear it into pieces. :-)

Thanks,
Rafael



2018-03-06 09:13:50

by Rafael J. Wysocki

[permalink] [raw]
Subject: [RFC/RFT][PATCH v2 6/6] time: tick-sched: Avoid running the same code twice in a row

From: Rafael J. Wysocki <[email protected]>

To avoid running the same piece of code twice in a row, move the
tick stopping part of __tick_nohz_next_event() into a new function
called __tick_nohz_stop_tick() and invoke them both separately.

Make __tick_nohz_idle_enter() avoid calling __tick_nohz_next_event()
if it has been called already by tick_nohz_get_sleep_length() and
use the new next_idle_tick field in struct tick_sched to pass the
next event time value between tick_nohz_get_sleep_length() and
__tick_nohz_idle_enter().

Signed-off-by: Rafael J. Wysocki <[email protected]>
---

-> v2: No changes.

---
kernel/time/tick-sched.c | 130 ++++++++++++++++++++++++++---------------------
kernel/time/tick-sched.h | 1
2 files changed, 73 insertions(+), 58 deletions(-)

Index: linux-pm/kernel/time/tick-sched.c
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.c
+++ linux-pm/kernel/time/tick-sched.c
@@ -655,13 +655,10 @@ static inline bool local_timer_softirq_p
return local_softirq_pending() & TIMER_SOFTIRQ;
}

-static ktime_t __tick_nohz_next_event(struct tick_sched *ts, int cpu,
- bool stop_tick)
+static ktime_t __tick_nohz_next_event(struct tick_sched *ts, int cpu)
{
- struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
u64 basemono, next_tick, next_tmr, next_rcu, delta, expires;
unsigned long seq, basejiff;
- ktime_t tick;

/* Read jiffies and the time when jiffies were updated last */
do {
@@ -714,34 +711,23 @@ static ktime_t __tick_nohz_next_event(st
* We've not stopped the tick yet, and there's a timer in the
* next period, so no point in stopping it either, bail.
*/
- if (!ts->tick_stopped) {
- tick = 0;
- goto out;
- }
+ if (!ts->tick_stopped)
+ return 0;
}

/*
- * If this CPU is the one which updates jiffies, then give up
- * the assignment and let it be taken by the CPU which runs
- * the tick timer next, which might be this CPU as well. If we
- * don't drop this here the jiffies might be stale and
- * do_timer() never invoked. Keep track of the fact that it
- * was the one which had the do_timer() duty last. If this CPU
- * is the one which had the do_timer() duty last, we limit the
- * sleep time to the timekeeping max_deferment value.
+ * If this CPU is the one which had the do_timer() duty last, we limit
+ * the sleep time to the timekeeping max_deferment value.
* Otherwise we can sleep as long as we want.
*/
delta = timekeeping_max_deferment();
- if (cpu == tick_do_timer_cpu) {
- if (stop_tick) {
- tick_do_timer_cpu = TICK_DO_TIMER_NONE;
- ts->do_timer_last = 1;
+ if (cpu != tick_do_timer_cpu) {
+ if (tick_do_timer_cpu != TICK_DO_TIMER_NONE) {
+ delta = KTIME_MAX;
+ ts->do_timer_last = 0;
+ } else if (!ts->do_timer_last) {
+ delta = KTIME_MAX;
}
- } else if (tick_do_timer_cpu != TICK_DO_TIMER_NONE) {
- delta = KTIME_MAX;
- ts->do_timer_last = 0;
- } else if (!ts->do_timer_last) {
- delta = KTIME_MAX;
}

#ifdef CONFIG_NO_HZ_FULL
@@ -756,24 +742,37 @@ static ktime_t __tick_nohz_next_event(st
else
expires = KTIME_MAX;

- expires = min_t(u64, expires, next_tick);
- tick = expires;
+ ts->next_idle_tick = min_t(u64, expires, next_tick);
+ return ts->next_idle_tick;
+}

- if (!stop_tick) {
- /* Undo the effect of get_next_timer_interrupt(). */
- timer_clear_idle();
- goto out;
+static void __tick_nohz_stop_tick(struct tick_sched *ts, int cpu, u64 expires)
+{
+ struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
+ ktime_t tick = expires;
+
+ /*
+ * If this CPU is the one which updates jiffies, then give up
+ * the assignment and let it be taken by the CPU which runs
+ * the tick timer next, which might be this CPU as well. If we
+ * don't drop this here the jiffies might be stale and
+ * do_timer() never invoked. Keep track of the fact that it
+ * was the one which had the do_timer() duty last.
+ */
+ if (cpu == tick_do_timer_cpu) {
+ tick_do_timer_cpu = TICK_DO_TIMER_NONE;
+ ts->do_timer_last = 1;
}

/* Skip reprogram of event if its not changed */
if (ts->tick_stopped && (expires == ts->next_tick)) {
/* Sanity check: make sure clockevent is actually programmed */
if (tick == KTIME_MAX || ts->next_tick == hrtimer_get_expires(&ts->sched_timer))
- goto out;
+ return;

WARN_ON_ONCE(1);
printk_once("basemono: %llu ts->next_tick: %llu dev->next_event: %llu timer->active: %d timer->expires: %llu\n",
- basemono, ts->next_tick, dev->next_event,
+ ts->last_jiffies_update, ts->next_tick, dev->next_event,
hrtimer_active(&ts->sched_timer), hrtimer_get_expires(&ts->sched_timer));
}

@@ -803,7 +802,7 @@ static ktime_t __tick_nohz_next_event(st
if (unlikely(expires == KTIME_MAX)) {
if (ts->nohz_mode == NOHZ_MODE_HIGHRES)
hrtimer_cancel(&ts->sched_timer);
- goto out;
+ return;
}

hrtimer_set_expires(&ts->sched_timer, tick);
@@ -812,14 +811,18 @@ static ktime_t __tick_nohz_next_event(st
hrtimer_start_expires(&ts->sched_timer, HRTIMER_MODE_ABS_PINNED);
else
tick_program_event(tick, 1);
-out:
- return tick;
}

-static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts, int cpu)
+#ifdef CONFIG_NO_HZ_FULL
+static void tick_nohz_stop_sched_tick(struct tick_sched *ts, int cpu)
{
- return __tick_nohz_next_event(ts, cpu, true);
+ u64 next_tick;
+
+ next_tick = __tick_nohz_next_event(ts, cpu);
+ if (next_tick)
+ __tick_nohz_stop_tick(ts, cpu, next_tick);
}
+#endif /* CONFIG_NO_HZ_FULL */

static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now)
{
@@ -921,32 +924,43 @@ static bool can_stop_idle_tick(int cpu,
static void __tick_nohz_idle_enter(struct tick_sched *ts, bool stop_tick)
{
int cpu = smp_processor_id();
+ ktime_t expires;
+ int was_stopped;

- if (!ts->last_jiffies_update) {
- /* tick_nohz_get_sleep_length() has not run. */
+ if (ts->last_jiffies_update) {
+ if (!stop_tick)
+ goto out;
+
+ /*
+ * tick_nohz_get_sleep_length() has run, so the tick timer
+ * expiration time has been computed already.
+ */
+ expires = ts->next_idle_tick;
+ } else {
tick_nohz_start_idle(ts);
- if (!can_stop_idle_tick(cpu, ts))
+ if (!can_stop_idle_tick(cpu, ts) || !stop_tick)
return;
+
+ expires = __tick_nohz_next_event(ts, cpu);
}

- if (stop_tick) {
- int was_stopped = ts->tick_stopped;
- ktime_t expires;
-
- ts->idle_calls++;
-
- expires = tick_nohz_stop_sched_tick(ts, cpu);
- if (expires > 0LL) {
- ts->idle_sleeps++;
- ts->idle_expires = expires;
- }
+ ts->idle_calls++;

- if (!was_stopped && ts->tick_stopped) {
- ts->idle_jiffies = ts->last_jiffies;
- nohz_balance_enter_idle(cpu);
- }
+ was_stopped = ts->tick_stopped;
+
+ if (expires > 0LL) {
+ __tick_nohz_stop_tick(ts, cpu, expires);
+
+ ts->idle_sleeps++;
+ ts->idle_expires = expires;
}

+ if (!was_stopped && ts->tick_stopped) {
+ ts->idle_jiffies = ts->last_jiffies;
+ nohz_balance_enter_idle(cpu);
+ }
+
+out:
ts->last_jiffies_update = 0;
}

@@ -955,7 +969,7 @@ void __tick_nohz_idle_prepare(void)
lockdep_assert_irqs_enabled();
/*
* Update the idle state in the scheduler domain hierarchy
- * when tick_nohz_stop_sched_tick() is called from the idle loop.
+ * when __tick_nohz_stop_tick() is called from the idle loop.
* State will be updated to busy during the first busy tick after
* exiting idle.
*/
@@ -1040,7 +1054,7 @@ ktime_t tick_nohz_get_sleep_length(void)
now = tick_nohz_start_idle(ts);

if (can_stop_idle_tick(cpu, ts)) {
- next_event = __tick_nohz_next_event(ts, cpu, false);
+ next_event = __tick_nohz_next_event(ts, cpu);
} else {
struct clock_event_device *dev;

Index: linux-pm/kernel/time/tick-sched.h
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.h
+++ linux-pm/kernel/time/tick-sched.h
@@ -59,6 +59,7 @@ struct tick_sched {
ktime_t iowait_sleeptime;
unsigned long last_jiffies;
u64 last_jiffies_update;
+ u64 next_idle_tick;
u64 next_timer;
ktime_t idle_expires;
int do_timer_last;


2018-03-06 09:14:29

by Rafael J. Wysocki

[permalink] [raw]
Subject: [RFC/RFT][PATCH v2 4/6] cpuidle: Return nohz hint from cpuidle_select()

From: Rafael J. Wysocki <[email protected]>

Add a new pointer argument to cpuidle_select() and to the ->select
cpuidle governor callback to allow a boolean value indicating
whether or not the tick should be stopped before entering the
selected state to be returned from there.

Make the ladder governor ignore that pointer (to preserve its
current behavior) and make the menu governor return 'false" through
it if:
(1) the idle exit latency is constrained at 0,
(2) the selected state is a polling one, or
(3) the target residency of the selected state is less than the
length of the tick period and there is at least one deeper
idle state available within the tick period range.

Since the value returned through the new argument pointer is not
used yet, this change is not expected to alter the functionality of
the code.

Signed-off-by: Rafael J. Wysocki <[email protected]>
---

-> v2: New patch.

---
drivers/cpuidle/cpuidle.c | 10 ++++++++--
drivers/cpuidle/governors/ladder.c | 3 ++-
drivers/cpuidle/governors/menu.c | 26 ++++++++++++++++++++++----
include/linux/cpuidle.h | 6 ++++--
kernel/sched/idle.c | 4 +++-
5 files changed, 39 insertions(+), 10 deletions(-)

Index: linux-pm/include/linux/cpuidle.h
===================================================================
--- linux-pm.orig/include/linux/cpuidle.h
+++ linux-pm/include/linux/cpuidle.h
@@ -131,7 +131,8 @@ extern bool cpuidle_not_available(struct
struct cpuidle_device *dev);

extern int cpuidle_select(struct cpuidle_driver *drv,
- struct cpuidle_device *dev);
+ struct cpuidle_device *dev,
+ bool *nohz_ret);
extern int cpuidle_enter(struct cpuidle_driver *drv,
struct cpuidle_device *dev, int index);
extern void cpuidle_reflect(struct cpuidle_device *dev, int index);
@@ -246,7 +247,8 @@ struct cpuidle_governor {
struct cpuidle_device *dev);

int (*select) (struct cpuidle_driver *drv,
- struct cpuidle_device *dev);
+ struct cpuidle_device *dev,
+ bool *nohz_ret);
void (*reflect) (struct cpuidle_device *dev, int index);
};

Index: linux-pm/kernel/sched/idle.c
===================================================================
--- linux-pm.orig/kernel/sched/idle.c
+++ linux-pm/kernel/sched/idle.c
@@ -186,13 +186,15 @@ static void cpuidle_idle_call(void)
next_state = cpuidle_find_deepest_state(drv, dev);
call_cpuidle(drv, dev, next_state);
} else {
+ bool nohz = true;
+
tick_nohz_idle_go_idle(true);
rcu_idle_enter();

/*
* Ask the cpuidle framework to choose a convenient idle state.
*/
- next_state = cpuidle_select(drv, dev);
+ next_state = cpuidle_select(drv, dev, &nohz);
entered_state = call_cpuidle(drv, dev, next_state);
/*
* Give the governor an opportunity to reflect on the outcome
Index: linux-pm/drivers/cpuidle/cpuidle.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/cpuidle.c
+++ linux-pm/drivers/cpuidle/cpuidle.c
@@ -263,12 +263,18 @@ int cpuidle_enter_state(struct cpuidle_d
*
* @drv: the cpuidle driver
* @dev: the cpuidle device
+ * @nohz_ret: indication on whether or not to stop the tick
*
* Returns the index of the idle state. The return value must not be negative.
+ *
+ * The memory location pointed to by @nohz_ret is expected to be written the
+ * 'false' boolean value if the scheduler tick should not be stopped before
+ * entering the returned state.
*/
-int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
+ bool *nohz_ret)
{
- return cpuidle_curr_governor->select(drv, dev);
+ return cpuidle_curr_governor->select(drv, dev, nohz_ret);
}

/**
Index: linux-pm/drivers/cpuidle/governors/ladder.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/ladder.c
+++ linux-pm/drivers/cpuidle/governors/ladder.c
@@ -63,9 +63,10 @@ static inline void ladder_do_selection(s
* ladder_select_state - selects the next state to enter
* @drv: cpuidle driver
* @dev: the CPU
+ * @dummy: not used
*/
static int ladder_select_state(struct cpuidle_driver *drv,
- struct cpuidle_device *dev)
+ struct cpuidle_device *dev, bool *dummy)
{
struct ladder_device *ldev = this_cpu_ptr(&ladder_devices);
struct device *device = get_cpu_device(dev->cpu);
Index: linux-pm/drivers/cpuidle/governors/menu.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/menu.c
+++ linux-pm/drivers/cpuidle/governors/menu.c
@@ -279,8 +279,10 @@ again:
* menu_select - selects the next idle state to enter
* @drv: cpuidle driver containing state data
* @dev: the CPU
+ * @nohz_ret: indication on whether or not to stop the tick
*/
-static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
+ bool *nohz_ret)
{
struct menu_device *data = this_cpu_ptr(&menu_devices);
struct device *device = get_cpu_device(dev->cpu);
@@ -303,8 +305,10 @@ static int menu_select(struct cpuidle_dr
latency_req = resume_latency;

/* Special case when user has set very strict latency requirement */
- if (unlikely(latency_req == 0))
+ if (unlikely(latency_req == 0)) {
+ *nohz_ret = false;
return 0;
+ }

/* determine the expected residency time, round up */
data->next_timer_us = ktime_to_us(tick_nohz_get_sleep_length());
@@ -367,10 +371,12 @@ static int menu_select(struct cpuidle_dr
continue;
if (idx == -1)
idx = i; /* first enabled state */
- if (s->target_residency > data->predicted_us)
- break;
if (s->exit_latency > latency_req)
break;
+ if (s->target_residency > data->predicted_us) {
+ first_idx = i;
+ break;
+ }

idx = i;
}
@@ -378,6 +384,18 @@ static int menu_select(struct cpuidle_dr
if (idx == -1)
idx = 0; /* No states enabled. Must use 0. */

+ if (drv->states[idx].flags & CPUIDLE_FLAG_POLLING) {
+ *nohz_ret = false;
+ } else if (drv->states[idx].target_residency < TICK_USEC) {
+ /*
+ * Do not stop the tick if there is at least one more state
+ * within the tick period range that could be used if longer
+ * idle duration was predicted.
+ */
+ *nohz_ret = first_idx > idx &&
+ drv->states[first_idx].target_residency < TICK_USEC;
+ }
+
data->last_state_idx = idx;

return data->last_state_idx;


2018-03-06 09:14:52

by Rafael J. Wysocki

[permalink] [raw]
Subject: [RFC/RFT][PATCH v2 2/6] sched: idle: Do not stop the tick upfront in the idle loop

From: Rafael J. Wysocki <[email protected]>

Push the decision whether or not to stop the tick somewhat deeper
into the idle loop.

Stopping the tick upfront leads to unpleasant outcomes in case the
idle governor doesn't agree with the timekeeping code on the duration
of the upcoming idle period. Specifically, if the tick has been
stopped and the idle governor predicts short idle, the situation is
bad regardless of whether or not the prediction is accurate. If it
is accurate, the tick has been stopped unnecessarily which means
excessive overhead. If it is not accurate, the CPU is likely to
spend too much time in the (shallow, because short idle has been
predicted) idle state selected by the governor [1].

As the first step towards addressing this problem, change the code
to make the tick stopping decision inside of the loop in do_idle().
In particular, do not stop the tick in the cpu_idle_poll() code path.
Also don't do that in tick_nohz_irq_exit() which doesn't really have
information to whether or not to stop the tick.

Link: https://marc.info/?l=linux-pm&m=150116085925208&w=2 # [1]
Link: https://tu-dresden.de/zih/forschung/ressourcen/dateien/projekte/haec/powernightmares.pdf
Signed-off-by: Rafael J. Wysocki <[email protected]>
---

-> v2: No changes.

---
kernel/sched/idle.c | 13 ++++++++++---
kernel/time/tick-sched.c | 2 +-
2 files changed, 11 insertions(+), 4 deletions(-)

Index: linux-pm/kernel/sched/idle.c
===================================================================
--- linux-pm.orig/kernel/sched/idle.c
+++ linux-pm/kernel/sched/idle.c
@@ -220,13 +220,17 @@ static void do_idle(void)
*/

__current_set_polling();
- tick_nohz_idle_enter();
+ tick_nohz_idle_prepare();

while (!need_resched()) {
check_pgt_cache();
rmb();

if (cpu_is_offline(cpu)) {
+ local_irq_disable();
+ tick_nohz_idle_go_idle(true);
+ local_irq_enable();
+
cpuhp_report_idle_dead();
arch_cpu_idle_dead();
}
@@ -240,10 +244,13 @@ static void do_idle(void)
* broadcast device expired for us, we don't want to go deep
* idle as we know that the IPI is going to arrive right away.
*/
- if (cpu_idle_force_poll || tick_check_broadcast_expired())
+ if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
+ tick_nohz_idle_go_idle(false);
cpu_idle_poll();
- else
+ } else {
+ tick_nohz_idle_go_idle(true);
cpuidle_idle_call();
+ }
arch_cpu_idle_exit();
}

Index: linux-pm/kernel/time/tick-sched.c
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.c
+++ linux-pm/kernel/time/tick-sched.c
@@ -1007,7 +1007,7 @@ void tick_nohz_irq_exit(void)
struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);

if (ts->inidle)
- __tick_nohz_idle_enter(ts, true);
+ __tick_nohz_idle_enter(ts, false);
else
tick_nohz_full_update_tick(ts);
}


2018-03-06 09:15:32

by Rafael J. Wysocki

[permalink] [raw]
Subject: [RFC/RFT][PATCH v2 3/6] sched: idle: Do not stop the tick before cpuidle_idle_call()

From: Rafael J. Wysocki <[email protected]>

Make cpuidle_idle_call() decide whether or not to stop the tick.

First, there are code paths in cpuidle_idle_call() that don't need
the tick to be stopped. In particular, the cpuidle_enter_s2idle()
path deals with the tick (and with the entire timekeeping for that
matter) by itself and it doesn't need the tick to be stopped
beforehand.

Second, to address the issue with short idle duration predictions
by the idle governor after the tick has been stopped, it will be
necessary to change the ordering of some governor code with respect
to some code in tick_nohz_idle_go_idle(). For this purpose, call
tick_nohz_idle_go_idle() in the same branch as cpuidle_select().

Signed-off-by: Rafael J. Wysocki <[email protected]>
---

-> v2: No changes.

---
kernel/sched/idle.c | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)

Index: linux-pm/kernel/sched/idle.c
===================================================================
--- linux-pm.orig/kernel/sched/idle.c
+++ linux-pm/kernel/sched/idle.c
@@ -146,13 +146,14 @@ static void cpuidle_idle_call(void)
}

/*
- * Tell the RCU framework we are entering an idle section,
- * so no more rcu read side critical sections and one more
+ * The RCU framework needs to be told that we are entering an idle
+ * section, so no more rcu read side critical sections and one more
* step to the grace period
*/
- rcu_idle_enter();

if (cpuidle_not_available(drv, dev)) {
+ tick_nohz_idle_go_idle(false);
+ rcu_idle_enter();
default_idle_call();
goto exit_idle;
}
@@ -169,6 +170,9 @@ static void cpuidle_idle_call(void)

if (idle_should_enter_s2idle() || dev->use_deepest_state) {
if (idle_should_enter_s2idle()) {
+ tick_nohz_idle_go_idle(false);
+ rcu_idle_enter();
+
entered_state = cpuidle_enter_s2idle(drv, dev);
if (entered_state > 0) {
local_irq_enable();
@@ -176,9 +180,15 @@ static void cpuidle_idle_call(void)
}
}

+ tick_nohz_idle_go_idle(true);
+ rcu_idle_enter();
+
next_state = cpuidle_find_deepest_state(drv, dev);
call_cpuidle(drv, dev, next_state);
} else {
+ tick_nohz_idle_go_idle(true);
+ rcu_idle_enter();
+
/*
* Ask the cpuidle framework to choose a convenient idle state.
*/
@@ -248,7 +258,6 @@ static void do_idle(void)
tick_nohz_idle_go_idle(false);
cpu_idle_poll();
} else {
- tick_nohz_idle_go_idle(true);
cpuidle_idle_call();
}
arch_cpu_idle_exit();


2018-03-06 09:15:35

by Rafael J. Wysocki

[permalink] [raw]
Subject: [RFC/RFT][PATCH v2 1/6] time: tick-sched: Reorganize idle tick management code

From: Rafael J. Wysocki <[email protected]>

Prepare two pieces of code in tick_nohz_idle_enter() for being
called separately from each other.

First, make it possible to call the initial preparatory part of
tick_nohz_idle_enter() without the tick-stopping part following
it and introduce the tick_nohz_idle_prepare() wrapper for that
(that will be used in the next set of changes).

Second, add a new stop_tick argument to __tick_nohz_idle_enter()
tell it whether or not to stop the tick (that is always set for
now) and add a wrapper allowing this function to be called from
the outside of tick-sched.c.

Just the code reorganization and two new wrapper functions, no
intended functional changes.

Signed-off-by: Rafael J. Wysocki <[email protected]>
---

-> v2: Eliminate assumetry in enabling/disabling interrupts.

---
include/linux/tick.h | 2 +
kernel/time/tick-sched.c | 62 ++++++++++++++++++++++++++++++++---------------
2 files changed, 45 insertions(+), 19 deletions(-)

Index: linux-pm/include/linux/tick.h
===================================================================
--- linux-pm.orig/include/linux/tick.h
+++ linux-pm/include/linux/tick.h
@@ -114,6 +114,8 @@ enum tick_dep_bits {
#ifdef CONFIG_NO_HZ_COMMON
extern bool tick_nohz_enabled;
extern int tick_nohz_tick_stopped(void);
+extern void tick_nohz_idle_prepare(void);
+extern void tick_nohz_idle_go_idle(bool stop_tick);
extern void tick_nohz_idle_enter(void);
extern void tick_nohz_idle_exit(void);
extern void tick_nohz_irq_exit(void);
Index: linux-pm/kernel/time/tick-sched.c
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.c
+++ linux-pm/kernel/time/tick-sched.c
@@ -911,14 +911,14 @@ static bool can_stop_idle_tick(int cpu,
return true;
}

-static void __tick_nohz_idle_enter(struct tick_sched *ts)
+static void __tick_nohz_idle_enter(struct tick_sched *ts, bool stop_tick)
{
ktime_t now, expires;
int cpu = smp_processor_id();

now = tick_nohz_start_idle(ts);

- if (can_stop_idle_tick(cpu, ts)) {
+ if (can_stop_idle_tick(cpu, ts) && stop_tick) {
int was_stopped = ts->tick_stopped;

ts->idle_calls++;
@@ -936,22 +936,8 @@ static void __tick_nohz_idle_enter(struc
}
}

-/**
- * tick_nohz_idle_enter - stop the idle tick from the idle task
- *
- * When the next event is more than a tick into the future, stop the idle tick
- * Called when we start the idle loop.
- *
- * The arch is responsible of calling:
- *
- * - rcu_idle_enter() after its last use of RCU before the CPU is put
- * to sleep.
- * - rcu_idle_exit() before the first use of RCU after the CPU is woken up.
- */
-void tick_nohz_idle_enter(void)
+void __tick_nohz_idle_prepare(void)
{
- struct tick_sched *ts;
-
lockdep_assert_irqs_enabled();
/*
* Update the idle state in the scheduler domain hierarchy
@@ -960,12 +946,50 @@ void tick_nohz_idle_enter(void)
* exiting idle.
*/
set_cpu_sd_state_idle();
+}
+
+/**
+ * tick_nohz_idle_prepare - prepare for entering idle on the current CPU.
+ *
+ * Called when we start the idle loop.
+ */
+void tick_nohz_idle_prepare(void)
+{
+ struct tick_sched *ts;
+
+ __tick_nohz_idle_prepare();
+
+ local_irq_disable();
+
+ ts = this_cpu_ptr(&tick_cpu_sched);
+ ts->inidle = 1;
+
+ local_irq_enable();
+}
+
+/**
+ * tick_nohz_idle_go_idle - start idle period on the current CPU.
+ * @stop_tick: Whether or not to stop the idle tick.
+ *
+ * When @stop_tick is set and the next event is more than a tick into the
+ * future, stop the idle tick.
+ */
+void tick_nohz_idle_go_idle(bool stop_tick)
+{
+ __tick_nohz_idle_enter(this_cpu_ptr(&tick_cpu_sched), stop_tick);
+}
+
+void tick_nohz_idle_enter(void)
+{
+ struct tick_sched *ts;
+
+ __tick_nohz_idle_prepare();

local_irq_disable();

ts = this_cpu_ptr(&tick_cpu_sched);
ts->inidle = 1;
- __tick_nohz_idle_enter(ts);
+ __tick_nohz_idle_enter(ts, true);

local_irq_enable();
}
@@ -983,7 +1007,7 @@ void tick_nohz_irq_exit(void)
struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);

if (ts->inidle)
- __tick_nohz_idle_enter(ts);
+ __tick_nohz_idle_enter(ts, true);
else
tick_nohz_full_update_tick(ts);
}


2018-03-06 09:18:51

by Rafael J. Wysocki

[permalink] [raw]
Subject: [RFC/RFT][PATCH v2 5/6] sched: idle: Select idle state before stopping the tick

From: Rafael J. Wysocki <[email protected]>

In order to address the issue with short idle duration predictions
by the idle governor after the tick has been stopped, reorder the
code in cpuidle_idle_call() so that the governor idle state selection
runs before tick_nohz_idle_go_idle() and use the "nohz" hint returned
by cpuidle_select() to tell tick_nohz_idle_go_idle() whether or not
to stop the tick.

This isn't straightforward, because menu_predict() invokes
tick_nohz_get_sleep_length() to get the time to the next timer
event and the number returned by the latter comes from
__tick_nohz_idle_enter(). Fortunately, however, it is possible
to compute that number without actually stopping the tick and with
the help of the existing code.

Namely, notice that tick_nohz_stop_sched_tick() already computes the
next timer event time to reprogram the scheduler tick hrtimer and
that time can be used as a proxy for the actual next timer event
time in the idle duration predicition.

Accordingly, rename the original tick_nohz_stop_sched_tick() to
__tick_nohz_next_event() and add the stop_tick argument indicating
whether or not to stop the tick to it. If that argument is 'true',
the function will work like the original tick_nohz_stop_sched_tick(),
but otherwise it will just compute the next event time without
stopping the tick. Next, redefine tick_nohz_stop_sched_tick() as
a wrapper around the new function.

Following that, make tick_nohz_get_sleep_length() call
__tick_nohz_next_event() to compute the next timer event time
and make it use the new last_jiffies_update field in struct
tick_sched to tell __tick_nohz_idle_enter() to skip some code
that has run already.

[After this change the __tick_nohz_next_event() code computing the
next event time will run twice in a row if the expected idle period
duration coming from cpuidle_select() is large enough which is sort
of ugly, but the next set of changes deals with that separately.
To do that, it uses the value of the last_jiffies_update field in
struct tick_sched introduced here, among other things.]

Finally, drop the now redundant sleep_length field from struct
tick_sched.

Signed-off-by: Rafael J. Wysocki <[email protected]>
---

-> v2: Use the "nohz" hint from cpuidle_select() instead of the expected
idle duration.

---
kernel/sched/idle.c | 7 ++---
kernel/time/tick-sched.c | 64 +++++++++++++++++++++++++++++++++--------------
kernel/time/tick-sched.h | 3 --
3 files changed, 50 insertions(+), 24 deletions(-)

Index: linux-pm/kernel/sched/idle.c
===================================================================
--- linux-pm.orig/kernel/sched/idle.c
+++ linux-pm/kernel/sched/idle.c
@@ -188,13 +188,14 @@ static void cpuidle_idle_call(void)
} else {
bool nohz = true;

- tick_nohz_idle_go_idle(true);
- rcu_idle_enter();
-
/*
* Ask the cpuidle framework to choose a convenient idle state.
*/
next_state = cpuidle_select(drv, dev, &nohz);
+
+ tick_nohz_idle_go_idle(nohz);
+ rcu_idle_enter();
+
entered_state = call_cpuidle(drv, dev, next_state);
/*
* Give the governor an opportunity to reflect on the outcome
Index: linux-pm/kernel/time/tick-sched.c
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.c
+++ linux-pm/kernel/time/tick-sched.c
@@ -655,8 +655,8 @@ static inline bool local_timer_softirq_p
return local_softirq_pending() & TIMER_SOFTIRQ;
}

-static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
- ktime_t now, int cpu)
+static ktime_t __tick_nohz_next_event(struct tick_sched *ts, int cpu,
+ bool stop_tick)
{
struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
u64 basemono, next_tick, next_tmr, next_rcu, delta, expires;
@@ -670,6 +670,7 @@ static ktime_t tick_nohz_stop_sched_tick
basejiff = jiffies;
} while (read_seqretry(&jiffies_lock, seq));
ts->last_jiffies = basejiff;
+ ts->last_jiffies_update = basemono;

/*
* Keep the periodic tick, when RCU, architecture or irq_work
@@ -732,8 +733,10 @@ static ktime_t tick_nohz_stop_sched_tick
*/
delta = timekeeping_max_deferment();
if (cpu == tick_do_timer_cpu) {
- tick_do_timer_cpu = TICK_DO_TIMER_NONE;
- ts->do_timer_last = 1;
+ if (stop_tick) {
+ tick_do_timer_cpu = TICK_DO_TIMER_NONE;
+ ts->do_timer_last = 1;
+ }
} else if (tick_do_timer_cpu != TICK_DO_TIMER_NONE) {
delta = KTIME_MAX;
ts->do_timer_last = 0;
@@ -756,6 +759,12 @@ static ktime_t tick_nohz_stop_sched_tick
expires = min_t(u64, expires, next_tick);
tick = expires;

+ if (!stop_tick) {
+ /* Undo the effect of get_next_timer_interrupt(). */
+ timer_clear_idle();
+ goto out;
+ }
+
/* Skip reprogram of event if its not changed */
if (ts->tick_stopped && (expires == ts->next_tick)) {
/* Sanity check: make sure clockevent is actually programmed */
@@ -804,14 +813,14 @@ static ktime_t tick_nohz_stop_sched_tick
else
tick_program_event(tick, 1);
out:
- /*
- * Update the estimated sleep length until the next timer
- * (not only the tick).
- */
- ts->sleep_length = ktime_sub(dev->next_event, now);
return tick;
}

+static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts, int cpu)
+{
+ return __tick_nohz_next_event(ts, cpu, true);
+}
+
static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now)
{
/* Update jiffies first */
@@ -847,7 +856,7 @@ static void tick_nohz_full_update_tick(s
return;

if (can_stop_full_tick(cpu, ts))
- tick_nohz_stop_sched_tick(ts, ktime_get(), cpu);
+ tick_nohz_stop_sched_tick(ts, cpu);
else if (ts->tick_stopped)
tick_nohz_restart_sched_tick(ts, ktime_get());
#endif
@@ -873,10 +882,8 @@ static bool can_stop_idle_tick(int cpu,
return false;
}

- if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE)) {
- ts->sleep_length = NSEC_PER_SEC / HZ;
+ if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
return false;
- }

if (need_resched())
return false;
@@ -913,17 +920,22 @@ static bool can_stop_idle_tick(int cpu,

static void __tick_nohz_idle_enter(struct tick_sched *ts, bool stop_tick)
{
- ktime_t now, expires;
int cpu = smp_processor_id();

- now = tick_nohz_start_idle(ts);
+ if (!ts->last_jiffies_update) {
+ /* tick_nohz_get_sleep_length() has not run. */
+ tick_nohz_start_idle(ts);
+ if (!can_stop_idle_tick(cpu, ts))
+ return;
+ }

- if (can_stop_idle_tick(cpu, ts) && stop_tick) {
+ if (stop_tick) {
int was_stopped = ts->tick_stopped;
+ ktime_t expires;

ts->idle_calls++;

- expires = tick_nohz_stop_sched_tick(ts, now, cpu);
+ expires = tick_nohz_stop_sched_tick(ts, cpu);
if (expires > 0LL) {
ts->idle_sleeps++;
ts->idle_expires = expires;
@@ -934,6 +946,8 @@ static void __tick_nohz_idle_enter(struc
nohz_balance_enter_idle(cpu);
}
}
+
+ ts->last_jiffies_update = 0;
}

void __tick_nohz_idle_prepare(void)
@@ -1013,15 +1027,27 @@ void tick_nohz_irq_exit(void)
}

/**
- * tick_nohz_get_sleep_length - return the length of the current sleep
+ * tick_nohz_get_sleep_length - return the expected length of the current sleep
*
* Called from power state control code with interrupts disabled
*/
ktime_t tick_nohz_get_sleep_length(void)
{
struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
+ int cpu = smp_processor_id();
+ ktime_t now, next_event;

- return ts->sleep_length;
+ now = tick_nohz_start_idle(ts);
+
+ if (can_stop_idle_tick(cpu, ts)) {
+ next_event = __tick_nohz_next_event(ts, cpu, false);
+ } else {
+ struct clock_event_device *dev;
+
+ dev = __this_cpu_read(tick_cpu_device.evtdev);
+ next_event = dev->next_event;
+ }
+ return ktime_sub(next_event, now);;
}

/**
Index: linux-pm/kernel/time/tick-sched.h
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.h
+++ linux-pm/kernel/time/tick-sched.h
@@ -38,7 +38,6 @@ enum tick_nohz_mode {
* @idle_exittime: Time when the idle state was left
* @idle_sleeptime: Sum of the time slept in idle with sched tick stopped
* @iowait_sleeptime: Sum of the time slept in idle with sched tick stopped, with IO outstanding
- * @sleep_length: Duration of the current idle sleep
* @do_timer_lst: CPU was the last one doing do_timer before going idle
*/
struct tick_sched {
@@ -58,8 +57,8 @@ struct tick_sched {
ktime_t idle_exittime;
ktime_t idle_sleeptime;
ktime_t iowait_sleeptime;
- ktime_t sleep_length;
unsigned long last_jiffies;
+ u64 last_jiffies_update;
u64 next_timer;
ktime_t idle_expires;
int do_timer_last;


2018-03-06 09:32:07

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC/RFT][PATCH v2 4/6] cpuidle: Return nohz hint from cpuidle_select()

Bummer. :-(

On Tue, Mar 6, 2018 at 10:05 AM, Rafael J. Wysocki <[email protected]> wrote:

> + if (drv->states[idx].flags & CPUIDLE_FLAG_POLLING) {
> + *nohz_ret = false;
> + } else if (drv->states[idx].target_residency < TICK_USEC) {
> + /*
> + * Do not stop the tick if there is at least one more state
> + * within the tick period range that could be used if longer
> + * idle duration was predicted.
> + */
> + *nohz_ret = first_idx > idx &&
> + drv->states[first_idx].target_residency < TICK_USEC;

This is reversed, sent a wrong version of the patch.

I'll resend with this fixed shortly.

2018-03-06 10:07:07

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Update][RFC/RFT][PATCH v2 4/6] cpuidle: Return nohz hint from cpuidle_select()

From: Rafael J. Wysocki <[email protected]>

Add a new pointer argument to cpuidle_select() and to the ->select
cpuidle governor callback to allow a boolean value indicating
whether or not the tick should be stopped before entering the
selected state to be returned from there.

Make the ladder governor ignore that pointer (to preserve its
current behavior) and make the menu governor return 'false" through
it if:
(1) the idle exit latency is constrained at 0,
(2) the selected state is a polling one, or
(3) the target residency of the selected state is less than the
length of the tick period and there is at least one deeper
idle state available within the tick period range.

Since the value returned through the new argument pointer is not
used yet, this change is not expected to alter the functionality of
the code.

Signed-off-by: Rafael J. Wysocki <[email protected]>
---

Update: Fix a reversed value of *nohz_ret in menu_select().

---
drivers/cpuidle/cpuidle.c | 10 ++++++++--
drivers/cpuidle/governors/ladder.c | 3 ++-
drivers/cpuidle/governors/menu.c | 26 ++++++++++++++++++++++----
include/linux/cpuidle.h | 6 ++++--
kernel/sched/idle.c | 4 +++-
5 files changed, 39 insertions(+), 10 deletions(-)

Index: linux-pm/include/linux/cpuidle.h
===================================================================
--- linux-pm.orig/include/linux/cpuidle.h
+++ linux-pm/include/linux/cpuidle.h
@@ -131,7 +131,8 @@ extern bool cpuidle_not_available(struct
struct cpuidle_device *dev);

extern int cpuidle_select(struct cpuidle_driver *drv,
- struct cpuidle_device *dev);
+ struct cpuidle_device *dev,
+ bool *nohz_ret);
extern int cpuidle_enter(struct cpuidle_driver *drv,
struct cpuidle_device *dev, int index);
extern void cpuidle_reflect(struct cpuidle_device *dev, int index);
@@ -246,7 +247,8 @@ struct cpuidle_governor {
struct cpuidle_device *dev);

int (*select) (struct cpuidle_driver *drv,
- struct cpuidle_device *dev);
+ struct cpuidle_device *dev,
+ bool *nohz_ret);
void (*reflect) (struct cpuidle_device *dev, int index);
};

Index: linux-pm/kernel/sched/idle.c
===================================================================
--- linux-pm.orig/kernel/sched/idle.c
+++ linux-pm/kernel/sched/idle.c
@@ -186,13 +186,15 @@ static void cpuidle_idle_call(void)
next_state = cpuidle_find_deepest_state(drv, dev);
call_cpuidle(drv, dev, next_state);
} else {
+ bool nohz = true;
+
tick_nohz_idle_go_idle(true);
rcu_idle_enter();

/*
* Ask the cpuidle framework to choose a convenient idle state.
*/
- next_state = cpuidle_select(drv, dev);
+ next_state = cpuidle_select(drv, dev, &nohz);
entered_state = call_cpuidle(drv, dev, next_state);
/*
* Give the governor an opportunity to reflect on the outcome
Index: linux-pm/drivers/cpuidle/cpuidle.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/cpuidle.c
+++ linux-pm/drivers/cpuidle/cpuidle.c
@@ -263,12 +263,18 @@ int cpuidle_enter_state(struct cpuidle_d
*
* @drv: the cpuidle driver
* @dev: the cpuidle device
+ * @nohz_ret: indication on whether or not to stop the tick
*
* Returns the index of the idle state. The return value must not be negative.
+ *
+ * The memory location pointed to by @nohz_ret is expected to be written the
+ * 'false' boolean value if the scheduler tick should not be stopped before
+ * entering the returned state.
*/
-int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
+ bool *nohz_ret)
{
- return cpuidle_curr_governor->select(drv, dev);
+ return cpuidle_curr_governor->select(drv, dev, nohz_ret);
}

/**
Index: linux-pm/drivers/cpuidle/governors/ladder.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/ladder.c
+++ linux-pm/drivers/cpuidle/governors/ladder.c
@@ -63,9 +63,10 @@ static inline void ladder_do_selection(s
* ladder_select_state - selects the next state to enter
* @drv: cpuidle driver
* @dev: the CPU
+ * @dummy: not used
*/
static int ladder_select_state(struct cpuidle_driver *drv,
- struct cpuidle_device *dev)
+ struct cpuidle_device *dev, bool *dummy)
{
struct ladder_device *ldev = this_cpu_ptr(&ladder_devices);
struct device *device = get_cpu_device(dev->cpu);
Index: linux-pm/drivers/cpuidle/governors/menu.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/menu.c
+++ linux-pm/drivers/cpuidle/governors/menu.c
@@ -279,8 +279,10 @@ again:
* menu_select - selects the next idle state to enter
* @drv: cpuidle driver containing state data
* @dev: the CPU
+ * @nohz_ret: indication on whether or not to stop the tick
*/
-static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
+ bool *nohz_ret)
{
struct menu_device *data = this_cpu_ptr(&menu_devices);
struct device *device = get_cpu_device(dev->cpu);
@@ -303,8 +305,10 @@ static int menu_select(struct cpuidle_dr
latency_req = resume_latency;

/* Special case when user has set very strict latency requirement */
- if (unlikely(latency_req == 0))
+ if (unlikely(latency_req == 0)) {
+ *nohz_ret = false;
return 0;
+ }

/* determine the expected residency time, round up */
data->next_timer_us = ktime_to_us(tick_nohz_get_sleep_length());
@@ -367,10 +371,12 @@ static int menu_select(struct cpuidle_dr
continue;
if (idx == -1)
idx = i; /* first enabled state */
- if (s->target_residency > data->predicted_us)
- break;
if (s->exit_latency > latency_req)
break;
+ if (s->target_residency > data->predicted_us) {
+ first_idx = i;
+ break;
+ }

idx = i;
}
@@ -378,6 +384,18 @@ static int menu_select(struct cpuidle_dr
if (idx == -1)
idx = 0; /* No states enabled. Must use 0. */

+ if (drv->states[idx].flags & CPUIDLE_FLAG_POLLING) {
+ *nohz_ret = false;
+ } else if (drv->states[idx].target_residency < TICK_USEC) {
+ /*
+ * Do not stop the tick if there is at least one more state
+ * within the tick period range that could be used if longer
+ * idle duration was predicted.
+ */
+ *nohz_ret = !(first_idx > idx &&
+ drv->states[first_idx].target_residency < TICK_USEC);
+ }
+
data->last_state_idx = idx;

return data->last_state_idx;


2018-03-07 17:05:52

by Doug Smythies

[permalink] [raw]
Subject: RE: [RFC/RFT][PATCH v2 0/6] sched/cpuidle: Idle loop rework

On 2018.03.06 12:57 Rafael J. Wysocki wrote:

...[snip]...

> And the two paragraphs below still apply:

>> I have tested these patches on a couple of machines, including the very laptop
>> I'm sending them from, without any obvious issues, but please give them a go
>> if you can, especially if you have an easy way to reproduce the problem they
>> are targeting. The patches are on top of 4.16-rc3 (if you need a git branch
>> with them for easier testing, please let me know).

Hi,

I am still having some boot troubles with V2. However, and because my system
did eventually boot, seemingly O.K., I didn't re-boot a bunch of times for
further testing.

I ran my 100% load on one CPU test, which is for idle state 0 issues, on
my otherwise extremely idle test server. I never did have very good ways
to test issues with the other idle states (Thomas Ilsche's specialty).

During the test I got some messages (I also got some with the V1 patch set):

[16246.655148] rcu_preempt kthread starved for 60005 jiffies! g10557 c10556
f0x0 RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=5
[19556.565007] rcu_preempt kthread starved for 60003 jiffies! g12126 c12125
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=5
[20223.066251] clocksource: timekeeping watchdog on CPU4: Marking clocksource
'tsc' as unstable because the skew is too large:
[20223.066260] clocksource: 'hpet' wd_now: 6b02e6a0
wd_last: c70685ef mask: ffffffff
[20223.066262] clocksource: 'tsc' cs_now: 3ed0d6f109f5
cs_last: 3e383b5c058d mask: ffffffffffffffff
[20223.066264] tsc: Marking TSC unstable due to clocksource watchdog
[26720.509156] rcu_preempt kthread starved for 60003 jiffies! g16640
c16639 f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=5
[29058.215330] rcu_preempt kthread starved for 60004 jiffies! g17522
c17521 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=4
...

The other observation is sometimes the number of irqs (turbostat) jumps
a lot. This did not occur with the V1 patch set. An increase in irqs is
expected, but I don't think that much.
Note: I am unable to show a correlation between the above log entries
and the jumps in irqs.

Test results (some previous restated):
Observe 24.44 average processor package watts. 100% load on CPU 7.
Test Duration 10 hours and 9 minutes. Peak power 26.88 Watts
Reference: K4.16-rc3 + rjw V1 patchset: 24.77 Watts. Peak power 32.8 watts
Reference: K4.16-rc3: 26.41 Watts (short test, 3.53 hours)
Reference: K4.15-rc1: 27.34 Watts
Reference: K4.15-rc1, idle states 0-3 disabled: 23.92 Watts
Reference: K4.16-rc3 + rjw v1 patch set, idle states 0-3 disabled: ~23.65 Watts

References (Graphs):
http://fast.smythies.com/rjw416rc3v2_irq.png
http://fast.smythies.com/rjw416rc3v2_pwr.png

... Doug



2018-03-07 22:12:47

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC/RFT][PATCH v2 0/6] sched/cpuidle: Idle loop rework

On Wed, Mar 7, 2018 at 6:04 PM, Doug Smythies <[email protected]> wrote:
> On 2018.03.06 12:57 Rafael J. Wysocki wrote:
>
> ...[snip]...
>
>> And the two paragraphs below still apply:
>
>>> I have tested these patches on a couple of machines, including the very laptop
>>> I'm sending them from, without any obvious issues, but please give them a go
>>> if you can, especially if you have an easy way to reproduce the problem they
>>> are targeting. The patches are on top of 4.16-rc3 (if you need a git branch
>>> with them for easier testing, please let me know).
>
> Hi,
>
> I am still having some boot troubles with V2. However, and because my system
> did eventually boot, seemingly O.K., I didn't re-boot a bunch of times for
> further testing.

OK, thanks for letting me know.

> I ran my 100% load on one CPU test, which is for idle state 0 issues, on
> my otherwise extremely idle test server. I never did have very good ways
> to test issues with the other idle states (Thomas Ilsche's specialty).
>
> During the test I got some messages (I also got some with the V1 patch set):
>
> [16246.655148] rcu_preempt kthread starved for 60005 jiffies! g10557 c10556
> f0x0 RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=5
> [19556.565007] rcu_preempt kthread starved for 60003 jiffies! g12126 c12125
> f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=5
> [20223.066251] clocksource: timekeeping watchdog on CPU4: Marking clocksource
> 'tsc' as unstable because the skew is too large:
> [20223.066260] clocksource: 'hpet' wd_now: 6b02e6a0
> wd_last: c70685ef mask: ffffffff
> [20223.066262] clocksource: 'tsc' cs_now: 3ed0d6f109f5
> cs_last: 3e383b5c058d mask: ffffffffffffffff
> [20223.066264] tsc: Marking TSC unstable due to clocksource watchdog
> [26720.509156] rcu_preempt kthread starved for 60003 jiffies! g16640
> c16639 f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=5
> [29058.215330] rcu_preempt kthread starved for 60004 jiffies! g17522
> c17521 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=4
> ...

Can you please check if that's reporoducible with just the first three
patches in the series applied?

> The other observation is sometimes the number of irqs (turbostat) jumps
> a lot. This did not occur with the V1 patch set. An increase in irqs is
> expected, but I don't think that much.

Right.

> Note: I am unable to show a correlation between the above log entries
> and the jumps in irqs.

Thanks,
Rafael

2018-03-07 23:19:40

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC/RFT][PATCH v2 1/6] time: tick-sched: Reorganize idle tick management code

On Tue, Mar 06, 2018 at 10:02:01AM +0100, Rafael J. Wysocki wrote:
> Index: linux-pm/kernel/time/tick-sched.c
> ===================================================================
> --- linux-pm.orig/kernel/time/tick-sched.c
> +++ linux-pm/kernel/time/tick-sched.c
> +
> +/**
> + * tick_nohz_idle_prepare - prepare for entering idle on the current CPU.
> + *
> + * Called when we start the idle loop.
> + */
> +void tick_nohz_idle_prepare(void)
> +{
> + struct tick_sched *ts;
> +
> + __tick_nohz_idle_prepare();
> +
> + local_irq_disable();
> +
> + ts = this_cpu_ptr(&tick_cpu_sched);
> + ts->inidle = 1;
> +
> + local_irq_enable();
> +}

Why not calling tick_nohz_start_idle() from there? This is going to
simplify the rest, you won't need to call tick_nohz_idle_go_idle()
from places that don't want to stop the tick and you can then remove
the stop_tick argument.

> +
> +/**
> + * tick_nohz_idle_go_idle - start idle period on the current CPU.
> + * @stop_tick: Whether or not to stop the idle tick.
> + *
> + * When @stop_tick is set and the next event is more than a tick into the
> + * future, stop the idle tick.
> + */
> +void tick_nohz_idle_go_idle(bool stop_tick)
> +{
> + __tick_nohz_idle_enter(this_cpu_ptr(&tick_cpu_sched), stop_tick);
> +}
> +
> +void tick_nohz_idle_enter(void)
> +{
> + struct tick_sched *ts;
> +
> + __tick_nohz_idle_prepare();
>
> local_irq_disable();
>
> ts = this_cpu_ptr(&tick_cpu_sched);
> ts->inidle = 1;
> - __tick_nohz_idle_enter(ts);
> + __tick_nohz_idle_enter(ts, true);
>
> local_irq_enable();
> }

Ah I see you're keeping tick_nohz_idle_enter() around because of the callsite
in xen. It looks like a hack (from the xen part), I'll need to have a look at
it, just a note for myself...

Thanks.

2018-03-07 23:40:34

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC/RFT][PATCH v2 2/6] sched: idle: Do not stop the tick upfront in the idle loop

On Tue, Mar 06, 2018 at 10:02:15AM +0100, Rafael J. Wysocki wrote:
> Index: linux-pm/kernel/sched/idle.c
> ===================================================================
> --- linux-pm.orig/kernel/sched/idle.c
> +++ linux-pm/kernel/sched/idle.c
> @@ -220,13 +220,17 @@ static void do_idle(void)
> */
>
> __current_set_polling();
> - tick_nohz_idle_enter();
> + tick_nohz_idle_prepare();

Since we leave tick_nohz_idle_exit() unchanged, can we keep tick_nohz_idle_prepare()
under the name tick_nohz_idle_enter() so that we stay symetric? And then make xen call
the two functions:

tick_nohz_idle_enter();
tick_nohz_idle_go_idle();

Also can we rename tick_nohz_idle_go_idle() to tick_nohz_idle_stop_tick() ?
This will be more self-explanatory.

Thanks.

2018-03-08 01:30:27

by Doug Smythies

[permalink] [raw]
Subject: RE: [RFC/RFT][PATCH v2 0/6] sched/cpuidle: Idle loop rework

On 2018.03.07 14:12 Rafael J. Wysocki wrote:
> On Wed, Mar 7, 2018 at 6:04 PM, Doug Smythies <[email protected]> wrote:
>> On 2018.03.06 12:57 Rafael J. Wysocki wrote:
>
>> During the test I got some messages (I also got some with the V1 patch set):

> Can you please check if that's reporoducible with just the first three
> patches in the series applied?

I will.

>> The other observation is sometimes the number of irqs (turbostat) jumps
>> a lot. This did not occur with the V1 patch set. An increase in irqs is
>> expected, but I don't think that much.

> Right.

I did a trace for 1 hour, and got about 12 occurrence of very high IRQs.
I was able to correlate trace results with high IRQs per turbostat sampling
time (1 minute).

The extreme jumps in IRQs is due to CPUs wanting to be in idle state 4
(the deepest for my older i7 processor, C6), but the tick is not stopping,
(or so I am guessing) thus they exit idle state 4 every millisecond
(I am using a 1000 Hz kernel), and then pretty much immediately to go
back into idle state 4.
When this occurs, it seems to occur for a very long time, but it does
seem to eventually sort itself out.

Example:
5 2.0 [006] d..1 780.447005: cpu_idle: state=4 cpu_id=6
993 2.0 [006] d..1 780.447999: cpu_idle: state=4294967295 cpu_id=6
6 2.0 [006] d..1 780.448006: cpu_idle: state=4 cpu_id=6
992 2.0 [006] d..1 780.448999: cpu_idle: state=4294967295 cpu_id=6
6 2.0 [006] d..1 780.449005: cpu_idle: state=4 cpu_id=6

Where:
column 1 is the time difference in microseconds from the previous sample.
column 2 is the elapsed time of the test in minutes (for ease of correlating
with the other once per minute data.
Other columns are unmodified from the raw trace data, but this file is
only CPU 6 and only idle entry/exit.

And the same area from the raw trace file:

<idle>-0 [006] d..1 780.447005: cpu_idle: state=4 cpu_id=6
test1-2664 [007] d.h. 780.447999: local_timer_entry: vector=237
<idle>-0 [006] d..1 780.447999: cpu_idle: state=4294967295 cpu_id=6
test1-2664 [007] d.h. 780.447999: hrtimer_expire_entry: hrtimer=00000000ea612c0e function=tick_sched_timer now=780435000915
<idle>-0 [006] d.h1 780.448000: local_timer_entry: vector=237
<idle>-0 [006] d.h1 780.448001: hrtimer_expire_entry: hrtimer=000000004b84020f function=tick_sched_timer now=780435002540
<idle>-0 [006] d.h1 780.448002: hrtimer_expire_exit: hrtimer=000000004b84020f
test1-2664 [007] d.h. 780.448003: hrtimer_expire_exit: hrtimer=00000000ea612c0e
<idle>-0 [006] d.h1 780.448003: local_timer_exit: vector=237
test1-2664 [007] d.h. 780.448003: local_timer_exit: vector=237
<idle>-0 [006] d..1 780.448006: cpu_idle: state=4 cpu_id=6

... Doug



2018-03-08 09:07:14

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC/RFT][PATCH v2 2/6] sched: idle: Do not stop the tick upfront in the idle loop

On Thursday, March 8, 2018 12:39:12 AM CET Frederic Weisbecker wrote:
> On Tue, Mar 06, 2018 at 10:02:15AM +0100, Rafael J. Wysocki wrote:
> > Index: linux-pm/kernel/sched/idle.c
> > ===================================================================
> > --- linux-pm.orig/kernel/sched/idle.c
> > +++ linux-pm/kernel/sched/idle.c
> > @@ -220,13 +220,17 @@ static void do_idle(void)
> > */
> >
> > __current_set_polling();
> > - tick_nohz_idle_enter();
> > + tick_nohz_idle_prepare();
>
> Since we leave tick_nohz_idle_exit() unchanged, can we keep tick_nohz_idle_prepare()
> under the name tick_nohz_idle_enter() so that we stay symetric? And then make xen call
> the two functions:
>
> tick_nohz_idle_enter();
> tick_nohz_idle_go_idle();

No problem with that.

> Also can we rename tick_nohz_idle_go_idle() to tick_nohz_idle_stop_tick() ?
> This will be more self-explanatory.

But it doesn't always stop the tick which is why I chose the other name.


2018-03-08 09:23:33

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC/RFT][PATCH v2 1/6] time: tick-sched: Reorganize idle tick management code

On Thursday, March 8, 2018 12:18:29 AM CET Frederic Weisbecker wrote:
> On Tue, Mar 06, 2018 at 10:02:01AM +0100, Rafael J. Wysocki wrote:
> > Index: linux-pm/kernel/time/tick-sched.c
> > ===================================================================
> > --- linux-pm.orig/kernel/time/tick-sched.c
> > +++ linux-pm/kernel/time/tick-sched.c
> > +
> > +/**
> > + * tick_nohz_idle_prepare - prepare for entering idle on the current CPU.
> > + *
> > + * Called when we start the idle loop.
> > + */
> > +void tick_nohz_idle_prepare(void)
> > +{
> > + struct tick_sched *ts;
> > +
> > + __tick_nohz_idle_prepare();
> > +
> > + local_irq_disable();
> > +
> > + ts = this_cpu_ptr(&tick_cpu_sched);
> > + ts->inidle = 1;
> > +
> > + local_irq_enable();
> > +}
>
> Why not calling tick_nohz_start_idle() from there? This is going to
> simplify the rest, you won't need to call tick_nohz_idle_go_idle()
> from places that don't want to stop the tick and you can then remove
> the stop_tick argument.

So I guess I would then use ts->idle_entrytime as "now" in the
tick_nohz_get_sleep_length() computation, right?


2018-03-08 10:32:51

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC/RFT][PATCH v2 0/6] sched/cpuidle: Idle loop rework

On Tue, 2018-03-06 at 09:57 +0100, Rafael J. Wysocki wrote:
> Hi All,

Greetings,

> Thanks a lot for the discussion so far!
>
> Here's a new version of the series addressing some comments from the
> discussion and (most importantly) replacing patches 4 and 5 with another
> (simpler) patch.

Oddity: these patches seemingly manage to cost a bit of power when
lightly loaded. ?(but didn't cut cross core nohz cost much.. darn)

i4790 booted nopti nospectre_v2

30 sec tbench
4.16.0.g1b88acc-master (virgin)
Throughput 559.279 MB/sec 1 clients 1 procs max_latency=0.046 ms
Throughput 997.119 MB/sec 2 clients 2 procs max_latency=0.246 ms
Throughput 1693.04 MB/sec 4 clients 4 procs max_latency=4.309 ms
Throughput 3597.2 MB/sec 8 clients 8 procs max_latency=6.760 ms
Throughput 3474.55 MB/sec 16 clients 16 procs max_latency=6.743 ms

4.16.0.g1b88acc-master (+v2)
Throughput 588.929 MB/sec 1 clients 1 procs max_latency=0.291 ms
Throughput 1080.93 MB/sec 2 clients 2 procs max_latency=0.639 ms
Throughput 1826.3 MB/sec 4 clients 4 procs max_latency=0.647 ms
Throughput 3561.01 MB/sec 8 clients 8 procs max_latency=1.279 ms
Throughput 3382.98 MB/sec 16 clients 16 procs max_latency=4.817 ms

4.16.0.g1b88acc-master (+local nohz mitigation etc for reference [1])
Throughput 722.559 MB/sec 1 clients 1 procs max_latency=0.087 ms
Throughput 1208.59 MB/sec 2 clients 2 procs max_latency=0.289 ms
Throughput 2071.94 MB/sec 4 clients 4 procs max_latency=0.654 ms
Throughput 3784.91 MB/sec 8 clients 8 procs max_latency=0.974 ms
Throughput 3644.4 MB/sec 16 clients 16 procs max_latency=5.620 ms

turbostat -q -- firefox /root/tmp/video/BigBuckBunny-DivXPlusHD.mkv & sleep 300;killall firefox

PkgWatt
1 2 3
4.16.0.g1b88acc-master 6.95 7.03 6.91 (virgin)
4.16.0.g1b88acc-master 7.20 7.25 7.26 (+v2)
4.16.0.g1b88acc-master 6.90 7.06 6.95 (+local)

Why would v2 charge the light firefox load a small but consistent fee?

-Mike

1. see low end, that's largely due to nohz throttling

2018-03-08 11:13:21

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC/RFT][PATCH v2 0/6] sched/cpuidle: Idle loop rework

On Thu, Mar 8, 2018 at 11:31 AM, Mike Galbraith <[email protected]> wrote:
> On Tue, 2018-03-06 at 09:57 +0100, Rafael J. Wysocki wrote:
>> Hi All,
>
> Greetings,

Hi,

>> Thanks a lot for the discussion so far!
>>
>> Here's a new version of the series addressing some comments from the
>> discussion and (most importantly) replacing patches 4 and 5 with another
>> (simpler) patch.
>
> Oddity: these patches seemingly manage to cost a bit of power when
> lightly loaded. (but didn't cut cross core nohz cost much.. darn)
>
> i4790 booted nopti nospectre_v2
>
> 30 sec tbench
> 4.16.0.g1b88acc-master (virgin)
> Throughput 559.279 MB/sec 1 clients 1 procs max_latency=0.046 ms
> Throughput 997.119 MB/sec 2 clients 2 procs max_latency=0.246 ms
> Throughput 1693.04 MB/sec 4 clients 4 procs max_latency=4.309 ms
> Throughput 3597.2 MB/sec 8 clients 8 procs max_latency=6.760 ms
> Throughput 3474.55 MB/sec 16 clients 16 procs max_latency=6.743 ms
>
> 4.16.0.g1b88acc-master (+v2)
> Throughput 588.929 MB/sec 1 clients 1 procs max_latency=0.291 ms
> Throughput 1080.93 MB/sec 2 clients 2 procs max_latency=0.639 ms
> Throughput 1826.3 MB/sec 4 clients 4 procs max_latency=0.647 ms
> Throughput 3561.01 MB/sec 8 clients 8 procs max_latency=1.279 ms
> Throughput 3382.98 MB/sec 16 clients 16 procs max_latency=4.817 ms

max_latency is much lower here for >2 clients/procs, but at the same
time it is much higher for <=2 clients/procs (which then may be
related to the somewhat higher throughput). Interesting.

> 4.16.0.g1b88acc-master (+local nohz mitigation etc for reference [1])
> Throughput 722.559 MB/sec 1 clients 1 procs max_latency=0.087 ms
> Throughput 1208.59 MB/sec 2 clients 2 procs max_latency=0.289 ms
> Throughput 2071.94 MB/sec 4 clients 4 procs max_latency=0.654 ms
> Throughput 3784.91 MB/sec 8 clients 8 procs max_latency=0.974 ms
> Throughput 3644.4 MB/sec 16 clients 16 procs max_latency=5.620 ms
>
> turbostat -q -- firefox /root/tmp/video/BigBuckBunny-DivXPlusHD.mkv & sleep 300;killall firefox
>
> PkgWatt
> 1 2 3
> 4.16.0.g1b88acc-master 6.95 7.03 6.91 (virgin)
> 4.16.0.g1b88acc-master 7.20 7.25 7.26 (+v2)
> 4.16.0.g1b88acc-master 6.90 7.06 6.95 (+local)
>
> Why would v2 charge the light firefox load a small but consistent fee?

Two effects may come into play here I think.

One is that allowing the tick to run biases the menu governor's
predictions towards the lower end, so we may use shallow states more
as a result then (Peter was talking about that).

The second one may be that intermediate states are used quite a bit
"by nature" in this workload (that should be quite straightforward to
verify) and stopping the tick for them saves some energy on idle
entry/exit.

Thanks!

2018-03-08 13:41:59

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC/RFT][PATCH v2 0/6] sched/cpuidle: Idle loop rework

On Thu, 2018-03-08 at 12:10 +0100, Rafael J. Wysocki wrote:
> On Thu, Mar 8, 2018 at 11:31 AM, Mike Galbraith <[email protected]> wrote:
> 1 2 3
> > 4.16.0.g1b88acc-master 6.95 7.03 6.91 (virgin)
> > 4.16.0.g1b88acc-master 7.20 7.25 7.26 (+v2)
> > 4.16.0.g1b88acc-master 6.90 7.06 6.95 (+local)
> >
> > Why would v2 charge the light firefox load a small but consistent fee?
>
> Two effects may come into play here I think.
>
> One is that allowing the tick to run biases the menu governor's
> predictions towards the lower end, so we may use shallow states more
> as a result then (Peter was talking about that).

Hm, I'd expect that to show up in +local as well then, as it keeps the
tick running when avg_idle < sched_migration_cost (convenient magic
number), but the firefox load runs at the same wattage as virgin. ?I'm
also doing this...

--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -335,7 +335,7 @@ static int menu_select(struct cpuidle_dr
* C1's exit latency exceeds the user configured limit.
*/
polling_threshold = max_t(unsigned int, 20, s->target_residency);
- if (data->next_timer_us > polling_threshold &&
+ if (expected_interval > polling_threshold &&
latency_req > s->exit_latency && !s->disabled &&
!dev->states_usage[1].disable)
first_idx = 1;

...to help out high frequency cross core throughput, but the firefox
load apparently doesn't tickle that, as significant polling would
surely show in the wattage.

-Mike

2018-03-08 15:15:37

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC/RFT][PATCH v2 1/6] time: tick-sched: Reorganize idle tick management code

On Thu, Mar 08, 2018 at 10:22:26AM +0100, Rafael J. Wysocki wrote:
> On Thursday, March 8, 2018 12:18:29 AM CET Frederic Weisbecker wrote:
> > On Tue, Mar 06, 2018 at 10:02:01AM +0100, Rafael J. Wysocki wrote:
> > > Index: linux-pm/kernel/time/tick-sched.c
> > > ===================================================================
> > > --- linux-pm.orig/kernel/time/tick-sched.c
> > > +++ linux-pm/kernel/time/tick-sched.c
> > > +
> > > +/**
> > > + * tick_nohz_idle_prepare - prepare for entering idle on the current CPU.
> > > + *
> > > + * Called when we start the idle loop.
> > > + */
> > > +void tick_nohz_idle_prepare(void)
> > > +{
> > > + struct tick_sched *ts;
> > > +
> > > + __tick_nohz_idle_prepare();
> > > +
> > > + local_irq_disable();
> > > +
> > > + ts = this_cpu_ptr(&tick_cpu_sched);
> > > + ts->inidle = 1;
> > > +
> > > + local_irq_enable();
> > > +}
> >
> > Why not calling tick_nohz_start_idle() from there? This is going to
> > simplify the rest, you won't need to call tick_nohz_idle_go_idle()
> > from places that don't want to stop the tick and you can then remove
> > the stop_tick argument.
>
> So I guess I would then use ts->idle_entrytime as "now" in the
> tick_nohz_get_sleep_length() computation, right?

Ah right, I missed the need for ktime_get().

You can't use ts->idle_entrytime in tick_nohz_get_sleep_length() because
full dynticks doesn't rely on it.

But I think you can just do the following, with a comment explaining that
idle_entrytime is expected to be fresh enough at this point:

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 57b3de4..8e61796 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -923,7 +923,7 @@ static void __tick_nohz_idle_enter(struct tick_sched *ts, bool stop_tick)

ts->idle_calls++;

- expires = tick_nohz_stop_sched_tick(ts, now, cpu);
+ expires = tick_nohz_stop_sched_tick(ts, ts->idle_entrytime, cpu);
if (expires > 0LL) {
ts->idle_sleeps++;
ts->idle_expires = expires;


2018-03-08 15:20:38

by Doug Smythies

[permalink] [raw]
Subject: RE: [RFC/RFT][PATCH v2 0/6] sched/cpuidle: Idle loop rework

On 2018.03.07 17:29 Doug wrote:
> On 2018.03.07 14:12 Rafael J. Wysocki wrote:
>> On Wed, Mar 7, 2018 at 6:04 PM, Doug Smythies <[email protected]> wrote:
>>> On 2018.03.06 12:57 Rafael J. Wysocki wrote:
>>
>>> During the test I got some messages (I also got some with the V1 patch set):
>
>> Can you please check if that's reporoducible with just the first three
>> patches in the series applied?
>
> I will.

No issues. Test duration: 7 hours and 26 minutes.
Boot seems normal. Times re-booted 4.

... Doug



2018-03-08 16:19:08

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC/RFT][PATCH v2 0/6] sched/cpuidle: Idle loop rework

On Thu, 2018-03-08 at 07:18 -0800, Doug Smythies wrote:
> On 2018.03.07 17:29 Doug wrote:
> > On 2018.03.07 14:12 Rafael J. Wysocki wrote:
> > > On Wed, Mar 7, 2018 at 6:04 PM, Doug Smythies <[email protected]
> > > t> wrote:
> > > > On 2018.03.06 12:57 Rafael J. Wysocki wrote:
> > > > During the test I got some messages (I also got some with the
> > > > V1 patch set):
> > > Can you please check if that's reporoducible with just the first
> > > three
> > > patches in the series applied?
> >
> > I will.
>
> No issues. Test duration: 7 hours and 26 minutes.
> Boot seems normal. Times re-booted 4.

I am observing the same issues here.

I'll comb through the code to see if I can
spot the issue :)

--
All Rights Reversed.


Attachments:
signature.asc (499.00 B)
This is a digitally signed message part

2018-03-08 16:35:39

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC/RFT][PATCH v2 1/6] time: tick-sched: Reorganize idle tick management code

On Thu, Mar 8, 2018 at 4:14 PM, Frederic Weisbecker <[email protected]> wrote:
> On Thu, Mar 08, 2018 at 10:22:26AM +0100, Rafael J. Wysocki wrote:
>> On Thursday, March 8, 2018 12:18:29 AM CET Frederic Weisbecker wrote:
>> > On Tue, Mar 06, 2018 at 10:02:01AM +0100, Rafael J. Wysocki wrote:
>> > > Index: linux-pm/kernel/time/tick-sched.c
>> > > ===================================================================
>> > > --- linux-pm.orig/kernel/time/tick-sched.c
>> > > +++ linux-pm/kernel/time/tick-sched.c
>> > > +
>> > > +/**
>> > > + * tick_nohz_idle_prepare - prepare for entering idle on the current CPU.
>> > > + *
>> > > + * Called when we start the idle loop.
>> > > + */
>> > > +void tick_nohz_idle_prepare(void)
>> > > +{
>> > > + struct tick_sched *ts;
>> > > +
>> > > + __tick_nohz_idle_prepare();
>> > > +
>> > > + local_irq_disable();
>> > > +
>> > > + ts = this_cpu_ptr(&tick_cpu_sched);
>> > > + ts->inidle = 1;
>> > > +
>> > > + local_irq_enable();
>> > > +}
>> >
>> > Why not calling tick_nohz_start_idle() from there? This is going to
>> > simplify the rest, you won't need to call tick_nohz_idle_go_idle()
>> > from places that don't want to stop the tick and you can then remove
>> > the stop_tick argument.
>>
>> So I guess I would then use ts->idle_entrytime as "now" in the
>> tick_nohz_get_sleep_length() computation, right?
>
> Ah right, I missed the need for ktime_get().
>
> You can't use ts->idle_entrytime in tick_nohz_get_sleep_length() because
> full dynticks doesn't rely on it.
>
> But I think you can just do the following, with a comment explaining that
> idle_entrytime is expected to be fresh enough at this point:
>
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 57b3de4..8e61796 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -923,7 +923,7 @@ static void __tick_nohz_idle_enter(struct tick_sched *ts, bool stop_tick)
>
> ts->idle_calls++;
>
> - expires = tick_nohz_stop_sched_tick(ts, now, cpu);
> + expires = tick_nohz_stop_sched_tick(ts, ts->idle_entrytime, cpu);
> if (expires > 0LL) {
> ts->idle_sleeps++;
> ts->idle_expires = expires;
>

That's what I was thinking about, but ktime_get() seems to be working too.

2018-03-08 16:37:13

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC/RFT][PATCH v2 0/6] sched/cpuidle: Idle loop rework

On Thu, Mar 8, 2018 at 5:16 PM, Rik van Riel <[email protected]> wrote:
> On Thu, 2018-03-08 at 07:18 -0800, Doug Smythies wrote:
>> On 2018.03.07 17:29 Doug wrote:
>> > On 2018.03.07 14:12 Rafael J. Wysocki wrote:
>> > > On Wed, Mar 7, 2018 at 6:04 PM, Doug Smythies <[email protected]
>> > > t> wrote:
>> > > > On 2018.03.06 12:57 Rafael J. Wysocki wrote:
>> > > > During the test I got some messages (I also got some with the
>> > > > V1 patch set):
>> > > Can you please check if that's reporoducible with just the first
>> > > three
>> > > patches in the series applied?
>> >
>> > I will.
>>
>> No issues. Test duration: 7 hours and 26 minutes.
>> Boot seems normal. Times re-booted 4.
>
> I am observing the same issues here.
>
> I'll comb through the code to see if I can
> spot the issue :)

I found a problem that may be causing this.

You can very well wait for a respin of the series, should be tomorrow. :-)

2018-03-08 16:39:08

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC/RFT][PATCH v2 0/6] sched/cpuidle: Idle loop rework

On Thu, Mar 8, 2018 at 4:18 PM, Doug Smythies <[email protected]> wrote:
> On 2018.03.07 17:29 Doug wrote:
>> On 2018.03.07 14:12 Rafael J. Wysocki wrote:
>>> On Wed, Mar 7, 2018 at 6:04 PM, Doug Smythies <[email protected]> wrote:
>>>> On 2018.03.06 12:57 Rafael J. Wysocki wrote:
>>>
>>>> During the test I got some messages (I also got some with the V1 patch set):
>>
>>> Can you please check if that's reporoducible with just the first three
>>> patches in the series applied?
>>
>> I will.
>
> No issues. Test duration: 7 hours and 26 minutes.
> Boot seems normal. Times re-booted 4.

Cool, thanks!

This means that the other patches introduce the problem, most likely.

2018-03-08 17:03:25

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC/RFT][PATCH v2 1/6] time: tick-sched: Reorganize idle tick management code

On Thu, Mar 08, 2018 at 05:34:20PM +0100, Rafael J. Wysocki wrote:
> On Thu, Mar 8, 2018 at 4:14 PM, Frederic Weisbecker <[email protected]> wrote:
> > On Thu, Mar 08, 2018 at 10:22:26AM +0100, Rafael J. Wysocki wrote:
> >> On Thursday, March 8, 2018 12:18:29 AM CET Frederic Weisbecker wrote:
> >> > On Tue, Mar 06, 2018 at 10:02:01AM +0100, Rafael J. Wysocki wrote:
> >> > > Index: linux-pm/kernel/time/tick-sched.c
> >> > > ===================================================================
> >> > > --- linux-pm.orig/kernel/time/tick-sched.c
> >> > > +++ linux-pm/kernel/time/tick-sched.c
> >> > > +
> >> > > +/**
> >> > > + * tick_nohz_idle_prepare - prepare for entering idle on the current CPU.
> >> > > + *
> >> > > + * Called when we start the idle loop.
> >> > > + */
> >> > > +void tick_nohz_idle_prepare(void)
> >> > > +{
> >> > > + struct tick_sched *ts;
> >> > > +
> >> > > + __tick_nohz_idle_prepare();
> >> > > +
> >> > > + local_irq_disable();
> >> > > +
> >> > > + ts = this_cpu_ptr(&tick_cpu_sched);
> >> > > + ts->inidle = 1;
> >> > > +
> >> > > + local_irq_enable();
> >> > > +}
> >> >
> >> > Why not calling tick_nohz_start_idle() from there? This is going to
> >> > simplify the rest, you won't need to call tick_nohz_idle_go_idle()
> >> > from places that don't want to stop the tick and you can then remove
> >> > the stop_tick argument.
> >>
> >> So I guess I would then use ts->idle_entrytime as "now" in the
> >> tick_nohz_get_sleep_length() computation, right?
> >
> > Ah right, I missed the need for ktime_get().
> >
> > You can't use ts->idle_entrytime in tick_nohz_get_sleep_length() because
> > full dynticks doesn't rely on it.
> >
> > But I think you can just do the following, with a comment explaining that
> > idle_entrytime is expected to be fresh enough at this point:
> >
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index 57b3de4..8e61796 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -923,7 +923,7 @@ static void __tick_nohz_idle_enter(struct tick_sched *ts, bool stop_tick)
> >
> > ts->idle_calls++;
> >
> > - expires = tick_nohz_stop_sched_tick(ts, now, cpu);
> > + expires = tick_nohz_stop_sched_tick(ts, ts->idle_entrytime, cpu);
> > if (expires > 0LL) {
> > ts->idle_sleeps++;
> > ts->idle_expires = expires;
> >
>
> That's what I was thinking about, but ktime_get() seems to be working too.

Well, ktime_get() has its share of overhead, so if we can avoid a call and use
a cache...

2018-03-09 10:00:03

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC/RFT][PATCH v2 0/6] sched/cpuidle: Idle loop rework

On Thu, Mar 8, 2018 at 2:40 PM, Mike Galbraith <[email protected]> wrote:
> On Thu, 2018-03-08 at 12:10 +0100, Rafael J. Wysocki wrote:
>> On Thu, Mar 8, 2018 at 11:31 AM, Mike Galbraith <[email protected]> wrote:
>> 1 2 3
>> > 4.16.0.g1b88acc-master 6.95 7.03 6.91 (virgin)
>> > 4.16.0.g1b88acc-master 7.20 7.25 7.26 (+v2)
>> > 4.16.0.g1b88acc-master 6.90 7.06 6.95 (+local)
>> >
>> > Why would v2 charge the light firefox load a small but consistent fee?
>>
>> Two effects may come into play here I think.
>>
>> One is that allowing the tick to run biases the menu governor's
>> predictions towards the lower end, so we may use shallow states more
>> as a result then (Peter was talking about that).
>
> Hm, I'd expect that to show up in +local as well then, as it keeps the
> tick running when avg_idle < sched_migration_cost (convenient magic
> number), but the firefox load runs at the same wattage as virgin. I'm
> also doing this...
>
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -335,7 +335,7 @@ static int menu_select(struct cpuidle_dr
> * C1's exit latency exceeds the user configured limit.
> */
> polling_threshold = max_t(unsigned int, 20, s->target_residency);
> - if (data->next_timer_us > polling_threshold &&
> + if (expected_interval > polling_threshold &&
> latency_req > s->exit_latency && !s->disabled &&
> !dev->states_usage[1].disable)
> first_idx = 1;
>
> ...to help out high frequency cross core throughput, but the firefox
> load apparently doesn't tickle that, as significant polling would
> surely show in the wattage.

OK, so the second reason sounds more likely to me.

Anyway, please retest with the v3 I've just posted. The previous
iteration had a rather serious issue that might very well influence
the results (it was using stale values sometimes).