2018-04-04 08:55:17

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH v9 00/10] sched/cpuidle: Idle loop rework

Hi All,

Thanks a lot for the feedback so far!

For the motivation/summary, please refer to the BZ entry at

https://bugzilla.kernel.org/show_bug.cgi?id=199227

created for collecting information related to this patch series. Some v7.3
testing results from Len and Doug are in there already.

The testing so far shows significant idle power improvements, both in terms of
reducing the average idle power (about 10% on some systems) and in terms of
reducing the idle power noise (in the vast majority of cases, with this series
applied the idle power is mostly stable around the power floor of the system).
The average power is also reduced in some non-idle workloads and there are
some performance improvements in them.

It also is reported that the series generally addresses the problem it has been
motivated by (ie. the "powernightmares" issue).

This revision is mostly a re-send of the v8 with three patches changed as
follows.

> Patch 1 prepares the tick-sched code for the subsequent modifications and it
> doesn't change the code's functionality (at least not intentionally).
>
> Patch 2 starts pushing the tick stopping decision deeper into the idle
> loop, but that is limited to do_idle() and tick_nohz_irq_exit().
>
> Patch 3 makes cpuidle_idle_call() decide whether or not to stop the tick
> and sets the stage for the subsequent changes.
>
> Patch 4 is a new one just for the TICK_USEC definition changes.
>
> Patch 5 adds a bool pointer argument to cpuidle_select() and the ->select
> governor callback allowing them to return a "nohz" hint on whether or not to
> stop the tick to the caller. It also adds code to decide what value to
> return as "nohz" to the menu governor and modifies its correction factor
> computations to take running tick into account if need be.
>
> Patch 6 (which is new) contains some changes that previously were included
> into the big reordering patch (patch [6/8] in the v7). Essentially, it does
> more tick-sched code reorganization in preparation for the subsequent changes
> (and should not modify the functionality).

Patch 7 is a new version of its v8 counterpart. It makes fewer changes to the
existing code and adds a special function for the handling of the use case it
is about. It still makes some hrtimer code modifications allowing it to return
the time to the next event with one timer excluded (which needs to be done with
respect to the tick timer), though.

Patch 8 reorders the idle state selection with respect to the stopping of
the tick and causes the additional "nohz" hint from cpuidle_select() to be
used for deciding whether or not to stop the tick. It is a rebased version
of its v8 counterpart.

Patch 9 causes the menu governor to refine the state selection in case the
tick is not going to be stopped and the already selected state does not fit
the interval before the next tick time. It is a new version that avoids
using state 0 if it has been disabled (if state 0 has been disabled, the
governor only should use it when no states are enabled at all).

> Patch 10 Deals with the situation in which the tick was stopped previously,
> but the idle governor still predicts short idle (it has not changed).

This series is complementary to the poll_idle() patches discussed recently

https://patchwork.kernel.org/patch/10282237/
https://patchwork.kernel.org/patch/10311775/

that have been merged for v4.17 already.

There is a new git branch containing the current series at

git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \
idle-loop-v9

Thanks,
Rafael



2018-04-04 08:54:28

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH v9 10/10] cpuidle: menu: Avoid selecting shallow states with stopped tick

From: Rafael J. Wysocki <[email protected]>

If the scheduler tick has been stopped already and the governor
selects a shallow idle state, the CPU can spend a long time in that
state if the selection is based on an inaccurate prediction of idle
time. That effect turns out to be relevant, so it needs to be
mitigated.

To that end, modify the menu governor to discard the result of the
idle time prediction if the tick is stopped and the predicted idle
time is less than the tick period length, unless the tick timer is
going to expire soon.

Signed-off-by: Rafael J. Wysocki <[email protected]>
---

v8 -> v9: No changes.

---
drivers/cpuidle/governors/menu.c | 29 ++++++++++++++++++++++-------
1 file changed, 22 insertions(+), 7 deletions(-)

Index: linux-pm/drivers/cpuidle/governors/menu.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/menu.c
+++ linux-pm/drivers/cpuidle/governors/menu.c
@@ -352,13 +352,28 @@ static int menu_select(struct cpuidle_dr
*/
data->predicted_us = min(data->predicted_us, expected_interval);

- /*
- * Use the performance multiplier and the user-configurable
- * latency_req to determine the maximum exit latency.
- */
- interactivity_req = data->predicted_us / performance_multiplier(nr_iowaiters, cpu_load);
- if (latency_req > interactivity_req)
- latency_req = interactivity_req;
+ if (tick_nohz_tick_stopped()) {
+ /*
+ * If the tick is already stopped, the cost of possible short
+ * idle duration misprediction is much higher, because the CPU
+ * may be stuck in a shallow idle state for a long time as a
+ * result of it. In that case say we might mispredict and try
+ * to force the CPU into a state for which we would have stopped
+ * the tick, unless the tick timer is going to expire really
+ * soon anyway.
+ */
+ if (data->predicted_us < TICK_USEC)
+ data->predicted_us = min_t(unsigned int, TICK_USEC,
+ ktime_to_us(delta_next));
+ } else {
+ /*
+ * Use the performance multiplier and the user-configurable
+ * latency_req to determine the maximum exit latency.
+ */
+ interactivity_req = data->predicted_us / performance_multiplier(nr_iowaiters, cpu_load);
+ if (latency_req > interactivity_req)
+ latency_req = interactivity_req;
+ }

expected_interval = data->predicted_us;
/*


2018-04-04 08:54:30

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH v9 02/10] sched: idle: Do not stop the tick upfront in the idle loop

From: Rafael J. Wysocki <[email protected]>

Push the decision whether or not to stop the tick somewhat deeper
into the idle loop.

Stopping the tick upfront leads to unpleasant outcomes in case the
idle governor doesn't agree with the nohz code on the duration of the
upcoming idle period. Specifically, if the tick has been stopped and
the idle governor predicts short idle, the situation is bad regardless
of whether or not the prediction is accurate. If it is accurate, the
tick has been stopped unnecessarily which means excessive overhead.
If it is not accurate, the CPU is likely to spend too much time in
the (shallow, because short idle has been predicted) idle state
selected by the governor [1].

As the first step towards addressing this problem, change the code
to make the tick stopping decision inside of the loop in do_idle().
In particular, do not stop the tick in the cpu_idle_poll() code path.
Also don't do that in tick_nohz_irq_exit() which doesn't really have
enough information on whether or not to stop the tick.

Link: https://marc.info/?l=linux-pm&m=150116085925208&w=2 # [1]
Link: https://tu-dresden.de/zih/forschung/ressourcen/dateien/projekte/haec/powernightmares.pdf
Suggested-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
Reviewed-by: Frederic Weisbecker <[email protected]>
---

v8 -> v9:
* No changes in the patch.
* Tag from Frederic.

---
include/linux/tick.h | 2 ++
kernel/sched/idle.c | 9 ++++++---
kernel/time/tick-sched.c | 26 ++++++++++++++++++--------
3 files changed, 26 insertions(+), 11 deletions(-)

Index: linux-pm/kernel/sched/idle.c
===================================================================
--- linux-pm.orig/kernel/sched/idle.c
+++ linux-pm/kernel/sched/idle.c
@@ -221,13 +221,13 @@ static void do_idle(void)

__current_set_polling();
tick_nohz_idle_enter();
- tick_nohz_idle_stop_tick_protected();

while (!need_resched()) {
check_pgt_cache();
rmb();

if (cpu_is_offline(cpu)) {
+ tick_nohz_idle_stop_tick_protected();
cpuhp_report_idle_dead();
arch_cpu_idle_dead();
}
@@ -241,10 +241,13 @@ static void do_idle(void)
* broadcast device expired for us, we don't want to go deep
* idle as we know that the IPI is going to arrive right away.
*/
- if (cpu_idle_force_poll || tick_check_broadcast_expired())
+ if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
+ tick_nohz_idle_restart_tick();
cpu_idle_poll();
- else
+ } else {
+ tick_nohz_idle_stop_tick();
cpuidle_idle_call();
+ }
arch_cpu_idle_exit();
}

Index: linux-pm/kernel/time/tick-sched.c
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.c
+++ linux-pm/kernel/time/tick-sched.c
@@ -984,12 +984,10 @@ void tick_nohz_irq_exit(void)
{
struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);

- if (ts->inidle) {
+ if (ts->inidle)
tick_nohz_start_idle(ts);
- __tick_nohz_idle_stop_tick(ts);
- } else {
+ else
tick_nohz_full_update_tick(ts);
- }
}

/**
@@ -1050,6 +1048,20 @@ static void tick_nohz_account_idle_ticks
#endif
}

+static void __tick_nohz_idle_restart_tick(struct tick_sched *ts, ktime_t now)
+{
+ tick_nohz_restart_sched_tick(ts, now);
+ tick_nohz_account_idle_ticks(ts);
+}
+
+void tick_nohz_idle_restart_tick(void)
+{
+ struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
+
+ if (ts->tick_stopped)
+ __tick_nohz_idle_restart_tick(ts, ktime_get());
+}
+
/**
* tick_nohz_idle_exit - restart the idle tick from the idle task
*
@@ -1074,10 +1086,8 @@ void tick_nohz_idle_exit(void)
if (ts->idle_active)
tick_nohz_stop_idle(ts, now);

- if (ts->tick_stopped) {
- tick_nohz_restart_sched_tick(ts, now);
- tick_nohz_account_idle_ticks(ts);
- }
+ if (ts->tick_stopped)
+ __tick_nohz_idle_restart_tick(ts, now);

local_irq_enable();
}
Index: linux-pm/include/linux/tick.h
===================================================================
--- linux-pm.orig/include/linux/tick.h
+++ linux-pm/include/linux/tick.h
@@ -115,6 +115,7 @@ enum tick_dep_bits {
extern bool tick_nohz_enabled;
extern int tick_nohz_tick_stopped(void);
extern void tick_nohz_idle_stop_tick(void);
+extern void tick_nohz_idle_restart_tick(void);
extern void tick_nohz_idle_enter(void);
extern void tick_nohz_idle_exit(void);
extern void tick_nohz_irq_exit(void);
@@ -135,6 +136,7 @@ static inline void tick_nohz_idle_stop_t
#define tick_nohz_enabled (0)
static inline int tick_nohz_tick_stopped(void) { return 0; }
static inline void tick_nohz_idle_stop_tick(void) { }
+static inline void tick_nohz_idle_restart_tick(void) { }
static inline void tick_nohz_idle_enter(void) { }
static inline void tick_nohz_idle_exit(void) { }



2018-04-04 08:55:06

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH v9 01/10] time: tick-sched: Reorganize idle tick management code

From: Rafael J. Wysocki <[email protected]>

Prepare the scheduler tick code for reworking the idle loop to
avoid stopping the tick in some cases.

The idea is to split the nohz idle entry call to decouple the idle
time stats accounting and preparatory work from the actual tick stop
code, in order to later be able to delay the tick stop once we reach
more power-knowledgeable callers.

Move away the tick_nohz_start_idle() invocation from
__tick_nohz_idle_enter(), rename the latter to
__tick_nohz_idle_stop_tick() and define tick_nohz_idle_stop_tick()
as a wrapper around it for calling it from the outside.

Make tick_nohz_idle_enter() only call tick_nohz_start_idle() instead
of calling the entire __tick_nohz_idle_enter(), add another wrapper
disabling and enabling interrupts around tick_nohz_idle_stop_tick()
and make the current callers of tick_nohz_idle_enter() call it too
to retain their current functionality.

Signed-off-by: Rafael J. Wysocki <[email protected]>
Reviewed-by: Frederic Weisbecker <[email protected]>
---

v8 -> v9:
* No changes in the patch.
* Tag from Frederic.

---
arch/x86/xen/smp_pv.c | 1 +
include/linux/tick.h | 12 ++++++++++++
kernel/sched/idle.c | 1 +
kernel/time/tick-sched.c | 46 +++++++++++++++++++++++++---------------------
4 files changed, 39 insertions(+), 21 deletions(-)

Index: linux-pm/include/linux/tick.h
===================================================================
--- linux-pm.orig/include/linux/tick.h
+++ linux-pm/include/linux/tick.h
@@ -114,6 +114,7 @@ enum tick_dep_bits {
#ifdef CONFIG_NO_HZ_COMMON
extern bool tick_nohz_enabled;
extern int tick_nohz_tick_stopped(void);
+extern void tick_nohz_idle_stop_tick(void);
extern void tick_nohz_idle_enter(void);
extern void tick_nohz_idle_exit(void);
extern void tick_nohz_irq_exit(void);
@@ -122,9 +123,18 @@ extern unsigned long tick_nohz_get_idle_
extern unsigned long tick_nohz_get_idle_calls_cpu(int cpu);
extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
+
+static inline void tick_nohz_idle_stop_tick_protected(void)
+{
+ local_irq_disable();
+ tick_nohz_idle_stop_tick();
+ local_irq_enable();
+}
+
#else /* !CONFIG_NO_HZ_COMMON */
#define tick_nohz_enabled (0)
static inline int tick_nohz_tick_stopped(void) { return 0; }
+static inline void tick_nohz_idle_stop_tick(void) { }
static inline void tick_nohz_idle_enter(void) { }
static inline void tick_nohz_idle_exit(void) { }

@@ -134,6 +144,8 @@ static inline ktime_t tick_nohz_get_slee
}
static inline u64 get_cpu_idle_time_us(int cpu, u64 *unused) { return -1; }
static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
+
+static inline void tick_nohz_idle_stop_tick_protected(void) { }
#endif /* !CONFIG_NO_HZ_COMMON */

#ifdef CONFIG_NO_HZ_FULL
Index: linux-pm/kernel/time/tick-sched.c
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.c
+++ linux-pm/kernel/time/tick-sched.c
@@ -539,14 +539,11 @@ static void tick_nohz_stop_idle(struct t
sched_clock_idle_wakeup_event();
}

-static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
+static void tick_nohz_start_idle(struct tick_sched *ts)
{
- ktime_t now = ktime_get();
-
- ts->idle_entrytime = now;
+ ts->idle_entrytime = ktime_get();
ts->idle_active = 1;
sched_clock_idle_sleep_event();
- return now;
}

/**
@@ -911,19 +908,21 @@ static bool can_stop_idle_tick(int cpu,
return true;
}

-static void __tick_nohz_idle_enter(struct tick_sched *ts)
+static void __tick_nohz_idle_stop_tick(struct tick_sched *ts)
{
- ktime_t now, expires;
+ ktime_t expires;
int cpu = smp_processor_id();

- now = tick_nohz_start_idle(ts);
-
if (can_stop_idle_tick(cpu, ts)) {
int was_stopped = ts->tick_stopped;

ts->idle_calls++;

- expires = tick_nohz_stop_sched_tick(ts, now, cpu);
+ /*
+ * The idle entry time should be a sufficient approximation of
+ * the current time at this point.
+ */
+ expires = tick_nohz_stop_sched_tick(ts, ts->idle_entrytime, cpu);
if (expires > 0LL) {
ts->idle_sleeps++;
ts->idle_expires = expires;
@@ -937,16 +936,19 @@ static void __tick_nohz_idle_enter(struc
}

/**
- * tick_nohz_idle_enter - stop the idle tick from the idle task
+ * tick_nohz_idle_stop_tick - stop the idle tick from the idle task
*
* When the next event is more than a tick into the future, stop the idle tick
- * Called when we start the idle loop.
- *
- * The arch is responsible of calling:
+ */
+void tick_nohz_idle_stop_tick(void)
+{
+ __tick_nohz_idle_stop_tick(this_cpu_ptr(&tick_cpu_sched));
+}
+
+/**
+ * tick_nohz_idle_enter - prepare for entering idle on the current CPU
*
- * - rcu_idle_enter() after its last use of RCU before the CPU is put
- * to sleep.
- * - rcu_idle_exit() before the first use of RCU after the CPU is woken up.
+ * Called when we start the idle loop.
*/
void tick_nohz_idle_enter(void)
{
@@ -965,7 +967,7 @@ void tick_nohz_idle_enter(void)

ts = this_cpu_ptr(&tick_cpu_sched);
ts->inidle = 1;
- __tick_nohz_idle_enter(ts);
+ tick_nohz_start_idle(ts);

local_irq_enable();
}
@@ -982,10 +984,12 @@ void tick_nohz_irq_exit(void)
{
struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);

- if (ts->inidle)
- __tick_nohz_idle_enter(ts);
- else
+ if (ts->inidle) {
+ tick_nohz_start_idle(ts);
+ __tick_nohz_idle_stop_tick(ts);
+ } else {
tick_nohz_full_update_tick(ts);
+ }
}

/**
Index: linux-pm/arch/x86/xen/smp_pv.c
===================================================================
--- linux-pm.orig/arch/x86/xen/smp_pv.c
+++ linux-pm/arch/x86/xen/smp_pv.c
@@ -425,6 +425,7 @@ static void xen_pv_play_dead(void) /* us
* data back is to call:
*/
tick_nohz_idle_enter();
+ tick_nohz_idle_stop_tick_protected();

cpuhp_online_idle(CPUHP_AP_ONLINE_IDLE);
}
Index: linux-pm/kernel/sched/idle.c
===================================================================
--- linux-pm.orig/kernel/sched/idle.c
+++ linux-pm/kernel/sched/idle.c
@@ -221,6 +221,7 @@ static void do_idle(void)

__current_set_polling();
tick_nohz_idle_enter();
+ tick_nohz_idle_stop_tick_protected();

while (!need_resched()) {
check_pgt_cache();


2018-04-04 08:55:34

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH v9 09/10] cpuidle: menu: Refine idle state selection for running tick

From: Rafael J. Wysocki <[email protected]>

If the tick isn't stopped, the target residency of the state selected
by the menu governor may be greater than the actual time to the next
tick and that means lost energy.

To avoid that, make tick_nohz_get_sleep_length() return the current
time to the next event (before stopping the tick) in addition to the
estimated one via an extra pointer argument and make menu_select()
use that value to refine the state selection when necessary.

Signed-off-by: Rafael J. Wysocki <[email protected]>
---

v8 -> v9:
* Avoid using idle state 0 if it is disabled.
* Rebase.

---
drivers/cpuidle/governors/menu.c | 27 +++++++++++++++++++++++++--
include/linux/tick.h | 7 ++++---
kernel/time/tick-sched.c | 12 ++++++------
3 files changed, 35 insertions(+), 11 deletions(-)

Index: linux-pm/include/linux/tick.h
===================================================================
--- linux-pm.orig/include/linux/tick.h
+++ linux-pm/include/linux/tick.h
@@ -121,7 +121,7 @@ extern void tick_nohz_idle_enter(void);
extern void tick_nohz_idle_exit(void);
extern void tick_nohz_irq_exit(void);
extern bool tick_nohz_idle_got_tick(void);
-extern ktime_t tick_nohz_get_sleep_length(void);
+extern ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next);
extern unsigned long tick_nohz_get_idle_calls(void);
extern unsigned long tick_nohz_get_idle_calls_cpu(int cpu);
extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
@@ -144,9 +144,10 @@ static inline void tick_nohz_idle_enter(
static inline void tick_nohz_idle_exit(void) { }
static inline bool tick_nohz_idle_got_tick(void) { return false; }

-static inline ktime_t tick_nohz_get_sleep_length(void)
+static inline ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next)
{
- return NSEC_PER_SEC / HZ;
+ *delta_next = NSEC_PER_SEC / HZ;
+ return *delta_next;
}
static inline u64 get_cpu_idle_time_us(int cpu, u64 *unused) { return -1; }
static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
Index: linux-pm/kernel/time/tick-sched.c
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.c
+++ linux-pm/kernel/time/tick-sched.c
@@ -1047,10 +1047,11 @@ bool tick_nohz_idle_got_tick(void)

/**
* tick_nohz_get_sleep_length - return the expected length of the current sleep
+ * @delta_next: duration until the next event if the tick cannot be stopped
*
* Called from power state control code with interrupts disabled
*/
-ktime_t tick_nohz_get_sleep_length(void)
+ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next)
{
struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
@@ -1064,12 +1065,14 @@ ktime_t tick_nohz_get_sleep_length(void)

WARN_ON_ONCE(!ts->inidle);

+ *delta_next = ktime_sub(dev->next_event, now);
+
if (!can_stop_idle_tick(cpu, ts))
- goto out_dev;
+ return *delta_next;

next_event = tick_nohz_next_event(ts, cpu);
if (!next_event)
- goto out_dev;
+ return *delta_next;

/*
* If the next highres timer to expire is earlier than next_event, the
@@ -1079,9 +1082,6 @@ ktime_t tick_nohz_get_sleep_length(void)
hrtimer_next_event_without(&ts->sched_timer));

return ktime_sub(next_event, now);
-
-out_dev:
- return ktime_sub(dev->next_event, now);
}

/**
Index: linux-pm/drivers/cpuidle/governors/menu.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/menu.c
+++ linux-pm/drivers/cpuidle/governors/menu.c
@@ -295,6 +295,7 @@ static int menu_select(struct cpuidle_dr
unsigned int expected_interval;
unsigned long nr_iowaiters, cpu_load;
int resume_latency = dev_pm_qos_raw_read_value(device);
+ ktime_t delta_next;

if (data->needs_update) {
menu_update(drv, dev);
@@ -312,7 +313,7 @@ static int menu_select(struct cpuidle_dr
}

/* determine the expected residency time, round up */
- data->next_timer_us = ktime_to_us(tick_nohz_get_sleep_length());
+ data->next_timer_us = ktime_to_us(tick_nohz_get_sleep_length(&delta_next));

get_iowait_load(&nr_iowaiters, &cpu_load);
data->bucket = which_bucket(data->next_timer_us, nr_iowaiters);
@@ -396,9 +397,31 @@ static int menu_select(struct cpuidle_dr
* expected idle duration is shorter than the tick period length.
*/
if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) ||
- expected_interval < TICK_USEC)
+ expected_interval < TICK_USEC) {
+ unsigned int delta_next_us = ktime_to_us(delta_next);
+
*stop_tick = false;

+ if (!tick_nohz_tick_stopped() && idx > 0 &&
+ drv->states[idx].target_residency > delta_next_us) {
+ /*
+ * The tick is not going to be stopped and the target
+ * residency of the state to be returned is not within
+ * the time until the next timer event including the
+ * tick, so try to correct that.
+ */
+ for (i = idx - 1; i >= 0; i--) {
+ if (drv->states[i].disabled ||
+ dev->states_usage[i].disable)
+ continue;
+
+ idx = i;
+ if (drv->states[i].target_residency <= delta_next_us)
+ break;
+ }
+ }
+ }
+
data->last_state_idx = idx;

return data->last_state_idx;


2018-04-04 08:55:41

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH v9 04/10] jiffies: Introduce USER_TICK_USEC and redefine TICK_USEC

From: Rafael J. Wysocki <[email protected]>

Since the subsequent changes will need a TICK_USEC definition
analogous to TICK_NSEC, rename the existing TICK_USEC as
USER_TICK_USEC, update its users and redefine TICK_USEC
accordingly.

Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
---

v8 -> v9: No changes.

---
drivers/net/ethernet/sfc/mcdi.c | 2 +-
include/linux/jiffies.h | 7 +++++--
kernel/time/ntp.c | 2 +-
3 files changed, 7 insertions(+), 4 deletions(-)

Index: linux-pm/include/linux/jiffies.h
===================================================================
--- linux-pm.orig/include/linux/jiffies.h
+++ linux-pm/include/linux/jiffies.h
@@ -62,8 +62,11 @@ extern int register_refined_jiffies(long
/* TICK_NSEC is the time between ticks in nsec assuming SHIFTED_HZ */
#define TICK_NSEC ((NSEC_PER_SEC+HZ/2)/HZ)

-/* TICK_USEC is the time between ticks in usec assuming fake USER_HZ */
-#define TICK_USEC ((1000000UL + USER_HZ/2) / USER_HZ)
+/* TICK_USEC is the time between ticks in usec assuming SHIFTED_HZ */
+#define TICK_USEC ((USEC_PER_SEC + HZ/2) / HZ)
+
+/* USER_TICK_USEC is the time between ticks in usec assuming fake USER_HZ */
+#define USER_TICK_USEC ((1000000UL + USER_HZ/2) / USER_HZ)

#ifndef __jiffy_arch_data
#define __jiffy_arch_data
Index: linux-pm/drivers/net/ethernet/sfc/mcdi.c
===================================================================
--- linux-pm.orig/drivers/net/ethernet/sfc/mcdi.c
+++ linux-pm/drivers/net/ethernet/sfc/mcdi.c
@@ -375,7 +375,7 @@ static int efx_mcdi_poll(struct efx_nic
* because generally mcdi responses are fast. After that, back off
* and poll once a jiffy (approximately)
*/
- spins = TICK_USEC;
+ spins = USER_TICK_USEC;
finish = jiffies + MCDI_RPC_TIMEOUT;

while (1) {
Index: linux-pm/kernel/time/ntp.c
===================================================================
--- linux-pm.orig/kernel/time/ntp.c
+++ linux-pm/kernel/time/ntp.c
@@ -31,7 +31,7 @@


/* USER_HZ period (usecs): */
-unsigned long tick_usec = TICK_USEC;
+unsigned long tick_usec = USER_TICK_USEC;

/* SHIFTED_HZ period (nsecs): */
unsigned long tick_nsec;


2018-04-04 08:55:51

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH v9 03/10] sched: idle: Do not stop the tick before cpuidle_idle_call()

From: Rafael J. Wysocki <[email protected]>

Make cpuidle_idle_call() decide whether or not to stop the tick.

First, the cpuidle_enter_s2idle() path deals with the tick (and with
the entire timekeeping for that matter) by itself and it doesn't need
the tick to be stopped beforehand.

Second, to address the issue with short idle duration predictions
by the idle governor after the tick has been stopped, it will be
necessary to change the ordering of cpuidle_select() with respect
to tick_nohz_idle_stop_tick(). To prepare for that, put a
tick_nohz_idle_stop_tick() call in the same branch in which
cpuidle_select() is called.

Signed-off-by: Rafael J. Wysocki <[email protected]>
Reviewed-by: Frederic Weisbecker <[email protected]>
---

v8 -> v9:
* No changes in the patch.
* Tag from Frederic.

---
kernel/sched/idle.c | 19 +++++++++++++++----
1 file changed, 15 insertions(+), 4 deletions(-)

Index: linux-pm/kernel/sched/idle.c
===================================================================
--- linux-pm.orig/kernel/sched/idle.c
+++ linux-pm/kernel/sched/idle.c
@@ -146,13 +146,15 @@ static void cpuidle_idle_call(void)
}

/*
- * Tell the RCU framework we are entering an idle section,
- * so no more rcu read side critical sections and one more
+ * The RCU framework needs to be told that we are entering an idle
+ * section, so no more rcu read side critical sections and one more
* step to the grace period
*/
- rcu_idle_enter();

if (cpuidle_not_available(drv, dev)) {
+ tick_nohz_idle_stop_tick();
+ rcu_idle_enter();
+
default_idle_call();
goto exit_idle;
}
@@ -169,16 +171,26 @@ static void cpuidle_idle_call(void)

if (idle_should_enter_s2idle() || dev->use_deepest_state) {
if (idle_should_enter_s2idle()) {
+ rcu_idle_enter();
+
entered_state = cpuidle_enter_s2idle(drv, dev);
if (entered_state > 0) {
local_irq_enable();
goto exit_idle;
}
+
+ rcu_idle_exit();
}

+ tick_nohz_idle_stop_tick();
+ rcu_idle_enter();
+
next_state = cpuidle_find_deepest_state(drv, dev);
call_cpuidle(drv, dev, next_state);
} else {
+ tick_nohz_idle_stop_tick();
+ rcu_idle_enter();
+
/*
* Ask the cpuidle framework to choose a convenient idle state.
*/
@@ -245,7 +257,6 @@ static void do_idle(void)
tick_nohz_idle_restart_tick();
cpu_idle_poll();
} else {
- tick_nohz_idle_stop_tick();
cpuidle_idle_call();
}
arch_cpu_idle_exit();


2018-04-04 08:56:42

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH v9 06/10] time: tick-sched: Split tick_nohz_stop_sched_tick()

From: Rafael J. Wysocki <[email protected]>

In order to address the issue with short idle duration predictions
by the idle governor after the scheduler tick has been stopped, split
tick_nohz_stop_sched_tick() into two separate routines, one computing
the time to the next timer event and the other simply stopping the
tick when the time to the next timer event is known.

Prepare these two routines to be called separately, as one of them
will be called by the idle governor in the cpuidle_select() code
path after subsequent changes.

Update the former callers of tick_nohz_stop_sched_tick() to use
the new routines, tick_nohz_next_event() and tick_nohz_stop_tick(),
instead of it and move the updates of the sleep_length field in
struct tick_sched into __tick_nohz_idle_stop_tick() as it doesn't
need to be updated anywhere else.

There should be no intentional visible changes in functionality
resulting from this change.

Signed-off-by: Rafael J. Wysocki <[email protected]>
---

v8 -> v9: No changes.

---
kernel/time/tick-sched.c | 128 +++++++++++++++++++++++++++++------------------
kernel/time/tick-sched.h | 4 +
2 files changed, 84 insertions(+), 48 deletions(-)

Index: linux-pm/kernel/time/tick-sched.h
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.h
+++ linux-pm/kernel/time/tick-sched.h
@@ -39,6 +39,8 @@ enum tick_nohz_mode {
* @idle_sleeptime: Sum of the time slept in idle with sched tick stopped
* @iowait_sleeptime: Sum of the time slept in idle with sched tick stopped, with IO outstanding
* @sleep_length: Duration of the current idle sleep
+ * @timer_expires: Anticipated timer expiration time (in case sched tick is stopped)
+ * @timer_expires_base: Base time clock monotonic for @timer_expires
* @do_timer_lst: CPU was the last one doing do_timer before going idle
*/
struct tick_sched {
@@ -60,6 +62,8 @@ struct tick_sched {
ktime_t iowait_sleeptime;
ktime_t sleep_length;
unsigned long last_jiffies;
+ u64 timer_expires;
+ u64 timer_expires_base;
u64 next_timer;
ktime_t idle_expires;
int do_timer_last;
Index: linux-pm/kernel/time/tick-sched.c
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.c
+++ linux-pm/kernel/time/tick-sched.c
@@ -652,13 +652,10 @@ static inline bool local_timer_softirq_p
return local_softirq_pending() & TIMER_SOFTIRQ;
}

-static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
- ktime_t now, int cpu)
+static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
{
- struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
u64 basemono, next_tick, next_tmr, next_rcu, delta, expires;
unsigned long seq, basejiff;
- ktime_t tick;

/* Read jiffies and the time when jiffies were updated last */
do {
@@ -667,6 +664,7 @@ static ktime_t tick_nohz_stop_sched_tick
basejiff = jiffies;
} while (read_seqretry(&jiffies_lock, seq));
ts->last_jiffies = basejiff;
+ ts->timer_expires_base = basemono;

/*
* Keep the periodic tick, when RCU, architecture or irq_work
@@ -711,32 +709,20 @@ static ktime_t tick_nohz_stop_sched_tick
* next period, so no point in stopping it either, bail.
*/
if (!ts->tick_stopped) {
- tick = 0;
+ ts->timer_expires = 0;
goto out;
}
}

/*
- * If this CPU is the one which updates jiffies, then give up
- * the assignment and let it be taken by the CPU which runs
- * the tick timer next, which might be this CPU as well. If we
- * don't drop this here the jiffies might be stale and
- * do_timer() never invoked. Keep track of the fact that it
- * was the one which had the do_timer() duty last. If this CPU
- * is the one which had the do_timer() duty last, we limit the
- * sleep time to the timekeeping max_deferment value.
+ * If this CPU is the one which had the do_timer() duty last, we limit
+ * the sleep time to the timekeeping max_deferment value.
* Otherwise we can sleep as long as we want.
*/
delta = timekeeping_max_deferment();
- if (cpu == tick_do_timer_cpu) {
- tick_do_timer_cpu = TICK_DO_TIMER_NONE;
- ts->do_timer_last = 1;
- } else if (tick_do_timer_cpu != TICK_DO_TIMER_NONE) {
- delta = KTIME_MAX;
- ts->do_timer_last = 0;
- } else if (!ts->do_timer_last) {
+ if (cpu != tick_do_timer_cpu &&
+ (tick_do_timer_cpu != TICK_DO_TIMER_NONE || !ts->do_timer_last))
delta = KTIME_MAX;
- }

#ifdef CONFIG_NO_HZ_FULL
/* Limit the tick delta to the maximum scheduler deferment */
@@ -750,14 +736,42 @@ static ktime_t tick_nohz_stop_sched_tick
else
expires = KTIME_MAX;

- expires = min_t(u64, expires, next_tick);
- tick = expires;
+ ts->timer_expires = min_t(u64, expires, next_tick);
+
+out:
+ return ts->timer_expires;
+}
+
+static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
+{
+ struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
+ u64 basemono = ts->timer_expires_base;
+ u64 expires = ts->timer_expires;
+ ktime_t tick = expires;
+
+ /* Make sure we won't be trying to stop it twice in a row. */
+ ts->timer_expires_base = 0;
+
+ /*
+ * If this CPU is the one which updates jiffies, then give up
+ * the assignment and let it be taken by the CPU which runs
+ * the tick timer next, which might be this CPU as well. If we
+ * don't drop this here the jiffies might be stale and
+ * do_timer() never invoked. Keep track of the fact that it
+ * was the one which had the do_timer() duty last.
+ */
+ if (cpu == tick_do_timer_cpu) {
+ tick_do_timer_cpu = TICK_DO_TIMER_NONE;
+ ts->do_timer_last = 1;
+ } else if (tick_do_timer_cpu != TICK_DO_TIMER_NONE) {
+ ts->do_timer_last = 0;
+ }

/* Skip reprogram of event if its not changed */
if (ts->tick_stopped && (expires == ts->next_tick)) {
/* Sanity check: make sure clockevent is actually programmed */
if (tick == KTIME_MAX || ts->next_tick == hrtimer_get_expires(&ts->sched_timer))
- goto out;
+ return;

WARN_ON_ONCE(1);
printk_once("basemono: %llu ts->next_tick: %llu dev->next_event: %llu timer->active: %d timer->expires: %llu\n",
@@ -791,7 +805,7 @@ static ktime_t tick_nohz_stop_sched_tick
if (unlikely(expires == KTIME_MAX)) {
if (ts->nohz_mode == NOHZ_MODE_HIGHRES)
hrtimer_cancel(&ts->sched_timer);
- goto out;
+ return;
}

hrtimer_set_expires(&ts->sched_timer, tick);
@@ -800,15 +814,23 @@ static ktime_t tick_nohz_stop_sched_tick
hrtimer_start_expires(&ts->sched_timer, HRTIMER_MODE_ABS_PINNED);
else
tick_program_event(tick, 1);
-out:
- /*
- * Update the estimated sleep length until the next timer
- * (not only the tick).
- */
- ts->sleep_length = ktime_sub(dev->next_event, now);
- return tick;
}

+static void tick_nohz_retain_tick(struct tick_sched *ts)
+{
+ ts->timer_expires_base = 0;
+}
+
+#ifdef CONFIG_NO_HZ_FULL
+static void tick_nohz_stop_sched_tick(struct tick_sched *ts, int cpu)
+{
+ if (tick_nohz_next_event(ts, cpu))
+ tick_nohz_stop_tick(ts, cpu);
+ else
+ tick_nohz_retain_tick(ts);
+}
+#endif /* CONFIG_NO_HZ_FULL */
+
static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now)
{
/* Update jiffies first */
@@ -844,7 +866,7 @@ static void tick_nohz_full_update_tick(s
return;

if (can_stop_full_tick(cpu, ts))
- tick_nohz_stop_sched_tick(ts, ktime_get(), cpu);
+ tick_nohz_stop_sched_tick(ts, cpu);
else if (ts->tick_stopped)
tick_nohz_restart_sched_tick(ts, ktime_get());
#endif
@@ -870,10 +892,8 @@ static bool can_stop_idle_tick(int cpu,
return false;
}

- if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE)) {
- ts->sleep_length = NSEC_PER_SEC / HZ;
+ if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
return false;
- }

if (need_resched())
return false;
@@ -910,29 +930,37 @@ static bool can_stop_idle_tick(int cpu,

static void __tick_nohz_idle_stop_tick(struct tick_sched *ts)
{
+ struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
ktime_t expires;
int cpu = smp_processor_id();

- if (can_stop_idle_tick(cpu, ts)) {
+ WARN_ON_ONCE(ts->timer_expires_base);
+
+ if (!can_stop_idle_tick(cpu, ts))
+ goto out;
+
+ expires = tick_nohz_next_event(ts, cpu);
+
+ ts->idle_calls++;
+
+ if (expires > 0LL) {
int was_stopped = ts->tick_stopped;

- ts->idle_calls++;
+ tick_nohz_stop_tick(ts, cpu);

- /*
- * The idle entry time should be a sufficient approximation of
- * the current time at this point.
- */
- expires = tick_nohz_stop_sched_tick(ts, ts->idle_entrytime, cpu);
- if (expires > 0LL) {
- ts->idle_sleeps++;
- ts->idle_expires = expires;
- }
+ ts->idle_sleeps++;
+ ts->idle_expires = expires;

if (!was_stopped && ts->tick_stopped) {
ts->idle_jiffies = ts->last_jiffies;
nohz_balance_enter_idle(cpu);
}
+ } else {
+ tick_nohz_retain_tick(ts);
}
+
+out:
+ ts->sleep_length = ktime_sub(dev->next_event, ts->idle_entrytime);
}

/**
@@ -957,7 +985,7 @@ void tick_nohz_idle_enter(void)
lockdep_assert_irqs_enabled();
/*
* Update the idle state in the scheduler domain hierarchy
- * when tick_nohz_stop_sched_tick() is called from the idle loop.
+ * when tick_nohz_stop_tick() is called from the idle loop.
* State will be updated to busy during the first busy tick after
* exiting idle.
*/
@@ -966,6 +994,9 @@ void tick_nohz_idle_enter(void)
local_irq_disable();

ts = this_cpu_ptr(&tick_cpu_sched);
+
+ WARN_ON_ONCE(ts->timer_expires_base);
+
ts->inidle = 1;
tick_nohz_start_idle(ts);

@@ -1091,6 +1122,7 @@ void tick_nohz_idle_exit(void)
local_irq_disable();

WARN_ON_ONCE(!ts->inidle);
+ WARN_ON_ONCE(ts->timer_expires_base);

ts->inidle = 0;



2018-04-04 08:57:07

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH v9 05/10] cpuidle: Return nohz hint from cpuidle_select()

From: Rafael J. Wysocki <[email protected]>

Add a new pointer argument to cpuidle_select() and to the ->select
cpuidle governor callback to allow a boolean value indicating
whether or not the tick should be stopped before entering the
selected state to be returned from there.

Make the ladder governor ignore that pointer (to preserve its
current behavior) and make the menu governor return 'false" through
it if:
(1) the idle exit latency is constrained at 0, or
(2) the selected state is a polling one, or
(3) the expected idle period duration is within the tick period
range.

In addition to that, the correction factor computations in the menu
governor need to take the possibility that the tick may not be
stopped into account to avoid artificially small correction factor
values. To that end, add a mechanism to record tick wakeups, as
suggested by Peter Zijlstra, and use it to modify the menu_update()
behavior when tick wakeup occurs. Namely, if the CPU is woken up by
the tick and the return value of tick_nohz_get_sleep_length() is not
within the tick boundary, the predicted idle duration is likely too
short, so make menu_update() try to compensate for that by updating
the governor statistics as though the CPU was idle for a long time.

Since the value returned through the new argument pointer of
cpuidle_select() is not used by its caller yet, this change by
itself is not expected to alter the functionality of the code.

Signed-off-by: Rafael J. Wysocki <[email protected]>
---

v8 -> v9: No changes.

---
drivers/cpuidle/cpuidle.c | 10 +++++-
drivers/cpuidle/governors/ladder.c | 3 +
drivers/cpuidle/governors/menu.c | 59 +++++++++++++++++++++++++++++--------
include/linux/cpuidle.h | 8 +++--
include/linux/tick.h | 2 +
kernel/sched/idle.c | 4 +-
kernel/time/tick-sched.c | 20 ++++++++++++
7 files changed, 87 insertions(+), 19 deletions(-)

Index: linux-pm/include/linux/cpuidle.h
===================================================================
--- linux-pm.orig/include/linux/cpuidle.h
+++ linux-pm/include/linux/cpuidle.h
@@ -135,7 +135,8 @@ extern bool cpuidle_not_available(struct
struct cpuidle_device *dev);

extern int cpuidle_select(struct cpuidle_driver *drv,
- struct cpuidle_device *dev);
+ struct cpuidle_device *dev,
+ bool *stop_tick);
extern int cpuidle_enter(struct cpuidle_driver *drv,
struct cpuidle_device *dev, int index);
extern void cpuidle_reflect(struct cpuidle_device *dev, int index);
@@ -167,7 +168,7 @@ static inline bool cpuidle_not_available
struct cpuidle_device *dev)
{return true; }
static inline int cpuidle_select(struct cpuidle_driver *drv,
- struct cpuidle_device *dev)
+ struct cpuidle_device *dev, bool *stop_tick)
{return -ENODEV; }
static inline int cpuidle_enter(struct cpuidle_driver *drv,
struct cpuidle_device *dev, int index)
@@ -250,7 +251,8 @@ struct cpuidle_governor {
struct cpuidle_device *dev);

int (*select) (struct cpuidle_driver *drv,
- struct cpuidle_device *dev);
+ struct cpuidle_device *dev,
+ bool *stop_tick);
void (*reflect) (struct cpuidle_device *dev, int index);
};

Index: linux-pm/kernel/sched/idle.c
===================================================================
--- linux-pm.orig/kernel/sched/idle.c
+++ linux-pm/kernel/sched/idle.c
@@ -188,13 +188,15 @@ static void cpuidle_idle_call(void)
next_state = cpuidle_find_deepest_state(drv, dev);
call_cpuidle(drv, dev, next_state);
} else {
+ bool stop_tick = true;
+
tick_nohz_idle_stop_tick();
rcu_idle_enter();

/*
* Ask the cpuidle framework to choose a convenient idle state.
*/
- next_state = cpuidle_select(drv, dev);
+ next_state = cpuidle_select(drv, dev, &stop_tick);
entered_state = call_cpuidle(drv, dev, next_state);
/*
* Give the governor an opportunity to reflect on the outcome
Index: linux-pm/drivers/cpuidle/cpuidle.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/cpuidle.c
+++ linux-pm/drivers/cpuidle/cpuidle.c
@@ -272,12 +272,18 @@ int cpuidle_enter_state(struct cpuidle_d
*
* @drv: the cpuidle driver
* @dev: the cpuidle device
+ * @stop_tick: indication on whether or not to stop the tick
*
* Returns the index of the idle state. The return value must not be negative.
+ *
+ * The memory location pointed to by @stop_tick is expected to be written the
+ * 'false' boolean value if the scheduler tick should not be stopped before
+ * entering the returned state.
*/
-int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
+ bool *stop_tick)
{
- return cpuidle_curr_governor->select(drv, dev);
+ return cpuidle_curr_governor->select(drv, dev, stop_tick);
}

/**
Index: linux-pm/drivers/cpuidle/governors/ladder.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/ladder.c
+++ linux-pm/drivers/cpuidle/governors/ladder.c
@@ -63,9 +63,10 @@ static inline void ladder_do_selection(s
* ladder_select_state - selects the next state to enter
* @drv: cpuidle driver
* @dev: the CPU
+ * @dummy: not used
*/
static int ladder_select_state(struct cpuidle_driver *drv,
- struct cpuidle_device *dev)
+ struct cpuidle_device *dev, bool *dummy)
{
struct ladder_device *ldev = this_cpu_ptr(&ladder_devices);
struct device *device = get_cpu_device(dev->cpu);
Index: linux-pm/drivers/cpuidle/governors/menu.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/menu.c
+++ linux-pm/drivers/cpuidle/governors/menu.c
@@ -123,6 +123,7 @@
struct menu_device {
int last_state_idx;
int needs_update;
+ int tick_wakeup;

unsigned int next_timer_us;
unsigned int predicted_us;
@@ -279,8 +280,10 @@ again:
* menu_select - selects the next idle state to enter
* @drv: cpuidle driver containing state data
* @dev: the CPU
+ * @stop_tick: indication on whether or not to stop the tick
*/
-static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
+ bool *stop_tick)
{
struct menu_device *data = this_cpu_ptr(&menu_devices);
struct device *device = get_cpu_device(dev->cpu);
@@ -303,8 +306,10 @@ static int menu_select(struct cpuidle_dr
latency_req = resume_latency;

/* Special case when user has set very strict latency requirement */
- if (unlikely(latency_req == 0))
+ if (unlikely(latency_req == 0)) {
+ *stop_tick = false;
return 0;
+ }

/* determine the expected residency time, round up */
data->next_timer_us = ktime_to_us(tick_nohz_get_sleep_length());
@@ -354,6 +359,7 @@ static int menu_select(struct cpuidle_dr
if (latency_req > interactivity_req)
latency_req = interactivity_req;

+ expected_interval = data->predicted_us;
/*
* Find the idle state with the lowest power while satisfying
* our constraints.
@@ -369,15 +375,30 @@ static int menu_select(struct cpuidle_dr
idx = i; /* first enabled state */
if (s->target_residency > data->predicted_us)
break;
- if (s->exit_latency > latency_req)
+ if (s->exit_latency > latency_req) {
+ /*
+ * If we break out of the loop for latency reasons, use
+ * the target residency of the selected state as the
+ * expected idle duration so that the tick is retained
+ * as long as that target residency is low enough.
+ */
+ expected_interval = drv->states[idx].target_residency;
break;
-
+ }
idx = i;
}

if (idx == -1)
idx = 0; /* No states enabled. Must use 0. */

+ /*
+ * Don't stop the tick if the selected state is a polling one or if the
+ * expected idle duration is shorter than the tick period length.
+ */
+ if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) ||
+ expected_interval < TICK_USEC)
+ *stop_tick = false;
+
data->last_state_idx = idx;

return data->last_state_idx;
@@ -397,6 +418,7 @@ static void menu_reflect(struct cpuidle_

data->last_state_idx = index;
data->needs_update = 1;
+ data->tick_wakeup = tick_nohz_idle_got_tick();
}

/**
@@ -427,14 +449,27 @@ static void menu_update(struct cpuidle_d
* assume the state was never reached and the exit latency is 0.
*/

- /* measured value */
- measured_us = cpuidle_get_last_residency(dev);
-
- /* Deduct exit latency */
- if (measured_us > 2 * target->exit_latency)
- measured_us -= target->exit_latency;
- else
- measured_us /= 2;
+ if (data->tick_wakeup && data->next_timer_us > TICK_USEC) {
+ /*
+ * The nohz code said that there wouldn't be any events within
+ * the tick boundary (if the tick was stopped), but the idle
+ * duration predictor had a differing opinion. Since the CPU
+ * was woken up by a tick (that wasn't stopped after all), the
+ * predictor was not quite right, so assume that the CPU could
+ * have been idle long (but not forever) to help the idle
+ * duration predictor do a better job next time.
+ */
+ measured_us = 9 * MAX_INTERESTING / 10;
+ } else {
+ /* measured value */
+ measured_us = cpuidle_get_last_residency(dev);
+
+ /* Deduct exit latency */
+ if (measured_us > 2 * target->exit_latency)
+ measured_us -= target->exit_latency;
+ else
+ measured_us /= 2;
+ }

/* Make sure our coefficients do not exceed unity */
if (measured_us > data->next_timer_us)
Index: linux-pm/kernel/time/tick-sched.c
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.c
+++ linux-pm/kernel/time/tick-sched.c
@@ -991,6 +991,20 @@ void tick_nohz_irq_exit(void)
}

/**
+ * tick_nohz_idle_got_tick - Check whether or not the tick handler has run
+ */
+bool tick_nohz_idle_got_tick(void)
+{
+ struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
+
+ if (ts->inidle > 1) {
+ ts->inidle = 1;
+ return true;
+ }
+ return false;
+}
+
+/**
* tick_nohz_get_sleep_length - return the length of the current sleep
*
* Called from power state control code with interrupts disabled
@@ -1101,6 +1115,9 @@ static void tick_nohz_handler(struct clo
struct pt_regs *regs = get_irq_regs();
ktime_t now = ktime_get();

+ if (ts->inidle)
+ ts->inidle = 2;
+
dev->next_event = KTIME_MAX;

tick_sched_do_timer(now);
@@ -1198,6 +1215,9 @@ static enum hrtimer_restart tick_sched_t
struct pt_regs *regs = get_irq_regs();
ktime_t now = ktime_get();

+ if (ts->inidle)
+ ts->inidle = 2;
+
tick_sched_do_timer(now);

/*
Index: linux-pm/include/linux/tick.h
===================================================================
--- linux-pm.orig/include/linux/tick.h
+++ linux-pm/include/linux/tick.h
@@ -119,6 +119,7 @@ extern void tick_nohz_idle_restart_tick(
extern void tick_nohz_idle_enter(void);
extern void tick_nohz_idle_exit(void);
extern void tick_nohz_irq_exit(void);
+extern bool tick_nohz_idle_got_tick(void);
extern ktime_t tick_nohz_get_sleep_length(void);
extern unsigned long tick_nohz_get_idle_calls(void);
extern unsigned long tick_nohz_get_idle_calls_cpu(int cpu);
@@ -139,6 +140,7 @@ static inline void tick_nohz_idle_stop_t
static inline void tick_nohz_idle_restart_tick(void) { }
static inline void tick_nohz_idle_enter(void) { }
static inline void tick_nohz_idle_exit(void) { }
+static inline bool tick_nohz_idle_got_tick(void) { return false; }

static inline ktime_t tick_nohz_get_sleep_length(void)
{


2018-04-04 08:57:33

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH v9 07/10] time: hrtimer: Introduce hrtimer_next_event_without()

From: Rafael J. Wysocki <[email protected]>

The next set of changes will need to compute the time to the next
hrtimer event over all hrtimers except for the scheduler tick one.

To that end introduce a new helper function,
hrtimer_next_event_without(), for computing the time until the next
hrtimer event over all timers except for one and modify the underlying
code in __hrtimer_next_event_base() to prepare it for being called by
that new function.

No intentional changes in functionality.

Signed-off-by: Rafael J. Wysocki <[email protected]>
---

v8 -> v9:
* Make fewer changes to the existing code.
* Add a new helper function for the handling of the use case at hand.

---
include/linux/hrtimer.h | 1
kernel/time/hrtimer.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 54 insertions(+), 2 deletions(-)

Index: linux-pm/include/linux/hrtimer.h
===================================================================
--- linux-pm.orig/include/linux/hrtimer.h
+++ linux-pm/include/linux/hrtimer.h
@@ -426,6 +426,7 @@ static inline ktime_t hrtimer_get_remain
}

extern u64 hrtimer_get_next_event(void);
+extern u64 hrtimer_next_event_without(const struct hrtimer *exclude);

extern bool hrtimer_active(const struct hrtimer *timer);

Index: linux-pm/kernel/time/hrtimer.c
===================================================================
--- linux-pm.orig/kernel/time/hrtimer.c
+++ linux-pm/kernel/time/hrtimer.c
@@ -490,6 +490,7 @@ __next_base(struct hrtimer_cpu_base *cpu
while ((base = __next_base((cpu_base), &(active))))

static ktime_t __hrtimer_next_event_base(struct hrtimer_cpu_base *cpu_base,
+ const struct hrtimer *exclude,
unsigned int active,
ktime_t expires_next)
{
@@ -502,9 +503,24 @@ static ktime_t __hrtimer_next_event_base

next = timerqueue_getnext(&base->active);
timer = container_of(next, struct hrtimer, node);
+ if (timer == exclude) {
+ /* Get to the next timer in the queue. */
+ struct rb_node *rbn = rb_next(&next->node);
+
+ next = rb_entry_safe(rbn, struct timerqueue_node, node);
+ if (!next)
+ continue;
+
+ timer = container_of(next, struct hrtimer, node);
+ }
expires = ktime_sub(hrtimer_get_expires(timer), base->offset);
if (expires < expires_next) {
expires_next = expires;
+
+ /* Skip cpu_base update if a timer is being excluded. */
+ if (exclude)
+ continue;
+
if (timer->is_soft)
cpu_base->softirq_next_timer = timer;
else
@@ -548,7 +564,8 @@ __hrtimer_get_next_event(struct hrtimer_
if (!cpu_base->softirq_activated && (active_mask & HRTIMER_ACTIVE_SOFT)) {
active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT;
cpu_base->softirq_next_timer = NULL;
- expires_next = __hrtimer_next_event_base(cpu_base, active, KTIME_MAX);
+ expires_next = __hrtimer_next_event_base(cpu_base, NULL,
+ active, KTIME_MAX);

next_timer = cpu_base->softirq_next_timer;
}
@@ -556,7 +573,8 @@ __hrtimer_get_next_event(struct hrtimer_
if (active_mask & HRTIMER_ACTIVE_HARD) {
active = cpu_base->active_bases & HRTIMER_ACTIVE_HARD;
cpu_base->next_timer = next_timer;
- expires_next = __hrtimer_next_event_base(cpu_base, active, expires_next);
+ expires_next = __hrtimer_next_event_base(cpu_base, NULL, active,
+ expires_next);
}

return expires_next;
@@ -1200,6 +1218,39 @@ u64 hrtimer_get_next_event(void)

raw_spin_unlock_irqrestore(&cpu_base->lock, flags);

+ return expires;
+}
+
+/**
+ * hrtimer_next_event_without - time until next expiry event w/o one timer
+ * @exclude: timer to exclude
+ *
+ * Returns the next expiry time over all timers except for the @exclude one or
+ * KTIME_MAX if none of them is pending.
+ */
+u64 hrtimer_next_event_without(const struct hrtimer *exclude)
+{
+ struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
+ u64 expires = KTIME_MAX;
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&cpu_base->lock, flags);
+
+ if (__hrtimer_hres_active(cpu_base)) {
+ unsigned int active;
+
+ if (!cpu_base->softirq_activated) {
+ active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT;
+ expires = __hrtimer_next_event_base(cpu_base, exclude,
+ active, KTIME_MAX);
+ }
+ active = cpu_base->active_bases & HRTIMER_ACTIVE_HARD;
+ expires = __hrtimer_next_event_base(cpu_base, exclude, active,
+ expires);
+ }
+
+ raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
+
return expires;
}
#endif


2018-04-04 08:58:41

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH v9 08/10] sched: idle: Select idle state before stopping the tick

From: Rafael J. Wysocki <[email protected]>

In order to address the issue with short idle duration predictions
by the idle governor after the scheduler tick has been stopped,
reorder the code in cpuidle_idle_call() so that the governor idle
state selection runs before tick_nohz_idle_go_idle() and use the
"nohz" hint returned by cpuidle_select() to decide whether or not
to stop the tick.

This isn't straightforward, because menu_select() invokes
tick_nohz_get_sleep_length() to get the time to the next timer
event and the number returned by the latter comes from
__tick_nohz_idle_stop_tick(). Fortunately, however, it is possible
to compute that number without actually stopping the tick and with
the help of the existing code.

Namely, tick_nohz_get_sleep_length() can be made call
tick_nohz_next_event(), introduced earlier, to get the time to the
next non-highres timer event. If that happens, tick_nohz_next_event()
need not be called by __tick_nohz_idle_stop_tick() again.

If it turns out that the scheduler tick cannot be stopped going
forward or the next timer event is too close for the tick to be
stopped, tick_nohz_get_sleep_length() can simply return the time to
the next event currently programmed into the corresponding clock
event device.

In addition to knowing the return value of tick_nohz_next_event(),
however, tick_nohz_get_sleep_length() needs to know the time to the
next highres timer event, but with the scheduler tick timer excluded,
which can be computed with the help of hrtimer_get_next_event().

That minimum of that number and the tick_nohz_next_event() return
value is the total time to the next timer event with the assumption
that the tick will be stopped. It can be returned to the idle
governor which can use it for predicting idle duration (under the
assumption that the tick will be stopped) and deciding whether or
not it makes sense to stop the tick before putting the CPU into the
selected idle state.

With the above, the sleep_length field in struct tick_sched is not
necessary any more, so drop it.

Signed-off-by: Rafael J. Wysocki <[email protected]>
---

v8 -> v9: Rebase on top of the new [07/10].

---
include/linux/tick.h | 2 +
kernel/sched/idle.c | 11 ++++++--
kernel/time/tick-sched.c | 61 +++++++++++++++++++++++++++++++++++++----------
kernel/time/tick-sched.h | 2 -
4 files changed, 59 insertions(+), 17 deletions(-)

Index: linux-pm/include/linux/tick.h
===================================================================
--- linux-pm.orig/include/linux/tick.h
+++ linux-pm/include/linux/tick.h
@@ -115,6 +115,7 @@ enum tick_dep_bits {
extern bool tick_nohz_enabled;
extern int tick_nohz_tick_stopped(void);
extern void tick_nohz_idle_stop_tick(void);
+extern void tick_nohz_idle_retain_tick(void);
extern void tick_nohz_idle_restart_tick(void);
extern void tick_nohz_idle_enter(void);
extern void tick_nohz_idle_exit(void);
@@ -137,6 +138,7 @@ static inline void tick_nohz_idle_stop_t
#define tick_nohz_enabled (0)
static inline int tick_nohz_tick_stopped(void) { return 0; }
static inline void tick_nohz_idle_stop_tick(void) { }
+static inline void tick_nohz_idle_retain_tick(void) { }
static inline void tick_nohz_idle_restart_tick(void) { }
static inline void tick_nohz_idle_enter(void) { }
static inline void tick_nohz_idle_exit(void) { }
Index: linux-pm/kernel/sched/idle.c
===================================================================
--- linux-pm.orig/kernel/sched/idle.c
+++ linux-pm/kernel/sched/idle.c
@@ -190,13 +190,18 @@ static void cpuidle_idle_call(void)
} else {
bool stop_tick = true;

- tick_nohz_idle_stop_tick();
- rcu_idle_enter();
-
/*
* Ask the cpuidle framework to choose a convenient idle state.
*/
next_state = cpuidle_select(drv, dev, &stop_tick);
+
+ if (stop_tick)
+ tick_nohz_idle_stop_tick();
+ else
+ tick_nohz_idle_retain_tick();
+
+ rcu_idle_enter();
+
entered_state = call_cpuidle(drv, dev, next_state);
/*
* Give the governor an opportunity to reflect on the outcome
Index: linux-pm/kernel/time/tick-sched.c
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.c
+++ linux-pm/kernel/time/tick-sched.c
@@ -930,16 +930,19 @@ static bool can_stop_idle_tick(int cpu,

static void __tick_nohz_idle_stop_tick(struct tick_sched *ts)
{
- struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
ktime_t expires;
int cpu = smp_processor_id();

- WARN_ON_ONCE(ts->timer_expires_base);
-
- if (!can_stop_idle_tick(cpu, ts))
- goto out;
-
- expires = tick_nohz_next_event(ts, cpu);
+ /*
+ * If tick_nohz_get_sleep_length() ran tick_nohz_next_event(), the
+ * tick timer expiration time is known already.
+ */
+ if (ts->timer_expires_base)
+ expires = ts->timer_expires;
+ else if (can_stop_idle_tick(cpu, ts))
+ expires = tick_nohz_next_event(ts, cpu);
+ else
+ return;

ts->idle_calls++;

@@ -958,9 +961,6 @@ static void __tick_nohz_idle_stop_tick(s
} else {
tick_nohz_retain_tick(ts);
}
-
-out:
- ts->sleep_length = ktime_sub(dev->next_event, ts->idle_entrytime);
}

/**
@@ -973,6 +973,16 @@ void tick_nohz_idle_stop_tick(void)
__tick_nohz_idle_stop_tick(this_cpu_ptr(&tick_cpu_sched));
}

+void tick_nohz_idle_retain_tick(void)
+{
+ tick_nohz_retain_tick(this_cpu_ptr(&tick_cpu_sched));
+ /*
+ * Undo the effect of get_next_timer_interrupt() called from
+ * tick_nohz_next_event().
+ */
+ timer_clear_idle();
+}
+
/**
* tick_nohz_idle_enter - prepare for entering idle on the current CPU
*
@@ -1036,15 +1046,42 @@ bool tick_nohz_idle_got_tick(void)
}

/**
- * tick_nohz_get_sleep_length - return the length of the current sleep
+ * tick_nohz_get_sleep_length - return the expected length of the current sleep
*
* Called from power state control code with interrupts disabled
*/
ktime_t tick_nohz_get_sleep_length(void)
{
+ struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
+ int cpu = smp_processor_id();
+ /*
+ * The idle entry time is expected to be a sufficient approximation of
+ * the current time at this point.
+ */
+ ktime_t now = ts->idle_entrytime;
+ ktime_t next_event;
+
+ WARN_ON_ONCE(!ts->inidle);
+
+ if (!can_stop_idle_tick(cpu, ts))
+ goto out_dev;
+
+ next_event = tick_nohz_next_event(ts, cpu);
+ if (!next_event)
+ goto out_dev;
+
+ /*
+ * If the next highres timer to expire is earlier than next_event, the
+ * idle governor needs to know that.
+ */
+ next_event = min_t(u64, next_event,
+ hrtimer_next_event_without(&ts->sched_timer));
+
+ return ktime_sub(next_event, now);

- return ts->sleep_length;
+out_dev:
+ return ktime_sub(dev->next_event, now);
}

/**
Index: linux-pm/kernel/time/tick-sched.h
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.h
+++ linux-pm/kernel/time/tick-sched.h
@@ -38,7 +38,6 @@ enum tick_nohz_mode {
* @idle_exittime: Time when the idle state was left
* @idle_sleeptime: Sum of the time slept in idle with sched tick stopped
* @iowait_sleeptime: Sum of the time slept in idle with sched tick stopped, with IO outstanding
- * @sleep_length: Duration of the current idle sleep
* @timer_expires: Anticipated timer expiration time (in case sched tick is stopped)
* @timer_expires_base: Base time clock monotonic for @timer_expires
* @do_timer_lst: CPU was the last one doing do_timer before going idle
@@ -60,7 +59,6 @@ struct tick_sched {
ktime_t idle_exittime;
ktime_t idle_sleeptime;
ktime_t iowait_sleeptime;
- ktime_t sleep_length;
unsigned long last_jiffies;
u64 timer_expires;
u64 timer_expires_base;


2018-04-05 12:29:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v9 09/10] cpuidle: menu: Refine idle state selection for running tick

On Wed, Apr 04, 2018 at 10:49:34AM +0200, Rafael J. Wysocki wrote:
> -static inline ktime_t tick_nohz_get_sleep_length(void)
> +static inline ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next)
> {
> - return NSEC_PER_SEC / HZ;
> + *delta_next = NSEC_PER_SEC / HZ;
> + return *delta_next;

Shouldn't that be TICK_NSEC ?

2018-04-05 12:33:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v9 09/10] cpuidle: menu: Refine idle state selection for running tick

On Wed, Apr 04, 2018 at 10:49:34AM +0200, Rafael J. Wysocki wrote:
> + data->next_timer_us = ktime_to_us(tick_nohz_get_sleep_length(&delta_next));
> + unsigned int delta_next_us = ktime_to_us(delta_next);

We really should be looking at removing all that us nonsense from the
idle code, but that'll be another series.

2018-04-05 12:49:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v9 10/10] cpuidle: menu: Avoid selecting shallow states with stopped tick

On Wed, Apr 04, 2018 at 10:50:36AM +0200, Rafael J. Wysocki wrote:
> + if (tick_nohz_tick_stopped()) {
> + /*
> + * If the tick is already stopped, the cost of possible short
> + * idle duration misprediction is much higher, because the CPU
> + * may be stuck in a shallow idle state for a long time as a
> + * result of it. In that case say we might mispredict and try
> + * to force the CPU into a state for which we would have stopped
> + * the tick, unless the tick timer is going to expire really
> + * soon anyway.

Wait what; the tick was stopped, therefore it _cannot_ expire soon.

*confused*

Did you mean s/tick/a/ ?

> + */
> + if (data->predicted_us < TICK_USEC)
> + data->predicted_us = min_t(unsigned int, TICK_USEC,
> + ktime_to_us(delta_next));
> + } else {
> + /*
> + * Use the performance multiplier and the user-configurable
> + * latency_req to determine the maximum exit latency.
> + */
> + interactivity_req = data->predicted_us / performance_multiplier(nr_iowaiters, cpu_load);
> + if (latency_req > interactivity_req)
> + latency_req = interactivity_req;
> + }
>
> expected_interval = data->predicted_us;
> /*
>

2018-04-05 13:50:55

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v9 10/10] cpuidle: menu: Avoid selecting shallow states with stopped tick

On Thu, Apr 5, 2018 at 2:47 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Apr 04, 2018 at 10:50:36AM +0200, Rafael J. Wysocki wrote:
>> + if (tick_nohz_tick_stopped()) {
>> + /*
>> + * If the tick is already stopped, the cost of possible short
>> + * idle duration misprediction is much higher, because the CPU
>> + * may be stuck in a shallow idle state for a long time as a
>> + * result of it. In that case say we might mispredict and try
>> + * to force the CPU into a state for which we would have stopped
>> + * the tick, unless the tick timer is going to expire really
>> + * soon anyway.
>
> Wait what; the tick was stopped, therefore it _cannot_ expire soon.
>
> *confused*
>
> Did you mean s/tick/a/ ?

Yeah, that should be "a timer".

2018-04-05 13:53:45

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v9 09/10] cpuidle: menu: Refine idle state selection for running tick

On Thu, Apr 5, 2018 at 2:27 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Apr 04, 2018 at 10:49:34AM +0200, Rafael J. Wysocki wrote:
>> -static inline ktime_t tick_nohz_get_sleep_length(void)
>> +static inline ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next)
>> {
>> - return NSEC_PER_SEC / HZ;
>> + *delta_next = NSEC_PER_SEC / HZ;
>> + return *delta_next;
>
> Shouldn't that be TICK_NSEC ?

Yes, it should.

2018-04-05 13:54:12

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v9 09/10] cpuidle: menu: Refine idle state selection for running tick

On Thu, Apr 5, 2018 at 2:32 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Apr 04, 2018 at 10:49:34AM +0200, Rafael J. Wysocki wrote:
>> + data->next_timer_us = ktime_to_us(tick_nohz_get_sleep_length(&delta_next));
>> + unsigned int delta_next_us = ktime_to_us(delta_next);
>
> We really should be looking at removing all that us nonsense from the
> idle code, but that'll be another series.

Right.

2018-04-05 14:13:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v9 10/10] cpuidle: menu: Avoid selecting shallow states with stopped tick

On Thu, Apr 05, 2018 at 03:49:32PM +0200, Rafael J. Wysocki wrote:
> On Thu, Apr 5, 2018 at 2:47 PM, Peter Zijlstra <[email protected]> wrote:
> > On Wed, Apr 04, 2018 at 10:50:36AM +0200, Rafael J. Wysocki wrote:
> >> + if (tick_nohz_tick_stopped()) {
> >> + /*
> >> + * If the tick is already stopped, the cost of possible short
> >> + * idle duration misprediction is much higher, because the CPU
> >> + * may be stuck in a shallow idle state for a long time as a
> >> + * result of it. In that case say we might mispredict and try
> >> + * to force the CPU into a state for which we would have stopped
> >> + * the tick, unless the tick timer is going to expire really
> >> + * soon anyway.
> >
> > Wait what; the tick was stopped, therefore it _cannot_ expire soon.
> >
> > *confused*
> >
> > Did you mean s/tick/a/ ?
>
> Yeah, that should be "a timer".

*phew* ok, that makes a lot more sense ;-)

My only concern with this is that we can now be overly pessimistic. The
predictor might know that statistically it's very likely a device
interrupt will arrive soon, but because the tick is already disabled, we
don't dare trust it, causing possible excessive latencies.

Would an alternative be to make @stop_tick be an enum capable of forcing
the tick back on?

enum tick_action {
NOHZ_TICK_STOP,
NOHZ_TICK_RETAIN,
NOHZ_TICK_START,
};

enum tick_action tick_action = NOHZ_TICK_STOP;

state = cpuidle_select(..., &tick_action);

switch (tick_action) {
case NOHZ_TICK_STOP:
tick_nohz_stop_tick();
break;

case NOHZ_TICK_RETAIN:
tick_nozh_retain_tick();
break;

case NOHZ_TICK_START:
tick_nohz_start_tick();
break;
};


Or something along those lines?

2018-04-05 14:15:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v9 10/10] cpuidle: menu: Avoid selecting shallow states with stopped tick

On Thu, Apr 05, 2018 at 04:11:27PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 05, 2018 at 03:49:32PM +0200, Rafael J. Wysocki wrote:
> > On Thu, Apr 5, 2018 at 2:47 PM, Peter Zijlstra <[email protected]> wrote:
> > > On Wed, Apr 04, 2018 at 10:50:36AM +0200, Rafael J. Wysocki wrote:
> > >> + if (tick_nohz_tick_stopped()) {
> > >> + /*
> > >> + * If the tick is already stopped, the cost of possible short
> > >> + * idle duration misprediction is much higher, because the CPU
> > >> + * may be stuck in a shallow idle state for a long time as a
> > >> + * result of it. In that case say we might mispredict and try
> > >> + * to force the CPU into a state for which we would have stopped
> > >> + * the tick, unless the tick timer is going to expire really
> > >> + * soon anyway.
> > >
> > > Wait what; the tick was stopped, therefore it _cannot_ expire soon.
> > >
> > > *confused*
> > >
> > > Did you mean s/tick/a/ ?
> >
> > Yeah, that should be "a timer".
>
> *phew* ok, that makes a lot more sense ;-)
>
> My only concern with this is that we can now be overly pessimistic. The
> predictor might know that statistically it's very likely a device
> interrupt will arrive soon, but because the tick is already disabled, we
> don't dare trust it, causing possible excessive latencies.
>
> Would an alternative be to make @stop_tick be an enum capable of forcing
> the tick back on?
>
> enum tick_action {
> NOHZ_TICK_STOP,
> NOHZ_TICK_RETAIN,
> NOHZ_TICK_START,
> };
>
> enum tick_action tick_action = NOHZ_TICK_STOP;
>
> state = cpuidle_select(..., &tick_action);
>
> switch (tick_action) {
> case NOHZ_TICK_STOP:
> tick_nohz_stop_tick();
> break;
>
> case NOHZ_TICK_RETAIN:
> tick_nozh_retain_tick();
> break;
>
> case NOHZ_TICK_START:
> tick_nohz_start_tick();
> break;
> };
>
>
> Or something along those lines?

To clarify, RETAIN keeps the status quo, if its off, it stays off, if
its on it stays on.

2018-04-05 14:31:28

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v9 10/10] cpuidle: menu: Avoid selecting shallow states with stopped tick

On Thu, Apr 5, 2018 at 4:13 PM, Peter Zijlstra <[email protected]> wrote:
> On Thu, Apr 05, 2018 at 04:11:27PM +0200, Peter Zijlstra wrote:
>> On Thu, Apr 05, 2018 at 03:49:32PM +0200, Rafael J. Wysocki wrote:
>> > On Thu, Apr 5, 2018 at 2:47 PM, Peter Zijlstra <[email protected]> wrote:
>> > > On Wed, Apr 04, 2018 at 10:50:36AM +0200, Rafael J. Wysocki wrote:
>> > >> + if (tick_nohz_tick_stopped()) {
>> > >> + /*
>> > >> + * If the tick is already stopped, the cost of possible short
>> > >> + * idle duration misprediction is much higher, because the CPU
>> > >> + * may be stuck in a shallow idle state for a long time as a
>> > >> + * result of it. In that case say we might mispredict and try
>> > >> + * to force the CPU into a state for which we would have stopped
>> > >> + * the tick, unless the tick timer is going to expire really
>> > >> + * soon anyway.
>> > >
>> > > Wait what; the tick was stopped, therefore it _cannot_ expire soon.
>> > >
>> > > *confused*
>> > >
>> > > Did you mean s/tick/a/ ?
>> >
>> > Yeah, that should be "a timer".
>>
>> *phew* ok, that makes a lot more sense ;-)
>>
>> My only concern with this is that we can now be overly pessimistic. The
>> predictor might know that statistically it's very likely a device
>> interrupt will arrive soon, but because the tick is already disabled, we
>> don't dare trust it, causing possible excessive latencies.

That's possible, but then we already stopped the tick before and the
CPU was in a deep idle state (or we wouldn't have got here with the
tick stopped), so the extra bit of latency coming from this is not
likely to matter IMO.

And the code can stay simpler this way. :-)

>> Would an alternative be to make @stop_tick be an enum capable of forcing
>> the tick back on?
>>
>> enum tick_action {
>> NOHZ_TICK_STOP,
>> NOHZ_TICK_RETAIN,
>> NOHZ_TICK_START,
>> };
>>
>> enum tick_action tick_action = NOHZ_TICK_STOP;
>>
>> state = cpuidle_select(..., &tick_action);
>>
>> switch (tick_action) {
>> case NOHZ_TICK_STOP:
>> tick_nohz_stop_tick();
>> break;
>>
>> case NOHZ_TICK_RETAIN:
>> tick_nozh_retain_tick();
>> break;
>>
>> case NOHZ_TICK_START:
>> tick_nohz_start_tick();
>> break;
>> };
>>
>>
>> Or something along those lines?
>
> To clarify, RETAIN keeps the status quo, if its off, it stays off, if
> its on it stays on.

That could be done, but I'm not sure if the menu governor has a good
way to decide whether to use NOHZ_TICK_RETAIN or NOHZ_TICK_START.

Doing NOHZ_TICK_START every time the predicted idle duration is within
the tick range may be wasteful.

Besides, enum tick_action and so on can be introduced later if really
needed, but for now it looks like the simpler code gets the job done.

2018-04-06 01:11:18

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v9 04/10] jiffies: Introduce USER_TICK_USEC and redefine TICK_USEC

On Wed, Apr 04, 2018 at 10:38:34AM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> Since the subsequent changes will need a TICK_USEC definition
> analogous to TICK_NSEC, rename the existing TICK_USEC as
> USER_TICK_USEC, update its users and redefine TICK_USEC
> accordingly.
>
> Suggested-by: Peter Zijlstra <[email protected]>
> Signed-off-by: Rafael J. Wysocki <[email protected]>
> ---
>
> v8 -> v9: No changes.
>
> ---
> drivers/net/ethernet/sfc/mcdi.c | 2 +-
> include/linux/jiffies.h | 7 +++++--
> kernel/time/ntp.c | 2 +-
> 3 files changed, 7 insertions(+), 4 deletions(-)
>
> Index: linux-pm/include/linux/jiffies.h
> ===================================================================
> --- linux-pm.orig/include/linux/jiffies.h
> +++ linux-pm/include/linux/jiffies.h
> @@ -62,8 +62,11 @@ extern int register_refined_jiffies(long
> /* TICK_NSEC is the time between ticks in nsec assuming SHIFTED_HZ */
> #define TICK_NSEC ((NSEC_PER_SEC+HZ/2)/HZ)
>
> -/* TICK_USEC is the time between ticks in usec assuming fake USER_HZ */
> -#define TICK_USEC ((1000000UL + USER_HZ/2) / USER_HZ)
> +/* TICK_USEC is the time between ticks in usec assuming SHIFTED_HZ */
> +#define TICK_USEC ((USEC_PER_SEC + HZ/2) / HZ)

Nit: SHIFTED_HZ doesn't seem to exist anymore.

Reviewed-by: Frederic Weisbecker <[email protected]>

2018-04-06 02:47:04

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v9 05/10] cpuidle: Return nohz hint from cpuidle_select()

On Wed, Apr 04, 2018 at 10:39:50AM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
> Index: linux-pm/kernel/time/tick-sched.c
> ===================================================================
> --- linux-pm.orig/kernel/time/tick-sched.c
> +++ linux-pm/kernel/time/tick-sched.c
> @@ -991,6 +991,20 @@ void tick_nohz_irq_exit(void)
> }
>
> /**
> + * tick_nohz_idle_got_tick - Check whether or not the tick handler has run
> + */
> +bool tick_nohz_idle_got_tick(void)
> +{
> + struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
> +
> + if (ts->inidle > 1) {
> + ts->inidle = 1;
> + return true;
> + }
> + return false;
> +}
> +
> +/**
> * tick_nohz_get_sleep_length - return the length of the current sleep
> *
> * Called from power state control code with interrupts disabled
> @@ -1101,6 +1115,9 @@ static void tick_nohz_handler(struct clo
> struct pt_regs *regs = get_irq_regs();
> ktime_t now = ktime_get();
>
> + if (ts->inidle)
> + ts->inidle = 2;
> +

You can move that to tick_sched_do_timer() to avoid code duplication.

Also these constants are very opaque. And even with proper symbols it wouldn't look
right to extend ts->inidle that way.

Perhaps you should add a field such as ts->got_idle_tick under the boolean fields
after the below patch:

--
From c7b2ca5a4c512517ddfeb9f922d5999f82542ced Mon Sep 17 00:00:00 2001
From: Frederic Weisbecker <[email protected]>
Date: Fri, 6 Apr 2018 04:32:37 +0200
Subject: [PATCH] nohz: Gather tick_sched booleans under a common flag field

This optimize the space and leave plenty of room for further flags.

Signed-off-by: Frederic Weisbecker <[email protected]>
---
kernel/time/tick-sched.h | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/time/tick-sched.h b/kernel/time/tick-sched.h
index 954b43d..38f24dc 100644
--- a/kernel/time/tick-sched.h
+++ b/kernel/time/tick-sched.h
@@ -45,14 +45,17 @@ struct tick_sched {
struct hrtimer sched_timer;
unsigned long check_clocks;
enum tick_nohz_mode nohz_mode;
+
+ unsigned int inidle : 1;
+ unsigned int tick_stopped : 1;
+ unsigned int idle_active : 1;
+ unsigned int do_timer_last : 1;
+
ktime_t last_tick;
ktime_t next_tick;
- int inidle;
- int tick_stopped;
unsigned long idle_jiffies;
unsigned long idle_calls;
unsigned long idle_sleeps;
- int idle_active;
ktime_t idle_entrytime;
ktime_t idle_waketime;
ktime_t idle_exittime;
@@ -62,7 +65,6 @@ struct tick_sched {
unsigned long last_jiffies;
u64 next_timer;
ktime_t idle_expires;
- int do_timer_last;
atomic_t tick_dep_mask;
};

--
2.7.4




2018-04-06 07:21:43

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v9 04/10] jiffies: Introduce USER_TICK_USEC and redefine TICK_USEC

On Friday, April 6, 2018 3:09:55 AM CEST Frederic Weisbecker wrote:
> On Wed, Apr 04, 2018 at 10:38:34AM +0200, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <[email protected]>
> >
> > Since the subsequent changes will need a TICK_USEC definition
> > analogous to TICK_NSEC, rename the existing TICK_USEC as
> > USER_TICK_USEC, update its users and redefine TICK_USEC
> > accordingly.
> >
> > Suggested-by: Peter Zijlstra <[email protected]>
> > Signed-off-by: Rafael J. Wysocki <[email protected]>
> > ---
> >
> > v8 -> v9: No changes.
> >
> > ---
> > drivers/net/ethernet/sfc/mcdi.c | 2 +-
> > include/linux/jiffies.h | 7 +++++--
> > kernel/time/ntp.c | 2 +-
> > 3 files changed, 7 insertions(+), 4 deletions(-)
> >
> > Index: linux-pm/include/linux/jiffies.h
> > ===================================================================
> > --- linux-pm.orig/include/linux/jiffies.h
> > +++ linux-pm/include/linux/jiffies.h
> > @@ -62,8 +62,11 @@ extern int register_refined_jiffies(long
> > /* TICK_NSEC is the time between ticks in nsec assuming SHIFTED_HZ */
> > #define TICK_NSEC ((NSEC_PER_SEC+HZ/2)/HZ)
> >
> > -/* TICK_USEC is the time between ticks in usec assuming fake USER_HZ */
> > -#define TICK_USEC ((1000000UL + USER_HZ/2) / USER_HZ)
> > +/* TICK_USEC is the time between ticks in usec assuming SHIFTED_HZ */
> > +#define TICK_USEC ((USEC_PER_SEC + HZ/2) / HZ)
>
> Nit: SHIFTED_HZ doesn't seem to exist anymore.

Well, fair enough, but that would need to be changed along with the TICK_NSEC
comment IMO.

> Reviewed-by: Frederic Weisbecker <[email protected]>

Thanks!


2018-04-06 07:25:58

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v9 05/10] cpuidle: Return nohz hint from cpuidle_select()

On Friday, April 6, 2018 4:44:14 AM CEST Frederic Weisbecker wrote:
> On Wed, Apr 04, 2018 at 10:39:50AM +0200, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <[email protected]>
> > Index: linux-pm/kernel/time/tick-sched.c
> > ===================================================================
> > --- linux-pm.orig/kernel/time/tick-sched.c
> > +++ linux-pm/kernel/time/tick-sched.c
> > @@ -991,6 +991,20 @@ void tick_nohz_irq_exit(void)
> > }
> >
> > /**
> > + * tick_nohz_idle_got_tick - Check whether or not the tick handler has run
> > + */
> > +bool tick_nohz_idle_got_tick(void)
> > +{
> > + struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
> > +
> > + if (ts->inidle > 1) {
> > + ts->inidle = 1;
> > + return true;
> > + }
> > + return false;
> > +}
> > +
> > +/**
> > * tick_nohz_get_sleep_length - return the length of the current sleep
> > *
> > * Called from power state control code with interrupts disabled
> > @@ -1101,6 +1115,9 @@ static void tick_nohz_handler(struct clo
> > struct pt_regs *regs = get_irq_regs();
> > ktime_t now = ktime_get();
> >
> > + if (ts->inidle)
> > + ts->inidle = 2;
> > +
>
> You can move that to tick_sched_do_timer() to avoid code duplication.

Right.

> Also these constants are very opaque. And even with proper symbols it wouldn't look
> right to extend ts->inidle that way.

Well, this was a Peter's idea. :-)

> Perhaps you should add a field such as ts->got_idle_tick under the boolean fields
> after the below patch:

OK, but at this point I'd prefer to make such changes on top of the existing
set, because that's got quite some testing coverage already and honestly this
is cosmetics in my view (albeit important).

> --
> From c7b2ca5a4c512517ddfeb9f922d5999f82542ced Mon Sep 17 00:00:00 2001
> From: Frederic Weisbecker <[email protected]>
> Date: Fri, 6 Apr 2018 04:32:37 +0200
> Subject: [PATCH] nohz: Gather tick_sched booleans under a common flag field
>
> This optimize the space and leave plenty of room for further flags.
>
> Signed-off-by: Frederic Weisbecker <[email protected]>
> ---
> kernel/time/tick-sched.h | 10 ++++++----
> 1 file changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/time/tick-sched.h b/kernel/time/tick-sched.h
> index 954b43d..38f24dc 100644
> --- a/kernel/time/tick-sched.h
> +++ b/kernel/time/tick-sched.h
> @@ -45,14 +45,17 @@ struct tick_sched {
> struct hrtimer sched_timer;
> unsigned long check_clocks;
> enum tick_nohz_mode nohz_mode;
> +
> + unsigned int inidle : 1;
> + unsigned int tick_stopped : 1;
> + unsigned int idle_active : 1;
> + unsigned int do_timer_last : 1;
> +
> ktime_t last_tick;
> ktime_t next_tick;
> - int inidle;
> - int tick_stopped;
> unsigned long idle_jiffies;
> unsigned long idle_calls;
> unsigned long idle_sleeps;
> - int idle_active;
> ktime_t idle_entrytime;
> ktime_t idle_waketime;
> ktime_t idle_exittime;
> @@ -62,7 +65,6 @@ struct tick_sched {
> unsigned long last_jiffies;
> u64 next_timer;
> ktime_t idle_expires;
> - int do_timer_last;
> atomic_t tick_dep_mask;
> };
>
>



2018-04-06 08:00:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v9 05/10] cpuidle: Return nohz hint from cpuidle_select()

On Fri, Apr 06, 2018 at 04:44:14AM +0200, Frederic Weisbecker wrote:
> You can move that to tick_sched_do_timer() to avoid code duplication.

I expect the reason I didn't was that it didn't have @ts, but that's
easily fixable.

> Also these constants are very opaque. And even with proper symbols it wouldn't look
> right to extend ts->inidle that way.
>
> Perhaps you should add a field such as ts->got_idle_tick under the boolean fields
> after the below patch:

> @@ -45,14 +45,17 @@ struct tick_sched {
> struct hrtimer sched_timer;
> unsigned long check_clocks;
> enum tick_nohz_mode nohz_mode;
> +
> + unsigned int inidle : 1;
> + unsigned int tick_stopped : 1;
> + unsigned int idle_active : 1;
> + unsigned int do_timer_last : 1;

That would generate worse code, but yes, the C might be prettier.

2018-04-06 08:13:39

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v9 05/10] cpuidle: Return nohz hint from cpuidle_select()

On Friday, April 6, 2018 4:44:14 AM CEST Frederic Weisbecker wrote:
> On Wed, Apr 04, 2018 at 10:39:50AM +0200, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <[email protected]>
> > Index: linux-pm/kernel/time/tick-sched.c
> > ===================================================================
> > --- linux-pm.orig/kernel/time/tick-sched.c
> > +++ linux-pm/kernel/time/tick-sched.c
> > @@ -991,6 +991,20 @@ void tick_nohz_irq_exit(void)
> > }
> >
> > /**
> > + * tick_nohz_idle_got_tick - Check whether or not the tick handler has run
> > + */
> > +bool tick_nohz_idle_got_tick(void)
> > +{
> > + struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
> > +
> > + if (ts->inidle > 1) {
> > + ts->inidle = 1;
> > + return true;
> > + }
> > + return false;
> > +}
> > +
> > +/**
> > * tick_nohz_get_sleep_length - return the length of the current sleep
> > *
> > * Called from power state control code with interrupts disabled
> > @@ -1101,6 +1115,9 @@ static void tick_nohz_handler(struct clo
> > struct pt_regs *regs = get_irq_regs();
> > ktime_t now = ktime_get();
> >
> > + if (ts->inidle)
> > + ts->inidle = 2;
> > +
>
> You can move that to tick_sched_do_timer() to avoid code duplication.
>
> Also these constants are very opaque. And even with proper symbols it wouldn't look
> right to extend ts->inidle that way.
>
> Perhaps you should add a field such as ts->got_idle_tick under the boolean fields
> after the below patch:
>
> --
> From c7b2ca5a4c512517ddfeb9f922d5999f82542ced Mon Sep 17 00:00:00 2001
> From: Frederic Weisbecker <[email protected]>
> Date: Fri, 6 Apr 2018 04:32:37 +0200
> Subject: [PATCH] nohz: Gather tick_sched booleans under a common flag field
>
> This optimize the space and leave plenty of room for further flags.
>
> Signed-off-by: Frederic Weisbecker <[email protected]>
> ---
> kernel/time/tick-sched.h | 10 ++++++----
> 1 file changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/time/tick-sched.h b/kernel/time/tick-sched.h
> index 954b43d..38f24dc 100644
> --- a/kernel/time/tick-sched.h
> +++ b/kernel/time/tick-sched.h
> @@ -45,14 +45,17 @@ struct tick_sched {
> struct hrtimer sched_timer;
> unsigned long check_clocks;
> enum tick_nohz_mode nohz_mode;
> +
> + unsigned int inidle : 1;
> + unsigned int tick_stopped : 1;

This particular change breaks build, because tick_stopped is
accessed via __this_cpu_read() in tick_nohz_tick_stopped().

> + unsigned int idle_active : 1;
> + unsigned int do_timer_last : 1;
> +
> ktime_t last_tick;
> ktime_t next_tick;
> - int inidle;
> - int tick_stopped;
> unsigned long idle_jiffies;
> unsigned long idle_calls;
> unsigned long idle_sleeps;
> - int idle_active;
> ktime_t idle_entrytime;
> ktime_t idle_waketime;
> ktime_t idle_exittime;
> @@ -62,7 +65,6 @@ struct tick_sched {
> unsigned long last_jiffies;
> u64 next_timer;
> ktime_t idle_expires;
> - int do_timer_last;
> atomic_t tick_dep_mask;
> };
>
>

So what about this? And moving the duplicated piece of got_idle_tick
manipulation on top of it?

---
From: Frederic Weisbecker <[email protected]>
Subject: [PATCH] nohz: Gather tick_sched booleans under a common flag field

Optimize the space and leave plenty of room for further flags.

Signed-off-by: Frederic Weisbecker <[email protected]>
[ rjw: Do not use __this_cpu_read() to access tick_stopped and add
got_idle_tick to avoid overloading inidle ]
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
kernel/time/tick-sched.c | 12 +++++++-----
kernel/time/tick-sched.h | 12 ++++++++----
2 files changed, 15 insertions(+), 9 deletions(-)

Index: linux-pm/kernel/time/tick-sched.h
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.h
+++ linux-pm/kernel/time/tick-sched.h
@@ -41,19 +41,24 @@ enum tick_nohz_mode {
* @timer_expires: Anticipated timer expiration time (in case sched tick is stopped)
* @timer_expires_base: Base time clock monotonic for @timer_expires
* @do_timer_lst: CPU was the last one doing do_timer before going idle
+ * @got_idle_tick: Tick timer function has run with @inidle set
*/
struct tick_sched {
struct hrtimer sched_timer;
unsigned long check_clocks;
enum tick_nohz_mode nohz_mode;
+
+ unsigned int inidle : 1;
+ unsigned int tick_stopped : 1;
+ unsigned int idle_active : 1;
+ unsigned int do_timer_last : 1;
+ unsigned int got_idle_tick : 1;
+
ktime_t last_tick;
ktime_t next_tick;
- int inidle;
- int tick_stopped;
unsigned long idle_jiffies;
unsigned long idle_calls;
unsigned long idle_sleeps;
- int idle_active;
ktime_t idle_entrytime;
ktime_t idle_waketime;
ktime_t idle_exittime;
@@ -64,7 +69,6 @@ struct tick_sched {
u64 timer_expires_base;
u64 next_timer;
ktime_t idle_expires;
- int do_timer_last;
atomic_t tick_dep_mask;
};

Index: linux-pm/kernel/time/tick-sched.c
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.c
+++ linux-pm/kernel/time/tick-sched.c
@@ -465,7 +465,9 @@ __setup("nohz=", setup_tick_nohz);

bool tick_nohz_tick_stopped(void)
{
- return __this_cpu_read(tick_cpu_sched.tick_stopped);
+ struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
+
+ return ts->tick_stopped;
}

bool tick_nohz_tick_stopped_cpu(int cpu)
@@ -1014,8 +1016,8 @@ bool tick_nohz_idle_got_tick(void)
{
struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);

- if (ts->inidle > 1) {
- ts->inidle = 1;
+ if (ts->got_idle_tick) {
+ ts->got_idle_tick = 0;
return true;
}
return false;
@@ -1161,7 +1163,7 @@ static void tick_nohz_handler(struct clo
ktime_t now = ktime_get();

if (ts->inidle)
- ts->inidle = 2;
+ ts->got_idle_tick = 1;

dev->next_event = KTIME_MAX;

@@ -1261,7 +1263,7 @@ static enum hrtimer_restart tick_sched_t
ktime_t now = ktime_get();

if (ts->inidle)
- ts->inidle = 2;
+ ts->got_idle_tick = 1;

tick_sched_do_timer(now);



2018-04-06 12:58:04

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v9 05/10] cpuidle: Return nohz hint from cpuidle_select()

On Friday, April 6, 2018 10:11:04 AM CEST Rafael J. Wysocki wrote:
> On Friday, April 6, 2018 4:44:14 AM CEST Frederic Weisbecker wrote:
> > On Wed, Apr 04, 2018 at 10:39:50AM +0200, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <[email protected]>
> > > Index: linux-pm/kernel/time/tick-sched.c
> > > ===================================================================
> > > --- linux-pm.orig/kernel/time/tick-sched.c
> > > +++ linux-pm/kernel/time/tick-sched.c
> > > @@ -991,6 +991,20 @@ void tick_nohz_irq_exit(void)
> > > }

[cut]

> So what about this? And moving the duplicated piece of got_idle_tick
> manipulation on top of it?

And appended is a patch to get rid of the code duplication (on top of the one
below), for completeness.

> ---
> From: Frederic Weisbecker <[email protected]>
> Subject: [PATCH] nohz: Gather tick_sched booleans under a common flag field
>
> Optimize the space and leave plenty of room for further flags.
>
> Signed-off-by: Frederic Weisbecker <[email protected]>
> [ rjw: Do not use __this_cpu_read() to access tick_stopped and add
> got_idle_tick to avoid overloading inidle ]
> Signed-off-by: Rafael J. Wysocki <[email protected]>
> ---
> kernel/time/tick-sched.c | 12 +++++++-----
> kernel/time/tick-sched.h | 12 ++++++++----
> 2 files changed, 15 insertions(+), 9 deletions(-)
>
> Index: linux-pm/kernel/time/tick-sched.h
> ===================================================================
> --- linux-pm.orig/kernel/time/tick-sched.h
> +++ linux-pm/kernel/time/tick-sched.h
> @@ -41,19 +41,24 @@ enum tick_nohz_mode {
> * @timer_expires: Anticipated timer expiration time (in case sched tick is stopped)
> * @timer_expires_base: Base time clock monotonic for @timer_expires
> * @do_timer_lst: CPU was the last one doing do_timer before going idle
> + * @got_idle_tick: Tick timer function has run with @inidle set
> */
> struct tick_sched {
> struct hrtimer sched_timer;
> unsigned long check_clocks;
> enum tick_nohz_mode nohz_mode;
> +
> + unsigned int inidle : 1;
> + unsigned int tick_stopped : 1;
> + unsigned int idle_active : 1;
> + unsigned int do_timer_last : 1;
> + unsigned int got_idle_tick : 1;
> +
> ktime_t last_tick;
> ktime_t next_tick;
> - int inidle;
> - int tick_stopped;
> unsigned long idle_jiffies;
> unsigned long idle_calls;
> unsigned long idle_sleeps;
> - int idle_active;
> ktime_t idle_entrytime;
> ktime_t idle_waketime;
> ktime_t idle_exittime;
> @@ -64,7 +69,6 @@ struct tick_sched {
> u64 timer_expires_base;
> u64 next_timer;
> ktime_t idle_expires;
> - int do_timer_last;
> atomic_t tick_dep_mask;
> };
>
> Index: linux-pm/kernel/time/tick-sched.c
> ===================================================================
> --- linux-pm.orig/kernel/time/tick-sched.c
> +++ linux-pm/kernel/time/tick-sched.c
> @@ -465,7 +465,9 @@ __setup("nohz=", setup_tick_nohz);
>
> bool tick_nohz_tick_stopped(void)
> {
> - return __this_cpu_read(tick_cpu_sched.tick_stopped);
> + struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
> +
> + return ts->tick_stopped;
> }
>
> bool tick_nohz_tick_stopped_cpu(int cpu)
> @@ -1014,8 +1016,8 @@ bool tick_nohz_idle_got_tick(void)
> {
> struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
>
> - if (ts->inidle > 1) {
> - ts->inidle = 1;
> + if (ts->got_idle_tick) {
> + ts->got_idle_tick = 0;
> return true;
> }
> return false;
> @@ -1161,7 +1163,7 @@ static void tick_nohz_handler(struct clo
> ktime_t now = ktime_get();
>
> if (ts->inidle)
> - ts->inidle = 2;
> + ts->got_idle_tick = 1;
>
> dev->next_event = KTIME_MAX;
>
> @@ -1261,7 +1263,7 @@ static enum hrtimer_restart tick_sched_t
> ktime_t now = ktime_get();
>
> if (ts->inidle)
> - ts->inidle = 2;
> + ts->got_idle_tick = 1;
>
> tick_sched_do_timer(now);
>

---
From: Rafael J. Wysocki <[email protected]>
Subject: [PATCH] nohz: Avoid duplication of code related to got_idle_tick

Move the code setting ts->got_idle_tick into tick_sched_do_timer() to
avoid code duplication.

No intentional changes in functionality.

Suggested-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
kernel/time/tick-sched.c | 16 ++++++----------
1 file changed, 6 insertions(+), 10 deletions(-)

Index: linux-pm/kernel/time/tick-sched.c
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.c
+++ linux-pm/kernel/time/tick-sched.c
@@ -113,8 +113,7 @@ static ktime_t tick_init_jiffy_update(vo
return period;
}

-
-static void tick_sched_do_timer(ktime_t now)
+static void tick_sched_do_timer(struct tick_sched *ts, ktime_t now)
{
int cpu = smp_processor_id();

@@ -134,6 +133,9 @@ static void tick_sched_do_timer(ktime_t
/* Check, if the jiffies need an update */
if (tick_do_timer_cpu == cpu)
tick_do_update_jiffies64(now);
+
+ if (ts->inidle)
+ ts->got_idle_tick = 1;
}

static void tick_sched_handle(struct tick_sched *ts, struct pt_regs *regs)
@@ -1162,12 +1164,9 @@ static void tick_nohz_handler(struct clo
struct pt_regs *regs = get_irq_regs();
ktime_t now = ktime_get();

- if (ts->inidle)
- ts->got_idle_tick = 1;
-
dev->next_event = KTIME_MAX;

- tick_sched_do_timer(now);
+ tick_sched_do_timer(ts, now);
tick_sched_handle(ts, regs);

/* No need to reprogram if we are running tickless */
@@ -1262,10 +1261,7 @@ static enum hrtimer_restart tick_sched_t
struct pt_regs *regs = get_irq_regs();
ktime_t now = ktime_get();

- if (ts->inidle)
- ts->got_idle_tick = 1;
-
- tick_sched_do_timer(now);
+ tick_sched_do_timer(ts, now);

/*
* Do not call, when we are not in irq context and have


2018-04-06 14:21:01

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v9 05/10] cpuidle: Return nohz hint from cpuidle_select()

On Fri, Apr 06, 2018 at 09:24:42AM +0200, Rafael J. Wysocki wrote:
> On Friday, April 6, 2018 4:44:14 AM CEST Frederic Weisbecker wrote:
> > On Wed, Apr 04, 2018 at 10:39:50AM +0200, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <[email protected]>
> > > Index: linux-pm/kernel/time/tick-sched.c
> > > ===================================================================
> > > --- linux-pm.orig/kernel/time/tick-sched.c
> > > +++ linux-pm/kernel/time/tick-sched.c
> > > @@ -991,6 +991,20 @@ void tick_nohz_irq_exit(void)
> > > }
> > >
> > > /**
> > > + * tick_nohz_idle_got_tick - Check whether or not the tick handler has run
> > > + */
> > > +bool tick_nohz_idle_got_tick(void)
> > > +{
> > > + struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
> > > +
> > > + if (ts->inidle > 1) {
> > > + ts->inidle = 1;
> > > + return true;
> > > + }
> > > + return false;
> > > +}
> > > +
> > > +/**
> > > * tick_nohz_get_sleep_length - return the length of the current sleep
> > > *
> > > * Called from power state control code with interrupts disabled
> > > @@ -1101,6 +1115,9 @@ static void tick_nohz_handler(struct clo
> > > struct pt_regs *regs = get_irq_regs();
> > > ktime_t now = ktime_get();
> > >
> > > + if (ts->inidle)
> > > + ts->inidle = 2;
> > > +
> >
> > You can move that to tick_sched_do_timer() to avoid code duplication.
>
> Right.
>
> > Also these constants are very opaque. And even with proper symbols it wouldn't look
> > right to extend ts->inidle that way.
>
> Well, this was a Peter's idea. :-)
>
> > Perhaps you should add a field such as ts->got_idle_tick under the boolean fields
> > after the below patch:
>
> OK, but at this point I'd prefer to make such changes on top of the existing
> set, because that's got quite some testing coverage already and honestly this
> is cosmetics in my view (albeit important).

Sure!

Thanks.

2018-04-06 14:25:27

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v9 05/10] cpuidle: Return nohz hint from cpuidle_select()

On Fri, Apr 06, 2018 at 09:58:37AM +0200, Peter Zijlstra wrote:
> On Fri, Apr 06, 2018 at 04:44:14AM +0200, Frederic Weisbecker wrote:
> > You can move that to tick_sched_do_timer() to avoid code duplication.
>
> I expect the reason I didn't was that it didn't have @ts, but that's
> easily fixable.
>
> > Also these constants are very opaque. And even with proper symbols it wouldn't look
> > right to extend ts->inidle that way.
> >
> > Perhaps you should add a field such as ts->got_idle_tick under the boolean fields
> > after the below patch:
>
> > @@ -45,14 +45,17 @@ struct tick_sched {
> > struct hrtimer sched_timer;
> > unsigned long check_clocks;
> > enum tick_nohz_mode nohz_mode;
> > +
> > + unsigned int inidle : 1;
> > + unsigned int tick_stopped : 1;
> > + unsigned int idle_active : 1;
> > + unsigned int do_timer_last : 1;
>
> That would generate worse code, but yes, the C might be prettier.

Well, not too bad I think. MOV become OR/AND, the rest is just
TESTs...

2018-04-06 14:29:47

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v9 05/10] cpuidle: Return nohz hint from cpuidle_select()

On Fri, Apr 06, 2018 at 10:11:04AM +0200, Rafael J. Wysocki wrote:
> On Friday, April 6, 2018 4:44:14 AM CEST Frederic Weisbecker wrote:
> > On Wed, Apr 04, 2018 at 10:39:50AM +0200, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <[email protected]>
> > > Index: linux-pm/kernel/time/tick-sched.c
> > > ===================================================================
> > > --- linux-pm.orig/kernel/time/tick-sched.c
> > > +++ linux-pm/kernel/time/tick-sched.c
> > > @@ -991,6 +991,20 @@ void tick_nohz_irq_exit(void)
> > > }
> > >
> > > /**
> > > + * tick_nohz_idle_got_tick - Check whether or not the tick handler has run
> > > + */
> > > +bool tick_nohz_idle_got_tick(void)
> > > +{
> > > + struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
> > > +
> > > + if (ts->inidle > 1) {
> > > + ts->inidle = 1;
> > > + return true;
> > > + }
> > > + return false;
> > > +}
> > > +
> > > +/**
> > > * tick_nohz_get_sleep_length - return the length of the current sleep
> > > *
> > > * Called from power state control code with interrupts disabled
> > > @@ -1101,6 +1115,9 @@ static void tick_nohz_handler(struct clo
> > > struct pt_regs *regs = get_irq_regs();
> > > ktime_t now = ktime_get();
> > >
> > > + if (ts->inidle)
> > > + ts->inidle = 2;
> > > +
> >
> > You can move that to tick_sched_do_timer() to avoid code duplication.
> >
> > Also these constants are very opaque. And even with proper symbols it wouldn't look
> > right to extend ts->inidle that way.
> >
> > Perhaps you should add a field such as ts->got_idle_tick under the boolean fields
> > after the below patch:
> >
> > --
> > From c7b2ca5a4c512517ddfeb9f922d5999f82542ced Mon Sep 17 00:00:00 2001
> > From: Frederic Weisbecker <[email protected]>
> > Date: Fri, 6 Apr 2018 04:32:37 +0200
> > Subject: [PATCH] nohz: Gather tick_sched booleans under a common flag field
> >
> > This optimize the space and leave plenty of room for further flags.
> >
> > Signed-off-by: Frederic Weisbecker <[email protected]>
> > ---
> > kernel/time/tick-sched.h | 10 ++++++----
> > 1 file changed, 6 insertions(+), 4 deletions(-)
> >
> > diff --git a/kernel/time/tick-sched.h b/kernel/time/tick-sched.h
> > index 954b43d..38f24dc 100644
> > --- a/kernel/time/tick-sched.h
> > +++ b/kernel/time/tick-sched.h
> > @@ -45,14 +45,17 @@ struct tick_sched {
> > struct hrtimer sched_timer;
> > unsigned long check_clocks;
> > enum tick_nohz_mode nohz_mode;
> > +
> > + unsigned int inidle : 1;
> > + unsigned int tick_stopped : 1;
>
> This particular change breaks build, because tick_stopped is
> accessed via __this_cpu_read() in tick_nohz_tick_stopped().

Oops...

>
> > + unsigned int idle_active : 1;
> > + unsigned int do_timer_last : 1;
> > +
> > ktime_t last_tick;
> > ktime_t next_tick;
> > - int inidle;
> > - int tick_stopped;
> > unsigned long idle_jiffies;
> > unsigned long idle_calls;
> > unsigned long idle_sleeps;
> > - int idle_active;
> > ktime_t idle_entrytime;
> > ktime_t idle_waketime;
> > ktime_t idle_exittime;
> > @@ -62,7 +65,6 @@ struct tick_sched {
> > unsigned long last_jiffies;
> > u64 next_timer;
> > ktime_t idle_expires;
> > - int do_timer_last;
> > atomic_t tick_dep_mask;
> > };
> >
> >
>
> So what about this? And moving the duplicated piece of got_idle_tick
> manipulation on top of it?
>
> ---
> From: Frederic Weisbecker <[email protected]>
> Subject: [PATCH] nohz: Gather tick_sched booleans under a common flag field
>
> Optimize the space and leave plenty of room for further flags.
>
> Signed-off-by: Frederic Weisbecker <[email protected]>
> [ rjw: Do not use __this_cpu_read() to access tick_stopped and add
> got_idle_tick to avoid overloading inidle ]
> Signed-off-by: Rafael J. Wysocki <[email protected]>

Yeah looks good, thanks!

2018-04-06 15:30:19

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v9 05/10] cpuidle: Return nohz hint from cpuidle_select()

On Fri, Apr 06, 2018 at 02:56:36PM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
> Subject: [PATCH] nohz: Avoid duplication of code related to got_idle_tick
>
> Move the code setting ts->got_idle_tick into tick_sched_do_timer() to
> avoid code duplication.
>
> No intentional changes in functionality.
>
> Suggested-by: Frederic Weisbecker <[email protected]>
> Signed-off-by: Rafael J. Wysocki <[email protected]>

Reviewed-by: Frederic Weisbecker <[email protected]>

Thanks!

2018-04-07 02:44:58

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v9 06/10] time: tick-sched: Split tick_nohz_stop_sched_tick()

On Wed, Apr 04, 2018 at 10:41:13AM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> In order to address the issue with short idle duration predictions
> by the idle governor after the scheduler tick has been stopped, split
> tick_nohz_stop_sched_tick() into two separate routines, one computing
> the time to the next timer event and the other simply stopping the
> tick when the time to the next timer event is known.
>
> Prepare these two routines to be called separately, as one of them
> will be called by the idle governor in the cpuidle_select() code
> path after subsequent changes.
>
> Update the former callers of tick_nohz_stop_sched_tick() to use
> the new routines, tick_nohz_next_event() and tick_nohz_stop_tick(),
> instead of it and move the updates of the sleep_length field in
> struct tick_sched into __tick_nohz_idle_stop_tick() as it doesn't
> need to be updated anywhere else.
>
> There should be no intentional visible changes in functionality
> resulting from this change.
>
> Signed-off-by: Rafael J. Wysocki <[email protected]>

Reviewed-by: Frederic Weisbecker <[email protected]>

Thanks! And sorry for the slow reviews, the changes are sensitive
and I want to make sure we are not breaking some subtlety.

2018-04-07 14:51:26

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v9 07/10] time: hrtimer: Introduce hrtimer_next_event_without()

On Wed, Apr 04, 2018 at 10:45:39AM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> The next set of changes will need to compute the time to the next
> hrtimer event over all hrtimers except for the scheduler tick one.
>
> To that end introduce a new helper function,
> hrtimer_next_event_without(), for computing the time until the next
> hrtimer event over all timers except for one and modify the underlying
> code in __hrtimer_next_event_base() to prepare it for being called by
> that new function.
>
> No intentional changes in functionality.
>
> Signed-off-by: Rafael J. Wysocki <[email protected]>
> ---
>
> v8 -> v9:
> * Make fewer changes to the existing code.
> * Add a new helper function for the handling of the use case at hand.
>
> ---
> include/linux/hrtimer.h | 1
> kernel/time/hrtimer.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++--
> 2 files changed, 54 insertions(+), 2 deletions(-)
>
> Index: linux-pm/include/linux/hrtimer.h
> ===================================================================
> --- linux-pm.orig/include/linux/hrtimer.h
> +++ linux-pm/include/linux/hrtimer.h
> @@ -426,6 +426,7 @@ static inline ktime_t hrtimer_get_remain
> }
>
> extern u64 hrtimer_get_next_event(void);
> +extern u64 hrtimer_next_event_without(const struct hrtimer *exclude);
>
> extern bool hrtimer_active(const struct hrtimer *timer);
>
> Index: linux-pm/kernel/time/hrtimer.c
> ===================================================================
> --- linux-pm.orig/kernel/time/hrtimer.c
> +++ linux-pm/kernel/time/hrtimer.c
> @@ -490,6 +490,7 @@ __next_base(struct hrtimer_cpu_base *cpu
> while ((base = __next_base((cpu_base), &(active))))
>
> static ktime_t __hrtimer_next_event_base(struct hrtimer_cpu_base *cpu_base,
> + const struct hrtimer *exclude,
> unsigned int active,
> ktime_t expires_next)
> {
> @@ -502,9 +503,24 @@ static ktime_t __hrtimer_next_event_base
>
> next = timerqueue_getnext(&base->active);
> timer = container_of(next, struct hrtimer, node);
> + if (timer == exclude) {
> + /* Get to the next timer in the queue. */
> + struct rb_node *rbn = rb_next(&next->node);
> +
> + next = rb_entry_safe(rbn, struct timerqueue_node, node);
> + if (!next)
> + continue;

Minor cosmectic detail again, timerqueue_iterate_next() would do the job and
avoid browsing timerqueue details.

Reviewed-by: Frederic Weisbecker <[email protected]>

Thanks.

2018-04-07 16:40:35

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v9 06/10] time: tick-sched: Split tick_nohz_stop_sched_tick()

On Saturday, April 7, 2018 4:36:25 AM CEST Frederic Weisbecker wrote:
> On Wed, Apr 04, 2018 at 10:41:13AM +0200, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <[email protected]>
> >
> > In order to address the issue with short idle duration predictions
> > by the idle governor after the scheduler tick has been stopped, split
> > tick_nohz_stop_sched_tick() into two separate routines, one computing
> > the time to the next timer event and the other simply stopping the
> > tick when the time to the next timer event is known.
> >
> > Prepare these two routines to be called separately, as one of them
> > will be called by the idle governor in the cpuidle_select() code
> > path after subsequent changes.
> >
> > Update the former callers of tick_nohz_stop_sched_tick() to use
> > the new routines, tick_nohz_next_event() and tick_nohz_stop_tick(),
> > instead of it and move the updates of the sleep_length field in
> > struct tick_sched into __tick_nohz_idle_stop_tick() as it doesn't
> > need to be updated anywhere else.
> >
> > There should be no intentional visible changes in functionality
> > resulting from this change.
> >
> > Signed-off-by: Rafael J. Wysocki <[email protected]>
>
> Reviewed-by: Frederic Weisbecker <[email protected]>
>
> Thanks! And sorry for the slow reviews, the changes are sensitive

Right, very much so.

> and I want to make sure we are not breaking some subtlety.

Thanks a lot for doing that!


2018-04-08 08:47:57

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v9 07/10] time: hrtimer: Introduce hrtimer_next_event_without()

On Saturday, April 7, 2018 4:46:38 PM CEST Frederic Weisbecker wrote:
> On Wed, Apr 04, 2018 at 10:45:39AM +0200, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <[email protected]>
> >
> > The next set of changes will need to compute the time to the next
> > hrtimer event over all hrtimers except for the scheduler tick one.
> >
> > To that end introduce a new helper function,
> > hrtimer_next_event_without(), for computing the time until the next
> > hrtimer event over all timers except for one and modify the underlying
> > code in __hrtimer_next_event_base() to prepare it for being called by
> > that new function.
> >
> > No intentional changes in functionality.
> >
> > Signed-off-by: Rafael J. Wysocki <[email protected]>
> > ---
> >
> > v8 -> v9:
> > * Make fewer changes to the existing code.
> > * Add a new helper function for the handling of the use case at hand.
> >
> > ---
> > include/linux/hrtimer.h | 1
> > kernel/time/hrtimer.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++--
> > 2 files changed, 54 insertions(+), 2 deletions(-)
> >
> > Index: linux-pm/include/linux/hrtimer.h
> > ===================================================================
> > --- linux-pm.orig/include/linux/hrtimer.h
> > +++ linux-pm/include/linux/hrtimer.h
> > @@ -426,6 +426,7 @@ static inline ktime_t hrtimer_get_remain
> > }
> >
> > extern u64 hrtimer_get_next_event(void);
> > +extern u64 hrtimer_next_event_without(const struct hrtimer *exclude);
> >
> > extern bool hrtimer_active(const struct hrtimer *timer);
> >
> > Index: linux-pm/kernel/time/hrtimer.c
> > ===================================================================
> > --- linux-pm.orig/kernel/time/hrtimer.c
> > +++ linux-pm/kernel/time/hrtimer.c
> > @@ -490,6 +490,7 @@ __next_base(struct hrtimer_cpu_base *cpu
> > while ((base = __next_base((cpu_base), &(active))))
> >
> > static ktime_t __hrtimer_next_event_base(struct hrtimer_cpu_base *cpu_base,
> > + const struct hrtimer *exclude,
> > unsigned int active,
> > ktime_t expires_next)
> > {
> > @@ -502,9 +503,24 @@ static ktime_t __hrtimer_next_event_base
> >
> > next = timerqueue_getnext(&base->active);
> > timer = container_of(next, struct hrtimer, node);
> > + if (timer == exclude) {
> > + /* Get to the next timer in the queue. */
> > + struct rb_node *rbn = rb_next(&next->node);
> > +
> > + next = rb_entry_safe(rbn, struct timerqueue_node, node);
> > + if (!next)
> > + continue;
>
> Minor cosmectic detail again, timerqueue_iterate_next() would do the job and
> avoid browsing timerqueue details.

And below is a patch to make this change on top of the original.

---
From: Rafael J. Wysocki <[email protected]>
Subject: [PATCH] time: hrtimer: Use timerqueue_iterate_next() to get to the next timer

Use timerqueue_iterate_next() to get to the next timer in
__hrtimer_next_event_base() without browsing the timerqueue
details diredctly.

No intentional changes in functionality.

Suggested-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
kernel/time/hrtimer.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

Index: linux-pm/kernel/time/hrtimer.c
===================================================================
--- linux-pm.orig/kernel/time/hrtimer.c
+++ linux-pm/kernel/time/hrtimer.c
@@ -505,9 +505,7 @@ static ktime_t __hrtimer_next_event_base
timer = container_of(next, struct hrtimer, node);
if (timer == exclude) {
/* Get to the next timer in the queue. */
- struct rb_node *rbn = rb_next(&next->node);
-
- next = rb_entry_safe(rbn, struct timerqueue_node, node);
+ next = timerqueue_iterate_next(next);
if (!next)
continue;



2018-04-08 17:23:09

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v9 00/10] sched/cpuidle: Idle loop rework

On Wednesday, April 4, 2018 10:32:12 AM CEST Rafael J. Wysocki wrote:
> Hi All,
>
> Thanks a lot for the feedback so far!
>
> For the motivation/summary, please refer to the BZ entry at
>
> https://bugzilla.kernel.org/show_bug.cgi?id=199227
>
> created for collecting information related to this patch series. Some v7.3
> testing results from Len and Doug are in there already.
>
> The testing so far shows significant idle power improvements, both in terms of
> reducing the average idle power (about 10% on some systems) and in terms of
> reducing the idle power noise (in the vast majority of cases, with this series
> applied the idle power is mostly stable around the power floor of the system).
> The average power is also reduced in some non-idle workloads and there are
> some performance improvements in them.
>
> It also is reported that the series generally addresses the problem it has been
> motivated by (ie. the "powernightmares" issue).
>
> This revision is mostly a re-send of the v8 with three patches changed as
> follows.
>
> > Patch 1 prepares the tick-sched code for the subsequent modifications and it
> > doesn't change the code's functionality (at least not intentionally).
> >
> > Patch 2 starts pushing the tick stopping decision deeper into the idle
> > loop, but that is limited to do_idle() and tick_nohz_irq_exit().
> >
> > Patch 3 makes cpuidle_idle_call() decide whether or not to stop the tick
> > and sets the stage for the subsequent changes.
> >
> > Patch 4 is a new one just for the TICK_USEC definition changes.
> >
> > Patch 5 adds a bool pointer argument to cpuidle_select() and the ->select
> > governor callback allowing them to return a "nohz" hint on whether or not to
> > stop the tick to the caller. It also adds code to decide what value to
> > return as "nohz" to the menu governor and modifies its correction factor
> > computations to take running tick into account if need be.
> >
> > Patch 6 (which is new) contains some changes that previously were included
> > into the big reordering patch (patch [6/8] in the v7). Essentially, it does
> > more tick-sched code reorganization in preparation for the subsequent changes
> > (and should not modify the functionality).
>
> Patch 7 is a new version of its v8 counterpart. It makes fewer changes to the
> existing code and adds a special function for the handling of the use case it
> is about. It still makes some hrtimer code modifications allowing it to return
> the time to the next event with one timer excluded (which needs to be done with
> respect to the tick timer), though.
>
> Patch 8 reorders the idle state selection with respect to the stopping of
> the tick and causes the additional "nohz" hint from cpuidle_select() to be
> used for deciding whether or not to stop the tick. It is a rebased version
> of its v8 counterpart.
>
> Patch 9 causes the menu governor to refine the state selection in case the
> tick is not going to be stopped and the already selected state does not fit
> the interval before the next tick time. It is a new version that avoids
> using state 0 if it has been disabled (if state 0 has been disabled, the
> governor only should use it when no states are enabled at all).
>
> > Patch 10 Deals with the situation in which the tick was stopped previously,
> > but the idle governor still predicts short idle (it has not changed).
>
> This series is complementary to the poll_idle() patches discussed recently
>
> https://patchwork.kernel.org/patch/10282237/
> https://patchwork.kernel.org/patch/10311775/
>
> that have been merged for v4.17 already.
>
> There is a new git branch containing the current series at
>
> git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \
> idle-loop-v9

The v9 along with some cleanups suggested by Frederic on top of it and with
ACKs from Peter (obtained on IRC) is now available from the pm-cpuidle branch
in the linux-pm.git tree.

It has been added to my linux-next branch, so it probably will be picked up by
linux-next tomorrow and I have a plan to push it for v4.17 in the second half
of the next week unless a major issue with it is found in the meantime.

Thanks,
Rafael


2018-04-08 18:43:23

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v9 07/10] time: hrtimer: Introduce hrtimer_next_event_without()

On Sun, Apr 08, 2018 at 10:20:30AM +0200, Rafael J. Wysocki wrote:
> On Saturday, April 7, 2018 4:46:38 PM CEST Frederic Weisbecker wrote:
> > On Wed, Apr 04, 2018 at 10:45:39AM +0200, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <[email protected]>
> > >
> > > The next set of changes will need to compute the time to the next
> > > hrtimer event over all hrtimers except for the scheduler tick one.
> > >
> > > To that end introduce a new helper function,
> > > hrtimer_next_event_without(), for computing the time until the next
> > > hrtimer event over all timers except for one and modify the underlying
> > > code in __hrtimer_next_event_base() to prepare it for being called by
> > > that new function.
> > >
> > > No intentional changes in functionality.
> > >
> > > Signed-off-by: Rafael J. Wysocki <[email protected]>
> > > ---
> > >
> > > v8 -> v9:
> > > * Make fewer changes to the existing code.
> > > * Add a new helper function for the handling of the use case at hand.
> > >
> > > ---
> > > include/linux/hrtimer.h | 1
> > > kernel/time/hrtimer.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++--
> > > 2 files changed, 54 insertions(+), 2 deletions(-)
> > >
> > > Index: linux-pm/include/linux/hrtimer.h
> > > ===================================================================
> > > --- linux-pm.orig/include/linux/hrtimer.h
> > > +++ linux-pm/include/linux/hrtimer.h
> > > @@ -426,6 +426,7 @@ static inline ktime_t hrtimer_get_remain
> > > }
> > >
> > > extern u64 hrtimer_get_next_event(void);
> > > +extern u64 hrtimer_next_event_without(const struct hrtimer *exclude);
> > >
> > > extern bool hrtimer_active(const struct hrtimer *timer);
> > >
> > > Index: linux-pm/kernel/time/hrtimer.c
> > > ===================================================================
> > > --- linux-pm.orig/kernel/time/hrtimer.c
> > > +++ linux-pm/kernel/time/hrtimer.c
> > > @@ -490,6 +490,7 @@ __next_base(struct hrtimer_cpu_base *cpu
> > > while ((base = __next_base((cpu_base), &(active))))
> > >
> > > static ktime_t __hrtimer_next_event_base(struct hrtimer_cpu_base *cpu_base,
> > > + const struct hrtimer *exclude,
> > > unsigned int active,
> > > ktime_t expires_next)
> > > {
> > > @@ -502,9 +503,24 @@ static ktime_t __hrtimer_next_event_base
> > >
> > > next = timerqueue_getnext(&base->active);
> > > timer = container_of(next, struct hrtimer, node);
> > > + if (timer == exclude) {
> > > + /* Get to the next timer in the queue. */
> > > + struct rb_node *rbn = rb_next(&next->node);
> > > +
> > > + next = rb_entry_safe(rbn, struct timerqueue_node, node);
> > > + if (!next)
> > > + continue;
> >
> > Minor cosmectic detail again, timerqueue_iterate_next() would do the job and
> > avoid browsing timerqueue details.
>
> And below is a patch to make this change on top of the original.
>
> ---
> From: Rafael J. Wysocki <[email protected]>
> Subject: [PATCH] time: hrtimer: Use timerqueue_iterate_next() to get to the next timer
>
> Use timerqueue_iterate_next() to get to the next timer in
> __hrtimer_next_event_base() without browsing the timerqueue
> details diredctly.
>
> No intentional changes in functionality.
>
> Suggested-by: Frederic Weisbecker <[email protected]>
> Signed-off-by: Rafael J. Wysocki <[email protected]>

Thanks!

2018-04-09 03:12:08

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v9 08/10] sched: idle: Select idle state before stopping the tick

On Wed, Apr 04, 2018 at 10:47:06AM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> In order to address the issue with short idle duration predictions
> by the idle governor after the scheduler tick has been stopped,
> reorder the code in cpuidle_idle_call() so that the governor idle
> state selection runs before tick_nohz_idle_go_idle() and use the
> "nohz" hint returned by cpuidle_select() to decide whether or not
> to stop the tick.
>
> This isn't straightforward, because menu_select() invokes
> tick_nohz_get_sleep_length() to get the time to the next timer
> event and the number returned by the latter comes from
> __tick_nohz_idle_stop_tick(). Fortunately, however, it is possible
> to compute that number without actually stopping the tick and with
> the help of the existing code.
>
> Namely, tick_nohz_get_sleep_length() can be made call
> tick_nohz_next_event(), introduced earlier, to get the time to the
> next non-highres timer event. If that happens, tick_nohz_next_event()
> need not be called by __tick_nohz_idle_stop_tick() again.
>
> If it turns out that the scheduler tick cannot be stopped going
> forward or the next timer event is too close for the tick to be
> stopped, tick_nohz_get_sleep_length() can simply return the time to
> the next event currently programmed into the corresponding clock
> event device.
>
> In addition to knowing the return value of tick_nohz_next_event(),
> however, tick_nohz_get_sleep_length() needs to know the time to the
> next highres timer event, but with the scheduler tick timer excluded,
> which can be computed with the help of hrtimer_get_next_event().
>
> That minimum of that number and the tick_nohz_next_event() return
> value is the total time to the next timer event with the assumption
> that the tick will be stopped. It can be returned to the idle
> governor which can use it for predicting idle duration (under the
> assumption that the tick will be stopped) and deciding whether or
> not it makes sense to stop the tick before putting the CPU into the
> selected idle state.
>
> With the above, the sleep_length field in struct tick_sched is not
> necessary any more, so drop it.
>
> Signed-off-by: Rafael J. Wysocki <[email protected]>

Reviewed-by: Frederic Weisbecker <[email protected]>

2018-04-09 16:03:25

by Thomas Ilsche

[permalink] [raw]
Subject: Re: [PATCH v9 00/10] sched/cpuidle: Idle loop rework

On 2018-04-08 18:32, Rafael J. Wysocki wrote:
> The v9 along with some cleanups suggested by Frederic on top of it and with
> ACKs from Peter (obtained on IRC) is now available from the pm-cpuidle branch
> in the linux-pm.git tree.
>
> It has been added to my linux-next branch, so it probably will be picked up by
> linux-next tomorrow and I have a plan to push it for v4.17 in the second half
> of the next week unless a major issue with it is found in the meantime.

Great to hear that. Thanks for all your work.

I'm finishing up some analysis of corner cases, but nothing major.
So I'm glad to see this is moving along. I've been nitpicking a lot,
but this is clearly a huge improvement and there are practical
limitations against a theoretically perfect solution. In any case the
changes will also make future policy adaptions much easier.

Thanks,
Thomas


Attachments:
smime.p7s (5.09 kB)
S/MIME Cryptographic Signature

2018-04-10 07:09:53

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v9 00/10] sched/cpuidle: Idle loop rework

On Monday, April 9, 2018 5:58:32 PM CEST Thomas Ilsche wrote:
>
> --------------ms020002070802070802040704
> Content-Type: text/plain; charset=utf-8; format=flowed
> Content-Language: en-US
> Content-Transfer-Encoding: quoted-printable
>
> On 2018-04-08 18:32, Rafael J. Wysocki wrote:
> > The v9 along with some cleanups suggested by Frederic on top of it and =
> with
> > ACKs from Peter (obtained on IRC) is now available from the pm-cpuidle =
> branch
> > in the linux-pm.git tree.
> >=20
> > It has been added to my linux-next branch, so it probably will be picke=
> d up by
> > linux-next tomorrow and I have a plan to push it for v4.17 in the secon=
> d half
> > of the next week unless a major issue with it is found in the meantime.=
>
>
> Great to hear that. Thanks for all your work.

You're very welcome!

> I'm finishing up some analysis of corner cases, but nothing major.
> So I'm glad to see this is moving along. I've been nitpicking a lot,
> but this is clearly a huge improvement and there are practical
> limitations against a theoretically perfect solution. In any case the
> changes will also make future policy adaptions much easier.

Thanks for your feedback, it would have taken much more time to get to this
point without it.

Thanks,
Rafael