2014-07-19 00:44:29

by Frederic Weisbecker

[permalink] [raw]
Subject: [RFC PATCH 00/10] nohz: Support sysidle (and some more cleanups)

Currently when nohz full is active, the CPU 0 handles timekeeping on
behalf of all other CPUs. This prevents it from ever entering in dynticks
idle mode.

This patchset uses the RCU sysidle feature to allow that. The CPU 0 can
know safely when to sleep and when to wake up, the sysidle code determines
that and takes care of races along the way.

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
nohz/sysidle

Thanks,
Frederic
---

Frederic Weisbecker (10):
irq_work: Introduce void irq work
nohz: Kick full dynticks timer targets with an empty IPI
rcu: Kick full dynticks CPU on extended grace period with a void IRQ
nohz: Appropriate timekeeper kick on sysidle break
smp: Fast path check on IPI list
nohz: Define meaningful symbol for nohz full timekeeper
nohz: Enforce timekeeping on CPU 0
nohz: Fetch timekeeping max deferment only for timekeeper
nohz: Switch nohz full timekeeper to dynticks idle on top of sysidle detection
nohz: Warn on illegal timekeeper switch in nohz full


include/linux/irq_work.h | 1 +
kernel/irq_work.c | 21 +++++++++++++
kernel/rcu/tree_plugin.h | 10 ++++---
kernel/sched/core.c | 2 +-
kernel/smp.c | 11 ++++++-
kernel/time/tick-common.c | 11 ++++---
kernel/time/tick-internal.h | 8 +++++
kernel/time/tick-sched.c | 72 ++++++++++++++++++++++-----------------------
8 files changed, 88 insertions(+), 48 deletions(-)


2014-07-19 00:44:36

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 01/10] irq_work: Introduce void irq work

Being able to trigger an empty IPI appears to be useless in the first
place. Yet it is expected to be very useful for callers who just need
to execute irq_enter() or irq_exit() to a remote target.

More precisely this is going to be useful for the nohz subsystem which
often needs a remote CPU to reschedule its tick when some state changed
and require periodicity for any reason. Similar cases have been handled
with random IPIs until now. But they surely bring unwanted overhead
all along since they are all dedicated for specific tasks.

Triggering an irq work/smp_call_function IPI should be enough to solve
that. If competing and spurious IPIs become a problem, we can still
optimize that later.

Cc: Ingo Molnar <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Viresh Kumar <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
include/linux/irq_work.h | 1 +
kernel/irq_work.c | 21 +++++++++++++++++++++
2 files changed, 22 insertions(+)

diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h
index bf9422c..b2ad065 100644
--- a/include/linux/irq_work.h
+++ b/include/linux/irq_work.h
@@ -36,6 +36,7 @@ bool irq_work_queue(struct irq_work *work);

#ifdef CONFIG_SMP
bool irq_work_queue_on(struct irq_work *work, int cpu);
+void irq_work_void_on(int cpu);
#endif

void irq_work_run(void);
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 4b0a890..36b7fb2 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -81,6 +81,27 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
return true;
}
EXPORT_SYMBOL_GPL(irq_work_queue_on);
+
+/**
+ * irq_work_void_on(): Run a void IRQ on the target
+ * @cpu: The cpu to run the IRQ on
+ *
+ * Run a void IRQ for its own sake on the target. It's generally
+ * useful for callers which want to run irq_enter() or irq_exit()
+ * on a remote CPU.
+ */
+void irq_work_void_on(int cpu)
+{
+ /*
+ * NOTE: we could optimize that and spare some IPIs
+ * after checking that raised_list isn't empty before raising.
+ * This can't be done properly without cmpxchg() though so
+ * it may make things worse after all. But lets leave that
+ * possibility open in case people report such issue in the
+ * future.
+ */
+ arch_send_call_function_single_ipi(cpu);
+}
#endif

/* Enqueue the irq work @work on the current CPU */
--
1.8.3.1

2014-07-19 00:44:51

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 08/10] nohz: Fetch timekeeping max deferment only for timekeeper

We fetch it unconditionally from the tick stop code whereas only
the timekeeper, or the CPU that carried that duty last, needs it.

Fetching the timekeeping max deferment should be lightweight but it
still involves a few read side barriers and a seqcount that may well
be cache cold for non-timekeepers.

So lets spare it when possible by inverting the way we handle timekeeper
deferment state machine.

Cc: Ingo Molnar <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Viresh Kumar <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
kernel/time/tick-sched.c | 57 ++++++++++++++++++++++++++----------------------
1 file changed, 31 insertions(+), 26 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 2ea2143..bcba79d 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -529,6 +529,36 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
}
EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);

+/*
+ * If this cpu is the one which updates jiffies, then
+ * give up the assignment and let it be taken by the
+ * cpu which runs the tick timer next, which might be
+ * this cpu as well. If we don't drop this here the
+ * jiffies might be stale and do_timer() never
+ * invoked. Keep track of the fact that it was the one
+ * which had the do_timer() duty last. If this cpu is
+ * the one which had the do_timer() duty last, we
+ * limit the sleep time to the timekeeping max deferement
+ * value. Otherwise we can sleep as long as we want.
+ */
+static u64 timekeeping_deferment(struct tick_sched *ts, int cpu)
+{
+ u64 time_delta = KTIME_MAX;
+
+ if (tick_do_timer_cpu == cpu) {
+ time_delta = timekeeping_max_deferment();
+ tick_do_timer_cpu = TICK_DO_TIMER_NONE;
+ ts->do_timer_last = 1;
+ } else if (ts->do_timer_last) {
+ if (tick_do_timer_cpu == TICK_DO_TIMER_NONE)
+ time_delta = timekeeping_max_deferment();
+ else
+ ts->do_timer_last = 0;
+ }
+
+ return time_delta;
+}
+
static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
ktime_t now, int cpu)
{
@@ -536,9 +566,6 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
ktime_t last_update, expires, ret = { .tv64 = 0 };
unsigned long rcu_delta_jiffies;
struct clock_event_device *dev = __get_cpu_var(tick_cpu_device).evtdev;
- u64 time_delta;
-
- time_delta = timekeeping_max_deferment();

/* Read jiffies and the time when jiffies were updated last */
do {
@@ -570,29 +597,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,

/* Schedule the tick, if we are at least one jiffie off */
if ((long)delta_jiffies >= 1) {
-
- /*
- * If this cpu is the one which updates jiffies, then
- * give up the assignment and let it be taken by the
- * cpu which runs the tick timer next, which might be
- * this cpu as well. If we don't drop this here the
- * jiffies might be stale and do_timer() never
- * invoked. Keep track of the fact that it was the one
- * which had the do_timer() duty last. If this cpu is
- * the one which had the do_timer() duty last, we
- * limit the sleep time to the timekeeping
- * max_deferement value which we retrieved
- * above. Otherwise we can sleep as long as we want.
- */
- if (cpu == tick_do_timer_cpu) {
- tick_do_timer_cpu = TICK_DO_TIMER_NONE;
- ts->do_timer_last = 1;
- } else if (tick_do_timer_cpu != TICK_DO_TIMER_NONE) {
- time_delta = KTIME_MAX;
- ts->do_timer_last = 0;
- } else if (!ts->do_timer_last) {
- time_delta = KTIME_MAX;
- }
+ u64 time_delta = timekeeping_deferment(ts, cpu);

#ifdef CONFIG_NO_HZ_FULL
if (!ts->inidle) {
--
1.8.3.1

2014-07-19 00:44:53

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 09/10] nohz: Switch nohz full timekeeper to dynticks idle on top of sysidle detection

In full dynticks, the CPU 0 carries the timekeeping duty on behalf
of all other CPUs in the system. This way full dynticks are left
undisturbed on this regard.

Of course this prevents CPU 0 from entering in dynticks idle mode
because any CPU may need uptodate timekeeping at any time.

Theoretically though, we could put CPU 0 in dynticks idle mode once we
are sure that all other CPUs are dynticks idle as well. Then when a
CPU wakes up and finds the timekeeper idle, it would send an IPI to
wake it up on its duty.

Such a machine state needs to take care of all the races in the way, make
sure that CPU 0 is neither stuck accidentally to sleep for ever, nor
stuck in periodic mode when it could sleep. Also given the amount of
shared data this involves and their access frequency, this must be built
on top of lockless low-overhead state machine.

This is what sysidle provides. The feature is ready for a while, we
were just waiting for the nohz susbsystem to support it. And we just
reached that state.

So lets defer the last call for CPU 0 to enter in dynticks idle to when
we find a full system idle state. And lets wake it up when its duty is
needed.

Cc: Ingo Molnar <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Viresh Kumar <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
kernel/time/tick-sched.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index bcba79d..845aaff 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -547,8 +547,10 @@ static u64 timekeeping_deferment(struct tick_sched *ts, int cpu)

if (tick_do_timer_cpu == cpu) {
time_delta = timekeeping_max_deferment();
- tick_do_timer_cpu = TICK_DO_TIMER_NONE;
ts->do_timer_last = 1;
+ /* In full dynticks mode, CPU 0 always keeps the duty */
+ if (!tick_nohz_full_enabled())
+ tick_do_timer_cpu = TICK_DO_TIMER_NONE;
} else if (ts->do_timer_last) {
if (tick_do_timer_cpu == TICK_DO_TIMER_NONE)
time_delta = timekeeping_max_deferment();
@@ -745,7 +747,7 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
* if there are full dynticks CPUs around
*/
if (tick_do_timer_cpu == cpu)
- return false;
+ return rcu_sys_is_idle();
}

return true;
--
1.8.3.1

2014-07-19 00:44:49

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 07/10] nohz: Enforce timekeeping on CPU 0

The timekeeper gets initialized to the value of the CPU where the
first clockevent device is setup. This works well because the timekeeper
can be any online CPU in most configs.

Full dynticks has its own requirement though and needs the timekeeper
to always be 0. And this requirement seem to accomodate pretty well with
the above described boot timekeeper setting because the first clockevent
device happens to be initialized, most of the time, on the boot CPU
(which should be CPU 0).

However there is no mention of such a guarantee anywhere. This assumption
might well be defeated on some corner case now or in the future.

So lets wipe out the FUD and force tick_do_timer_cpu to CPU 0 on boot
when full dynticks is used.

This way we can even remove some corner case code that handled scenarios
where all clockevent devices were setup on full dynticks CPUs.

Cc: Ingo Molnar <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Viresh Kumar <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
kernel/time/tick-common.c | 6 +++---
kernel/time/tick-sched.c | 6 ------
2 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
index 0a0608e..cb57bce 100644
--- a/kernel/time/tick-common.c
+++ b/kernel/time/tick-common.c
@@ -179,10 +179,10 @@ static void tick_setup_device(struct tick_device *td,
* this cpu:
*/
if (tick_do_timer_cpu == TICK_DO_TIMER_BOOT) {
- if (!tick_nohz_full_cpu(cpu))
- tick_do_timer_cpu = cpu;
+ if (tick_nohz_full_enabled())
+ tick_do_timer_cpu = TICK_DO_TIMER_DEFAULT;
else
- tick_do_timer_cpu = TICK_DO_TIMER_NONE;
+ tick_do_timer_cpu = cpu;
tick_next_period = ktime_get();
tick_period = ktime_set(0, NSEC_PER_SEC / HZ);
}
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 3d63944..2ea2143 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -741,12 +741,6 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
*/
if (tick_do_timer_cpu == cpu)
return false;
- /*
- * Boot safety: make sure the timekeeping duty has been
- * assigned before entering dyntick-idle mode,
- */
- if (tick_do_timer_cpu == TICK_DO_TIMER_NONE)
- return false;
}

return true;
--
1.8.3.1

2014-07-19 00:44:48

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 06/10] nohz: Define meaningful symbol for nohz full timekeeper

The nohz full timekeeper is always CPU 0. Lets add it to the list of
special tick_do_timer_cpu symbols for more self explanatory code.

Cc: Ingo Molnar <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Viresh Kumar <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
kernel/time/tick-internal.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
index 7ab92b1..6b592a8 100644
--- a/kernel/time/tick-internal.h
+++ b/kernel/time/tick-internal.h
@@ -10,6 +10,7 @@ extern seqlock_t jiffies_lock;

#ifdef CONFIG_GENERIC_CLOCKEVENTS_BUILD

+#define TICK_DO_TIMER_DEFAULT 0
#define TICK_DO_TIMER_NONE -1
#define TICK_DO_TIMER_BOOT -2

--
1.8.3.1

2014-07-19 00:44:46

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 03/10] rcu: Kick full dynticks CPU on extended grace period with a void IRQ

When a full dynticks CPU stays for too long in the kernel, it may fail
to report quiescent states due to it missing ticks and therefore it can
delay the completion of grace periods.

A way to solve this is to send an IPI to the incriminated CPU such that
it can check rcu_needs_cpu() and reschedule some ticks accordingly to
poll again on quiescent states reports.

This is what we try to achieve while calling rcu_kick_nohz_cpu() but it
has no effect because we trigger a scheduler IPI which is actually a
no-op when not used for scheduler internal purpose, ie: it doesn't call
irq_exit() when not used for remote wakeup or other specifics.

No need to tweak the scheduler IPI further though. Lets keep it clean
of external noise and use an empty IPI instead. We hereby get sure that
we will call irq_exit() on the target without much overhead nor added
noise on scheduler fast path.

Cc: Ingo Molnar <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Viresh Kumar <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
kernel/rcu/tree_plugin.h | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index cbc2c45..395c14d 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2401,14 +2401,16 @@ static bool init_nocb_callback_list(struct rcu_data *rdp)
* off. RCU will be paying attention to this CPU because it is in the
* kernel, but the CPU cannot be guaranteed to be executing the RCU state
* machine because the scheduling-clock tick has been disabled. Therefore,
- * if an adaptive-ticks CPU is failing to respond to the current grace
- * period and has not be idle from an RCU perspective, kick it.
+ * if an full dynticks CPU is failing to respond to the current grace
+ * period and has not be idle from an RCU perspective, kick it with a
+ * void IRQ so that it can check that RCU needs its tick from rcu_needs_cpu()
+ * on irq exit.
*/
static void rcu_kick_nohz_cpu(int cpu)
{
#ifdef CONFIG_NO_HZ_FULL
if (tick_nohz_full_cpu(cpu))
- smp_send_reschedule(cpu);
+ irq_work_void_on(cpu);
#endif /* #ifdef CONFIG_NO_HZ_FULL */
}

--
1.8.3.1

2014-07-19 00:44:44

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 02/10] nohz: Kick full dynticks timer targets with an empty IPI

While we enqueue a new timer on a dynticks target, we must send it an
IPI so that it reschedules the next tick accordingly.

Now all we need for that is to run irq_exit() on the target. Hence
an empty IRQ is way enough. We don't need to run tick_nohz_full_check()
and all the overhead that comes with the several cmpxchg() necessary for
irq work list and state progress.

So lets use a void irq work instead.

Cc: Ingo Molnar <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Viresh Kumar <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
kernel/sched/core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f3063c..f3e48b8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -693,7 +693,7 @@ static bool wake_up_full_nohz_cpu(int cpu)
if (tick_nohz_full_cpu(cpu)) {
if (cpu != smp_processor_id() ||
tick_nohz_tick_stopped())
- tick_nohz_full_kick_cpu(cpu);
+ irq_work_void_on(cpu);
return true;
}

--
1.8.3.1

2014-07-19 00:46:46

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 04/10] nohz: Appropriate timekeeper kick on sysidle break

When a CPU wakes up from idle and finds out that the timekeeper is
sleeping, we need to kick it such that it switches from dynticks to
periodic mode to maintain its timekeeping duty on behalf of the newly
awoken CPU.

However we aren't using the right API for that. rcu_kick_nohz_cpu() is
aimed at waking full dynticks CPUs and not the timekeeper.

Moreover the timekeeper must perform a new dynticks cycle to check the
new sysidle state and restart the tick if necessary. A simple call
to irq_exit() isn't enough.

wake_up_nohz_cpu() is a good fit for that job because it pulls the
target out of the idle loop and restart the tick. Then if no other
task waits for the CPU, it will reenter the idle loop and then the
new sysidle state will be visible and well handled.

Cc: Ingo Molnar <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Viresh Kumar <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
kernel/rcu/tree_plugin.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 395c14d..b65da1a 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2494,7 +2494,7 @@ void rcu_sysidle_force_exit(void)
oldstate, RCU_SYSIDLE_NOT);
if (oldstate == newoldstate &&
oldstate == RCU_SYSIDLE_FULL_NOTED) {
- rcu_kick_nohz_cpu(tick_do_timer_cpu);
+ wake_up_nohz_cpu(tick_do_timer_cpu);
return; /* We cleared it, done! */
}
oldstate = newoldstate;
--
1.8.3.1

2014-07-19 00:46:44

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 10/10] nohz: Warn on illegal timekeeper switch in nohz full

In full dynticks idle mode, the timekeeper should never be set to another
CPU than 0.

Lets enforce that through a dedicated mutator.

Cc: Ingo Molnar <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Viresh Kumar <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
kernel/time/tick-common.c | 9 ++++-----
kernel/time/tick-internal.h | 7 +++++++
kernel/time/tick-sched.c | 7 +++----
3 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
index cb57bce..72ccaf2 100644
--- a/kernel/time/tick-common.c
+++ b/kernel/time/tick-common.c
@@ -180,9 +180,9 @@ static void tick_setup_device(struct tick_device *td,
*/
if (tick_do_timer_cpu == TICK_DO_TIMER_BOOT) {
if (tick_nohz_full_enabled())
- tick_do_timer_cpu = TICK_DO_TIMER_DEFAULT;
+ tick_do_timer_cpu_set(TICK_DO_TIMER_DEFAULT);
else
- tick_do_timer_cpu = cpu;
+ tick_do_timer_cpu_set(cpu);
tick_next_period = ktime_get();
tick_period = ktime_set(0, NSEC_PER_SEC / HZ);
}
@@ -341,9 +341,8 @@ void tick_handover_do_timer(int *cpup)
{
if (*cpup == tick_do_timer_cpu) {
int cpu = cpumask_first(cpu_online_mask);
-
- tick_do_timer_cpu = (cpu < nr_cpu_ids) ? cpu :
- TICK_DO_TIMER_NONE;
+ tick_do_timer_cpu_set((cpu < nr_cpu_ids) ? cpu :
+ TICK_DO_TIMER_NONE);
}
}

diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
index 6b592a8..eadfd89 100644
--- a/kernel/time/tick-internal.h
+++ b/kernel/time/tick-internal.h
@@ -23,6 +23,13 @@ extern void tick_setup_periodic(struct clock_event_device *dev, int broadcast);
extern void tick_handle_periodic(struct clock_event_device *dev);
extern void tick_check_new_device(struct clock_event_device *dev);
extern void tick_handover_do_timer(int *cpup);
+
+static inline void tick_do_timer_cpu_set(int cpu)
+{
+ WARN_ON_ONCE(tick_nohz_full_enabled() && cpu != TICK_DO_TIMER_DEFAULT);
+ tick_do_timer_cpu = cpu;
+}
+
extern void tick_shutdown(unsigned int *cpup);
extern void tick_suspend(void);
extern void tick_resume(void);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 845aaff..3b5102a 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -123,7 +123,7 @@ static void tick_sched_do_timer(ktime_t now)
*/
if (unlikely(tick_do_timer_cpu == TICK_DO_TIMER_NONE)
&& !tick_nohz_full_cpu(cpu))
- tick_do_timer_cpu = cpu;
+ tick_do_timer_cpu_set(cpu);
#endif

/* Check, if the jiffies need an update */
@@ -550,7 +550,7 @@ static u64 timekeeping_deferment(struct tick_sched *ts, int cpu)
ts->do_timer_last = 1;
/* In full dynticks mode, CPU 0 always keeps the duty */
if (!tick_nohz_full_enabled())
- tick_do_timer_cpu = TICK_DO_TIMER_NONE;
+ tick_do_timer_cpu_set(TICK_DO_TIMER_NONE);
} else if (ts->do_timer_last) {
if (tick_do_timer_cpu == TICK_DO_TIMER_NONE)
time_delta = timekeeping_max_deferment();
@@ -600,7 +600,6 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
/* Schedule the tick, if we are at least one jiffie off */
if ((long)delta_jiffies >= 1) {
u64 time_delta = timekeeping_deferment(ts, cpu);
-
#ifdef CONFIG_NO_HZ_FULL
if (!ts->inidle) {
time_delta = min(time_delta,
@@ -717,7 +716,7 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
*/
if (unlikely(!cpu_online(cpu))) {
if (cpu == tick_do_timer_cpu)
- tick_do_timer_cpu = TICK_DO_TIMER_NONE;
+ tick_do_timer_cpu_set(TICK_DO_TIMER_NONE);
return false;
}

--
1.8.3.1

2014-07-19 00:47:23

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 05/10] smp: Fast path check on IPI list

When we enqueue a remote irq work, we trigger the same IPI as those
raised by smp_call_function_*() family.

So when we receive such IPI, we check both irq_work and smp_call_function
queues. Thus if we trigger a remote irq work, we'll likely find the
smp_call_function queue empty unless we collide with concurrent enqueuers
but the probability is low.

Meanwhile, checking the smp_call_function queue can be costly because
we use llist_del_all() which relies on cmpxchg().

We can reduce this overhead by doing a fast path check with llist_empty().
Given the implicit IPI ordering:

Enqueuer Dequeuer
--------- --------
llist_add(csd, queue) get_IPI() {
send_IPI() if (llist_empty(queue)
...
When the IPI is sent, we are guaranteed that the IPI receiver will
see the new csd.

So lets do the fast path check to optimize non smp_call_function() related
jobs.

Cc: Ingo Molnar <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Viresh Kumar <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
kernel/smp.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index a1812d1..34378d4 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -184,11 +184,19 @@ static int generic_exec_single(int cpu, struct call_single_data *csd,
*/
void generic_smp_call_function_single_interrupt(void)
{
+ struct llist_head *head = &__get_cpu_var(call_single_queue);
struct llist_node *entry;
struct call_single_data *csd, *csd_next;
static bool warned;

- entry = llist_del_all(&__get_cpu_var(call_single_queue));
+ /*
+ * Fast check: in case of irq work remote queue, the IPI list
+ * is likely empty. We can spare the expensive llist_del_all().
+ */
+ if (llist_empty(head))
+ goto irq_work;
+
+ entry = llist_del_all(head);
entry = llist_reverse_order(entry);

/*
@@ -212,6 +220,7 @@ void generic_smp_call_function_single_interrupt(void)
csd_unlock(csd);
}

+irq_work:
/*
* Handle irq works queued remotely by irq_work_queue_on().
* Smp functions above are typically synchronous so they
--
1.8.3.1

2014-07-19 07:20:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 02/10] nohz: Kick full dynticks timer targets with an empty IPI

On Sat, Jul 19, 2014 at 02:44:13AM +0200, Frederic Weisbecker wrote:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 7f3063c..f3e48b8 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -693,7 +693,7 @@ static bool wake_up_full_nohz_cpu(int cpu)
> if (tick_nohz_full_cpu(cpu)) {
> if (cpu != smp_processor_id() ||
> tick_nohz_tick_stopped())
> - tick_nohz_full_kick_cpu(cpu);
> + irq_work_void_on(cpu);

So no, while the previous function name was descriptive, the new one
leave one wondering.


Attachments:
(No filename) (543.00 B)
(No filename) (836.00 B)
Download all attachments

2014-07-19 13:18:59

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 02/10] nohz: Kick full dynticks timer targets with an empty IPI

On Sat, Jul 19, 2014 at 09:19:51AM +0200, Peter Zijlstra wrote:
> On Sat, Jul 19, 2014 at 02:44:13AM +0200, Frederic Weisbecker wrote:
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 7f3063c..f3e48b8 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -693,7 +693,7 @@ static bool wake_up_full_nohz_cpu(int cpu)
> > if (tick_nohz_full_cpu(cpu)) {
> > if (cpu != smp_processor_id() ||
> > tick_nohz_tick_stopped())
> > - tick_nohz_full_kick_cpu(cpu);
> > + irq_work_void_on(cpu);
>
> So no, while the previous function name was descriptive, the new one
> leave one wondering.

Yeah, I suppose I can wrap that to some new function: tick_nohz_full_reschedule_cpu()
for example (although that looks a bit like a sched thing).

That said I plan to gather tick stop and tick restart in the same place, in the
end of the interrupt. So in the end we'll have only one tick_nohz_full_kick_cpu()
for all kick uses, and this will just wrapp to irq_work_void_on().

2014-07-19 13:47:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 02/10] nohz: Kick full dynticks timer targets with an empty IPI

On Sat, Jul 19, 2014 at 03:18:47PM +0200, Frederic Weisbecker wrote:
> On Sat, Jul 19, 2014 at 09:19:51AM +0200, Peter Zijlstra wrote:
> > On Sat, Jul 19, 2014 at 02:44:13AM +0200, Frederic Weisbecker wrote:
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 7f3063c..f3e48b8 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -693,7 +693,7 @@ static bool wake_up_full_nohz_cpu(int cpu)
> > > if (tick_nohz_full_cpu(cpu)) {
> > > if (cpu != smp_processor_id() ||
> > > tick_nohz_tick_stopped())
> > > - tick_nohz_full_kick_cpu(cpu);
> > > + irq_work_void_on(cpu);
> >
> > So no, while the previous function name was descriptive, the new one
> > leave one wondering.
>
> Yeah, I suppose I can wrap that to some new function: tick_nohz_full_reschedule_cpu()
> for example (although that looks a bit like a sched thing).
>
> That said I plan to gather tick stop and tick restart in the same place, in the
> end of the interrupt. So in the end we'll have only one tick_nohz_full_kick_cpu()
> for all kick uses, and this will just wrapp to irq_work_void_on().

So there's already kick_all_cpus_sync() in kernel/smp.c, I would suggest
implementing your irq_work_void_on() as kick_cpu() right next to it.

And yes, I would suggest also keeping tick_nohz_full_kick_cpu() as a
'pointless' wrapper purely for 'documentation' value.

2014-07-19 13:54:21

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 02/10] nohz: Kick full dynticks timer targets with an empty IPI

On Sat, Jul 19, 2014 at 03:47:41PM +0200, Peter Zijlstra wrote:
> On Sat, Jul 19, 2014 at 03:18:47PM +0200, Frederic Weisbecker wrote:
> > On Sat, Jul 19, 2014 at 09:19:51AM +0200, Peter Zijlstra wrote:
> > > On Sat, Jul 19, 2014 at 02:44:13AM +0200, Frederic Weisbecker wrote:
> > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > > index 7f3063c..f3e48b8 100644
> > > > --- a/kernel/sched/core.c
> > > > +++ b/kernel/sched/core.c
> > > > @@ -693,7 +693,7 @@ static bool wake_up_full_nohz_cpu(int cpu)
> > > > if (tick_nohz_full_cpu(cpu)) {
> > > > if (cpu != smp_processor_id() ||
> > > > tick_nohz_tick_stopped())
> > > > - tick_nohz_full_kick_cpu(cpu);
> > > > + irq_work_void_on(cpu);
> > >
> > > So no, while the previous function name was descriptive, the new one
> > > leave one wondering.
> >
> > Yeah, I suppose I can wrap that to some new function: tick_nohz_full_reschedule_cpu()
> > for example (although that looks a bit like a sched thing).
> >
> > That said I plan to gather tick stop and tick restart in the same place, in the
> > end of the interrupt. So in the end we'll have only one tick_nohz_full_kick_cpu()
> > for all kick uses, and this will just wrapp to irq_work_void_on().
>
> So there's already kick_all_cpus_sync() in kernel/smp.c, I would suggest
> implementing your irq_work_void_on() as kick_cpu() right next to it.

Good point!

>
> And yes, I would suggest also keeping tick_nohz_full_kick_cpu() as a
> 'pointless' wrapper purely for 'documentation' value.

Ok, granted!

Thanks!

2014-07-19 17:31:28

by Nicolas Pitre

[permalink] [raw]
Subject: Re: [PATCH 07/10] nohz: Enforce timekeeping on CPU 0

On Sat, 19 Jul 2014, Frederic Weisbecker wrote:

> The timekeeper gets initialized to the value of the CPU where the
> first clockevent device is setup. This works well because the timekeeper
> can be any online CPU in most configs.
>
> Full dynticks has its own requirement though and needs the timekeeper
> to always be 0. And this requirement seem to accomodate pretty well with
> the above described boot timekeeper setting because the first clockevent
> device happens to be initialized, most of the time, on the boot CPU
> (which should be CPU 0).

This might have been discussed before... but this isn't ARM big.LITTLE
friendly at all.

Could we accommodate for any arbitrary CPU instead of making CPU 0
"special" other than its role as the boot CPU please? It doesn't have
to be completely dynamic, but CPU 0 might be a really bad choice for
ongoing periodic duties in some configurations. For example, we might
highly prefer to do this on CPU 4 for power efficiency reasons once it
is online and keep CPU 0 in a deep C-state as much as possible.


Nicolas

2014-07-19 18:31:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 07/10] nohz: Enforce timekeeping on CPU 0

On Sat, Jul 19, 2014 at 01:31:25PM -0400, Nicolas Pitre wrote:
> On Sat, 19 Jul 2014, Frederic Weisbecker wrote:
>
> > The timekeeper gets initialized to the value of the CPU where the
> > first clockevent device is setup. This works well because the timekeeper
> > can be any online CPU in most configs.
> >
> > Full dynticks has its own requirement though and needs the timekeeper
> > to always be 0. And this requirement seem to accomodate pretty well with
> > the above described boot timekeeper setting because the first clockevent
> > device happens to be initialized, most of the time, on the boot CPU
> > (which should be CPU 0).
>
> This might have been discussed before... but this isn't ARM big.LITTLE
> friendly at all.
>
> Could we accommodate for any arbitrary CPU instead of making CPU 0
> "special" other than its role as the boot CPU please? It doesn't have
> to be completely dynamic, but CPU 0 might be a really bad choice for
> ongoing periodic duties in some configurations. For example, we might
> highly prefer to do this on CPU 4 for power efficiency reasons once it
> is online and keep CPU 0 in a deep C-state as much as possible.

This is because CPU0 can be a big core, right? IIRC this is done because
a big core as boot cpu, boots faster and some people think boot time is
relevant.

2014-07-19 18:46:59

by Nicolas Pitre

[permalink] [raw]
Subject: Re: [PATCH 07/10] nohz: Enforce timekeeping on CPU 0

On Sat, 19 Jul 2014, Peter Zijlstra wrote:

> On Sat, Jul 19, 2014 at 01:31:25PM -0400, Nicolas Pitre wrote:
> > On Sat, 19 Jul 2014, Frederic Weisbecker wrote:
> >
> > > The timekeeper gets initialized to the value of the CPU where the
> > > first clockevent device is setup. This works well because the timekeeper
> > > can be any online CPU in most configs.
> > >
> > > Full dynticks has its own requirement though and needs the timekeeper
> > > to always be 0. And this requirement seem to accomodate pretty well with
> > > the above described boot timekeeper setting because the first clockevent
> > > device happens to be initialized, most of the time, on the boot CPU
> > > (which should be CPU 0).
> >
> > This might have been discussed before... but this isn't ARM big.LITTLE
> > friendly at all.
> >
> > Could we accommodate for any arbitrary CPU instead of making CPU 0
> > "special" other than its role as the boot CPU please? It doesn't have
> > to be completely dynamic, but CPU 0 might be a really bad choice for
> > ongoing periodic duties in some configurations. For example, we might
> > highly prefer to do this on CPU 4 for power efficiency reasons once it
> > is online and keep CPU 0 in a deep C-state as much as possible.
>
> This is because CPU0 can be a big core, right? IIRC this is done because
> a big core as boot cpu, boots faster and some people think boot time is
> relevant.

People think all sorts of things. And it becomes very irritating when
thoughtful assumptions get burned into ROM for example. We should be
able to do better in the kernel.


Nicolas

2014-07-19 19:45:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 07/10] nohz: Enforce timekeeping on CPU 0

On Sat, Jul 19, 2014 at 02:46:56PM -0400, Nicolas Pitre wrote:
> People think all sorts of things. And it becomes very irritating when
> thoughtful assumptions get burned into ROM for example. We should be
> able to do better in the kernel.

Agreed. There used to be an x86 subarch where the boot cpu wasn't cpu0,
that always gave a lot of 'joy' too :-)

2014-07-20 01:07:29

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 07/10] nohz: Enforce timekeeping on CPU 0

On Sat, Jul 19, 2014 at 01:31:25PM -0400, Nicolas Pitre wrote:
> On Sat, 19 Jul 2014, Frederic Weisbecker wrote:
>
> > The timekeeper gets initialized to the value of the CPU where the
> > first clockevent device is setup. This works well because the timekeeper
> > can be any online CPU in most configs.
> >
> > Full dynticks has its own requirement though and needs the timekeeper
> > to always be 0. And this requirement seem to accomodate pretty well with
> > the above described boot timekeeper setting because the first clockevent
> > device happens to be initialized, most of the time, on the boot CPU
> > (which should be CPU 0).
>
> This might have been discussed before... but this isn't ARM big.LITTLE
> friendly at all.
>
> Could we accommodate for any arbitrary CPU instead of making CPU 0
> "special" other than its role as the boot CPU please? It doesn't have
> to be completely dynamic, but CPU 0 might be a really bad choice for
> ongoing periodic duties in some configurations. For example, we might
> highly prefer to do this on CPU 4 for power efficiency reasons once it
> is online and keep CPU 0 in a deep C-state as much as possible.

I can certainly arrange for setting user defined timekeeper on boot
time. Just one constraint: since the timekeeper is fixed, we can't
offline it. That's usually fine with CPU 0 but on other CPUs, rejecting
a CPU offlining blocks the suspend process.

I imagine I can handover the timekeeping to another CPU in this case
and accept the offlining. But there are chances that the only online
CPUs remaining are full dynticks and it's not tempting to give them
a timekeeping duty.

2014-07-21 17:50:47

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 09/10] nohz: Switch nohz full timekeeper to dynticks idle on top of sysidle detection

On Sat, Jul 19, 2014 at 02:44:20AM +0200, Frederic Weisbecker wrote:
> In full dynticks, the CPU 0 carries the timekeeping duty on behalf
> of all other CPUs in the system. This way full dynticks are left
> undisturbed on this regard.
>
> Of course this prevents CPU 0 from entering in dynticks idle mode
> because any CPU may need uptodate timekeeping at any time.
>
> Theoretically though, we could put CPU 0 in dynticks idle mode once we
> are sure that all other CPUs are dynticks idle as well. Then when a
> CPU wakes up and finds the timekeeper idle, it would send an IPI to
> wake it up on its duty.
>
> Such a machine state needs to take care of all the races in the way, make
> sure that CPU 0 is neither stuck accidentally to sleep for ever, nor
> stuck in periodic mode when it could sleep. Also given the amount of
> shared data this involves and their access frequency, this must be built
> on top of lockless low-overhead state machine.
>
> This is what sysidle provides. The feature is ready for a while, we
> were just waiting for the nohz susbsystem to support it. And we just
> reached that state.
>
> So lets defer the last call for CPU 0 to enter in dynticks idle to when
> we find a full system idle state. And lets wake it up when its duty is
> needed.
>
> Cc: Ingo Molnar <[email protected]>
> Cc: Paul E. McKenney <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Signed-off-by: Frederic Weisbecker <[email protected]>

OK, it looks like this calls rcu_sys_is_idle() only if there actually
are some nohz_full= CPUs, which is good. I therefore only need
tick_nohz_full_enabled() checks on the internal sysidle machinery, and
even then these checks only have effect on performance, not on semantics.
Which is also good. ;-)

Reviewed-by: Paul E. McKenney <[email protected]>

> ---
> kernel/time/tick-sched.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index bcba79d..845aaff 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -547,8 +547,10 @@ static u64 timekeeping_deferment(struct tick_sched *ts, int cpu)
>
> if (tick_do_timer_cpu == cpu) {
> time_delta = timekeeping_max_deferment();
> - tick_do_timer_cpu = TICK_DO_TIMER_NONE;
> ts->do_timer_last = 1;
> + /* In full dynticks mode, CPU 0 always keeps the duty */
> + if (!tick_nohz_full_enabled())
> + tick_do_timer_cpu = TICK_DO_TIMER_NONE;
> } else if (ts->do_timer_last) {
> if (tick_do_timer_cpu == TICK_DO_TIMER_NONE)
> time_delta = timekeeping_max_deferment();
> @@ -745,7 +747,7 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
> * if there are full dynticks CPUs around
> */
> if (tick_do_timer_cpu == cpu)
> - return false;
> + return rcu_sys_is_idle();
> }
>
> return true;
> --
> 1.8.3.1
>

2014-07-21 17:51:57

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 10/10] nohz: Warn on illegal timekeeper switch in nohz full

On Sat, Jul 19, 2014 at 02:44:21AM +0200, Frederic Weisbecker wrote:
> In full dynticks idle mode, the timekeeper should never be set to another
> CPU than 0.
>
> Lets enforce that through a dedicated mutator.
>
> Cc: Ingo Molnar <[email protected]>
> Cc: Paul E. McKenney <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Signed-off-by: Frederic Weisbecker <[email protected]>

Acked-by: Paul E. McKenney <[email protected]>

> ---
> kernel/time/tick-common.c | 9 ++++-----
> kernel/time/tick-internal.h | 7 +++++++
> kernel/time/tick-sched.c | 7 +++----
> 3 files changed, 14 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
> index cb57bce..72ccaf2 100644
> --- a/kernel/time/tick-common.c
> +++ b/kernel/time/tick-common.c
> @@ -180,9 +180,9 @@ static void tick_setup_device(struct tick_device *td,
> */
> if (tick_do_timer_cpu == TICK_DO_TIMER_BOOT) {
> if (tick_nohz_full_enabled())
> - tick_do_timer_cpu = TICK_DO_TIMER_DEFAULT;
> + tick_do_timer_cpu_set(TICK_DO_TIMER_DEFAULT);
> else
> - tick_do_timer_cpu = cpu;
> + tick_do_timer_cpu_set(cpu);
> tick_next_period = ktime_get();
> tick_period = ktime_set(0, NSEC_PER_SEC / HZ);
> }
> @@ -341,9 +341,8 @@ void tick_handover_do_timer(int *cpup)
> {
> if (*cpup == tick_do_timer_cpu) {
> int cpu = cpumask_first(cpu_online_mask);
> -
> - tick_do_timer_cpu = (cpu < nr_cpu_ids) ? cpu :
> - TICK_DO_TIMER_NONE;
> + tick_do_timer_cpu_set((cpu < nr_cpu_ids) ? cpu :
> + TICK_DO_TIMER_NONE);
> }
> }
>
> diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
> index 6b592a8..eadfd89 100644
> --- a/kernel/time/tick-internal.h
> +++ b/kernel/time/tick-internal.h
> @@ -23,6 +23,13 @@ extern void tick_setup_periodic(struct clock_event_device *dev, int broadcast);
> extern void tick_handle_periodic(struct clock_event_device *dev);
> extern void tick_check_new_device(struct clock_event_device *dev);
> extern void tick_handover_do_timer(int *cpup);
> +
> +static inline void tick_do_timer_cpu_set(int cpu)
> +{
> + WARN_ON_ONCE(tick_nohz_full_enabled() && cpu != TICK_DO_TIMER_DEFAULT);
> + tick_do_timer_cpu = cpu;
> +}
> +
> extern void tick_shutdown(unsigned int *cpup);
> extern void tick_suspend(void);
> extern void tick_resume(void);
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 845aaff..3b5102a 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -123,7 +123,7 @@ static void tick_sched_do_timer(ktime_t now)
> */
> if (unlikely(tick_do_timer_cpu == TICK_DO_TIMER_NONE)
> && !tick_nohz_full_cpu(cpu))
> - tick_do_timer_cpu = cpu;
> + tick_do_timer_cpu_set(cpu);
> #endif
>
> /* Check, if the jiffies need an update */
> @@ -550,7 +550,7 @@ static u64 timekeeping_deferment(struct tick_sched *ts, int cpu)
> ts->do_timer_last = 1;
> /* In full dynticks mode, CPU 0 always keeps the duty */
> if (!tick_nohz_full_enabled())
> - tick_do_timer_cpu = TICK_DO_TIMER_NONE;
> + tick_do_timer_cpu_set(TICK_DO_TIMER_NONE);
> } else if (ts->do_timer_last) {
> if (tick_do_timer_cpu == TICK_DO_TIMER_NONE)
> time_delta = timekeeping_max_deferment();
> @@ -600,7 +600,6 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
> /* Schedule the tick, if we are at least one jiffie off */
> if ((long)delta_jiffies >= 1) {
> u64 time_delta = timekeeping_deferment(ts, cpu);
> -
> #ifdef CONFIG_NO_HZ_FULL
> if (!ts->inidle) {
> time_delta = min(time_delta,
> @@ -717,7 +716,7 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
> */
> if (unlikely(!cpu_online(cpu))) {
> if (cpu == tick_do_timer_cpu)
> - tick_do_timer_cpu = TICK_DO_TIMER_NONE;
> + tick_do_timer_cpu_set(TICK_DO_TIMER_NONE);
> return false;
> }
>
> --
> 1.8.3.1
>

2014-07-21 18:11:26

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 06/10] nohz: Define meaningful symbol for nohz full timekeeper

On Sat, Jul 19, 2014 at 02:44:17AM +0200, Frederic Weisbecker wrote:
> The nohz full timekeeper is always CPU 0. Lets add it to the list of
> special tick_do_timer_cpu symbols for more self explanatory code.
>
> Cc: Ingo Molnar <[email protected]>
> Cc: Paul E. McKenney <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Signed-off-by: Frederic Weisbecker <[email protected]>

Reviewed-by: Paul E. McKenney <[email protected]>

> ---
> kernel/time/tick-internal.h | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
> index 7ab92b1..6b592a8 100644
> --- a/kernel/time/tick-internal.h
> +++ b/kernel/time/tick-internal.h
> @@ -10,6 +10,7 @@ extern seqlock_t jiffies_lock;
>
> #ifdef CONFIG_GENERIC_CLOCKEVENTS_BUILD
>
> +#define TICK_DO_TIMER_DEFAULT 0
> #define TICK_DO_TIMER_NONE -1
> #define TICK_DO_TIMER_BOOT -2
>
> --
> 1.8.3.1
>

2014-07-21 18:13:00

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 06/10] nohz: Define meaningful symbol for nohz full timekeeper

On Sat, Jul 19, 2014 at 02:44:17AM +0200, Frederic Weisbecker wrote:
> The nohz full timekeeper is always CPU 0. Lets add it to the list of
> special tick_do_timer_cpu symbols for more self explanatory code.
>
> Cc: Ingo Molnar <[email protected]>
> Cc: Paul E. McKenney <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Signed-off-by: Frederic Weisbecker <[email protected]>

Sounds like Nicolas Pitre would like to be able to specify the default,
perhaps at build time, but aside from that requested enhancement:

Reviewed-by: Paul E. McKenney <[email protected]>

> ---
> kernel/time/tick-internal.h | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
> index 7ab92b1..6b592a8 100644
> --- a/kernel/time/tick-internal.h
> +++ b/kernel/time/tick-internal.h
> @@ -10,6 +10,7 @@ extern seqlock_t jiffies_lock;
>
> #ifdef CONFIG_GENERIC_CLOCKEVENTS_BUILD
>
> +#define TICK_DO_TIMER_DEFAULT 0
> #define TICK_DO_TIMER_NONE -1
> #define TICK_DO_TIMER_BOOT -2
>
> --
> 1.8.3.1
>

2014-07-21 18:15:36

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 04/10] nohz: Appropriate timekeeper kick on sysidle break

On Sat, Jul 19, 2014 at 02:44:15AM +0200, Frederic Weisbecker wrote:
> When a CPU wakes up from idle and finds out that the timekeeper is
> sleeping, we need to kick it such that it switches from dynticks to
> periodic mode to maintain its timekeeping duty on behalf of the newly
> awoken CPU.
>
> However we aren't using the right API for that. rcu_kick_nohz_cpu() is
> aimed at waking full dynticks CPUs and not the timekeeper.
>
> Moreover the timekeeper must perform a new dynticks cycle to check the
> new sysidle state and restart the tick if necessary. A simple call
> to irq_exit() isn't enough.
>
> wake_up_nohz_cpu() is a good fit for that job because it pulls the
> target out of the idle loop and restart the tick. Then if no other
> task waits for the CPU, it will reenter the idle loop and then the
> new sysidle state will be visible and well handled.
>
> Cc: Ingo Molnar <[email protected]>
> Cc: Paul E. McKenney <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Signed-off-by: Frederic Weisbecker <[email protected]>

Either name is fine by me... ;-)

Reviewed-by: Paul E. McKenney <[email protected]>

> ---
> kernel/rcu/tree_plugin.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 395c14d..b65da1a 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -2494,7 +2494,7 @@ void rcu_sysidle_force_exit(void)
> oldstate, RCU_SYSIDLE_NOT);
> if (oldstate == newoldstate &&
> oldstate == RCU_SYSIDLE_FULL_NOTED) {
> - rcu_kick_nohz_cpu(tick_do_timer_cpu);
> + wake_up_nohz_cpu(tick_do_timer_cpu);
> return; /* We cleared it, done! */
> }
> oldstate = newoldstate;
> --
> 1.8.3.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2014-07-21 18:16:19

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 03/10] rcu: Kick full dynticks CPU on extended grace period with a void IRQ

On Sat, Jul 19, 2014 at 02:44:14AM +0200, Frederic Weisbecker wrote:
> When a full dynticks CPU stays for too long in the kernel, it may fail
> to report quiescent states due to it missing ticks and therefore it can
> delay the completion of grace periods.
>
> A way to solve this is to send an IPI to the incriminated CPU such that
> it can check rcu_needs_cpu() and reschedule some ticks accordingly to
> poll again on quiescent states reports.
>
> This is what we try to achieve while calling rcu_kick_nohz_cpu() but it
> has no effect because we trigger a scheduler IPI which is actually a
> no-op when not used for scheduler internal purpose, ie: it doesn't call
> irq_exit() when not used for remote wakeup or other specifics.
>
> No need to tweak the scheduler IPI further though. Lets keep it clean
> of external noise and use an empty IPI instead. We hereby get sure that
> we will call irq_exit() on the target without much overhead nor added
> noise on scheduler fast path.
>
> Cc: Ingo Molnar <[email protected]>
> Cc: Paul E. McKenney <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Signed-off-by: Frederic Weisbecker <[email protected]>

Reviewed-by: Paul E. McKenney <[email protected]>

> ---
> kernel/rcu/tree_plugin.h | 8 +++++---
> 1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index cbc2c45..395c14d 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -2401,14 +2401,16 @@ static bool init_nocb_callback_list(struct rcu_data *rdp)
> * off. RCU will be paying attention to this CPU because it is in the
> * kernel, but the CPU cannot be guaranteed to be executing the RCU state
> * machine because the scheduling-clock tick has been disabled. Therefore,
> - * if an adaptive-ticks CPU is failing to respond to the current grace
> - * period and has not be idle from an RCU perspective, kick it.
> + * if an full dynticks CPU is failing to respond to the current grace
> + * period and has not be idle from an RCU perspective, kick it with a
> + * void IRQ so that it can check that RCU needs its tick from rcu_needs_cpu()
> + * on irq exit.
> */
> static void rcu_kick_nohz_cpu(int cpu)
> {
> #ifdef CONFIG_NO_HZ_FULL
> if (tick_nohz_full_cpu(cpu))
> - smp_send_reschedule(cpu);
> + irq_work_void_on(cpu);
> #endif /* #ifdef CONFIG_NO_HZ_FULL */
> }
>
> --
> 1.8.3.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2014-07-21 18:16:43

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 01/10] irq_work: Introduce void irq work

On Sat, Jul 19, 2014 at 02:44:12AM +0200, Frederic Weisbecker wrote:
> Being able to trigger an empty IPI appears to be useless in the first
> place. Yet it is expected to be very useful for callers who just need
> to execute irq_enter() or irq_exit() to a remote target.
>
> More precisely this is going to be useful for the nohz subsystem which
> often needs a remote CPU to reschedule its tick when some state changed
> and require periodicity for any reason. Similar cases have been handled
> with random IPIs until now. But they surely bring unwanted overhead
> all along since they are all dedicated for specific tasks.
>
> Triggering an irq work/smp_call_function IPI should be enough to solve
> that. If competing and spurious IPIs become a problem, we can still
> optimize that later.
>
> Cc: Ingo Molnar <[email protected]>
> Cc: Paul E. McKenney <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Signed-off-by: Frederic Weisbecker <[email protected]>

Reviewed-by: Paul E. McKenney <[email protected]>

> ---
> include/linux/irq_work.h | 1 +
> kernel/irq_work.c | 21 +++++++++++++++++++++
> 2 files changed, 22 insertions(+)
>
> diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h
> index bf9422c..b2ad065 100644
> --- a/include/linux/irq_work.h
> +++ b/include/linux/irq_work.h
> @@ -36,6 +36,7 @@ bool irq_work_queue(struct irq_work *work);
>
> #ifdef CONFIG_SMP
> bool irq_work_queue_on(struct irq_work *work, int cpu);
> +void irq_work_void_on(int cpu);
> #endif
>
> void irq_work_run(void);
> diff --git a/kernel/irq_work.c b/kernel/irq_work.c
> index 4b0a890..36b7fb2 100644
> --- a/kernel/irq_work.c
> +++ b/kernel/irq_work.c
> @@ -81,6 +81,27 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
> return true;
> }
> EXPORT_SYMBOL_GPL(irq_work_queue_on);
> +
> +/**
> + * irq_work_void_on(): Run a void IRQ on the target
> + * @cpu: The cpu to run the IRQ on
> + *
> + * Run a void IRQ for its own sake on the target. It's generally
> + * useful for callers which want to run irq_enter() or irq_exit()
> + * on a remote CPU.
> + */
> +void irq_work_void_on(int cpu)
> +{
> + /*
> + * NOTE: we could optimize that and spare some IPIs
> + * after checking that raised_list isn't empty before raising.
> + * This can't be done properly without cmpxchg() though so
> + * it may make things worse after all. But lets leave that
> + * possibility open in case people report such issue in the
> + * future.
> + */
> + arch_send_call_function_single_ipi(cpu);
> +}
> #endif
>
> /* Enqueue the irq work @work on the current CPU */
> --
> 1.8.3.1
>

2014-07-25 21:27:30

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 06/10] nohz: Define meaningful symbol for nohz full timekeeper

On Mon, Jul 21, 2014 at 11:12:55AM -0700, Paul E. McKenney wrote:
> On Sat, Jul 19, 2014 at 02:44:17AM +0200, Frederic Weisbecker wrote:
> > The nohz full timekeeper is always CPU 0. Lets add it to the list of
> > special tick_do_timer_cpu symbols for more self explanatory code.
> >
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Paul E. McKenney <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Steven Rostedt <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
> > Cc: Viresh Kumar <[email protected]>
> > Signed-off-by: Frederic Weisbecker <[email protected]>
>
> Sounds like Nicolas Pitre would like to be able to specify the default,
> perhaps at build time, but aside from that requested enhancement:

Ok. That's outside the scope of this patchset though as it doesn't change much things
in this regard: CPU 0 has been the timekeeper since the early days of nohz full.
And we still need a default timekeeper in any case, whether we can override it or not
in the future. So we still need that patch.

Then if big.LITTLE people are really interested in nohz full, we can still enhance
that in the future.

Thanks!