2013-08-20 02:47:11

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 0/9] v2 sysidle changes for 3.12

Hello!

Whenever there is at least one non-idle CPU, it is necessary to
periodically update timekeeping information. Before NO_HZ_FULL, this
updating was carried out by the scheduling-clock tick, which ran on
every non-idle CPU. With the advent of NO_HZ_FULL, it is possible
to have non-idle CPUs that are not receiving scheduling-clock ticks.
This possibility is handled by assigning a timekeeping CPU that continues
taking scheduling-clock ticks.

Unfortunately, timekeeping CPU continues taking scheduling-clock
interrupts even when all other CPUs are completely idle, which is
not so good for energy efficiency and battery lifetime. Clearly, it
would be good to turn off the timekeeping CPU's scheduling-clock tick
when all CPUs are completely idle. This is conceptually simple, but
we also need good performance and scalability on large systems, which
rules out implementations based on frequently updated global counts of
non-idle CPUs as well as implementations that frequently scan all CPUs.
Nevertheless, we need a single global indicator in order to keep the
overhead of checking acceptably low.

The chosen approach is to enforce hysteresis on the non-idle to
full-system-idle transition, with the amount of hysteresis increasing
linearly with the number of CPUs, thus keeping contention acceptably low.
This approach piggybacks on RCU's existing force-quiescent-state scanning
of idle CPUs, which has the advantage of avoiding the scan entirely on
busy systems that have high levels of multiprogramming. This scan
takes per-CPU idleness information and feeds it into a state machine
that applies the level of hysteresis required to arrive at a single
full-system-idle indicator.

The individual patches are as follows:

1. Eliminate unused APIs that were intended for adaptive ticks.

2. Add documentation covering the testing of nohz_full.

3. Add a CONFIG_NO_HZ_FULL_SYSIDLE Kconfig parameter to enable
this feature. Kernels built with CONFIG_NO_HZ_FULL_SYSIDLE=n
act exactly as they do today.

4. Add new fields to the rcu_dynticks structure that track CPU-idle
information. These fields consider CPUs running usermode to be
non-idle, in contrast with the existing fields in that structure.

5. Track per-CPU idle states.

6. Add full-system idle states and state variables.

7. Expand force_qs_rnp(), dyntick_save_progress_counter(), and
rcu_implicit_dynticks_qs() APIs to enable passing full-system
idle state information.

8. Add full-system-idle state machine.

9. Force RCU's grace-period kthreads onto the timekeeping CPU.

Changes since v4 (https://lkml.org/lkml/2013/8/17/108):

o Move small-system cutoff to Kconfig symbol as suggested
by Josh Triplett.

o Provide names (not just types) for function pointer argument
as suggested by Josh Triplett.

Changes since v3 (https://lkml.org/lkml/2013/7/8/441):

o Updated Kconfig help text as suggested by Frederic.

o Added a few adaptive-ticks-related fixes (#1 and #2 above).

Changes since v2:

o Completed removing NMI support (thanks to Frederic for spotting
the remaining cruft).

o Fix a state-machine bug, again spotted by Frederic. See
http://lists-archives.com/linux-kernel/27865835-nohz_full-add-full-system-idle-state-machine.html
for the full details of the bug.

o Updated commit log and comment as suggested by Josh Triplett.

Changes since v1:

o Removed NMI support because NMI handlers cannot safely read
the time anyway (thanks to Thomas Gleixner and Peter Zijlstra).

Thanx, Paul


b/Documentation/timers/NO_HZ.txt | 44 +++-
b/include/linux/rcupdate.h | 22 +-
b/kernel/rcutree.c | 94 +++-----
b/kernel/rcutree.h | 17 +
b/kernel/rcutree_plugin.h | 420 ++++++++++++++++++++++++++++++++++++++-
b/kernel/time/Kconfig | 50 ++++
6 files changed, 575 insertions(+), 72 deletions(-)


2013-08-20 02:47:40

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 4/9] nohz_full: Add rcu_dyntick data for scalable detection of all-idle state

From: "Paul E. McKenney" <[email protected]>

This commit adds fields to the rcu_dyntick structure that are used to
detect idle CPUs. These new fields differ from the existing ones in
that the existing ones consider a CPU executing in user mode to be idle,
where the new ones consider CPUs executing in user mode to be busy.
The handling of these new fields is otherwise quite similar to that for
the exiting fields. This commit also adds the initialization required
for these fields.

So, why is usermode execution treated differently, with RCU considering
it a quiescent state equivalent to idle, while in contrast the new
full-system idle state detection considers usermode execution to be
non-idle?

It turns out that although one of RCU's quiescent states is usermode
execution, it is not a full-system idle state. This is because the
purpose of the full-system idle state is not RCU, but rather determining
when accurate timekeeping can safely be disabled. Whenever accurate
timekeeping is required in a CONFIG_NO_HZ_FULL kernel, at least one
CPU must keep the scheduling-clock tick going. If even one CPU is
executing in user mode, accurate timekeeping is requires, particularly for
architectures where gettimeofday() and friends do not enter the kernel.
Only when all CPUs are really and truly idle can accurate timekeeping be
disabled, allowing all CPUs to turn off the scheduling clock interrupt,
thus greatly improving energy efficiency.

This naturally raises the question "Why is this code in RCU rather than in
timekeeping?", and the answer is that RCU has the data and infrastructure
to efficiently make this determination.

Signed-off-by: Paul E. McKenney <[email protected]>
Acked-by: Frederic Weisbecker <[email protected]>
Cc: Steven Rostedt <[email protected]>
Reviewed-by: Josh Triplett <[email protected]>
---
kernel/rcutree.c | 5 +++++
kernel/rcutree.h | 9 +++++++++
kernel/rcutree_plugin.h | 19 +++++++++++++++++++
3 files changed, 33 insertions(+)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 8807019..4f27b85 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -224,6 +224,10 @@ EXPORT_SYMBOL_GPL(rcu_note_context_switch);
DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks) = {
.dynticks_nesting = DYNTICK_TASK_EXIT_IDLE,
.dynticks = ATOMIC_INIT(1),
+#ifdef CONFIG_NO_HZ_FULL_SYSIDLE
+ .dynticks_idle_nesting = DYNTICK_TASK_NEST_VALUE,
+ .dynticks_idle = ATOMIC_INIT(1),
+#endif /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
};

static long blimit = 10; /* Maximum callbacks per rcu_do_batch. */
@@ -2904,6 +2908,7 @@ rcu_init_percpu_data(int cpu, struct rcu_state *rsp, int preemptible)
rdp->blimit = blimit;
init_callback_list(rdp); /* Re-enable callbacks on this CPU. */
rdp->dynticks->dynticks_nesting = DYNTICK_TASK_EXIT_IDLE;
+ rcu_sysidle_init_percpu_data(rdp->dynticks);
atomic_set(&rdp->dynticks->dynticks,
(atomic_read(&rdp->dynticks->dynticks) & ~0x1) + 1);
raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index cbdeac6..52d1be1 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -88,6 +88,14 @@ struct rcu_dynticks {
/* Process level is worth LLONG_MAX/2. */
int dynticks_nmi_nesting; /* Track NMI nesting level. */
atomic_t dynticks; /* Even value for idle, else odd. */
+#ifdef CONFIG_NO_HZ_FULL_SYSIDLE
+ long long dynticks_idle_nesting;
+ /* irq/process nesting level from idle. */
+ atomic_t dynticks_idle; /* Even value for idle, else odd. */
+ /* "Idle" excludes userspace execution. */
+ unsigned long dynticks_idle_jiffies;
+ /* End of last non-NMI non-idle period. */
+#endif /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
#ifdef CONFIG_RCU_FAST_NO_HZ
bool all_lazy; /* Are all CPU's CBs lazy? */
unsigned long nonlazy_posted;
@@ -545,6 +553,7 @@ static void rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp);
static void rcu_spawn_nocb_kthreads(struct rcu_state *rsp);
static void rcu_kick_nohz_cpu(int cpu);
static bool init_nocb_callback_list(struct rcu_data *rdp);
+static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp);

#endif /* #ifndef RCU_TREE_NONCORE */

diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index dff86f5..e5baccb 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -2373,3 +2373,22 @@ static void rcu_kick_nohz_cpu(int cpu)
smp_send_reschedule(cpu);
#endif /* #ifdef CONFIG_NO_HZ_FULL */
}
+
+
+#ifdef CONFIG_NO_HZ_FULL_SYSIDLE
+
+/*
+ * Initialize dynticks sysidle state for CPUs coming online.
+ */
+static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp)
+{
+ rdtp->dynticks_idle_nesting = DYNTICK_TASK_NEST_VALUE;
+}
+
+#else /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
+
+static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp)
+{
+}
+
+#endif /* #else #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
--
1.8.1.5

2013-08-20 02:47:38

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 2/9] nohz_full: Add testing information to documentation

From: "Paul E. McKenney" <[email protected]>

This commit adds information about testing nohz_full, and also emphasizes
the fact that you need a multi-CPU system to get any benefit from nohz_full.

Signed-off-by: Paul E. McKenney <[email protected]>
Acked-by: Frederic Weisbecker <[email protected]>
Reviewed-by: Josh Triplett <[email protected]>
---
Documentation/timers/NO_HZ.txt | 44 ++++++++++++++++++++++++++++++++----------
1 file changed, 34 insertions(+), 10 deletions(-)

diff --git a/Documentation/timers/NO_HZ.txt b/Documentation/timers/NO_HZ.txt
index 8869758..cca122f 100644
--- a/Documentation/timers/NO_HZ.txt
+++ b/Documentation/timers/NO_HZ.txt
@@ -24,8 +24,8 @@ There are three main ways of managing scheduling-clock interrupts
workloads, you will normally -not- want this option.

These three cases are described in the following three sections, followed
-by a third section on RCU-specific considerations and a fourth and final
-section listing known issues.
+by a third section on RCU-specific considerations, a fourth section
+discussing testing, and a fifth and final section listing known issues.


NEVER OMIT SCHEDULING-CLOCK TICKS
@@ -121,14 +121,15 @@ boot parameter specifies the adaptive-ticks CPUs. For example,
"nohz_full=1,6-8" says that CPUs 1, 6, 7, and 8 are to be adaptive-ticks
CPUs. Note that you are prohibited from marking all of the CPUs as
adaptive-tick CPUs: At least one non-adaptive-tick CPU must remain
-online to handle timekeeping tasks in order to ensure that system calls
-like gettimeofday() returns accurate values on adaptive-tick CPUs.
-(This is not an issue for CONFIG_NO_HZ_IDLE=y because there are no
-running user processes to observe slight drifts in clock rate.)
-Therefore, the boot CPU is prohibited from entering adaptive-ticks
-mode. Specifying a "nohz_full=" mask that includes the boot CPU will
-result in a boot-time error message, and the boot CPU will be removed
-from the mask.
+online to handle timekeeping tasks in order to ensure that system
+calls like gettimeofday() returns accurate values on adaptive-tick CPUs.
+(This is not an issue for CONFIG_NO_HZ_IDLE=y because there are no running
+user processes to observe slight drifts in clock rate.) Therefore, the
+boot CPU is prohibited from entering adaptive-ticks mode. Specifying a
+"nohz_full=" mask that includes the boot CPU will result in a boot-time
+error message, and the boot CPU will be removed from the mask. Note that
+this means that your system must have at least two CPUs in order for
+CONFIG_NO_HZ_FULL=y to do anything for you.

Alternatively, the CONFIG_NO_HZ_FULL_ALL=y Kconfig parameter specifies
that all CPUs other than the boot CPU are adaptive-ticks CPUs. This
@@ -232,6 +233,29 @@ scheduler will decide where to run them, which might or might not be
where you want them to run.


+TESTING
+
+So you enable all the OS-jitter features described in this document,
+but do not see any change in your workload's behavior. Is this because
+your workload isn't affected that much by OS jitter, or is it because
+something else is in the way? This section helps answer this question
+by providing a simple OS-jitter test suite, which is available on branch
+master of the following git archive:
+
+git://git.kernel.org/pub/scm/linux/kernel/git/frederic/dynticks-testing.git
+
+Clone this archive and follow the instructions in the README file.
+This test procedure will produce a trace that will allow you to evaluate
+whether or not you have succeeded in removing OS jitter from your system.
+If this trace shows that you have removed OS jitter as much as is
+possible, then you can conclude that your workload is not all that
+sensitive to OS jitter.
+
+Note: this test requires that your system have at least two CPUs.
+We do not currently have a good way to remove OS jitter from single-CPU
+systems.
+
+
KNOWN ISSUES

o Dyntick-idle slows transitions to and from idle slightly.
--
1.8.1.5

2013-08-20 02:47:37

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 1/9] rcu: Eliminate unused APIs intended for adaptive ticks

From: "Paul E. McKenney" <[email protected]>

The rcu_user_enter_after_irq() and rcu_user_exit_after_irq()
functions were intended for use by adaptive ticks, but changes
in implementation have rendered them unnecessary. This commit
therefore removes them.

Reported-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
Reviewed-by: Josh Triplett <[email protected]>
---
include/linux/rcupdate.h | 4 ----
kernel/rcutree.c | 43 -------------------------------------------
2 files changed, 47 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 0c38abb..30bea9c 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -229,13 +229,9 @@ extern void rcu_irq_exit(void);
#ifdef CONFIG_RCU_USER_QS
extern void rcu_user_enter(void);
extern void rcu_user_exit(void);
-extern void rcu_user_enter_after_irq(void);
-extern void rcu_user_exit_after_irq(void);
#else
static inline void rcu_user_enter(void) { }
static inline void rcu_user_exit(void) { }
-static inline void rcu_user_enter_after_irq(void) { }
-static inline void rcu_user_exit_after_irq(void) { }
static inline void rcu_user_hooks_switch(struct task_struct *prev,
struct task_struct *next) { }
#endif /* CONFIG_RCU_USER_QS */
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 338f1d1..8807019 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -444,27 +444,6 @@ void rcu_user_enter(void)
{
rcu_eqs_enter(1);
}
-
-/**
- * rcu_user_enter_after_irq - inform RCU that we are going to resume userspace
- * after the current irq returns.
- *
- * This is similar to rcu_user_enter() but in the context of a non-nesting
- * irq. After this call, RCU enters into idle mode when the interrupt
- * returns.
- */
-void rcu_user_enter_after_irq(void)
-{
- unsigned long flags;
- struct rcu_dynticks *rdtp;
-
- local_irq_save(flags);
- rdtp = &__get_cpu_var(rcu_dynticks);
- /* Ensure this irq is interrupting a non-idle RCU state. */
- WARN_ON_ONCE(!(rdtp->dynticks_nesting & DYNTICK_TASK_MASK));
- rdtp->dynticks_nesting = 1;
- local_irq_restore(flags);
-}
#endif /* CONFIG_RCU_USER_QS */

/**
@@ -581,28 +560,6 @@ void rcu_user_exit(void)
{
rcu_eqs_exit(1);
}
-
-/**
- * rcu_user_exit_after_irq - inform RCU that we won't resume to userspace
- * idle mode after the current non-nesting irq returns.
- *
- * This is similar to rcu_user_exit() but in the context of an irq.
- * This is called when the irq has interrupted a userspace RCU idle mode
- * context. When the current non-nesting interrupt returns after this call,
- * the CPU won't restore the RCU idle mode.
- */
-void rcu_user_exit_after_irq(void)
-{
- unsigned long flags;
- struct rcu_dynticks *rdtp;
-
- local_irq_save(flags);
- rdtp = &__get_cpu_var(rcu_dynticks);
- /* Ensure we are interrupting an RCU idle mode. */
- WARN_ON_ONCE(rdtp->dynticks_nesting & DYNTICK_TASK_NEST_MASK);
- rdtp->dynticks_nesting += DYNTICK_TASK_EXIT_IDLE;
- local_irq_restore(flags);
-}
#endif /* CONFIG_RCU_USER_QS */

/**
--
1.8.1.5

2013-08-20 02:47:35

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 5/9] nohz_full: Add per-CPU idle-state tracking

From: "Paul E. McKenney" <[email protected]>

This commit adds the code that updates the rcu_dyntick structure's
new fields to track the per-CPU idle state based on interrupts and
transitions into and out of the idle loop (NMIs are ignored because NMI
handlers cannot cleanly read out the time anyway). This code is similar
to the code that maintains RCU's idea of per-CPU idleness, but differs
in that RCU treats CPUs running in user mode as idle, where this new
code does not.

Signed-off-by: Paul E. McKenney <[email protected]>
Acked-by: Frederic Weisbecker <[email protected]>
Cc: Steven Rostedt <[email protected]>
Reviewed-by: Josh Triplett <[email protected]>
---
kernel/rcutree.c | 4 +++
kernel/rcutree.h | 2 ++
kernel/rcutree_plugin.h | 79 +++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 85 insertions(+)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 4f27b85..b0d2cc3 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -431,6 +431,7 @@ void rcu_idle_enter(void)

local_irq_save(flags);
rcu_eqs_enter(false);
+ rcu_sysidle_enter(&__get_cpu_var(rcu_dynticks), 0);
local_irq_restore(flags);
}
EXPORT_SYMBOL_GPL(rcu_idle_enter);
@@ -481,6 +482,7 @@ void rcu_irq_exit(void)
trace_rcu_dyntick(TPS("--="), oldval, rdtp->dynticks_nesting);
else
rcu_eqs_enter_common(rdtp, oldval, true);
+ rcu_sysidle_enter(rdtp, 1);
local_irq_restore(flags);
}

@@ -549,6 +551,7 @@ void rcu_idle_exit(void)

local_irq_save(flags);
rcu_eqs_exit(false);
+ rcu_sysidle_exit(&__get_cpu_var(rcu_dynticks), 0);
local_irq_restore(flags);
}
EXPORT_SYMBOL_GPL(rcu_idle_exit);
@@ -600,6 +603,7 @@ void rcu_irq_enter(void)
trace_rcu_dyntick(TPS("++="), oldval, rdtp->dynticks_nesting);
else
rcu_eqs_exit_common(rdtp, oldval, true);
+ rcu_sysidle_exit(rdtp, 1);
local_irq_restore(flags);
}

diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 52d1be1..9dd8b17 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -553,6 +553,8 @@ static void rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp);
static void rcu_spawn_nocb_kthreads(struct rcu_state *rsp);
static void rcu_kick_nohz_cpu(int cpu);
static bool init_nocb_callback_list(struct rcu_data *rdp);
+static void rcu_sysidle_enter(struct rcu_dynticks *rdtp, int irq);
+static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq);
static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp);

#endif /* #ifndef RCU_TREE_NONCORE */
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index e5baccb..eab81da 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -2378,6 +2378,77 @@ static void rcu_kick_nohz_cpu(int cpu)
#ifdef CONFIG_NO_HZ_FULL_SYSIDLE

/*
+ * Invoked to note exit from irq or task transition to idle. Note that
+ * usermode execution does -not- count as idle here! After all, we want
+ * to detect full-system idle states, not RCU quiescent states and grace
+ * periods. The caller must have disabled interrupts.
+ */
+static void rcu_sysidle_enter(struct rcu_dynticks *rdtp, int irq)
+{
+ unsigned long j;
+
+ /* Adjust nesting, check for fully idle. */
+ if (irq) {
+ rdtp->dynticks_idle_nesting--;
+ WARN_ON_ONCE(rdtp->dynticks_idle_nesting < 0);
+ if (rdtp->dynticks_idle_nesting != 0)
+ return; /* Still not fully idle. */
+ } else {
+ if ((rdtp->dynticks_idle_nesting & DYNTICK_TASK_NEST_MASK) ==
+ DYNTICK_TASK_NEST_VALUE) {
+ rdtp->dynticks_idle_nesting = 0;
+ } else {
+ rdtp->dynticks_idle_nesting -= DYNTICK_TASK_NEST_VALUE;
+ WARN_ON_ONCE(rdtp->dynticks_idle_nesting < 0);
+ return; /* Still not fully idle. */
+ }
+ }
+
+ /* Record start of fully idle period. */
+ j = jiffies;
+ ACCESS_ONCE(rdtp->dynticks_idle_jiffies) = j;
+ smp_mb__before_atomic_inc();
+ atomic_inc(&rdtp->dynticks_idle);
+ smp_mb__after_atomic_inc();
+ WARN_ON_ONCE(atomic_read(&rdtp->dynticks_idle) & 0x1);
+}
+
+/*
+ * Invoked to note entry to irq or task transition from idle. Note that
+ * usermode execution does -not- count as idle here! The caller must
+ * have disabled interrupts.
+ */
+static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq)
+{
+ /* Adjust nesting, check for already non-idle. */
+ if (irq) {
+ rdtp->dynticks_idle_nesting++;
+ WARN_ON_ONCE(rdtp->dynticks_idle_nesting <= 0);
+ if (rdtp->dynticks_idle_nesting != 1)
+ return; /* Already non-idle. */
+ } else {
+ /*
+ * Allow for irq misnesting. Yes, it really is possible
+ * to enter an irq handler then never leave it, and maybe
+ * also vice versa. Handle both possibilities.
+ */
+ if (rdtp->dynticks_idle_nesting & DYNTICK_TASK_NEST_MASK) {
+ rdtp->dynticks_idle_nesting += DYNTICK_TASK_NEST_VALUE;
+ WARN_ON_ONCE(rdtp->dynticks_idle_nesting <= 0);
+ return; /* Already non-idle. */
+ } else {
+ rdtp->dynticks_idle_nesting = DYNTICK_TASK_EXIT_IDLE;
+ }
+ }
+
+ /* Record end of idle period. */
+ smp_mb__before_atomic_inc();
+ atomic_inc(&rdtp->dynticks_idle);
+ smp_mb__after_atomic_inc();
+ WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks_idle) & 0x1));
+}
+
+/*
* Initialize dynticks sysidle state for CPUs coming online.
*/
static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp)
@@ -2387,6 +2458,14 @@ static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp)

#else /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */

+static void rcu_sysidle_enter(struct rcu_dynticks *rdtp, int irq)
+{
+}
+
+static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq)
+{
+}
+
static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp)
{
}
--
1.8.1.5

2013-08-20 02:47:34

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 3/9] nohz_full: Add Kconfig parameter for scalable detection of all-idle state

From: "Paul E. McKenney" <[email protected]>

At least one CPU must keep the scheduling-clock tick running for
timekeeping purposes whenever there is a non-idle CPU. However, with
the new nohz_full adaptive-idle machinery, it is difficult to distinguish
between all CPUs really being idle as opposed to all non-idle CPUs being
in adaptive-ticks mode. This commit therefore adds a Kconfig parameter
as a first step towards enabling a scalable detection of full-system
idle state.

Signed-off-by: Paul E. McKenney <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Steven Rostedt <[email protected]>
[ paulmck: Update help text per Frederic Weisbecker. ]
Reviewed-by: Josh Triplett <[email protected]>
---
kernel/time/Kconfig | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)

diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index 70f27e8..c7d2fd6 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -134,6 +134,29 @@ config NO_HZ_FULL_ALL
Note the boot CPU will still be kept outside the range to
handle the timekeeping duty.

+config NO_HZ_FULL_SYSIDLE
+ bool "Detect full-system idle state for full dynticks system"
+ depends on NO_HZ_FULL
+ default n
+ help
+ At least one CPU must keep the scheduling-clock tick running for
+ timekeeping purposes whenever there is a non-idle CPU, where
+ "non-idle" also includes dynticks CPUs as long as they are
+ running non-idle tasks. Because the underlying adaptive-tick
+ support cannot distinguish between all CPUs being idle and
+ all CPUs each running a single task in dynticks mode, the
+ underlying support simply ensures that there is always a CPU
+ handling the scheduling-clock tick, whether or not all CPUs
+ are idle. This Kconfig option enables scalable detection of
+ the all-CPUs-idle state, thus allowing the scheduling-clock
+ tick to be disabled when all CPUs are idle. Note that scalable
+ detection of the all-CPUs-idle state means that larger systems
+ will be slower to declare the all-CPUs-idle state.
+
+ Say Y if you would like to help debug all-CPUs-idle detection.
+
+ Say N if you are unsure.
+
config NO_HZ
bool "Old Idle dynticks config"
depends on !ARCH_USES_GETTIMEOFFSET && GENERIC_CLOCKEVENTS
--
1.8.1.5

2013-08-20 02:48:54

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 9/9] nohz_full: Force RCU's grace-period kthreads onto timekeeping CPU

From: "Paul E. McKenney" <[email protected]>

Because RCU's quiescent-state-forcing mechanism is used to drive the
full-system-idle state machine, and because this mechanism is executed
by RCU's grace-period kthreads, this commit forces these kthreads to
run on the timekeeping CPU (tick_do_timer_cpu). To do otherwise would
mean that the RCU grace-period kthreads would force the system into
non-idle state every time they drove the state machine, which would
be just a bit on the futile side.

Signed-off-by: Paul E. McKenney <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Reviewed-by: Josh Triplett <[email protected]>
---
kernel/rcutree.c | 1 +
kernel/rcutree.h | 1 +
kernel/rcutree_plugin.h | 21 ++++++++++++++++++++-
3 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index eca70f44..64eaafb 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1303,6 +1303,7 @@ static int rcu_gp_init(struct rcu_state *rsp)
struct rcu_data *rdp;
struct rcu_node *rnp = rcu_get_root(rsp);

+ rcu_bind_gp_kthread();
raw_spin_lock_irq(&rnp->lock);
rsp->gp_flags = 0; /* Clear all flags: New grace period. */

diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 6fd3659..5f97eab 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -560,6 +560,7 @@ static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
static bool is_sysidle_rcu_state(struct rcu_state *rsp);
static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
unsigned long maxj);
+static void rcu_bind_gp_kthread(void);
static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp);

#endif /* #ifndef RCU_TREE_NONCORE */
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 90c3fba..dc7f0d4 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -2531,7 +2531,8 @@ static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
if (!*isidle || rdp->rsp != rcu_sysidle_state ||
cpu_is_offline(rdp->cpu) || rdp->cpu == tick_do_timer_cpu)
return;
- /* WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu); */
+ if (rcu_gp_in_progress(rdp->rsp))
+ WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu);

/* Pick up current idle and NMI-nesting counter and check. */
cur = atomic_read(&rdtp->dynticks_idle);
@@ -2557,6 +2558,20 @@ static bool is_sysidle_rcu_state(struct rcu_state *rsp)
}

/*
+ * Bind the grace-period kthread for the sysidle flavor of RCU to the
+ * timekeeping CPU.
+ */
+static void rcu_bind_gp_kthread(void)
+{
+ int cpu = ACCESS_ONCE(tick_do_timer_cpu);
+
+ if (cpu < 0 || cpu >= nr_cpu_ids)
+ return;
+ if (raw_smp_processor_id() != cpu)
+ set_cpus_allowed_ptr(current, cpumask_of(cpu));
+}
+
+/*
* Return a delay in jiffies based on the number of CPUs, rcu_node
* leaf fanout, and jiffies tick rate. The idea is to allow larger
* systems more time to transition to full-idle state in order to
@@ -2754,6 +2769,10 @@ static bool is_sysidle_rcu_state(struct rcu_state *rsp)
return false;
}

+static void rcu_bind_gp_kthread(void)
+{
+}
+
static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
unsigned long maxj)
{
--
1.8.1.5

2013-08-20 02:49:17

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 6/9] nohz_full: Add full-system idle states and variables

From: "Paul E. McKenney" <[email protected]>

This commit adds control variables and states for full-system idle.
The system will progress through the states in numerical order when
the system is fully idle (other than the timekeeping CPU), and reset
down to the initial state if any non-timekeeping CPU goes non-idle.
The current state is kept in full_sysidle_state.

One flavor of RCU will be in charge of driving the state machine,
defined by rcu_sysidle_state. This should be the busiest flavor of RCU.

Signed-off-by: Paul E. McKenney <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Steven Rostedt <[email protected]>
Reviewed-by: Josh Triplett <[email protected]>
---
kernel/rcutree_plugin.h | 17 +++++++++++++++++
1 file changed, 17 insertions(+)

diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index eab81da..a7419ce 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -2378,6 +2378,23 @@ static void rcu_kick_nohz_cpu(int cpu)
#ifdef CONFIG_NO_HZ_FULL_SYSIDLE

/*
+ * Define RCU flavor that holds sysidle state. This needs to be the
+ * most active flavor of RCU.
+ */
+#ifdef CONFIG_PREEMPT_RCU
+static struct rcu_state __maybe_unused *rcu_sysidle_state = &rcu_preempt_state;
+#else /* #ifdef CONFIG_PREEMPT_RCU */
+static struct rcu_state __maybe_unused *rcu_sysidle_state = &rcu_sched_state;
+#endif /* #else #ifdef CONFIG_PREEMPT_RCU */
+
+static int __maybe_unused full_sysidle_state; /* Current system-idle state. */
+#define RCU_SYSIDLE_NOT 0 /* Some CPU is not idle. */
+#define RCU_SYSIDLE_SHORT 1 /* All CPUs idle for brief period. */
+#define RCU_SYSIDLE_LONG 2 /* All CPUs idle for long enough. */
+#define RCU_SYSIDLE_FULL 3 /* All CPUs idle, ready for sysidle. */
+#define RCU_SYSIDLE_FULL_NOTED 4 /* Actually entered sysidle state. */
+
+/*
* Invoked to note exit from irq or task transition to idle. Note that
* usermode execution does -not- count as idle here! After all, we want
* to detect full-system idle states, not RCU quiescent states and grace
--
1.8.1.5

2013-08-20 02:50:03

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 8/9] nohz_full: Add full-system-idle state machine

From: "Paul E. McKenney" <[email protected]>

This commit adds the state machine that takes the per-CPU idle data
as input and produces a full-system-idle indication as output. This
state machine is driven out of RCU's quiescent-state-forcing
mechanism, which invokes rcu_sysidle_check_cpu() to collect per-CPU
idle state and then rcu_sysidle_report() to drive the state machine.

The full-system-idle state is sampled using rcu_sys_is_idle(), which
also drives the state machine if RCU is idle (and does so by forcing
RCU to become non-idle). This function returns true if all but the
timekeeping CPU (tick_do_timer_cpu) are idle and have been idle long
enough to avoid memory contention on the full_sysidle_state state
variable. The rcu_sysidle_force_exit() may be called externally
to reset the state machine back into non-idle state.

For large systems the state machine is driven out of RCU's
force-quiescent-state logic, which provides good scalability at the price
of millisecond-scale latencies on the transition to full-system-idle
state. This is not so good for battery-powered systems, which are usually
small enough that they don't need to care about scalability, but which
do care deeply about energy efficiency. Small systems therefore drive
the state machine directly out of the idle-entry code. The number of
CPUs in a "small" system is defined by a new NO_HZ_FULL_SYSIDLE_SMALL
Kconfig parameter, which defaults to 8. Note that this is a build-time
definition.

Signed-off-by: Paul E. McKenney <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Lai Jiangshan <[email protected]>
[ paulmck: Use true and false for boolean constants per Lai Jiangshan. ]
Reviewed-by: Josh Triplett <[email protected]>
---
include/linux/rcupdate.h | 18 +++
kernel/rcutree.c | 16 ++-
kernel/rcutree.h | 5 +
kernel/rcutree_plugin.h | 284 ++++++++++++++++++++++++++++++++++++++++++++++-
kernel/time/Kconfig | 27 +++++
5 files changed, 343 insertions(+), 7 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 30bea9c..f1f1bc3 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -1011,4 +1011,22 @@ static inline bool rcu_is_nocb_cpu(int cpu) { return false; }
#endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */


+/* Only for use by adaptive-ticks code. */
+#ifdef CONFIG_NO_HZ_FULL_SYSIDLE
+extern bool rcu_sys_is_idle(void);
+extern void rcu_sysidle_force_exit(void);
+#else /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
+
+static inline bool rcu_sys_is_idle(void)
+{
+ return false;
+}
+
+static inline void rcu_sysidle_force_exit(void)
+{
+}
+
+#endif /* #else #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
+
+
#endif /* __LINUX_RCUPDATE_H */
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 7b5be56..eca70f44 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -734,6 +734,7 @@ static int dyntick_save_progress_counter(struct rcu_data *rdp,
bool *isidle, unsigned long *maxj)
{
rdp->dynticks_snap = atomic_add_return(0, &rdp->dynticks->dynticks);
+ rcu_sysidle_check_cpu(rdp, isidle, maxj);
return (rdp->dynticks_snap & 0x1) == 0;
}

@@ -1373,11 +1374,17 @@ int rcu_gp_fqs(struct rcu_state *rsp, int fqs_state_in)
rsp->n_force_qs++;
if (fqs_state == RCU_SAVE_DYNTICK) {
/* Collect dyntick-idle snapshots. */
+ if (is_sysidle_rcu_state(rsp)) {
+ isidle = 1;
+ maxj = jiffies - ULONG_MAX / 4;
+ }
force_qs_rnp(rsp, dyntick_save_progress_counter,
&isidle, &maxj);
+ rcu_sysidle_report_gp(rsp, isidle, maxj);
fqs_state = RCU_FORCE_QS;
} else {
/* Handle dyntick-idle and offline CPUs. */
+ isidle = 0;
force_qs_rnp(rsp, rcu_implicit_dynticks_qs, &isidle, &maxj);
}
/* Clear flag to prevent immediate re-entry. */
@@ -2103,9 +2110,12 @@ static void force_qs_rnp(struct rcu_state *rsp,
cpu = rnp->grplo;
bit = 1;
for (; cpu <= rnp->grphi; cpu++, bit <<= 1) {
- if ((rnp->qsmask & bit) != 0 &&
- f(per_cpu_ptr(rsp->rda, cpu), isidle, maxj))
- mask |= bit;
+ if ((rnp->qsmask & bit) != 0) {
+ if ((rnp->qsmaskinit & bit) != 0)
+ *isidle = 0;
+ if (f(per_cpu_ptr(rsp->rda, cpu), isidle, maxj))
+ mask |= bit;
+ }
}
if (mask != 0) {

diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 9dd8b17..6fd3659 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -555,6 +555,11 @@ static void rcu_kick_nohz_cpu(int cpu);
static bool init_nocb_callback_list(struct rcu_data *rdp);
static void rcu_sysidle_enter(struct rcu_dynticks *rdtp, int irq);
static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq);
+static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
+ unsigned long *maxj);
+static bool is_sysidle_rcu_state(struct rcu_state *rsp);
+static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
+ unsigned long maxj);
static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp);

#endif /* #ifndef RCU_TREE_NONCORE */
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index a7419ce..90c3fba 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -28,7 +28,7 @@
#include <linux/gfp.h>
#include <linux/oom.h>
#include <linux/smpboot.h>
-#include <linux/tick.h>
+#include "time/tick-internal.h"

#define RCU_KTHREAD_PRIO 1

@@ -2382,12 +2382,12 @@ static void rcu_kick_nohz_cpu(int cpu)
* most active flavor of RCU.
*/
#ifdef CONFIG_PREEMPT_RCU
-static struct rcu_state __maybe_unused *rcu_sysidle_state = &rcu_preempt_state;
+static struct rcu_state *rcu_sysidle_state = &rcu_preempt_state;
#else /* #ifdef CONFIG_PREEMPT_RCU */
-static struct rcu_state __maybe_unused *rcu_sysidle_state = &rcu_sched_state;
+static struct rcu_state *rcu_sysidle_state = &rcu_sched_state;
#endif /* #else #ifdef CONFIG_PREEMPT_RCU */

-static int __maybe_unused full_sysidle_state; /* Current system-idle state. */
+static int full_sysidle_state; /* Current system-idle state. */
#define RCU_SYSIDLE_NOT 0 /* Some CPU is not idle. */
#define RCU_SYSIDLE_SHORT 1 /* All CPUs idle for brief period. */
#define RCU_SYSIDLE_LONG 2 /* All CPUs idle for long enough. */
@@ -2431,6 +2431,38 @@ static void rcu_sysidle_enter(struct rcu_dynticks *rdtp, int irq)
}

/*
+ * Unconditionally force exit from full system-idle state. This is
+ * invoked when a normal CPU exits idle, but must be called separately
+ * for the timekeeping CPU (tick_do_timer_cpu). The reason for this
+ * is that the timekeeping CPU is permitted to take scheduling-clock
+ * interrupts while the system is in system-idle state, and of course
+ * rcu_sysidle_exit() has no way of distinguishing a scheduling-clock
+ * interrupt from any other type of interrupt.
+ */
+void rcu_sysidle_force_exit(void)
+{
+ int oldstate = ACCESS_ONCE(full_sysidle_state);
+ int newoldstate;
+
+ /*
+ * Each pass through the following loop attempts to exit full
+ * system-idle state. If contention proves to be a problem,
+ * a trylock-based contention tree could be used here.
+ */
+ while (oldstate > RCU_SYSIDLE_SHORT) {
+ newoldstate = cmpxchg(&full_sysidle_state,
+ oldstate, RCU_SYSIDLE_NOT);
+ if (oldstate == newoldstate &&
+ oldstate == RCU_SYSIDLE_FULL_NOTED) {
+ rcu_kick_nohz_cpu(tick_do_timer_cpu);
+ return; /* We cleared it, done! */
+ }
+ oldstate = newoldstate;
+ }
+ smp_mb(); /* Order initial oldstate fetch vs. later non-idle work. */
+}
+
+/*
* Invoked to note entry to irq or task transition from idle. Note that
* usermode execution does -not- count as idle here! The caller must
* have disabled interrupts.
@@ -2463,6 +2495,235 @@ static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq)
atomic_inc(&rdtp->dynticks_idle);
smp_mb__after_atomic_inc();
WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks_idle) & 0x1));
+
+ /*
+ * If we are the timekeeping CPU, we are permitted to be non-idle
+ * during a system-idle state. This must be the case, because
+ * the timekeeping CPU has to take scheduling-clock interrupts
+ * during the time that the system is transitioning to full
+ * system-idle state. This means that the timekeeping CPU must
+ * invoke rcu_sysidle_force_exit() directly if it does anything
+ * more than take a scheduling-clock interrupt.
+ */
+ if (smp_processor_id() == tick_do_timer_cpu)
+ return;
+
+ /* Update system-idle state: We are clearly no longer fully idle! */
+ rcu_sysidle_force_exit();
+}
+
+/*
+ * Check to see if the current CPU is idle. Note that usermode execution
+ * does not count as idle. The caller must have disabled interrupts.
+ */
+static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
+ unsigned long *maxj)
+{
+ int cur;
+ unsigned long j;
+ struct rcu_dynticks *rdtp = rdp->dynticks;
+
+ /*
+ * If some other CPU has already reported non-idle, if this is
+ * not the flavor of RCU that tracks sysidle state, or if this
+ * is an offline or the timekeeping CPU, nothing to do.
+ */
+ if (!*isidle || rdp->rsp != rcu_sysidle_state ||
+ cpu_is_offline(rdp->cpu) || rdp->cpu == tick_do_timer_cpu)
+ return;
+ /* WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu); */
+
+ /* Pick up current idle and NMI-nesting counter and check. */
+ cur = atomic_read(&rdtp->dynticks_idle);
+ if (cur & 0x1) {
+ *isidle = false; /* We are not idle! */
+ return;
+ }
+ smp_mb(); /* Read counters before timestamps. */
+
+ /* Pick up timestamps. */
+ j = ACCESS_ONCE(rdtp->dynticks_idle_jiffies);
+ /* If this CPU entered idle more recently, update maxj timestamp. */
+ if (ULONG_CMP_LT(*maxj, j))
+ *maxj = j;
+}
+
+/*
+ * Is this the flavor of RCU that is handling full-system idle?
+ */
+static bool is_sysidle_rcu_state(struct rcu_state *rsp)
+{
+ return rsp == rcu_sysidle_state;
+}
+
+/*
+ * Return a delay in jiffies based on the number of CPUs, rcu_node
+ * leaf fanout, and jiffies tick rate. The idea is to allow larger
+ * systems more time to transition to full-idle state in order to
+ * avoid the cache thrashing that otherwise occur on the state variable.
+ * Really small systems (less than a couple of tens of CPUs) should
+ * instead use a single global atomically incremented counter, and later
+ * versions of this will automatically reconfigure themselves accordingly.
+ */
+static unsigned long rcu_sysidle_delay(void)
+{
+ if (nr_cpu_ids <= CONFIG_NO_HZ_FULL_SYSIDLE_SMALL)
+ return 0;
+ return DIV_ROUND_UP(nr_cpu_ids * HZ, rcu_fanout_leaf * 1000);
+}
+
+/*
+ * Advance the full-system-idle state. This is invoked when all of
+ * the non-timekeeping CPUs are idle.
+ */
+static void rcu_sysidle(unsigned long j)
+{
+ /* Check the current state. */
+ switch (ACCESS_ONCE(full_sysidle_state)) {
+ case RCU_SYSIDLE_NOT:
+
+ /* First time all are idle, so note a short idle period. */
+ ACCESS_ONCE(full_sysidle_state) = RCU_SYSIDLE_SHORT;
+ break;
+
+ case RCU_SYSIDLE_SHORT:
+
+ /*
+ * Idle for a bit, time to advance to next state?
+ * cmpxchg failure means race with non-idle, let them win.
+ */
+ if (ULONG_CMP_GE(jiffies, j + rcu_sysidle_delay()))
+ (void)cmpxchg(&full_sysidle_state,
+ RCU_SYSIDLE_SHORT, RCU_SYSIDLE_LONG);
+ break;
+
+ case RCU_SYSIDLE_LONG:
+
+ /*
+ * Do an additional check pass before advancing to full.
+ * cmpxchg failure means race with non-idle, let them win.
+ */
+ if (ULONG_CMP_GE(jiffies, j + rcu_sysidle_delay()))
+ (void)cmpxchg(&full_sysidle_state,
+ RCU_SYSIDLE_LONG, RCU_SYSIDLE_FULL);
+ break;
+
+ default:
+ break;
+ }
+}
+
+/*
+ * Found a non-idle non-timekeeping CPU, so kick the system-idle state
+ * back to the beginning.
+ */
+static void rcu_sysidle_cancel(void)
+{
+ smp_mb();
+ ACCESS_ONCE(full_sysidle_state) = RCU_SYSIDLE_NOT;
+}
+
+/*
+ * Update the sysidle state based on the results of a force-quiescent-state
+ * scan of the CPUs' dyntick-idle state.
+ */
+static void rcu_sysidle_report(struct rcu_state *rsp, int isidle,
+ unsigned long maxj, bool gpkt)
+{
+ if (rsp != rcu_sysidle_state)
+ return; /* Wrong flavor, ignore. */
+ if (isidle) {
+ if (gpkt && nr_cpu_ids > CONFIG_NO_HZ_FULL_SYSIDLE_SMALL)
+ rcu_sysidle(maxj); /* More idle! */
+ } else {
+ rcu_sysidle_cancel(); /* Idle is over. */
+ }
+}
+
+static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
+ unsigned long maxj)
+{
+ rcu_sysidle_report(rsp, isidle, maxj, true);
+}
+
+/* Callback and function for forcing an RCU grace period. */
+struct rcu_sysidle_head {
+ struct rcu_head rh;
+ int inuse;
+};
+
+static void rcu_sysidle_cb(struct rcu_head *rhp)
+{
+ struct rcu_sysidle_head *rshp;
+
+ smp_mb(); /* grace period precedes setting inuse. */
+ rshp = container_of(rhp, struct rcu_sysidle_head, rh);
+ ACCESS_ONCE(rshp->inuse) = 0;
+}
+
+/*
+ * Check to see if the system is fully idle, other than the timekeeping CPU.
+ * The caller must have disabled interrupts.
+ */
+bool rcu_sys_is_idle(void)
+{
+ static struct rcu_sysidle_head rsh;
+ int rss = ACCESS_ONCE(full_sysidle_state);
+
+ if (WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu))
+ return false;
+
+ /* Handle small-system case by doing a full scan of CPUs. */
+ if (nr_cpu_ids <= CONFIG_NO_HZ_FULL_SYSIDLE_SMALL) {
+ int oldrss = rss - 1;
+
+ /*
+ * One pass to advance to each state up to _FULL.
+ * Give up if any pass fails to advance the state.
+ */
+ while (rss < RCU_SYSIDLE_FULL && oldrss < rss) {
+ int cpu;
+ bool isidle = true;
+ unsigned long maxj = jiffies - ULONG_MAX / 4;
+ struct rcu_data *rdp;
+
+ /* Scan all the CPUs looking for nonidle CPUs. */
+ for_each_possible_cpu(cpu) {
+ rdp = per_cpu_ptr(rcu_sysidle_state->rda, cpu);
+ rcu_sysidle_check_cpu(rdp, &isidle, &maxj);
+ if (!isidle)
+ break;
+ }
+ rcu_sysidle_report(rcu_sysidle_state,
+ isidle, maxj, false);
+ oldrss = rss;
+ rss = ACCESS_ONCE(full_sysidle_state);
+ }
+ }
+
+ /* If this is the first observation of an idle period, record it. */
+ if (rss == RCU_SYSIDLE_FULL) {
+ rss = cmpxchg(&full_sysidle_state,
+ RCU_SYSIDLE_FULL, RCU_SYSIDLE_FULL_NOTED);
+ return rss == RCU_SYSIDLE_FULL;
+ }
+
+ smp_mb(); /* ensure rss load happens before later caller actions. */
+
+ /* If already fully idle, tell the caller (in case of races). */
+ if (rss == RCU_SYSIDLE_FULL_NOTED)
+ return true;
+
+ /*
+ * If we aren't there yet, and a grace period is not in flight,
+ * initiate a grace period. Either way, tell the caller that
+ * we are not there yet.
+ */
+ if (nr_cpu_ids > CONFIG_NO_HZ_FULL_SYSIDLE_SMALL &&
+ !rcu_gp_in_progress(rcu_sysidle_state) &&
+ !rsh.inuse && xchg(&rsh.inuse, 1) == 0)
+ call_rcu(&rsh.rh, rcu_sysidle_cb);
+ return false;
}

/*
@@ -2483,6 +2744,21 @@ static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq)
{
}

+static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
+ unsigned long *maxj)
+{
+}
+
+static bool is_sysidle_rcu_state(struct rcu_state *rsp)
+{
+ return false;
+}
+
+static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
+ unsigned long maxj)
+{
+}
+
static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp)
{
}
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index c7d2fd6..3381f09 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -157,6 +157,33 @@ config NO_HZ_FULL_SYSIDLE

Say N if you are unsure.

+config NO_HZ_FULL_SYSIDLE_SMALL
+ int "Number of CPUs above which large-system approach is used"
+ depends on NO_HZ_FULL_SYSIDLE
+ range 1 NR_CPUS
+ default 8
+ help
+ The full-system idle detection mechanism takes a lazy approach
+ on large systems, as is required to attain decent scalability.
+ However, on smaller systems, scalability is not anywhere near as
+ large a concern as is energy efficiency. The sysidle subsystem
+ therefore uses a fast but non-scalable algorithm for small
+ systems and a lazier but scalable algorithm for large systems.
+ This Kconfig parameter defines the number of CPUs in the largest
+ system that will be considered to be "small".
+
+ The default value will be fine in most cases. Battery-powered
+ systems that (1) enable NO_HZ_FULL_SYSIDLE, (2) have larger
+ numbers of CPUs, and (3) are suffering from battery-lifetime
+ problems due to long sysidle latencies might wish to experiment
+ with larger values for this Kconfig parameter. On the other
+ hand, they might be even better served by disabling NO_HZ_FULL
+ entirely, given that NO_HZ_FULL is intended for HPC and
+ real-time workloads that at present do not tend to be run on
+ battery-powered systems.
+
+ Take the default if you are unsure.
+
config NO_HZ
bool "Old Idle dynticks config"
depends on !ARCH_USES_GETTIMEOFFSET && GENERIC_CLOCKEVENTS
--
1.8.1.5

2013-08-20 02:50:00

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 7/9] nohz_full: Add full-system-idle arguments to API

From: "Paul E. McKenney" <[email protected]>

This commit adds an isidle and jiffies argument to force_qs_rnp(),
dyntick_save_progress_counter(), and rcu_implicit_dynticks_qs() to enable
RCU's force-quiescent-state process to check for full-system idle.

Signed-off-by: Paul E. McKenney <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Lai Jiangshan <[email protected]>
[ paulmck: Use true and false for boolean constants per Lai Jiangshan. ]
Reviewed-by: Josh Triplett <[email protected]>
---
kernel/rcutree.c | 25 ++++++++++++++++++-------
1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index b0d2cc3..7b5be56 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -246,7 +246,10 @@ module_param(jiffies_till_next_fqs, ulong, 0644);

static void rcu_start_gp_advanced(struct rcu_state *rsp, struct rcu_node *rnp,
struct rcu_data *rdp);
-static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *));
+static void force_qs_rnp(struct rcu_state *rsp,
+ int (*f)(struct rcu_data *rsp, bool *isidle,
+ unsigned long *maxj),
+ bool *isidle, unsigned long *maxj);
static void force_quiescent_state(struct rcu_state *rsp);
static int rcu_pending(int cpu);

@@ -727,7 +730,8 @@ static int rcu_is_cpu_rrupt_from_idle(void)
* credit them with an implicit quiescent state. Return 1 if this CPU
* is in dynticks idle mode, which is an extended quiescent state.
*/
-static int dyntick_save_progress_counter(struct rcu_data *rdp)
+static int dyntick_save_progress_counter(struct rcu_data *rdp,
+ bool *isidle, unsigned long *maxj)
{
rdp->dynticks_snap = atomic_add_return(0, &rdp->dynticks->dynticks);
return (rdp->dynticks_snap & 0x1) == 0;
@@ -739,7 +743,8 @@ static int dyntick_save_progress_counter(struct rcu_data *rdp)
* idle state since the last call to dyntick_save_progress_counter()
* for this same CPU, or by virtue of having been offline.
*/
-static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
+static int rcu_implicit_dynticks_qs(struct rcu_data *rdp,
+ bool *isidle, unsigned long *maxj)
{
unsigned int curr;
unsigned int snap;
@@ -1361,16 +1366,19 @@ static int rcu_gp_init(struct rcu_state *rsp)
int rcu_gp_fqs(struct rcu_state *rsp, int fqs_state_in)
{
int fqs_state = fqs_state_in;
+ bool isidle = false;
+ unsigned long maxj;
struct rcu_node *rnp = rcu_get_root(rsp);

rsp->n_force_qs++;
if (fqs_state == RCU_SAVE_DYNTICK) {
/* Collect dyntick-idle snapshots. */
- force_qs_rnp(rsp, dyntick_save_progress_counter);
+ force_qs_rnp(rsp, dyntick_save_progress_counter,
+ &isidle, &maxj);
fqs_state = RCU_FORCE_QS;
} else {
/* Handle dyntick-idle and offline CPUs. */
- force_qs_rnp(rsp, rcu_implicit_dynticks_qs);
+ force_qs_rnp(rsp, rcu_implicit_dynticks_qs, &isidle, &maxj);
}
/* Clear flag to prevent immediate re-entry. */
if (ACCESS_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) {
@@ -2069,7 +2077,10 @@ void rcu_check_callbacks(int cpu, int user)
*
* The caller must have suppressed start of new grace periods.
*/
-static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *))
+static void force_qs_rnp(struct rcu_state *rsp,
+ int (*f)(struct rcu_data *rsp, bool *isidle,
+ unsigned long *maxj),
+ bool *isidle, unsigned long *maxj)
{
unsigned long bit;
int cpu;
@@ -2093,7 +2104,7 @@ static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *))
bit = 1;
for (; cpu <= rnp->grphi; cpu++, bit <<= 1) {
if ((rnp->qsmask & bit) != 0 &&
- f(per_cpu_ptr(rsp->rda, cpu)))
+ f(per_cpu_ptr(rsp->rda, cpu), isidle, maxj))
mask |= bit;
}
if (mask != 0) {
--
1.8.1.5

2013-08-26 05:41:07

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 8/9] nohz_full: Add full-system-idle state machine

On 08/20/2013 10:47 AM, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <[email protected]>
>
> This commit adds the state machine that takes the per-CPU idle data
> as input and produces a full-system-idle indication as output. This
> state machine is driven out of RCU's quiescent-state-forcing
> mechanism, which invokes rcu_sysidle_check_cpu() to collect per-CPU
> idle state and then rcu_sysidle_report() to drive the state machine.
>
> The full-system-idle state is sampled using rcu_sys_is_idle(), which
> also drives the state machine if RCU is idle (and does so by forcing
> RCU to become non-idle). This function returns true if all but the
> timekeeping CPU (tick_do_timer_cpu) are idle and have been idle long
> enough to avoid memory contention on the full_sysidle_state state
> variable. The rcu_sysidle_force_exit() may be called externally
> to reset the state machine back into non-idle state.
>
> For large systems the state machine is driven out of RCU's
> force-quiescent-state logic, which provides good scalability at the price
> of millisecond-scale latencies on the transition to full-system-idle
> state. This is not so good for battery-powered systems, which are usually
> small enough that they don't need to care about scalability, but which
> do care deeply about energy efficiency. Small systems therefore drive
> the state machine directly out of the idle-entry code. The number of
> CPUs in a "small" system is defined by a new NO_HZ_FULL_SYSIDLE_SMALL
> Kconfig parameter, which defaults to 8. Note that this is a build-time
> definition.
>
> Signed-off-by: Paul E. McKenney <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Lai Jiangshan <[email protected]>
> [ paulmck: Use true and false for boolean constants per Lai Jiangshan. ]
> Reviewed-by: Josh Triplett <[email protected]>
> ---
> include/linux/rcupdate.h | 18 +++
> kernel/rcutree.c | 16 ++-
> kernel/rcutree.h | 5 +
> kernel/rcutree_plugin.h | 284 ++++++++++++++++++++++++++++++++++++++++++++++-
> kernel/time/Kconfig | 27 +++++
> 5 files changed, 343 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 30bea9c..f1f1bc3 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -1011,4 +1011,22 @@ static inline bool rcu_is_nocb_cpu(int cpu) { return false; }
> #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
>
>
> +/* Only for use by adaptive-ticks code. */
> +#ifdef CONFIG_NO_HZ_FULL_SYSIDLE
> +extern bool rcu_sys_is_idle(void);
> +extern void rcu_sysidle_force_exit(void);
> +#else /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
> +
> +static inline bool rcu_sys_is_idle(void)
> +{
> + return false;
> +}
> +
> +static inline void rcu_sysidle_force_exit(void)
> +{
> +}
> +
> +#endif /* #else #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
> +
> +
> #endif /* __LINUX_RCUPDATE_H */
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index 7b5be56..eca70f44 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -734,6 +734,7 @@ static int dyntick_save_progress_counter(struct rcu_data *rdp,
> bool *isidle, unsigned long *maxj)
> {
> rdp->dynticks_snap = atomic_add_return(0, &rdp->dynticks->dynticks);
> + rcu_sysidle_check_cpu(rdp, isidle, maxj);
> return (rdp->dynticks_snap & 0x1) == 0;
> }
>
> @@ -1373,11 +1374,17 @@ int rcu_gp_fqs(struct rcu_state *rsp, int fqs_state_in)
> rsp->n_force_qs++;
> if (fqs_state == RCU_SAVE_DYNTICK) {
> /* Collect dyntick-idle snapshots. */
> + if (is_sysidle_rcu_state(rsp)) {
> + isidle = 1;
> + maxj = jiffies - ULONG_MAX / 4;
> + }
> force_qs_rnp(rsp, dyntick_save_progress_counter,
> &isidle, &maxj);
> + rcu_sysidle_report_gp(rsp, isidle, maxj);
> fqs_state = RCU_FORCE_QS;
> } else {
> /* Handle dyntick-idle and offline CPUs. */
> + isidle = 0;
> force_qs_rnp(rsp, rcu_implicit_dynticks_qs, &isidle, &maxj);
> }
> /* Clear flag to prevent immediate re-entry. */
> @@ -2103,9 +2110,12 @@ static void force_qs_rnp(struct rcu_state *rsp,
> cpu = rnp->grplo;
> bit = 1;
> for (; cpu <= rnp->grphi; cpu++, bit <<= 1) {
> - if ((rnp->qsmask & bit) != 0 &&
> - f(per_cpu_ptr(rsp->rda, cpu), isidle, maxj))
> - mask |= bit;
> + if ((rnp->qsmask & bit) != 0) {
> + if ((rnp->qsmaskinit & bit) != 0)
> + *isidle = 0;
> + if (f(per_cpu_ptr(rsp->rda, cpu), isidle, maxj))
> + mask |= bit;
> + }
> }
> if (mask != 0) {
>
> diff --git a/kernel/rcutree.h b/kernel/rcutree.h
> index 9dd8b17..6fd3659 100644
> --- a/kernel/rcutree.h
> +++ b/kernel/rcutree.h
> @@ -555,6 +555,11 @@ static void rcu_kick_nohz_cpu(int cpu);
> static bool init_nocb_callback_list(struct rcu_data *rdp);
> static void rcu_sysidle_enter(struct rcu_dynticks *rdtp, int irq);
> static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq);
> +static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
> + unsigned long *maxj);
> +static bool is_sysidle_rcu_state(struct rcu_state *rsp);
> +static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
> + unsigned long maxj);
> static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp);
>
> #endif /* #ifndef RCU_TREE_NONCORE */
> diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> index a7419ce..90c3fba 100644
> --- a/kernel/rcutree_plugin.h
> +++ b/kernel/rcutree_plugin.h
> @@ -28,7 +28,7 @@
> #include <linux/gfp.h>
> #include <linux/oom.h>
> #include <linux/smpboot.h>
> -#include <linux/tick.h>
> +#include "time/tick-internal.h"
>
> #define RCU_KTHREAD_PRIO 1
>
> @@ -2382,12 +2382,12 @@ static void rcu_kick_nohz_cpu(int cpu)
> * most active flavor of RCU.
> */
> #ifdef CONFIG_PREEMPT_RCU
> -static struct rcu_state __maybe_unused *rcu_sysidle_state = &rcu_preempt_state;
> +static struct rcu_state *rcu_sysidle_state = &rcu_preempt_state;
> #else /* #ifdef CONFIG_PREEMPT_RCU */
> -static struct rcu_state __maybe_unused *rcu_sysidle_state = &rcu_sched_state;
> +static struct rcu_state *rcu_sysidle_state = &rcu_sched_state;
> #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
>
> -static int __maybe_unused full_sysidle_state; /* Current system-idle state. */
> +static int full_sysidle_state; /* Current system-idle state. */
> #define RCU_SYSIDLE_NOT 0 /* Some CPU is not idle. */
> #define RCU_SYSIDLE_SHORT 1 /* All CPUs idle for brief period. */
> #define RCU_SYSIDLE_LONG 2 /* All CPUs idle for long enough. */
> @@ -2431,6 +2431,38 @@ static void rcu_sysidle_enter(struct rcu_dynticks *rdtp, int irq)
> }
>
> /*
> + * Unconditionally force exit from full system-idle state. This is
> + * invoked when a normal CPU exits idle, but must be called separately
> + * for the timekeeping CPU (tick_do_timer_cpu). The reason for this
> + * is that the timekeeping CPU is permitted to take scheduling-clock
> + * interrupts while the system is in system-idle state, and of course
> + * rcu_sysidle_exit() has no way of distinguishing a scheduling-clock
> + * interrupt from any other type of interrupt.
> + */
> +void rcu_sysidle_force_exit(void)
> +{
> + int oldstate = ACCESS_ONCE(full_sysidle_state);
> + int newoldstate;
> +
> + /*
> + * Each pass through the following loop attempts to exit full
> + * system-idle state. If contention proves to be a problem,
> + * a trylock-based contention tree could be used here.
> + */
> + while (oldstate > RCU_SYSIDLE_SHORT) {
> + newoldstate = cmpxchg(&full_sysidle_state,
> + oldstate, RCU_SYSIDLE_NOT);
> + if (oldstate == newoldstate &&
> + oldstate == RCU_SYSIDLE_FULL_NOTED) {
> + rcu_kick_nohz_cpu(tick_do_timer_cpu);
> + return; /* We cleared it, done! */
> + }
> + oldstate = newoldstate;
> + }
> + smp_mb(); /* Order initial oldstate fetch vs. later non-idle work. */
> +}
> +
> +/*
> * Invoked to note entry to irq or task transition from idle. Note that
> * usermode execution does -not- count as idle here! The caller must
> * have disabled interrupts.
> @@ -2463,6 +2495,235 @@ static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq)
> atomic_inc(&rdtp->dynticks_idle);
> smp_mb__after_atomic_inc();
> WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks_idle) & 0x1));
> +
> + /*
> + * If we are the timekeeping CPU, we are permitted to be non-idle
> + * during a system-idle state. This must be the case, because
> + * the timekeeping CPU has to take scheduling-clock interrupts
> + * during the time that the system is transitioning to full
> + * system-idle state. This means that the timekeeping CPU must
> + * invoke rcu_sysidle_force_exit() directly if it does anything
> + * more than take a scheduling-clock interrupt.
> + */
> + if (smp_processor_id() == tick_do_timer_cpu)
> + return;
> +
> + /* Update system-idle state: We are clearly no longer fully idle! */
> + rcu_sysidle_force_exit();
> +}
> +
> +/*
> + * Check to see if the current CPU is idle. Note that usermode execution
> + * does not count as idle. The caller must have disabled interrupts.
> + */
> +static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
> + unsigned long *maxj)
> +{
> + int cur;
> + unsigned long j;
> + struct rcu_dynticks *rdtp = rdp->dynticks;
> +
> + /*
> + * If some other CPU has already reported non-idle, if this is
> + * not the flavor of RCU that tracks sysidle state, or if this
> + * is an offline or the timekeeping CPU, nothing to do.
> + */
> + if (!*isidle || rdp->rsp != rcu_sysidle_state ||
> + cpu_is_offline(rdp->cpu) || rdp->cpu == tick_do_timer_cpu)
> + return;
> + /* WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu); */
> +
> + /* Pick up current idle and NMI-nesting counter and check. */
> + cur = atomic_read(&rdtp->dynticks_idle);
> + if (cur & 0x1) {
> + *isidle = false; /* We are not idle! */
> + return;
> + }
> + smp_mb(); /* Read counters before timestamps. */
> +
> + /* Pick up timestamps. */
> + j = ACCESS_ONCE(rdtp->dynticks_idle_jiffies);
> + /* If this CPU entered idle more recently, update maxj timestamp. */
> + if (ULONG_CMP_LT(*maxj, j))
> + *maxj = j;
> +}
> +
> +/*
> + * Is this the flavor of RCU that is handling full-system idle?
> + */
> +static bool is_sysidle_rcu_state(struct rcu_state *rsp)
> +{
> + return rsp == rcu_sysidle_state;
> +}
> +
> +/*
> + * Return a delay in jiffies based on the number of CPUs, rcu_node
> + * leaf fanout, and jiffies tick rate. The idea is to allow larger
> + * systems more time to transition to full-idle state in order to
> + * avoid the cache thrashing that otherwise occur on the state variable.
> + * Really small systems (less than a couple of tens of CPUs) should
> + * instead use a single global atomically incremented counter, and later
> + * versions of this will automatically reconfigure themselves accordingly.
> + */
> +static unsigned long rcu_sysidle_delay(void)
> +{
> + if (nr_cpu_ids <= CONFIG_NO_HZ_FULL_SYSIDLE_SMALL)
> + return 0;
> + return DIV_ROUND_UP(nr_cpu_ids * HZ, rcu_fanout_leaf * 1000);
> +}
> +
> +/*
> + * Advance the full-system-idle state. This is invoked when all of
> + * the non-timekeeping CPUs are idle.
> + */
> +static void rcu_sysidle(unsigned long j)
> +{
> + /* Check the current state. */
> + switch (ACCESS_ONCE(full_sysidle_state)) {
> + case RCU_SYSIDLE_NOT:
> +
> + /* First time all are idle, so note a short idle period. */
> + ACCESS_ONCE(full_sysidle_state) = RCU_SYSIDLE_SHORT;
> + break;
> +
> + case RCU_SYSIDLE_SHORT:
> +
> + /*
> + * Idle for a bit, time to advance to next state?
> + * cmpxchg failure means race with non-idle, let them win.
> + */
> + if (ULONG_CMP_GE(jiffies, j + rcu_sysidle_delay()))
> + (void)cmpxchg(&full_sysidle_state,
> + RCU_SYSIDLE_SHORT, RCU_SYSIDLE_LONG);
> + break;
> +
> + case RCU_SYSIDLE_LONG:
> +
> + /*
> + * Do an additional check pass before advancing to full.
> + * cmpxchg failure means race with non-idle, let them win.
> + */
> + if (ULONG_CMP_GE(jiffies, j + rcu_sysidle_delay()))
> + (void)cmpxchg(&full_sysidle_state,
> + RCU_SYSIDLE_LONG, RCU_SYSIDLE_FULL);
> + break;
> +
> + default:
> + break;
> + }
> +}
> +
> +/*
> + * Found a non-idle non-timekeeping CPU, so kick the system-idle state
> + * back to the beginning.
> + */
> +static void rcu_sysidle_cancel(void)
> +{
> + smp_mb();
> + ACCESS_ONCE(full_sysidle_state) = RCU_SYSIDLE_NOT;
> +}
> +
> +/*
> + * Update the sysidle state based on the results of a force-quiescent-state
> + * scan of the CPUs' dyntick-idle state.
> + */
> +static void rcu_sysidle_report(struct rcu_state *rsp, int isidle,
> + unsigned long maxj, bool gpkt)
> +{
> + if (rsp != rcu_sysidle_state)
> + return; /* Wrong flavor, ignore. */
> + if (isidle) {
> + if (gpkt && nr_cpu_ids > CONFIG_NO_HZ_FULL_SYSIDLE_SMALL)
> + rcu_sysidle(maxj); /* More idle! */
> + } else {
> + rcu_sysidle_cancel(); /* Idle is over. */
> + }

"gpkt" is always equal to "nr_cpu_ids > RCU_SYSIDLE_SMALL",

so we can remove "gpkt" argument and rcu_sysidle_report_gp(

> +}
> +
> +static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
> + unsigned long maxj)
> +{
> + rcu_sysidle_report(rsp, isidle, maxj, true);
> +}
> +
> +/* Callback and function for forcing an RCU grace period. */
> +struct rcu_sysidle_head {
> + struct rcu_head rh;
> + int inuse;
> +};
> +
> +static void rcu_sysidle_cb(struct rcu_head *rhp)
> +{
> + struct rcu_sysidle_head *rshp;
> +
> + smp_mb(); /* grace period precedes setting inuse. */

Why we need this mb()?


> + rshp = container_of(rhp, struct rcu_sysidle_head, rh);
> + ACCESS_ONCE(rshp->inuse) = 0;
> +}
> +
> +/*
> + * Check to see if the system is fully idle, other than the timekeeping CPU.
> + * The caller must have disabled interrupts.
> + */
> +bool rcu_sys_is_idle(void)
> +{
> + static struct rcu_sysidle_head rsh;
> + int rss = ACCESS_ONCE(full_sysidle_state);
> +
> + if (WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu))
> + return false;
> +
> + /* Handle small-system case by doing a full scan of CPUs. */
> + if (nr_cpu_ids <= CONFIG_NO_HZ_FULL_SYSIDLE_SMALL) {
> + int oldrss = rss - 1;
> +
> + /*
> + * One pass to advance to each state up to _FULL.
> + * Give up if any pass fails to advance the state.
> + */
> + while (rss < RCU_SYSIDLE_FULL && oldrss < rss) {
> + int cpu;
> + bool isidle = true;
> + unsigned long maxj = jiffies - ULONG_MAX / 4;
> + struct rcu_data *rdp;
> +
> + /* Scan all the CPUs looking for nonidle CPUs. */
> + for_each_possible_cpu(cpu) {
> + rdp = per_cpu_ptr(rcu_sysidle_state->rda, cpu);
> + rcu_sysidle_check_cpu(rdp, &isidle, &maxj);
> + if (!isidle)
> + break;
> + }
> + rcu_sysidle_report(rcu_sysidle_state,
> + isidle, maxj, false);
> + oldrss = rss;
> + rss = ACCESS_ONCE(full_sysidle_state);
> + }
> + }

I don't think it is a good idea to move the overhead to fqs when nr_cpu_ids > 8
the total overhead will no be reduced, and it maybe more.

I think we can calculate it on the time-keeping-cpu when nr_cpu_ids > 8 && time-keeping-cpu is idle.

> +
> + /* If this is the first observation of an idle period, record it. */
> + if (rss == RCU_SYSIDLE_FULL) {
> + rss = cmpxchg(&full_sysidle_state,
> + RCU_SYSIDLE_FULL, RCU_SYSIDLE_FULL_NOTED);
> + return rss == RCU_SYSIDLE_FULL;
> + }
> +
> + smp_mb(); /* ensure rss load happens before later caller actions. */
> +
> + /* If already fully idle, tell the caller (in case of races). */
> + if (rss == RCU_SYSIDLE_FULL_NOTED)
> + return true;
> +
> + /*
> + * If we aren't there yet, and a grace period is not in flight,
> + * initiate a grace period. Either way, tell the caller that
> + * we are not there yet.
> + */
> + if (nr_cpu_ids > CONFIG_NO_HZ_FULL_SYSIDLE_SMALL &&
> + !rcu_gp_in_progress(rcu_sysidle_state) &&
> + !rsh.inuse && xchg(&rsh.inuse, 1) == 0)
> + call_rcu(&rsh.rh, rcu_sysidle_cb);

why need to use xchg()? Who will it race with?


Thanks,
Lai


> + return false;
> }
>
> /*
> @@ -2483,6 +2744,21 @@ static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq)
> {
> }
>
> +static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
> + unsigned long *maxj)
> +{
> +}
> +
> +static bool is_sysidle_rcu_state(struct rcu_state *rsp)
> +{
> + return false;
> +}
> +
> +static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
> + unsigned long maxj)
> +{
> +}
> +
> static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp)
> {
> }
> diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
> index c7d2fd6..3381f09 100644
> --- a/kernel/time/Kconfig
> +++ b/kernel/time/Kconfig
> @@ -157,6 +157,33 @@ config NO_HZ_FULL_SYSIDLE
>
> Say N if you are unsure.
>
> +config NO_HZ_FULL_SYSIDLE_SMALL
> + int "Number of CPUs above which large-system approach is used"
> + depends on NO_HZ_FULL_SYSIDLE
> + range 1 NR_CPUS
> + default 8
> + help
> + The full-system idle detection mechanism takes a lazy approach
> + on large systems, as is required to attain decent scalability.
> + However, on smaller systems, scalability is not anywhere near as
> + large a concern as is energy efficiency. The sysidle subsystem
> + therefore uses a fast but non-scalable algorithm for small
> + systems and a lazier but scalable algorithm for large systems.
> + This Kconfig parameter defines the number of CPUs in the largest
> + system that will be considered to be "small".
> +
> + The default value will be fine in most cases. Battery-powered
> + systems that (1) enable NO_HZ_FULL_SYSIDLE, (2) have larger
> + numbers of CPUs, and (3) are suffering from battery-lifetime
> + problems due to long sysidle latencies might wish to experiment
> + with larger values for this Kconfig parameter. On the other
> + hand, they might be even better served by disabling NO_HZ_FULL
> + entirely, given that NO_HZ_FULL is intended for HPC and
> + real-time workloads that at present do not tend to be run on
> + battery-powered systems.
> +
> + Take the default if you are unsure.
> +
> config NO_HZ
> bool "Old Idle dynticks config"
> depends on !ARCH_USES_GETTIMEOFFSET && GENERIC_CLOCKEVENTS

2013-08-26 16:25:01

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 8/9] nohz_full: Add full-system-idle state machine

On Mon, Aug 26, 2013 at 01:45:32PM +0800, Lai Jiangshan wrote:
> On 08/20/2013 10:47 AM, Paul E. McKenney wrote:
> > From: "Paul E. McKenney" <[email protected]>
> >
> > This commit adds the state machine that takes the per-CPU idle data
> > as input and produces a full-system-idle indication as output. This
> > state machine is driven out of RCU's quiescent-state-forcing
> > mechanism, which invokes rcu_sysidle_check_cpu() to collect per-CPU
> > idle state and then rcu_sysidle_report() to drive the state machine.
> >
> > The full-system-idle state is sampled using rcu_sys_is_idle(), which
> > also drives the state machine if RCU is idle (and does so by forcing
> > RCU to become non-idle). This function returns true if all but the
> > timekeeping CPU (tick_do_timer_cpu) are idle and have been idle long
> > enough to avoid memory contention on the full_sysidle_state state
> > variable. The rcu_sysidle_force_exit() may be called externally
> > to reset the state machine back into non-idle state.
> >
> > For large systems the state machine is driven out of RCU's
> > force-quiescent-state logic, which provides good scalability at the price
> > of millisecond-scale latencies on the transition to full-system-idle
> > state. This is not so good for battery-powered systems, which are usually
> > small enough that they don't need to care about scalability, but which
> > do care deeply about energy efficiency. Small systems therefore drive
> > the state machine directly out of the idle-entry code. The number of
> > CPUs in a "small" system is defined by a new NO_HZ_FULL_SYSIDLE_SMALL
> > Kconfig parameter, which defaults to 8. Note that this is a build-time
> > definition.
> >
> > Signed-off-by: Paul E. McKenney <[email protected]>
> > Cc: Frederic Weisbecker <[email protected]>
> > Cc: Steven Rostedt <[email protected]>
> > Cc: Lai Jiangshan <[email protected]>
> > [ paulmck: Use true and false for boolean constants per Lai Jiangshan. ]
> > Reviewed-by: Josh Triplett <[email protected]>
> > ---
> > include/linux/rcupdate.h | 18 +++
> > kernel/rcutree.c | 16 ++-
> > kernel/rcutree.h | 5 +
> > kernel/rcutree_plugin.h | 284 ++++++++++++++++++++++++++++++++++++++++++++++-
> > kernel/time/Kconfig | 27 +++++
> > 5 files changed, 343 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > index 30bea9c..f1f1bc3 100644
> > --- a/include/linux/rcupdate.h
> > +++ b/include/linux/rcupdate.h
> > @@ -1011,4 +1011,22 @@ static inline bool rcu_is_nocb_cpu(int cpu) { return false; }
> > #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
> >
> >
> > +/* Only for use by adaptive-ticks code. */
> > +#ifdef CONFIG_NO_HZ_FULL_SYSIDLE
> > +extern bool rcu_sys_is_idle(void);
> > +extern void rcu_sysidle_force_exit(void);
> > +#else /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
> > +
> > +static inline bool rcu_sys_is_idle(void)
> > +{
> > + return false;
> > +}
> > +
> > +static inline void rcu_sysidle_force_exit(void)
> > +{
> > +}
> > +
> > +#endif /* #else #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
> > +
> > +
> > #endif /* __LINUX_RCUPDATE_H */
> > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > index 7b5be56..eca70f44 100644
> > --- a/kernel/rcutree.c
> > +++ b/kernel/rcutree.c
> > @@ -734,6 +734,7 @@ static int dyntick_save_progress_counter(struct rcu_data *rdp,
> > bool *isidle, unsigned long *maxj)
> > {
> > rdp->dynticks_snap = atomic_add_return(0, &rdp->dynticks->dynticks);
> > + rcu_sysidle_check_cpu(rdp, isidle, maxj);
> > return (rdp->dynticks_snap & 0x1) == 0;
> > }
> >
> > @@ -1373,11 +1374,17 @@ int rcu_gp_fqs(struct rcu_state *rsp, int fqs_state_in)
> > rsp->n_force_qs++;
> > if (fqs_state == RCU_SAVE_DYNTICK) {
> > /* Collect dyntick-idle snapshots. */
> > + if (is_sysidle_rcu_state(rsp)) {
> > + isidle = 1;
> > + maxj = jiffies - ULONG_MAX / 4;
> > + }
> > force_qs_rnp(rsp, dyntick_save_progress_counter,
> > &isidle, &maxj);
> > + rcu_sysidle_report_gp(rsp, isidle, maxj);
> > fqs_state = RCU_FORCE_QS;
> > } else {
> > /* Handle dyntick-idle and offline CPUs. */
> > + isidle = 0;
> > force_qs_rnp(rsp, rcu_implicit_dynticks_qs, &isidle, &maxj);
> > }
> > /* Clear flag to prevent immediate re-entry. */
> > @@ -2103,9 +2110,12 @@ static void force_qs_rnp(struct rcu_state *rsp,
> > cpu = rnp->grplo;
> > bit = 1;
> > for (; cpu <= rnp->grphi; cpu++, bit <<= 1) {
> > - if ((rnp->qsmask & bit) != 0 &&
> > - f(per_cpu_ptr(rsp->rda, cpu), isidle, maxj))
> > - mask |= bit;
> > + if ((rnp->qsmask & bit) != 0) {
> > + if ((rnp->qsmaskinit & bit) != 0)
> > + *isidle = 0;
> > + if (f(per_cpu_ptr(rsp->rda, cpu), isidle, maxj))
> > + mask |= bit;
> > + }
> > }
> > if (mask != 0) {
> >
> > diff --git a/kernel/rcutree.h b/kernel/rcutree.h
> > index 9dd8b17..6fd3659 100644
> > --- a/kernel/rcutree.h
> > +++ b/kernel/rcutree.h
> > @@ -555,6 +555,11 @@ static void rcu_kick_nohz_cpu(int cpu);
> > static bool init_nocb_callback_list(struct rcu_data *rdp);
> > static void rcu_sysidle_enter(struct rcu_dynticks *rdtp, int irq);
> > static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq);
> > +static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
> > + unsigned long *maxj);
> > +static bool is_sysidle_rcu_state(struct rcu_state *rsp);
> > +static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
> > + unsigned long maxj);
> > static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp);
> >
> > #endif /* #ifndef RCU_TREE_NONCORE */
> > diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> > index a7419ce..90c3fba 100644
> > --- a/kernel/rcutree_plugin.h
> > +++ b/kernel/rcutree_plugin.h
> > @@ -28,7 +28,7 @@
> > #include <linux/gfp.h>
> > #include <linux/oom.h>
> > #include <linux/smpboot.h>
> > -#include <linux/tick.h>
> > +#include "time/tick-internal.h"
> >
> > #define RCU_KTHREAD_PRIO 1
> >
> > @@ -2382,12 +2382,12 @@ static void rcu_kick_nohz_cpu(int cpu)
> > * most active flavor of RCU.
> > */
> > #ifdef CONFIG_PREEMPT_RCU
> > -static struct rcu_state __maybe_unused *rcu_sysidle_state = &rcu_preempt_state;
> > +static struct rcu_state *rcu_sysidle_state = &rcu_preempt_state;
> > #else /* #ifdef CONFIG_PREEMPT_RCU */
> > -static struct rcu_state __maybe_unused *rcu_sysidle_state = &rcu_sched_state;
> > +static struct rcu_state *rcu_sysidle_state = &rcu_sched_state;
> > #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
> >
> > -static int __maybe_unused full_sysidle_state; /* Current system-idle state. */
> > +static int full_sysidle_state; /* Current system-idle state. */
> > #define RCU_SYSIDLE_NOT 0 /* Some CPU is not idle. */
> > #define RCU_SYSIDLE_SHORT 1 /* All CPUs idle for brief period. */
> > #define RCU_SYSIDLE_LONG 2 /* All CPUs idle for long enough. */
> > @@ -2431,6 +2431,38 @@ static void rcu_sysidle_enter(struct rcu_dynticks *rdtp, int irq)
> > }
> >
> > /*
> > + * Unconditionally force exit from full system-idle state. This is
> > + * invoked when a normal CPU exits idle, but must be called separately
> > + * for the timekeeping CPU (tick_do_timer_cpu). The reason for this
> > + * is that the timekeeping CPU is permitted to take scheduling-clock
> > + * interrupts while the system is in system-idle state, and of course
> > + * rcu_sysidle_exit() has no way of distinguishing a scheduling-clock
> > + * interrupt from any other type of interrupt.
> > + */
> > +void rcu_sysidle_force_exit(void)
> > +{
> > + int oldstate = ACCESS_ONCE(full_sysidle_state);
> > + int newoldstate;
> > +
> > + /*
> > + * Each pass through the following loop attempts to exit full
> > + * system-idle state. If contention proves to be a problem,
> > + * a trylock-based contention tree could be used here.
> > + */
> > + while (oldstate > RCU_SYSIDLE_SHORT) {
> > + newoldstate = cmpxchg(&full_sysidle_state,
> > + oldstate, RCU_SYSIDLE_NOT);
> > + if (oldstate == newoldstate &&
> > + oldstate == RCU_SYSIDLE_FULL_NOTED) {
> > + rcu_kick_nohz_cpu(tick_do_timer_cpu);
> > + return; /* We cleared it, done! */
> > + }
> > + oldstate = newoldstate;
> > + }
> > + smp_mb(); /* Order initial oldstate fetch vs. later non-idle work. */
> > +}
> > +
> > +/*
> > * Invoked to note entry to irq or task transition from idle. Note that
> > * usermode execution does -not- count as idle here! The caller must
> > * have disabled interrupts.
> > @@ -2463,6 +2495,235 @@ static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq)
> > atomic_inc(&rdtp->dynticks_idle);
> > smp_mb__after_atomic_inc();
> > WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks_idle) & 0x1));
> > +
> > + /*
> > + * If we are the timekeeping CPU, we are permitted to be non-idle
> > + * during a system-idle state. This must be the case, because
> > + * the timekeeping CPU has to take scheduling-clock interrupts
> > + * during the time that the system is transitioning to full
> > + * system-idle state. This means that the timekeeping CPU must
> > + * invoke rcu_sysidle_force_exit() directly if it does anything
> > + * more than take a scheduling-clock interrupt.
> > + */
> > + if (smp_processor_id() == tick_do_timer_cpu)
> > + return;
> > +
> > + /* Update system-idle state: We are clearly no longer fully idle! */
> > + rcu_sysidle_force_exit();
> > +}
> > +
> > +/*
> > + * Check to see if the current CPU is idle. Note that usermode execution
> > + * does not count as idle. The caller must have disabled interrupts.
> > + */
> > +static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
> > + unsigned long *maxj)
> > +{
> > + int cur;
> > + unsigned long j;
> > + struct rcu_dynticks *rdtp = rdp->dynticks;
> > +
> > + /*
> > + * If some other CPU has already reported non-idle, if this is
> > + * not the flavor of RCU that tracks sysidle state, or if this
> > + * is an offline or the timekeeping CPU, nothing to do.
> > + */
> > + if (!*isidle || rdp->rsp != rcu_sysidle_state ||
> > + cpu_is_offline(rdp->cpu) || rdp->cpu == tick_do_timer_cpu)
> > + return;
> > + /* WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu); */
> > +
> > + /* Pick up current idle and NMI-nesting counter and check. */
> > + cur = atomic_read(&rdtp->dynticks_idle);
> > + if (cur & 0x1) {
> > + *isidle = false; /* We are not idle! */
> > + return;
> > + }
> > + smp_mb(); /* Read counters before timestamps. */
> > +
> > + /* Pick up timestamps. */
> > + j = ACCESS_ONCE(rdtp->dynticks_idle_jiffies);
> > + /* If this CPU entered idle more recently, update maxj timestamp. */
> > + if (ULONG_CMP_LT(*maxj, j))
> > + *maxj = j;
> > +}
> > +
> > +/*
> > + * Is this the flavor of RCU that is handling full-system idle?
> > + */
> > +static bool is_sysidle_rcu_state(struct rcu_state *rsp)
> > +{
> > + return rsp == rcu_sysidle_state;
> > +}
> > +
> > +/*
> > + * Return a delay in jiffies based on the number of CPUs, rcu_node
> > + * leaf fanout, and jiffies tick rate. The idea is to allow larger
> > + * systems more time to transition to full-idle state in order to
> > + * avoid the cache thrashing that otherwise occur on the state variable.
> > + * Really small systems (less than a couple of tens of CPUs) should
> > + * instead use a single global atomically incremented counter, and later
> > + * versions of this will automatically reconfigure themselves accordingly.
> > + */
> > +static unsigned long rcu_sysidle_delay(void)
> > +{
> > + if (nr_cpu_ids <= CONFIG_NO_HZ_FULL_SYSIDLE_SMALL)
> > + return 0;
> > + return DIV_ROUND_UP(nr_cpu_ids * HZ, rcu_fanout_leaf * 1000);
> > +}
> > +
> > +/*
> > + * Advance the full-system-idle state. This is invoked when all of
> > + * the non-timekeeping CPUs are idle.
> > + */
> > +static void rcu_sysidle(unsigned long j)
> > +{
> > + /* Check the current state. */
> > + switch (ACCESS_ONCE(full_sysidle_state)) {
> > + case RCU_SYSIDLE_NOT:
> > +
> > + /* First time all are idle, so note a short idle period. */
> > + ACCESS_ONCE(full_sysidle_state) = RCU_SYSIDLE_SHORT;
> > + break;
> > +
> > + case RCU_SYSIDLE_SHORT:
> > +
> > + /*
> > + * Idle for a bit, time to advance to next state?
> > + * cmpxchg failure means race with non-idle, let them win.
> > + */
> > + if (ULONG_CMP_GE(jiffies, j + rcu_sysidle_delay()))
> > + (void)cmpxchg(&full_sysidle_state,
> > + RCU_SYSIDLE_SHORT, RCU_SYSIDLE_LONG);
> > + break;
> > +
> > + case RCU_SYSIDLE_LONG:
> > +
> > + /*
> > + * Do an additional check pass before advancing to full.
> > + * cmpxchg failure means race with non-idle, let them win.
> > + */
> > + if (ULONG_CMP_GE(jiffies, j + rcu_sysidle_delay()))
> > + (void)cmpxchg(&full_sysidle_state,
> > + RCU_SYSIDLE_LONG, RCU_SYSIDLE_FULL);
> > + break;
> > +
> > + default:
> > + break;
> > + }
> > +}
> > +
> > +/*
> > + * Found a non-idle non-timekeeping CPU, so kick the system-idle state
> > + * back to the beginning.
> > + */
> > +static void rcu_sysidle_cancel(void)
> > +{
> > + smp_mb();
> > + ACCESS_ONCE(full_sysidle_state) = RCU_SYSIDLE_NOT;
> > +}
> > +
> > +/*
> > + * Update the sysidle state based on the results of a force-quiescent-state
> > + * scan of the CPUs' dyntick-idle state.
> > + */
> > +static void rcu_sysidle_report(struct rcu_state *rsp, int isidle,
> > + unsigned long maxj, bool gpkt)
> > +{
> > + if (rsp != rcu_sysidle_state)
> > + return; /* Wrong flavor, ignore. */
> > + if (isidle) {
> > + if (gpkt && nr_cpu_ids > CONFIG_NO_HZ_FULL_SYSIDLE_SMALL)
> > + rcu_sysidle(maxj); /* More idle! */
> > + } else {
> > + rcu_sysidle_cancel(); /* Idle is over. */
> > + }
>
> "gpkt" is always equal to "nr_cpu_ids > RCU_SYSIDLE_SMALL",

Almost. When rcu_sysidle_report() is called from the grace-period
kthread, gpkt is true. So when nr_cpu_ids > RCU_SYSIDLE_SMALL, gpkt
will always be true, but when nr_cpu_ids <= RCU_SYSIDLE_SMALL, gpkt can
be either true or false.

> so we can remove "gpkt" argument and rcu_sysidle_report_gp(

If we do that, when nr_cpu_ids <= RCU_SYSIDLE_SMALL there can be multiple
tasks manipulating the state machine concurrently. So let's not. ;-)

That said, the (gpkt && nr_cpu_ids > CONFIG_NO_HZ_FULL_SYSIDLE_SMALL)
check should be outside of the (isidle) check in order to avoid having
multiple tasks manipulate the rcu_sysidle_state variable, fixed.

> > +}
> > +
> > +static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
> > + unsigned long maxj)
> > +{
> > + rcu_sysidle_report(rsp, isidle, maxj, true);
> > +}
> > +
> > +/* Callback and function for forcing an RCU grace period. */
> > +struct rcu_sysidle_head {
> > + struct rcu_head rh;
> > + int inuse;
> > +};
> > +
> > +static void rcu_sysidle_cb(struct rcu_head *rhp)
> > +{
> > + struct rcu_sysidle_head *rshp;
> > +
> > + smp_mb(); /* grace period precedes setting inuse. */
>
> Why we need this mb()?

I put it there because we are not using a memory allocator, and thus
cannot rely on the memory barriers that would be executed as part of
the memory allocator should this callback get migrated to some other CPU.

Can you prove that I don't need it?

> > + rshp = container_of(rhp, struct rcu_sysidle_head, rh);
> > + ACCESS_ONCE(rshp->inuse) = 0;
> > +}
> > +
> > +/*
> > + * Check to see if the system is fully idle, other than the timekeeping CPU.
> > + * The caller must have disabled interrupts.
> > + */
> > +bool rcu_sys_is_idle(void)
> > +{
> > + static struct rcu_sysidle_head rsh;
> > + int rss = ACCESS_ONCE(full_sysidle_state);
> > +
> > + if (WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu))
> > + return false;
> > +
> > + /* Handle small-system case by doing a full scan of CPUs. */
> > + if (nr_cpu_ids <= CONFIG_NO_HZ_FULL_SYSIDLE_SMALL) {
> > + int oldrss = rss - 1;
> > +
> > + /*
> > + * One pass to advance to each state up to _FULL.
> > + * Give up if any pass fails to advance the state.
> > + */
> > + while (rss < RCU_SYSIDLE_FULL && oldrss < rss) {
> > + int cpu;
> > + bool isidle = true;
> > + unsigned long maxj = jiffies - ULONG_MAX / 4;
> > + struct rcu_data *rdp;
> > +
> > + /* Scan all the CPUs looking for nonidle CPUs. */
> > + for_each_possible_cpu(cpu) {
> > + rdp = per_cpu_ptr(rcu_sysidle_state->rda, cpu);
> > + rcu_sysidle_check_cpu(rdp, &isidle, &maxj);
> > + if (!isidle)
> > + break;
> > + }
> > + rcu_sysidle_report(rcu_sysidle_state,
> > + isidle, maxj, false);
> > + oldrss = rss;
> > + rss = ACCESS_ONCE(full_sysidle_state);
> > + }
> > + }
>
> I don't think it is a good idea to move the overhead to fqs when nr_cpu_ids > 8
> the total overhead will no be reduced, and it maybe more.
>
> I think we can calculate it on the time-keeping-cpu when nr_cpu_ids > 8 && time-keeping-cpu is idle.

If we do that, then the timekeeping CPU could scan most of the CPUs
times when it transitions to idle. This would probably not make the
people running systems with hundreds of CPUs very happy.

Now it might well be that CONFIG_NO_HZ_FULL_SYSIDLE_SMALL need to be
larger, but that is one reason why it is a Kconfig variable.

> > +
> > + /* If this is the first observation of an idle period, record it. */
> > + if (rss == RCU_SYSIDLE_FULL) {
> > + rss = cmpxchg(&full_sysidle_state,
> > + RCU_SYSIDLE_FULL, RCU_SYSIDLE_FULL_NOTED);
> > + return rss == RCU_SYSIDLE_FULL;
> > + }
> > +
> > + smp_mb(); /* ensure rss load happens before later caller actions. */
> > +
> > + /* If already fully idle, tell the caller (in case of races). */
> > + if (rss == RCU_SYSIDLE_FULL_NOTED)
> > + return true;
> > +
> > + /*
> > + * If we aren't there yet, and a grace period is not in flight,
> > + * initiate a grace period. Either way, tell the caller that
> > + * we are not there yet.
> > + */
> > + if (nr_cpu_ids > CONFIG_NO_HZ_FULL_SYSIDLE_SMALL &&
> > + !rcu_gp_in_progress(rcu_sysidle_state) &&
> > + !rsh.inuse && xchg(&rsh.inuse, 1) == 0)
> > + call_rcu(&rsh.rh, rcu_sysidle_cb);
>
> why need to use xchg()? Who will it race with?

The xchg() is taking the place of the memory barriers that would otherwise
be present in the memory allocator -- the RCU callback would free itself,
and we would allocate it here. This memory barrier pairs with the one
you asked about in rcu_sysidle_cb().

So the intent is that if this code sees rsh.inuse==0, then the call_rcu()
won't have to worry about rcu_do_batch() messing with the callback.

For example, suppose that we some day make rcu_do_batch NULL out each
callback's ->next pointer for debugging purposes. We would then need
both the memory barrier in rcu_sysidle_cb() and the memory barriers in
the xchg() to ensure that call_rcu()'s assignment to ->next happened
after rcu_do_batch()'s NULLing of the ->next pointer.

Thanx, Paul

> Thanks,
> Lai
>
>
> > + return false;
> > }
> >
> > /*
> > @@ -2483,6 +2744,21 @@ static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq)
> > {
> > }
> >
> > +static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
> > + unsigned long *maxj)
> > +{
> > +}
> > +
> > +static bool is_sysidle_rcu_state(struct rcu_state *rsp)
> > +{
> > + return false;
> > +}
> > +
> > +static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
> > + unsigned long maxj)
> > +{
> > +}
> > +
> > static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp)
> > {
> > }
> > diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
> > index c7d2fd6..3381f09 100644
> > --- a/kernel/time/Kconfig
> > +++ b/kernel/time/Kconfig
> > @@ -157,6 +157,33 @@ config NO_HZ_FULL_SYSIDLE
> >
> > Say N if you are unsure.
> >
> > +config NO_HZ_FULL_SYSIDLE_SMALL
> > + int "Number of CPUs above which large-system approach is used"
> > + depends on NO_HZ_FULL_SYSIDLE
> > + range 1 NR_CPUS
> > + default 8
> > + help
> > + The full-system idle detection mechanism takes a lazy approach
> > + on large systems, as is required to attain decent scalability.
> > + However, on smaller systems, scalability is not anywhere near as
> > + large a concern as is energy efficiency. The sysidle subsystem
> > + therefore uses a fast but non-scalable algorithm for small
> > + systems and a lazier but scalable algorithm for large systems.
> > + This Kconfig parameter defines the number of CPUs in the largest
> > + system that will be considered to be "small".
> > +
> > + The default value will be fine in most cases. Battery-powered
> > + systems that (1) enable NO_HZ_FULL_SYSIDLE, (2) have larger
> > + numbers of CPUs, and (3) are suffering from battery-lifetime
> > + problems due to long sysidle latencies might wish to experiment
> > + with larger values for this Kconfig parameter. On the other
> > + hand, they might be even better served by disabling NO_HZ_FULL
> > + entirely, given that NO_HZ_FULL is intended for HPC and
> > + real-time workloads that at present do not tend to be run on
> > + battery-powered systems.
> > +
> > + Take the default if you are unsure.
> > +
> > config NO_HZ
> > bool "Old Idle dynticks config"
> > depends on !ARCH_USES_GETTIMEOFFSET && GENERIC_CLOCKEVENTS
>

2013-08-27 03:37:20

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 8/9] nohz_full: Add full-system-idle state machine

On 08/27/2013 12:24 AM, Paul E. McKenney wrote:
> On Mon, Aug 26, 2013 at 01:45:32PM +0800, Lai Jiangshan wrote:
>> On 08/20/2013 10:47 AM, Paul E. McKenney wrote:
>>> From: "Paul E. McKenney" <[email protected]>
>>>
>>> This commit adds the state machine that takes the per-CPU idle data
>>> as input and produces a full-system-idle indication as output. This
>>> state machine is driven out of RCU's quiescent-state-forcing
>>> mechanism, which invokes rcu_sysidle_check_cpu() to collect per-CPU
>>> idle state and then rcu_sysidle_report() to drive the state machine.
>>>
>>> The full-system-idle state is sampled using rcu_sys_is_idle(), which
>>> also drives the state machine if RCU is idle (and does so by forcing
>>> RCU to become non-idle). This function returns true if all but the
>>> timekeeping CPU (tick_do_timer_cpu) are idle and have been idle long
>>> enough to avoid memory contention on the full_sysidle_state state
>>> variable. The rcu_sysidle_force_exit() may be called externally
>>> to reset the state machine back into non-idle state.
>>>
>>> For large systems the state machine is driven out of RCU's
>>> force-quiescent-state logic, which provides good scalability at the price
>>> of millisecond-scale latencies on the transition to full-system-idle
>>> state. This is not so good for battery-powered systems, which are usually
>>> small enough that they don't need to care about scalability, but which
>>> do care deeply about energy efficiency. Small systems therefore drive
>>> the state machine directly out of the idle-entry code. The number of
>>> CPUs in a "small" system is defined by a new NO_HZ_FULL_SYSIDLE_SMALL
>>> Kconfig parameter, which defaults to 8. Note that this is a build-time
>>> definition.
>>>
>>> Signed-off-by: Paul E. McKenney <[email protected]>
>>> Cc: Frederic Weisbecker <[email protected]>
>>> Cc: Steven Rostedt <[email protected]>
>>> Cc: Lai Jiangshan <[email protected]>
>>> [ paulmck: Use true and false for boolean constants per Lai Jiangshan. ]
>>> Reviewed-by: Josh Triplett <[email protected]>
>>> ---
>>> include/linux/rcupdate.h | 18 +++
>>> kernel/rcutree.c | 16 ++-
>>> kernel/rcutree.h | 5 +
>>> kernel/rcutree_plugin.h | 284 ++++++++++++++++++++++++++++++++++++++++++++++-
>>> kernel/time/Kconfig | 27 +++++
>>> 5 files changed, 343 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
>>> index 30bea9c..f1f1bc3 100644
>>> --- a/include/linux/rcupdate.h
>>> +++ b/include/linux/rcupdate.h
>>> @@ -1011,4 +1011,22 @@ static inline bool rcu_is_nocb_cpu(int cpu) { return false; }
>>> #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
>>>
>>>
>>> +/* Only for use by adaptive-ticks code. */
>>> +#ifdef CONFIG_NO_HZ_FULL_SYSIDLE
>>> +extern bool rcu_sys_is_idle(void);
>>> +extern void rcu_sysidle_force_exit(void);
>>> +#else /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
>>> +
>>> +static inline bool rcu_sys_is_idle(void)
>>> +{
>>> + return false;
>>> +}
>>> +
>>> +static inline void rcu_sysidle_force_exit(void)
>>> +{
>>> +}
>>> +
>>> +#endif /* #else #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
>>> +
>>> +
>>> #endif /* __LINUX_RCUPDATE_H */
>>> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
>>> index 7b5be56..eca70f44 100644
>>> --- a/kernel/rcutree.c
>>> +++ b/kernel/rcutree.c
>>> @@ -734,6 +734,7 @@ static int dyntick_save_progress_counter(struct rcu_data *rdp,
>>> bool *isidle, unsigned long *maxj)
>>> {
>>> rdp->dynticks_snap = atomic_add_return(0, &rdp->dynticks->dynticks);
>>> + rcu_sysidle_check_cpu(rdp, isidle, maxj);
>>> return (rdp->dynticks_snap & 0x1) == 0;
>>> }
>>>
>>> @@ -1373,11 +1374,17 @@ int rcu_gp_fqs(struct rcu_state *rsp, int fqs_state_in)
>>> rsp->n_force_qs++;
>>> if (fqs_state == RCU_SAVE_DYNTICK) {
>>> /* Collect dyntick-idle snapshots. */
>>> + if (is_sysidle_rcu_state(rsp)) {
>>> + isidle = 1;
>>> + maxj = jiffies - ULONG_MAX / 4;
>>> + }
>>> force_qs_rnp(rsp, dyntick_save_progress_counter,
>>> &isidle, &maxj);
>>> + rcu_sysidle_report_gp(rsp, isidle, maxj);
>>> fqs_state = RCU_FORCE_QS;
>>> } else {
>>> /* Handle dyntick-idle and offline CPUs. */
>>> + isidle = 0;
>>> force_qs_rnp(rsp, rcu_implicit_dynticks_qs, &isidle, &maxj);
>>> }
>>> /* Clear flag to prevent immediate re-entry. */
>>> @@ -2103,9 +2110,12 @@ static void force_qs_rnp(struct rcu_state *rsp,
>>> cpu = rnp->grplo;
>>> bit = 1;
>>> for (; cpu <= rnp->grphi; cpu++, bit <<= 1) {
>>> - if ((rnp->qsmask & bit) != 0 &&
>>> - f(per_cpu_ptr(rsp->rda, cpu), isidle, maxj))
>>> - mask |= bit;
>>> + if ((rnp->qsmask & bit) != 0) {
>>> + if ((rnp->qsmaskinit & bit) != 0)
>>> + *isidle = 0;
>>> + if (f(per_cpu_ptr(rsp->rda, cpu), isidle, maxj))
>>> + mask |= bit;
>>> + }
>>> }
>>> if (mask != 0) {
>>>
>>> diff --git a/kernel/rcutree.h b/kernel/rcutree.h
>>> index 9dd8b17..6fd3659 100644
>>> --- a/kernel/rcutree.h
>>> +++ b/kernel/rcutree.h
>>> @@ -555,6 +555,11 @@ static void rcu_kick_nohz_cpu(int cpu);
>>> static bool init_nocb_callback_list(struct rcu_data *rdp);
>>> static void rcu_sysidle_enter(struct rcu_dynticks *rdtp, int irq);
>>> static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq);
>>> +static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
>>> + unsigned long *maxj);
>>> +static bool is_sysidle_rcu_state(struct rcu_state *rsp);
>>> +static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
>>> + unsigned long maxj);
>>> static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp);
>>>
>>> #endif /* #ifndef RCU_TREE_NONCORE */
>>> diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
>>> index a7419ce..90c3fba 100644
>>> --- a/kernel/rcutree_plugin.h
>>> +++ b/kernel/rcutree_plugin.h
>>> @@ -28,7 +28,7 @@
>>> #include <linux/gfp.h>
>>> #include <linux/oom.h>
>>> #include <linux/smpboot.h>
>>> -#include <linux/tick.h>
>>> +#include "time/tick-internal.h"
>>>
>>> #define RCU_KTHREAD_PRIO 1
>>>
>>> @@ -2382,12 +2382,12 @@ static void rcu_kick_nohz_cpu(int cpu)
>>> * most active flavor of RCU.
>>> */
>>> #ifdef CONFIG_PREEMPT_RCU
>>> -static struct rcu_state __maybe_unused *rcu_sysidle_state = &rcu_preempt_state;
>>> +static struct rcu_state *rcu_sysidle_state = &rcu_preempt_state;
>>> #else /* #ifdef CONFIG_PREEMPT_RCU */
>>> -static struct rcu_state __maybe_unused *rcu_sysidle_state = &rcu_sched_state;
>>> +static struct rcu_state *rcu_sysidle_state = &rcu_sched_state;
>>> #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
>>>
>>> -static int __maybe_unused full_sysidle_state; /* Current system-idle state. */
>>> +static int full_sysidle_state; /* Current system-idle state. */
>>> #define RCU_SYSIDLE_NOT 0 /* Some CPU is not idle. */
>>> #define RCU_SYSIDLE_SHORT 1 /* All CPUs idle for brief period. */
>>> #define RCU_SYSIDLE_LONG 2 /* All CPUs idle for long enough. */
>>> @@ -2431,6 +2431,38 @@ static void rcu_sysidle_enter(struct rcu_dynticks *rdtp, int irq)
>>> }
>>>
>>> /*
>>> + * Unconditionally force exit from full system-idle state. This is
>>> + * invoked when a normal CPU exits idle, but must be called separately
>>> + * for the timekeeping CPU (tick_do_timer_cpu). The reason for this
>>> + * is that the timekeeping CPU is permitted to take scheduling-clock
>>> + * interrupts while the system is in system-idle state, and of course
>>> + * rcu_sysidle_exit() has no way of distinguishing a scheduling-clock
>>> + * interrupt from any other type of interrupt.
>>> + */
>>> +void rcu_sysidle_force_exit(void)
>>> +{
>>> + int oldstate = ACCESS_ONCE(full_sysidle_state);
>>> + int newoldstate;
>>> +
>>> + /*
>>> + * Each pass through the following loop attempts to exit full
>>> + * system-idle state. If contention proves to be a problem,
>>> + * a trylock-based contention tree could be used here.
>>> + */
>>> + while (oldstate > RCU_SYSIDLE_SHORT) {
>>> + newoldstate = cmpxchg(&full_sysidle_state,
>>> + oldstate, RCU_SYSIDLE_NOT);
>>> + if (oldstate == newoldstate &&
>>> + oldstate == RCU_SYSIDLE_FULL_NOTED) {
>>> + rcu_kick_nohz_cpu(tick_do_timer_cpu);
>>> + return; /* We cleared it, done! */
>>> + }
>>> + oldstate = newoldstate;
>>> + }
>>> + smp_mb(); /* Order initial oldstate fetch vs. later non-idle work. */
>>> +}
>>> +
>>> +/*
>>> * Invoked to note entry to irq or task transition from idle. Note that
>>> * usermode execution does -not- count as idle here! The caller must
>>> * have disabled interrupts.
>>> @@ -2463,6 +2495,235 @@ static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq)
>>> atomic_inc(&rdtp->dynticks_idle);
>>> smp_mb__after_atomic_inc();
>>> WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks_idle) & 0x1));
>>> +
>>> + /*
>>> + * If we are the timekeeping CPU, we are permitted to be non-idle
>>> + * during a system-idle state. This must be the case, because
>>> + * the timekeeping CPU has to take scheduling-clock interrupts
>>> + * during the time that the system is transitioning to full
>>> + * system-idle state. This means that the timekeeping CPU must
>>> + * invoke rcu_sysidle_force_exit() directly if it does anything
>>> + * more than take a scheduling-clock interrupt.
>>> + */
>>> + if (smp_processor_id() == tick_do_timer_cpu)
>>> + return;
>>> +
>>> + /* Update system-idle state: We are clearly no longer fully idle! */
>>> + rcu_sysidle_force_exit();
>>> +}
>>> +
>>> +/*
>>> + * Check to see if the current CPU is idle. Note that usermode execution
>>> + * does not count as idle. The caller must have disabled interrupts.
>>> + */
>>> +static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
>>> + unsigned long *maxj)
>>> +{
>>> + int cur;
>>> + unsigned long j;
>>> + struct rcu_dynticks *rdtp = rdp->dynticks;
>>> +
>>> + /*
>>> + * If some other CPU has already reported non-idle, if this is
>>> + * not the flavor of RCU that tracks sysidle state, or if this
>>> + * is an offline or the timekeeping CPU, nothing to do.
>>> + */
>>> + if (!*isidle || rdp->rsp != rcu_sysidle_state ||
>>> + cpu_is_offline(rdp->cpu) || rdp->cpu == tick_do_timer_cpu)
>>> + return;
>>> + /* WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu); */
>>> +
>>> + /* Pick up current idle and NMI-nesting counter and check. */
>>> + cur = atomic_read(&rdtp->dynticks_idle);
>>> + if (cur & 0x1) {
>>> + *isidle = false; /* We are not idle! */
>>> + return;
>>> + }
>>> + smp_mb(); /* Read counters before timestamps. */
>>> +
>>> + /* Pick up timestamps. */
>>> + j = ACCESS_ONCE(rdtp->dynticks_idle_jiffies);
>>> + /* If this CPU entered idle more recently, update maxj timestamp. */
>>> + if (ULONG_CMP_LT(*maxj, j))
>>> + *maxj = j;
>>> +}
>>> +
>>> +/*
>>> + * Is this the flavor of RCU that is handling full-system idle?
>>> + */
>>> +static bool is_sysidle_rcu_state(struct rcu_state *rsp)
>>> +{
>>> + return rsp == rcu_sysidle_state;
>>> +}
>>> +
>>> +/*
>>> + * Return a delay in jiffies based on the number of CPUs, rcu_node
>>> + * leaf fanout, and jiffies tick rate. The idea is to allow larger
>>> + * systems more time to transition to full-idle state in order to
>>> + * avoid the cache thrashing that otherwise occur on the state variable.
>>> + * Really small systems (less than a couple of tens of CPUs) should
>>> + * instead use a single global atomically incremented counter, and later
>>> + * versions of this will automatically reconfigure themselves accordingly.
>>> + */
>>> +static unsigned long rcu_sysidle_delay(void)
>>> +{
>>> + if (nr_cpu_ids <= CONFIG_NO_HZ_FULL_SYSIDLE_SMALL)
>>> + return 0;
>>> + return DIV_ROUND_UP(nr_cpu_ids * HZ, rcu_fanout_leaf * 1000);
>>> +}
>>> +
>>> +/*
>>> + * Advance the full-system-idle state. This is invoked when all of
>>> + * the non-timekeeping CPUs are idle.
>>> + */
>>> +static void rcu_sysidle(unsigned long j)
>>> +{
>>> + /* Check the current state. */
>>> + switch (ACCESS_ONCE(full_sysidle_state)) {
>>> + case RCU_SYSIDLE_NOT:
>>> +
>>> + /* First time all are idle, so note a short idle period. */
>>> + ACCESS_ONCE(full_sysidle_state) = RCU_SYSIDLE_SHORT;
>>> + break;
>>> +
>>> + case RCU_SYSIDLE_SHORT:
>>> +
>>> + /*
>>> + * Idle for a bit, time to advance to next state?
>>> + * cmpxchg failure means race with non-idle, let them win.
>>> + */
>>> + if (ULONG_CMP_GE(jiffies, j + rcu_sysidle_delay()))
>>> + (void)cmpxchg(&full_sysidle_state,
>>> + RCU_SYSIDLE_SHORT, RCU_SYSIDLE_LONG);
>>> + break;
>>> +
>>> + case RCU_SYSIDLE_LONG:
>>> +
>>> + /*
>>> + * Do an additional check pass before advancing to full.
>>> + * cmpxchg failure means race with non-idle, let them win.
>>> + */
>>> + if (ULONG_CMP_GE(jiffies, j + rcu_sysidle_delay()))
>>> + (void)cmpxchg(&full_sysidle_state,
>>> + RCU_SYSIDLE_LONG, RCU_SYSIDLE_FULL);
>>> + break;
>>> +
>>> + default:
>>> + break;
>>> + }
>>> +}
>>> +
>>> +/*
>>> + * Found a non-idle non-timekeeping CPU, so kick the system-idle state
>>> + * back to the beginning.
>>> + */
>>> +static void rcu_sysidle_cancel(void)
>>> +{
>>> + smp_mb();
>>> + ACCESS_ONCE(full_sysidle_state) = RCU_SYSIDLE_NOT;
>>> +}
>>> +
>>> +/*
>>> + * Update the sysidle state based on the results of a force-quiescent-state
>>> + * scan of the CPUs' dyntick-idle state.
>>> + */
>>> +static void rcu_sysidle_report(struct rcu_state *rsp, int isidle,
>>> + unsigned long maxj, bool gpkt)
>>> +{
>>> + if (rsp != rcu_sysidle_state)
>>> + return; /* Wrong flavor, ignore. */
>>> + if (isidle) {
>>> + if (gpkt && nr_cpu_ids > CONFIG_NO_HZ_FULL_SYSIDLE_SMALL)
>>> + rcu_sysidle(maxj); /* More idle! */
>>> + } else {
>>> + rcu_sysidle_cancel(); /* Idle is over. */
>>> + }
>>
>> "gpkt" is always equal to "nr_cpu_ids > RCU_SYSIDLE_SMALL",


Sorry, "(gpkt && nr_cpu_ids > CONFIG_NO_HZ_FULL_SYSIDLE_SMALL)"
is always equal to "nr_cpu_ids > CONFIG_NO_HZ_FULL_SYSIDLE_SMALL)".

There are two callsites of rcu_sysidle_report().
if rcu_sysidle_report() is called from rcu_sysidle_report_gp(),
gpkt is true, so my proposition is correct.

if rcu_sysidle_report() is called from rcu_sys_is_idle(),
the gpkt is false, but this time nr_cpu_ids <= CONFIG_NO_HZ_FULL_SYSIDLE_SMALL,
so my proposition is still correct.

So since "(gpkt && nr_cpu_ids > CONFIG_NO_HZ_FULL_SYSIDLE_SMALL)"
is always equal to "nr_cpu_ids > CONFIG_NO_HZ_FULL_SYSIDLE_SMALL)".
we can just remove "gpkt" argument and rcu_sysidle_report_gp().

>
> Almost. When rcu_sysidle_report() is called from the grace-period
> kthread, gpkt is true. So when nr_cpu_ids > RCU_SYSIDLE_SMALL, gpkt
> will always be true, but when nr_cpu_ids <= RCU_SYSIDLE_SMALL, gpkt can
> be either true or false.
>
>> so we can remove "gpkt" argument and rcu_sysidle_report_gp(
>
> If we do that, when nr_cpu_ids <= RCU_SYSIDLE_SMALL there can be multiple
> tasks manipulating the state machine concurrently. So let's not. ;-)
>
> That said, the (gpkt && nr_cpu_ids > CONFIG_NO_HZ_FULL_SYSIDLE_SMALL)
> check should be outside of the (isidle) check in order to avoid having
> multiple tasks manipulate the rcu_sysidle_state variable, fixed.
>
>>> +}
>>> +
>>> +static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
>>> + unsigned long maxj)
>>> +{
>>> + rcu_sysidle_report(rsp, isidle, maxj, true);
>>> +}
>>> +
>>> +/* Callback and function for forcing an RCU grace period. */
>>> +struct rcu_sysidle_head {
>>> + struct rcu_head rh;
>>> + int inuse;
>>> +};
>>> +
>>> +static void rcu_sysidle_cb(struct rcu_head *rhp)
>>> +{
>>> + struct rcu_sysidle_head *rshp;
>>> +
>>> + smp_mb(); /* grace period precedes setting inuse. */
>>
>> Why we need this mb()?
>
> I put it there because we are not using a memory allocator, and thus
> cannot rely on the memory barriers that would be executed as part of
> the memory allocator should this callback get migrated to some other CPU.
>
> Can you prove that I don't need it?
>
>>> + rshp = container_of(rhp, struct rcu_sysidle_head, rh);
>>> + ACCESS_ONCE(rshp->inuse) = 0;
>>> +}
>>> +
>>> +/*
>>> + * Check to see if the system is fully idle, other than the timekeeping CPU.
>>> + * The caller must have disabled interrupts.
>>> + */
>>> +bool rcu_sys_is_idle(void)
>>> +{
>>> + static struct rcu_sysidle_head rsh;
>>> + int rss = ACCESS_ONCE(full_sysidle_state);
>>> +
>>> + if (WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu))
>>> + return false;
>>> +
>>> + /* Handle small-system case by doing a full scan of CPUs. */
>>> + if (nr_cpu_ids <= CONFIG_NO_HZ_FULL_SYSIDLE_SMALL) {
>>> + int oldrss = rss - 1;
>>> +
>>> + /*
>>> + * One pass to advance to each state up to _FULL.
>>> + * Give up if any pass fails to advance the state.
>>> + */
>>> + while (rss < RCU_SYSIDLE_FULL && oldrss < rss) {
>>> + int cpu;
>>> + bool isidle = true;
>>> + unsigned long maxj = jiffies - ULONG_MAX / 4;
>>> + struct rcu_data *rdp;
>>> +
>>> + /* Scan all the CPUs looking for nonidle CPUs. */
>>> + for_each_possible_cpu(cpu) {
>>> + rdp = per_cpu_ptr(rcu_sysidle_state->rda, cpu);
>>> + rcu_sysidle_check_cpu(rdp, &isidle, &maxj);
>>> + if (!isidle)
>>> + break;
>>> + }
>>> + rcu_sysidle_report(rcu_sysidle_state,
>>> + isidle, maxj, false);
>>> + oldrss = rss;
>>> + rss = ACCESS_ONCE(full_sysidle_state);
>>> + }
>>> + }
>>
>> I don't think it is a good idea to move the overhead to fqs when nr_cpu_ids > 8
>> the total overhead will no be reduced, and it maybe more.
>>
>> I think we can calculate it on the time-keeping-cpu when nr_cpu_ids > 8 && time-keeping-cpu is idle.
>
> If we do that, then the timekeeping CPU could scan most of the CPUs
> times when it transitions to idle. This would probably not make the
> people running systems with hundreds of CPUs very happy.

When nr_cpu_ids > 8, this patch puts the scan in gp_thread, which is also in time-keeping-cpu,
it already stops the cpu to transition idle. and if there are/is task(s) are/is
woken up in the time-keeping-cpu, the scan can't quit immedeately.

If we put the scan(but interruptible) in idle, the scan can quit immedeately
by testing NEED_RESCHED bits while scaning.

>
> Now it might well be that CONFIG_NO_HZ_FULL_SYSIDLE_SMALL need to be
> larger, but that is one reason why it is a Kconfig variable.
>
>>> +
>>> + /* If this is the first observation of an idle period, record it. */
>>> + if (rss == RCU_SYSIDLE_FULL) {
>>> + rss = cmpxchg(&full_sysidle_state,
>>> + RCU_SYSIDLE_FULL, RCU_SYSIDLE_FULL_NOTED);
>>> + return rss == RCU_SYSIDLE_FULL;
>>> + }
>>> +
>>> + smp_mb(); /* ensure rss load happens before later caller actions. */
>>> +
>>> + /* If already fully idle, tell the caller (in case of races). */
>>> + if (rss == RCU_SYSIDLE_FULL_NOTED)
>>> + return true;
>>> +
>>> + /*
>>> + * If we aren't there yet, and a grace period is not in flight,
>>> + * initiate a grace period. Either way, tell the caller that
>>> + * we are not there yet.
>>> + */
>>> + if (nr_cpu_ids > CONFIG_NO_HZ_FULL_SYSIDLE_SMALL &&
>>> + !rcu_gp_in_progress(rcu_sysidle_state) &&
>>> + !rsh.inuse && xchg(&rsh.inuse, 1) == 0)
>>> + call_rcu(&rsh.rh, rcu_sysidle_cb);
>>
>> why need to use xchg()? Who will it race with?
>
> The xchg() is taking the place of the memory barriers that would otherwise
> be present in the memory allocator -- the RCU callback would free itself,
> and we would allocate it here. This memory barrier pairs with the one
> you asked about in rcu_sysidle_cb().

but xchg() will not fail here, does "smp_mb(); rsh.inuse = 1;" enough here?

>
> So the intent is that if this code sees rsh.inuse==0, then the call_rcu()
> won't have to worry about rcu_do_batch() messing with the callback.
>
> For example, suppose that we some day make rcu_do_batch NULL out each
> callback's ->next pointer for debugging purposes. We would then need
> both the memory barrier in rcu_sysidle_cb() and the memory barriers in
> the xchg() to ensure that call_rcu()'s assignment to ->next happened
> after rcu_do_batch()'s NULLing of the ->next pointer.
>
> Thanx, Paul
>
>> Thanks,
>> Lai
>>
>>
>>> + return false;
>>> }
>>>
>>> /*
>>> @@ -2483,6 +2744,21 @@ static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq)
>>> {
>>> }
>>>
>>> +static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
>>> + unsigned long *maxj)
>>> +{
>>> +}
>>> +
>>> +static bool is_sysidle_rcu_state(struct rcu_state *rsp)
>>> +{
>>> + return false;
>>> +}
>>> +
>>> +static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
>>> + unsigned long maxj)
>>> +{
>>> +}
>>> +
>>> static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp)
>>> {
>>> }
>>> diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
>>> index c7d2fd6..3381f09 100644
>>> --- a/kernel/time/Kconfig
>>> +++ b/kernel/time/Kconfig
>>> @@ -157,6 +157,33 @@ config NO_HZ_FULL_SYSIDLE
>>>
>>> Say N if you are unsure.
>>>
>>> +config NO_HZ_FULL_SYSIDLE_SMALL
>>> + int "Number of CPUs above which large-system approach is used"
>>> + depends on NO_HZ_FULL_SYSIDLE
>>> + range 1 NR_CPUS
>>> + default 8
>>> + help
>>> + The full-system idle detection mechanism takes a lazy approach
>>> + on large systems, as is required to attain decent scalability.
>>> + However, on smaller systems, scalability is not anywhere near as
>>> + large a concern as is energy efficiency. The sysidle subsystem
>>> + therefore uses a fast but non-scalable algorithm for small
>>> + systems and a lazier but scalable algorithm for large systems.
>>> + This Kconfig parameter defines the number of CPUs in the largest
>>> + system that will be considered to be "small".
>>> +
>>> + The default value will be fine in most cases. Battery-powered
>>> + systems that (1) enable NO_HZ_FULL_SYSIDLE, (2) have larger
>>> + numbers of CPUs, and (3) are suffering from battery-lifetime
>>> + problems due to long sysidle latencies might wish to experiment
>>> + with larger values for this Kconfig parameter. On the other
>>> + hand, they might be even better served by disabling NO_HZ_FULL
>>> + entirely, given that NO_HZ_FULL is intended for HPC and
>>> + real-time workloads that at present do not tend to be run on
>>> + battery-powered systems.
>>> +
>>> + Take the default if you are unsure.
>>> +
>>> config NO_HZ
>>> bool "Old Idle dynticks config"
>>> depends on !ARCH_USES_GETTIMEOFFSET && GENERIC_CLOCKEVENTS
>>
>
>

2013-08-31 21:20:36

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 8/9] nohz_full: Add full-system-idle state machine

On Tue, Aug 27, 2013 at 11:41:44AM +0800, Lai Jiangshan wrote:
> On 08/27/2013 12:24 AM, Paul E. McKenney wrote:
> > On Mon, Aug 26, 2013 at 01:45:32PM +0800, Lai Jiangshan wrote:
> >> On 08/20/2013 10:47 AM, Paul E. McKenney wrote:
> >>> From: "Paul E. McKenney" <[email protected]>
> >>>
> >>> This commit adds the state machine that takes the per-CPU idle data
> >>> as input and produces a full-system-idle indication as output. This
> >>> state machine is driven out of RCU's quiescent-state-forcing
> >>> mechanism, which invokes rcu_sysidle_check_cpu() to collect per-CPU
> >>> idle state and then rcu_sysidle_report() to drive the state machine.
> >>>
> >>> The full-system-idle state is sampled using rcu_sys_is_idle(), which
> >>> also drives the state machine if RCU is idle (and does so by forcing
> >>> RCU to become non-idle). This function returns true if all but the
> >>> timekeeping CPU (tick_do_timer_cpu) are idle and have been idle long
> >>> enough to avoid memory contention on the full_sysidle_state state
> >>> variable. The rcu_sysidle_force_exit() may be called externally
> >>> to reset the state machine back into non-idle state.
> >>>
> >>> For large systems the state machine is driven out of RCU's
> >>> force-quiescent-state logic, which provides good scalability at the price
> >>> of millisecond-scale latencies on the transition to full-system-idle
> >>> state. This is not so good for battery-powered systems, which are usually
> >>> small enough that they don't need to care about scalability, but which
> >>> do care deeply about energy efficiency. Small systems therefore drive
> >>> the state machine directly out of the idle-entry code. The number of
> >>> CPUs in a "small" system is defined by a new NO_HZ_FULL_SYSIDLE_SMALL
> >>> Kconfig parameter, which defaults to 8. Note that this is a build-time
> >>> definition.
> >>>
> >>> Signed-off-by: Paul E. McKenney <[email protected]>
> >>> Cc: Frederic Weisbecker <[email protected]>
> >>> Cc: Steven Rostedt <[email protected]>
> >>> Cc: Lai Jiangshan <[email protected]>
> >>> [ paulmck: Use true and false for boolean constants per Lai Jiangshan. ]
> >>> Reviewed-by: Josh Triplett <[email protected]>
> >>> ---
> >>> include/linux/rcupdate.h | 18 +++
> >>> kernel/rcutree.c | 16 ++-
> >>> kernel/rcutree.h | 5 +
> >>> kernel/rcutree_plugin.h | 284 ++++++++++++++++++++++++++++++++++++++++++++++-
> >>> kernel/time/Kconfig | 27 +++++
> >>> 5 files changed, 343 insertions(+), 7 deletions(-)
> >>>
> >>> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> >>> index 30bea9c..f1f1bc3 100644
> >>> --- a/include/linux/rcupdate.h
> >>> +++ b/include/linux/rcupdate.h
> >>> @@ -1011,4 +1011,22 @@ static inline bool rcu_is_nocb_cpu(int cpu) { return false; }
> >>> #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
> >>>
> >>>
> >>> +/* Only for use by adaptive-ticks code. */
> >>> +#ifdef CONFIG_NO_HZ_FULL_SYSIDLE
> >>> +extern bool rcu_sys_is_idle(void);
> >>> +extern void rcu_sysidle_force_exit(void);
> >>> +#else /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
> >>> +
> >>> +static inline bool rcu_sys_is_idle(void)
> >>> +{
> >>> + return false;
> >>> +}
> >>> +
> >>> +static inline void rcu_sysidle_force_exit(void)
> >>> +{
> >>> +}
> >>> +
> >>> +#endif /* #else #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
> >>> +
> >>> +
> >>> #endif /* __LINUX_RCUPDATE_H */
> >>> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> >>> index 7b5be56..eca70f44 100644
> >>> --- a/kernel/rcutree.c
> >>> +++ b/kernel/rcutree.c
> >>> @@ -734,6 +734,7 @@ static int dyntick_save_progress_counter(struct rcu_data *rdp,
> >>> bool *isidle, unsigned long *maxj)
> >>> {
> >>> rdp->dynticks_snap = atomic_add_return(0, &rdp->dynticks->dynticks);
> >>> + rcu_sysidle_check_cpu(rdp, isidle, maxj);
> >>> return (rdp->dynticks_snap & 0x1) == 0;
> >>> }
> >>>
> >>> @@ -1373,11 +1374,17 @@ int rcu_gp_fqs(struct rcu_state *rsp, int fqs_state_in)
> >>> rsp->n_force_qs++;
> >>> if (fqs_state == RCU_SAVE_DYNTICK) {
> >>> /* Collect dyntick-idle snapshots. */
> >>> + if (is_sysidle_rcu_state(rsp)) {
> >>> + isidle = 1;
> >>> + maxj = jiffies - ULONG_MAX / 4;
> >>> + }
> >>> force_qs_rnp(rsp, dyntick_save_progress_counter,
> >>> &isidle, &maxj);
> >>> + rcu_sysidle_report_gp(rsp, isidle, maxj);
> >>> fqs_state = RCU_FORCE_QS;
> >>> } else {
> >>> /* Handle dyntick-idle and offline CPUs. */
> >>> + isidle = 0;
> >>> force_qs_rnp(rsp, rcu_implicit_dynticks_qs, &isidle, &maxj);
> >>> }
> >>> /* Clear flag to prevent immediate re-entry. */
> >>> @@ -2103,9 +2110,12 @@ static void force_qs_rnp(struct rcu_state *rsp,
> >>> cpu = rnp->grplo;
> >>> bit = 1;
> >>> for (; cpu <= rnp->grphi; cpu++, bit <<= 1) {
> >>> - if ((rnp->qsmask & bit) != 0 &&
> >>> - f(per_cpu_ptr(rsp->rda, cpu), isidle, maxj))
> >>> - mask |= bit;
> >>> + if ((rnp->qsmask & bit) != 0) {
> >>> + if ((rnp->qsmaskinit & bit) != 0)
> >>> + *isidle = 0;
> >>> + if (f(per_cpu_ptr(rsp->rda, cpu), isidle, maxj))
> >>> + mask |= bit;
> >>> + }
> >>> }
> >>> if (mask != 0) {
> >>>
> >>> diff --git a/kernel/rcutree.h b/kernel/rcutree.h
> >>> index 9dd8b17..6fd3659 100644
> >>> --- a/kernel/rcutree.h
> >>> +++ b/kernel/rcutree.h
> >>> @@ -555,6 +555,11 @@ static void rcu_kick_nohz_cpu(int cpu);
> >>> static bool init_nocb_callback_list(struct rcu_data *rdp);
> >>> static void rcu_sysidle_enter(struct rcu_dynticks *rdtp, int irq);
> >>> static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq);
> >>> +static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
> >>> + unsigned long *maxj);
> >>> +static bool is_sysidle_rcu_state(struct rcu_state *rsp);
> >>> +static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
> >>> + unsigned long maxj);
> >>> static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp);
> >>>
> >>> #endif /* #ifndef RCU_TREE_NONCORE */
> >>> diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> >>> index a7419ce..90c3fba 100644
> >>> --- a/kernel/rcutree_plugin.h
> >>> +++ b/kernel/rcutree_plugin.h
> >>> @@ -28,7 +28,7 @@
> >>> #include <linux/gfp.h>
> >>> #include <linux/oom.h>
> >>> #include <linux/smpboot.h>
> >>> -#include <linux/tick.h>
> >>> +#include "time/tick-internal.h"
> >>>
> >>> #define RCU_KTHREAD_PRIO 1
> >>>
> >>> @@ -2382,12 +2382,12 @@ static void rcu_kick_nohz_cpu(int cpu)
> >>> * most active flavor of RCU.
> >>> */
> >>> #ifdef CONFIG_PREEMPT_RCU
> >>> -static struct rcu_state __maybe_unused *rcu_sysidle_state = &rcu_preempt_state;
> >>> +static struct rcu_state *rcu_sysidle_state = &rcu_preempt_state;
> >>> #else /* #ifdef CONFIG_PREEMPT_RCU */
> >>> -static struct rcu_state __maybe_unused *rcu_sysidle_state = &rcu_sched_state;
> >>> +static struct rcu_state *rcu_sysidle_state = &rcu_sched_state;
> >>> #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
> >>>
> >>> -static int __maybe_unused full_sysidle_state; /* Current system-idle state. */
> >>> +static int full_sysidle_state; /* Current system-idle state. */
> >>> #define RCU_SYSIDLE_NOT 0 /* Some CPU is not idle. */
> >>> #define RCU_SYSIDLE_SHORT 1 /* All CPUs idle for brief period. */
> >>> #define RCU_SYSIDLE_LONG 2 /* All CPUs idle for long enough. */
> >>> @@ -2431,6 +2431,38 @@ static void rcu_sysidle_enter(struct rcu_dynticks *rdtp, int irq)
> >>> }
> >>>
> >>> /*
> >>> + * Unconditionally force exit from full system-idle state. This is
> >>> + * invoked when a normal CPU exits idle, but must be called separately
> >>> + * for the timekeeping CPU (tick_do_timer_cpu). The reason for this
> >>> + * is that the timekeeping CPU is permitted to take scheduling-clock
> >>> + * interrupts while the system is in system-idle state, and of course
> >>> + * rcu_sysidle_exit() has no way of distinguishing a scheduling-clock
> >>> + * interrupt from any other type of interrupt.
> >>> + */
> >>> +void rcu_sysidle_force_exit(void)
> >>> +{
> >>> + int oldstate = ACCESS_ONCE(full_sysidle_state);
> >>> + int newoldstate;
> >>> +
> >>> + /*
> >>> + * Each pass through the following loop attempts to exit full
> >>> + * system-idle state. If contention proves to be a problem,
> >>> + * a trylock-based contention tree could be used here.
> >>> + */
> >>> + while (oldstate > RCU_SYSIDLE_SHORT) {
> >>> + newoldstate = cmpxchg(&full_sysidle_state,
> >>> + oldstate, RCU_SYSIDLE_NOT);
> >>> + if (oldstate == newoldstate &&
> >>> + oldstate == RCU_SYSIDLE_FULL_NOTED) {
> >>> + rcu_kick_nohz_cpu(tick_do_timer_cpu);
> >>> + return; /* We cleared it, done! */
> >>> + }
> >>> + oldstate = newoldstate;
> >>> + }
> >>> + smp_mb(); /* Order initial oldstate fetch vs. later non-idle work. */
> >>> +}
> >>> +
> >>> +/*
> >>> * Invoked to note entry to irq or task transition from idle. Note that
> >>> * usermode execution does -not- count as idle here! The caller must
> >>> * have disabled interrupts.
> >>> @@ -2463,6 +2495,235 @@ static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq)
> >>> atomic_inc(&rdtp->dynticks_idle);
> >>> smp_mb__after_atomic_inc();
> >>> WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks_idle) & 0x1));
> >>> +
> >>> + /*
> >>> + * If we are the timekeeping CPU, we are permitted to be non-idle
> >>> + * during a system-idle state. This must be the case, because
> >>> + * the timekeeping CPU has to take scheduling-clock interrupts
> >>> + * during the time that the system is transitioning to full
> >>> + * system-idle state. This means that the timekeeping CPU must
> >>> + * invoke rcu_sysidle_force_exit() directly if it does anything
> >>> + * more than take a scheduling-clock interrupt.
> >>> + */
> >>> + if (smp_processor_id() == tick_do_timer_cpu)
> >>> + return;
> >>> +
> >>> + /* Update system-idle state: We are clearly no longer fully idle! */
> >>> + rcu_sysidle_force_exit();
> >>> +}
> >>> +
> >>> +/*
> >>> + * Check to see if the current CPU is idle. Note that usermode execution
> >>> + * does not count as idle. The caller must have disabled interrupts.
> >>> + */
> >>> +static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
> >>> + unsigned long *maxj)
> >>> +{
> >>> + int cur;
> >>> + unsigned long j;
> >>> + struct rcu_dynticks *rdtp = rdp->dynticks;
> >>> +
> >>> + /*
> >>> + * If some other CPU has already reported non-idle, if this is
> >>> + * not the flavor of RCU that tracks sysidle state, or if this
> >>> + * is an offline or the timekeeping CPU, nothing to do.
> >>> + */
> >>> + if (!*isidle || rdp->rsp != rcu_sysidle_state ||
> >>> + cpu_is_offline(rdp->cpu) || rdp->cpu == tick_do_timer_cpu)
> >>> + return;
> >>> + /* WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu); */
> >>> +
> >>> + /* Pick up current idle and NMI-nesting counter and check. */
> >>> + cur = atomic_read(&rdtp->dynticks_idle);
> >>> + if (cur & 0x1) {
> >>> + *isidle = false; /* We are not idle! */
> >>> + return;
> >>> + }
> >>> + smp_mb(); /* Read counters before timestamps. */
> >>> +
> >>> + /* Pick up timestamps. */
> >>> + j = ACCESS_ONCE(rdtp->dynticks_idle_jiffies);
> >>> + /* If this CPU entered idle more recently, update maxj timestamp. */
> >>> + if (ULONG_CMP_LT(*maxj, j))
> >>> + *maxj = j;
> >>> +}
> >>> +
> >>> +/*
> >>> + * Is this the flavor of RCU that is handling full-system idle?
> >>> + */
> >>> +static bool is_sysidle_rcu_state(struct rcu_state *rsp)
> >>> +{
> >>> + return rsp == rcu_sysidle_state;
> >>> +}
> >>> +
> >>> +/*
> >>> + * Return a delay in jiffies based on the number of CPUs, rcu_node
> >>> + * leaf fanout, and jiffies tick rate. The idea is to allow larger
> >>> + * systems more time to transition to full-idle state in order to
> >>> + * avoid the cache thrashing that otherwise occur on the state variable.
> >>> + * Really small systems (less than a couple of tens of CPUs) should
> >>> + * instead use a single global atomically incremented counter, and later
> >>> + * versions of this will automatically reconfigure themselves accordingly.
> >>> + */
> >>> +static unsigned long rcu_sysidle_delay(void)
> >>> +{
> >>> + if (nr_cpu_ids <= CONFIG_NO_HZ_FULL_SYSIDLE_SMALL)
> >>> + return 0;
> >>> + return DIV_ROUND_UP(nr_cpu_ids * HZ, rcu_fanout_leaf * 1000);
> >>> +}
> >>> +
> >>> +/*
> >>> + * Advance the full-system-idle state. This is invoked when all of
> >>> + * the non-timekeeping CPUs are idle.
> >>> + */
> >>> +static void rcu_sysidle(unsigned long j)
> >>> +{
> >>> + /* Check the current state. */
> >>> + switch (ACCESS_ONCE(full_sysidle_state)) {
> >>> + case RCU_SYSIDLE_NOT:
> >>> +
> >>> + /* First time all are idle, so note a short idle period. */
> >>> + ACCESS_ONCE(full_sysidle_state) = RCU_SYSIDLE_SHORT;
> >>> + break;
> >>> +
> >>> + case RCU_SYSIDLE_SHORT:
> >>> +
> >>> + /*
> >>> + * Idle for a bit, time to advance to next state?
> >>> + * cmpxchg failure means race with non-idle, let them win.
> >>> + */
> >>> + if (ULONG_CMP_GE(jiffies, j + rcu_sysidle_delay()))
> >>> + (void)cmpxchg(&full_sysidle_state,
> >>> + RCU_SYSIDLE_SHORT, RCU_SYSIDLE_LONG);
> >>> + break;
> >>> +
> >>> + case RCU_SYSIDLE_LONG:
> >>> +
> >>> + /*
> >>> + * Do an additional check pass before advancing to full.
> >>> + * cmpxchg failure means race with non-idle, let them win.
> >>> + */
> >>> + if (ULONG_CMP_GE(jiffies, j + rcu_sysidle_delay()))
> >>> + (void)cmpxchg(&full_sysidle_state,
> >>> + RCU_SYSIDLE_LONG, RCU_SYSIDLE_FULL);
> >>> + break;
> >>> +
> >>> + default:
> >>> + break;
> >>> + }
> >>> +}
> >>> +
> >>> +/*
> >>> + * Found a non-idle non-timekeeping CPU, so kick the system-idle state
> >>> + * back to the beginning.
> >>> + */
> >>> +static void rcu_sysidle_cancel(void)
> >>> +{
> >>> + smp_mb();
> >>> + ACCESS_ONCE(full_sysidle_state) = RCU_SYSIDLE_NOT;
> >>> +}
> >>> +
> >>> +/*
> >>> + * Update the sysidle state based on the results of a force-quiescent-state
> >>> + * scan of the CPUs' dyntick-idle state.
> >>> + */
> >>> +static void rcu_sysidle_report(struct rcu_state *rsp, int isidle,
> >>> + unsigned long maxj, bool gpkt)
> >>> +{
> >>> + if (rsp != rcu_sysidle_state)
> >>> + return; /* Wrong flavor, ignore. */
> >>> + if (isidle) {
> >>> + if (gpkt && nr_cpu_ids > CONFIG_NO_HZ_FULL_SYSIDLE_SMALL)
> >>> + rcu_sysidle(maxj); /* More idle! */
> >>> + } else {
> >>> + rcu_sysidle_cancel(); /* Idle is over. */
> >>> + }
> >>
> >> "gpkt" is always equal to "nr_cpu_ids > RCU_SYSIDLE_SMALL",
>
>
> Sorry, "(gpkt && nr_cpu_ids > CONFIG_NO_HZ_FULL_SYSIDLE_SMALL)"
> is always equal to "nr_cpu_ids > CONFIG_NO_HZ_FULL_SYSIDLE_SMALL)".
>
> There are two callsites of rcu_sysidle_report().
> if rcu_sysidle_report() is called from rcu_sysidle_report_gp(),
> gpkt is true, so my proposition is correct.
>
> if rcu_sysidle_report() is called from rcu_sys_is_idle(),
> the gpkt is false, but this time nr_cpu_ids <= CONFIG_NO_HZ_FULL_SYSIDLE_SMALL,
> so my proposition is still correct.
>
> So since "(gpkt && nr_cpu_ids > CONFIG_NO_HZ_FULL_SYSIDLE_SMALL)"
> is always equal to "nr_cpu_ids > CONFIG_NO_HZ_FULL_SYSIDLE_SMALL)".
> we can just remove "gpkt" argument and rcu_sysidle_report_gp().

I really don't see how to remove the gpkt argument without having two
different tasks trying to advance the state machine concurrently in
the case where nr_cpu_ids <= CONFIG_NO_HZ_FULL_SYSIDLE_SMALL, which
would be very bad. (It is OK to have one task attempting to advance
the state machine while another pushes it back to the beginning, but
that is another story.)

I did rework rcu_sysidle_report() a bit, here is the updated version:

static void rcu_sysidle_report(struct rcu_state *rsp, int isidle,
unsigned long maxj, bool gpkt)
{
if (rsp != rcu_sysidle_state)
return; /* Wrong flavor, ignore. */
if (gpkt && nr_cpu_ids <= CONFIG_NO_HZ_FULL_SYSIDLE_SMALL)
return; /* Running state machine from timekeeping CPU. */
if (isidle)
rcu_sysidle(maxj); /* More idle! */
else
rcu_sysidle_cancel(); /* Idle is over. */
}

The idea is that if nr_cpu_ids <= CONFIG_NO_HZ_FULL_SYSIDLE_SMALL, we
want to ignore calls to this function made in the context of RCU's
grace-period kthread. In this case, the state machine is driven from
the timekeeping CPU.

On the other hand, if nr_cpu_ids > CONFIG_NO_HZ_FULL_SYSIDLE_SMALL,
the timekeeping CPU won't be invoking this function, so we need to
pay attention to the calls made in the context of RCU's grace-period
kthread.

> > Almost. When rcu_sysidle_report() is called from the grace-period
> > kthread, gpkt is true. So when nr_cpu_ids > RCU_SYSIDLE_SMALL, gpkt
> > will always be true, but when nr_cpu_ids <= RCU_SYSIDLE_SMALL, gpkt can
> > be either true or false.
> >
> >> so we can remove "gpkt" argument and rcu_sysidle_report_gp(
> >
> > If we do that, when nr_cpu_ids <= RCU_SYSIDLE_SMALL there can be multiple
> > tasks manipulating the state machine concurrently. So let's not. ;-)
> >
> > That said, the (gpkt && nr_cpu_ids > CONFIG_NO_HZ_FULL_SYSIDLE_SMALL)
> > check should be outside of the (isidle) check in order to avoid having
> > multiple tasks manipulate the rcu_sysidle_state variable, fixed.
> >
> >>> +}
> >>> +
> >>> +static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
> >>> + unsigned long maxj)
> >>> +{
> >>> + rcu_sysidle_report(rsp, isidle, maxj, true);
> >>> +}
> >>> +
> >>> +/* Callback and function for forcing an RCU grace period. */
> >>> +struct rcu_sysidle_head {
> >>> + struct rcu_head rh;
> >>> + int inuse;
> >>> +};
> >>> +
> >>> +static void rcu_sysidle_cb(struct rcu_head *rhp)
> >>> +{
> >>> + struct rcu_sysidle_head *rshp;
> >>> +
> >>> + smp_mb(); /* grace period precedes setting inuse. */
> >>
> >> Why we need this mb()?
> >
> > I put it there because we are not using a memory allocator, and thus
> > cannot rely on the memory barriers that would be executed as part of
> > the memory allocator should this callback get migrated to some other CPU.
> >
> > Can you prove that I don't need it?
> >
> >>> + rshp = container_of(rhp, struct rcu_sysidle_head, rh);
> >>> + ACCESS_ONCE(rshp->inuse) = 0;
> >>> +}
> >>> +
> >>> +/*
> >>> + * Check to see if the system is fully idle, other than the timekeeping CPU.
> >>> + * The caller must have disabled interrupts.
> >>> + */
> >>> +bool rcu_sys_is_idle(void)
> >>> +{
> >>> + static struct rcu_sysidle_head rsh;
> >>> + int rss = ACCESS_ONCE(full_sysidle_state);
> >>> +
> >>> + if (WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu))
> >>> + return false;
> >>> +
> >>> + /* Handle small-system case by doing a full scan of CPUs. */
> >>> + if (nr_cpu_ids <= CONFIG_NO_HZ_FULL_SYSIDLE_SMALL) {
> >>> + int oldrss = rss - 1;
> >>> +
> >>> + /*
> >>> + * One pass to advance to each state up to _FULL.
> >>> + * Give up if any pass fails to advance the state.
> >>> + */
> >>> + while (rss < RCU_SYSIDLE_FULL && oldrss < rss) {
> >>> + int cpu;
> >>> + bool isidle = true;
> >>> + unsigned long maxj = jiffies - ULONG_MAX / 4;
> >>> + struct rcu_data *rdp;
> >>> +
> >>> + /* Scan all the CPUs looking for nonidle CPUs. */
> >>> + for_each_possible_cpu(cpu) {
> >>> + rdp = per_cpu_ptr(rcu_sysidle_state->rda, cpu);
> >>> + rcu_sysidle_check_cpu(rdp, &isidle, &maxj);
> >>> + if (!isidle)
> >>> + break;
> >>> + }
> >>> + rcu_sysidle_report(rcu_sysidle_state,
> >>> + isidle, maxj, false);
> >>> + oldrss = rss;
> >>> + rss = ACCESS_ONCE(full_sysidle_state);
> >>> + }
> >>> + }
> >>
> >> I don't think it is a good idea to move the overhead to fqs when nr_cpu_ids > 8
> >> the total overhead will no be reduced, and it maybe more.
> >>
> >> I think we can calculate it on the time-keeping-cpu when nr_cpu_ids > 8 && time-keeping-cpu is idle.
> >
> > If we do that, then the timekeeping CPU could scan most of the CPUs
> > times when it transitions to idle. This would probably not make the
> > people running systems with hundreds of CPUs very happy.
>
> When nr_cpu_ids > 8, this patch puts the scan in gp_thread, which is also in time-keeping-cpu,
> it already stops the cpu to transition idle. and if there are/is task(s) are/is
> woken up in the time-keeping-cpu, the scan can't quit immedeately.
>
> If we put the scan(but interruptible) in idle, the scan can quit immedeately
> by testing NEED_RESCHED bits while scaning.

But we also have interrupts disabled during the scan, which would result
in unacceptable interrupt latency.

> > Now it might well be that CONFIG_NO_HZ_FULL_SYSIDLE_SMALL need to be
> > larger, but that is one reason why it is a Kconfig variable.
> >
> >>> +
> >>> + /* If this is the first observation of an idle period, record it. */
> >>> + if (rss == RCU_SYSIDLE_FULL) {
> >>> + rss = cmpxchg(&full_sysidle_state,
> >>> + RCU_SYSIDLE_FULL, RCU_SYSIDLE_FULL_NOTED);
> >>> + return rss == RCU_SYSIDLE_FULL;
> >>> + }
> >>> +
> >>> + smp_mb(); /* ensure rss load happens before later caller actions. */
> >>> +
> >>> + /* If already fully idle, tell the caller (in case of races). */
> >>> + if (rss == RCU_SYSIDLE_FULL_NOTED)
> >>> + return true;
> >>> +
> >>> + /*
> >>> + * If we aren't there yet, and a grace period is not in flight,
> >>> + * initiate a grace period. Either way, tell the caller that
> >>> + * we are not there yet.
> >>> + */
> >>> + if (nr_cpu_ids > CONFIG_NO_HZ_FULL_SYSIDLE_SMALL &&
> >>> + !rcu_gp_in_progress(rcu_sysidle_state) &&
> >>> + !rsh.inuse && xchg(&rsh.inuse, 1) == 0)
> >>> + call_rcu(&rsh.rh, rcu_sysidle_cb);
> >>
> >> why need to use xchg()? Who will it race with?
> >
> > The xchg() is taking the place of the memory barriers that would otherwise
> > be present in the memory allocator -- the RCU callback would free itself,
> > and we would allocate it here. This memory barrier pairs with the one
> > you asked about in rcu_sysidle_cb().
>
> but xchg() will not fail here, does "smp_mb(); rsh.inuse = 1;" enough here?

Frederic talks about having multiple timekeeping CPUs at some point, so
maybe the right approach is WARN_ON_ONCE(xchg(&rsh.inuse, 1)) == 0.

Thanx, Paul

> > So the intent is that if this code sees rsh.inuse==0, then the call_rcu()
> > won't have to worry about rcu_do_batch() messing with the callback.
> >
> > For example, suppose that we some day make rcu_do_batch NULL out each
> > callback's ->next pointer for debugging purposes. We would then need
> > both the memory barrier in rcu_sysidle_cb() and the memory barriers in
> > the xchg() to ensure that call_rcu()'s assignment to ->next happened
> > after rcu_do_batch()'s NULLing of the ->next pointer.
> >
> > Thanx, Paul
> >
> >> Thanks,
> >> Lai
> >>
> >>
> >>> + return false;
> >>> }
> >>>
> >>> /*
> >>> @@ -2483,6 +2744,21 @@ static void rcu_sysidle_exit(struct rcu_dynticks *rdtp, int irq)
> >>> {
> >>> }
> >>>
> >>> +static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
> >>> + unsigned long *maxj)
> >>> +{
> >>> +}
> >>> +
> >>> +static bool is_sysidle_rcu_state(struct rcu_state *rsp)
> >>> +{
> >>> + return false;
> >>> +}
> >>> +
> >>> +static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
> >>> + unsigned long maxj)
> >>> +{
> >>> +}
> >>> +
> >>> static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp)
> >>> {
> >>> }
> >>> diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
> >>> index c7d2fd6..3381f09 100644
> >>> --- a/kernel/time/Kconfig
> >>> +++ b/kernel/time/Kconfig
> >>> @@ -157,6 +157,33 @@ config NO_HZ_FULL_SYSIDLE
> >>>
> >>> Say N if you are unsure.
> >>>
> >>> +config NO_HZ_FULL_SYSIDLE_SMALL
> >>> + int "Number of CPUs above which large-system approach is used"
> >>> + depends on NO_HZ_FULL_SYSIDLE
> >>> + range 1 NR_CPUS
> >>> + default 8
> >>> + help
> >>> + The full-system idle detection mechanism takes a lazy approach
> >>> + on large systems, as is required to attain decent scalability.
> >>> + However, on smaller systems, scalability is not anywhere near as
> >>> + large a concern as is energy efficiency. The sysidle subsystem
> >>> + therefore uses a fast but non-scalable algorithm for small
> >>> + systems and a lazier but scalable algorithm for large systems.
> >>> + This Kconfig parameter defines the number of CPUs in the largest
> >>> + system that will be considered to be "small".
> >>> +
> >>> + The default value will be fine in most cases. Battery-powered
> >>> + systems that (1) enable NO_HZ_FULL_SYSIDLE, (2) have larger
> >>> + numbers of CPUs, and (3) are suffering from battery-lifetime
> >>> + problems due to long sysidle latencies might wish to experiment
> >>> + with larger values for this Kconfig parameter. On the other
> >>> + hand, they might be even better served by disabling NO_HZ_FULL
> >>> + entirely, given that NO_HZ_FULL is intended for HPC and
> >>> + real-time workloads that at present do not tend to be run on
> >>> + battery-powered systems.
> >>> +
> >>> + Take the default if you are unsure.
> >>> +
> >>> config NO_HZ
> >>> bool "Old Idle dynticks config"
> >>> depends on !ARCH_USES_GETTIMEOFFSET && GENERIC_CLOCKEVENTS
> >>
> >
> >
>

2013-09-06 08:08:31

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 8/9] nohz_full: Add full-system-idle state machine

On Tue, Aug 20, 2013 at 4:47 AM, Paul E. McKenney
<[email protected]> wrote:
> --- a/kernel/time/Kconfig
> +++ b/kernel/time/Kconfig
> @@ -157,6 +157,33 @@ config NO_HZ_FULL_SYSIDLE
>
> Say N if you are unsure.
>
> +config NO_HZ_FULL_SYSIDLE_SMALL
> + int "Number of CPUs above which large-system approach is used"
> + depends on NO_HZ_FULL_SYSIDLE
> + range 1 NR_CPUS

This causes "kernel/time/Kconfig:162:warning: range is invalid" on m68k and
all other architectures that do not support SMP.

How to reproduce:
make ARCH=m68k defconfig

Furthermore, it seems only hexagon, metag, mips, and x86 set NR_CPUS to 1
if !SMP. On other architectures, NR_CPUS is not defined and presumed to be 0.

Hence in non-interactive configs (e.g. "make defconfig"),
NO_HZ_FULL_SYSIDLE_SMALL will end up as 0.
In interactive configs (e.g. "make oldconfig") Kconfig suggest "0" as
the default,
but refuses to accept it as it doesn't fall within the range 1..0.

How to reproduce:
Remove the "depends on NO_HZ_FULL_SYSIDLE"
make ARCH=powerpc mpc83xx_defconfig
grep NO_HZ_FULL_SYSIDLE_SMALL .config
-> CONFIG_NO_HZ_FULL_SYSIDLE_SMALL=0
sed 's/CONFIG_NO_HZ_FULL_SYSIDLE_SMALL=0//g' -i .config
make ARCH=powerpc oldconfig
-> no value is accepted

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2013-09-06 17:30:58

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 8/9] nohz_full: Add full-system-idle state machine

On Fri, Sep 06, 2013 at 10:08:22AM +0200, Geert Uytterhoeven wrote:
> On Tue, Aug 20, 2013 at 4:47 AM, Paul E. McKenney
> <[email protected]> wrote:
> > --- a/kernel/time/Kconfig
> > +++ b/kernel/time/Kconfig
> > @@ -157,6 +157,33 @@ config NO_HZ_FULL_SYSIDLE
> >
> > Say N if you are unsure.
> >
> > +config NO_HZ_FULL_SYSIDLE_SMALL
> > + int "Number of CPUs above which large-system approach is used"
> > + depends on NO_HZ_FULL_SYSIDLE
> > + range 1 NR_CPUS
>
> This causes "kernel/time/Kconfig:162:warning: range is invalid" on m68k and
> all other architectures that do not support SMP.
>
> How to reproduce:
> make ARCH=m68k defconfig

OK, this does complain, but seems to give a reasonable .config file.
(From what I can tell.) It would clearly be good to get rid of the
complaint.

> Furthermore, it seems only hexagon, metag, mips, and x86 set NR_CPUS to 1
> if !SMP. On other architectures, NR_CPUS is not defined and presumed to be 0.

Would it make sense to require that NR_CPUS=1 for !SMP?

> Hence in non-interactive configs (e.g. "make defconfig"),
> NO_HZ_FULL_SYSIDLE_SMALL will end up as 0.
> In interactive configs (e.g. "make oldconfig") Kconfig suggest "0" as
> the default,
> but refuses to accept it as it doesn't fall within the range 1..0.
>
> How to reproduce:
> Remove the "depends on NO_HZ_FULL_SYSIDLE"
> make ARCH=powerpc mpc83xx_defconfig
> grep NO_HZ_FULL_SYSIDLE_SMALL .config
> -> CONFIG_NO_HZ_FULL_SYSIDLE_SMALL=0
> sed 's/CONFIG_NO_HZ_FULL_SYSIDLE_SMALL=0//g' -i .config
> make ARCH=powerpc oldconfig
> -> no value is accepted

If it turns out that there is some reason by NR_CPUS=1 is impossible,
there are a few things that I could do:

I could just leave the range off, which would allow people to give
nonsense values. This would be harmless in the code, for example,
a negative value would simply disable small-system handling, while
a too-large value would similarly disable large-system handling.
Might be a bit obnoxious for the guy who typoed and then wasted a
kernel build/boot/test cycle, but it is an option.

I could use a small fixed range (say from 1 to 64), which would
provide at least some checking. In the unlikely event that someone
really wants more than 64 CPUs handled with small-system handling,
we could revisit at that point.

I tried creating a NR_CPUS_REALLY as follows:

config NR_CPUS_REALLY
int "Fixed version of NR_CPUS"
default NR_CPUS if NR_CPUS
default 1 if !NR_CPUS

But this still gave a warning on the first "default" even though it
was not in effect. I also tried using Kconfig "if":

if SMP
config NR_CPUS_REALLY
int "Fixed version of NR_CPUS"
default NR_CPUS
endif
if !SMP
config NR_CPUS_REALLY
int "Fixed version of NR_CPUS"
default 1 if !SMP
endif

However, Kconfig complained about the use of NR_CPUS even though this
was under an "if" whose condition was not set. Perhaps someone with
better Kconfig-fu than I have can come up with something.

Defining NR_CPUS=1 if !SMP is looking pretty good to me just now.
This would probably have other benefits -- I cannot be the only
person who ever wanted this. ;-)

Thanx, Paul

2013-09-06 18:50:46

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 8/9] nohz_full: Add full-system-idle state machine

On Fri, Sep 6, 2013 at 7:30 PM, Paul E. McKenney
<[email protected]> wrote:
>> Furthermore, it seems only hexagon, metag, mips, and x86 set NR_CPUS to 1
>> if !SMP. On other architectures, NR_CPUS is not defined and presumed to be 0.
>
> Would it make sense to require that NR_CPUS=1 for !SMP?

Yes, this looks reasonable to me.

> I tried creating a NR_CPUS_REALLY as follows:
>
> config NR_CPUS_REALLY
> int "Fixed version of NR_CPUS"
> default NR_CPUS if NR_CPUS
> default 1 if !NR_CPUS
>
> But this still gave a warning on the first "default" even though it
> was not in effect. I also tried using Kconfig "if":

IIRC, it tries to use the first default first, so the below may work
(the "if SMP" is probably not needed):

config NR_CPUS_REALLY
int "Fixed version of NR_CPUS"
default 1 if !SMP
default NR_CPUS if SMP

> Defining NR_CPUS=1 if !SMP is looking pretty good to me just now.
> This would probably have other benefits -- I cannot be the only
> person who ever wanted this. ;-)

Sure. I just didn't want to create patches for all architectures without
having a discussion first.

And it would be nice if it cuould be done in a central place, without
touching all architectures.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2013-09-06 19:32:17

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 8/9] nohz_full: Add full-system-idle state machine

On Fri, Sep 06, 2013 at 08:50:41PM +0200, Geert Uytterhoeven wrote:
> On Fri, Sep 6, 2013 at 7:30 PM, Paul E. McKenney
> <[email protected]> wrote:
> >> Furthermore, it seems only hexagon, metag, mips, and x86 set NR_CPUS to 1
> >> if !SMP. On other architectures, NR_CPUS is not defined and presumed to be 0.
> >
> > Would it make sense to require that NR_CPUS=1 for !SMP?
>
> Yes, this looks reasonable to me.
>
> > I tried creating a NR_CPUS_REALLY as follows:
> >
> > config NR_CPUS_REALLY
> > int "Fixed version of NR_CPUS"
> > default NR_CPUS if NR_CPUS
> > default 1 if !NR_CPUS
> >
> > But this still gave a warning on the first "default" even though it
> > was not in effect. I also tried using Kconfig "if":
>
> IIRC, it tries to use the first default first, so the below may work
> (the "if SMP" is probably not needed):
>
> config NR_CPUS_REALLY
> int "Fixed version of NR_CPUS"
> default 1 if !SMP
> default NR_CPUS if SMP

Seemed like a good idea, but I still get:

make O=/tmp/e ARCH=m68k defconfig
GEN /tmp/e/Makefile
*** Default configuration is based on 'multi_defconfig'
kernel/time/Kconfig:140:warning: 'NR_CPUS_REALLY': number is invalid
#
# configuration written to .config
#

Diff below in case I messed something up.

> > Defining NR_CPUS=1 if !SMP is looking pretty good to me just now.
> > This would probably have other benefits -- I cannot be the only
> > person who ever wanted this. ;-)
>
> Sure. I just didn't want to create patches for all architectures without
> having a discussion first.
>
> And it would be nice if it cuould be done in a central place, without
> touching all architectures.

Agreed, should it prove possible. ;-)

Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index 3381f09..cb7a932 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -134,6 +134,11 @@ config NO_HZ_FULL_ALL
Note the boot CPU will still be kept outside the range to
handle the timekeeping duty.

+config NR_CPUS_REALLY
+ int "Fixed version of NR_CPUS"
+ default 1 if !SMP
+ default NR_CPUS if SMP
+
config NO_HZ_FULL_SYSIDLE
bool "Detect full-system idle state for full dynticks system"
depends on NO_HZ_FULL
@@ -160,7 +165,7 @@ config NO_HZ_FULL_SYSIDLE
config NO_HZ_FULL_SYSIDLE_SMALL
int "Number of CPUs above which large-system approach is used"
depends on NO_HZ_FULL_SYSIDLE
- range 1 NR_CPUS
+ range 1 NR_CPUS_REALLY
default 8
help
The full-system idle detection mechanism takes a lazy approach

2013-09-07 09:13:58

by Yann E. MORIN

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 8/9] nohz_full: Add full-system-idle state machine

Paul, All,

On 2013-09-06 10:30 -0700, Paul E. McKenney spake thusly:
[--SNIP--]
> I also tried using Kconfig "if":
>
> if SMP
> config NR_CPUS_REALLY
> int "Fixed version of NR_CPUS"
> default NR_CPUS
> endif
> if !SMP
> config NR_CPUS_REALLY
> int "Fixed version of NR_CPUS"
> default 1 if !SMP

The 'if !SMP' here is unneeded, you're already in a 'if !SMP' if-block.

> endif
>
> However, Kconfig complained about the use of NR_CPUS even though this
> was under an "if" whose condition was not set. Perhaps someone with
> better Kconfig-fu than I have can come up with something.

That's because the 'if' condition is added to the dependency list of the
symbol(s) that is(are) enclosed in the if.

'if' in Kconfig behaves the same way as an 'if' in C. What you expected
(I believe) was the behaviour of '#ifdef', which is not the case. From
Documentation/kbuild/kconfig-language.txt:

---8<---
if:

"if" <expr>
<if block>
"endif"

This defines an if block. The dependency expression <expr> is appended
to all enclosed menu entries.
---8<---

There's no equivlaent to '#ifdef' in Kconfig.

I'll see if I can come up with a meaningfull construct that fixes your
use-case. Don't hold your breath, though! ;-)

Regards,
Yann E. MORIN.

--
.-----------------.--------------------.------------------.--------------------.
| Yann E. MORIN | Real-Time Embedded | /"\ ASCII RIBBON | Erics' conspiracy: |
| +33 662 376 056 | Software Designer | \ / CAMPAIGN | ___ |
| +33 223 225 172 `------------.-------: X AGAINST | \e/ There is no |
| http://ymorin.is-a-geek.org/ | _/*\_ | / \ HTML MAIL | v conspiracy. |
'------------------------------^-------^------------------^--------------------'

2013-09-07 11:23:00

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 8/9] nohz_full: Add full-system-idle state machine

On Fri, Sep 6, 2013 at 8:50 PM, Geert Uytterhoeven <[email protected]> wrote:
> On Fri, Sep 6, 2013 at 7:30 PM, Paul E. McKenney
> <[email protected]> wrote:
>>> Furthermore, it seems only hexagon, metag, mips, and x86 set NR_CPUS to 1
>>> if !SMP. On other architectures, NR_CPUS is not defined and presumed to be 0.
>>
>> Would it make sense to require that NR_CPUS=1 for !SMP?
>
> Yes, this looks reasonable to me.

Perhaps we can invert the logic and define only NR_CPUS in arch-specific
code, and derive SMP from NR_CPUS != 1 in generic code?

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2013-09-07 18:57:31

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 8/9] nohz_full: Add full-system-idle state machine

On Sat, Sep 07, 2013 at 11:13:48AM +0200, Yann E. MORIN wrote:
> Paul, All,
>
> On 2013-09-06 10:30 -0700, Paul E. McKenney spake thusly:
> [--SNIP--]
> > I also tried using Kconfig "if":
> >
> > if SMP
> > config NR_CPUS_REALLY
> > int "Fixed version of NR_CPUS"
> > default NR_CPUS
> > endif
> > if !SMP
> > config NR_CPUS_REALLY
> > int "Fixed version of NR_CPUS"
> > default 1 if !SMP
>
> The 'if !SMP' here is unneeded, you're already in a 'if !SMP' if-block.

Agreed, though I get the same result even without the !SMP.

> > endif
> >
> > However, Kconfig complained about the use of NR_CPUS even though this
> > was under an "if" whose condition was not set. Perhaps someone with
> > better Kconfig-fu than I have can come up with something.
>
> That's because the 'if' condition is added to the dependency list of the
> symbol(s) that is(are) enclosed in the if.
>
> 'if' in Kconfig behaves the same way as an 'if' in C. What you expected
> (I believe) was the behaviour of '#ifdef', which is not the case. From
> Documentation/kbuild/kconfig-language.txt:
>
> ---8<---
> if:
>
> "if" <expr>
> <if block>
> "endif"
>
> This defines an if block. The dependency expression <expr> is appended
> to all enclosed menu entries.
> ---8<---

OK, I did read this, but misunderstood it.

> There's no equivlaent to '#ifdef' in Kconfig.
>
> I'll see if I can come up with a meaningfull construct that fixes your
> use-case. Don't hold your breath, though! ;-)

If not, we need to add NR_CPUS to the architectures lacking them...

Thanx, Paul

2013-09-07 19:00:03

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 8/9] nohz_full: Add full-system-idle state machine

On Sat, Sep 07, 2013 at 01:22:57PM +0200, Geert Uytterhoeven wrote:
> On Fri, Sep 6, 2013 at 8:50 PM, Geert Uytterhoeven <[email protected]> wrote:
> > On Fri, Sep 6, 2013 at 7:30 PM, Paul E. McKenney
> > <[email protected]> wrote:
> >>> Furthermore, it seems only hexagon, metag, mips, and x86 set NR_CPUS to 1
> >>> if !SMP. On other architectures, NR_CPUS is not defined and presumed to be 0.
> >>
> >> Would it make sense to require that NR_CPUS=1 for !SMP?
> >
> > Yes, this looks reasonable to me.
>
> Perhaps we can invert the logic and define only NR_CPUS in arch-specific
> code, and derive SMP from NR_CPUS != 1 in generic code?

If we always had NR_CPUS defined, that might be a good way to go.
We would of course need acks from the various arch maintainers. I am
guessing that we are OK for m68k. ;-)

Thanx, Paul

2013-09-08 10:33:00

by Yann E. MORIN

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 8/9] nohz_full: Add full-system-idle state machine

Paul, All,

On 2013-09-07 11:57 -0700, Paul E. McKenney spake thusly:
> On Sat, Sep 07, 2013 at 11:13:48AM +0200, Yann E. MORIN wrote:
[--SNIP--]
> > I'll see if I can come up with a meaningfull construct that fixes your
> > use-case. Don't hold your breath, though! ;-)
>
> If not, we need to add NR_CPUS to the architectures lacking them...

Unfortunately, I was not able to come up with anything suitable.

I think your proposal to always define NR_CPUS=1 for architectures
without SMP support is a good solution.

After all, if !SMP because the architecture does not support it, I
believe it makes sense that NR_CPUS be defined to 1.

Unless NR_CPUS carries with it a hidden meaning about SMP being
possible, that is, which would probably be wrong anyway, since we have
SMP for this.

Regards,
Yann E. MORIN.

--
.-----------------.--------------------.------------------.--------------------.
| Yann E. MORIN | Real-Time Embedded | /"\ ASCII RIBBON | Erics' conspiracy: |
| +33 662 376 056 | Software Designer | \ / CAMPAIGN | ___ |
| +33 223 225 172 `------------.-------: X AGAINST | \e/ There is no |
| http://ymorin.is-a-geek.org/ | _/*\_ | / \ HTML MAIL | v conspiracy. |
'------------------------------^-------^------------------^--------------------'

2013-09-08 10:46:47

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 8/9] nohz_full: Add full-system-idle state machine

On Sun, Sep 08, 2013 at 12:32:50PM +0200, Yann E. MORIN wrote:
> Paul, All,
>
> On 2013-09-07 11:57 -0700, Paul E. McKenney spake thusly:
> > On Sat, Sep 07, 2013 at 11:13:48AM +0200, Yann E. MORIN wrote:
> [--SNIP--]
> > > I'll see if I can come up with a meaningfull construct that fixes your
> > > use-case. Don't hold your breath, though! ;-)
> >
> > If not, we need to add NR_CPUS to the architectures lacking them...
>
> Unfortunately, I was not able to come up with anything suitable.
>
> I think your proposal to always define NR_CPUS=1 for architectures
> without SMP support is a good solution.
>
> After all, if !SMP because the architecture does not support it, I
> believe it makes sense that NR_CPUS be defined to 1.
>
> Unless NR_CPUS carries with it a hidden meaning about SMP being
> possible, that is, which would probably be wrong anyway, since we have
> SMP for this.

Right, if possible, the best would be to define CONFIG_NR_CPUS to a single
central place, like arch/Kconfig:

config NR_CPUS
default 1 if !SMP

It's a single int after all. Specific arch constraints can probably be overriden
from the arch.

>
> Regards,
> Yann E. MORIN.
>
> --
> .-----------------.--------------------.------------------.--------------------.
> | Yann E. MORIN | Real-Time Embedded | /"\ ASCII RIBBON | Erics' conspiracy: |
> | +33 662 376 056 | Software Designer | \ / CAMPAIGN | ___ |
> | +33 223 225 172 `------------.-------: X AGAINST | \e/ There is no |
> | http://ymorin.is-a-geek.org/ | _/*\_ | / \ HTML MAIL | v conspiracy. |
> '------------------------------^-------^------------------^--------------------'