2012-05-03 14:57:07

by Gilad Ben-Yossef

[permalink] [raw]
Subject: [PATCH v1 0/6] reduce workqueue and timer noise

Timers and work queues both provide a useful way to defer work
for a later time or a different context. However, when the
timer or work item runs, it interrupts the CPU it is running on.
This is good if it is doing useful work, but it turns out this
is not always the case.

This patch set tries to locate and address code paths where work queues
items and timers are scheduled on CPUs where they have no useful work
to do and adapet them to be more selective.

This includes:

- Introducing helper function to schedule work queue items on a subset
of CPUs in the system.
- Use the helper function to schedule work items to attempt to drain
LRUs only on CPUs where there are LRU pages.
- Stop running the per cpu work item that does per-cpu pages reclaim
and VM statistics on CPUs that did not have any VM activity for the
last second (time frame configurable) and re-start it when VM
activity is detected.
- Fix a bug that prevented the timer code to to not program the
underlying HW timer to fire periodically when no future timer
event exists for a CPU

Changelog:

- The vmstat_update patch was changed to use a scapegoat CPU as
suggested by Christoph Lameter when the patch was previously
discussed in response to Frederic Weisbecker's adaptive tick
patch set.

Also included is a testing only patch, not intdented for mainline,
that turns the clock source watchdog into a config option which
I used while testing the timer code fix change.

The patch was boot tested on 32bit x86 in 8 way SMP and UP VMs.

For you reference, I keep a todo list for these and other noise sources
at: https://github.com/gby/linux/wiki

The git branched can be fetched from the git repo at
[email protected]:gby/linux.git on the reduce_workqueue_and_timers_noise_v1
branch

Gilad Ben-Yossef (6):
timer: make __next_timer_interrupt explicit about no future event
workqueue: introduce schedule_on_each_cpu_mask
workqueue: introduce schedule_on_each_cpu_cond
mm: make lru_drain selective where it schedules work
mm: make vmstat_update periodic run conditional
x86: make clocksource watchdog configurable (not for mainline)

arch/x86/Kconfig | 9 +++-
include/linux/vmstat.h | 2 +-
include/linux/workqueue.h | 4 ++
kernel/time/clocksource.c | 2 +
kernel/timer.c | 31 ++++++++++-----
kernel/workqueue.c | 73 ++++++++++++++++++++++++++++++----
mm/swap.c | 25 +++++++++++-
mm/vmstat.c | 95 ++++++++++++++++++++++++++++++++++++++-------
8 files changed, 204 insertions(+), 37 deletions(-)

Signed-off-by: Gilad Ben-Yossef <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Tejun Heo <[email protected]>
CC: John Stultz <[email protected]>
CC: Andrew Morton <[email protected]>
CC: KOSAKI Motohiro <[email protected]>
CC: Mel Gorman <[email protected]>
CC: Mike Frysinger <[email protected]>
CC: David Rientjes <[email protected]>
CC: Hugh Dickins <[email protected]>
CC: Minchan Kim <[email protected]>
CC: Konstantin Khlebnikov <[email protected]>
CC: Christoph Lameter <[email protected]>
CC: Chris Metcalf <[email protected]>
CC: Hakan Akkan <[email protected]>
CC: Max Krasnyansky <[email protected]>
CC: Frederic Weisbecker <[email protected]>
CC: [email protected]
CC: [email protected]


2012-05-03 14:57:30

by Gilad Ben-Yossef

[permalink] [raw]
Subject: [PATCH v1 5/6] mm: make vmstat_update periodic run conditional

vmstat_update runs every second from the work queue to update statistics
and drain per cpu pages back into the global page allocator.

This is useful in most circumstances but is wasteful if the CPU doesn't
actually make any VM activity. This can happen in the situtation that
the CPU is idle or running a CPU bound long term task (e.g. CPU
isolation), in which case the periodic vmstate_update timer needlessly
itnerrupts the CPU.

This patch tries to make vmstat_update schedule itself for the next
round only if there was any work for it to do in the previous run.
The assumption is that if for a whole second we didn't see any VM
activity it is reasnoable to assume that the CPU is not using the
VM because it is idle or runs a long term single CPU bound task.

A new single unbound system work queue item is scheduled periodically
to monitor CPUs that have their vmstat_update work stopped and
re-schedule them if VM activity is detected.

Signed-off-by: Gilad Ben-Yossef <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Tejun Heo <[email protected]>
CC: John Stultz <[email protected]>
CC: Andrew Morton <[email protected]>
CC: KOSAKI Motohiro <[email protected]>
CC: Mel Gorman <[email protected]>
CC: Mike Frysinger <[email protected]>
CC: David Rientjes <[email protected]>
CC: Hugh Dickins <[email protected]>
CC: Minchan Kim <[email protected]>
CC: Konstantin Khlebnikov <[email protected]>
CC: Christoph Lameter <[email protected]>
CC: Chris Metcalf <[email protected]>
CC: Hakan Akkan <[email protected]>
CC: Max Krasnyansky <[email protected]>
CC: Frederic Weisbecker <[email protected]>
CC: [email protected]
CC: [email protected]
---
include/linux/vmstat.h | 2 +-
mm/vmstat.c | 95 +++++++++++++++++++++++++++++++++++++++++-------
2 files changed, 82 insertions(+), 15 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 65efb92..67bf202 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -200,7 +200,7 @@ extern void __inc_zone_state(struct zone *, enum zone_stat_item);
extern void dec_zone_state(struct zone *, enum zone_stat_item);
extern void __dec_zone_state(struct zone *, enum zone_stat_item);

-void refresh_cpu_vm_stats(int);
+bool refresh_cpu_vm_stats(int);
void refresh_zone_stat_thresholds(void);

int calculate_pressure_threshold(struct zone *zone);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7db1b9b..1676132 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -14,6 +14,7 @@
#include <linux/module.h>
#include <linux/slab.h>
#include <linux/cpu.h>
+#include <linux/cpumask.h>
#include <linux/vmstat.h>
#include <linux/sched.h>
#include <linux/math64.h>
@@ -434,11 +435,12 @@ EXPORT_SYMBOL(dec_zone_page_state);
* with the global counters. These could cause remote node cache line
* bouncing and will have to be only done when necessary.
*/
-void refresh_cpu_vm_stats(int cpu)
+bool refresh_cpu_vm_stats(int cpu)
{
struct zone *zone;
int i;
int global_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
+ bool vm_activity = false;

for_each_populated_zone(zone) {
struct per_cpu_pageset *p;
@@ -485,14 +487,21 @@ void refresh_cpu_vm_stats(int cpu)
if (p->expire)
continue;

- if (p->pcp.count)
+ if (p->pcp.count) {
+ vm_activity = true;
drain_zone_pages(zone, &p->pcp);
+ }
#endif
}

for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
- if (global_diff[i])
+ if (global_diff[i]) {
atomic_long_add(global_diff[i], &vm_stat[i]);
+ vm_activity = true;
+ }
+
+ return vm_activity;
+
}

#endif
@@ -1141,22 +1150,72 @@ static const struct file_operations proc_vmstat_file_operations = {
#ifdef CONFIG_SMP
static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
int sysctl_stat_interval __read_mostly = HZ;
+static struct cpumask vmstat_off_cpus;
+struct delayed_work vmstat_monitor_work;

-static void vmstat_update(struct work_struct *w)
+static inline bool need_vmstat(int cpu)
{
- refresh_cpu_vm_stats(smp_processor_id());
- schedule_delayed_work(&__get_cpu_var(vmstat_work),
- round_jiffies_relative(sysctl_stat_interval));
+ struct zone *zone;
+ int i;
+
+ for_each_populated_zone(zone) {
+ struct per_cpu_pageset *p;
+
+ p = per_cpu_ptr(zone->pageset, cpu);
+
+ for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
+ if (p->vm_stat_diff[i])
+ return true;
+
+ if (zone_to_nid(zone) != numa_node_id() && p->pcp.count)
+ return true;
+ }
+
+ return false;
}

-static void __cpuinit start_cpu_timer(int cpu)
+static void vmstat_update(struct work_struct *w);
+
+static void start_cpu_timer(int cpu)
{
struct delayed_work *work = &per_cpu(vmstat_work, cpu);

- INIT_DELAYED_WORK_DEFERRABLE(work, vmstat_update);
+ cpumask_clear_cpu(cpu, &vmstat_off_cpus);
schedule_delayed_work_on(cpu, work, __round_jiffies_relative(HZ, cpu));
}

+static void __cpuinit setup_cpu_timer(int cpu)
+{
+ struct delayed_work *work = &per_cpu(vmstat_work, cpu);
+
+ INIT_DELAYED_WORK_DEFERRABLE(work, vmstat_update);
+ start_cpu_timer(cpu);
+}
+
+static void vmstat_update_monitor(struct work_struct *w)
+{
+ int cpu;
+
+ for_each_cpu_and(cpu, &vmstat_off_cpus, cpu_online_mask)
+ if (need_vmstat(cpu))
+ start_cpu_timer(cpu);
+
+ queue_delayed_work(system_unbound_wq, &vmstat_monitor_work,
+ round_jiffies_relative(sysctl_stat_interval));
+}
+
+
+static void vmstat_update(struct work_struct *w)
+{
+ int cpu = smp_processor_id();
+
+ if (likely(refresh_cpu_vm_stats(cpu)))
+ schedule_delayed_work(&__get_cpu_var(vmstat_work),
+ round_jiffies_relative(sysctl_stat_interval));
+ else
+ cpumask_set_cpu(cpu, &vmstat_off_cpus);
+}
+
/*
* Use the cpu notifier to insure that the thresholds are recalculated
* when necessary.
@@ -1171,17 +1230,19 @@ static int __cpuinit vmstat_cpuup_callback(struct notifier_block *nfb,
case CPU_ONLINE:
case CPU_ONLINE_FROZEN:
refresh_zone_stat_thresholds();
- start_cpu_timer(cpu);
+ setup_cpu_timer(cpu);
node_set_state(cpu_to_node(cpu), N_CPU);
break;
case CPU_DOWN_PREPARE:
case CPU_DOWN_PREPARE_FROZEN:
- cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu));
- per_cpu(vmstat_work, cpu).work.func = NULL;
+ if (!cpumask_test_cpu(cpu, &vmstat_off_cpus)) {
+ cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu));
+ per_cpu(vmstat_work, cpu).work.func = NULL;
+ }
break;
case CPU_DOWN_FAILED:
case CPU_DOWN_FAILED_FROZEN:
- start_cpu_timer(cpu);
+ setup_cpu_timer(cpu);
break;
case CPU_DEAD:
case CPU_DEAD_FROZEN:
@@ -1204,8 +1265,14 @@ static int __init setup_vmstat(void)

register_cpu_notifier(&vmstat_notifier);

+ INIT_DELAYED_WORK_DEFERRABLE(&vmstat_monitor_work,
+ vmstat_update_monitor);
+ queue_delayed_work(system_unbound_wq,
+ &vmstat_monitor_work,
+ round_jiffies_relative(HZ));
+
for_each_online_cpu(cpu)
- start_cpu_timer(cpu);
+ setup_cpu_timer(cpu);
#endif
#ifdef CONFIG_PROC_FS
proc_create("buddyinfo", S_IRUGO, NULL, &fragmentation_file_operations);
--
1.7.0.4

2012-05-03 14:57:40

by Gilad Ben-Yossef

[permalink] [raw]
Subject: [PATCH v1 6/6] x86: make clocksource watchdog configurable (not for mainline)

The clock source watchdog wakes up idle cores.

Since I'm using KVM to test, where the TSC is always marked
unstable, I've added this option to allow to disable it to
assist testing.

This is not intended for mainlining, the patch just serves
as a reference for how I tested the patch set.

Signed-off-by: Gilad Ben-Yossef <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Tejun Heo <[email protected]>
CC: John Stultz <[email protected]>
CC: Andrew Morton <[email protected]>
CC: KOSAKI Motohiro <[email protected]>
CC: Mel Gorman <[email protected]>
CC: Mike Frysinger <[email protected]>
CC: David Rientjes <[email protected]>
CC: Hugh Dickins <[email protected]>
CC: Minchan Kim <[email protected]>
CC: Konstantin Khlebnikov <[email protected]>
CC: Christoph Lameter <[email protected]>
CC: Chris Metcalf <[email protected]>
CC: Hakan Akkan <[email protected]>
CC: Max Krasnyansky <[email protected]>
CC: Frederic Weisbecker <[email protected]>
CC: [email protected]
CC: [email protected]
---
arch/x86/Kconfig | 9 ++++++---
kernel/time/clocksource.c | 2 ++
2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1d14cc6..6706c00 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -99,9 +99,6 @@ config ARCH_DEFCONFIG
config GENERIC_CMOS_UPDATE
def_bool y

-config CLOCKSOURCE_WATCHDOG
- def_bool y
-
config GENERIC_CLOCKEVENTS
def_bool y

@@ -1673,6 +1670,12 @@ config HOTPLUG_CPU
automatically on SMP systems. )
Say N if you want to disable CPU hotplug.

+config CLOCKSOURCE_WATCHDOG
+ bool "Clocksource watchdog"
+ default y
+ help
+ Enable clock source watchdog.
+
config COMPAT_VDSO
def_bool y
prompt "Compat VDSO support"
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index c958338..1eb5a4e 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -450,6 +450,8 @@ static void clocksource_enqueue_watchdog(struct clocksource *cs)
static inline void clocksource_dequeue_watchdog(struct clocksource *cs) { }
static inline void clocksource_resume_watchdog(void) { }
static inline int clocksource_watchdog_kthread(void *data) { return 0; }
+void clocksource_mark_unstable(struct clocksource *cs) { }
+

#endif /* CONFIG_CLOCKSOURCE_WATCHDOG */

--
1.7.0.4

2012-05-03 14:58:01

by Gilad Ben-Yossef

[permalink] [raw]
Subject: [PATCH v1 4/6] mm: make lru_drain selective where it schedules work

lru drain work is being done by scheduling a work queue on each
CPU, whether it has LRU pages to drain or not, thus creating
interference on isolated CPUs.

This patch uses schedule_on_each_cpu_cond() to schedule the work
only on CPUs where it seems that there are LRUs to drain.

Signed-off-by: Gilad Ben-Yossef <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Tejun Heo <[email protected]>
CC: John Stultz <[email protected]>
CC: Andrew Morton <[email protected]>
CC: KOSAKI Motohiro <[email protected]>
CC: Mel Gorman <[email protected]>
CC: Mike Frysinger <[email protected]>
CC: David Rientjes <[email protected]>
CC: Hugh Dickins <[email protected]>
CC: Minchan Kim <[email protected]>
CC: Konstantin Khlebnikov <[email protected]>
CC: Christoph Lameter <[email protected]>
CC: Chris Metcalf <[email protected]>
CC: Hakan Akkan <[email protected]>
CC: Max Krasnyansky <[email protected]>
CC: Frederic Weisbecker <[email protected]>
CC: [email protected]
CC: [email protected]
---
mm/swap.c | 25 ++++++++++++++++++++++++-
1 files changed, 24 insertions(+), 1 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 5c13f13..ab07b62 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -562,12 +562,35 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
lru_add_drain();
}

+static bool lru_drain_cpu(int cpu)
+{
+ struct pagevec *pvecs = per_cpu(lru_add_pvecs, cpu);
+ struct pagevec *pvec;
+ int lru;
+
+ for_each_lru(lru) {
+ pvec = &pvecs[lru - LRU_BASE];
+ if (pagevec_count(pvec))
+ return true;
+ }
+
+ pvec = &per_cpu(lru_rotate_pvecs, cpu);
+ if (pagevec_count(pvec))
+ return true;
+
+ pvec = &per_cpu(lru_deactivate_pvecs, cpu);
+ if (pagevec_count(pvec))
+ return true;
+
+ return false;
+}
+
/*
* Returns 0 for success
*/
int lru_add_drain_all(void)
{
- return schedule_on_each_cpu(lru_add_drain_per_cpu);
+ return schedule_on_each_cpu_cond(lru_add_drain_per_cpu, lru_drain_cpu);
}

/*
--
1.7.0.4

2012-05-03 14:57:16

by Gilad Ben-Yossef

[permalink] [raw]
Subject: [PATCH v1 2/6] workqueue: introduce schedule_on_each_cpu_mask

Introduce schedule_on_each_cpu_mask function to schedule a work
item on each online CPU which is included in the mask provided.

Then re-implement schedule_on_each_cpu on top of the new function.

This function should be prefered to schedule_on_each_cpu in
any case where some of the CPUs, especially on a big multi-core
system, might not have actual work to perform in order to save
needless wakeups and schedules.

Signed-off-by: Gilad Ben-Yossef <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Tejun Heo <[email protected]>
CC: John Stultz <[email protected]>
CC: Andrew Morton <[email protected]>
CC: KOSAKI Motohiro <[email protected]>
CC: Mel Gorman <[email protected]>
CC: Mike Frysinger <[email protected]>
CC: David Rientjes <[email protected]>
CC: Hugh Dickins <[email protected]>
CC: Minchan Kim <[email protected]>
CC: Konstantin Khlebnikov <[email protected]>
CC: Christoph Lameter <[email protected]>
CC: Chris Metcalf <[email protected]>
CC: Hakan Akkan <[email protected]>
CC: Max Krasnyansky <[email protected]>
CC: Frederic Weisbecker <[email protected]>
CC: [email protected]
CC: [email protected]
---
include/linux/workqueue.h | 2 ++
kernel/workqueue.c | 36 ++++++++++++++++++++++++++++--------
2 files changed, 30 insertions(+), 8 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index af15545..20da95a 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -383,6 +383,8 @@ extern int schedule_delayed_work(struct delayed_work *work, unsigned long delay)
extern int schedule_delayed_work_on(int cpu, struct delayed_work *work,
unsigned long delay);
extern int schedule_on_each_cpu(work_func_t func);
+extern int schedule_on_each_cpu_mask(work_func_t func,
+ const struct cpumask *mask);
extern int keventd_up(void);

int execute_in_process_context(work_func_t fn, struct execute_work *);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 5abf42f..1c9782b 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2787,43 +2787,63 @@ int schedule_delayed_work_on(int cpu,
EXPORT_SYMBOL(schedule_delayed_work_on);

/**
- * schedule_on_each_cpu - execute a function synchronously on each online CPU
+ * schedule_on_each_cpu_mask - execute a function synchronously on each
+ * online CPU which is specified in the supplied cpumask
* @func: the function to call
+ * @mask: the cpu mask
*
- * schedule_on_each_cpu() executes @func on each online CPU using the
- * system workqueue and blocks until all CPUs have completed.
- * schedule_on_each_cpu() is very slow.
+ * schedule_on_each_cpu_mask() executes @func on each online CPU which
+ * is part of the @mask using the * system workqueue and blocks until
+ * all CPUs have completed
+ * schedule_on_each_cpu_mask() is very slow.
*
* RETURNS:
* 0 on success, -errno on failure.
*/
-int schedule_on_each_cpu(work_func_t func)
+int schedule_on_each_cpu_mask(work_func_t func, const struct cpumask *mask)
{
int cpu;
struct work_struct __percpu *works;

works = alloc_percpu(struct work_struct);
- if (!works)
+ if (unlikely(!works))
return -ENOMEM;

get_online_cpus();

- for_each_online_cpu(cpu) {
+ for_each_cpu_and(cpu, mask, cpu_online_mask) {
struct work_struct *work = per_cpu_ptr(works, cpu);

INIT_WORK(work, func);
schedule_work_on(cpu, work);
}

- for_each_online_cpu(cpu)
+ for_each_cpu_and(cpu, mask, cpu_online_mask)
flush_work(per_cpu_ptr(works, cpu));

put_online_cpus();
free_percpu(works);
+
return 0;
}

/**
+ * schedule_on_each_cpu - execute a function synchronously on each online CPU
+ * @func: the function to call
+ *
+ * schedule_on_each_cpu() executes @func on each online CPU using the
+ * system workqueue and blocks until all CPUs have completed.
+ * schedule_on_each_cpu() is very slow.
+ *
+ * RETURNS:
+ * 0 on success, -errno on failure.
+ */
+int schedule_on_each_cpu(work_func_t func)
+{
+ return schedule_on_each_cpu_mask(func, cpu_online_mask);
+}
+
+/**
* flush_scheduled_work - ensure that any scheduled work has run to completion.
*
* Forces execution of the kernel-global workqueue and blocks until its
--
1.7.0.4

2012-05-03 14:58:34

by Gilad Ben-Yossef

[permalink] [raw]
Subject: [PATCH v1 3/6] workqueue: introduce schedule_on_each_cpu_cond

Introduce the schedule_on_each_cpu_cond() function that schedules
a work item on each online CPU for which the supplied condition
function returns true.

This function should be used instead of schedule_on_each_cpu()
when only some of the CPUs have actual work to do and a predicate
function can tell if a certain CPU does or does not have work to do,
thus saving unneeded wakeups and schedules.

Signed-off-by: Gilad Ben-Yossef <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Tejun Heo <[email protected]>
CC: John Stultz <[email protected]>
CC: Andrew Morton <[email protected]>
CC: KOSAKI Motohiro <[email protected]>
CC: Mel Gorman <[email protected]>
CC: Mike Frysinger <[email protected]>
CC: David Rientjes <[email protected]>
CC: Hugh Dickins <[email protected]>
CC: Minchan Kim <[email protected]>
CC: Konstantin Khlebnikov <[email protected]>
CC: Christoph Lameter <[email protected]>
CC: Chris Metcalf <[email protected]>
CC: Hakan Akkan <[email protected]>
CC: Max Krasnyansky <[email protected]>
CC: Frederic Weisbecker <[email protected]>
CC: [email protected]
CC: [email protected]
---
include/linux/workqueue.h | 2 ++
kernel/workqueue.c | 37 +++++++++++++++++++++++++++++++++++++
2 files changed, 39 insertions(+), 0 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 20da95a..d7bb104 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -385,6 +385,8 @@ extern int schedule_delayed_work_on(int cpu, struct delayed_work *work,
extern int schedule_on_each_cpu(work_func_t func);
extern int schedule_on_each_cpu_mask(work_func_t func,
const struct cpumask *mask);
+extern int schedule_on_each_cpu_cond(work_func_t func,
+ bool (*cond_func)(int cpu));
extern int keventd_up(void);

int execute_in_process_context(work_func_t fn, struct execute_work *);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 1c9782b..3322d30 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2828,6 +2828,43 @@ int schedule_on_each_cpu_mask(work_func_t func, const struct cpumask *mask)
}

/**
+ * schedule_on_each_cpu_cond - execute a function synchronously on each
+ * online CPU for which the supplied condition function returns true
+ * @func: the function to run on the selected CPUs
+ * @cond_func: the function to call to select the CPUs
+ *
+ * schedule_on_each_cpu_cond() executes @func on each online CPU for
+ * @cond_func returns true using the system workqueue and blocks until
+ * all CPUs have completed.
+ * schedule_on_each_cpu_cond() is very slow.
+ *
+ * RETURNS:
+ * 0 on success, -errno on failure.
+ */
+int schedule_on_each_cpu_cond(work_func_t func, bool (*cond_func)(int cpu))
+{
+ int cpu, ret;
+ cpumask_var_t mask;
+
+ if (unlikely(!zalloc_cpumask_var(&mask, GFP_KERNEL)))
+ return -ENOMEM;
+
+ get_online_cpus();
+
+ for_each_online_cpu(cpu)
+ if (cond_func(cpu))
+ cpumask_set_cpu(cpu, mask);
+
+ ret = schedule_on_each_cpu_mask(func, mask);
+
+ put_online_cpus();
+
+ free_cpumask_var(mask);
+
+ return ret;
+}
+
+/**
* schedule_on_each_cpu - execute a function synchronously on each online CPU
* @func: the function to call
*
--
1.7.0.4

2012-05-03 14:58:59

by Gilad Ben-Yossef

[permalink] [raw]
Subject: [PATCH v1 1/6] timer: make __next_timer_interrupt explicit about no future event

Current timer code fails to correctly return a value meaning
that there is no future timer event, with the result that
the timer keeps getting re-armed in HZ one shot mode even
when we could turn it off, generating unneeded interrupts.
This patch attempts to fix this problem.

What is happening is that when __next_timer_interrupt() wishes
to return a value that signifies "there is no future timer
event", it returns (base->timer_jiffies + NEXT_TIMER_MAX_DELTA).

However, the code in tick_nohz_stop_sched_tick(), which called
__next_timer_interrupt() via get_next_timer_interrupt(),
compares the return value to (last_jiffies + NEXT_TIMER_MAX_DELTA)
to see if the timer needs to be re-armed.

base->timer_jiffies != last_jiffies and so
tick_nohz_stop_sched_tick() interperts the return value as
indication that there is a distant future event 12 days
from now and programs the timer to fire next after KIME_MAX
nsecs instead of avoiding to arm it. This ends up causesing
a needless interrupt once every KTIME_MAX nsecs.

I've noticed a similar but slightly different fix to the
same problem in the Tilera kernel tree from Chris M. (I've
wrote this before seeing that one), so some variation of this
fix is in use on real hardware for some time now.

Signed-off-by: Gilad Ben-Yossef <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Tejun Heo <[email protected]>
CC: John Stultz <[email protected]>
CC: Andrew Morton <[email protected]>
CC: KOSAKI Motohiro <[email protected]>
CC: Mel Gorman <[email protected]>
CC: Mike Frysinger <[email protected]>
CC: David Rientjes <[email protected]>
CC: Hugh Dickins <[email protected]>
CC: Minchan Kim <[email protected]>
CC: Konstantin Khlebnikov <[email protected]>
CC: Christoph Lameter <[email protected]>
CC: Chris Metcalf <[email protected]>
CC: Hakan Akkan <[email protected]>
CC: Max Krasnyansky <[email protected]>
CC: Frederic Weisbecker <[email protected]>
CC: [email protected]
CC: [email protected]
---
kernel/timer.c | 31 +++++++++++++++++++++----------
1 files changed, 21 insertions(+), 10 deletions(-)

diff --git a/kernel/timer.c b/kernel/timer.c
index a297ffc..32ba64a 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -1187,11 +1187,13 @@ static inline void __run_timers(struct tvec_base *base)
* is used on S/390 to stop all activity when a CPU is idle.
* This function needs to be called with interrupts disabled.
*/
-static unsigned long __next_timer_interrupt(struct tvec_base *base)
+static bool __next_timer_interrupt(struct tvec_base *base,
+ unsigned long *next_timer)
{
unsigned long timer_jiffies = base->timer_jiffies;
unsigned long expires = timer_jiffies + NEXT_TIMER_MAX_DELTA;
- int index, slot, array, found = 0;
+ int index, slot, array;
+ bool found = false;
struct timer_list *nte;
struct tvec *varray[4];

@@ -1202,12 +1204,12 @@ static unsigned long __next_timer_interrupt(struct tvec_base *base)
if (tbase_get_deferrable(nte->base))
continue;

- found = 1;
+ found = true;
expires = nte->expires;
/* Look at the cascade bucket(s)? */
if (!index || slot < index)
goto cascade;
- return expires;
+ goto out;
}
slot = (slot + 1) & TVR_MASK;
} while (slot != index);
@@ -1233,7 +1235,7 @@ cascade:
if (tbase_get_deferrable(nte->base))
continue;

- found = 1;
+ found = true;
if (time_before(nte->expires, expires))
expires = nte->expires;
}
@@ -1245,7 +1247,7 @@ cascade:
/* Look at the cascade bucket(s)? */
if (!index || slot < index)
break;
- return expires;
+ goto out;
}
slot = (slot + 1) & TVN_MASK;
} while (slot != index);
@@ -1254,7 +1256,10 @@ cascade:
timer_jiffies += TVN_SIZE - index;
timer_jiffies >>= TVN_BITS;
}
- return expires;
+out:
+ if (found)
+ *next_timer = expires;
+ return found;
}

/*
@@ -1317,9 +1322,15 @@ unsigned long get_next_timer_interrupt(unsigned long now)
if (cpu_is_offline(smp_processor_id()))
return now + NEXT_TIMER_MAX_DELTA;
spin_lock(&base->lock);
- if (time_before_eq(base->next_timer, base->timer_jiffies))
- base->next_timer = __next_timer_interrupt(base);
- expires = base->next_timer;
+ if (time_before_eq(base->next_timer, base->timer_jiffies)) {
+
+ if (__next_timer_interrupt(base, &expires))
+ base->next_timer = expires;
+ else
+ expires = now + NEXT_TIMER_MAX_DELTA;
+ } else
+ expires = base->next_timer;
+
spin_unlock(&base->lock);

if (time_before_eq(expires, now))
--
1.7.0.4

2012-05-03 15:39:48

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v1 3/6] workqueue: introduce schedule_on_each_cpu_cond

Hello,

On Thu, May 03, 2012 at 05:55:59PM +0300, Gilad Ben-Yossef wrote:
> Introduce the schedule_on_each_cpu_cond() function that schedules
> a work item on each online CPU for which the supplied condition
> function returns true.
>
> This function should be used instead of schedule_on_each_cpu()
> when only some of the CPUs have actual work to do and a predicate
> function can tell if a certain CPU does or does not have work to do,
> thus saving unneeded wakeups and schedules.
>
> /**
> + * schedule_on_each_cpu_cond - execute a function synchronously on each
> + * online CPU for which the supplied condition function returns true
> + * @func: the function to run on the selected CPUs
> + * @cond_func: the function to call to select the CPUs
> + *
> + * schedule_on_each_cpu_cond() executes @func on each online CPU for
> + * @cond_func returns true using the system workqueue and blocks until
> + * all CPUs have completed.
> + * schedule_on_each_cpu_cond() is very slow.
> + *
> + * RETURNS:
> + * 0 on success, -errno on failure.
> + */
> +int schedule_on_each_cpu_cond(work_func_t func, bool (*cond_func)(int cpu))
> +{
> + int cpu, ret;
> + cpumask_var_t mask;
> +
> + if (unlikely(!zalloc_cpumask_var(&mask, GFP_KERNEL)))
> + return -ENOMEM;
> +
> + get_online_cpus();
> +
> + for_each_online_cpu(cpu)
> + if (cond_func(cpu))
> + cpumask_set_cpu(cpu, mask);
> +
> + ret = schedule_on_each_cpu_mask(func, mask);
> +
> + put_online_cpus();
> +
> + free_cpumask_var(mask);
> +
> + return ret;
> +}

I'm usually not a big fan of callback based interface. They tend to
be quite clunky to use. e.g. in this case, wouldn't it be better to
have helper functions which allocate cpumask and disables cpu hotplug
and undo that afterwards? That is, if such convenience helpers are
necessary at all. Also, callback which doesn't have a private data
argument tends to be PITA.

Thanks.

--
tejun

2012-05-04 04:45:40

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: [PATCH v1 2/6] workqueue: introduce schedule_on_each_cpu_mask

On 05/03/2012 08:25 PM, Gilad Ben-Yossef wrote:

> Introduce schedule_on_each_cpu_mask function to schedule a work
> item on each online CPU which is included in the mask provided.
>
> Then re-implement schedule_on_each_cpu on top of the new function.
>
> This function should be prefered to schedule_on_each_cpu in
> any case where some of the CPUs, especially on a big multi-core
> system, might not have actual work to perform in order to save
> needless wakeups and schedules.
>
> Signed-off-by: Gilad Ben-Yossef <[email protected]>


> /**
> - * schedule_on_each_cpu - execute a function synchronously on each online CPU
> + * schedule_on_each_cpu_mask - execute a function synchronously on each
> + * online CPU which is specified in the supplied cpumask
> * @func: the function to call
> + * @mask: the cpu mask
> *
> - * schedule_on_each_cpu() executes @func on each online CPU using the
> - * system workqueue and blocks until all CPUs have completed.
> - * schedule_on_each_cpu() is very slow.
> + * schedule_on_each_cpu_mask() executes @func on each online CPU which
> + * is part of the @mask using the * system workqueue and blocks until

^^^
stray character?

> + * all CPUs have completed
> + * schedule_on_each_cpu_mask() is very slow.
> *
> * RETURNS:
> * 0 on success, -errno on failure.
> */
> -int schedule_on_each_cpu(work_func_t func)
> +int schedule_on_each_cpu_mask(work_func_t func, const struct cpumask *mask)
> {
> int cpu;
> struct work_struct __percpu *works;
>
> works = alloc_percpu(struct work_struct);
> - if (!works)
> + if (unlikely(!works))
> return -ENOMEM;
>
> get_online_cpus();
>
> - for_each_online_cpu(cpu) {
> + for_each_cpu_and(cpu, mask, cpu_online_mask) {
> struct work_struct *work = per_cpu_ptr(works, cpu);
>
> INIT_WORK(work, func);
> schedule_work_on(cpu, work);
> }
>
> - for_each_online_cpu(cpu)
> + for_each_cpu_and(cpu, mask, cpu_online_mask)
> flush_work(per_cpu_ptr(works, cpu));
>


Given that cpu hotplug is not a frequent operation, I think mask will be
a subset of cpu_online_mask most of the time (also, one example is from
schedule_on_each_cpu_cond() introduced in 3/6, which is already under
get/put_online_cpus(). So can we optimize something (the 'and' operations
perhaps) based on that?

May be something by using:
if (likely(cpumask_subset(mask, cpu_online_mask))

> put_online_cpus();

> free_percpu(works);
> +
> return 0;
> }
>


Regards,
Srivatsa S. Bhat

2012-05-04 05:10:57

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: [PATCH v1 3/6] workqueue: introduce schedule_on_each_cpu_cond

On 05/03/2012 08:25 PM, Gilad Ben-Yossef wrote:

> Introduce the schedule_on_each_cpu_cond() function that schedules
> a work item on each online CPU for which the supplied condition
> function returns true.
>
> This function should be used instead of schedule_on_each_cpu()
> when only some of the CPUs have actual work to do and a predicate
> function can tell if a certain CPU does or does not have work to do,
> thus saving unneeded wakeups and schedules.
>
> Signed-off-by: Gilad Ben-Yossef <[email protected]>
> ---


> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 1c9782b..3322d30 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -2828,6 +2828,43 @@ int schedule_on_each_cpu_mask(work_func_t func, const struct cpumask *mask)
> }
>
> /**
> + * schedule_on_each_cpu_cond - execute a function synchronously on each
> + * online CPU for which the supplied condition function returns true
> + * @func: the function to run on the selected CPUs
> + * @cond_func: the function to call to select the CPUs
> + *
> + * schedule_on_each_cpu_cond() executes @func on each online CPU for
> + * @cond_func returns true using the system workqueue and blocks until

^^^
(for) which

Regards,
Srivatsa S. Bhat

> + * all CPUs have completed.
> + * schedule_on_each_cpu_cond() is very slow.
> + *
> + * RETURNS:
> + * 0 on success, -errno on failure.
> + */
> +int schedule_on_each_cpu_cond(work_func_t func, bool (*cond_func)(int cpu))
> +{
> + int cpu, ret;
> + cpumask_var_t mask;
> +
> + if (unlikely(!zalloc_cpumask_var(&mask, GFP_KERNEL)))
> + return -ENOMEM;
> +
> + get_online_cpus();
> +
> + for_each_online_cpu(cpu)
> + if (cond_func(cpu))
> + cpumask_set_cpu(cpu, mask);
> +
> + ret = schedule_on_each_cpu_mask(func, mask);
> +
> + put_online_cpus();
> +
> + free_cpumask_var(mask);
> +
> + return ret;

2012-05-04 12:05:09

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v1 1/6] timer: make __next_timer_interrupt explicit about no future event

On Thu, May 03, 2012 at 05:55:57PM +0300, Gilad Ben-Yossef wrote:
> Current timer code fails to correctly return a value meaning
> that there is no future timer event, with the result that
> the timer keeps getting re-armed in HZ one shot mode even
> when we could turn it off, generating unneeded interrupts.
> This patch attempts to fix this problem.
>
> What is happening is that when __next_timer_interrupt() wishes
> to return a value that signifies "there is no future timer
> event", it returns (base->timer_jiffies + NEXT_TIMER_MAX_DELTA).
>
> However, the code in tick_nohz_stop_sched_tick(), which called
> __next_timer_interrupt() via get_next_timer_interrupt(),
> compares the return value to (last_jiffies + NEXT_TIMER_MAX_DELTA)
> to see if the timer needs to be re-armed.
>
> base->timer_jiffies != last_jiffies and so
> tick_nohz_stop_sched_tick() interperts the return value as
> indication that there is a distant future event 12 days
> from now and programs the timer to fire next after KIME_MAX
> nsecs instead of avoiding to arm it. This ends up causesing
> a needless interrupt once every KTIME_MAX nsecs.

Good catch! So if I understand correctly, base->timer_jiffies can
be backward compared to last_jiffies. If we return
base->timer_jiffies + NEXT_TIMER_MAX_DELTA, the next_jiffies - last_jiffies
diff gives a delta that is a bit before NEXT_TIMER_MAX_DELTA.

And this can indeed happen if we haven't got any timer list executed since
we updated jiffies last, timer_jiffies can be a backward compared to last_jiffies.

This is harmless but causes needless timers.

I just have small comment below:

>
> I've noticed a similar but slightly different fix to the
> same problem in the Tilera kernel tree from Chris M. (I've
> wrote this before seeing that one), so some variation of this
> fix is in use on real hardware for some time now.
>
> Signed-off-by: Gilad Ben-Yossef <[email protected]>
> CC: Thomas Gleixner <[email protected]>
> CC: Tejun Heo <[email protected]>
> CC: John Stultz <[email protected]>
> CC: Andrew Morton <[email protected]>
> CC: KOSAKI Motohiro <[email protected]>
> CC: Mel Gorman <[email protected]>
> CC: Mike Frysinger <[email protected]>
> CC: David Rientjes <[email protected]>
> CC: Hugh Dickins <[email protected]>
> CC: Minchan Kim <[email protected]>
> CC: Konstantin Khlebnikov <[email protected]>
> CC: Christoph Lameter <[email protected]>
> CC: Chris Metcalf <[email protected]>
> CC: Hakan Akkan <[email protected]>
> CC: Max Krasnyansky <[email protected]>
> CC: Frederic Weisbecker <[email protected]>
> CC: [email protected]
> CC: [email protected]
> ---
> kernel/timer.c | 31 +++++++++++++++++++++----------
> 1 files changed, 21 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/timer.c b/kernel/timer.c
> index a297ffc..32ba64a 100644
> --- a/kernel/timer.c
> +++ b/kernel/timer.c
> @@ -1187,11 +1187,13 @@ static inline void __run_timers(struct tvec_base *base)
> * is used on S/390 to stop all activity when a CPU is idle.
> * This function needs to be called with interrupts disabled.
> */
> -static unsigned long __next_timer_interrupt(struct tvec_base *base)
> +static bool __next_timer_interrupt(struct tvec_base *base,
> + unsigned long *next_timer)
> {
> unsigned long timer_jiffies = base->timer_jiffies;
> unsigned long expires = timer_jiffies + NEXT_TIMER_MAX_DELTA;
> - int index, slot, array, found = 0;
> + int index, slot, array;
> + bool found = false;
> struct timer_list *nte;
> struct tvec *varray[4];
>
> @@ -1202,12 +1204,12 @@ static unsigned long __next_timer_interrupt(struct tvec_base *base)
> if (tbase_get_deferrable(nte->base))
> continue;
>
> - found = 1;
> + found = true;
> expires = nte->expires;
> /* Look at the cascade bucket(s)? */
> if (!index || slot < index)
> goto cascade;
> - return expires;
> + goto out;
> }
> slot = (slot + 1) & TVR_MASK;
> } while (slot != index);
> @@ -1233,7 +1235,7 @@ cascade:
> if (tbase_get_deferrable(nte->base))
> continue;
>
> - found = 1;
> + found = true;
> if (time_before(nte->expires, expires))
> expires = nte->expires;
> }
> @@ -1245,7 +1247,7 @@ cascade:
> /* Look at the cascade bucket(s)? */
> if (!index || slot < index)
> break;
> - return expires;
> + goto out;
> }
> slot = (slot + 1) & TVN_MASK;
> } while (slot != index);
> @@ -1254,7 +1256,10 @@ cascade:
> timer_jiffies += TVN_SIZE - index;
> timer_jiffies >>= TVN_BITS;
> }
> - return expires;
> +out:
> + if (found)
> + *next_timer = expires;
> + return found;
> }
>
> /*
> @@ -1317,9 +1322,15 @@ unsigned long get_next_timer_interrupt(unsigned long now)
> if (cpu_is_offline(smp_processor_id()))
> return now + NEXT_TIMER_MAX_DELTA;
> spin_lock(&base->lock);
> - if (time_before_eq(base->next_timer, base->timer_jiffies))
> - base->next_timer = __next_timer_interrupt(base);
> - expires = base->next_timer;
> + if (time_before_eq(base->next_timer, base->timer_jiffies)) {
> +
> + if (__next_timer_interrupt(base, &expires))
> + base->next_timer = expires;
> + else
> + expires = now + NEXT_TIMER_MAX_DELTA;

I believe you can update base->next_timer to now + NEXT_TIMER_MAX_DELTA,
so on any further idle interrupt exit that call tick_nohz_stop_sched_tick(),
we won't get again the overhead of __next_timer_interrupt().

2012-05-04 12:20:19

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v1 1/6] timer: make __next_timer_interrupt explicit about no future event

2012/5/4 Frederic Weisbecker <[email protected]>:
> On Thu, May 03, 2012 at 05:55:57PM +0300, Gilad Ben-Yossef wrote:
>> @@ -1317,9 +1322,15 @@ unsigned long get_next_timer_interrupt(unsigned long now)
>> ? ? ? if (cpu_is_offline(smp_processor_id()))
>> ? ? ? ? ? ? ? return now + NEXT_TIMER_MAX_DELTA;
>> ? ? ? spin_lock(&base->lock);
>> - ? ? if (time_before_eq(base->next_timer, base->timer_jiffies))
>> - ? ? ? ? ? ? base->next_timer = __next_timer_interrupt(base);
>> - ? ? expires = base->next_timer;
>> + ? ? if (time_before_eq(base->next_timer, base->timer_jiffies)) {
>> +
>> + ? ? ? ? ? ? if (__next_timer_interrupt(base, &expires))
>> + ? ? ? ? ? ? ? ? ? ? base->next_timer = expires;
>> + ? ? ? ? ? ? else
>> + ? ? ? ? ? ? ? ? ? ? expires = now + NEXT_TIMER_MAX_DELTA;
>
> I believe you can update base->next_timer to now + NEXT_TIMER_MAX_DELTA,
> so on any further idle interrupt exit that call tick_nohz_stop_sched_tick(),
> we won't get again the overhead of __next_timer_interrupt().

Ah forget that, I was confused. If we do that we actually get the useless timer
at now + NEXT_TIMER_MAX_DELTA.

So I think the patch is fine.

Acked-by: Frederic Weisbecker <[email protected]>

2012-05-06 13:15:33

by Gilad Ben-Yossef

[permalink] [raw]
Subject: Re: [PATCH v1 3/6] workqueue: introduce schedule_on_each_cpu_cond

On Thu, May 3, 2012 at 6:39 PM, Tejun Heo <[email protected]> wrote:
> Hello,
>
> On Thu, May 03, 2012 at 05:55:59PM +0300, Gilad Ben-Yossef wrote:
>> Introduce the schedule_on_each_cpu_cond() function that schedules
>> a work item on each online CPU for which the supplied condition
>> function returns true.
>>
>> This function should be used instead of schedule_on_each_cpu()
>> when only some of the CPUs have actual work to do and a predicate
>> function can tell if a certain CPU does or does not have work to do,
>> thus saving unneeded wakeups and schedules.
>>
>> ?/**
>> + * schedule_on_each_cpu_cond - execute a function synchronously on each
>> + * online CPU for which the supplied condition function returns true
>> + * @func: the function to run on the selected CPUs
>> + * @cond_func: the function to call to select the CPUs
>> + *
>> + * schedule_on_each_cpu_cond() executes @func on each online CPU for
>> + * @cond_func returns true using the system workqueue and blocks until
>> + * all CPUs have completed.
>> + * schedule_on_each_cpu_cond() is very slow.
>> + *
>> + * RETURNS:
>> + * 0 on success, -errno on failure.
>> + */
>> +int schedule_on_each_cpu_cond(work_func_t func, bool (*cond_func)(int cpu))
>> +{
>> + ? ? int cpu, ret;
>> + ? ? cpumask_var_t mask;
>> +
>> + ? ? if (unlikely(!zalloc_cpumask_var(&mask, GFP_KERNEL)))
>> + ? ? ? ? ? ? return -ENOMEM;
>> +
>> + ? ? get_online_cpus();
>> +
>> + ? ? for_each_online_cpu(cpu)
>> + ? ? ? ? ? ? if (cond_func(cpu))
>> + ? ? ? ? ? ? ? ? ? ? cpumask_set_cpu(cpu, mask);
>> +
>> + ? ? ret = schedule_on_each_cpu_mask(func, mask);
>> +
>> + ? ? put_online_cpus();
>> +
>> + ? ? free_cpumask_var(mask);
>> +
>> + ? ? return ret;
>> +}
>
> I'm usually not a big fan of callback based interface. ?They tend to
> be quite clunky to use.

My first academic computer science course used SICP as the text book.
I've managed to kick most of the bad habits I've gained in that course
over the years,
but two: a taste for higher order functions and a fetish for
parenthesis.I can't quite tell
which of the two is a bigger barrier to my re-integration with normal
society... ;-)

> ?e.g. in this case, wouldn't it be better to
> have helper functions which allocate cpumask and disables cpu hotplug
> and undo that afterwards? ?That is, if such convenience helpers are
> necessary at all.

If we'll always have only a single call sign you are certainly right.
My thought was
that I'm not only solving the specific problem here, but trying to
help the next person
doing something similar do the right thing.

A single helper function called schedule_on_each_cpu_cond() is very
obvious to find
to someone reading the source or documentation. On the other hand
figuring out that
the helper functions that handle cpu hotplug and cpumask allocation
are there for that
purpose is a bit more involved.

That was my thinking at least.

> Also, callback which doesn't have a private data
> argument tends to be PITA.
>

You are 100% right. The callback should have a private data parameter.

The way i see it, I can either obliterate on_each_cpu_cond() and out its code
in place in the LRU path, or fix the callback to get an extra private
data parameter -

It's your call - what would you have me do?

Thanks,
Gilad



> Thanks.
>
> --
> tejun



--
Gilad Ben-Yossef
Chief Coffee Drinker
[email protected]
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
?-- Jean-Baptiste Queru

2012-05-06 13:16:39

by Gilad Ben-Yossef

[permalink] [raw]
Subject: Re: [PATCH v1 3/6] workqueue: introduce schedule_on_each_cpu_cond

On Fri, May 4, 2012 at 7:51 AM, Srivatsa S. Bhat
<[email protected]> wrote:
> On 05/03/2012 08:25 PM, Gilad Ben-Yossef wrote:
>

>> ?/**
>> + * schedule_on_each_cpu_cond - execute a function synchronously on each
>> + * online CPU for which the supplied condition function returns true
>> + * @func: the function to run on the selected CPUs
>> + * @cond_func: the function to call to select the CPUs
>> + *
>> + * schedule_on_each_cpu_cond() executes @func on each online CPU for
>> + * @cond_func returns true using the system workqueue and blocks until
>
> ? ?^^^
> (for) which

Thanks!

I'll fix this and the other typo in the next iteration.

Gilad

--
Gilad Ben-Yossef
Chief Coffee Drinker
[email protected]
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
?-- Jean-Baptiste Queru

Subject: Re: [PATCH v1 5/6] mm: make vmstat_update periodic run conditional

On Thu, 3 May 2012, Gilad Ben-Yossef wrote:

> vmstat_update runs every second from the work queue to update statistics
> and drain per cpu pages back into the global page allocator.

Looks good.

- vmstat_off_cpus is a bit strange. Could we have a cpumask that has a bit
set if vmstat is active? Rename to "vmstat_cpus"?

- Start out with vmstat_cpus cleared? Cpus only need vmstat if they do
something and if a cpu is idle on boot then it will not need vmstat
enabled until the cpu does something useful.

> @@ -1204,8 +1265,14 @@ static int __init setup_vmstat(void)
>
> register_cpu_notifier(&vmstat_notifier);
>
> + INIT_DELAYED_WORK_DEFERRABLE(&vmstat_monitor_work,
> + vmstat_update_monitor);
> + queue_delayed_work(system_unbound_wq,
> + &vmstat_monitor_work,
> + round_jiffies_relative(HZ));
> +
> for_each_online_cpu(cpu)
> - start_cpu_timer(cpu);
> + setup_cpu_timer(cpu);
> #endif
> #ifdef CONFIG_PROC_FS
> proc_create("buddyinfo", S_IRUGO, NULL, &fragmentation_file_operations);

So the monitoring thread just bounces around the system? Hope that the
scheduler does the right thing to keep it on processors that do some other
work.

2012-05-07 17:17:32

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v1 3/6] workqueue: introduce schedule_on_each_cpu_cond

Hello,

On Sun, May 06, 2012 at 04:15:30PM +0300, Gilad Ben-Yossef wrote:
> A single helper function called schedule_on_each_cpu_cond() is very
> obvious to find to someone reading the source or documentation. On
> the other hand figuring out that the helper functions that handle
> cpu hotplug and cpumask allocation are there for that purpose is a
> bit more involved.
>
> That was my thinking at least.

Yeah, having common mechanism is nice, but I just prefer iterators /
helpers which can be embedded in the caller to interface which takes a
callback unless the execution context is actually asynchronous to the
caller. We don't use nested functions / scopes in kernel which makes
those callbacks (higher order functions, lambdas, gammas, zetas
whatever) painful to use and follow.

> The way i see it, I can either obliterate on_each_cpu_cond() and out
> its code in place in the LRU path, or fix the callback to get an
> extra private data parameter -

Unless we can code up something pretty, I vote for just open coding it
for now. If we grow more usages like this, maybe we'll be able to see
the pattern better and come up with better abstraction.

Thank you.

--
tejun

2012-05-07 19:34:04

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH v1 5/6] mm: make vmstat_update periodic run conditional

>> @@ -1204,8 +1265,14 @@ static int __init setup_vmstat(void)
>>
>> register_cpu_notifier(&vmstat_notifier);
>>
>> + INIT_DELAYED_WORK_DEFERRABLE(&vmstat_monitor_work,
>> + vmstat_update_monitor);
>> + queue_delayed_work(system_unbound_wq,
>> + &vmstat_monitor_work,
>> + round_jiffies_relative(HZ));
>> +
>> for_each_online_cpu(cpu)
>> - start_cpu_timer(cpu);
>> + setup_cpu_timer(cpu);
>> #endif
>> #ifdef CONFIG_PROC_FS
>> proc_create("buddyinfo", S_IRUGO, NULL,&fragmentation_file_operations);
>
> So the monitoring thread just bounces around the system? Hope that the
> scheduler does the right thing to keep it on processors that do some other
> work.

Good point. Usually, all cpus have update items and monitor worker only makes
new noise. I think this feature is only useful some hpc case. So I wonder if
this vmstat improvemnt can integrate Frederic's Nohz cpusets activity. I.e.
vmstat-update integrate timer house keeping and automatically stop when stopping
hz house keeping.

Subject: Re: [PATCH v1 5/6] mm: make vmstat_update periodic run conditional

On Mon, 7 May 2012, KOSAKI Motohiro wrote:

> > > @@ -1204,8 +1265,14 @@ static int __init setup_vmstat(void)
> > >
> > > register_cpu_notifier(&vmstat_notifier);
> > >
> > > + INIT_DELAYED_WORK_DEFERRABLE(&vmstat_monitor_work,
> > > + vmstat_update_monitor);
> > > + queue_delayed_work(system_unbound_wq,
> > > + &vmstat_monitor_work,
> > > + round_jiffies_relative(HZ));
> > > +
> > > for_each_online_cpu(cpu)
> > > - start_cpu_timer(cpu);
> > > + setup_cpu_timer(cpu);
> > > #endif
> > > #ifdef CONFIG_PROC_FS
> > > proc_create("buddyinfo", S_IRUGO,
> > > NULL,&fragmentation_file_operations);
> >
> > So the monitoring thread just bounces around the system? Hope that the
> > scheduler does the right thing to keep it on processors that do some other
> > work.
>
> Good point. Usually, all cpus have update items and monitor worker only makes
> new noise. I think this feature is only useful some hpc case. So I wonder if
> this vmstat improvemnt can integrate Frederic's Nohz cpusets activity. I.e.
> vmstat-update integrate timer house keeping and automatically stop when
> stopping
> hz house keeping.

Right. We could do the same processing in vmstat update and the
thread could check if it is the last vmstat update thread. If so simply
continue and do not terminate.

But this would still mean that the vmstat update thread would run on an
arbitrary cpu. If I have a sacrificial lamb processor for OS processing
then I would expect the vmstat update thread to stick to that processor
and avoid to run on the other processor that I would like to be as free
from OS noise as possible.

2012-05-08 15:18:47

by Gilad Ben-Yossef

[permalink] [raw]
Subject: Re: [PATCH v1 5/6] mm: make vmstat_update periodic run conditional

On Mon, May 7, 2012 at 6:29 PM, Christoph Lameter <[email protected]> wrote:
> On Thu, 3 May 2012, Gilad Ben-Yossef wrote:
>
>> vmstat_update runs every second from the work queue to update statistics
>> and drain per cpu pages back into the global page allocator.
>
> Looks good.

Thanks :-)

>
> - vmstat_off_cpus is a bit strange. Could we have a cpumask that has a bit
> set if vmstat is active? Rename to "vmstat_cpus"?

Sure.

> - Start out with vmstat_cpus cleared? Cpus only need vmstat if they do
> something and if a cpu is idle on boot then it will not need vmstat
> enabled until the cpu does something useful.

Ah cool. I haven't thought of that.

>
>> @@ -1204,8 +1265,14 @@ static int __init setup_vmstat(void)
>>
>> ? ? ? register_cpu_notifier(&vmstat_notifier);
>>
>> + ? ? INIT_DELAYED_WORK_DEFERRABLE(&vmstat_monitor_work,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? vmstat_update_monitor);
>> + ? ? queue_delayed_work(system_unbound_wq,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? &vmstat_monitor_work,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? round_jiffies_relative(HZ));
>> +
>> ? ? ? for_each_online_cpu(cpu)
>> - ? ? ? ? ? ? start_cpu_timer(cpu);
>> + ? ? ? ? ? ? setup_cpu_timer(cpu);
>> ?#endif
>> ?#ifdef CONFIG_PROC_FS
>> ? ? ? proc_create("buddyinfo", S_IRUGO, NULL, &fragmentation_file_operations);
>
> So the monitoring thread just bounces around the system? Hope that the
> scheduler does the right thing to keep it on processors that do some other
> work.

My line of thought was that if we explicitly choose a scapegoat cpu we
and the user
need to manage this - such as worry about what happens if the
scapegoats is offlines and
let the user explicitly designate the scapegoat cpu thus creating
another knob, and worrying
about what happens if the user designate such a cpu but then it goes offlines...

I figured the user needs to worry about other unbounded work items
anyway if he cares about
where such things are run in the general case, but using isolcpus for example.

The same should be doable with cpusets, except that right now we mark
unbounded workqueue
worker threads as pinned even though they aren't. If I understood the
discussion, the idea is
exactly to stop users from putting these threads in non root cpusets.
I am not 100% sure why..

Does that makes sense?

Thanks!
Gilad

Gilad


--
Gilad Ben-Yossef
Chief Coffee Drinker
[email protected]
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
?-- Jean-Baptiste Queru

2012-05-08 15:22:55

by Gilad Ben-Yossef

[permalink] [raw]
Subject: Re: [PATCH v1 5/6] mm: make vmstat_update periodic run conditional

On Mon, May 7, 2012 at 10:33 PM, KOSAKI Motohiro
<[email protected]> wrote:
>>> @@ -1204,8 +1265,14 @@ static int __init setup_vmstat(void)
>>>
>>> ? ? ? ?register_cpu_notifier(&vmstat_notifier);
>>>
>>> + ? ? ? INIT_DELAYED_WORK_DEFERRABLE(&vmstat_monitor_work,
>>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? vmstat_update_monitor);
>>> + ? ? ? queue_delayed_work(system_unbound_wq,
>>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? &vmstat_monitor_work,
>>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? round_jiffies_relative(HZ));
>>> +
>>> ? ? ? ?for_each_online_cpu(cpu)
>>> - ? ? ? ? ? ? ? start_cpu_timer(cpu);
>>> + ? ? ? ? ? ? ? setup_cpu_timer(cpu);
>>> ?#endif
>>> ?#ifdef CONFIG_PROC_FS
>>> ? ? ? ?proc_create("buddyinfo", S_IRUGO,
>>> NULL,&fragmentation_file_operations);
>>
>>
>> So the monitoring thread just bounces around the system? Hope that the
>> scheduler does the right thing to keep it on processors that do some other
>> work.
>
>
> Good point. Usually, all cpus have update items and monitor worker only
> makes
> new noise. I think this feature is only useful some hpc case. ?So I wonder
> if
> this vmstat improvemnt can integrate Frederic's Nohz cpusets activity. I.e.
> vmstat-update integrate timer house keeping and automatically stop when
> stopping
> hz house keeping.

I wrote this and the previous IPI patch set explicitly to use with
Frederic's Nohz stuff
for CPU isolation. It just seemed at the time to be wrong to tie them
together - I mean
people that do CPU isolation can enjoy this even if they don't want to
kill the tick (that
comes with its own overhead for doing system calls, for example).

Thanks!
Gilad


--
Gilad Ben-Yossef
Chief Coffee Drinker
[email protected]
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
?-- Jean-Baptiste Queru

Subject: Re: [PATCH v1 5/6] mm: make vmstat_update periodic run conditional

On Tue, 8 May 2012, Gilad Ben-Yossef wrote:

> My line of thought was that if we explicitly choose a scapegoat cpu we
> and the user need to manage this - such as worry about what happens if
> the scapegoats is offlines and let the user explicitly designate the
> scapegoat cpu thus creating another knob, and worrying about what
> happens if the user designate such a cpu but then it goes offlines...

The scapegoat can be chosen on boot. One can f.e. create a file in

/sys/device/syste/cpu called "scapegoat" which contains the number of the
processor chosen. Then one can even write a userspace daemon to automatize
the moving of the processing elsewhere. Could be integrated into something
horrible like irqbalance f.e.

> I figured the user needs to worry about other unbounded work items
> anyway if he cares about where such things are run in the general case,
> but using isolcpus for example.

True. So the scapegoat heuristic could be to pick the first
unisolated cpu.

> The same should be doable with cpusets, except that right now we mark
> unbounded workqueue worker threads as pinned even though they aren't. If
> I understood the discussion, the idea is exactly to stop users from
> putting these threads in non root cpusets. I am not 100% sure why..

Not sure that cpusets is a good thing to bring in here because that is an
optional feature of the kernel and tying basic functionality like this
to cpuset support does not sound right to me.

2012-05-08 15:25:12

by Gilad Ben-Yossef

[permalink] [raw]
Subject: Re: [PATCH v1 5/6] mm: make vmstat_update periodic run conditional

On Mon, May 7, 2012 at 10:40 PM, Christoph Lameter <[email protected]> wrote:
> On Mon, 7 May 2012, KOSAKI Motohiro wrote:
>
>> > > @@ -1204,8 +1265,14 @@ static int __init setup_vmstat(void)
>> > >
>> > > ? ? ? ? ? register_cpu_notifier(&vmstat_notifier);
>> > >
>> > > + INIT_DELAYED_WORK_DEFERRABLE(&vmstat_monitor_work,
>> > > + ? ? ? ? ? ? ? ? ? ? ? ? vmstat_update_monitor);
>> > > + queue_delayed_work(system_unbound_wq,
>> > > + ? ? ? ? ? ? ? ? ? ? ? ? &vmstat_monitor_work,
>> > > + ? ? ? ? ? ? ? ? ? ? ? ? round_jiffies_relative(HZ));
>> > > +
>> > > ? ? ? ? ? for_each_online_cpu(cpu)
>> > > - ? ? ? ? start_cpu_timer(cpu);
>> > > + ? ? ? ? setup_cpu_timer(cpu);
>> > > ? #endif
>> > > ? #ifdef CONFIG_PROC_FS
>> > > ? ? ? ? ? proc_create("buddyinfo", S_IRUGO,
>> > > NULL,&fragmentation_file_operations);
>> >
>> > So the monitoring thread just bounces around the system? Hope that the
>> > scheduler does the right thing to keep it on processors that do some other
>> > work.
>>
>> Good point. Usually, all cpus have update items and monitor worker only makes
>> new noise. I think this feature is only useful some hpc case. ?So I wonder if
>> this vmstat improvemnt can integrate Frederic's Nohz cpusets activity. I.e.
>> vmstat-update integrate timer house keeping and automatically stop when
>> stopping
>> hz house keeping.
>
> Right. We could do the same processing in vmstat update and the
> thread could check if it is the last vmstat update thread. If so simply
> continue and do not terminate.
>
> But this would still mean that the vmstat update thread would run on an
> arbitrary cpu. If I have a sacrificial lamb processor for OS processing
> then I would expect the vmstat update thread to stick to that processor
> and avoid to run on the other processor that I would like to be as free
> from OS noise as possible.
>

OK, what about -

- We pick a scapegoat cpu (the first to come up gets the job).
- We add a knob to let user designate another cpu for the job.
- If scapegoat cpus goes offline, the cpu processing the off lining is
the new scapegoat.

Does this makes better sense?

Gilad


--
Gilad Ben-Yossef
Chief Coffee Drinker
[email protected]
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
?-- Jean-Baptiste Queru

Subject: Re: [PATCH v1 5/6] mm: make vmstat_update periodic run conditional

On Tue, 8 May 2012, Gilad Ben-Yossef wrote:

> > But this would still mean that the vmstat update thread would run on an
> > arbitrary cpu. If I have a sacrificial lamb processor for OS processing
> > then I would expect the vmstat update thread to stick to that processor
> > and avoid to run on the other processor that I would like to be as free
> > from OS noise as possible.
> >
>
> OK, what about -
>
> - We pick a scapegoat cpu (the first to come up gets the job).
> - We add a knob to let user designate another cpu for the job.
> - If scapegoat cpus goes offline, the cpu processing the off lining is
> the new scapegoat.
>
> Does this makes better sense?

Sounds good. The first that comes up. If the cpu is isolated then the
first non isolated cpu is picked.

2012-05-09 14:26:14

by Gilad Ben-Yossef

[permalink] [raw]
Subject: Re: [PATCH v1 5/6] mm: make vmstat_update periodic run conditional

On Tue, May 8, 2012 at 6:34 PM, Christoph Lameter <[email protected]> wrote:
> On Tue, 8 May 2012, Gilad Ben-Yossef wrote:
>
>> > But this would still mean that the vmstat update thread would run on an
>> > arbitrary cpu. If I have a sacrificial lamb processor for OS processing
>> > then I would expect the vmstat update thread to stick to that processor
>> > and avoid to run on the other processor that I would like to be as free
>> > from OS noise as possible.
>> >
>>
>> OK, what about -
>>
>> - We pick a scapegoat cpu (the first to come up gets the job).
>> - We add a knob to let user designate another cpu for the job.
>> - If scapegoat cpus goes offline, the cpu processing the off lining is
>> the new scapegoat.
>>
>> Does this makes better sense?
>
> Sounds good. The first that comes up. If the cpu is isolated then the
> first non isolated cpu is picked.
>

OK, will do.

Thanks,
Gilad


--
Gilad Ben-Yossef
Chief Coffee Drinker
[email protected]
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
?-- Jean-Baptiste Queru

2012-05-09 14:26:50

by Gilad Ben-Yossef

[permalink] [raw]
Subject: Re: [PATCH v1 3/6] workqueue: introduce schedule_on_each_cpu_cond

On Mon, May 7, 2012 at 8:17 PM, Tejun Heo <[email protected]> wrote:
> Hello,
>
> On Sun, May 06, 2012 at 04:15:30PM +0300, Gilad Ben-Yossef wrote:
>> A single helper function called schedule_on_each_cpu_cond() is very
>> obvious to find to someone reading the source or documentation. On
>> the other hand figuring out that the helper functions that handle
>> cpu hotplug and cpumask allocation are there for that purpose is a
>> bit more involved.
>>
>> That was my thinking at least.
>
> Yeah, having common mechanism is nice, but I just prefer iterators /
> helpers which can be embedded in the caller to interface which takes a
> callback unless the execution context is actually asynchronous to the
> caller. ?We don't use nested functions / scopes in kernel which makes
> those callbacks (higher order functions, lambdas, gammas, zetas
> whatever) painful to use and follow.
>
>> The way i see it, I can either obliterate on_each_cpu_cond() and out
>> its code in place in the LRU path, or fix the callback to get an
>> extra private data parameter -
>
> Unless we can code up something pretty, I vote for just open coding it
> for now. ?If we grow more usages like this, maybe we'll be able to see
> the pattern better and come up with better abstraction.

Got you. Will do.

Thanks.
Gilad


> --
> tejun



--
Gilad Ben-Yossef
Chief Coffee Drinker
[email protected]
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
?-- Jean-Baptiste Queru

2012-05-25 20:48:35

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v1 1/6] timer: make __next_timer_interrupt explicit about no future event

On Thu, 3 May 2012, Gilad Ben-Yossef wrote:
> What is happening is that when __next_timer_interrupt() wishes
> to return a value that signifies "there is no future timer
> event", it returns (base->timer_jiffies + NEXT_TIMER_MAX_DELTA).
>
> However, the code in tick_nohz_stop_sched_tick(), which called
> __next_timer_interrupt() via get_next_timer_interrupt(),
> compares the return value to (last_jiffies + NEXT_TIMER_MAX_DELTA)
> to see if the timer needs to be re-armed.

Yeah, that's nonsense.

> I've noticed a similar but slightly different fix to the
> same problem in the Tilera kernel tree from Chris M. (I've
> wrote this before seeing that one), so some variation of this
> fix is in use on real hardware for some time now.

Sigh, why can't people post their fixes instead of burying them in
their private trees?

> -static unsigned long __next_timer_interrupt(struct tvec_base *base)
> +static bool __next_timer_interrupt(struct tvec_base *base,
> + unsigned long *next_timer)

....

> +out:
> + if (found)
> + *next_timer = expires;
> + return found;

I'd really like to avoid that churn. That function is ugly enough
already. No need to make it even more so.

> @@ -1317,9 +1322,15 @@ unsigned long get_next_timer_interrupt(unsigned long now)
> if (cpu_is_offline(smp_processor_id()))
> return now + NEXT_TIMER_MAX_DELTA;
> spin_lock(&base->lock);
> - if (time_before_eq(base->next_timer, base->timer_jiffies))
> - base->next_timer = __next_timer_interrupt(base);
> - expires = base->next_timer;
> + if (time_before_eq(base->next_timer, base->timer_jiffies)) {
> +
> + if (__next_timer_interrupt(base, &expires))
> + base->next_timer = expires;
> + else
> + expires = now + NEXT_TIMER_MAX_DELTA;

Here you don't update base->next_timer which makes sure, that we run
through the scan function on every call. Not good.

> + } else
> + expires = base->next_timer;
> +

If the thing is empty or just contains deferrable timers then we
really want to avoid running through the whole cascade horror for
nothing.

Timer add and remove are protected by base->lock. So we simply should
count the non deferrable enqueued timers and avoid the whole
__next_timer_interrupt() completely in case there is nothing what
should wake us up.

I had a deeper look at that and will send out a repair set soon.

Thanks,

tglx

2012-05-25 20:56:19

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH v1 1/6] timer: make __next_timer_interrupt explicit about no future event

On 5/25/2012 4:48 PM, Thomas Gleixner wrote:
>> I've noticed a similar but slightly different fix to the
>> > same problem in the Tilera kernel tree from Chris M. (I've
>> > wrote this before seeing that one), so some variation of this
>> > fix is in use on real hardware for some time now.
> Sigh, why can't people post their fixes instead of burying them in
> their private trees?

The tree was never really ready for review. I pushed the tree just for
reference to the nohz cpusets work, and so that I have something I can
refer people to when I start participating more actively in that discussion.

It didn't seem useful to post a single patch by itself without more
motivating examples behind it (i.e. without the entirety of the tree).

--
Chris Metcalf, Tilera Corp.
http://www.tilera.com

2012-05-25 21:04:33

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v1 1/6] timer: make __next_timer_interrupt explicit about no future event

On Fri, 25 May 2012, Chris Metcalf wrote:

> On 5/25/2012 4:48 PM, Thomas Gleixner wrote:
> >> I've noticed a similar but slightly different fix to the
> >> > same problem in the Tilera kernel tree from Chris M. (I've
> >> > wrote this before seeing that one), so some variation of this
> >> > fix is in use on real hardware for some time now.
> > Sigh, why can't people post their fixes instead of burying them in
> > their private trees?
>
> The tree was never really ready for review. I pushed the tree just for
> reference to the nohz cpusets work, and so that I have something I can
> refer people to when I start participating more actively in that discussion.
>
> It didn't seem useful to post a single patch by itself without more
> motivating examples behind it (i.e. without the entirety of the tree).

The code does the Wrong Thing. Independent of nohz cpusets or
whatever.

Aside of that this is also relevant for power saving stuff.

Thanks,

tglx