2009-03-04 12:13:40

by Arun R Bharadwaj

[permalink] [raw]
Subject: [v2 PATCH 0/4] timers: framework for migration between CPU

Hi,


In an SMP system, tasks are scheduled on different CPUs by the
scheduler, interrupts are managed by irqbalancer daemon, but timers
are still stuck to the CPUs that they have been initialised. Timers
queued by tasks gets re-queued on the CPU where the task gets to run
next, but timers from IRQ context like the ones in device drivers are
still stuck on the CPU they were initialised. This framework will
help move all 'movable timers' from one CPU to any other CPU of choice
using a sysfs interface.

Original posting can be found here --> http://lkml.org/lkml/2009/2/20/121

Based on Ingo's suggestion, I have extended the scheduler power-saving
code, which already identifies an idle load balancer CPU, to also attract
all the attractable sources of timers, automatically.

Also, I have removed the per-cpu sysfs interface and instead created a
single entry at /sys/devices/system/cpu/enable_timer_migration.
This allows users to enable timer migration as a policy and let
the kernel decide the target CPU to move timers and also decide on
the thresholds on when to initiate a timer migration and when to stop.

Timers from idle cpus are migrated to the idle-load-balancer-cpu.
The idle load balancer is one of the idle cpus that has the sched
ticks running and does other system management tasks and load
balancing on behalf of other idle cpus. Attracting timers from
other idle cpus will reduce wakeups for them while increasing the
probability of overlap with sched ticks on the idle load balancer cpu.

However, this technique has drawbacks if the idle load balancer cpu
is re-nominated too often based on the system behaviour leading to
ping-pong of the timers in the system. This issue can be solved by
optimising the selection of the idle load balancer cpu as described
by Gautham in the following patch http://lkml.org/lkml/2008/9/23/82.

If the idle-load-balancer is selected from a semi-idle package by
including Gautham's patch, then we are able to experimentally verify
consistent selection of idle-load-balancer cpu and timers are also
consolidated to this cpu.

The following patches are included:
PATCH 1/4 - framework to identify pinned timers.
PATCH 2/4 - identifying the existing pinned hrtimers.
PATCH 3/4 - sysfs hook to enable timer migration.
PATCH 4/4 - logic to enable timer migration.

The patchset is based on the latest tip/master.


The following experiment was carried out to demonstrate the
functionality of the patch.
The machine used is a 2 socket, quad core machine, with HT enabled.

I run a `make -j4` pinned to 4 CPUs.
I have used a driver which continuously queues timers on a CPU.
With the timers queued I measure the sleep state residency
for a period of 10s.
Next, I enable timer migration and measure the sleep state
residency period.
The comparison in sleep state residency values is posted below.

Also the difference in Local Timer Interrupt rate(LOC) rate
from /proc/interrupts is posted below.

This enables timer migration.

echo 1 > /sys/devices/system/cpu/enable_timer_migration

similarly,

echo 0 > /sys/devices/system/cpu/enable_timer_migration
disables timer migration.


$taskset -c 4,5,6,7 make -j4

my_driver queuing timers continuously on CPU 10.

idle load balancer currently on CPU 15


Case1: Without timer migration Case2: With timer migration

-------------------- --------------------
| Core | LOC Count | | Core | LOC Count |
| 4 | 2504 | | 4 | 2503 |
| 5 | 2502 | | 5 | 2503 |
| 6 | 2502 | | 6 | 2502 |
| 7 | 2498 | | 7 | 2500 |
| 10 | 2501 | | 10 | 35 |
| 15 | 2501 | | 15 | 2501 |
-------------------- --------------------

--------------------- --------------------
| Core | Sleep time | | Core | Sleep time |
| 4 | 0.47168 | | 4 | 0.49601 |
| 5 | 0.44301 | | 5 | 0.37153 |
| 6 | 0.38979 | | 6 | 0.51286 |
| 7 | 0.42829 | | 7 | 0.49635 |
| 10 | 9.86652 | | 10 | 10.04216 |
| 15 | 0.43048 | | 15 | 0.49056 |
--------------------- ---------------------

Here, all the timers queued by the driver on CPU10 are moved to CPU15,
which is the idle load balancer.


--arun


2009-03-04 12:16:15

by Arun R Bharadwaj

[permalink] [raw]
Subject: [v2 PATCH 1/4] timers: framework to identify pinned timers.

* Arun R Bharadwaj <[email protected]> [2009-03-04 17:42:49]:

This patch creates a new framework for identifying cpu-pinned timers
and hrtimers.


This framework is needed because pinned timers are expected to fire on
the same CPU on which they are queued. So it is essential to identify
these and not migrate them, in case there are any.


For regular timers a new flag called TBASE_PINNED_FLAG is created.
Since the last 3 bits of the tvec_base is guaranteed to be 0, and
since the last bit is being used to indicate deferrable timers,
the second last bit is used to indicate cpu-pinned regular timers.
The implementation of functions to manage the TBASE_PINNED_FLAG is
similar to those which manage the TBASE_DEFERRABLE_FLAG.


For hrtimers, a new interface hrtimer_start_pinned() is created,
which can be used to queue cpu-pinned hrtimer.


Signed-off-by: Arun R Bharadwaj <[email protected]>
---
include/linux/hrtimer.h | 24 ++++++++++++++++++++----
kernel/hrtimer.c | 34 ++++++++++++++++++++++++++++------
kernel/timer.c | 30 +++++++++++++++++++++++++++---
3 files changed, 75 insertions(+), 13 deletions(-)

Index: linux.trees.git/kernel/timer.c
===================================================================
--- linux.trees.git.orig/kernel/timer.c
+++ linux.trees.git/kernel/timer.c
@@ -37,6 +37,7 @@
#include <linux/delay.h>
#include <linux/tick.h>
#include <linux/kallsyms.h>
+#include <linux/timer.h>

#include <asm/uaccess.h>
#include <asm/unistd.h>
@@ -87,8 +88,12 @@ static DEFINE_PER_CPU(struct tvec_base *
* the new flag to indicate whether the timer is deferrable
*/
#define TBASE_DEFERRABLE_FLAG (0x1)
+#define TBASE_PINNED_FLAG (0x2)

-/* Functions below help us manage 'deferrable' flag */
+/*
+ * Functions below help us manage
+ * 'deferrable' flag and 'cpu-pinned-timer' flag
+ */
static inline unsigned int tbase_get_deferrable(struct tvec_base *base)
{
return ((unsigned int)(unsigned long)base & TBASE_DEFERRABLE_FLAG);
@@ -96,7 +101,8 @@ static inline unsigned int tbase_get_def

static inline struct tvec_base *tbase_get_base(struct tvec_base *base)
{
- return ((struct tvec_base *)((unsigned long)base & ~TBASE_DEFERRABLE_FLAG));
+ return (struct tvec_base *)((unsigned long)base &
+ ~(TBASE_DEFERRABLE_FLAG | TBASE_PINNED_FLAG));
}

static inline void timer_set_deferrable(struct timer_list *timer)
@@ -105,11 +111,28 @@ static inline void timer_set_deferrable(
TBASE_DEFERRABLE_FLAG));
}

+static inline unsigned long tbase_get_pinned(struct tvec_base *base)
+{
+ return (unsigned long)base & TBASE_PINNED_FLAG;
+}
+
+static inline unsigned long tbase_get_flag_bits(struct timer_list *timer)
+{
+ return tbase_get_deferrable(timer->base) |
+ tbase_get_pinned(timer->base);
+}
+
static inline void
timer_set_base(struct timer_list *timer, struct tvec_base *new_base)
{
timer->base = (struct tvec_base *)((unsigned long)(new_base) |
- tbase_get_deferrable(timer->base));
+ tbase_get_flag_bits(timer));
+}
+
+static inline void timer_set_pinned(struct timer_list *timer)
+{
+ timer->base = ((struct tvec_base *)((unsigned long)(timer->base) |
+ TBASE_PINNED_FLAG));
}

static unsigned long round_jiffies_common(unsigned long j, int cpu,
@@ -736,6 +759,7 @@ void add_timer_on(struct timer_list *tim
struct tvec_base *base = per_cpu(tvec_bases, cpu);
unsigned long flags;

+ timer_set_pinned(timer);
timer_stats_timer_set_start_info(timer);
BUG_ON(timer_pending(timer) || !timer->function);
spin_lock_irqsave(&base->lock, flags);
Index: linux.trees.git/include/linux/hrtimer.h
===================================================================
--- linux.trees.git.orig/include/linux/hrtimer.h
+++ linux.trees.git/include/linux/hrtimer.h
@@ -331,23 +331,39 @@ static inline void hrtimer_init_on_stack
static inline void destroy_hrtimer_on_stack(struct hrtimer *timer) { }
#endif

+#define HRTIMER_NOT_PINNED 0
+#define HRTIMER_PINNED 1
/* Basic timer operations: */
extern int hrtimer_start(struct hrtimer *timer, ktime_t tim,
const enum hrtimer_mode mode);
+extern int hrtimer_start_pinned(struct hrtimer *timer, ktime_t tim,
+ const enum hrtimer_mode mode);
extern int hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
- unsigned long range_ns, const enum hrtimer_mode mode);
+ unsigned long range_ns, const enum hrtimer_mode mode, int pinned);
extern int hrtimer_cancel(struct hrtimer *timer);
extern int hrtimer_try_to_cancel(struct hrtimer *timer);

-static inline int hrtimer_start_expires(struct hrtimer *timer,
- enum hrtimer_mode mode)
+static inline int __hrtimer_start_expires(struct hrtimer *timer,
+ enum hrtimer_mode mode, int pinned)
{
unsigned long delta;
ktime_t soft, hard;
soft = hrtimer_get_softexpires(timer);
hard = hrtimer_get_expires(timer);
delta = ktime_to_ns(ktime_sub(hard, soft));
- return hrtimer_start_range_ns(timer, soft, delta, mode);
+ return hrtimer_start_range_ns(timer, soft, delta, mode, pinned);
+}
+
+static inline int hrtimer_start_expires(struct hrtimer *timer,
+ enum hrtimer_mode mode)
+{
+ return __hrtimer_start_expires(timer, mode, HRTIMER_NOT_PINNED);
+}
+
+static inline int hrtimer_start_expires_pinned(struct hrtimer *timer,
+ enum hrtimer_mode mode)
+{
+ return __hrtimer_start_expires(timer, mode, HRTIMER_PINNED);
}

static inline int hrtimer_restart(struct hrtimer *timer)
Index: linux.trees.git/kernel/hrtimer.c
===================================================================
--- linux.trees.git.orig/kernel/hrtimer.c
+++ linux.trees.git/kernel/hrtimer.c
@@ -193,7 +193,8 @@ struct hrtimer_clock_base *lock_hrtimer_
* Switch the timer base to the current CPU when possible.
*/
static inline struct hrtimer_clock_base *
-switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base)
+switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base,
+int pinned)
{
struct hrtimer_clock_base *new_base;
struct hrtimer_cpu_base *new_cpu_base;
@@ -897,9 +898,8 @@ remove_hrtimer(struct hrtimer *timer, st
* 0 on success
* 1 when the timer was active
*/
-int
-hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim, unsigned long delta_ns,
- const enum hrtimer_mode mode)
+int hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
+ unsigned long delta_ns, const enum hrtimer_mode mode, int pinned)
{
struct hrtimer_clock_base *base, *new_base;
unsigned long flags;
@@ -911,7 +911,7 @@ hrtimer_start_range_ns(struct hrtimer *t
ret = remove_hrtimer(timer, base);

/* Switch the timer base, if necessary: */
- new_base = switch_hrtimer_base(timer, base);
+ new_base = switch_hrtimer_base(timer, base, pinned);

if (mode == HRTIMER_MODE_REL) {
tim = ktime_add_safe(tim, new_base->get_time());
@@ -948,6 +948,12 @@ hrtimer_start_range_ns(struct hrtimer *t
}
EXPORT_SYMBOL_GPL(hrtimer_start_range_ns);

+int __hrtimer_start(struct hrtimer *timer, ktime_t tim,
+const enum hrtimer_mode mode, int pinned)
+{
+ return hrtimer_start_range_ns(timer, tim, 0, mode, pinned);
+}
+
/**
* hrtimer_start - (re)start an hrtimer on the current CPU
* @timer: the timer to be added
@@ -961,10 +967,26 @@ EXPORT_SYMBOL_GPL(hrtimer_start_range_ns
int
hrtimer_start(struct hrtimer *timer, ktime_t tim, const enum hrtimer_mode mode)
{
- return hrtimer_start_range_ns(timer, tim, 0, mode);
+ return __hrtimer_start(timer, tim, mode, HRTIMER_NOT_PINNED);
}
EXPORT_SYMBOL_GPL(hrtimer_start);

+/**
+ * hrtimer_start_pinned - start a CPU-pinned hrtimer
+ * @timer: the timer to be added
+ * @tim: expiry time
+ * @mode: expiry mode: absolute (HRTIMER_ABS) or relative (HRTIMER_REL)
+ *
+ * Returns:
+ * 0 on success
+ * 1 when the timer was active
+ */
+int hrtimer_start_pinned(struct hrtimer *timer,
+ ktime_t tim, const enum hrtimer_mode mode)
+{
+ return __hrtimer_start(timer, tim, mode, HRTIMER_PINNED);
+}
+EXPORT_SYMBOL_GPL(hrtimer_start_pinned);

/**
* hrtimer_try_to_cancel - try to deactivate a timer

2009-03-04 12:17:18

by Arun R Bharadwaj

[permalink] [raw]
Subject: [v2 PATCH 2/4] timers: identifying the existing pinned hrtimers.

* Arun R Bharadwaj <[email protected]> [2009-03-04 17:42:49]:

The following pinned hrtimers have been identified and marked:
1)sched_rt_period_timer
2)tick_sched_timer
3)stack_trace_timer_fn

Signed-off-by: Arun R Bharadwaj <[email protected]>
---
kernel/sched.c | 5 +++--
kernel/time/tick-sched.c | 7 ++++---
kernel/trace/trace_sysprof.c | 3 ++-
3 files changed, 9 insertions(+), 6 deletions(-)

Index: linux.trees.git/kernel/sched.c
===================================================================
--- linux.trees.git.orig/kernel/sched.c
+++ linux.trees.git/kernel/sched.c
@@ -236,7 +236,7 @@ static void start_rt_bandwidth(struct rt

now = hrtimer_cb_get_time(&rt_b->rt_period_timer);
hrtimer_forward(&rt_b->rt_period_timer, now, rt_b->rt_period);
- hrtimer_start_expires(&rt_b->rt_period_timer,
+ hrtimer_start_expires_pinned(&rt_b->rt_period_timer,
HRTIMER_MODE_ABS);
}
spin_unlock(&rt_b->rt_runtime_lock);
@@ -1156,7 +1156,8 @@ static __init void init_hrtick(void)
*/
static void hrtick_start(struct rq *rq, u64 delay)
{
- hrtimer_start(&rq->hrtick_timer, ns_to_ktime(delay), HRTIMER_MODE_REL);
+ hrtimer_start_pinned(&rq->hrtick_timer, ns_to_ktime(delay),
+ HRTIMER_MODE_REL);
}

static inline void init_hrtick(void)
Index: linux.trees.git/kernel/time/tick-sched.c
===================================================================
--- linux.trees.git.orig/kernel/time/tick-sched.c
+++ linux.trees.git/kernel/time/tick-sched.c
@@ -348,7 +348,7 @@ void tick_nohz_stop_sched_tick(int inidl
ts->idle_expires = expires;

if (ts->nohz_mode == NOHZ_MODE_HIGHRES) {
- hrtimer_start(&ts->sched_timer, expires,
+ hrtimer_start_pinned(&ts->sched_timer, expires,
HRTIMER_MODE_ABS);
/* Check, if the timer was already in the past */
if (hrtimer_active(&ts->sched_timer))
@@ -394,7 +394,7 @@ static void tick_nohz_restart(struct tic
hrtimer_forward(&ts->sched_timer, now, tick_period);

if (ts->nohz_mode == NOHZ_MODE_HIGHRES) {
- hrtimer_start_expires(&ts->sched_timer,
+ hrtimer_start_expires_pinned(&ts->sched_timer,
HRTIMER_MODE_ABS);
/* Check, if the timer was already in the past */
if (hrtimer_active(&ts->sched_timer))
@@ -698,7 +698,8 @@ void tick_setup_sched_timer(void)

for (;;) {
hrtimer_forward(&ts->sched_timer, now, tick_period);
- hrtimer_start_expires(&ts->sched_timer, HRTIMER_MODE_ABS);
+ hrtimer_start_expires_pinned(&ts->sched_timer,
+ HRTIMER_MODE_ABS);
/* Check, if the timer was already in the past */
if (hrtimer_active(&ts->sched_timer))
break;
Index: linux.trees.git/kernel/trace/trace_sysprof.c
===================================================================
--- linux.trees.git.orig/kernel/trace/trace_sysprof.c
+++ linux.trees.git/kernel/trace/trace_sysprof.c
@@ -203,7 +203,8 @@ static void start_stack_timer(void *unus
hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
hrtimer->function = stack_trace_timer_fn;

- hrtimer_start(hrtimer, ns_to_ktime(sample_period), HRTIMER_MODE_REL);
+ hrtimer_start_pinned(hrtimer, ns_to_ktime(sample_period),
+ HRTIMER_MODE_REL);
}

static void start_stack_timers(void)

2009-03-04 12:18:40

by Arun R Bharadwaj

[permalink] [raw]
Subject: [v2 PATCH 3/4] timers: sysfs hook to enable timer migration

* Arun R Bharadwaj <[email protected]> [2009-03-04 17:42:49]:

This patch creates the necessary sysfs interface for timer migration.

The interface is located at
/sys/devices/system/cpu/enable_timer_migration

This enables timer migration.

echo 1 > /sys/devices/system/cpu/enable_timer_migration

similarly,

echo 0 > /sys/devices/system/cpu/enable_timer_migration

disables timer migration.


Signed-off-by: Arun R Bharadwaj <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched.c | 28 ++++++++++++++++++++++++++++
2 files changed, 29 insertions(+)

Index: linux.trees.git/include/linux/sched.h
===================================================================
--- linux.trees.git.orig/include/linux/sched.h
+++ linux.trees.git/include/linux/sched.h
@@ -803,6 +803,7 @@ enum powersavings_balance_level {
};

extern int sched_mc_power_savings, sched_smt_power_savings;
+extern int enable_timer_migration;

static inline int sd_balance_for_mc_power(void)
{
Index: linux.trees.git/kernel/sched.c
===================================================================
--- linux.trees.git.orig/kernel/sched.c
+++ linux.trees.git/kernel/sched.c
@@ -7445,6 +7445,7 @@ static void sched_domain_node_span(int n
#endif /* CONFIG_NUMA */

int sched_smt_power_savings = 0, sched_mc_power_savings = 0;
+int enable_timer_migration;

/*
* The cpus mask in sched_group and sched_domain hangs off the end.
@@ -8317,6 +8318,29 @@ static SYSDEV_CLASS_ATTR(sched_smt_power
sched_smt_power_savings_store);
#endif

+static ssize_t timer_migration_show(struct sysdev_class *class,
+ char *page)
+{
+ return sprintf(page, "%u\n", enable_timer_migration);
+}
+static ssize_t timer_migration_store(struct sysdev_class *class,
+ const char *buf, size_t count)
+{
+ unsigned int level = 0;
+
+ if (sscanf(buf, "%u", &level) != 1)
+ return -EINVAL;
+
+ if (level > 1)
+ return -EINVAL;
+
+ enable_timer_migration = level;
+
+ return count;
+}
+static SYSDEV_CLASS_ATTR(enable_timer_migration, 0644,
+ timer_migration_show,
+ timer_migration_store);
int __init sched_create_sysfs_power_savings_entries(struct sysdev_class *cls)
{
int err = 0;
@@ -8331,6 +8355,10 @@ int __init sched_create_sysfs_power_savi
err = sysfs_create_file(&cls->kset.kobj,
&attr_sched_mc_power_savings.attr);
#endif
+ if (!err)
+ err = sysfs_create_file(&cls->kset.kobj,
+ &attr_enable_timer_migration.attr);
+
return err;
}
#endif /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */

2009-03-04 12:19:56

by Arun R Bharadwaj

[permalink] [raw]
Subject: [v2 PATCH 4/4] timers: logic to enable timer migration.

* Arun R Bharadwaj <[email protected]> [2009-03-04 17:42:49]:

This patch migrates all non pinned timers and hrtimers to the current
idle load balancer, from all the idle CPUs. Timers firing on busy CPUs
are not migrated.


Signed-off-by: Arun R Bharadwaj <[email protected]>
---
include/linux/sched.h | 1 +
kernel/hrtimer.c | 12 +++++++++++-
kernel/sched.c | 5 +++++
kernel/timer.c | 13 ++++++++++++-
4 files changed, 29 insertions(+), 2 deletions(-)

Index: linux.trees.git/kernel/timer.c
===================================================================
--- linux.trees.git.orig/kernel/timer.c
+++ linux.trees.git/kernel/timer.c
@@ -38,6 +38,7 @@
#include <linux/tick.h>
#include <linux/kallsyms.h>
#include <linux/timer.h>
+#include <linux/sched.h>

#include <asm/uaccess.h>
#include <asm/unistd.h>
@@ -628,7 +629,7 @@ __mod_timer(struct timer_list *timer, un
{
struct tvec_base *base, *new_base;
unsigned long flags;
- int ret;
+ int ret, current_cpu, preferred_cpu;

ret = 0;

@@ -649,6 +650,16 @@ __mod_timer(struct timer_list *timer, un

new_base = __get_cpu_var(tvec_bases);

+ current_cpu = smp_processor_id();
+ preferred_cpu = get_nohz_load_balancer();
+ if (enable_timer_migration && !tbase_get_pinned(timer->base) &&
+ idle_cpu(current_cpu) && preferred_cpu != -1) {
+ new_base = per_cpu(tvec_bases, preferred_cpu);
+ timer_set_base(timer, new_base);
+ timer->expires = expires;
+ internal_add_timer(new_base, timer);
+ goto out_unlock;
+ }
if (base != new_base) {
/*
* We are trying to schedule the timer on the local CPU.
Index: linux.trees.git/kernel/hrtimer.c
===================================================================
--- linux.trees.git.orig/kernel/hrtimer.c
+++ linux.trees.git/kernel/hrtimer.c
@@ -43,6 +43,8 @@
#include <linux/seq_file.h>
#include <linux/err.h>
#include <linux/debugobjects.h>
+#include <linux/sched.h>
+#include <linux/timer.h>

#include <asm/uaccess.h>

@@ -198,8 +200,16 @@ int pinned)
{
struct hrtimer_clock_base *new_base;
struct hrtimer_cpu_base *new_cpu_base;
+ int current_cpu, preferred_cpu;
+
+ current_cpu = smp_processor_id();
+ preferred_cpu = get_nohz_load_balancer();
+ if (enable_timer_migration && !pinned && preferred_cpu != -1 &&
+ idle_cpu(current_cpu))
+ new_cpu_base = &per_cpu(hrtimer_bases, preferred_cpu);
+ else
+ new_cpu_base = &__get_cpu_var(hrtimer_bases);

- new_cpu_base = &__get_cpu_var(hrtimer_bases);
new_base = &new_cpu_base->clock_base[base->index];

if (base != new_base) {
Index: linux.trees.git/include/linux/sched.h
===================================================================
--- linux.trees.git.orig/include/linux/sched.h
+++ linux.trees.git/include/linux/sched.h
@@ -265,6 +265,7 @@ static inline int select_nohz_load_balan
}
#endif

+extern int get_nohz_load_balancer(void);
/*
* Only dump TASK_* tasks. (0 for all tasks)
*/
Index: linux.trees.git/kernel/sched.c
===================================================================
--- linux.trees.git.orig/kernel/sched.c
+++ linux.trees.git/kernel/sched.c
@@ -4009,6 +4009,11 @@ static struct {
.load_balancer = ATOMIC_INIT(-1),
};

+inline int get_nohz_load_balancer(void)
+{
+ return atomic_read(&nohz.load_balancer);
+}
+
/*
* This routine will try to nominate the ilb (idle load balancing)
* owner among the cpus whose ticks are stopped. ilb owner will do the idle

2009-03-04 16:33:20

by Daniel Walker

[permalink] [raw]
Subject: Re: [v2 PATCH 4/4] timers: logic to enable timer migration.

On Wed, 2009-03-04 at 17:49 +0530, Arun R Bharadwaj wrote:
> @@ -649,6 +650,16 @@ __mod_timer(struct timer_list *timer, un
>
> new_base = __get_cpu_var(tvec_bases);
>
> + current_cpu = smp_processor_id();
> + preferred_cpu = get_nohz_load_balancer();
> + if (enable_timer_migration && !tbase_get_pinned(timer->base)
> &&
> + idle_cpu(current_cpu) && preferred_cpu != -1)
> {
> + new_base = per_cpu(tvec_bases, preferred_cpu);
> + timer_set_base(timer, new_base);
> + timer->expires = expires;
> + internal_add_timer(new_base, timer);
> + goto out_unlock;
> + }

Are you sure this compiles w/ CONFIG_SMP=n ? It looks like
enable_timer_migration wouldn't exist in that case, but the chunks your
adding in the timer code are still using it.

Daniel

2009-03-04 16:53:01

by Arun R Bharadwaj

[permalink] [raw]
Subject: Re: [v2 PATCH 4/4] timers: logic to enable timer migration.

* Daniel Walker <[email protected]> [2009-03-04 08:33:09]:

> On Wed, 2009-03-04 at 17:49 +0530, Arun R Bharadwaj wrote:
> > @@ -649,6 +650,16 @@ __mod_timer(struct timer_list *timer, un
> >
> > new_base = __get_cpu_var(tvec_bases);
> >
> > + current_cpu = smp_processor_id();
> > + preferred_cpu = get_nohz_load_balancer();
> > + if (enable_timer_migration && !tbase_get_pinned(timer->base)
> > &&
> > + idle_cpu(current_cpu) && preferred_cpu != -1)
> > {
> > + new_base = per_cpu(tvec_bases, preferred_cpu);
> > + timer_set_base(timer, new_base);
> > + timer->expires = expires;
> > + internal_add_timer(new_base, timer);
> > + goto out_unlock;
> > + }
>
> Are you sure this compiles w/ CONFIG_SMP=n ? It looks like
> enable_timer_migration wouldn't exist in that case, but the chunks your
> adding in the timer code are still using it.
>
> Daniel
>

Hi Daniel,

Thanks for pointing this out. I'll clean this up in my next iteration.

--arun

2009-03-04 17:34:11

by Ingo Molnar

[permalink] [raw]
Subject: Re: [v2 PATCH 0/4] timers: framework for migration between CPU


* Arun R Bharadwaj <[email protected]> wrote:

> $taskset -c 4,5,6,7 make -j4
>
> my_driver queuing timers continuously on CPU 10.
>
> idle load balancer currently on CPU 15
>
>
> Case1: Without timer migration Case2: With timer migration
>
> -------------------- --------------------
> | Core | LOC Count | | Core | LOC Count |
> | 4 | 2504 | | 4 | 2503 |
> | 5 | 2502 | | 5 | 2503 |
> | 6 | 2502 | | 6 | 2502 |
> | 7 | 2498 | | 7 | 2500 |
> | 10 | 2501 | | 10 | 35 |
> | 15 | 2501 | | 15 | 2501 |
> -------------------- --------------------
>
> --------------------- --------------------
> | Core | Sleep time | | Core | Sleep time |
> | 4 | 0.47168 | | 4 | 0.49601 |
> | 5 | 0.44301 | | 5 | 0.37153 |
> | 6 | 0.38979 | | 6 | 0.51286 |
> | 7 | 0.42829 | | 7 | 0.49635 |
> | 10 | 9.86652 | | 10 | 10.04216 |
> | 15 | 0.43048 | | 15 | 0.49056 |
> --------------------- ---------------------
>
> Here, all the timers queued by the driver on CPU10 are moved to CPU15,
> which is the idle load balancer.

The numbers with this automatic method based on the ilb-cpu look
pretty convincing. Is this what you expected it to be?

Ingo

2009-03-04 18:05:32

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [v2 PATCH 0/4] timers: framework for migration between CPU

* Ingo Molnar <[email protected]> [2009-03-04 18:33:21]:

>
> * Arun R Bharadwaj <[email protected]> wrote:
>
> > $taskset -c 4,5,6,7 make -j4
> >
> > my_driver queuing timers continuously on CPU 10.
> >
> > idle load balancer currently on CPU 15
> >
> >
> > Case1: Without timer migration Case2: With timer migration
> >
> > -------------------- --------------------
> > | Core | LOC Count | | Core | LOC Count |
> > | 4 | 2504 | | 4 | 2503 |
> > | 5 | 2502 | | 5 | 2503 |
> > | 6 | 2502 | | 6 | 2502 |
> > | 7 | 2498 | | 7 | 2500 |
> > | 10 | 2501 | | 10 | 35 |
> > | 15 | 2501 | | 15 | 2501 |
> > -------------------- --------------------
> >
> > --------------------- --------------------
> > | Core | Sleep time | | Core | Sleep time |
> > | 4 | 0.47168 | | 4 | 0.49601 |
> > | 5 | 0.44301 | | 5 | 0.37153 |
> > | 6 | 0.38979 | | 6 | 0.51286 |
> > | 7 | 0.42829 | | 7 | 0.49635 |
> > | 10 | 9.86652 | | 10 | 10.04216 |
> > | 15 | 0.43048 | | 15 | 0.49056 |
> > --------------------- ---------------------
> >
> > Here, all the timers queued by the driver on CPU10 are moved to CPU15,
> > which is the idle load balancer.
>
> The numbers with this automatic method based on the ilb-cpu look
> pretty convincing. Is this what you expected it to be?

Yes Ingo, this is the expected results and looks pretty good. However
there are two parameters controlled in this experiment:

1) The system is moderately loaded with kernbench so that there are
some busy CPUs and some idle cpus, and the no_hz mask is does not
change often. This leads to stable ilb-cpu selection. If the
system is either completely idle or loaded too little leading to
ilb nominations, then timers keep following the ilb cpu and it is
very difficult to experimentally observe the benefits.

Even if the ilb bounces, consolidating timers should increase
overlap between timers and reduce the wakeup from idle.

Optimising the ilb selection should significantly improve
experimental results for this patch.

2) The timer test driver creates quite large timer load so that the
effect of migration is observable as sleep time difference on the
expected target cpu. This kind of timer load may not be uncommon
with lots of application stack loaded in an enterprise system

--Vaidy

2009-03-04 18:29:48

by Ingo Molnar

[permalink] [raw]
Subject: Re: [v2 PATCH 0/4] timers: framework for migration between CPU


* Vaidyanathan Srinivasan <[email protected]> wrote:

> * Ingo Molnar <[email protected]> [2009-03-04 18:33:21]:
>
> >
> > * Arun R Bharadwaj <[email protected]> wrote:
> >
> > > $taskset -c 4,5,6,7 make -j4
> > >
> > > my_driver queuing timers continuously on CPU 10.
> > >
> > > idle load balancer currently on CPU 15
> > >
> > >
> > > Case1: Without timer migration Case2: With timer migration
> > >
> > > -------------------- --------------------
> > > | Core | LOC Count | | Core | LOC Count |
> > > | 4 | 2504 | | 4 | 2503 |
> > > | 5 | 2502 | | 5 | 2503 |
> > > | 6 | 2502 | | 6 | 2502 |
> > > | 7 | 2498 | | 7 | 2500 |
> > > | 10 | 2501 | | 10 | 35 |
> > > | 15 | 2501 | | 15 | 2501 |
> > > -------------------- --------------------
> > >
> > > --------------------- --------------------
> > > | Core | Sleep time | | Core | Sleep time |
> > > | 4 | 0.47168 | | 4 | 0.49601 |
> > > | 5 | 0.44301 | | 5 | 0.37153 |
> > > | 6 | 0.38979 | | 6 | 0.51286 |
> > > | 7 | 0.42829 | | 7 | 0.49635 |
> > > | 10 | 9.86652 | | 10 | 10.04216 |
> > > | 15 | 0.43048 | | 15 | 0.49056 |
> > > --------------------- ---------------------
> > >
> > > Here, all the timers queued by the driver on CPU10 are moved to CPU15,
> > > which is the idle load balancer.
> >
> > The numbers with this automatic method based on the ilb-cpu look
> > pretty convincing. Is this what you expected it to be?
>
> Yes Ingo, this is the expected results and looks pretty good. However
> there are two parameters controlled in this experiment:
>
> 1) The system is moderately loaded with kernbench so that there are
> some busy CPUs and some idle cpus, and the no_hz mask is does not
> change often. This leads to stable ilb-cpu selection. If the
> system is either completely idle or loaded too little leading to
> ilb nominations, then timers keep following the ilb cpu and it is
> very difficult to experimentally observe the benefits.
>
> Even if the ilb bounces, consolidating timers should increase
> overlap between timers and reduce the wakeup from idle.
>
> Optimising the ilb selection should significantly improve
> experimental results for this patch.
>
> 2) The timer test driver creates quite large timer load so that the
> effect of migration is observable as sleep time difference on the
> expected target cpu. This kind of timer load may not be uncommon
> with lots of application stack loaded in an enterprise system

the important thing to watch out for is to not have _worse_
performance due to ilb jumping too much. So as long as you can
prove that numbers dont get worse you are golden.

Power-saving via migration will only work if there's a
concentrated workload to begin with.

So the best results will be in combination with scheduler
power-saving patches. (which too make the ilb jump less in
essence)

So by getting scheduler power saving enhancements your method
will work better too - there's good synergy and no dependency on
any user-space component.

Btw., could you please turn the runtime switch into a /proc/sys
sysctl, and only when CONFIG_SCHED_DEBUG=y. Otherwise it should
be default-enabled with no ability to turn it off.

Ingo

2009-03-05 15:38:20

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [v2 PATCH 4/4] timers: logic to enable timer migration.

On 03/04, Arun R Bharadwaj wrote:
>
> @@ -628,7 +629,7 @@ __mod_timer(struct timer_list *timer, un
> {
> struct tvec_base *base, *new_base;
> unsigned long flags;
> - int ret;
> + int ret, current_cpu, preferred_cpu;
>
> ret = 0;
>
> @@ -649,6 +650,16 @@ __mod_timer(struct timer_list *timer, un
>
> new_base = __get_cpu_var(tvec_bases);
>
> + current_cpu = smp_processor_id();
> + preferred_cpu = get_nohz_load_balancer();
> + if (enable_timer_migration && !tbase_get_pinned(timer->base) &&
> + idle_cpu(current_cpu) && preferred_cpu != -1) {
> + new_base = per_cpu(tvec_bases, preferred_cpu);
> + timer_set_base(timer, new_base);
> + timer->expires = expires;
> + internal_add_timer(new_base, timer);
> + goto out_unlock;

I didn't read the whole series, but this looks very wrong.

We can not do internal_add_timer/etc until we lock new_base, please
look how the current code does this under "if (base != new_base)".

I think you can do something like

- new_base = __get_cpu_var(tvec_bases);
+
+ new_cpu = smp_processor_id();
+ if (enable_timer_migration && ....)
+ new_cpu = preferred_cpu;
+
+ new_base = per_cpu(tvec_bases, new_cpu);

if (base != new_base) {

Oleg.

2009-03-05 16:27:18

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [v2 PATCH 4/4] timers: logic to enable timer migration.

On 03/04, Arun R Bharadwaj wrote:
>
> +++ linux.trees.git/kernel/sched.c
> @@ -4009,6 +4009,11 @@ static struct {
> .load_balancer = ATOMIC_INIT(-1),
> };
>
> +inline int get_nohz_load_balancer(void)

inline?

> +{
> + return atomic_read(&nohz.load_balancer);
> +}

Shouldn't we reset .load_balancer when this CPU is CPU_DOWN'ed ?
Otherwise the timer can migrate to the dead CPU.

Oleg.

2009-03-05 16:57:57

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [v2 PATCH 1/4] timers: framework to identify pinned timers.

On 03/04, Arun R Bharadwaj wrote:
>
> +static inline unsigned long tbase_get_flag_bits(struct timer_list *timer)
> +{
> + return tbase_get_deferrable(timer->base) |
> + tbase_get_pinned(timer->base);
> +}

I'd say this looks a bit strange. Hopefully compiler can optimize this code
to return (unsigned long)base & (TBASE_DEFERRABLE_FLAG | TBASE_PINNED_FLAG).

> @@ -736,6 +759,7 @@ void add_timer_on(struct timer_list *tim
> struct tvec_base *base = per_cpu(tvec_bases, cpu);
> unsigned long flags;
>
> + timer_set_pinned(timer);

But we never clear TBASE_PINNED_FLAG?

If we use mod_timer() next time, the timer remains "pinned". I do not say
this is really wrong, but a bit strange imho.

Oleg.

2009-03-06 00:14:32

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [v2 PATCH 0/4] timers: framework for migration between CPU

On Wed, 4 Mar 2009 17:42:49 +0530
Arun R Bharadwaj <[email protected]> wrote:
> With the timers queued I measure the sleep state residency
> for a period of 10s.
> Next, I enable timer migration and measure the sleep state
> residency period.

is your table in seconds? Because realistically, at least on PC like
hardware, doing work to get sleep times over a 100 / 200 msecs or so
isn't going to save you any amount of measurable power anymore...



--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

Subject: Re: [v2 PATCH 4/4] timers: logic to enable timer migration.

On Thu, Mar 05, 2009 at 05:23:29PM +0100, Oleg Nesterov wrote:
> On 03/04, Arun R Bharadwaj wrote:
> >
> > +++ linux.trees.git/kernel/sched.c
> > @@ -4009,6 +4009,11 @@ static struct {
> > .load_balancer = ATOMIC_INIT(-1),
> > };
> >
> > +inline int get_nohz_load_balancer(void)
>
> inline?
>
> > +{
> > + return atomic_read(&nohz.load_balancer);
> > +}
>
> Shouldn't we reset .load_balancer when this CPU is CPU_DOWN'ed ?
> Otherwise the timer can migrate to the dead CPU.

In the select_nohz_load_balancer() code, we check if this CPU is in the
cpu_active_map. If no, then this CPU relinquishes being the idle
load balancer. Also, the timer migration code in the CPU down path would
migrate any timers queued onto this CPU, right ?

>
> Oleg.

--
Thanks and Regards
gautham

2009-03-06 04:24:17

by Arun R Bharadwaj

[permalink] [raw]
Subject: Re: [v2 PATCH 0/4] timers: framework for migration between CPU

* Arjan van de Ven <[email protected]> [2009-03-05 16:14:51]:

> On Wed, 4 Mar 2009 17:42:49 +0530
> Arun R Bharadwaj <[email protected]> wrote:
> > With the timers queued I measure the sleep state residency
> > for a period of 10s.
> > Next, I enable timer migration and measure the sleep state
> > residency period.
>
> is your table in seconds? Because realistically, at least on PC like
> hardware, doing work to get sleep times over a 100 / 200 msecs or so
> isn't going to save you any amount of measurable power anymore...
>

Hi Arjan,

Yes, the table is in seconds.
Rather than looking at timer migration as a direct means of obtaining
power savings, I would like to see it as one of the mechanisms which
would aid in preventing an almost idle cpu from waking up
unnecessarily, thereby causing power penalty. Otherwise, there is a
chance of all the CPUs in a particular package being idle and the
entire package going into deep sleep.
So we cannot really get big power savings just by doing timer
migration alone.

--arun

>
>
> --
> Arjan van de Ven Intel Open Source Technology Centre
> For development, discussion and tips for power savings,
> visit http://www.lesswatts.org

2009-03-06 06:14:55

by Arun R Bharadwaj

[permalink] [raw]
Subject: Re: [v2 PATCH 1/4] timers: framework to identify pinned timers.

* Oleg Nesterov <[email protected]> [2009-03-05 17:53:40]:

> On 03/04, Arun R Bharadwaj wrote:
> >
> > +static inline unsigned long tbase_get_flag_bits(struct timer_list *timer)
> > +{
> > + return tbase_get_deferrable(timer->base) |
> > + tbase_get_pinned(timer->base);
> > +}
>
> I'd say this looks a bit strange. Hopefully compiler can optimize this code
> to return (unsigned long)base & (TBASE_DEFERRABLE_FLAG | TBASE_PINNED_FLAG).
>

Yes, it does.

> > @@ -736,6 +759,7 @@ void add_timer_on(struct timer_list *tim
> > struct tvec_base *base = per_cpu(tvec_bases, cpu);
> > unsigned long flags;
> >
> > + timer_set_pinned(timer);
>
> But we never clear TBASE_PINNED_FLAG?
>
> If we use mod_timer() next time, the timer remains "pinned". I do not say
> this is really wrong, but a bit strange imho.
>
> Oleg.
>

The pinned timer would expect to continue firing on the same CPU
although it does a mod_timer() the next time, right?
Thats why I have not cleared the TBASE_PINNED_FLAG.

--arun

2009-03-06 07:01:25

by Arun R Bharadwaj

[permalink] [raw]
Subject: Re: [v2 PATCH 1/4] timers: framework to identify pinned timers.

* Oleg Nesterov <[email protected]> [2009-03-05 17:53:40]:

> On 03/04, Arun R Bharadwaj wrote:
> >
> > +static inline unsigned long tbase_get_flag_bits(struct timer_list *timer)
> > +{
> > + return tbase_get_deferrable(timer->base) |
> > + tbase_get_pinned(timer->base);
> > +}
>
> I'd say this looks a bit strange. Hopefully compiler can optimize this code
> to return (unsigned long)base & (TBASE_DEFERRABLE_FLAG | TBASE_PINNED_FLAG).
>

Misread the above statement.
The code is functionally correct. But not sure if the compiler
optimizes in the way that you have mentioned.

> > @@ -736,6 +759,7 @@ void add_timer_on(struct timer_list *tim
> > struct tvec_base *base = per_cpu(tvec_bases, cpu);
> > unsigned long flags;
> >
> > + timer_set_pinned(timer);
>
> But we never clear TBASE_PINNED_FLAG?
>
> If we use mod_timer() next time, the timer remains "pinned". I do not say
> this is really wrong, but a bit strange imho.
>
> Oleg.
>

2009-03-06 14:53:33

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [v2 PATCH 4/4] timers: logic to enable timer migration.

On 03/06, Gautham R Shenoy wrote:
>
> On Thu, Mar 05, 2009 at 05:23:29PM +0100, Oleg Nesterov wrote:
> > On 03/04, Arun R Bharadwaj wrote:
> > >
> > > +++ linux.trees.git/kernel/sched.c
> > > @@ -4009,6 +4009,11 @@ static struct {
> > > .load_balancer = ATOMIC_INIT(-1),
> > > };
> > >
> > > +inline int get_nohz_load_balancer(void)
> >
> > inline?
> >
> > > +{
> > > + return atomic_read(&nohz.load_balancer);
> > > +}
> >
> > Shouldn't we reset .load_balancer when this CPU is CPU_DOWN'ed ?
> > Otherwise the timer can migrate to the dead CPU.
>
> In the select_nohz_load_balancer() code, we check if this CPU is in the
> cpu_active_map. If no, then this CPU relinquishes being the idle
> load balancer.

I don't understand this code, but I am not sure select_nohz_load_balancer()
is always called on cpu_down() path before migrate_timers/migrate_hrtimers.

If this is true, then there is no problem.

> Also, the timer migration code in the CPU down path would
> migrate any timers queued onto this CPU, right ?

Yes, but if .load_balancer is not cleared before the timer migration,
mod_timer() can move the timer to the dead CPU after migration.

Oleg.

2009-03-06 15:07:25

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [v2 PATCH 1/4] timers: framework to identify pinned timers.

On 03/06, Arun R Bharadwaj wrote:
>
> * Oleg Nesterov <[email protected]> [2009-03-05 17:53:40]:
>
> > > @@ -736,6 +759,7 @@ void add_timer_on(struct timer_list *tim
> > > struct tvec_base *base = per_cpu(tvec_bases, cpu);
> > > unsigned long flags;
> > >
> > > + timer_set_pinned(timer);
> >
> > But we never clear TBASE_PINNED_FLAG?
> >
> > If we use mod_timer() next time, the timer remains "pinned". I do not say
> > this is really wrong, but a bit strange imho.
> >
> > Oleg.
> >
>
> The pinned timer would expect to continue firing on the same CPU
> although it does a mod_timer() the next time, right?

Why? Let's suppose we call queue_delayed_work_on(), and next time
we use queue_delayed_work() with the same dwork. The timer is still
pinned, this doesn't look consistent to me.

> Thats why I have not cleared the TBASE_PINNED_FLAG.

Personally, I don't like this flag. I think that "pinned" should
be the argument for mod_timer(), not the "property" of timer_list.

Oleg.

Subject: Re: [v2 PATCH 4/4] timers: logic to enable timer migration.

On Fri, Mar 06, 2009 at 03:50:02PM +0100, Oleg Nesterov wrote:
> On 03/06, Gautham R Shenoy wrote:
> >
> > On Thu, Mar 05, 2009 at 05:23:29PM +0100, Oleg Nesterov wrote:
> > > On 03/04, Arun R Bharadwaj wrote:
> > > >
> > > > +++ linux.trees.git/kernel/sched.c
> > > > @@ -4009,6 +4009,11 @@ static struct {
> > > > .load_balancer = ATOMIC_INIT(-1),
> > > > };
> > > >
> > > > +inline int get_nohz_load_balancer(void)
> > >
> > > inline?
> > >
> > > > +{
> > > > + return atomic_read(&nohz.load_balancer);
> > > > +}
> > >
> > > Shouldn't we reset .load_balancer when this CPU is CPU_DOWN'ed ?
> > > Otherwise the timer can migrate to the dead CPU.
> >
> > In the select_nohz_load_balancer() code, we check if this CPU is in the
> > cpu_active_map. If no, then this CPU relinquishes being the idle
> > load balancer.

.load_balancer is the idle load balancer CPU which performs
load balancing on behalf of all the idle cpus in the system.
It's the only idle CPU which doesn't turn off it's ticks.

This CPU relinquishes it's role as the idle load balancer when
a) it finds a runnable task in it's runqueue, i.e before exiting idle
state.
OR
b) all the CPUs in the system go idle.

Since it doesn't turn off it's ticks, it calls
select_nohz_load_balancer() every scheduler tick, and thus can observe
that it's no longer set in cpu_active_map within a tick or two.

Also, the cpu_down() path calls migrate_timers() in CPU_DEAD:,i.e after
the CPU has gone down. But to take the CPU down, we would have invoked
the stop_machine_run() code on that CPU, which by itself would make the
CPU relinquish it's role as the idle load balancer owing to condition (a).

Thus I believe, mod_timer() can never migrate a timer on to a DEAD CPU,
during/_after_ we invoke migrate_timers() from the CPU_DEAD callpath.

>
> I don't understand this code, but I am not sure select_nohz_load_balancer()
> is always called on cpu_down() path before migrate_timers/migrate_hrtimers.
>
> If this is true, then there is no problem.
>
> > Also, the timer migration code in the CPU down path would
> > migrate any timers queued onto this CPU, right ?
>
> Yes, but if .load_balancer is not cleared before the timer migration,
> mod_timer() can move the timer to the dead CPU after migration.
>
> Oleg.

--
Thanks and Regards
gautham

2009-03-06 17:12:12

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [v2 PATCH 4/4] timers: logic to enable timer migration.

On 03/06, Gautham R Shenoy wrote:
>
> .load_balancer is the idle load balancer CPU which performs
> load balancing on behalf of all the idle cpus in the system.
> It's the only idle CPU which doesn't turn off it's ticks.
>
> This CPU relinquishes it's role as the idle load balancer when
> a) it finds a runnable task in it's runqueue, i.e before exiting idle
> state.
> OR
> b) all the CPUs in the system go idle.
>
> Since it doesn't turn off it's ticks, it calls
> select_nohz_load_balancer() every scheduler tick, and thus can observe
> that it's no longer set in cpu_active_map within a tick or two.
>
> Also, the cpu_down() path calls migrate_timers() in CPU_DEAD:,i.e after
> the CPU has gone down. But to take the CPU down, we would have invoked
> the stop_machine_run() code on that CPU, which by itself would make the
> CPU relinquish it's role as the idle load balancer owing to condition (a).
>
> Thus I believe, mod_timer() can never migrate a timer on to a DEAD CPU,
> during/_after_ we invoke migrate_timers() from the CPU_DEAD callpath.

OK, thanks a lot for your explanation!

Oleg.

Subject: Re: [v2 PATCH 4/4] timers: logic to enable timer migration.

On Fri, Mar 06, 2009 at 06:08:34PM +0100, Oleg Nesterov wrote:
> On 03/06, Gautham R Shenoy wrote:
> >
> > .load_balancer is the idle load balancer CPU which performs
> > load balancing on behalf of all the idle cpus in the system.
> > It's the only idle CPU which doesn't turn off it's ticks.
> >
> > This CPU relinquishes it's role as the idle load balancer when
> > a) it finds a runnable task in it's runqueue, i.e before exiting idle
> > state.
> > OR
> > b) all the CPUs in the system go idle.
> >
> > Since it doesn't turn off it's ticks, it calls
> > select_nohz_load_balancer() every scheduler tick, and thus can observe
> > that it's no longer set in cpu_active_map within a tick or two.
> >
> > Also, the cpu_down() path calls migrate_timers() in CPU_DEAD:,i.e after
> > the CPU has gone down. But to take the CPU down, we would have invoked
> > the stop_machine_run() code on that CPU, which by itself would make the
> > CPU relinquish it's role as the idle load balancer owing to condition (a).
> >
> > Thus I believe, mod_timer() can never migrate a timer on to a DEAD CPU,
> > during/_after_ we invoke migrate_timers() from the CPU_DEAD callpath.
>
> OK, thanks a lot for your explanation!

Glad it helped :-)
>
> Oleg.

--
Thanks and Regards
gautham