2005-03-08 07:12:33

by Christoph Lameter

[permalink] [raw]
Subject: [patch] del_timer_sync scalability patch

When a potential periodic timer is deleted through timer_del_sync, all
cpus are scanned to determine if the timer is running on that cpu. In a
NUMA configuration doing so will cause NUMA interlink traffic which limits
the scalability of timers.

The following patch makes the timer remember where the timer was last
started. It is then possible to only wait for the completion of the timer
on that specific cpu.

Signed-off-by: Shai Fultheim <[email protected]>
Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.11/include/linux/timer.h
===================================================================
--- linux-2.6.11.orig/include/linux/timer.h 2005-03-07 21:42:43.539328640 -0800
+++ linux-2.6.11/include/linux/timer.h 2005-03-07 21:42:50.993195480 -0800
@@ -19,10 +19,19 @@ struct timer_list {
unsigned long data;

struct tvec_t_base_s *base;
+#ifdef CONFIG_SMP
+ struct tvec_t_base_s *last_running;
+#endif
};

#define TIMER_MAGIC 0x4b87ad6e

+#ifdef CONFIG_SMP
+#define TIMER_INIT_LASTRUNNING .last_running = NULL,
+#else
+#define TIMER_INIT_LASTRUNNING
+#endif
+
#define TIMER_INITIALIZER(_function, _expires, _data) { \
.function = (_function), \
.expires = (_expires), \
@@ -30,6 +39,7 @@ struct timer_list {
.base = NULL, \
.magic = TIMER_MAGIC, \
.lock = SPIN_LOCK_UNLOCKED, \
+ TIMER_INIT_LASTRUNNING \
}

/***
@@ -41,6 +51,9 @@ struct timer_list {
*/
static inline void init_timer(struct timer_list * timer)
{
+#ifdef CONFIG_SMP
+ timer->last_running = NULL;
+#endif
timer->base = NULL;
timer->magic = TIMER_MAGIC;
spin_lock_init(&timer->lock);
Index: linux-2.6.11/kernel/timer.c
===================================================================
--- linux-2.6.11.orig/kernel/timer.c 2005-03-07 21:42:43.539328640 -0800
+++ linux-2.6.11/kernel/timer.c 2005-03-07 22:01:27.733425160 -0800
@@ -84,6 +84,14 @@ static inline void set_running_timer(tve
#endif
}

+static inline void set_last_running(struct timer_list *timer,
+ tvec_base_t *base)
+{
+#ifdef CONFIG_SMP
+ timer->last_running = base;
+#endif
+}
+
/* Fake initialization */
static DEFINE_PER_CPU(tvec_base_t, tvec_bases) = { SPIN_LOCK_UNLOCKED };

@@ -335,33 +343,41 @@ EXPORT_SYMBOL(del_timer);
*
* The function returns whether it has deactivated a pending timer or not.
*
- * del_timer_sync() is slow and complicated because it copes with timer
- * handlers which re-arm the timer (periodic timers). If the timer handler
- * is known to not do this (a single shot timer) then use
- * del_singleshot_timer_sync() instead.
+ * del_timer_sync() copes with time handlers which re-arm the timer (periodic
+ * timers). If the timer handler is known to not do this (a single shot
+ * timer) then use del_singleshot_timer_sync() instead.
*/
int del_timer_sync(struct timer_list *timer)
{
tvec_base_t *base;
- int i, ret = 0;
+ int ret = 0;

check_timer(timer);

del_again:
ret += del_timer(timer);

- for_each_online_cpu(i) {
- base = &per_cpu(tvec_bases, i);
- if (base->running_timer == timer) {
- while (base->running_timer == timer) {
- cpu_relax();
- preempt_check_resched();
- }
- break;
+ /* Get where the timer ran last */
+ base = timer->last_running;
+ if (base) {
+ /*
+ * If the timer is still executing then wait until the
+ * run is complete.
+ */
+ while (base->running_timer == timer) {
+ cpu_relax();
+ preempt_check_resched();
}
}
smp_rmb();
- if (timer_pending(timer))
+ /*
+ * If the timer is no longer pending and its last run
+ * was where we checked then the timer
+ * is truly off. If the timer has been started on some other
+ * cpu in the meantime (due to a race condition) then
+ * we need to repeat what we have done.
+ */
+ if (timer_pending(timer) || timer->last_running != base)
goto del_again;

return ret;
@@ -464,6 +480,7 @@ repeat:
set_running_timer(base, timer);
smp_wmb();
timer->base = NULL;
+ set_last_running(timer, base);
spin_unlock_irq(&base->lock);
{
u32 preempt_count = preempt_count();


2005-03-08 07:33:21

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] del_timer_sync scalability patch

Christoph Lameter <[email protected]> wrote:
>
> When a potential periodic timer is deleted through timer_del_sync, all
> cpus are scanned to determine if the timer is running on that cpu. In a
> NUMA configuration doing so will cause NUMA interlink traffic which limits
> the scalability of timers.
>
> The following patch makes the timer remember where the timer was last
> started. It is then possible to only wait for the completion of the timer
> on that specific cpu.

OK, I stared at this for a while and cannot see holes in it. There may
still be one though - this code is tricky. Probably any such holes would
be covered up by the fact that timers never migrate between CPUs anyway,
unless the handler chooses to do add_timer_on(), which I'm sure none do..

Ingo, could you take a look? Especially what happens if the timer hops
CPUs after the

/* Get where the timer ran last */
base = timer->last_running;

statement. Do we have sufficient barriers in there for this?


A few tidyings which I'd suggest:




- Remove TIMER_INIT_LASTRUNNING: struct initialisers set unmentioned fields
to zero anyway.

- Rename set_last_running() to timer_set_last_running, move it to timer.h.
Then use it in init_timer() to avoid an ifdef.

Signed-off-by: Andrew Morton <[email protected]>
---

25-akpm/include/linux/timer.h | 21 ++++++++++-----------
25-akpm/kernel/timer.c | 10 +---------
2 files changed, 11 insertions(+), 20 deletions(-)

diff -puN include/linux/timer.h~del_timer_sync-scalability-patch-tidy include/linux/timer.h
--- 25/include/linux/timer.h~del_timer_sync-scalability-patch-tidy 2005-03-07 23:28:55.000000000 -0800
+++ 25-akpm/include/linux/timer.h 2005-03-07 23:30:23.000000000 -0800
@@ -26,12 +26,6 @@ struct timer_list {

#define TIMER_MAGIC 0x4b87ad6e

-#ifdef CONFIG_SMP
-#define TIMER_INIT_LASTRUNNING .last_running = NULL,
-#else
-#define TIMER_INIT_LASTRUNNING
-#endif
-
#define TIMER_INITIALIZER(_function, _expires, _data) { \
.function = (_function), \
.expires = (_expires), \
@@ -39,9 +33,16 @@ struct timer_list {
.base = NULL, \
.magic = TIMER_MAGIC, \
.lock = SPIN_LOCK_UNLOCKED, \
- TIMER_INIT_LASTRUNNING \
}

+static inline void
+timer_set_last_running(struct timer_list *timer, struct tvec_t_base_s *base)
+{
+#ifdef CONFIG_SMP
+ timer->last_running = base;
+#endif
+}
+
/***
* init_timer - initialize a timer.
* @timer: the timer to be initialized
@@ -49,11 +50,9 @@ struct timer_list {
* init_timer() must be done to a timer prior calling *any* of the
* other timer functions.
*/
-static inline void init_timer(struct timer_list * timer)
+static inline void init_timer(struct timer_list *timer)
{
-#ifdef CONFIG_SMP
- timer->last_running = NULL;
-#endif
+ timer_set_last_running(timer, NULL);
timer->base = NULL;
timer->magic = TIMER_MAGIC;
spin_lock_init(&timer->lock);
diff -puN kernel/timer.c~del_timer_sync-scalability-patch-tidy kernel/timer.c
--- 25/kernel/timer.c~del_timer_sync-scalability-patch-tidy 2005-03-07 23:28:55.000000000 -0800
+++ 25-akpm/kernel/timer.c 2005-03-07 23:28:55.000000000 -0800
@@ -86,14 +86,6 @@ static inline void set_running_timer(tve
#endif
}

-static inline void set_last_running(struct timer_list *timer,
- tvec_base_t *base)
-{
-#ifdef CONFIG_SMP
- timer->last_running = base;
-#endif
-}
-
/* Fake initialization */
static DEFINE_PER_CPU(tvec_base_t, tvec_bases) = { SPIN_LOCK_UNLOCKED };

@@ -482,7 +474,7 @@ repeat:
set_running_timer(base, timer);
smp_wmb();
timer->base = NULL;
- set_last_running(timer, base);
+ timer_set_last_running(timer, base);
spin_unlock_irq(&base->lock);
{
u32 preempt_count = preempt_count();
_

2005-03-08 08:20:30

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] del_timer_sync scalability patch


* Andrew Morton <[email protected]> wrote:

> Christoph Lameter <[email protected]> wrote:
> >
> > When a potential periodic timer is deleted through timer_del_sync, all
> > cpus are scanned to determine if the timer is running on that cpu. In a
> > NUMA configuration doing so will cause NUMA interlink traffic which limits
> > the scalability of timers.
> >
> > The following patch makes the timer remember where the timer was last
> > started. It is then possible to only wait for the completion of the timer
> > on that specific cpu.

i'm not sure about this. The patch adds one more pointer to a very
frequently used and frequently embedded data structure (struct
timer_list), for the benefit of a rarely used API variant
(timer_del_sync()).

Furthermore, timer->base itself is affine too, a timer always runs on
the CPU belonging to timer->base. So a more scalable variant of
del_timer_sync() would perhaps be possible by carefully deleting the
timer and only going into the full loop if the timer was deleted before.
(and in which case semantics require us to synchronize on timer
execution.) Or we could skip the full loop altogether and just
synchronize with timer execution if _we_ deleted the timer. This should
work fine as far as itimers are concerned.

Ingo

2005-03-08 08:34:42

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] del_timer_sync scalability patch

Ingo Molnar <[email protected]> wrote:
>
>
> * Andrew Morton <[email protected]> wrote:
>
> > Christoph Lameter <[email protected]> wrote:
> > >
> > > When a potential periodic timer is deleted through timer_del_sync, all
> > > cpus are scanned to determine if the timer is running on that cpu. In a
> > > NUMA configuration doing so will cause NUMA interlink traffic which limits
> > > the scalability of timers.
> > >
> > > The following patch makes the timer remember where the timer was last
> > > started. It is then possible to only wait for the completion of the timer
> > > on that specific cpu.
>
> i'm not sure about this. The patch adds one more pointer to a very
> frequently used and frequently embedded data structure (struct
> timer_list), for the benefit of a rarely used API variant
> (timer_del_sync()).

True.

(We can delete timer.magic and check_timer() now. It has served its purpose)

> Furthermore, timer->base itself is affine too, a timer always runs on
> the CPU belonging to timer->base. So a more scalable variant of
> del_timer_sync() would perhaps be possible by carefully deleting the
> timer and only going into the full loop if the timer was deleted before.
> (and in which case semantics require us to synchronize on timer
> execution.) Or we could skip the full loop altogether and just
> synchronize with timer execution if _we_ deleted the timer. This should
> work fine as far as itimers are concerned.

If we're prepared to rule that a timer handler is not allowed to do
add_timer_on() then a recurring timer is permanently pinned to a CPU, isn't
it?

That should make things simpler?

2005-03-08 20:29:54

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch] del_timer_sync scalability patch

On Tue, 8 Mar 2005, Andrew Morton wrote:

> If we're prepared to rule that a timer handler is not allowed to do
> add_timer_on() then a recurring timer is permanently pinned to a CPU, isn't
> it?

The process may be rescheduled to run on different processor. Then the
add_timer() function (called from schedule_next_timer in
kernel/posix-timers.c) will also add the timer to the new processor
because it is called from the signal handling code. So I think that it
is possible that a periodic timer will be scheduled on different
processors.

2005-03-08 20:29:53

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch] del_timer_sync scalability patch

On Tue, 8 Mar 2005, Ingo Molnar wrote:

> > > The following patch makes the timer remember where the timer was last
> > > started. It is then possible to only wait for the completion of the timer
> > > on that specific cpu.
>
> i'm not sure about this. The patch adds one more pointer to a very
> frequently used and frequently embedded data structure (struct
> timer_list), for the benefit of a rarely used API variant
> (timer_del_sync()).

Without that pointer there is no information in the timer_list struct as
to where the timer is/was running since base is set to NULL when the timer
function is executed.

> Furthermore, timer->base itself is affine too, a timer always runs on
> the CPU belonging to timer->base. So a more scalable variant of
> del_timer_sync() would perhaps be possible by carefully deleting the
> timer and only going into the full loop if the timer was deleted before.
> (and in which case semantics require us to synchronize on timer
> execution.) Or we could skip the full loop altogether and just
> synchronize with timer execution if _we_ deleted the timer. This should
> work fine as far as itimers are concerned.

There is no easy way to distinguish the case that a timer was deleted
from the case that the timer function is running (which may reschedule
the periodic timer). A scan over all the bases is required to insure that
the timer function is not being executed right now.

If we only would sync when we deleted the timer (meaning timer->base was
!= NULL) then we would not synchronize when the timer function
is executing (and therefore timer->base == NULL). However, we need to sync
exactly in that case because a periodic timer may set timer->base again
when scheduling the next event. That is the main stated purpose of
del_timer_sync.

2005-03-11 17:49:46

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [patch] del_timer_sync scalability patch

Hello.

I am not sure, but I think this patch incorrect.

> @@ -466,6 +482,7 @@ repeat:
> set_running_timer(base, timer);
> smp_wmb();
> timer->base = NULL;
------> WINDOW <------
> + set_last_running(timer, base);
> spin_unlock_irq(&base->lock);

Suppose it is the first invocation of timer, so
timer->last_running == NULL.

What if del_timer_sync() happens in that window?

del_timer_sync:
del_timer(); // timer->base == NULL, returns

base = timer->last_running;
if (base) // no, it is still NULL
...

if (timer->base != NULL || timer->last_running != base)
goto del_again; // not taken

return;

I think it is not enough to exchange these 2 lines in
__run_timers, we also need barriers.

Oleg.

2005-03-11 21:09:39

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch] del_timer_sync scalability patch

On Fri, 11 Mar 2005, Oleg Nesterov wrote:

> I think it is not enough to exchange these 2 lines in
> __run_timers, we also need barriers.

Maybe its best to drop last_running_timer as Ingo suggested.

Replace the magic with a flag that can be set to stop scheduling a timer
again.

Then del_timer_sync may

1. Set the flag not to reschedule
2. delete the timer
3. wait till the timer function is complete
4. (eventually verify that the timer is really gone)
5. reset the no reschedule flag

2005-03-13 12:07:51

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [patch] del_timer_sync scalability patch

I suspect that del_timer_sync() in its current form is racy.

CPU 0 CPU 1

__run_timers() sets timer->base = NULL

del_timer_sync() starts, calls del_timer(), it returns
because timer->base == NULL.

waits until the run is complete:
while (base->running_timer == timer)
preempt_check_resched();

calls schedule(), or long interrupt happens.

timer reschedules itself, __run_timers()
exits.

base->running_timer == NULL, end of loop.

next timer interrupt, __run_timers() picks
this timer again, sets timer->base = NULL

if (timer_pending(timer)) // no, timer->base == NULL
goto del_again; // not taken

del_timer_sync() returns

timer runs.

No?

Oleg.

2005-03-14 19:43:50

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch] del_timer_sync scalability patch


On Sun, 13 Mar 2005, Oleg Nesterov wrote:

> I suspect that del_timer_sync() in its current form is racy.
>
> CPU 0 CPU 1
>
> __run_timers() sets timer->base = NULL
>
> del_timer_sync() starts, calls del_timer(), it returns
> because timer->base == NULL.
>
> waits until the run is complete:
> while (base->running_timer == timer)
> preempt_check_resched();
>
> calls schedule(), or long interrupt happens.
>
> timer reschedules itself, __run_timers()
> exits.
>
> base->running_timer == NULL, end of loop.
>
> next timer interrupt, __run_timers() picks
> this timer again, sets timer->base = NULL
>
> if (timer_pending(timer)) // no, timer->base == NULL

timer->base is != NULL because the timer has rescheduled itself.
__mod_timer sets timer->bBase

> goto del_again; // not taken

Yes it is taken!

>
> del_timer_sync() returns
>
> timer runs.

Nope.

2005-03-15 08:07:17

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [patch] del_timer_sync scalability patch

Christoph Lameter wrote:
>
> On Sun, 13 Mar 2005, Oleg Nesterov wrote:
>
> > I suspect that del_timer_sync() in its current form is racy.
> >
...snip...
> > next timer interrupt, __run_timers() picks
> > this timer again, sets timer->base = NULL
^^^^^^^^^^^^^^^^^^^^^^^
> >
> > if (timer_pending(timer)) // no, timer->base == NULL
>
> timer->base is != NULL because the timer has rescheduled itself.
> __mod_timer sets timer->bBase

Christoph, please look again. Yes, __mod_timer sets timer->base,
but it is cleared in the _next_ timer interrupt on CPU 0.

Andrew, Ingo, what do you think?

Oleg.

2005-03-15 08:07:01

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch] del_timer_sync scalability patch

How about this take on the problem?

When a potential periodic timer is deleted through timer_del_sync, all cpus
are scanned to determine if the timer is running on that cpu. In a NUMA
configuration doing so will cause NUMA interlink traffic which limits the
scalability of timers.

The following patch removes the magic in the timer_list structure (Andrew
suggested that we may not need it anymore) and replaces it with two u8
variables that give us some additional state of the timer

running -> Set if the timer function is currently executing
shutdown -> Set if del_timer_sync is running and the timer
should not be rescheduled.

Signed-off-by: Christoph Lameter <[email protected]>
Signed-off-by: Shai Fultheim <[email protected]>

Index: linux-2.6.11/include/linux/timer.h
===================================================================
--- linux-2.6.11.orig/include/linux/timer.h 2005-03-14 14:17:51.671533904 -0800
+++ linux-2.6.11/include/linux/timer.h 2005-03-14 14:41:49.640929328 -0800
@@ -13,22 +13,22 @@ struct timer_list {
unsigned long expires;

spinlock_t lock;
- unsigned long magic;

void (*function)(unsigned long);
unsigned long data;

- struct tvec_t_base_s *base;
+ struct tvec_t_base_s *base; /* ==NULL if timer not scheduled */
+ u8 running; /* function is currently executing */
+ u8 shutdown; /* do not schedule this timer */
};

-#define TIMER_MAGIC 0x4b87ad6e
-
#define TIMER_INITIALIZER(_function, _expires, _data) { \
.function = (_function), \
.expires = (_expires), \
.data = (_data), \
.base = NULL, \
- .magic = TIMER_MAGIC, \
+ .running = 0, \
+ .shutdown = 0, \
.lock = SPIN_LOCK_UNLOCKED, \
}

@@ -42,7 +42,8 @@ struct timer_list {
static inline void init_timer(struct timer_list * timer)
{
timer->base = NULL;
- timer->magic = TIMER_MAGIC;
+ timer->running = 0;
+ timer->shutdown = 0;
spin_lock_init(&timer->lock);
}

Index: linux-2.6.11/kernel/timer.c
===================================================================
--- linux-2.6.11.orig/kernel/timer.c 2005-03-14 14:17:51.671533904 -0800
+++ linux-2.6.11/kernel/timer.c 2005-03-14 14:57:24.613791848 -0800
@@ -89,31 +89,6 @@ static inline void set_running_timer(tve
/* Fake initialization */
static DEFINE_PER_CPU(tvec_base_t, tvec_bases) = { SPIN_LOCK_UNLOCKED };

-static void check_timer_failed(struct timer_list *timer)
-{
- static int whine_count;
- if (whine_count < 16) {
- whine_count++;
- printk("Uninitialised timer!\n");
- printk("This is just a warning. Your computer is OK\n");
- printk("function=0x%p, data=0x%lx\n",
- timer->function, timer->data);
- dump_stack();
- }
- /*
- * Now fix it up
- */
- spin_lock_init(&timer->lock);
- timer->magic = TIMER_MAGIC;
-}
-
-static inline void check_timer(struct timer_list *timer)
-{
- if (timer->magic != TIMER_MAGIC)
- check_timer_failed(timer);
-}
-
-
static void internal_add_timer(tvec_base_t *base, struct timer_list *timer)
{
unsigned long expires = timer->expires;
@@ -164,8 +139,6 @@ int __mod_timer(struct timer_list *timer

BUG_ON(!timer->function);

- check_timer(timer);
-
spin_lock_irqsave(&timer->lock, flags);
new_base = &__get_cpu_var(tvec_bases);
repeat:
@@ -208,8 +181,10 @@ repeat:
ret = 1;
}
timer->expires = expires;
- internal_add_timer(new_base, timer);
- timer->base = new_base;
+ if (!timer->shutdown) {
+ internal_add_timer(new_base, timer);
+ timer->base = new_base;
+ }

if (old_base && (new_base != old_base))
spin_unlock(&old_base->lock);
@@ -235,7 +210,8 @@ void add_timer_on(struct timer_list *tim

BUG_ON(timer_pending(timer) || !timer->function);

- check_timer(timer);
+ if (timer->shutdown)
+ return;

spin_lock_irqsave(&base->lock, flags);
internal_add_timer(base, timer);
@@ -267,8 +243,6 @@ int mod_timer(struct timer_list *timer,
{
BUG_ON(!timer->function);

- check_timer(timer);
-
/*
* This is a common optimization triggered by the
* networking code - if the timer is re-modified
@@ -298,8 +272,6 @@ int del_timer(struct timer_list *timer)
unsigned long flags;
tvec_base_t *base;

- check_timer(timer);
-
repeat:
base = timer->base;
if (!base)
@@ -337,35 +309,40 @@ EXPORT_SYMBOL(del_timer);
*
* The function returns whether it has deactivated a pending timer or not.
*
- * del_timer_sync() is slow and complicated because it copes with timer
- * handlers which re-arm the timer (periodic timers). If the timer handler
- * is known to not do this (a single shot timer) then use
- * del_singleshot_timer_sync() instead.
+ * del_timer_sync() copes with time handlers which re-arm the timer (periodic
+ * timers). If the timer handler is known to not do this (a single shot
+ * timer) then use del_singleshot_timer_sync() instead.
*/
int del_timer_sync(struct timer_list *timer)
{
- tvec_base_t *base;
- int i, ret = 0;
+ int ret = 0;

- check_timer(timer);
+ /* Forbid additional scheduling of this timer */
+ timer->shutdown = 1;

del_again:
ret += del_timer(timer);

- for_each_online_cpu(i) {
- base = &per_cpu(tvec_bases, i);
- if (base->running_timer == timer) {
- while (base->running_timer == timer) {
- cpu_relax();
- preempt_check_resched();
- }
- break;
- }
+ /*
+ * If the timer is still executing then wait until the
+ * run is complete.
+ */
+ while (timer->running) {
+ cpu_relax();
+ preempt_check_resched();
}
smp_rmb();
+ /*
+ * If the timer is no longer pending then the timer
+ * is truly off.
+ * The timer may have been started on some other
+ * cpu in the meantime (due to a race condition)
+ */
if (timer_pending(timer))
goto del_again;

+ /* Reenable timer shutdown */
+ timer->shutdown = 0;
return ret;
}
EXPORT_SYMBOL(del_timer_sync);
@@ -380,7 +357,7 @@ EXPORT_SYMBOL(del_timer_sync);
*
* Synchronization rules: callers must prevent restarting of the timer,
* otherwise this function is meaningless. It must not be called from
- * interrupt contexts. The caller must not hold locks which wold prevent
+ * interrupt contexts. The caller must not hold locks which would prevent
* completion of the timer's handler. Upon exit the timer is not queued and
* the handler is not running on any CPU.
*
@@ -464,6 +441,7 @@ repeat:

list_del(&timer->entry);
set_running_timer(base, timer);
+ timer->running = 1;
smp_wmb();
timer->base = NULL;
spin_unlock_irq(&base->lock);
@@ -476,6 +454,7 @@ repeat:
}
}
spin_lock_irq(&base->lock);
+ timer->running = 0;
goto repeat;
}
}

2005-03-15 09:22:38

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [patch] del_timer_sync scalability patch

Christoph Lameter wrote:
>
> @@ -476,6 +454,7 @@ repeat:
> }
> }
> spin_lock_irq(&base->lock);
> + timer->running = 0;
^^^^^^^^^^^^^^^^^^
> goto repeat;
> }
> }

This is imho wrong. The timer probably don't exist when
timer_list->function returns.

I mean, timer_list->function could deallocate timer's memory.

Oleg.

2005-03-15 09:28:56

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] del_timer_sync scalability patch


* Christoph Lameter <[email protected]> wrote:

> The following patch removes the magic in the timer_list structure
> (Andrew suggested that we may not need it anymore) and replaces it
> with two u8 variables that give us some additional state of the timer

The 'remove the magic' observation is not a 'backdoor' to introduce new
fields 'for free'. So please dont mix the two things into one patch.

> + u8 running; /* function is currently executing */
> + u8 shutdown; /* do not schedule this timer */

it may as well be cleaner to do the timer->base_running thing after all,
and to do this in __run_timers():

timer->base = NULL;
timer->base_running = base;

note that we cannot clear ->base_running in __run_timers() [we dont own
any access to the timer anymore, it may be kfree()d, etc.], so
del_timer_sync() has to make sure that the timer->base_running info is
not stale - it is a hint only. But it looks doable, and it should solve
the NUMA scalability problem.

Ingo

2005-03-20 23:27:36

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: [patch] del_timer_sync scalability patch

We did exactly the same thing about 10 months back. Nice to
see that independent people came up with exactly the same
solution that we proposed 10 months back. In fact, this patch
is line-by-line identical to the one we post.

Hope Andrew is going to take the patch this time.

See our original posting:
http://marc.theaimsgroup.com/?l=linux-kernel&m=108422767319822&w=2

- Ken


2005-03-20 23:35:46

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] del_timer_sync scalability patch

"Chen, Kenneth W" <[email protected]> wrote:
>
> We did exactly the same thing about 10 months back. Nice to
> see that independent people came up with exactly the same
> solution that we proposed 10 months back.

Well the same question applies. Christoph, which code is calling
del_timer_sync() so often that you noticed?

> In fact, this patch
> is line-by-line identical to the one we post.

I assume that's not a coincidence.

> Hope Andrew is going to take the patch this time.

Hope Kenneth is going to test the alternate del_timer_sync patches in next
-mm ;)

2005-03-21 00:47:47

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch] del_timer_sync scalability patch

On Sun, 20 Mar 2005, Andrew Morton wrote:

> "Chen, Kenneth W" <[email protected]> wrote:
> >
> > We did exactly the same thing about 10 months back. Nice to
> > see that independent people came up with exactly the same
> > solution that we proposed 10 months back.
>
> Well the same question applies. Christoph, which code is calling
> del_timer_sync() so often that you noticed?

Ummm. I have to ask those who brought this to my attention. Are you
looking for the application under which del_timer_sync showed up in
profiling or the kernel subsystem?

2005-03-21 00:54:25

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch] del_timer_sync scalability patch

On Sun, 20 Mar 2005, Andrew Morton wrote:

> > Hope Andrew is going to take the patch this time.
>
> Hope Kenneth is going to test the alternate del_timer_sync patches in next
> -mm ;)

BTW Why are we going through this? Oleg has posted a much better solution
to this issue yersteday AFAIK.

2005-03-21 01:00:47

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] del_timer_sync scalability patch

Christoph Lameter <[email protected]> wrote:
>
> On Sun, 20 Mar 2005, Andrew Morton wrote:
>
> > "Chen, Kenneth W" <[email protected]> wrote:
> > >
> > > We did exactly the same thing about 10 months back. Nice to
> > > see that independent people came up with exactly the same
> > > solution that we proposed 10 months back.
> >
> > Well the same question applies. Christoph, which code is calling
> > del_timer_sync() so often that you noticed?
>
> Ummm. I have to ask those who brought this to my attention. Are you
> looking for the application under which del_timer_sync showed up in
> profiling or the kernel subsystem?

I was wondering which part of the kernel was hammering del_timer_sync() so
hard. I guess we could work that out from a description of the workload.

2005-03-21 01:17:53

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] del_timer_sync scalability patch

Christoph Lameter <[email protected]> wrote:
>
> On Sun, 20 Mar 2005, Andrew Morton wrote:
>
> > > Hope Andrew is going to take the patch this time.
> >
> > Hope Kenneth is going to test the alternate del_timer_sync patches in next
> > -mm ;)
>
> BTW Why are we going through this? Oleg has posted a much better solution
> to this issue yersteday AFAIK.

That is what I was referring to.

Those patches need to be reviewed, performance tested and stability tested.
It's appropriate that interested parties do that work, please.