2005-04-01 11:52:56

by Oleg Nesterov

[permalink] [raw]
Subject: [RFC][PATCH] timers fixes/improvements

This patch replaces and updates 6 timer patches which are currently
in -mm tree. This version does not play games with __TIMER_PENDING
bit, so incremental patch is not suitable. It is against 2.6.12-rc1.
Please comment. I am sending pseudo code in a separate message for
easier review.


This patch tries to solve following problems:

1. del_timer_sync() is racy. The timer can be fired again after
del_timer_sync have checked all cpus and before it will recheck
timer_pending().

2. It has scalability problems. All cpus are scanned to determine
if the timer is running on that cpu.

With this patch del_timer_sync is O(1) and no slower than plain
del_timer(pending_timer), unless it has to actually wait for
completion of the currently running timer.

The only restriction is that the recurring timer should not use
add_timer_on().

3. The timers are not serialized wrt to itself.

If CPU_0 does mod_timer(jiffies+1) while the timer is currently
running on CPU 1, it is quite possible that local interrupt on
CPU_0 will start that timer before it finished on CPU_1.

4. The timers locking is suboptimal. __mod_timer() takes 3 locks
at once and still requires wmb() in del_timer/run_timers.

The new implementation takes 2 locks sequentially and does not
need memory barriers.


Currently ->base != NULL means that the timer is pending. In
that case ->base.lock is used to lock the timer. __mod_timer
also takes timer->lock because ->base can be == NULL.

This patch uses timer->entry.next != NULL as indication that
the timer is pending. So it does __list_del(), entry->next = NULL
instead of list_del() when the timer is deleted.

__run_timers(), del_timer() do not set ->base = NULL anymore, they
only clear pending flag, so that del_timer_sync() can wait while
timer->base->running_timer == timer.

The ->base field is used for hashed locking only. init_timer()
sets ->base = per_cpu(tvec_bases) and it is changed only in
__mod_timer(), when the timer changes CPU. Now the timer always
can be locked through ->base->lock, so timer_list->lock can be
killed and we don't need memory barriers in __run_timers/del_timer.

It is pointless to use ->base directly, it can change at any moment.
This patch adds lock_timer_base() helper, which locks timer's base,
and checks it is still the same. It also checks that ->base != NULL,
see below.

__mod_timer() do not locks both bases at once. It locks timer via
lock_timer_base(), __list_del(timer), sets ->base = NULL temporally,
and unlocks old_base. Now:
1. This timer can't be seen in __run_timers() via ->entry,
it is already removed from list.
2. Nobody can lock this timer because lock_timer_base()
waits for ->base != NULL.
3. If the timer was pending - it is still pending, so
concurrent del_timer() will spin in lock_timer_base().

Then __mod_timer() locks new_base and adds this timer. I hope it
may improve scalability of timers. It also simplifies the code,
because AB-BA deadlock is not possible.

On the other hand, when __run_timers()/migrate_timers() locks
tvec_bases->lock it can safely traverse ->tvX lists, all timers
have valid ->base != NULL and locked.

One problem. TIMER_INITIALIZER can't use per_cpu(tvec_bases). So
this patch adds global
struct timer_base_s {
spinlock_t lock;
struct timer_list *running_timer;
} __init_timer_base;
which is used by TIMER_INITIALIZER. The corresponding fields in
struct tvec_t_base_s are replaced by struct timer_base_s t_base.

It is indeed ugly. But this can't have scalability problems. The
global __init_timer_base.lock is used only when __mod_timer() is
called for the first time AND the timer was compile time initialized.
After that the timer migrates to the local CPU.

This patch lessens timer.o size even without deleting now unneeded
del_singleshot_timer_sync(). In my opinion it also simplifies the
code.

Signed-off-by: Oleg Nesterov <[email protected]>

--- 2.6.12-rc1/include/linux/timer.h~ 2004-09-13 09:32:56.000000000 +0400
+++ 2.6.12-rc1/include/linux/timer.h 2005-04-01 00:33:33.000000000 +0400
@@ -6,45 +6,33 @@
#include <linux/spinlock.h>
#include <linux/stddef.h>

-struct tvec_t_base_s;
+struct timer_base_s;

struct timer_list {
struct list_head entry;
unsigned long expires;

- spinlock_t lock;
unsigned long magic;

void (*function)(unsigned long);
unsigned long data;

- struct tvec_t_base_s *base;
+ struct timer_base_s *_base;
};

#define TIMER_MAGIC 0x4b87ad6e

+extern struct timer_base_s __init_timer_base;
+
#define TIMER_INITIALIZER(_function, _expires, _data) { \
.function = (_function), \
.expires = (_expires), \
.data = (_data), \
- .base = NULL, \
+ ._base = &__init_timer_base, \
.magic = TIMER_MAGIC, \
- .lock = SPIN_LOCK_UNLOCKED, \
}

-/***
- * init_timer - initialize a timer.
- * @timer: the timer to be initialized
- *
- * init_timer() must be done to a timer prior calling *any* of the
- * other timer functions.
- */
-static inline void init_timer(struct timer_list * timer)
-{
- timer->base = NULL;
- timer->magic = TIMER_MAGIC;
- spin_lock_init(&timer->lock);
-}
+void fastcall init_timer(struct timer_list * timer);

/***
* timer_pending - is a timer pending?
@@ -58,7 +46,7 @@ static inline void init_timer(struct tim
*/
static inline int timer_pending(const struct timer_list * timer)
{
- return timer->base != NULL;
+ return timer->entry.next != NULL;
}

extern void add_timer_on(struct timer_list *timer, int cpu);
@@ -89,12 +77,12 @@ static inline void add_timer(struct time

#ifdef CONFIG_SMP
extern int del_timer_sync(struct timer_list *timer);
- extern int del_singleshot_timer_sync(struct timer_list *timer);
#else
# define del_timer_sync(t) del_timer(t)
-# define del_singleshot_timer_sync(t) del_timer(t)
#endif

+#define del_singleshot_timer_sync(t) del_timer_sync(t)
+
extern void init_timers(void);
extern void run_local_timers(void);
extern void it_real_fn(unsigned long);
--- 2.6.12-rc1/kernel/timer.c~ 2005-03-19 14:16:53.000000000 +0300
+++ 2.6.12-rc1/kernel/timer.c 2005-04-01 16:09:56.000000000 +0400
@@ -57,6 +57,11 @@ static void time_interpolator_update(lon
#define TVN_MASK (TVN_SIZE - 1)
#define TVR_MASK (TVR_SIZE - 1)

+struct timer_base_s {
+ spinlock_t lock;
+ struct timer_list *running_timer;
+};
+
typedef struct tvec_s {
struct list_head vec[TVN_SIZE];
} tvec_t;
@@ -66,9 +71,8 @@ typedef struct tvec_root_s {
} tvec_root_t;

struct tvec_t_base_s {
- spinlock_t lock;
+ struct timer_base_s t_base;
unsigned long timer_jiffies;
- struct timer_list *running_timer;
tvec_root_t tv1;
tvec_t tv2;
tvec_t tv3;
@@ -77,18 +81,16 @@ struct tvec_t_base_s {
} ____cacheline_aligned_in_smp;

typedef struct tvec_t_base_s tvec_base_t;
+static DEFINE_PER_CPU(tvec_base_t, tvec_bases);

static inline void set_running_timer(tvec_base_t *base,
struct timer_list *timer)
{
#ifdef CONFIG_SMP
- base->running_timer = timer;
+ base->t_base.running_timer = timer;
#endif
}

-/* Fake initialization */
-static DEFINE_PER_CPU(tvec_base_t, tvec_bases) = { SPIN_LOCK_UNLOCKED };
-
static void check_timer_failed(struct timer_list *timer)
{
static int whine_count;
@@ -103,7 +105,6 @@ static void check_timer_failed(struct ti
/*
* Now fix it up
*/
- spin_lock_init(&timer->lock);
timer->magic = TIMER_MAGIC;
}

@@ -156,65 +157,96 @@ static void internal_add_timer(tvec_base
list_add_tail(&timer->entry, vec);
}

+typedef struct timer_base_s timer_base_t;
+timer_base_t __init_timer_base
+ ____cacheline_aligned_in_smp = { .lock = SPIN_LOCK_UNLOCKED };
+EXPORT_SYMBOL(__init_timer_base);
+
+/***
+ * init_timer - initialize a timer.
+ * @timer: the timer to be initialized
+ *
+ * init_timer() must be done to a timer prior calling *any* of the
+ * other timer functions.
+ */
+void fastcall init_timer(struct timer_list *timer)
+{
+ timer->entry.next = NULL;
+ timer->_base = &per_cpu(tvec_bases,
+ __smp_processor_id()).t_base;
+ timer->magic = TIMER_MAGIC;
+}
+EXPORT_SYMBOL(init_timer);
+
+static inline void detach_timer(struct timer_list *timer,
+ int clear_pending)
+{
+ struct list_head *entry = &timer->entry;
+
+ __list_del(entry->prev, entry->next);
+ if (clear_pending)
+ entry->next = NULL;
+ entry->prev = LIST_POISON2;
+}
+
+static timer_base_t *lock_timer_base(struct timer_list *timer,
+ unsigned long *flags)
+{
+ timer_base_t *base;
+
+ for (;;) {
+ base = timer->_base;
+ /* Can be NULL while __mod_timer switches bases */
+ if (likely(base != NULL)) {
+ spin_lock_irqsave(&base->lock, *flags);
+ if (likely(base == timer->_base))
+ return base;
+ spin_unlock_irqrestore(&base->lock, *flags);
+ }
+ cpu_relax();
+ }
+}
+
int __mod_timer(struct timer_list *timer, unsigned long expires)
{
- tvec_base_t *old_base, *new_base;
+ timer_base_t *base;
+ tvec_base_t *new_base;
unsigned long flags;
- int ret = 0;
+ int ret = -1;

BUG_ON(!timer->function);
-
check_timer(timer);

- spin_lock_irqsave(&timer->lock, flags);
- new_base = &__get_cpu_var(tvec_bases);
-repeat:
- old_base = timer->base;
+ do {
+ base = lock_timer_base(timer, &flags);
+ new_base = &__get_cpu_var(tvec_bases);

- /*
- * Prevent deadlocks via ordering by old_base < new_base.
- */
- if (old_base && (new_base != old_base)) {
- if (old_base < new_base) {
- spin_lock(&new_base->lock);
- spin_lock(&old_base->lock);
- } else {
- spin_lock(&old_base->lock);
- spin_lock(&new_base->lock);
+ /* Ensure the timer is serialized. */
+ if (base != &new_base->t_base
+ && base->running_timer == timer)
+ goto unlock;
+
+ ret = 0;
+ if (timer_pending(timer)) {
+ detach_timer(timer, 0);
+ ret = 1;
}
- /*
- * The timer base might have been cancelled while we were
- * trying to take the lock(s):
- */
- if (timer->base != old_base) {
- spin_unlock(&new_base->lock);
- spin_unlock(&old_base->lock);
- goto repeat;
- }
- } else {
- spin_lock(&new_base->lock);
- if (timer->base != old_base) {
- spin_unlock(&new_base->lock);
- goto repeat;
+
+ if (base != &new_base->t_base) {
+ timer->_base = NULL;
+ /* Safe: the timer can't be seen via ->entry,
+ * and lock_timer_base checks ->_base != 0. */
+ spin_unlock(&base->lock);
+ base = &new_base->t_base;
+ spin_lock(&base->lock);
+ timer->_base = base;
}
- }

- /*
- * Delete the previous timeout (if there was any), and install
- * the new one:
- */
- if (old_base) {
- list_del(&timer->entry);
- ret = 1;
- }
- timer->expires = expires;
- internal_add_timer(new_base, timer);
- timer->base = new_base;
-
- if (old_base && (new_base != old_base))
- spin_unlock(&old_base->lock);
- spin_unlock(&new_base->lock);
- spin_unlock_irqrestore(&timer->lock, flags);
+ timer->expires = expires;
+ internal_add_timer(new_base, timer);
+unlock:
+ spin_unlock_irqrestore(&base->lock, flags);
+ } while (ret < 0);

return ret;
}
@@ -232,15 +264,15 @@ void add_timer_on(struct timer_list *tim
{
tvec_base_t *base = &per_cpu(tvec_bases, cpu);
unsigned long flags;
-
+
BUG_ON(timer_pending(timer) || !timer->function);

check_timer(timer);

- spin_lock_irqsave(&base->lock, flags);
+ spin_lock_irqsave(&base->t_base.lock, flags);
+ timer->_base = &base->t_base;
internal_add_timer(base, timer);
- timer->base = base;
- spin_unlock_irqrestore(&base->lock, flags);
+ spin_unlock_irqrestore(&base->t_base.lock, flags);
}


@@ -295,27 +327,22 @@ EXPORT_SYMBOL(mod_timer);
*/
int del_timer(struct timer_list *timer)
{
+ timer_base_t *base;
unsigned long flags;
- tvec_base_t *base;
+ int ret = 0;

check_timer(timer);

-repeat:
- base = timer->base;
- if (!base)
- return 0;
- spin_lock_irqsave(&base->lock, flags);
- if (base != timer->base) {
+ if (timer_pending(timer)) {
+ base = lock_timer_base(timer, &flags);
+ if (timer_pending(timer)) {
+ detach_timer(timer, 1);
+ ret = 1;
+ }
spin_unlock_irqrestore(&base->lock, flags);
- goto repeat;
}
- list_del(&timer->entry);
- /* Need to make sure that anybody who sees a NULL base also sees the list ops */
- smp_wmb();
- timer->base = NULL;
- spin_unlock_irqrestore(&base->lock, flags);

- return 1;
+ return ret;
}

EXPORT_SYMBOL(del_timer);
@@ -332,72 +359,39 @@ EXPORT_SYMBOL(del_timer);
* Synchronization rules: callers must prevent restarting of the timer,
* otherwise this function is meaningless. It must not be called from
* interrupt contexts. The caller must not hold locks which would prevent
- * completion of the timer's handler. Upon exit the timer is not queued and
- * the handler is not running on any CPU.
+ * completion of the timer's handler. The timer's handler must not call
+ * add_timer_on(). Upon exit the timer is not queued and the handler is
+ * not running on any CPU.
*
* The function returns whether it has deactivated a pending timer or not.
- *
- * del_timer_sync() is slow and complicated because it copes with timer
- * handlers which re-arm the timer (periodic timers). If the timer handler
- * is known to not do this (a single shot timer) then use
- * del_singleshot_timer_sync() instead.
*/
int del_timer_sync(struct timer_list *timer)
{
- tvec_base_t *base;
- int i, ret = 0;
+ timer_base_t *base;
+ unsigned long flags;
+ int ret = -1;

check_timer(timer);

-del_again:
- ret += del_timer(timer);
+ do {
+ base = lock_timer_base(timer, &flags);

- for_each_online_cpu(i) {
- base = &per_cpu(tvec_bases, i);
- if (base->running_timer == timer) {
- while (base->running_timer == timer) {
- cpu_relax();
- preempt_check_resched();
- }
- break;
+ if (base->running_timer == timer)
+ goto unlock;
+
+ ret = 0;
+ if (timer_pending(timer)) {
+ detach_timer(timer, 1);
+ ret = 1;
}
- }
- smp_rmb();
- if (timer_pending(timer))
- goto del_again;
+unlock:
+ spin_unlock_irqrestore(&base->lock, flags);
+ } while (ret < 0);

return ret;
}
-EXPORT_SYMBOL(del_timer_sync);
-
-/***
- * del_singleshot_timer_sync - deactivate a non-recursive timer
- * @timer: the timer to be deactivated
- *
- * This function is an optimization of del_timer_sync for the case where the
- * caller can guarantee the timer does not reschedule itself in its timer
- * function.
- *
- * Synchronization rules: callers must prevent restarting of the timer,
- * otherwise this function is meaningless. It must not be called from
- * interrupt contexts. The caller must not hold locks which wold prevent
- * completion of the timer's handler. Upon exit the timer is not queued and
- * the handler is not running on any CPU.
- *
- * The function returns whether it has deactivated a pending timer or not.
- */
-int del_singleshot_timer_sync(struct timer_list *timer)
-{
- int ret = del_timer(timer);

- if (!ret) {
- ret = del_timer_sync(timer);
- BUG_ON(ret);
- }
-
- return ret;
-}
-EXPORT_SYMBOL(del_singleshot_timer_sync);
+EXPORT_SYMBOL(del_timer_sync);
#endif

static int cascade(tvec_base_t *base, tvec_t *tv, int index)
@@ -415,7 +409,7 @@ static int cascade(tvec_base_t *base, tv
struct timer_list *tmp;

tmp = list_entry(curr, struct timer_list, entry);
- BUG_ON(tmp->base != base);
+ BUG_ON(tmp->_base != &base->t_base);
curr = curr->next;
internal_add_timer(base, tmp);
}
@@ -437,7 +431,7 @@ static inline void __run_timers(tvec_bas
{
struct timer_list *timer;

- spin_lock_irq(&base->lock);
+ spin_lock_irq(&base->t_base.lock);
while (time_after_eq(jiffies, base->timer_jiffies)) {
struct list_head work_list = LIST_HEAD_INIT(work_list);
struct list_head *head = &work_list;
@@ -462,11 +456,9 @@ repeat:
fn = timer->function;
data = timer->data;

- list_del(&timer->entry);
set_running_timer(base, timer);
- smp_wmb();
- timer->base = NULL;
- spin_unlock_irq(&base->lock);
+ detach_timer(timer, 1);
+ spin_unlock_irq(&base->t_base.lock);
{
u32 preempt_count = preempt_count();
fn(data);
@@ -475,12 +467,12 @@ repeat:
BUG();
}
}
- spin_lock_irq(&base->lock);
+ spin_lock_irq(&base->t_base.lock);
goto repeat;
}
}
set_running_timer(base, NULL);
- spin_unlock_irq(&base->lock);
+ spin_unlock_irq(&base->t_base.lock);
}

#ifdef CONFIG_NO_IDLE_HZ
@@ -1286,9 +1278,9 @@ static void __devinit init_timers_cpu(in
{
int j;
tvec_base_t *base;
-
+
base = &per_cpu(tvec_bases, cpu);
- spin_lock_init(&base->lock);
+ spin_lock_init(&base->t_base.lock);
for (j = 0; j < TVN_SIZE; j++) {
INIT_LIST_HEAD(base->tv5.vec + j);
INIT_LIST_HEAD(base->tv4.vec + j);
@@ -1302,22 +1294,16 @@ static void __devinit init_timers_cpu(in
}

#ifdef CONFIG_HOTPLUG_CPU
-static int migrate_timer_list(tvec_base_t *new_base, struct list_head *head)
+static void migrate_timer_list(tvec_base_t *new_base, struct list_head *head)
{
struct timer_list *timer;

while (!list_empty(head)) {
timer = list_entry(head->next, struct timer_list, entry);
- /* We're locking backwards from __mod_timer order here,
- beware deadlock. */
- if (!spin_trylock(&timer->lock))
- return 0;
- list_del(&timer->entry);
+ detach_timer(timer, 0);
+ timer->_base = &new_base->t_base;
internal_add_timer(new_base, timer);
- timer->base = new_base;
- spin_unlock(&timer->lock);
}
- return 1;
}

static void __devinit migrate_timers(int cpu)
@@ -1331,39 +1317,24 @@ static void __devinit migrate_timers(int
new_base = &get_cpu_var(tvec_bases);

local_irq_disable();
-again:
- /* Prevent deadlocks via ordering by old_base < new_base. */
- if (old_base < new_base) {
- spin_lock(&new_base->lock);
- spin_lock(&old_base->lock);
- } else {
- spin_lock(&old_base->lock);
- spin_lock(&new_base->lock);
- }
+ spin_lock(&new_base->t_base.lock);
+ spin_lock(&old_base->t_base.lock);

- if (old_base->running_timer)
+ if (old_base->t_base.running_timer)
BUG();
for (i = 0; i < TVR_SIZE; i++)
- if (!migrate_timer_list(new_base, old_base->tv1.vec + i))
- goto unlock_again;
- for (i = 0; i < TVN_SIZE; i++)
- if (!migrate_timer_list(new_base, old_base->tv2.vec + i)
- || !migrate_timer_list(new_base, old_base->tv3.vec + i)
- || !migrate_timer_list(new_base, old_base->tv4.vec + i)
- || !migrate_timer_list(new_base, old_base->tv5.vec + i))
- goto unlock_again;
- spin_unlock(&old_base->lock);
- spin_unlock(&new_base->lock);
+ migrate_timer_list(new_base, old_base->tv1.vec + i);
+ for (i = 0; i < TVN_SIZE; i++) {
+ migrate_timer_list(new_base, old_base->tv2.vec + i);
+ migrate_timer_list(new_base, old_base->tv3.vec + i);
+ migrate_timer_list(new_base, old_base->tv4.vec + i);
+ migrate_timer_list(new_base, old_base->tv5.vec + i);
+ }
+
+ spin_unlock(&old_base->t_base.lock);
+ spin_unlock(&new_base->t_base.lock);
local_irq_enable();
put_cpu_var(tvec_bases);
- return;
-
-unlock_again:
- /* Avoid deadlock with __mod_timer, by backing off. */
- spin_unlock(&old_base->lock);
- spin_unlock(&new_base->lock);
- cpu_relax();
- goto again;
}
#endif /* CONFIG_HOTPLUG_CPU */

_


2005-04-01 11:54:54

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC][PATCH] timers fixes/improvements

Here it is code snippets for easier review.

typedef struct timer_base_s {
spinlock_t lock;
struct timer_list *running_timer;
} timer_base_t;

struct tvec_t_base_s {
timer_base_t t_base;
...
} tvec_bases[];

struct timer_list {
...
timer_base_t *_base;
};

int timer_pending(struct timer_list * timer)
{
return timer->entry.next != NULL;
}

void init_timer(struct timer_list *timer)
{
timer->entry.next = NULL;
timer->_base = &per_cpu(tvec_bases).t_base;
}

timer_base_t *lock_timer_base(struct timer_list *timer, unsigned long *flags)
{
for (;;) {
timer_base_t *base = timer->_base;
if (base != NULL) {
spin_lock_irqsave(&base->lock, *flags);
if (base == timer->_base)
return base;
spin_unlock_irqrestore(&base->lock, *flags);
}
}
}

int __mod_timer(struct timer_list *timer, unsigned long expires)
{
timer_base_t *base;
tvec_base_t *new_base;
int ret = -1;

do {
base = lock_timer_base(timer, &flags);
new_base = &__get_cpu_var(tvec_bases);

/* Ensure the timer is serialized. */
if (base != &new_base->t_base
&& base->running_timer == timer)
goto unlock;

ret = 0;
if (timer_pending(timer)) {
__list_del(timer->entry);
ret = 1;
}

if (base != &new_base->t_base) {
timer->_base = NULL;
spin_unlock(&base->lock);
base = &new_base->t_base;
spin_lock(&base->lock);
timer->_base = base;
}

timer->expires = expires;
internal_add_timer(new_base, timer);
unlock:
spin_unlock_irqrestore(&base->lock, flags);
} while (ret < 0);

return ret;
}

void detach_timer(struct timer_list *timer)
{
__list_del(timer->entry);
entry->next = NULL;
}

int del_timer(struct timer_list *timer)
{
int ret = 0;

if (timer_pending(timer)) {
timer_base_t *base = lock_timer_base(timer);
if (timer_pending(timer)) {
detach_timer(timer);
ret = 1;
}
spin_unlock_irqrestore(&base->lock, flags);
}

return ret;
}

int del_timer_sync(struct timer_list *timer)
{
int ret = -1;

do {
timer_base_t *base = lock_timer_base(timer);

if (base->running_timer == timer)
goto unlock;

ret = 0;
if (timer_pending(timer)) {
detach_timer(timer);
ret = 1;
}
unlock:
spin_unlock_irqrestore(&base->lock, flags);
} while (ret < 0);

return ret;
}

void __run_timers(tvec_base_t *base)
{
spin_lock_irq(&base->t_base.lock);

for_each_expired_timer(timer) {
base->t_base.running_timer = timer;
detach_timer(timer);
spin_unlock(&base->t_base.lock);
timer->function(data);
spin_lock(&base->t_base.lock);
}
}

2005-04-01 13:00:36

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC][PATCH] timers fixes/improvements


* Oleg Nesterov <[email protected]> wrote:

> This patch replaces and updates 6 timer patches which are currently in
> -mm tree. This version does not play games with __TIMER_PENDING bit,
> so incremental patch is not suitable. It is against 2.6.12-rc1. Please
> comment. I am sending pseudo code in a separate message for easier
> review.

i like it. This is nice too:

2 files changed, 152 insertions(+), 193 deletions(-)

Acked-by: Ingo Molnar <[email protected]>

Ingo

2005-04-01 13:09:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC][PATCH] timers fixes/improvements


* Oleg Nesterov <[email protected]> wrote:

> struct timer_list {
> ...
> timer_base_t *_base;
> };

namespace cleanliness: i'd suggest s/_base/base.

another detail:

> int __mod_timer(struct timer_list *timer, unsigned long expires)
[...]
> /* Ensure the timer is serialized. */
> if (base != &new_base->t_base
> && base->running_timer == timer)
> goto unlock;

> unlock:
> spin_unlock_irqrestore(&base->lock, flags);
> } while (ret < 0);

so we keep looping in __mod_timer() when the timer is running? Couldnt
this be a performance hit?

Ingo

2005-04-01 13:46:10

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC][PATCH] timers fixes/improvements

Ingo Molnar wrote:
>
> * Oleg Nesterov <[email protected]> wrote:
>
> > struct timer_list {
> > ...
> > timer_base_t *_base;
> > };
>
> namespace cleanliness: i'd suggest s/_base/base.

I deliberately renamed it to '_base' because then it is much more grepable.
But I don't mind doing s/_base/base/ if you prefer.

> > int __mod_timer(struct timer_list *timer, unsigned long expires)
> [...]
> > /* Ensure the timer is serialized. */
> > if (base != &new_base->t_base
> > && base->running_timer == timer)
> > goto unlock;
>
> > unlock:
> > spin_unlock_irqrestore(&base->lock, flags);
> > } while (ret < 0);
>
> so we keep looping in __mod_timer() when the timer is running? Couldnt
> this be a performance hit?

I hope it is unlikely that __mod_timer() would hit the already running timer,
so hopefully this will not degrade the performance. And I don't see a simple
alternative to ensure the timer's serialization. At least it spins without
interrupts disabling.

Oleg.

2005-04-01 16:06:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC][PATCH] timers fixes/improvements



On Fri, 1 Apr 2005, Oleg Nesterov wrote:
>
> This patch replaces and updates 6 timer patches which are currently
> in -mm tree. This version does not play games with __TIMER_PENDING
> bit, so incremental patch is not suitable. It is against 2.6.12-rc1.
> Please comment. I am sending pseudo code in a separate message for
> easier review.

Looks ok by me. Andrew, should we let it cook in -mm, or what?

Linus

2005-04-01 18:36:46

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC][PATCH] timers fixes/improvements

Linus Torvalds <[email protected]> wrote:
>
>
>
> On Fri, 1 Apr 2005, Oleg Nesterov wrote:
> >
> > This patch replaces and updates 6 timer patches which are currently
> > in -mm tree. This version does not play games with __TIMER_PENDING
> > bit, so incremental patch is not suitable. It is against 2.6.12-rc1.
> > Please comment. I am sending pseudo code in a separate message for
> > easier review.
>
> Looks ok by me. Andrew, should we let it cook in -mm, or what?
>

Sure. Christoph and (I think) Ken have been seeing mysterious misbehaviour
which _might_ be due to Oleg's first round of timer patches. I assume C&K
will test this new patch?

2005-04-01 18:48:18

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC][PATCH] timers fixes/improvements

On Fri, 1 Apr 2005, Andrew Morton wrote:

> Sure. Christoph and (I think) Ken have been seeing mysterious misbehaviour
> which _might_ be due to Oleg's first round of timer patches. I assume C&K
> will test this new patch?

Yes will be tested. The hangs disappeared here when we removed Oleg's
patches.

2005-04-01 18:51:30

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: [RFC][PATCH] timers fixes/improvements

Linus Torvalds <[email protected]> wrote:
> On Fri, 1 Apr 2005, Oleg Nesterov wrote:
> >
> > This patch replaces and updates 6 timer patches which are currently
> > in -mm tree. This version does not play games with __TIMER_PENDING
> > bit, so incremental patch is not suitable. It is against 2.6.12-rc1.
> > Please comment. I am sending pseudo code in a separate message for
> > easier review.
>
> Looks ok by me. Andrew, should we let it cook in -mm, or what?
>

Andrew Morton wrote on Friday, April 01, 2005 10:33 AM
> Sure. Christoph and (I think) Ken have been seeing mysterious misbehaviour
> which _might_ be due to Oleg's first round of timer patches. I assume C&K
> will test this new patch?

Yes, we saw kernel hang with previous timer patches. I will give this one
a try.


2005-04-02 09:16:22

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC][PATCH] timers fixes/improvements

Oh, It never ends. Andrew, please apply this too.
Or I can send you updated previous patch if your prefer.

[PATCH] timers fixes/improvements fix

1. Fix compilation with CONFIG_NO_IDLE_HZ:
s/->lock/->t_base.lock/

2. s/_base/base as Ingo suggested.

Signed-off-by: Oleg Nesterov <[email protected]>

--- 2.6.12-rc1/include/linux/timer.h~FIX 2005-04-01 00:33:33.000000000 +0400
+++ 2.6.12-rc1/include/linux/timer.h 2005-04-02 16:08:45.000000000 +0400
@@ -17,7 +17,7 @@ struct timer_list {
void (*function)(unsigned long);
unsigned long data;

- struct timer_base_s *_base;
+ struct timer_base_s *base;
};

#define TIMER_MAGIC 0x4b87ad6e
@@ -28,7 +28,7 @@ extern struct timer_base_s __init_timer_
.function = (_function), \
.expires = (_expires), \
.data = (_data), \
- ._base = &__init_timer_base, \
+ .base = &__init_timer_base, \
.magic = TIMER_MAGIC, \
}

--- 2.6.12-rc1/kernel/timer.c~FIX 2005-04-02 15:49:13.000000000 +0400
+++ 2.6.12-rc1/kernel/timer.c 2005-04-02 16:09:11.000000000 +0400
@@ -172,7 +172,7 @@ EXPORT_SYMBOL(__init_timer_base);
void fastcall init_timer(struct timer_list *timer)
{
timer->entry.next = NULL;
- timer->_base = &per_cpu(tvec_bases,
+ timer->base = &per_cpu(tvec_bases,
__smp_processor_id()).t_base;
timer->magic = TIMER_MAGIC;
}
@@ -195,11 +195,11 @@ static timer_base_t *lock_timer_base(str
timer_base_t *base;

for (;;) {
- base = timer->_base;
+ base = timer->base;
/* Can be NULL while __mod_timer switches bases */
if (likely(base != NULL)) {
spin_lock_irqsave(&base->lock, *flags);
- if (likely(base == timer->_base))
+ if (likely(base == timer->base))
return base;
spin_unlock_irqrestore(&base->lock, *flags);
}
@@ -233,13 +233,13 @@ int __mod_timer(struct timer_list *timer
}

if (base != &new_base->t_base) {
- timer->_base = NULL;
+ timer->base = NULL;
/* Safe: the timer can't be seen via ->entry,
- * and lock_timer_base checks ->_base != 0. */
+ * and lock_timer_base checks ->base != 0. */
spin_unlock(&base->lock);
base = &new_base->t_base;
spin_lock(&base->lock);
- timer->_base = base;
+ timer->base = base;
}

timer->expires = expires;
@@ -270,7 +270,7 @@ void add_timer_on(struct timer_list *tim
check_timer(timer);

spin_lock_irqsave(&base->t_base.lock, flags);
- timer->_base = &base->t_base;
+ timer->base = &base->t_base;
internal_add_timer(base, timer);
spin_unlock_irqrestore(&base->t_base.lock, flags);
}
@@ -409,7 +409,7 @@ static int cascade(tvec_base_t *base, tv
struct timer_list *tmp;

tmp = list_entry(curr, struct timer_list, entry);
- BUG_ON(tmp->_base != &base->t_base);
+ BUG_ON(tmp->base != &base->t_base);
curr = curr->next;
internal_add_timer(base, tmp);
}
@@ -491,7 +491,7 @@ unsigned long next_timer_interrupt(void)
int i, j;

base = &__get_cpu_var(tvec_bases);
- spin_lock(&base->lock);
+ spin_lock(&base->t_base.lock);
expires = base->timer_jiffies + (LONG_MAX >> 1);
list = 0;

@@ -539,7 +539,7 @@ found:
expires = nte->expires;
}
}
- spin_unlock(&base->lock);
+ spin_unlock(&base->t_base.lock);
return expires;
}
#endif
@@ -1301,7 +1301,7 @@ static void migrate_timer_list(tvec_base
while (!list_empty(head)) {
timer = list_entry(head->next, struct timer_list, entry);
detach_timer(timer, 0);
- timer->_base = &new_base->t_base;
+ timer->base = &new_base->t_base;
internal_add_timer(new_base, timer);
}
}

2005-04-02 10:07:29

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC][PATCH] timers fixes/improvements

Oleg Nesterov <[email protected]> wrote:
>
> +void fastcall init_timer(struct timer_list *timer)
> +{
> + timer->entry.next = NULL;
> + timer->_base = &per_cpu(tvec_bases,
> + __smp_processor_id()).t_base;
> + timer->magic = TIMER_MAGIC;
> +}

__smp_processor_id() is not implemented on all architectures. I'll switch
this to _smp_processor_id().

The smp_processor_id() stuff is all rather a twisty maze (looks at Ingo).

It's a rather odd thing which you're doing there. Why does a
not-yet-scheduled timer need a ->_base?


2005-04-02 10:56:37

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC][PATCH] timers fixes/improvements

Andrew Morton wrote:
>
> Oleg Nesterov <[email protected]> wrote:
> >
> > +void fastcall init_timer(struct timer_list *timer)
> > +{
> > + timer->entry.next = NULL;
> > + timer->_base = &per_cpu(tvec_bases,
> > + __smp_processor_id()).t_base;
> > + timer->magic = TIMER_MAGIC;
> > +}
>
> __smp_processor_id() is not implemented on all architectures. I'll switch
> this to _smp_processor_id().

Wow, I did not know.

> It's a rather odd thing which you're doing there. Why does a
> not-yet-scheduled timer need a ->_base?

Because all locking goes through timer_list->base->lock now.
That is why timer_list->lock can be deleted. The timer is
always locked via loc_timer_base().

timer->base == NULL only temporally when __mod_timer() does
while switching timer's base:
base = lock_timer_base(timer);
timer->base = NULL;
unlock(base->lock);
// Nobody can use this timer, lock_timer_base()
// will spin waiting for ->base != 0
lock(new_base->lock);
timer->base = new_base;
unlock(new_base);

So ->base == NULL means that timer itself is locked, not it's
base. That is why __mod_timer() do not need to hold 2 spinlocks
at once.

Oleg.

2005-04-02 15:06:56

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC][PATCH] timers fixes/improvements


* Andrew Morton <[email protected]> wrote:

> Oleg Nesterov <[email protected]> wrote:
> >
> > +void fastcall init_timer(struct timer_list *timer)
> > +{
> > + timer->entry.next = NULL;
> > + timer->_base = &per_cpu(tvec_bases,
> > + __smp_processor_id()).t_base;
> > + timer->magic = TIMER_MAGIC;
> > +}
>
> __smp_processor_id() is not implemented on all architectures. I'll switch
> this to _smp_processor_id().
>
> The smp_processor_id() stuff is all rather a twisty maze (looks at Ingo).

_smp_processor_id() is fine in this case - we pick a CPU at init_timer()
time, but we dont need to be non-preemptible when we do so.

Ingo

2005-05-09 20:35:26

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC][PATCH] timers fixes/improvements

We have some strange race conditions as a result of the timer
scalability fixes.

ptype_all is set to 0x10:0x10 on faster systems (Xeon 3.6Ghz).
Slower systems do fine(Xeon 3.0Ghz) and do not corrupt ptype_all.

Its not clear to me how ptype_all could relate to timer operations but
if I apply these timer patches I get ptype_all corruption.

timers-fixes-improvements.patch
timers-fixes-improvements-smp_processor_id-fix.patch
timers-fixes-improvements-fix.patch
timer-deadlock-fix
(It does not matter if the last three are applied)

I put some printk's in an get the following output (2.6.12-rc4 + patches):

net_dev_init: ptype_all = c0569e20:c0569e20
Machine check exception polling timer started.
No per-cpu room for modules.
highmem bounce pool size: 64 pages
Installing knfsd (copyright (C) 1996 [email protected]).
SGI XFS with large block numbers, no debug enabled
Intel E7520/7320/7525 detected.<6>ACPI: Power Button (FF) [PWRF]
PNP: No PS/2 controller found. Probing ports directly.
serio: i8042 AUX port at 0x60,0x64 irq 12
serio: i8042 KBD port at 0x60,0x64 irq 1
Serial: 8250/16550 driver $Revision: 1.90 $ 8 ports, IRQ sharing disabled
ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
io scheduler noop registered
io scheduler anticipatory registered
io scheduler deadline registered
io scheduler cfq registered
floppy0: no floppy controllers found
loop: loaded (max 8 devices)
Intel(R) PRO/1000 Network Driver - version 5.7.6-k2
Copyright (c) 1999-2004 Intel Corporation.
ACPI: PCI Interrupt
register_netdevice eth%d: ptype_all = 00000010:00000010
ptype_all corrupted = 00000010:00000010. ptype_all fixed up
10 9e 56 c0 10 9e 56 c0 18 9e 56 c0 18 9e 56 c0 10 00 00 00 10 00 00 00 00 00 00 00 00 00 00 00
e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection
ACPI: PCI Interrupt
register_netdevice eth%d: ptype_all = c0569e20:c0569e20

2005-05-09 21:42:43

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC][PATCH] timers fixes/improvements

Christoph Lameter <[email protected]> wrote:
>
> We have some strange race conditions as a result of the timer
> scalability fixes.
>
> ptype_all is set to 0x10:0x10 on faster systems (Xeon 3.6Ghz).
> Slower systems do fine(Xeon 3.0Ghz) and do not corrupt ptype_all.
>
> Its not clear to me how ptype_all could relate to timer operations

I'd do `nm -n vmlinux', see which data structures the linker placed nearby
to ptype_all and then go looking for overruns against them.

ptype_base is an array, but I cannot see any race around ptype_base. So
look to see if ptype_base is corrupted as well, keep walking back through
memory, see where the scribble starts.

> but
> if I apply these timer patches I get ptype_all corruption.
>
> timers-fixes-improvements.patch
> timers-fixes-improvements-smp_processor_id-fix.patch
> timers-fixes-improvements-fix.patch
> timer-deadlock-fix
> (It does not matter if the last three are applied)

2.6.12-rc3-mm3 has different patches:

timers-fixes-improvements.patch
timers-fixes-improvements-smp_processor_id-fix.patch
timers-fixes-improvements-fix.patch
timers-fix-__mod_timer-vs-__run_timers-deadlock.patch
timers-fix-__mod_timer-vs-__run_timers-deadlock-tidy.patch
timers-comments-update.patch
kernel-timerc-remove-a-goto-construct.patch

2005-05-09 21:51:17

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC][PATCH] timers fixes/improvements

On Mon, 9 May 2005, Andrew Morton wrote:

> ptype_base is an array, but I cannot see any race around ptype_base. So
> look to see if ptype_base is corrupted as well, keep walking back through
> memory, see where the scribble starts.

There is no corruption around ptype_all as you can see from the log. There
is a list of hex numbers which are from ptype_all -8 to ptype_all +8.
Looks okay to me.

> 2.6.12-rc3-mm3 has different patches:
>
> timers-fixes-improvements.patch
> timers-fixes-improvements-smp_processor_id-fix.patch
> timers-fixes-improvements-fix.patch
> timers-fix-__mod_timer-vs-__run_timers-deadlock.patch
> timers-fix-__mod_timer-vs-__run_timers-deadlock-tidy.patch
> timers-comments-update.patch
> kernel-timerc-remove-a-goto-construct.patch

Ok. Will try these.

2005-05-09 23:10:12

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC][PATCH] timers fixes/improvements

On Mon, 9 May 2005, Andrew Morton wrote:

> 2.6.12-rc3-mm3 has different patches:
>
> timers-fixes-improvements.patch
> timers-fixes-improvements-smp_processor_id-fix.patch
> timers-fixes-improvements-fix.patch
> timers-fix-__mod_timer-vs-__run_timers-deadlock.patch
> timers-fix-__mod_timer-vs-__run_timers-deadlock-tidy.patch
> timers-comments-update.patch
> kernel-timerc-remove-a-goto-construct.patch

Tried these. Result is the same:

register_netdevice eth%d: ptype_all = 00000010:00000010
ptype_all corrupted = 00000010:00000010. ptype_all fixed up
10 9e 56 c0 10 9e 56 c0 18 9e 56 c0 18 9e 56 c0 10 00 00 00 10 00 00 00 00
00 00 00 00 00 00 00

2005-05-10 10:15:42

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC][PATCH] timers fixes/improvements

Christoph Lameter wrote:
>
> On Mon, 9 May 2005, Andrew Morton wrote:
>
> > ptype_base is an array, but I cannot see any race around ptype_base. So
> > look to see if ptype_base is corrupted as well, keep walking back through
> > memory, see where the scribble starts.
>
> There is no corruption around ptype_all as you can see from the log. There
> is a list of hex numbers which are from ptype_all -8 to ptype_all +8.
> Looks okay to me.

Still ptype_all could be accessed (and corrupted) as ptype_base[16].

Christoph, could you please reboot with this patch?
Just to be sure.

--- 2.6.12-rc4/net/core/dev.c~ Tue May 10 12:24:11 2005
+++ 2.6.12-rc4/net/core/dev.c Tue May 10 14:19:29 2005
@@ -157,7 +157,9 @@

static DEFINE_SPINLOCK(ptype_lock);
static struct list_head ptype_base[16]; /* 16 way hashed list */
+static unsigned long ptype_before[4] = { [0 ... 3] = -1 };
static struct list_head ptype_all; /* Taps */
+static unsigned long ptype_after[4] = { [0 ... 3] = -1 };

#ifdef OFFLINE_SAMPLE
static void sample_queue(unsigned long dummy);

2005-05-10 19:15:43

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC][PATCH] timers fixes/improvements

On Tue, 10 May 2005, Oleg Nesterov wrote:

> > There is no corruption around ptype_all as you can see from the log. There
> > is a list of hex numbers which are from ptype_all -8 to ptype_all +8.
> > Looks okay to me.
>
> Still ptype_all could be accessed (and corrupted) as ptype_base[16].
>
> Christoph, could you please reboot with this patch?

Ok. I added padding before and after ptype_all.
With padding the problem no longer occurs.

However, if the padding is put before ptype_base and after ptype_all
then the problem occurs.

So it looks like this is due to writes intended for ptype_base going
out of bounds. However, there nothing in the code in net/core/dev.c
that would allow this to happen. Also why is the list head set
to 0x10:0x10?

2005-05-11 01:19:15

by George Anzinger

[permalink] [raw]
Subject: Re: [RFC][PATCH] timers fixes/improvements

I wonder if this problem might benifit from the "breakpoint on write" capability
in kgdb. If you are using the kgdb in the mm patch, look in
Documentation/i386/kgdb/gdbinit.hw at the hwwbrk macro. You will, of course,
have to source this file from gdb to load the macro. Then you can use the gdb
command: help hwwbrk to get info on how to use it.

If the location is not written to too often this should help find the offender.

George
--

Christoph Lameter wrote:
> On Tue, 10 May 2005, Oleg Nesterov wrote:
>
>
>>>There is no corruption around ptype_all as you can see from the log. There
>>>is a list of hex numbers which are from ptype_all -8 to ptype_all +8.
>>>Looks okay to me.
>>
>>Still ptype_all could be accessed (and corrupted) as ptype_base[16].
>>
>>Christoph, could you please reboot with this patch?
>
>
> Ok. I added padding before and after ptype_all.
> With padding the problem no longer occurs.
>
> However, if the padding is put before ptype_base and after ptype_all
> then the problem occurs.
>
> So it looks like this is due to writes intended for ptype_base going
> out of bounds. However, there nothing in the code in net/core/dev.c
> that would allow this to happen. Also why is the list head set
> to 0x10:0x10?
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/

2005-05-11 10:11:23

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC][PATCH] timers fixes/improvements

Christoph Lameter wrote:
>
> On Tue, 10 May 2005, Oleg Nesterov wrote:
>
> > > There is no corruption around ptype_all as you can see from the log. There
> > > is a list of hex numbers which are from ptype_all -8 to ptype_all +8.
> > > Looks okay to me.
> >
> > Still ptype_all could be accessed (and corrupted) as ptype_base[16].
>
> Ok. I added padding before and after ptype_all.
> With padding the problem no longer occurs.
>
> However, if the padding is put before ptype_base and after ptype_all
> then the problem occurs.

So. ptype_base/ptype_all is corrupted before e1000_probe()->register_netdev().

Christoph, please, could you try this patch?

Make sure you are booting with 'init=/bin/sh', kernel should oops.

My kernel oops (as expected) in arp_init()->dev_add_pack(), after 2 successful
register_netdevice() calls.

Oleg.

--- 2.6.12-rc4/arch/i386/kernel/cpu/common.c~HACK 2005-05-09 16:36:52.000000000 +0400
+++ 2.6.12-rc4/arch/i386/kernel/cpu/common.c 2005-05-11 16:51:29.000000000 +0400
@@ -542,7 +542,7 @@ void __init early_cpu_init(void)
umc_init_cpu();
early_cpu_detect();

-#ifdef CONFIG_DEBUG_PAGEALLOC
+#if 1
/* pse is not compatible with on-the-fly unmapping,
* disable it even if the cpus claim to support it.
*/
--- 2.6.12-rc4/net/core/dev.c~HACK 2005-05-09 16:37:16.000000000 +0400
+++ 2.6.12-rc4/net/core/dev.c 2005-05-11 17:51:07.000000000 +0400
@@ -156,8 +156,19 @@
*/

static DEFINE_SPINLOCK(ptype_lock);
-static struct list_head ptype_base[16]; /* 16 way hashed list */
-static struct list_head ptype_all; /* Taps */
+
+static struct {
+ char pad_start[512];
+
+ struct list_head _ptype_base[16]; /* 16 way hashed list */
+ struct list_head _ptype_all; /* Taps */
+
+ char pad_end[PAGE_SIZE - 512 - (16+1) * sizeof(struct list_head)];
+
+} PTYPE_PAGE __attribute__((__aligned__(PAGE_SIZE)));
+
+#define ptype_base (PTYPE_PAGE._ptype_base)
+#define ptype_all (PTYPE_PAGE._ptype_all)

#ifdef OFFLINE_SAMPLE
static void sample_queue(unsigned long dummy);
@@ -2727,6 +2738,8 @@ int register_netdevice(struct net_device
struct hlist_node *p;
int ret;

+ printk(KERN_CRIT "%s: ENTER\n", __FUNCTION__);
+
BUG_ON(dev_boot_phase);
ASSERT_RTNL();

@@ -2828,6 +2841,7 @@ int register_netdevice(struct net_device
ret = 0;

out:
+ printk(KERN_CRIT "%s: LEAVE=%d\n", __FUNCTION__, ret);
return ret;
out_err:
free_divert_blk(dev);
@@ -3255,6 +3269,33 @@ static int dev_cpu_callback(struct notif
*
*/

+#define CK(e) do { if (!(e)) { \
+ printk(KERN_CRIT "ERR!! %d %s(%s)\n", __LINE__, __FUNCTION__, #e); \
+ mdelay(2000); \
+}} while (0)
+
+#include <asm/tlbflush.h>
+
+static void mk_writable(int yes)
+{
+ unsigned long addr = (unsigned long)&PTYPE_PAGE;
+ pte_t *pte;
+
+ CK(!(addr & ~PAGE_MASK));
+ CK(sizeof(PTYPE_PAGE) == PAGE_SIZE);
+
+ pte = lookup_address(addr);
+
+ CK(pte); CK(!(pte_val(*pte) & _PAGE_PSE));
+
+ if (yes)
+ set_pte_atomic(pte, pte_mkwrite(*pte));
+ else
+ set_pte_atomic(pte, pte_wrprotect(*pte));
+
+ flush_tlb_all();
+}
+
/*
* This is called single threaded during boot, so no need
* to take the rtnl semaphore.
@@ -3277,6 +3318,8 @@ static int __init net_dev_init(void)
for (i = 0; i < 16; i++)
INIT_LIST_HEAD(&ptype_base[i]);

+ mk_writable(0);
+
for (i = 0; i < ARRAY_SIZE(dev_name_head); i++)
INIT_HLIST_HEAD(&dev_name_head[i]);

-

2005-05-11 14:59:19

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC][PATCH] timers fixes/improvements

Btw, i think that drivers/net/e1000/e1000_main.c:e1000_down() is buggy.

It calls del_timer_sync(&adapter->watchdog_timer), but e1000_watchdog()
calls schedule_work(e1000_watchdog_task), so the work could be queued
after del_timer_sync().

And e1000_watchdog_task() arms timers again.

Note that it's not enough to do flush_scheduled_work() here.

Oleg.

2005-05-11 15:13:30

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC][PATCH] timers fixes/improvements

On Wed, 11 May 2005, Oleg Nesterov wrote:

> > However, if the padding is put before ptype_base and after ptype_all
> > then the problem occurs.
>
> So. ptype_base/ptype_all is corrupted before e1000_probe()->register_netdev().
>
> Christoph, please, could you try this patch?

We found that this has nothing to do with the timer patches. There is a
scribble in pcie_rootport_aspm_quirk that overwrites ptype_all.

quirk_aspm_offset[GET_INDEX(pdev->device, dev->devfn)]= cap_base + 0x10;

does the evil deed. The array offset calculated by GET_INDEX is out of
bounds.

The definition of GET_INDEX is suspect:

#define GET_INDEX(a, b) (((a - PCI_DEVICE_ID_INTEL_MCH_PA) << 3) + b)

should this not be

#define GET_INDEX(a, b) ((((a) - PCI_DEVICE_ID_INTEL_MCH_PA) << 3) + \
((b) & 7))

?


2005-05-12 23:36:27

by Ganesh Venkatesan

[permalink] [raw]
Subject: Re: [RFC][PATCH] timers fixes/improvements

John Linville, who originally proposed the patch (to add
e1000_watchdog_task and related code) has since withdrawn the patch.

e1000_down may still be suffering from the del_timer_sync race
condition that Oleg identified at the start of this thread.

ganesh.

On 5/11/05, Oleg Nesterov <[email protected]> wrote:
> Btw, i think that drivers/net/e1000/e1000_main.c:e1000_down() is buggy.
>
> It calls del_timer_sync(&adapter->watchdog_timer), but e1000_watchdog()
> calls schedule_work(e1000_watchdog_task), so the work could be queued
> after del_timer_sync().
>
> And e1000_watchdog_task() arms timers again.
>
> Note that it's not enough to do flush_scheduled_work() here.
>
> Oleg.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2005-05-13 22:43:05

by Greg KH

[permalink] [raw]
Subject: Re: [RFC][PATCH] timers fixes/improvements

On Wed, May 11, 2005 at 08:12:16AM -0700, Christoph Lameter wrote:
> We found that this has nothing to do with the timer patches. There is a
> scribble in pcie_rootport_aspm_quirk that overwrites ptype_all.

Ick.

> quirk_aspm_offset[GET_INDEX(pdev->device, dev->devfn)]= cap_base + 0x10;
>
> does the evil deed. The array offset calculated by GET_INDEX is out of
> bounds.
>
> The definition of GET_INDEX is suspect:
>
> #define GET_INDEX(a, b) (((a - PCI_DEVICE_ID_INTEL_MCH_PA) << 3) + b)
>
> should this not be
>
> #define GET_INDEX(a, b) ((((a) - PCI_DEVICE_ID_INTEL_MCH_PA) << 3) + \
> ((b) & 7))
>
> ?

Dely, any thoughts about this, or know who would know about it?

thanks,

greg k-h

2005-05-17 15:40:49

by Sy, Dely L

[permalink] [raw]
Subject: RE: [RFC][PATCH] timers fixes/improvements

On Friday, May 13, 2005 3:36 PM, Greg KH wrote:
> > The definition of GET_INDEX is suspect:
> >
> > #define GET_INDEX(a, b) (((a - PCI_DEVICE_ID_INTEL_MCH_PA) << 3) +
b)
> >
> > should this not be
> >
> > #define GET_INDEX(a, b) ((((a) - PCI_DEVICE_ID_INTEL_MCH_PA) << 3) +
\
> > ((b) & 7))
> >
> > ?
>
> Dely, any thoughts about this, or know who would know about it?

Greg,

I looked at the code and talked with Steve on this. The fix is correct;
i.e. b has to be masked with 7. Would Christoph or you send out a
patch for the fix or would you like us to do so? Thanks for finding out

the problem.

Thanks,
Dely

2005-05-17 15:55:04

by Christoph Lameter

[permalink] [raw]
Subject: [PATCH] fix memory scribble in arch/i386/pci/fixup.c

On Tue, 17 May 2005, Sy, Dely L wrote:

> On Friday, May 13, 2005 3:36 PM, Greg KH wrote:
> > > The definition of GET_INDEX is suspect:
> > >
> > > #define GET_INDEX(a, b) (((a - PCI_DEVICE_ID_INTEL_MCH_PA) << 3) + b)
> > > should this not be
> > > #define GET_INDEX(a, b) ((((a) - PCI_DEVICE_ID_INTEL_MCH_PA) << 3) + \
> > > ((b) & 7))
> > Dely, any thoughts about this, or know who would know about it?
>
> I looked at the code and talked with Steve on this. The fix is correct;
> i.e. b has to be masked with 7. Would Christoph or you send out a
> patch for the fix or would you like us to do so? Thanks for finding out
> the problem.

Ok. Here is the patch:

Index: linux-2.6.12-rc4/arch/i386/pci/fixup.c
===================================================================
--- linux-2.6.12-rc4.orig/arch/i386/pci/fixup.c 2005-05-12 16:39:39.000000000 +0000
+++ linux-2.6.12-rc4/arch/i386/pci/fixup.c 2005-05-17 15:45:05.000000000 +0000
@@ -253,7 +253,7 @@
#define MAX_PCIEROOT 6
static int quirk_aspm_offset[MAX_PCIEROOT << 3];

-#define GET_INDEX(a, b) (((a - PCI_DEVICE_ID_INTEL_MCH_PA) << 3) + b)
+#define GET_INDEX(a, b) ((((a) - PCI_DEVICE_ID_INTEL_MCH_PA) << 3) + ((b) & 7))

static int quirk_pcie_aspm_read(struct pci_bus *bus, unsigned int devfn, int where, int size, u32 *value)
{

2005-05-17 16:05:13

by Greg KH

[permalink] [raw]
Subject: Re: [RFC][PATCH] timers fixes/improvements

On Tue, May 17, 2005 at 08:38:45AM -0700, Sy, Dely L wrote:
> On Friday, May 13, 2005 3:36 PM, Greg KH wrote:
> > > The definition of GET_INDEX is suspect:
> > >
> > > #define GET_INDEX(a, b) (((a - PCI_DEVICE_ID_INTEL_MCH_PA) << 3) +
> b)
> > >
> > > should this not be
> > >
> > > #define GET_INDEX(a, b) ((((a) - PCI_DEVICE_ID_INTEL_MCH_PA) << 3) +
> \
> > > ((b) & 7))
> > >
> > > ?
> >
> > Dely, any thoughts about this, or know who would know about it?
>
> Greg,
>
> I looked at the code and talked with Steve on this. The fix is correct;
> i.e. b has to be masked with 7. Would Christoph or you send out a
> patch for the fix or would you like us to do so? Thanks for finding out
> the problem.

If you would send me a patch (or someone), that I could apply, I would
appreciate it.

thanks,

greg k-h