2006-02-24 00:27:13

by Nathan Lynch

[permalink] [raw]
Subject: [PATCH] softlockup detection vs. cpu hotplug

I'm able to trigger bogus softlockup warnings during cpu hotplug
testing, due to the percpu timestamp not being re-initialized before
the cpu starts servicing timer interrupts.

Before starting a cpu's watchdog thread, touch the cpu's timestamp to
avoid false positives.

In the watchdog thread, do touch_softlockup_watchdog in a
non-preemptible section so that it won't touch another cpu's
timestamp. This can happen in the window between the watchdog thread
getting forcefully migrated during a cpu offline operation and
kthread_should_stop.

Signed-off-by: Nathan Lynch <[email protected]>

---

kernel/softlockup.c | 20 ++++++++++++++++----
1 files changed, 16 insertions(+), 4 deletions(-)

33426e0c29ad973f9107cbd872648050f8988e61
diff --git a/kernel/softlockup.c b/kernel/softlockup.c
index c67189a..bd86fe1 100644
--- a/kernel/softlockup.c
+++ b/kernel/softlockup.c
@@ -34,9 +34,14 @@ static struct notifier_block panic_block
.notifier_call = softlock_panic,
};

+static void __touch_softlockup_watchdog(int cpu)
+{
+ per_cpu(timestamp, cpu) = jiffies;
+}
+
void touch_softlockup_watchdog(void)
{
- per_cpu(timestamp, raw_smp_processor_id()) = jiffies;
+ __touch_softlockup_watchdog(raw_smp_processor_id());
}
EXPORT_SYMBOL(touch_softlockup_watchdog);

@@ -73,6 +78,7 @@ void softlockup_tick(struct pt_regs *reg
static int watchdog(void * __bind_cpu)
{
struct sched_param param = { .sched_priority = 99 };
+ unsigned long bind_cpu = (unsigned long)__bind_cpu;

sched_setscheduler(current, SCHED_FIFO, &param);
current->flags |= PF_NOFREEZE;
@@ -86,10 +92,15 @@ static int watchdog(void * __bind_cpu)
*/
while (!kthread_should_stop()) {
msleep_interruptible(1000);
- touch_softlockup_watchdog();
+ /* When our cpu is offlined the watchdog thread can
+ * get migrated before it is stopped.
+ */
+ preempt_disable();
+ if (likely(smp_processor_id() == bind_cpu))
+ touch_softlockup_watchdog();
+ preempt_enable();
+ __set_current_state(TASK_RUNNING);
}
- __set_current_state(TASK_RUNNING);
-
return 0;
}

@@ -112,6 +123,7 @@ cpu_callback(struct notifier_block *nfb,
}
per_cpu(watchdog_task, hotcpu) = p;
kthread_bind(p, hotcpu);
+ __touch_softlockup_watchdog(hotcpu);
break;
case CPU_ONLINE:

--
1.1.5


2006-02-24 01:32:45

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH] softlockup detection vs. cpu hotplug

On Fri, 2006-02-24 at 08:31 +0800, Nathan Lynch wrote:
>
> In the watchdog thread, do touch_softlockup_watchdog in a
> non-preemptible section so that it won't touch another cpu's
> timestamp. This can happen in the window between the watchdog thread
> getting forcefully migrated during a cpu offline operation and
> kthread_should_stop.
Could we stop the thread in CPU_DOWN_PREPARE case, so it will not be
migrated to other CPUs? I suppose it's better the per-cpu thread only
runs on the specific cpu.

Thanks,
Shaohua

2006-02-24 06:27:06

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] softlockup detection vs. cpu hotplug


* Nathan Lynch <[email protected]> wrote:

> @@ -86,10 +92,15 @@ static int watchdog(void * __bind_cpu)
> */
> while (!kthread_should_stop()) {
> msleep_interruptible(1000);
> - touch_softlockup_watchdog();
> + /* When our cpu is offlined the watchdog thread can
> + * get migrated before it is stopped.
> + */
> + preempt_disable();
> + if (likely(smp_processor_id() == bind_cpu))
> + touch_softlockup_watchdog();
> + preempt_enable();
> + __set_current_state(TASK_RUNNING);
> }
> - __set_current_state(TASK_RUNNING);
> -
> return 0;
> }
>

the above change is unnecessary: there is absolutely no harm from a
migrated watchdog thread touching another CPU's timestamp for a short
amount of time. [Furtermore, why doesnt the hotplug CPU code use the
kthread_should_stop() mechanism to gracefully stop per-CPU threads,
instead of migrating unsuspecting threads and putting ugly hooks into
every such thread?]

the other fix (to touch before bringing the CPU up) looks OK, but it's
simpler to initialize it via the oneliner below. Andrew, this is what
i'd suggest to do for v2.6.16. [i'll send an updated softirq-rework
patch in the next mail.]

Ingo

------

fix from Nathan Lynch: initialize the softlockup timestamp of freshly
brought up CPUs with jiffies.

Signed-off-by: Ingo Molnar <[email protected]>

--- kernel/softlockup.c.orig
+++ kernel/softlockup.c
@@ -110,6 +110,7 @@ cpu_callback(struct notifier_block *nfb,
printk("watchdog for %i failed\n", hotcpu);
return NOTIFY_BAD;
}
+ per_cpu(timestamp, hotcpu) = jiffies;
per_cpu(watchdog_task, hotcpu) = p;
kthread_bind(p, hotcpu);
break;