Hi,
I'm getting some "scheduling while atomic" messages from 2.6.14-git3 on boot
and on suspend/resume, eg.:
NET: Registered protocol family 1
powernow-k8: Found 1 AMD Athlon 64 / Opteron processors (version 1.50.4)
powernow-k8: 0 : fid 0x0 (800 MHz), vid 0x12 (1100 mV)
powernow-k8: 1 : fid 0x8 (1600 MHz), vid 0x6 (1400 mV)
powernow-k8: 2 : fid 0xa (1800 MHz), vid 0x2 (1500 mV)
cpu_init done, current fid 0xa, vid 0x2
scheduling while atomic: swapper/0x00000001/1
Call Trace:<ffffffff8035014a>{schedule+122} <ffffffff80163576>{poison_obj+70}
<ffffffff801be269>{sysfs_new_dirent+41} <ffffffff80162989>{dbg_redzone1+25}
<ffffffff801be269>{sysfs_new_dirent+41} <ffffffff80163908>{cache_alloc_debugcheck_after+280}
<ffffffff8011cf1b>{powernowk8_target+139} <ffffffff80162989>{dbg_redzone1+25}
<ffffffff802e0830>{cpufreq_stat_notifier_policy+304}
<ffffffff802dfdb4>{__cpufreq_driver_target+116} <ffffffff801be269>{sysfs_new_dirent+41}
<ffffffff802e097e>{cpufreq_governor_performance+62}
<ffffffff802dec8d>{__cpufreq_governor+173} <ffffffff802df3a8>{__cpufreq_set_policy+440}
<ffffffff802df5bf>{cpufreq_set_policy+79} <ffffffff802df946>{cpufreq_add_dev+806}
<ffffffff802df540>{handle_update+0} <ffffffff802ae21a>{sysdev_driver_register+170}
<ffffffff802df106>{cpufreq_register_driver+198} <ffffffff8010c122>{init+194}
<ffffffff8010f556>{child_rip+8} <ffffffff8010c060>{init+0}
<ffffffff8010f54e>{child_rip+0}
scheduling while atomic: swapper/0x00000001/1
Call Trace:<ffffffff8035014a>{schedule+122} <ffffffff802e2453>{cpufreq_frequency_table_target+371}
<ffffffff8011d60c>{powernowk8_target+1916} <ffffffff802dfdb4>{__cpufreq_driver_target+116}
<ffffffff801be269>{sysfs_new_dirent+41} <ffffffff802e097e>{cpufreq_governor_performance+62}
<ffffffff802dec8d>{__cpufreq_governor+173} <ffffffff802df3a8>{__cpufreq_set_policy+440}
<ffffffff802df5bf>{cpufreq_set_policy+79} <ffffffff802df946>{cpufreq_add_dev+806}
<ffffffff802df540>{handle_update+0} <ffffffff802ae21a>{sysdev_driver_register+170}
<ffffffff802df106>{cpufreq_register_driver+198} <ffffffff8010c122>{init+194}
<ffffffff8010f556>{child_rip+8} <ffffffff8010c060>{init+0}
<ffffffff8010f54e>{child_rip+0}
input: AT Translated Set 2 keyboard//class/input as input1
scheduling while atomic: swapper/0x00000001/1
Call Trace:<ffffffff8035014a>{schedule+122} <ffffffff802e2453>{cpufreq_frequency_table_target+371}
<ffffffff8011cf1b>{powernowk8_target+139} <ffffffff802dfdb4>{__cpufreq_driver_target+116}
<ffffffff801be269>{sysfs_new_dirent+41} <ffffffff802e097e>{cpufreq_governor_performance+62}
<ffffffff802dec8d>{__cpufreq_governor+173} <ffffffff802df417>{__cpufreq_set_policy+551}
<ffffffff802df5bf>{cpufreq_set_policy+79} <ffffffff802df946>{cpufreq_add_dev+806}
<ffffffff802df540>{handle_update+0} <ffffffff802ae21a>{sysdev_driver_register+170}
<ffffffff802df106>{cpufreq_register_driver+198} <ffffffff8010c122>{init+194}
<ffffffff8010f556>{child_rip+8} <ffffffff8010c060>{init+0}
<ffffffff8010f54e>{child_rip+0}
scheduling while atomic: swapper/0x00000001/1
Call Trace:<ffffffff8035014a>{schedule+122} <ffffffff802e2453>{cpufreq_frequency_table_target+371}
<ffffffff8011d60c>{powernowk8_target+1916} <ffffffff802dfdb4>{__cpufreq_driver_target+116}
<ffffffff801be269>{sysfs_new_dirent+41} <ffffffff802e097e>{cpufreq_governor_performance+62}
<ffffffff802dec8d>{__cpufreq_governor+173} <ffffffff802df417>{__cpufreq_set_policy+551}
<ffffffff802df5bf>{cpufreq_set_policy+79} <ffffffff802df946>{cpufreq_add_dev+806}
<ffffffff802df540>{handle_update+0} <ffffffff802ae21a>{sysdev_driver_register+170}
<ffffffff802df106>{cpufreq_register_driver+198} <ffffffff8010c122>{init+194}
<ffffffff8010f556>{child_rip+8} <ffffffff8010c060>{init+0}
<ffffffff8010f54e>{child_rip+0}
Additionally there are some problems with freezing processes by swsusp.
Greetings,
Rafael
"Rafael J. Wysocki" <[email protected]> wrote:
>
> Hi,
>
> I'm getting some "scheduling while atomic" messages from 2.6.14-git3 on boot
> and on suspend/resume, eg.:
>
> NET: Registered protocol family 1
> powernow-k8: Found 1 AMD Athlon 64 / Opteron processors (version 1.50.4)
> powernow-k8: 0 : fid 0x0 (800 MHz), vid 0x12 (1100 mV)
> powernow-k8: 1 : fid 0x8 (1600 MHz), vid 0x6 (1400 mV)
> powernow-k8: 2 : fid 0xa (1800 MHz), vid 0x2 (1500 mV)
> cpu_init done, current fid 0xa, vid 0x2
> scheduling while atomic: swapper/0x00000001/1
>
> Call Trace:<ffffffff8035014a>{schedule+122} <ffffffff80163576>{poison_obj+70}
> <ffffffff801be269>{sysfs_new_dirent+41} <ffffffff80162989>{dbg_redzone1+25}
> <ffffffff801be269>{sysfs_new_dirent+41} <ffffffff80163908>{cache_alloc_debugcheck_after+280}
> <ffffffff8011cf1b>{powernowk8_target+139} <ffffffff80162989>{dbg_redzone1+25}
> <ffffffff802e0830>{cpufreq_stat_notifier_policy+304}
> <ffffffff802dfdb4>{__cpufreq_driver_target+116} <ffffffff801be269>{sysfs_new_dirent+41}
> <ffffffff802e097e>{cpufreq_governor_performance+62}
> <ffffffff802dec8d>{__cpufreq_governor+173} <ffffffff802df3a8>{__cpufreq_set_policy+440}
> <ffffffff802df5bf>{cpufreq_set_policy+79} <ffffffff802df946>{cpufreq_add_dev+806}
> <ffffffff802df540>{handle_update+0} <ffffffff802ae21a>{sysdev_driver_register+170}
> <ffffffff802df106>{cpufreq_register_driver+198} <ffffffff8010c122>{init+194}
> <ffffffff8010f556>{child_rip+8} <ffffffff8010c060>{init+0}
> <ffffffff8010f54e>{child_rip+0}
> scheduling while atomic: swapper/0x00000001/1
>
> Call Trace:<ffffffff8035014a>{schedule+122} <ffffffff802e2453>{cpufreq_frequency_table_target+371}
> <ffffffff8011d60c>{powernowk8_target+1916} <ffffffff802dfdb4>{__cpufreq_driver_target+116}
> <ffffffff801be269>{sysfs_new_dirent+41} <ffffffff802e097e>{cpufreq_governor_performance+62}
> <ffffffff802dec8d>{__cpufreq_governor+173} <ffffffff802df3a8>{__cpufreq_set_policy+440}
> <ffffffff802df5bf>{cpufreq_set_policy+79} <ffffffff802df946>{cpufreq_add_dev+806}
> <ffffffff802df540>{handle_update+0} <ffffffff802ae21a>{sysdev_driver_register+170}
> <ffffffff802df106>{cpufreq_register_driver+198} <ffffffff8010c122>{init+194}
> <ffffffff8010f556>{child_rip+8} <ffffffff8010c060>{init+0}
> <ffffffff8010f54e>{child_rip+0}
> input: AT Translated Set 2 keyboard//class/input as input1
> scheduling while atomic: swapper/0x00000001/1
>
> Call Trace:<ffffffff8035014a>{schedule+122} <ffffffff802e2453>{cpufreq_frequency_table_target+371}
> <ffffffff8011cf1b>{powernowk8_target+139} <ffffffff802dfdb4>{__cpufreq_driver_target+116}
> <ffffffff801be269>{sysfs_new_dirent+41} <ffffffff802e097e>{cpufreq_governor_performance+62}
> <ffffffff802dec8d>{__cpufreq_governor+173} <ffffffff802df417>{__cpufreq_set_policy+551}
> <ffffffff802df5bf>{cpufreq_set_policy+79} <ffffffff802df946>{cpufreq_add_dev+806}
> <ffffffff802df540>{handle_update+0} <ffffffff802ae21a>{sysdev_driver_register+170}
> <ffffffff802df106>{cpufreq_register_driver+198} <ffffffff8010c122>{init+194}
> <ffffffff8010f556>{child_rip+8} <ffffffff8010c060>{init+0}
> <ffffffff8010f54e>{child_rip+0}
> scheduling while atomic: swapper/0x00000001/1
>
> Call Trace:<ffffffff8035014a>{schedule+122} <ffffffff802e2453>{cpufreq_frequency_table_target+371}
> <ffffffff8011d60c>{powernowk8_target+1916} <ffffffff802dfdb4>{__cpufreq_driver_target+116}
> <ffffffff801be269>{sysfs_new_dirent+41} <ffffffff802e097e>{cpufreq_governor_performance+62}
> <ffffffff802dec8d>{__cpufreq_governor+173} <ffffffff802df417>{__cpufreq_set_policy+551}
> <ffffffff802df5bf>{cpufreq_set_policy+79} <ffffffff802df946>{cpufreq_add_dev+806}
> <ffffffff802df540>{handle_update+0} <ffffffff802ae21a>{sysdev_driver_register+170}
> <ffffffff802df106>{cpufreq_register_driver+198} <ffffffff8010c122>{init+194}
> <ffffffff8010f556>{child_rip+8} <ffffffff8010c060>{init+0}
> <ffffffff8010f54e>{child_rip+0}
Well I can't find it. Ingo, didn't you have a debug patch which would help
us identify where this atomic section started?
> Additionally there are some problems with freezing processes by swsusp.
More details on this?
On Monday, 31 of October 2005 20:34, Andrew Morton wrote:
> "Rafael J. Wysocki" <[email protected]> wrote:
}-- snip --{
> > scheduling while atomic: swapper/0x00000001/1
> >
> > Call Trace:<ffffffff8035014a>{schedule+122} <ffffffff802e2453>{cpufreq_frequency_table_target+371}
> > <ffffffff8011d60c>{powernowk8_target+1916} <ffffffff802dfdb4>{__cpufreq_driver_target+116}
> > <ffffffff801be269>{sysfs_new_dirent+41} <ffffffff802e097e>{cpufreq_governor_performance+62}
> > <ffffffff802dec8d>{__cpufreq_governor+173} <ffffffff802df417>{__cpufreq_set_policy+551}
> > <ffffffff802df5bf>{cpufreq_set_policy+79} <ffffffff802df946>{cpufreq_add_dev+806}
> > <ffffffff802df540>{handle_update+0} <ffffffff802ae21a>{sysdev_driver_register+170}
> > <ffffffff802df106>{cpufreq_register_driver+198} <ffffffff8010c122>{init+194}
> > <ffffffff8010f556>{child_rip+8} <ffffffff8010c060>{init+0}
> > <ffffffff8010f54e>{child_rip+0}
>
> Well I can't find it. Ingo, didn't you have a debug patch which would help
> us identify where this atomic section started?
This is 100% reproducible on my box so I'll try to figure out what's up tomorrow
(unless someone else does it earlier ;-)). Now I can only say it did not happen
with 2.6.14-rc5-mm1.
> > Additionally there are some problems with freezing processes by swsusp.
Once the box has not suspended due to a process refusing to freeze, but I was
unable to trace the problem at that time. This does not seem to be readily
reproducible, however.
Greetings,
Rafael
On Mon, Oct 31, 2005 at 08:45:32PM +0100, Rafael J. Wysocki wrote:
> On Monday, 31 of October 2005 20:34, Andrew Morton wrote:
> > "Rafael J. Wysocki" <[email protected]> wrote:
> }-- snip --{
> > > scheduling while atomic: swapper/0x00000001/1
> > >
> > > Call Trace:<ffffffff8035014a>{schedule+122} <ffffffff802e2453>{cpufreq_frequency_table_target+371}
> > > <ffffffff8011d60c>{powernowk8_target+1916} <ffffffff802dfdb4>{__cpufreq_driver_target+116}
> > > <ffffffff801be269>{sysfs_new_dirent+41} <ffffffff802e097e>{cpufreq_governor_performance+62}
> > > <ffffffff802dec8d>{__cpufreq_governor+173} <ffffffff802df417>{__cpufreq_set_policy+551}
> > > <ffffffff802df5bf>{cpufreq_set_policy+79} <ffffffff802df946>{cpufreq_add_dev+806}
> > > <ffffffff802df540>{handle_update+0} <ffffffff802ae21a>{sysdev_driver_register+170}
> > > <ffffffff802df106>{cpufreq_register_driver+198} <ffffffff8010c122>{init+194}
> > > <ffffffff8010f556>{child_rip+8} <ffffffff8010c060>{init+0}
> > > <ffffffff8010f54e>{child_rip+0}
> >
> > Well I can't find it. Ingo, didn't you have a debug patch which would help
> > us identify where this atomic section started?
>
> This is 100% reproducible on my box so I'll try to figure out what's up tomorrow
> (unless someone else does it earlier ;-)). Now I can only say it did not happen
> with 2.6.14-rc5-mm1.
This could be because of the new patch, i added preempt_disable() instead
of taking cpucontrol lock in __cpufreq_driver_target().
The reason is we now enter the same code path from the cpu_up() and cpu_down()
generated cpu notifier callbacks and ends up trying to lock when the
call path already has the cpucontrol lock.
Its happening because we do set_cpus_allowed() in powernowk8_target().
>
> > > Additionally there are some problems with freezing processes by swsusp.
>
> Once the box has not suspended due to a process refusing to freeze, but I was
> unable to trace the problem at that time. This does not seem to be readily
> reproducible, however.
>
> Greetings,
> Rafael
--
Cheers,
Ashok Raj
- Open Source Technology Center
* Andrew Morton <[email protected]> wrote:
> Well I can't find it. Ingo, didn't you have a debug patch which would
> help us identify where this atomic section started?
yes. Find below the preempt-trace.patch, updated and merged to 2.6.14.
Rafael, just enable CONFIG_DEBUG_PREEMPT and repeat the reproducer.
Works fine on x86 and x64, and should not break other architectures. Has
been included in the -rt tree for quite some time. It is certainly a
useful debugging feature. Could we add this to -mm, with an focus on
getting this upstream eventually?
Ingo
-----
this implements the "non-preemptible section trace" feature, which
prints out a "critical section nesting" trace after stackdumps.
sample output on x86:
Call Trace:
[<c0103db1>] show_stack+0x7a/0x90
[<c0103f36>] show_registers+0x156/0x1ce
[<c010412e>] die+0xe8/0x172
[<c010422e>] do_trap+0x76/0xa3
[<c01044fe>] do_invalid_op+0xa3/0xad
[<c01039ef>] error_code+0x4f/0x54
[<c0120be9>] test+0x8/0xa
[<c0120c41>] sys_gettimeofday+0x56/0x74
[<c0102eeb>] sysenter_past_esp+0x54/0x75
---------------------------
| preempt count: 00000004 ]
| 4 levels deep critical section nesting:
-----------------------------------------
.. [<c0120bbe>] .... test3+0xd/0xf
.....[<c0120bc8>] .. ( <= test2+0x8/0x21)
.. [<c0120bbe>] .... test3+0xd/0xf
.....[<c0120bcd>] .. ( <= test2+0xd/0x21)
.. [<c0120bd7>] .... test2+0x17/0x21
.....[<c0120be9>] .. ( <= test+0x8/0xa)
.. [<c010407f>] .... die+0x39/0x172
.....[<c010422e>] .. ( <= do_trap+0x76/0xa3)
sample output on x64:
Call Trace:<ffffffff8012373c>{sys32_gettimeofday+60} <ffffffff80122af3>{cstar_do_call+27}
---------------------------
| preempt count: 00000002 ]
| 2 level deep critical section nesting:
----------------------------------------
.. [<ffffffff8012372d>] .... sys32_gettimeofday+0x2d/0x110
.....[<ffffffff80122af3>] .. ( <= cstar_do_call+0x1b/0x60)
.. [<ffffffff80123737>] .... sys32_gettimeofday+0x37/0x110
.....[<ffffffff80122af3>] .. ( <= cstar_do_call+0x1b/0x60)
this feature is implemented via a low-overhead mechanism by keeping
the caller and caller-parent addresses for each disable_preempt()
call site, and printing it upon crashes. Note that every other
locking API is thus traced too, such as spinlocks, rwlocks, per-cpu
variables, etc. This feature is especially useful in identifying
leaked preemption counts, as the missing count is displayed as an
extra entry in the stack.
the feature is active when DEBUG_PREEMPT is enabled.
i've also cleaned up preemption-count debugging by moving the debug
functions out of sched.c into lib/preempt.c.
also, i have added preemption-counter-imbalance checks to the hardirq
and softirq processing codepaths. The behavior of preemption-counter
checks is now uniform: a warning is printed with all info we have at
that point, and the preemption counter is then restored to the old
value.
on x86 i have changed the 4KSTACKS feature to inherit the low bits of
the preemption-count across hardirq/softirq-context switching, so that
the preemption trace entries of interrupts do not overwrite process
level preemption trace entries.
boot- and functionality-tested on x86 and x64. Should work on all
architectures, but only x86 and x64 has been updated to print the
trace-stack out at stackdump time.
This feature has been part of the -rt tree for quite some time and was
very useful in debugging preempt-counter leaks, deadlock and lockup
situations.
Signed-off-by: Ingo Molnar <[email protected]>
arch/i386/kernel/irq.c | 28 +++++++++++
arch/i386/kernel/traps.c | 1
arch/x86_64/kernel/process.c | 2
arch/x86_64/kernel/traps.c | 13 ++---
include/asm-x86_64/proto.h | 2
include/linux/sched.h | 13 +++++
kernel/exit.c | 9 ++-
kernel/irq/handle.c | 17 +++++++
kernel/sched.c | 33 --------------
kernel/softirq.c | 17 +++++++
kernel/timer.c | 27 ++++++-----
lib/Kconfig.debug | 2
lib/Makefile | 2
lib/preempt.c | 101 +++++++++++++++++++++++++++++++++++++++++++
14 files changed, 211 insertions(+), 56 deletions(-)
Index: linux/arch/i386/kernel/irq.c
===================================================================
--- linux.orig/arch/i386/kernel/irq.c
+++ linux/arch/i386/kernel/irq.c
@@ -55,6 +55,9 @@ fastcall unsigned int do_IRQ(struct pt_r
{
/* high bits used in ret_from_ code */
int irq = regs->orig_eax & 0xff;
+#ifdef CONFIG_DEBUG_PREEMPT
+ u32 count = preempt_count() & PREEMPT_MASK;
+#endif
#ifdef CONFIG_4KSTACKS
union irq_ctx *curctx, *irqctx;
u32 *isp;
@@ -95,6 +98,14 @@ fastcall unsigned int do_IRQ(struct pt_r
irqctx->tinfo.task = curctx->tinfo.task;
irqctx->tinfo.previous_esp = current_stack_pointer;
+ /*
+ * Keep the preemption-count offset, so that the
+ * process-level preemption-trace entries do not
+ * get overwritten by the hardirq context:
+ */
+#ifdef CONFIG_DEBUG_PREEMPT
+ irqctx->tinfo.preempt_count += count;
+#endif
asm volatile(
" xchgl %%ebx,%%esp \n"
" call __do_IRQ \n"
@@ -103,6 +114,9 @@ fastcall unsigned int do_IRQ(struct pt_r
: "0" (irq), "1" (regs), "2" (isp)
: "memory", "cc", "ecx"
);
+#ifdef CONFIG_DEBUG_PREEMPT
+ irqctx->tinfo.preempt_count -= count;
+#endif
} else
#endif
__do_IRQ(irq, regs);
@@ -165,6 +179,9 @@ extern asmlinkage void __do_softirq(void
asmlinkage void do_softirq(void)
{
+#ifdef CONFIG_DEBUG_PREEMPT
+ u32 count = preempt_count() & PREEMPT_MASK;
+#endif
unsigned long flags;
struct thread_info *curctx;
union irq_ctx *irqctx;
@@ -181,6 +198,14 @@ asmlinkage void do_softirq(void)
irqctx->tinfo.task = curctx->task;
irqctx->tinfo.previous_esp = current_stack_pointer;
+ /*
+ * Keep the preemption-count offset, so that the
+ * process-level preemption-trace entries do not
+ * get overwritten by the softirq context:
+ */
+#ifdef CONFIG_DEBUG_PREEMPT
+ irqctx->tinfo.preempt_count += count;
+#endif
/* build the stack frame on the softirq stack */
isp = (u32*) ((char*)irqctx + sizeof(*irqctx));
@@ -192,6 +217,9 @@ asmlinkage void do_softirq(void)
: "0"(isp)
: "memory", "cc", "edx", "ecx", "eax"
);
+#ifdef CONFIG_DEBUG_PREEMPT
+ irqctx->tinfo.preempt_count -= count;
+#endif
}
local_irq_restore(flags);
Index: linux/arch/i386/kernel/traps.c
===================================================================
--- linux.orig/arch/i386/kernel/traps.c
+++ linux/arch/i386/kernel/traps.c
@@ -163,6 +163,7 @@ void show_trace(struct task_struct *task
break;
printk(" =======================\n");
}
+ print_preempt_trace(task, task->thread_info->preempt_count);
}
void show_stack(struct task_struct *task, unsigned long *esp)
Index: linux/arch/x86_64/kernel/process.c
===================================================================
--- linux.orig/arch/x86_64/kernel/process.c
+++ linux/arch/x86_64/kernel/process.c
@@ -315,7 +315,7 @@ void show_regs(struct pt_regs *regs)
{
printk("CPU %d:", smp_processor_id());
__show_regs(regs);
- show_trace(®s->rsp);
+ show_trace(current, ®s->rsp);
}
/*
Index: linux/arch/x86_64/kernel/traps.c
===================================================================
--- linux.orig/arch/x86_64/kernel/traps.c
+++ linux/arch/x86_64/kernel/traps.c
@@ -154,7 +154,7 @@ static unsigned long *in_exception_stack
* severe exception (double fault, nmi, stack fault, debug, mce) hardware stack
*/
-void show_trace(unsigned long *stack)
+void show_trace(struct task_struct *task, unsigned long *stack)
{
unsigned long addr;
const unsigned cpu = safe_smp_processor_id();
@@ -219,9 +219,10 @@ void show_trace(unsigned long *stack)
HANDLE_STACK (((long) stack & (THREAD_SIZE-1)) != 0);
#undef HANDLE_STACK
printk("\n");
+ print_preempt_trace(task, task->thread_info->preempt_count);
}
-void show_stack(struct task_struct *tsk, unsigned long * rsp)
+void show_stack(struct task_struct *task, unsigned long * rsp)
{
unsigned long *stack;
int i;
@@ -233,8 +234,8 @@ void show_stack(struct task_struct *tsk,
// back trace for this cpu.
if (rsp == NULL) {
- if (tsk)
- rsp = (unsigned long *)tsk->thread.rsp;
+ if (task)
+ rsp = (unsigned long *)task->thread.rsp;
else
rsp = (unsigned long *)&rsp;
}
@@ -255,7 +256,7 @@ void show_stack(struct task_struct *tsk,
printk("%016lx ", *stack++);
touch_nmi_watchdog();
}
- show_trace((unsigned long *)rsp);
+ show_trace(task, (unsigned long *)rsp);
}
/*
@@ -264,7 +265,7 @@ void show_stack(struct task_struct *tsk,
void dump_stack(void)
{
unsigned long dummy;
- show_trace(&dummy);
+ show_trace(current, &dummy);
}
EXPORT_SYMBOL(dump_stack);
Index: linux/include/asm-x86_64/proto.h
===================================================================
--- linux.orig/include/asm-x86_64/proto.h
+++ linux/include/asm-x86_64/proto.h
@@ -65,7 +65,7 @@ extern unsigned long end_pfn_map;
extern cpumask_t cpu_initialized;
-extern void show_trace(unsigned long * rsp);
+extern void show_trace(struct task_struct *task, unsigned long * rsp);
extern void show_registers(struct pt_regs *regs);
extern void exception_table_check(void);
Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -592,6 +592,14 @@ extern void partition_sched_domains(cpum
cpumask_t *partition2);
#endif /* CONFIG_SMP */
+#ifdef CONFIG_DEBUG_PREEMPT
+# define MAX_PREEMPT_TRACE 25
+extern void print_preempt_trace(struct task_struct *task, u32 count);
+#else
+static inline void print_preempt_trace(struct task_struct *task, u32 count)
+{
+}
+#endif
struct io_context; /* See blkdev.h */
void exit_io_context(void);
@@ -813,6 +821,11 @@ struct task_struct {
int cpuset_mems_generation;
#endif
atomic_t fs_excl; /* holding fs exclusive resources */
+
+#ifdef CONFIG_DEBUG_PREEMPT
+ void *preempt_off_caller[MAX_PREEMPT_TRACE];
+ void *preempt_off_parent[MAX_PREEMPT_TRACE];
+#endif
};
static inline pid_t process_group(struct task_struct *tsk)
Index: linux/kernel/exit.c
===================================================================
--- linux.orig/kernel/exit.c
+++ linux/kernel/exit.c
@@ -833,10 +833,11 @@ fastcall NORET_TYPE void do_exit(long co
tsk->it_prof_expires = cputime_zero;
tsk->it_sched_expires = 0;
- if (unlikely(in_atomic()))
- printk(KERN_INFO "note: %s[%d] exited with preempt_count %d\n",
- current->comm, current->pid,
- preempt_count());
+ if (unlikely(in_atomic())) {
+ printk(KERN_ERR "BUG: %s[%d] exited with nonzero preempt_count %d!\n",
+ tsk->comm, tsk->pid, preempt_count());
+ print_preempt_trace(tsk, preempt_count());
+ }
acct_update_integrals(tsk);
update_mem_hiwater(tsk);
Index: linux/kernel/irq/handle.c
===================================================================
--- linux.orig/kernel/irq/handle.c
+++ linux/kernel/irq/handle.c
@@ -85,7 +85,24 @@ fastcall int handle_IRQ_event(unsigned i
local_irq_enable();
do {
+#ifdef CONFIG_DEBUG_PREEMPT
+ u32 in_count = preempt_count(), out_count;
+#endif
ret = action->handler(irq, action->dev_id, regs);
+#ifdef CONFIG_DEBUG_PREEMPT
+ out_count = preempt_count();
+ if (in_count != out_count) {
+ printk(KERN_ERR "BUG: irq %d [%s] preempt-count "
+ "imbalance: in=%08x, out=%08x!\n",
+ irq, action->name, in_count, out_count);
+ print_preempt_trace(current, out_count);
+ /*
+ * We already printed all the useful info,
+ * fix up the preemption count now:
+ */
+ preempt_count() = in_count;
+ }
+#endif
if (ret == IRQ_HANDLED)
status |= action->flags;
retval |= ret;
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -47,6 +47,7 @@
#include <linux/syscalls.h>
#include <linux/times.h>
#include <linux/acct.h>
+#include <linux/kallsyms.h>
#include <asm/tlb.h>
#include <asm/unistd.h>
@@ -2809,38 +2810,6 @@ static inline int dependent_sleeper(int
}
#endif
-#if defined(CONFIG_PREEMPT) && defined(CONFIG_DEBUG_PREEMPT)
-
-void fastcall add_preempt_count(int val)
-{
- /*
- * Underflow?
- */
- BUG_ON((preempt_count() < 0));
- preempt_count() += val;
- /*
- * Spinlock count overflowing soon?
- */
- BUG_ON((preempt_count() & PREEMPT_MASK) >= PREEMPT_MASK-10);
-}
-EXPORT_SYMBOL(add_preempt_count);
-
-void fastcall sub_preempt_count(int val)
-{
- /*
- * Underflow?
- */
- BUG_ON(val > preempt_count());
- /*
- * Is the spinlock portion underflowing?
- */
- BUG_ON((val < PREEMPT_MASK) && !(preempt_count() & PREEMPT_MASK));
- preempt_count() -= val;
-}
-EXPORT_SYMBOL(sub_preempt_count);
-
-#endif
-
/*
* schedule() is the main scheduler function.
*/
Index: linux/kernel/softirq.c
===================================================================
--- linux.orig/kernel/softirq.c
+++ linux/kernel/softirq.c
@@ -92,7 +92,24 @@ restart:
do {
if (pending & 1) {
+#ifdef CONFIG_DEBUG_PREEMPT
+ u32 in_count = preempt_count(), out_count;
+#endif
h->action(h);
+#ifdef CONFIG_DEBUG_PREEMPT
+ out_count = preempt_count();
+ if (in_count != out_count) {
+ printk(KERN_ERR "BUG: softirq %ld preempt-count"
+ " imbalance: in=%08x, out=%08x!\n",
+ (long)(h - softirq_vec),
+ in_count, out_count);
+ print_preempt_trace(current, out_count);
+ /*
+ * Fix up the bad preemption count:
+ */
+ preempt_count() = in_count;
+ }
+#endif
rcu_bh_qsctr_inc(cpu);
}
h++;
Index: linux/kernel/timer.c
===================================================================
--- linux.orig/kernel/timer.c
+++ linux/kernel/timer.c
@@ -33,6 +33,7 @@
#include <linux/posix-timers.h>
#include <linux/cpu.h>
#include <linux/syscalls.h>
+#include <linux/kallsyms.h>
#include <asm/uaccess.h>
#include <asm/unistd.h>
@@ -480,6 +481,7 @@ static inline void __run_timers(tvec_bas
while (!list_empty(head)) {
void (*fn)(unsigned long);
unsigned long data;
+ int in_count, out_count;
timer = list_entry(head->next,struct timer_list,entry);
fn = timer->function;
@@ -488,17 +490,20 @@ static inline void __run_timers(tvec_bas
set_running_timer(base, timer);
detach_timer(timer, 1);
spin_unlock_irq(&base->t_base.lock);
- {
- int preempt_count = preempt_count();
- fn(data);
- if (preempt_count != preempt_count()) {
- printk(KERN_WARNING "huh, entered %p "
- "with preempt_count %08x, exited"
- " with %08x?\n",
- fn, preempt_count,
- preempt_count());
- BUG();
- }
+
+ in_count = preempt_count();
+ fn(data);
+ out_count = preempt_count();
+ if (in_count != out_count) {
+ print_symbol(KERN_ERR "BUG: %s", (long)fn);
+ printk(KERN_ERR "(%p) preempt-count imbalance: "
+ "in=%08x, out=%08x!",
+ fn, in_count, out_count);
+ print_preempt_trace(current, out_count);
+ /*
+ * Fix up the bad preemption count:
+ */
+ preempt_count() = in_count;
}
spin_lock_irq(&base->t_base.lock);
}
Index: linux/lib/Kconfig.debug
===================================================================
--- linux.orig/lib/Kconfig.debug
+++ linux/lib/Kconfig.debug
@@ -88,6 +88,8 @@ config DEBUG_SLAB
config DEBUG_PREEMPT
bool "Debug preemptible kernel"
depends on DEBUG_KERNEL && PREEMPT
+ select FRAME_POINTER
+ select KALLSYMS_ALL
default y
help
If you say Y here then the kernel will use a debug variant of the
Index: linux/lib/Makefile
===================================================================
--- linux.orig/lib/Makefile
+++ linux/lib/Makefile
@@ -22,7 +22,7 @@ lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) +=
lib-$(CONFIG_SEMAPHORE_SLEEPERS) += semaphore-sleepers.o
lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find_next_bit.o
obj-$(CONFIG_LOCK_KERNEL) += kernel_lock.o
-obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
+obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o preempt.o
ifneq ($(CONFIG_HAVE_DEC_LOCK),y)
lib-y += dec_and_lock.o
Index: linux/lib/preempt.c
===================================================================
--- /dev/null
+++ linux/lib/preempt.c
@@ -0,0 +1,101 @@
+/*
+ * lib/preempt.c
+ *
+ * DEBUG_PREEMPT variant of add_preempt_count() and sub_preempt_count().
+ * Preemption tracing.
+ *
+ * (C) 2005 Ingo Molnar, Red Hat
+ */
+#include <linux/module.h>
+#include <linux/hardirq.h>
+#include <linux/kallsyms.h>
+
+/*
+ * Add a value to the preemption count, and check for overflows,
+ * underflows and maintain a small stack of callers that gets
+ * printed upon crashes.
+ */
+void fastcall add_preempt_count(int val)
+{
+ unsigned int count = preempt_count(), idx = count & PREEMPT_MASK;
+
+ /*
+ * Underflow?
+ */
+ BUG_ON(count < 0);
+
+ preempt_count() += val;
+
+ /*
+ * Spinlock count overflowing soon?
+ */
+ BUG_ON(idx >= PREEMPT_MASK-10);
+
+ /*
+ * Maintain the per-task preemption-nesting stack (which
+ * will be printed upon crashes). It's a low-overhead thing,
+ * constant overhead per preempt-disable.
+ */
+ if (idx < MAX_PREEMPT_TRACE) {
+ void *caller = __builtin_return_address(0), *parent = NULL;
+
+#ifdef CONFIG_FRAME_POINTER
+ parent = __builtin_return_address(1);
+ if (in_lock_functions((unsigned long)parent)) {
+ parent = __builtin_return_address(2);
+ if (in_lock_functions((unsigned long)parent))
+ parent = __builtin_return_address(3);
+ }
+#endif
+ current->preempt_off_caller[idx] = caller;
+ current->preempt_off_parent[idx] = parent;
+ }
+}
+EXPORT_SYMBOL(add_preempt_count);
+
+void fastcall sub_preempt_count(int val)
+{
+ unsigned int count = preempt_count();
+
+ /*
+ * Underflow?
+ */
+ BUG_ON(val > count);
+ /*
+ * Is the spinlock portion underflowing?
+ */
+ BUG_ON((val < PREEMPT_MASK) && !(count & PREEMPT_MASK));
+
+ preempt_count() -= val;
+}
+EXPORT_SYMBOL(sub_preempt_count);
+
+void print_preempt_trace(struct task_struct *task, u32 count)
+{
+ unsigned int i, idx = count & PREEMPT_MASK;
+
+ preempt_disable();
+
+ printk("---------------------------\n");
+ printk("| preempt count: %08x ]\n", count);
+ if (count) {
+ printk("| %d level deep critical section nesting:\n", idx);
+ printk("----------------------------------------\n");
+ } else
+ printk("---------------------------\n");
+ for (i = 0; i < idx; i++) {
+ printk(".. [<%p>] .... ", task->preempt_off_caller[i]);
+ print_symbol("%s\n", (long)task->preempt_off_caller[i]);
+ printk(".....[<%p>] .. ( <= ",
+ task->preempt_off_parent[i]);
+ print_symbol("%s)\n", (long)task->preempt_off_parent[i]);
+ if (i == MAX_PREEMPT_TRACE-1) {
+ printk("[rest truncated, reached MAX_PREEMPT_TRACE]\n");
+ break;
+ }
+ }
+ printk("\n");
+
+ preempt_enable();
+}
+
Hi,
On Monday, 31 of October 2005 22:42, Ingo Molnar wrote:
>
> * Andrew Morton <[email protected]> wrote:
>
> > Well I can't find it. Ingo, didn't you have a debug patch which would
> > help us identify where this atomic section started?
>
> yes. Find below the preempt-trace.patch, updated and merged to 2.6.14.
>
> Rafael, just enable CONFIG_DEBUG_PREEMPT and repeat the reproducer.
Thanks a lot, but it was not necessary in this particular case (please see
my reply to the Ashok's post).
Greetings,
Rafael
Hi,
On Monday, 31 of October 2005 21:42, Ashok Raj wrote:
> On Mon, Oct 31, 2005 at 08:45:32PM +0100, Rafael J. Wysocki wrote:
> > On Monday, 31 of October 2005 20:34, Andrew Morton wrote:
> > > "Rafael J. Wysocki" <[email protected]> wrote:
> > }-- snip --{
> > > > scheduling while atomic: swapper/0x00000001/1
> > > >
> > > > Call Trace:<ffffffff8035014a>{schedule+122} <ffffffff802e2453>{cpufreq_frequency_table_target+371}
> > > > <ffffffff8011d60c>{powernowk8_target+1916} <ffffffff802dfdb4>{__cpufreq_driver_target+116}
> > > > <ffffffff801be269>{sysfs_new_dirent+41} <ffffffff802e097e>{cpufreq_governor_performance+62}
> > > > <ffffffff802dec8d>{__cpufreq_governor+173} <ffffffff802df417>{__cpufreq_set_policy+551}
> > > > <ffffffff802df5bf>{cpufreq_set_policy+79} <ffffffff802df946>{cpufreq_add_dev+806}
> > > > <ffffffff802df540>{handle_update+0} <ffffffff802ae21a>{sysdev_driver_register+170}
> > > > <ffffffff802df106>{cpufreq_register_driver+198} <ffffffff8010c122>{init+194}
> > > > <ffffffff8010f556>{child_rip+8} <ffffffff8010c060>{init+0}
> > > > <ffffffff8010f54e>{child_rip+0}
> > >
> > > Well I can't find it. Ingo, didn't you have a debug patch which would help
> > > us identify where this atomic section started?
> >
> > This is 100% reproducible on my box so I'll try to figure out what's up tomorrow
> > (unless someone else does it earlier ;-)). Now I can only say it did not happen
> > with 2.6.14-rc5-mm1.
>
> This could be because of the new patch, i added preempt_disable() instead
> of taking cpucontrol lock in __cpufreq_driver_target().
Yes, that's it.
> The reason is we now enter the same code path from the cpu_up() and cpu_down()
> generated cpu notifier callbacks and ends up trying to lock when the
> call path already has the cpucontrol lock.
>
> Its happening because we do set_cpus_allowed() in powernowk8_target().
Unfortunately, powernowk8_target() calls schedule() right after
set_cpus_allowed(), so it throws "scheduling while atomic" on every call,
because of the preempt_disable()/_enable() around it.
Greetings,
Rafael
On Tue, Nov 01, 2005 at 08:07:19PM +0100, Rafael J. Wysocki wrote:
> Hi,
>
> > of taking cpucontrol lock in __cpufreq_driver_target().
>
> Yes, that's it.
>
> > The reason is we now enter the same code path from the cpu_up() and cpu_down()
> > generated cpu notifier callbacks and ends up trying to lock when the
> > call path already has the cpucontrol lock.
> >
> > Its happening because we do set_cpus_allowed() in powernowk8_target().
>
> Unfortunately, powernowk8_target() calls schedule() right after
> set_cpus_allowed(), so it throws "scheduling while atomic" on every call,
> because of the preempt_disable()/_enable() around it.
>
> Greetings,
> Rafael
>
Thanks Rafael,
could you try this patch instead? I hate to keep these state variables
and thats why i went with the preempt approach, too bad it seems it wont
work for more than just the case you mentioned.
seems ugly, but i dont find a better looking cure...
--
Cheers,
Ashok Raj
- Open Source Technology Center
When calling target drivers to set frequency, we used to take cpucontrol
semaphore. When we modified the code to accomodate CPU hotplug, there was an
attempt to take a double lock leading to a deadlock. Since the code called from
under the CPU hotplug notifiers already hold the cpucontrol lock.
we attempted to circumvent this ugly state keeping by just doing a simple
preempt_disable() and preempt_enable(), but that appears to get in the
way when cpufreq drivers attempt to do set_cpus_allowed() since this call
will go to sleep.
Now we just keep a state variable and each notifier path is supposed to set
this to avoid the cpufreq driver taking double locks.
Signed-off-by: Ashok Raj <[email protected]>
-----------------------------------------------------------------
drivers/cpufreq/cpufreq.c | 35 +++++++++++++++++++----------------
drivers/cpufreq/cpufreq_stats.c | 2 ++
include/linux/cpufreq.h | 2 ++
3 files changed, 23 insertions(+), 16 deletions(-)
Index: linux-2.6.14-rc4-mm1/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-2.6.14-rc4-mm1.orig/drivers/cpufreq/cpufreq.c
+++ linux-2.6.14-rc4-mm1/drivers/cpufreq/cpufreq.c
@@ -38,6 +38,8 @@ static struct cpufreq_driver *cpufreq
static struct cpufreq_policy *cpufreq_cpu_data[NR_CPUS];
static DEFINE_SPINLOCK(cpufreq_driver_lock);
+int cpufreq_hotplug_inprogress;
+EXPORT_SYMBOL_GPL(cpufreq_hotplug_inprogress);
/* internal prototypes */
static int __cpufreq_governor(struct cpufreq_policy *policy, unsigned int event);
@@ -59,6 +61,18 @@ static DECLARE_RWSEM (cpufreq_notifier_
static LIST_HEAD(cpufreq_governor_list);
static DECLARE_MUTEX (cpufreq_governor_sem);
+static void cpufreq_lock_hotplug(void)
+{
+ if (!cpufreq_hotplug_inprogress)
+ lock_cpu_hotplug();
+}
+
+static void cpufreq_unlock_hotplug(void)
+{
+ if (!cpufreq_hotplug_inprogress)
+ unlock_cpu_hotplug();
+}
+
struct cpufreq_policy * cpufreq_cpu_get(unsigned int cpu)
{
struct cpufreq_policy *data;
@@ -1114,25 +1128,12 @@ int __cpufreq_driver_target(struct cpufr
{
int retval = -EINVAL;
- /*
- * Converted the lock_cpu_hotplug to preempt_disable()
- * and preempt enable. This is a bit kludgy and relies on
- * how cpu hotplug works. All we need is a gaurantee that cpu hotplug
- * wont make progress on any cpu. Once we do preempt_disable(), this
- * would ensure hotplug threads dont get on this cpu, thereby delaying
- * the cpu remove process.
- *
- * we removed the lock_cpu_hotplug since we need to call this function via
- * cpu hotplug callbacks, which result in locking the cpu hotplug
- * thread itself. Agree this is not very clean, cpufreq community
- * could improve this if required. - Ashok Raj <[email protected]>
- */
- preempt_disable();
+ cpufreq_lock_hotplug();
dprintk("target for CPU %u: %u kHz, relation %u\n", policy->cpu,
target_freq, relation);
if (cpu_online(policy->cpu) && cpufreq_driver->target)
retval = cpufreq_driver->target(policy, target_freq, relation);
- preempt_enable();
+ cpufreq_unlock_hotplug();
return retval;
}
EXPORT_SYMBOL_GPL(__cpufreq_driver_target);
@@ -1432,8 +1433,9 @@ static int __cpuinit cpufreq_cpu_callbac
struct sys_device *sys_dev;
sys_dev = get_cpu_sysdev(cpu);
-
+
if (sys_dev) {
+ cpufreq_hotplug_inprogress = 1;
switch (action) {
case CPU_ONLINE:
(void) cpufreq_add_dev(sys_dev);
@@ -1455,6 +1457,7 @@ static int __cpuinit cpufreq_cpu_callbac
(void) cpufreq_remove_dev(sys_dev);
break;
}
+ cpufreq_hotplug_inprogress = 0;
}
return NOTIFY_OK;
}
Index: linux-2.6.14-rc4-mm1/drivers/cpufreq/cpufreq_stats.c
===================================================================
--- linux-2.6.14-rc4-mm1.orig/drivers/cpufreq/cpufreq_stats.c
+++ linux-2.6.14-rc4-mm1/drivers/cpufreq/cpufreq_stats.c
@@ -302,6 +302,7 @@ static int __cpuinit cpufreq_stat_cpu_ca
{
unsigned int cpu = (unsigned long)hcpu;
+ cpufreq_hotplug_inprogress = 1;
switch (action) {
case CPU_ONLINE:
cpufreq_update_policy(cpu);
@@ -310,6 +311,7 @@ static int __cpuinit cpufreq_stat_cpu_ca
cpufreq_stats_free_table(cpu);
break;
}
+ cpufreq_hotplug_inprogress = 0;
return NOTIFY_OK;
}
Index: linux-2.6.14-rc4-mm1/include/linux/cpufreq.h
===================================================================
--- linux-2.6.14-rc4-mm1.orig/include/linux/cpufreq.h
+++ linux-2.6.14-rc4-mm1/include/linux/cpufreq.h
@@ -156,6 +156,8 @@ struct cpufreq_governor {
struct module *owner;
};
+extern int cpufreq_hotplug_inprogress;
+
/* pass a target to the cpufreq driver
*/
extern int cpufreq_driver_target(struct cpufreq_policy *policy,
On Tuesday, 1 of November 2005 20:14, Ashok Raj wrote:
> On Tue, Nov 01, 2005 at 08:07:19PM +0100, Rafael J. Wysocki wrote:
> > Hi,
> >
> > > of taking cpucontrol lock in __cpufreq_driver_target().
> >
> > Yes, that's it.
> >
> > > The reason is we now enter the same code path from the cpu_up() and cpu_down()
> > > generated cpu notifier callbacks and ends up trying to lock when the
> > > call path already has the cpucontrol lock.
> > >
> > > Its happening because we do set_cpus_allowed() in powernowk8_target().
> >
> > Unfortunately, powernowk8_target() calls schedule() right after
> > set_cpus_allowed(), so it throws "scheduling while atomic" on every call,
> > because of the preempt_disable()/_enable() around it.
> >
> > Greetings,
> > Rafael
> >
>
> Thanks Rafael,
>
> could you try this patch instead? I hate to keep these state variables
> and thats why i went with the preempt approach, too bad it seems it wont
> work for more than just the case you mentioned.
>
> seems ugly, but i dont find a better looking cure...
It fixes my problem.
Thanks,
Rafael
On Tue, 2005-11-01 at 11:14 -0800, Ashok Raj wrote:
> On Tue, Nov 01, 2005 at 08:07:19PM +0100, Rafael J. Wysocki wrote:
> > Hi,
> >
> > > of taking cpucontrol lock in __cpufreq_driver_target().
> >
> > Yes, that's it.
> >
> > > The reason is we now enter the same code path from the cpu_up() and cpu_down()
> > > generated cpu notifier callbacks and ends up trying to lock when the
> > > call path already has the cpucontrol lock.
> > >
> > > Its happening because we do set_cpus_allowed() in powernowk8_target().
> >
> > Unfortunately, powernowk8_target() calls schedule() right after
> > set_cpus_allowed(), so it throws "scheduling while atomic" on every call,
> > because of the preempt_disable()/_enable() around it.
> >
> > Greetings,
> > Rafael
> >
>
> Thanks Rafael,
>
> could you try this patch instead? I hate to keep these state variables
> and thats why i went with the preempt approach, too bad it seems it wont
> work for more than just the case you mentioned.
>
> seems ugly, but i dont find a better looking cure...
It can't possibly be uglier than disabling preemption to work around a
locking bug.
Lee
On Tue, Nov 01, 2005 at 08:44:10PM +0100, Rafael J. Wysocki wrote:
> On Tuesday, 1 of November 2005 20:14, Ashok Raj wrote:
> > Thanks Rafael,
> >
> > could you try this patch instead? I hate to keep these state variables
> > and thats why i went with the preempt approach, too bad it seems it wont
> > work for more than just the case you mentioned.
> >
> > seems ugly, but i dont find a better looking cure...
>
> It fixes my problem.
>
Thanks,
Andrew: please consider for -mm
--
Cheers,
Ashok Raj
- Open Source Technology Center
On Mon, Oct 31, 2005 at 11:34:13AM -0800, Andrew Morton wrote:
> > Additionally there are some problems with freezing processes by swsusp.
> More details on this?
I've seen this recently. It was kauditd refusing to freeze
for some reason. I've not had chance to look into the code yet.
Dave
Ashok Raj <[email protected]> wrote:
>
>
> ...
>
> seems ugly, but i dont find a better looking cure...
>
Could you take another look, please? It really is pretty gross.
And the second rule of pretty-gross code is to clearly comment it - draw
attention to it, provide a detailed explanation of what it's doing and why
it's necessary, so the next person who comes along will understand the
problem.
>
> When calling target drivers to set frequency, we used to take cpucontrol
> semaphore. When we modified the code to accomodate CPU hotplug, there was an
> attempt to take a double lock leading to a deadlock. Since the code called from
> under the CPU hotplug notifiers already hold the cpucontrol lock.
>
> we attempted to circumvent this ugly state keeping by just doing a simple
> preempt_disable() and preempt_enable(), but that appears to get in the
> way when cpufreq drivers attempt to do set_cpus_allowed() since this call
> will go to sleep.
>
> Now we just keep a state variable and each notifier path is supposed to set
> this to avoid the cpufreq driver taking double locks.
The way in which we normally address such things is to do:
foo_locked()
{
...
}
foo()
{
lock(a_lock);
foo_locked();
unlock(a_lock);
}
and code which already holds a_lock calls foo_locked().
Or, worse,
foo(int caller_has_lock)
{
if (!caller_has_lock)
lock(a_lock);
do_stuff();
if (!caller_has_lock)
unlock(a_lock);
}
But a global flag (yes, it's protected by the global a_lock) is a bit over
the top.
> Signed-off-by: Ashok Raj <[email protected]>
> -----------------------------------------------------------------
> drivers/cpufreq/cpufreq.c | 35 +++++++++++++++++++----------------
> drivers/cpufreq/cpufreq_stats.c | 2 ++
> include/linux/cpufreq.h | 2 ++
> 3 files changed, 23 insertions(+), 16 deletions(-)
>
> Index: linux-2.6.14-rc4-mm1/drivers/cpufreq/cpufreq.c
> ===================================================================
> --- linux-2.6.14-rc4-mm1.orig/drivers/cpufreq/cpufreq.c
> +++ linux-2.6.14-rc4-mm1/drivers/cpufreq/cpufreq.c
> @@ -38,6 +38,8 @@ static struct cpufreq_driver *cpufreq
> static struct cpufreq_policy *cpufreq_cpu_data[NR_CPUS];
> static DEFINE_SPINLOCK(cpufreq_driver_lock);
>
> +int cpufreq_hotplug_inprogress;
> +EXPORT_SYMBOL_GPL(cpufreq_hotplug_inprogress);
>
> /* internal prototypes */
> static int __cpufreq_governor(struct cpufreq_policy *policy, unsigned int event);
> @@ -59,6 +61,18 @@ static DECLARE_RWSEM (cpufreq_notifier_
> static LIST_HEAD(cpufreq_governor_list);
> static DECLARE_MUTEX (cpufreq_governor_sem);
>
> +static void cpufreq_lock_hotplug(void)
> +{
> + if (!cpufreq_hotplug_inprogress)
> + lock_cpu_hotplug();
> +}
> +
> +static void cpufreq_unlock_hotplug(void)
> +{
> + if (!cpufreq_hotplug_inprogress)
> + unlock_cpu_hotplug();
> +}
> +
> struct cpufreq_policy * cpufreq_cpu_get(unsigned int cpu)
> {
> struct cpufreq_policy *data;
> @@ -1114,25 +1128,12 @@ int __cpufreq_driver_target(struct cpufr
> {
> int retval = -EINVAL;
>
> - /*
> - * Converted the lock_cpu_hotplug to preempt_disable()
> - * and preempt enable. This is a bit kludgy and relies on
> - * how cpu hotplug works. All we need is a gaurantee that cpu hotplug
> - * wont make progress on any cpu. Once we do preempt_disable(), this
> - * would ensure hotplug threads dont get on this cpu, thereby delaying
> - * the cpu remove process.
> - *
> - * we removed the lock_cpu_hotplug since we need to call this function via
> - * cpu hotplug callbacks, which result in locking the cpu hotplug
> - * thread itself. Agree this is not very clean, cpufreq community
> - * could improve this if required. - Ashok Raj <[email protected]>
> - */
> - preempt_disable();
> + cpufreq_lock_hotplug();
> dprintk("target for CPU %u: %u kHz, relation %u\n", policy->cpu,
> target_freq, relation);
> if (cpu_online(policy->cpu) && cpufreq_driver->target)
> retval = cpufreq_driver->target(policy, target_freq, relation);
> - preempt_enable();
> + cpufreq_unlock_hotplug();
> return retval;
> }
> EXPORT_SYMBOL_GPL(__cpufreq_driver_target);
> @@ -1432,8 +1433,9 @@ static int __cpuinit cpufreq_cpu_callbac
> struct sys_device *sys_dev;
>
> sys_dev = get_cpu_sysdev(cpu);
> -
> +
> if (sys_dev) {
> + cpufreq_hotplug_inprogress = 1;
> switch (action) {
> case CPU_ONLINE:
> (void) cpufreq_add_dev(sys_dev);
> @@ -1455,6 +1457,7 @@ static int __cpuinit cpufreq_cpu_callbac
> (void) cpufreq_remove_dev(sys_dev);
> break;
> }
> + cpufreq_hotplug_inprogress = 0;
> }
> return NOTIFY_OK;
> }
> Index: linux-2.6.14-rc4-mm1/drivers/cpufreq/cpufreq_stats.c
> ===================================================================
> --- linux-2.6.14-rc4-mm1.orig/drivers/cpufreq/cpufreq_stats.c
> +++ linux-2.6.14-rc4-mm1/drivers/cpufreq/cpufreq_stats.c
> @@ -302,6 +302,7 @@ static int __cpuinit cpufreq_stat_cpu_ca
> {
> unsigned int cpu = (unsigned long)hcpu;
>
> + cpufreq_hotplug_inprogress = 1;
> switch (action) {
> case CPU_ONLINE:
> cpufreq_update_policy(cpu);
> @@ -310,6 +311,7 @@ static int __cpuinit cpufreq_stat_cpu_ca
> cpufreq_stats_free_table(cpu);
> break;
> }
> + cpufreq_hotplug_inprogress = 0;
> return NOTIFY_OK;
> }
>
> Index: linux-2.6.14-rc4-mm1/include/linux/cpufreq.h
> ===================================================================
> --- linux-2.6.14-rc4-mm1.orig/include/linux/cpufreq.h
> +++ linux-2.6.14-rc4-mm1/include/linux/cpufreq.h
> @@ -156,6 +156,8 @@ struct cpufreq_governor {
> struct module *owner;
> };
>
> +extern int cpufreq_hotplug_inprogress;
> +
> /* pass a target to the cpufreq driver
> */
> extern int cpufreq_driver_target(struct cpufreq_policy *policy,
On Fri, Nov 04, 2005 at 02:30:35PM -0800, Andrew Morton wrote:
> Ashok Raj <[email protected]> wrote:
> >
> >
> > ...
> >
> > seems ugly, but i dont find a better looking cure...
> >
>
> Could you take another look, please? It really is pretty gross.
Agree, frankly i didnt like this either, but the calls there were so
nested that i tried doing multiple ways and it appeared they need
to be broken down so much to do this a little cleanly.
i did put a real ugly comment earlier, but i must have removed it in the next
respin.
I was hoping to do just make the dirs appear/disapper (so i get the
behaviour i need) and let people more familier with the code go fix it.
>
> And the second rule of pretty-gross code is to clearly comment it - draw
>
pretty-gross.... humbly saying that is :-) i will attempt to look at it
again.
cpufreq folks... if you have better ideas brewing to make this work
in a pretty-clean way.... many thanks!
Cheers,
ashok
On Fri, Nov 04, 2005 at 02:30:35PM -0800, Andrew Morton wrote:
> Ashok Raj <[email protected]> wrote:
> >
> >
> > ...
> >
> > seems ugly, but i dont find a better looking cure...
> >
>
> Could you take another look, please? It really is pretty gross.
>
So here it is again... thanks to Andrew suggesting to take the PF_ avenue.
Sure enough the old one was gross, and there was an even groosier patch
that met infant mortality...
since the last patch in -git8 broke some implementations, this is
relative to that earlier patch.
It would be nice to have this in base sooner so cpufreq's dont break.
Thanks a ton.
--------------
When calling target drivers to set frequency, we take cpucontrol lock.
When we modified the code to accomodate CPU hotplug, there was an
attempt to take a double lock of cpucontrol leading to a deadlock.
Since the current thread context is already holding the cpucontrol lock,
we dont need to make another attempt to acquire it.
Now we leave a trace in current->flags indicating current thread already
is under cpucontrol lock held, so we dont attempt to do this another time.
Thanks to Andrew Morton for the beating:-)
Signed-off-by: Ashok Raj <[email protected]>
-----------------------------------------------------------------
drivers/cpufreq/cpufreq.c | 24 ++++++++++--------------
include/linux/cpu.h | 1 +
include/linux/sched.h | 1 +
kernel/cpu.c | 33 +++++++++++++++++++++++++++++++++
4 files changed, 45 insertions(+), 14 deletions(-)
Index: linux-2.6.14-rc4-mm1/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-2.6.14-rc4-mm1.orig/drivers/cpufreq/cpufreq.c
+++ linux-2.6.14-rc4-mm1/drivers/cpufreq/cpufreq.c
@@ -38,7 +38,6 @@ static struct cpufreq_driver *cpufreq
static struct cpufreq_policy *cpufreq_cpu_data[NR_CPUS];
static DEFINE_SPINLOCK(cpufreq_driver_lock);
-
/* internal prototypes */
static int __cpufreq_governor(struct cpufreq_policy *policy, unsigned int event);
static void handle_update(void *data);
@@ -1115,24 +1114,21 @@ int __cpufreq_driver_target(struct cpufr
int retval = -EINVAL;
/*
- * Converted the lock_cpu_hotplug to preempt_disable()
- * and preempt enable. This is a bit kludgy and relies on
- * how cpu hotplug works. All we need is a gaurantee that cpu hotplug
- * wont make progress on any cpu. Once we do preempt_disable(), this
- * would ensure hotplug threads dont get on this cpu, thereby delaying
- * the cpu remove process.
- *
- * we removed the lock_cpu_hotplug since we need to call this function via
- * cpu hotplug callbacks, which result in locking the cpu hotplug
- * thread itself. Agree this is not very clean, cpufreq community
- * could improve this if required. - Ashok Raj <[email protected]>
+ * If we are already in context of hotplug thread, we dont need to
+ * acquire the hotplug lock. Otherwise acquire cpucontrol to prevent
+ * hotplug from removing this cpu that we are working on.
*/
- preempt_disable();
+ if (!current_in_cpu_hotplug())
+ lock_cpu_hotplug();
+
dprintk("target for CPU %u: %u kHz, relation %u\n", policy->cpu,
target_freq, relation);
if (cpu_online(policy->cpu) && cpufreq_driver->target)
retval = cpufreq_driver->target(policy, target_freq, relation);
- preempt_enable();
+
+ if (!current_in_cpu_hotplug())
+ unlock_cpu_hotplug();
+
return retval;
}
EXPORT_SYMBOL_GPL(__cpufreq_driver_target);
Index: linux-2.6.14-rc4-mm1/include/linux/cpu.h
===================================================================
--- linux-2.6.14-rc4-mm1.orig/include/linux/cpu.h
+++ linux-2.6.14-rc4-mm1/include/linux/cpu.h
@@ -33,6 +33,7 @@ struct cpu {
extern int register_cpu(struct cpu *, int, struct node *);
extern struct sys_device *get_cpu_sysdev(int cpu);
+extern int current_in_cpu_hotplug(void);
#ifdef CONFIG_HOTPLUG_CPU
extern void unregister_cpu(struct cpu *, struct node *);
#endif
Index: linux-2.6.14-rc4-mm1/include/linux/sched.h
===================================================================
--- linux-2.6.14-rc4-mm1.orig/include/linux/sched.h
+++ linux-2.6.14-rc4-mm1/include/linux/sched.h
@@ -880,6 +880,7 @@ do { if (atomic_dec_and_test(&(tsk)->usa
#define PF_SYNCWRITE 0x00200000 /* I am doing a sync write */
#define PF_BORROWED_MM 0x00400000 /* I am a kthread doing use_mm */
#define PF_RANDOMIZE 0x00800000 /* randomize virtual address space */
+#define PF_HOTPLUG_CPU 0x01000000 /* Currently performing CPU hotplug */
/*
* Only the _current_ task can read/write to tsk->flags, but other
Index: linux-2.6.14-rc4-mm1/kernel/cpu.c
===================================================================
--- linux-2.6.14-rc4-mm1.orig/kernel/cpu.c
+++ linux-2.6.14-rc4-mm1/kernel/cpu.c
@@ -20,6 +20,24 @@ DECLARE_MUTEX(cpucontrol);
static struct notifier_block *cpu_chain;
+/*
+ * Used to check by callers if they need to acquire the cpucontrol
+ * or not to protect a cpu from being removed. Its sometimes required to
+ * call these functions both for normal operations, and in response to
+ * a cpu being added/removed. If the context of the call is in the same
+ * thread context as a CPU hotplug thread, we dont need to take the lock
+ * since its already protected
+ * check drivers/cpufreq/cpufreq.c for its usage - Ashok Raj
+ */
+
+int current_in_cpu_hotplug(void)
+{
+ return (current->flags & PF_HOTPLUG_CPU);
+}
+
+EXPORT_SYMBOL_GPL(current_in_cpu_hotplug);
+
+
/* Need to know about CPUs going up/down? */
int register_cpu_notifier(struct notifier_block *nb)
{
@@ -93,6 +111,13 @@ int cpu_down(unsigned int cpu)
goto out;
}
+ /*
+ * Leave a trace in current->flags indicating we are already in
+ * process of performing CPU hotplug. Callers can check if cpucontrol
+ * is already acquired by current thread, and if so not cause
+ * a dead lock by not acquiring the lock
+ */
+ current->flags |= PF_HOTPLUG_CPU;
err = notifier_call_chain(&cpu_chain, CPU_DOWN_PREPARE,
(void *)(long)cpu);
if (err == NOTIFY_BAD) {
@@ -145,6 +170,7 @@ out_thread:
out_allowed:
set_cpus_allowed(current, old_allowed);
out:
+ current->flags &= ~PF_HOTPLUG_CPU;
unlock_cpu_hotplug();
return err;
}
@@ -162,6 +188,12 @@ int __devinit cpu_up(unsigned int cpu)
ret = -EINVAL;
goto out;
}
+
+ /*
+ * Leave a trace in current->flags indicating we are already in
+ * process of performing CPU hotplug.
+ */
+ current->flags |= PF_HOTPLUG_CPU;
ret = notifier_call_chain(&cpu_chain, CPU_UP_PREPARE, hcpu);
if (ret == NOTIFY_BAD) {
printk("%s: attempt to bring up CPU %u failed\n",
@@ -184,6 +216,7 @@ out_notify:
if (ret != 0)
notifier_call_chain(&cpu_chain, CPU_UP_CANCELED, hcpu);
out:
+ current->flags &= ~PF_HOTPLUG_CPU;
up(&cpucontrol);
return ret;
}
Ashok Raj <[email protected]> wrote:
>
> Now we leave a trace in current->flags indicating current thread already
> is under cpucontrol lock held, so we dont attempt to do this another time.
>
> ..
> +#define PF_HOTPLUG_CPU 0x01000000 /* Currently performing CPU hotplug */
>
It's still hacky - I mean, we could use this trick to avoid recursion onto
any lock in the kernel whenever we get ourselves into a mess. We'd gain an
awful lot of PF_* flags.
So we should still view this as a temporary fix.
I don't think I've seen an analysis of the actual deadlock yet. Are you
able to provide a stack trace of the offending callpath?
On Sat, Nov 05, 2005 at 03:33:04PM -0800, Andrew Morton wrote:
> Ashok Raj <[email protected]> wrote:
> >
> > Now we leave a trace in current->flags indicating current thread already
> > is under cpucontrol lock held, so we dont attempt to do this another time.
> >
> > ..
> > +#define PF_HOTPLUG_CPU 0x01000000 /* Currently performing CPU hotplug */
> >
>
> It's still hacky - I mean, we could use this trick to avoid recursion onto
> any lock in the kernel whenever we get ourselves into a mess. We'd gain an
> awful lot of PF_* flags.
>
> So we should still view this as a temporary fix.
>
> I don't think I've seen an analysis of the actual deadlock yet. Are you
> able to provide a stack trace of the offending callpath?
Hi Andrew,
we call the exact same functions in cpufreq during startup and in
response to cpu hotplug events, to create or destroy
sysfs entries /sys/devices/system/cpu/cpuX/cpufreq/*. cpufreq_add_dev().
problem is cpufreq_set_policy() eventually ends up calling
__cpufreq_driver_target() during the CPU_ONLINE, and CPU_DOWN_PREPARE
that takes cpucontrol lock.
Since when we already in the cpu notifier callbacks, cpucontrol is already
held by the cpu_up() or the cpu_down() that caused the double lock.
i dont see a panic, but just when we try to do cpu_down, or cpu_up that
thread would just lock out waiting to acquire the cpucontrol the second
time.
Hope this helps.
--
Cheers,
Ashok Raj
- Open Source Technology Center
Ashok Raj <[email protected]> wrote:
>
> On Sat, Nov 05, 2005 at 03:33:04PM -0800, Andrew Morton wrote:
> > Ashok Raj <[email protected]> wrote:
> > >
> > > Now we leave a trace in current->flags indicating current thread already
> > > is under cpucontrol lock held, so we dont attempt to do this another time.
> > >
> > > ..
> > > +#define PF_HOTPLUG_CPU 0x01000000 /* Currently performing CPU hotplug */
> > >
> >
> > It's still hacky - I mean, we could use this trick to avoid recursion onto
> > any lock in the kernel whenever we get ourselves into a mess. We'd gain an
> > awful lot of PF_* flags.
> >
> > So we should still view this as a temporary fix.
> >
> > I don't think I've seen an analysis of the actual deadlock yet. Are you
> > able to provide a stack trace of the offending callpath?
>
> Hi Andrew,
>
> we call the exact same functions in cpufreq during startup and in
> response to cpu hotplug events, to create or destroy
> sysfs entries /sys/devices/system/cpu/cpuX/cpufreq/*. cpufreq_add_dev().
>
> problem is cpufreq_set_policy() eventually ends up calling
> __cpufreq_driver_target() during the CPU_ONLINE, and CPU_DOWN_PREPARE
> that takes cpucontrol lock.
>
> Since when we already in the cpu notifier callbacks, cpucontrol is already
> held by the cpu_up() or the cpu_down() that caused the double lock.
I appreciate that, but was hoping to see the call graph for the deadlock,
with sysrq-t.
See, one fix might be to take the lock_cpu_hotplug() call out of
__cpufreq_driver_target() and move it higher up the cpufreq call tree.
Into the lowest-level caller which is _above_ the function which the
hotplug code calls.
Like this:
hotplug -> down() -> ... ->
__cpufreq_driver_target()
cpufreq -> down() -> ... ->
and not like this:
hotplug -> down() -> ... ->
__cpufreq_driver_target() -> down()
cpufreq -> ... ->
On Sat, 5 Nov 2005 15:54:07 -0800,
Ashok Raj <[email protected]> wrote:
>Since when we already in the cpu notifier callbacks, cpucontrol is already
>held by the cpu_up() or the cpu_down() that caused the double lock.
Add the owning cpu id to the lock. spin_trylock() first, if the lock
is already held then check if the current cpu owns the lock. If it
does then continue, noting (in a local variable) that you do not need
to drop the lock on exit from this recursion level.