2008-01-18 01:05:29

by Steven Rostedt

[permalink] [raw]
Subject: [PATCH] printk deadlocks if called with runqueue lock held


I thought that one could place a printk anywhere without worrying.
But it seems that it is not wise to place a printk where the runqueue
lock is held.

I just spent two hours debugging why some of my code was locking up,
to find that the lockup was caused by some debugging printk's that
I had in the scheduler. The printk's were only in rare paths so
they shouldn't be too much of a problem, but after I hit the printk
the system locked up.

Thinking that it was locking up on my code I went looking down the
wrong path. I finally found (after examining an NMI dump) that
the lockup happened because printk was trying to wakeup the klogd
daemon, which caused a deadlock when the try_to_wakeup code tries
to grab the runqueue lock.

Since printks are seldom called with interrupts disabled, we can
hold off the waking of klogd if they are. We don't have access to
the runqueue locks from printk, but those locks need interrupts
disabled in order to be held.

Calling printk with interrupts disabled should only be done for
emergencies and debugging anyway.

And with this patch, my code ran fine ;-)

Signed-off-by: Steven Rostedt <[email protected]>
---
kernel/printk.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

Index: linux-mcount.git/kernel/printk.c
===================================================================
--- linux-mcount.git.orig/kernel/printk.c 2008-01-17 09:06:23.000000000 -0500
+++ linux-mcount.git/kernel/printk.c 2008-01-17 19:56:59.000000000 -0500
@@ -978,7 +978,13 @@ void release_console_sem(void)
console_locked = 0;
up(&console_sem);
spin_unlock_irqrestore(&logbuf_lock, flags);
- if (wake_klogd)
+ /*
+ * If we try to wake up klogd while printing with the runqueue lock
+ * held, this will deadlock. We don't have access to the runqueue
+ * lock from here, but just checking for interrupts disabled
+ * should be enough.
+ */
+ if (!irqs_disabled() && wake_klogd)
wake_up_klogd();
}
EXPORT_SYMBOL(release_console_sem);


2008-01-18 01:20:49

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] printk deadlocks if called with runqueue lock held

On Thu, 17 Jan 2008 20:04:27 -0500 (EST)
Steven Rostedt <[email protected]> wrote:

>
> I thought that one could place a printk anywhere without worrying.
> But it seems that it is not wise to place a printk where the runqueue
> lock is held.
>
> I just spent two hours debugging why some of my code was locking up,
> to find that the lockup was caused by some debugging printk's that
> I had in the scheduler. The printk's were only in rare paths so
> they shouldn't be too much of a problem, but after I hit the printk
> the system locked up.
>
> Thinking that it was locking up on my code I went looking down the
> wrong path. I finally found (after examining an NMI dump) that
> the lockup happened because printk was trying to wakeup the klogd
> daemon, which caused a deadlock when the try_to_wakeup code tries
> to grab the runqueue lock.

A "well-known" problem which few know about ;)

Anyway you should be developing with all debug options enabled and that
includes NMI watchdog so there.

> Since printks are seldom called with interrupts disabled, we can
> hold off the waking of klogd if they are. We don't have access to
> the runqueue locks from printk, but those locks need interrupts
> disabled in order to be held.
>
> Calling printk with interrupts disabled should only be done for
> emergencies and debugging anyway.
>
> And with this patch, my code ran fine ;-)
>
> Signed-off-by: Steven Rostedt <[email protected]>
> ---
> kernel/printk.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> Index: linux-mcount.git/kernel/printk.c
> ===================================================================
> --- linux-mcount.git.orig/kernel/printk.c 2008-01-17 09:06:23.000000000 -0500
> +++ linux-mcount.git/kernel/printk.c 2008-01-17 19:56:59.000000000 -0500
> @@ -978,7 +978,13 @@ void release_console_sem(void)
> console_locked = 0;
> up(&console_sem);
> spin_unlock_irqrestore(&logbuf_lock, flags);
> - if (wake_klogd)
> + /*
> + * If we try to wake up klogd while printing with the runqueue lock
> + * held, this will deadlock. We don't have access to the runqueue
> + * lock from here, but just checking for interrupts disabled
> + * should be enough.
> + */
> + if (!irqs_disabled() && wake_klogd)
> wake_up_klogd();
> }
> EXPORT_SYMBOL(release_console_sem);

this looks fairly foul. Might cause problems if one CPU is stuck with
interrupts off spewing printks?

Couldn't you maintain a sched-hackers-only printk patch which adds a
sched_printk() which avoids the wakeup or something like that?

2008-01-18 01:33:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] printk deadlocks if called with runqueue lock held



On Thu, 17 Jan 2008, Steven Rostedt wrote:
>
> Calling printk with interrupts disabled should only be done for
> emergencies and debugging anyway.

Agreed, I think this is sane.

> And with this patch, my code ran fine ;-)

Patch looks fine, but I'm saddened:

> + /*
> + * If we try to wake up klogd while printing with the runqueue lock
> + * held, this will deadlock. We don't have access to the runqueue
> + * lock from here, but just checking for interrupts disabled
> + * should be enough.
> + */
> + if (!irqs_disabled() && wake_klogd)
> wake_up_klogd();

..because we actually *have* the old hardware irq enable/disable value in
in that "flags" parameter we just restored, so it's more than a bit sad to
then ask again whether they were disabled or not!

IOW, I think this should be

if (raw_irqs_disabled_flags(flags) && wake_klogd)
wake_up_klogd();

Of course, not all architectures seem to suport that thing (it's currently
only used by the CONFIG_TRACE_IRQFLAGS config option).

Damn. So close, yet so far away.

Linus

2008-01-18 01:53:55

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] printk deadlocks if called with runqueue lock held


On Thu, 17 Jan 2008, Linus Torvalds wrote:

> IOW, I think this should be
>
> if (raw_irqs_disabled_flags(flags) && wake_klogd)
> wake_up_klogd();
>
> Of course, not all architectures seem to suport that thing (it's currently
> only used by the CONFIG_TRACE_IRQFLAGS config option).
>
> Damn. So close, yet so far away.

Perhaps we can define something like:

#ifdef CONFIG_HAS_RAW_IRQS_DISABLED_FLAGS
# define test_irqs_disabled_flags(flags) raw_irqs_disabled_flags(flags)
#else
# define test_irqs_disabled_flags(flags) \
({ (void)flags; irqs_disabled();})
#endif

then we could do

if (test_irqs_disabled_flags(flags) && wake_klogd)
wake_up_klogd();

Or is this asking too much ;-)

-- Steve

2008-01-18 01:59:18

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] printk deadlocks if called with runqueue lock held


On Thu, 17 Jan 2008, Andrew Morton wrote:
>
> A "well-known" problem which few know about ;)

And now I feel less ignorant.

>
> Anyway you should be developing with all debug options enabled and that
> includes NMI watchdog so there.

hehe, I did have NMI on the whole time. It's that I also had modifications
in try_to_wakeup, so seeing that in the backtrace, I thought I was the one
screwing up. But I was still certain that it was my code locking up.


> >
> > Calling printk with interrupts disabled should only be done for
> > emergencies and debugging anyway.
> >

> > + * held, this will deadlock. We don't have access to the runqueue
> > + * lock from here, but just checking for interrupts disabled
> > + * should be enough.
> > + */
> > + if (!irqs_disabled() && wake_klogd)
> > wake_up_klogd();
> > }
> > EXPORT_SYMBOL(release_console_sem);
>
> this looks fairly foul. Might cause problems if one CPU is stuck with
> interrupts off spewing printks?

Well, we have more than one problem if that is happening.

>
> Couldn't you maintain a sched-hackers-only printk patch which adds a
> sched_printk() which avoids the wakeup or something like that?
>

Actually, the code that did trigger it was part of the latency_tracer
code doing the wakeup latency timings. Theres a printk that happens when
we trigger the max latency, and it prints from the scheduler.

So this was perfect to lock up every time.

-- Steve

2008-01-18 11:05:51

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH] printk deadlocks if called with runqueue lock held

On Thu, 17 Jan 2008, Steven Rostedt wrote:

> Thinking that it was locking up on my code I went looking down the wrong
> path. I finally found (after examining an NMI dump) that the lockup
> happened because printk was trying to wakeup the klogd daemon, which
> caused a deadlock when the try_to_wakeup code tries to grab the runqueue
> lock.

... which I have documented in the printk() comment's in commit 1492192b
:)

> Since printks are seldom called with interrupts disabled, we can
> hold off the waking of klogd if they are. We don't have access to
> the runqueue locks from printk, but those locks need interrupts
> disabled in order to be held.

If this patch is going to be merged, you should perhaps adjust the comment
introduced by the above mentioned commit, so that it reflects the new
behavior.

Thanks,

--
Jiri Kosina

2008-01-18 11:36:21

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] printk deadlocks if called with runqueue lock held


On Fri, 18 Jan 2008, Jiri Kosina wrote:
>
> If this patch is going to be merged, you should perhaps adjust the comment
> introduced by the above mentioned commit, so that it reflects the new
> behavior.

Thanks for pointing this out. Updated patch below:

-- Steve
=========

I thought that one could place a printk anywhere without worrying.
But it seems that it is not wise to place a printk where the runqueue
lock is held.

I just spent two hours debugging why some of my code was locking up,
to find that the lockup was caused by some debugging printk's that
I had in the scheduler. The printk's were only in rare paths so
they shouldn't be too much of a problem, but after I hit the printk
the system locked up.

Thinking that it was locking up on my code I went looking down the
wrong path. I finally found (after examining an NMI dump) that
the lockup happened because printk was trying to wakeup the klogd
daemon, which caused a deadlock when the try_to_wakeup code tries
to grab the runqueue lock.

Since printks are seldom called with interrupts disabled, we can
hold off the waking of klogd if they are. We don't have access to
the runqueue locks from printk, but those locks need interrupts
disabled in order to be held.

Calling printk with interrupts disabled should only be done for
emergencies and debugging anyway.

And with this patch, my code ran fine ;-)

Signed-off-by: Steven Rostedt <[email protected]>
---
kernel/printk.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)

Index: linux-mcount.git/kernel/printk.c
===================================================================
--- linux-mcount.git.orig/kernel/printk.c 2008-01-18 06:29:15.000000000 -0500
+++ linux-mcount.git/kernel/printk.c 2008-01-18 06:32:38.000000000 -0500
@@ -595,9 +595,11 @@ static int have_callable_console(void)
* @fmt: format string
*
* This is printk(). It can be called from any context. We want it to work.
- * Be aware of the fact that if oops_in_progress is not set, we might try to
- * wake klogd up which could deadlock on runqueue lock if printk() is called
- * from scheduler code.
+ *
+ * Note: if printk() is called with interrupts disabled, it will not wake
+ * up the klogd. This is to avoid a deadlock from calling printk() in schedule
+ * with the runqueue lock held and having the wake_up grab the runqueue lock
+ * as well.
*
* We try to grab the console_sem. If we succeed, it's easy - we log the output and
* call the console drivers. If we fail to get the semaphore we place the output
@@ -978,7 +980,13 @@ void release_console_sem(void)
console_locked = 0;
up(&console_sem);
spin_unlock_irqrestore(&logbuf_lock, flags);
- if (wake_klogd)
+ /*
+ * If we try to wake up klogd while printing with the runqueue lock
+ * held, this will deadlock. We don't have access to the runqueue
+ * lock from here, but just checking for interrupts disabled
+ * should be enough.
+ */
+ if (!irqs_disabled() && wake_klogd)
wake_up_klogd();
}
EXPORT_SYMBOL(release_console_sem);

2008-01-18 14:56:18

by Jan Kiszka

[permalink] [raw]
Subject: Re: [PATCH] printk deadlocks if called with runqueue lock held

Steven Rostedt wrote:
...
> @@ -978,7 +980,13 @@ void release_console_sem(void)
> console_locked = 0;
> up(&console_sem);

Hmm, just looking at this fragment: Doesn't up() include the risk of
running onto the runqueue lock as well?

> spin_unlock_irqrestore(&logbuf_lock, flags);
> - if (wake_klogd)
> + /*
> + * If we try to wake up klogd while printing with the runqueue lock
> + * held, this will deadlock. We don't have access to the runqueue
> + * lock from here, but just checking for interrupts disabled
> + * should be enough.
> + */
> + if (!irqs_disabled() && wake_klogd)
> wake_up_klogd();
> }
> EXPORT_SYMBOL(release_console_sem);

Jan

2008-01-18 16:29:34

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] printk deadlocks if called with runqueue lock held


On Fri, 18 Jan 2008, Jan Kiszka wrote:

> Steven Rostedt wrote:
> ....
> > @@ -978,7 +980,13 @@ void release_console_sem(void)
> > console_locked = 0;
> > up(&console_sem);
>
> Hmm, just looking at this fragment: Doesn't up() include the risk of
> running onto the runqueue lock as well?

Very little risk (if any). If printk fails to get the console_sem it
doesn't block. So there would be no waiters on the semaphore, and thus
try_to_wake_up would not be called. The only place I see the down
actually being called is in suspend code, and even then, we would need to
lock the rq of the task that is trying to grab the console_sem and the
deadlock would only occur if that was on the same CPU. And honestly, I'm
not sure that's even possible.

-- Steve


>
> > spin_unlock_irqrestore(&logbuf_lock, flags);
> > - if (wake_klogd)
> > + /*
> > + * If we try to wake up klogd while printing with the runqueue lock
> > + * held, this will deadlock. We don't have access to the runqueue
> > + * lock from here, but just checking for interrupts disabled
> > + * should be enough.
> > + */
> > + if (!irqs_disabled() && wake_klogd)
> > wake_up_klogd();
> > }
> > EXPORT_SYMBOL(release_console_sem);
>
> Jan
>

2008-01-18 17:32:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] printk deadlocks if called with runqueue lock held



On Fri, 18 Jan 2008, Jan Kiszka wrote:
> Steven Rostedt wrote:
> ...
> > @@ -978,7 +980,13 @@ void release_console_sem(void)
> > console_locked = 0;
> > up(&console_sem);
>
> Hmm, just looking at this fragment: Doesn't up() include the risk of
> running onto the runqueue lock as well?

In theory yes.

I suspect it would never ever be a problem in practice (the case we care
about is running with interrupts disabled, and we got it with
down_trylock()), so as this is only about essentially custon debug or oops
things anyway, it's probably not worth fixing.

That said, you're definitely right in theory.

But *IF* we want to fix the almost certainly purely theoretical problem,
it would be possible but fairly ugly.

We'd need to (a) make it a mutex rather than a semaphore (which is
definitely not the ugly part), and then - the ugly part - (b) expose a
whole new mutex interface: an enhanched version of "mutex_trylock()" that
*also* keeps the mutex spinlock locked, and then instead of using
"mutex_unlock()" we'd use a special "mutex_unlock_atomic()" that knows the
spinlock was held over the whole time.

So it would then use something like

if (mutex_trylock_atomic(..)) {

mutex_unlock_atomic(..);
}

and that would work out ok.

It's likely not that hard, and in fact this may be why "console_sem" was
never converted to a mutex: I think Ingo tried, but it didn't work right
with debugging enabled, and I can well imagine that it was all due to this
issue. But if we do those _atomic() versions, we'd probably fix that
problem.

So maybe the "ugly new interface" would actually be a cleanup in the long
run, by possibly fixing the fact that things just *happened* to work with
semaphores because they didn't do the fancy debug version..

Ingo?

Linus

2008-01-23 18:37:33

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] printk deadlocks if called with runqueue lock held



On Wed, 23 Jan 2008, Pavel Machek wrote:
>
> Could try_to_wakeup use trylock, and only avoid wakeup if lock is
> already held?
> Pavel

I could imagine a lot of missed wakeups caused by that.

-- Steve

2008-01-23 18:22:58

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] printk deadlocks if called with runqueue lock held

Hi!

> I thought that one could place a printk anywhere without worrying.
> But it seems that it is not wise to place a printk where the runqueue
> lock is held.
>
> I just spent two hours debugging why some of my code was locking up,
> to find that the lockup was caused by some debugging printk's that
> I had in the scheduler. The printk's were only in rare paths so
> they shouldn't be too much of a problem, but after I hit the printk
> the system locked up.
>
> Thinking that it was locking up on my code I went looking down the
> wrong path. I finally found (after examining an NMI dump) that
> the lockup happened because printk was trying to wakeup the klogd
> daemon, which caused a deadlock when the try_to_wakeup code tries
> to grab the runqueue lock.

Could try_to_wakeup use trylock, and only avoid wakeup if lock is
already held?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-01-24 13:36:59

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] printk deadlocks if called with runqueue lock held

On Wed 2008-01-23 13:36:12, Steven Rostedt wrote:
>
>
> On Wed, 23 Jan 2008, Pavel Machek wrote:
> >
> > Could try_to_wakeup use trylock, and only avoid wakeup if lock is
> > already held?
> > Pavel
>
> I could imagine a lot of missed wakeups caused by that.

...alternative being suggested is to miss _all_ the wakeups. So
"missing some" seems like better alternative.

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html