2011-04-02 11:47:18

by Xiaotian Feng

[permalink] [raw]
Subject: BUG: sleeping function called from invalid context at kernel/mutex.c

Hi,

I've got following warnings when I'm trying to resume from suspend, so
sparse_lock is not safe enough as a mutex for the suspend resume case, can
we simply convert sparse_lock back to spinlock?

BUG: sleeping function called from invalid context at kernel/mutex.c:278
in_atomic(): 1, irqs_disabled(): 1, pid: 2345, name: bash
INFO: lockdep is turned off.
irq event stamp: 0
hardirqs last enabled at (0): [< (null)>] (null)
hardirqs last disabled at (0): [<ffffffff81068391>]
copy_process+0x4d1/0x1070
softirqs last enabled at (0): [<ffffffff81068391>]
copy_process+0x4d1/0x1070
softirqs last disabled at (0): [< (null)>] (null)
Pid: 2345, comm: bash Tainted: G I 2.6.39-rc1+ #59
Call Trace:
[<ffffffff81052d77>] __might_sleep+0x107/0x140
[<ffffffff816a57df>] mutex_lock_nested+0x2f/0x50
[<ffffffff810e191a>] irq_reserve_irqs+0x4a/0xa0
[<ffffffff810e18a8>] ? __irq_put_desc_unlock+0x28/0x50
[<ffffffff810e4e78>] irq_set_chip+0x58/0x70
[<ffffffff810e4fc0>] ? handle_level_irq+0xe0/0xe0
[<ffffffff810e4eb6>] irq_set_chip_and_handler_name+0x26/0x50
[<ffffffff8102fbb0>] arch_setup_hpet_msi+0xa0/0xf0
[<ffffffff8103613c>] hpet_setup_msi_irq+0x1c/0x40
[<ffffffff81036ad3>] hpet_set_mode+0x1a3/0x1f0
[<ffffffff81036b34>] hpet_msi_set_mode+0x14/0x20
[<ffffffff810a03aa>] clockevents_set_mode+0x2a/0x70
[<ffffffff810a0693>] tick_resume+0x53/0xa0
[<ffffffff810a0cbd>] tick_notify+0x1cd/0x1e0
[<ffffffff810a00a8>] ? clockevents_notify+0x28/0x140
[<ffffffff816ab1f4>] notifier_call_chain+0x94/0xd0
[<ffffffff810970c6>] raw_notifier_call_chain+0x16/0x20
[<ffffffff810a00bd>] clockevents_notify+0x3d/0x140
[<ffffffff8109c000>] timekeeping_resume+0x190/0x1d0
[<ffffffff813c5fd5>] syscore_resume+0x45/0xf0
[<ffffffff810bb977>] suspend_enter+0xe7/0x120
[<ffffffff810bba7e>] suspend_devices_and_enter+0x8e/0x2c0
[<ffffffff810bbd86>] enter_state+0xd6/0x130
[<ffffffff810baf99>] state_store+0x99/0x100
[<ffffffff812c82b7>] kobj_attr_store+0x17/0x20
[<ffffffff811fc6f3>] sysfs_write_file+0xa3/0x100
[<ffffffff81187310>] vfs_write+0xd0/0x1a0
[<ffffffff811874e4>] sys_write+0x54/0xa0
[<ffffffff816af382>] system_call_fastpath+0x16/0x1b

Thanks
Xiaotian


2011-04-02 19:12:52

by Thomas Gleixner

[permalink] [raw]
Subject: Re: BUG: sleeping function called from invalid context at kernel/mutex.c

On Sat, 2 Apr 2011, Xiaotian Feng wrote:
> I've got following warnings when I'm trying to resume from suspend, so
> sparse_lock is not safe enough as a mutex for the suspend resume case, can
> we simply convert sparse_lock back to spinlock?

What the heck? We do not just convert a mutex to a spinlock because a
warning triggers. And it was a mutex already in 2.6.38 w/o any such
problems.

And it is safe enough for the resume case as well, it simply cannot go
to sleep because nothing else can hold that mutex in the early resume
code. We take the mutex during early boot as well, but there the might
sleep checks are disabled. And we should do the very same thing in the
early resume code as well. Rafael, PeterZ ???

But that's a different issue and not the real problem.

So now instead of thinking about spinlock it would have been pretty
simple to figure out that the warning is triggered due to commit
d72274e5.

But of course we dont want to revert d72274e5 either, though it's easy
to figure out from the call chain that the fundamental stupidity is in
the hpet resume code.

Why the hell does it call arch_setup_hpet_msi()? That interrupt was
completely setup _BEFORE_ we went into suspend. And I don't see any
code which tears down the interrupt on suspend. So we run through a
full setup for nothing. And when we have interrupt remapping enabled
it even leaks an irte, because it allocates a new one. Utter crap.
init_one_hpet_msi_clockevent() has the same stupid call.

So no, the mutex stays a mutex and we fix crap where the crap is.

The following completely untested patch should solve that. Actually we
should be more clever and do the affinity magic in the new function as
well, so we can get rid of this weird disable/set_affinity/enable
hack, but that's a different story.

Thanks,

tglx
---
arch/x86/include/asm/hpet.h | 2 ++
arch/x86/kernel/apic/io_apic.c | 11 +++++++++++
arch/x86/kernel/hpet.c | 5 +++--
3 files changed, 16 insertions(+), 2 deletions(-)

Index: linux-2.6/arch/x86/include/asm/hpet.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/hpet.h
+++ linux-2.6/arch/x86/include/asm/hpet.h
@@ -83,11 +83,13 @@ extern void hpet_msi_read(struct hpet_de

#ifdef CONFIG_PCI_MSI
extern int arch_setup_hpet_msi(unsigned int irq, unsigned int id);
+extern int arch_start_hpet_msi(unsigned int irq, unsigned int id);
#else
static inline int arch_setup_hpet_msi(unsigned int irq, unsigned int id)
{
return -EINVAL;
}
+static inline int arch_start_hpet_msi(unsigned int irq, unsigned int id) { }
#endif

#ifdef CONFIG_HPET_EMULATE_RTC
Index: linux-2.6/arch/x86/kernel/apic/io_apic.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/apic/io_apic.c
+++ linux-2.6/arch/x86/kernel/apic/io_apic.c
@@ -3477,6 +3477,17 @@ int arch_setup_hpet_msi(unsigned int irq
irq_set_chip_and_handler_name(irq, chip, handle_edge_irq, "edge");
return 0;
}
+
+int arch_start_hpet_msi(unsigned int irq, unsigned int id)
+{
+ struct msi_msg msg;
+ int ret = msi_compose_msg(NULL, irq, &msg, id);
+
+ if (!ret)
+ hpet_msi_write(irq_get_handler_data(irq), &msg);
+ return ret;
+}
+
#endif

#endif /* CONFIG_PCI_MSI */
Index: linux-2.6/arch/x86/kernel/hpet.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/hpet.c
+++ linux-2.6/arch/x86/kernel/hpet.c
@@ -370,7 +370,8 @@ static void hpet_set_mode(enum clock_eve
hpet_enable_legacy_int();
} else {
struct hpet_dev *hdev = EVT_TO_HPET_DEV(evt);
- hpet_setup_msi_irq(hdev->irq);
+
+ arch_start_hpet_msi(hdev->irq, hpet_blockid);
disable_irq(hdev->irq);
irq_set_affinity(hdev->irq, cpumask_of(hdev->cpu));
enable_irq(hdev->irq);
@@ -555,7 +556,7 @@ static void init_one_hpet_msi_clockevent
if (!(hdev->flags & HPET_DEV_VALID))
return;

- if (hpet_setup_msi_irq(hdev->irq))
+ if (arch_start_hpet_msi(hdev->irq, hpet_blockid))
return;

hdev->cpu = cpu;

2011-04-05 07:41:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: BUG: sleeping function called from invalid context at kernel/mutex.c

On Sat, 2011-04-02 at 21:12 +0200, Thomas Gleixner wrote:
>
> And it is safe enough for the resume case as well, it simply cannot go
> to sleep because nothing else can hold that mutex in the early resume
> code. We take the mutex during early boot as well, but there the might
> sleep checks are disabled. And we should do the very same thing in the
> early resume code as well. Rafael, PeterZ ???

the $subject text sound like it triggered might_sleep(), and that had a
system_state != SYSTEM_RUNNING bail condition, but then, I've no clue
what resume looks like.

2011-04-06 04:43:15

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: BUG: sleeping function called from invalid context at kernel/mutex.c

On Tuesday, April 05, 2011, Peter Zijlstra wrote:
> On Sat, 2011-04-02 at 21:12 +0200, Thomas Gleixner wrote:
> >
> > And it is safe enough for the resume case as well, it simply cannot go
> > to sleep because nothing else can hold that mutex in the early resume
> > code. We take the mutex during early boot as well, but there the might
> > sleep checks are disabled. And we should do the very same thing in the
> > early resume code as well. Rafael, PeterZ ???
>
> the $subject text sound like it triggered might_sleep(), and that had a
> system_state != SYSTEM_RUNNING bail condition, but then, I've no clue
> what resume looks like.

Early resume looks pretty much like the system startup, e.g. everything
called from syscore_ops should not be sleepable (although mutexes shouldn't
trigger, because that code is effectively single-threaded, unless somebody
holds the mutex in question when that code is being executed, but that would
deadlock anyway).

Thanks,
Rafael

2011-04-06 08:07:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: BUG: sleeping function called from invalid context at kernel/mutex.c

On Wed, 2011-04-06 at 06:43 +0200, Rafael J. Wysocki wrote:
> > the $subject text sound like it triggered might_sleep(), and that had a
> > system_state != SYSTEM_RUNNING bail condition, but then, I've no clue
> > what resume looks like.
>
> Early resume looks pretty much like the system startup, e.g. everything
> called from syscore_ops should not be sleepable (although mutexes shouldn't
> trigger, because that code is effectively single-threaded, unless somebody
> holds the mutex in question when that code is being executed, but that would
> deadlock anyway).

Right, so system_state != SYSTEM_RUNNING should be true for resume?

2011-04-07 05:23:18

by Xiaotian Feng

[permalink] [raw]
Subject: Re: BUG: sleeping function called from invalid context at kernel/mutex.c

On 04/03/2011 03:12 AM, Thomas Gleixner wrote:
> On Sat, 2 Apr 2011, Xiaotian Feng wrote:
>> I've got following warnings when I'm trying to resume from suspend, so
>> sparse_lock is not safe enough as a mutex for the suspend resume case, can
>> we simply convert sparse_lock back to spinlock?
>
> What the heck? We do not just convert a mutex to a spinlock because a
> warning triggers. And it was a mutex already in 2.6.38 w/o any such
> problems.
>
> And it is safe enough for the resume case as well, it simply cannot go
> to sleep because nothing else can hold that mutex in the early resume
> code. We take the mutex during early boot as well, but there the might
> sleep checks are disabled. And we should do the very same thing in the
> early resume code as well. Rafael, PeterZ ???
>
> But that's a different issue and not the real problem.
>
> So now instead of thinking about spinlock it would have been pretty
> simple to figure out that the warning is triggered due to commit
> d72274e5.
>
> But of course we dont want to revert d72274e5 either, though it's easy
> to figure out from the call chain that the fundamental stupidity is in
> the hpet resume code.
>
> Why the hell does it call arch_setup_hpet_msi()? That interrupt was
> completely setup _BEFORE_ we went into suspend. And I don't see any
> code which tears down the interrupt on suspend. So we run through a
> full setup for nothing. And when we have interrupt remapping enabled
> it even leaks an irte, because it allocates a new one. Utter crap.
> init_one_hpet_msi_clockevent() has the same stupid call.
>
> So no, the mutex stays a mutex and we fix crap where the crap is.
>
> The following completely untested patch should solve that. Actually we
> should be more clever and do the affinity magic in the new function as
> well, so we can get rid of this weird disable/set_affinity/enable
> hack, but that's a different story.
>

Yes, this patch resolves the warnings, thanks.

> Thanks,
>
> tglx
> ---
> arch/x86/include/asm/hpet.h | 2 ++
> arch/x86/kernel/apic/io_apic.c | 11 +++++++++++
> arch/x86/kernel/hpet.c | 5 +++--
> 3 files changed, 16 insertions(+), 2 deletions(-)
>
> Index: linux-2.6/arch/x86/include/asm/hpet.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/hpet.h
> +++ linux-2.6/arch/x86/include/asm/hpet.h
> @@ -83,11 +83,13 @@ extern void hpet_msi_read(struct hpet_de
>
> #ifdef CONFIG_PCI_MSI
> extern int arch_setup_hpet_msi(unsigned int irq, unsigned int id);
> +extern int arch_start_hpet_msi(unsigned int irq, unsigned int id);
> #else
> static inline int arch_setup_hpet_msi(unsigned int irq, unsigned int id)
> {
> return -EINVAL;
> }
> +static inline int arch_start_hpet_msi(unsigned int irq, unsigned int id) { }
> #endif
>
> #ifdef CONFIG_HPET_EMULATE_RTC
> Index: linux-2.6/arch/x86/kernel/apic/io_apic.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/apic/io_apic.c
> +++ linux-2.6/arch/x86/kernel/apic/io_apic.c
> @@ -3477,6 +3477,17 @@ int arch_setup_hpet_msi(unsigned int irq
> irq_set_chip_and_handler_name(irq, chip, handle_edge_irq, "edge");
> return 0;
> }
> +
> +int arch_start_hpet_msi(unsigned int irq, unsigned int id)
> +{
> + struct msi_msg msg;
> + int ret = msi_compose_msg(NULL, irq,&msg, id);
> +
> + if (!ret)
> + hpet_msi_write(irq_get_handler_data(irq),&msg);
> + return ret;
> +}
> +
> #endif
>
> #endif /* CONFIG_PCI_MSI */
> Index: linux-2.6/arch/x86/kernel/hpet.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/hpet.c
> +++ linux-2.6/arch/x86/kernel/hpet.c
> @@ -370,7 +370,8 @@ static void hpet_set_mode(enum clock_eve
> hpet_enable_legacy_int();
> } else {
> struct hpet_dev *hdev = EVT_TO_HPET_DEV(evt);
> - hpet_setup_msi_irq(hdev->irq);
> +
> + arch_start_hpet_msi(hdev->irq, hpet_blockid);
> disable_irq(hdev->irq);
> irq_set_affinity(hdev->irq, cpumask_of(hdev->cpu));
> enable_irq(hdev->irq);
> @@ -555,7 +556,7 @@ static void init_one_hpet_msi_clockevent
> if (!(hdev->flags& HPET_DEV_VALID))
> return;
>
> - if (hpet_setup_msi_irq(hdev->irq))
> + if (arch_start_hpet_msi(hdev->irq, hpet_blockid))
> return;
>
> hdev->cpu = cpu;
>
>
>

2011-04-07 05:53:31

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: BUG: sleeping function called from invalid context at kernel/mutex.c

On Wednesday, April 06, 2011, Peter Zijlstra wrote:
> On Wed, 2011-04-06 at 06:43 +0200, Rafael J. Wysocki wrote:
> > > the $subject text sound like it triggered might_sleep(), and that had a
> > > system_state != SYSTEM_RUNNING bail condition, but then, I've no clue
> > > what resume looks like.
> >
> > Early resume looks pretty much like the system startup, e.g. everything
> > called from syscore_ops should not be sleepable (although mutexes shouldn't
> > trigger, because that code is effectively single-threaded, unless somebody
> > holds the mutex in question when that code is being executed, but that would
> > deadlock anyway).
>
> Right, so system_state != SYSTEM_RUNNING should be true for resume?

It is not, although it probably should be.

First, some time ago there was opposition to adding more different possible
values of system_state and it wasn't clear which of the existing values should
be used instead of SYSTEM_RUNNING during suspend/resume. We ended up sticking
to SYSTEM_RUNNING for that reason long enough for some code to develop the
expectation of system_state == SYSTEM_RUNNING during suspend/resume (IOW,
changing that right now would probably break stuff).

Second, even if we decide to switch from SYSTEM_RUNNING to something else
during suspend (and back during resume), it's not entirely clear what's the
right place to do so.

Thanks,
Rafael