2006-09-27 12:24:08

by Rolf Eike Beer

[permalink] [raw]
Subject: [BUG] Oops on boot (probably ACPI related)

I get this on my machine. SMP kernel, linus git from this morning. .config and
test available on request.

Eike

BUG: unable to handle kernel paging request at virtual address f0003504
printing eip:
c102d804
*pde = 00000000
Oops: 0000 [#1]
SMP
Modules linked in:
CPU: 0
EIP: 0060:[<c102d804>] Not tainted VLI
EFLAGS: 00010086 (2.6.18 #3)
EIP is at mark_lock+0x24/0x34c
eax: f00034ec ebx: c126a674 ecx: 00000001 edx: 00000001
esi: c126a140 edi: 00000000 ebp: c1380e88 esp: c1380e78
ds: 007b es: 007b ss: 0068
Process swapper (pid: 0, ti=c1380000 task=c126a140 task.ti=c1380000)
Stack: 00000001 f00034ec 00000018 0000ffff c1380ec4 c102e5ee c11ee67a 00000000
00000000 00000018 c126a140 c126a674 00000000 c126a674 00000000 00000001
00000046 00000018 0000ffff c1380ee4 c102f03a 00000000 00000002 00000001
Call Trace:
[<c102e5ee>] __lock_acquire+0x45e/0x967
[<c102f03a>] lock_acquire+0x4b/0x6d
[<c11f15ef>] _spin_lock_irqsave+0x22/0x32
[<c11ee67a>] __down_trylock+0x12/0x48
[<c11f0e76>] __down_failed_trylock+0xa/0x10
DWARF2 unwinder stuck at __down_failed_trylock+0xa/0x10

Leftover inexact backtrace:

[<c10e9362>] acpi_os_wait_semaphore+0x38/0xd7
[<c10ff1e6>] acpi_ut_acquire_mutex+0x39/0x77
[<c10f768e>] acpi_ns_get_node+0x42/0x84
[<c10f64e1>] acpi_ns_root_initialize+0x276/0x2ad
[<c10fdd23>] acpi_initialize_subsystem+0x38/0x5d
[<c1397f64>] acpi_early_init+0x4e/0x108
[<c1384747>] start_kernel+0x376/0x383
[<00000000>] 0x0
=======================
Code: 8d 65 f8 5b 5e 5d c3 55 89 e5 57 56 53 83 ec 04 89 c6 89 d3 89 cf c7 45
f0 01 00 00 00 d3 65 f0 8b 42 08 ba 01 00 00 00 8b 4d f0 <85> 48 18 0f 85 15
03 00 00 f0 fe 0d 9c f8 26 c1 79 0d f3 90 80
EIP: [<c102d804>] mark_lock+0x24/0x34c SS:ESP 0068:c1380e78
<0>Kernel panic - not syncing: Attempted to kill the idle task!


Attachments:
(No filename) (1.79 kB)
(No filename) (189.00 B)
Download all attachments

2006-09-27 18:05:34

by Markus Dahms

[permalink] [raw]
Subject: Re: [BUG] Oops on boot (probably ACPI related)

Am Wed, 27 Sep 2006 14:24:47 +0200 schrieb Rolf Eike Beer:

> I get this on my machine. SMP kernel, linus git from this morning. .config
> and test available on request.

I encountered a similar bug, but a lot earlier. It seems to be a locking
problem, as it is lockdep which does the BUG() for me.
It's also an SMP machine in my case, acpi_os_wait_semaphore() is in the
call chain, too. No textual output (no serial connection attached, too
early for netconsole), but a screenshot:

http://automagically.de/images/linux-2.6.18+-acpi-lockup.jpg (154kB)

2.6.18 works for me, newer git versions explode.

Maybe it's an SMP-related problem, but it does BUG() before initialization
of the second CPU.

Markus


2006-09-27 18:40:43

by Kyle McMartin

[permalink] [raw]
Subject: Re: [BUG] Oops on boot (probably ACPI related)

On Wed, Sep 27, 2006 at 07:56:13PM +0200, Markus Dahms wrote:
> > I get this on my machine. SMP kernel, linus git from this morning. .config
> > and test available on request.
>

I saw this as well.

Reverting,
> i386: Remove lock section support in semaphore.h

Fixes it for me (and apparently akpm too from Message-Id:
<[email protected]>)

Linus, please revert 01215ad8d83e18321d99e9b5750a6f21cac243a2 for now...

Cheers,
Kyle McMartin

2006-09-27 19:38:12

by Andi Kleen

[permalink] [raw]
Subject: Re: [BUG] Oops on boot (probably ACPI related)

Rolf Eike Beer <[email protected]> writes:

> I get this on my machine. SMP kernel, linus git from this morning. .config and
> test available on request.

What gcc do you use?

Anyways, does this patch fix it? This might have been Andrew's vaio problem too.

-Andi

i386: Use early clobbers for semaphores now

The new code does clobber the result early, so make sure to tell
gcc to not put it into the same register as a input argument

Signed-off-by: Andi Kleen <[email protected]>

Index: linux/include/asm-i386/semaphore.h
===================================================================
--- linux.orig/include/asm-i386/semaphore.h
+++ linux/include/asm-i386/semaphore.h
@@ -126,7 +126,7 @@ static inline int down_interruptible(str
"lea %1,%%eax\n\t"
"call __down_failed_interruptible\n"
"2:"
- :"=a" (result), "+m" (sem->count)
+ :"=&a" (result), "+m" (sem->count)
:
:"memory");
return result;
@@ -148,7 +148,7 @@ static inline int down_trylock(struct se
"lea %1,%%eax\n\t"
"call __down_failed_trylock\n\t"
"2:\n"
- :"=a" (result), "+m" (sem->count)
+ :"=&a" (result), "+m" (sem->count)
:
:"memory");
return result;

2006-09-27 19:39:01

by Andi Kleen

[permalink] [raw]
Subject: Re: [BUG] Oops on boot (probably ACPI related)

Kyle McMartin <[email protected]> writes:

> On Wed, Sep 27, 2006 at 07:56:13PM +0200, Markus Dahms wrote:
> > > I get this on my machine. SMP kernel, linus git from this morning. .config
> > > and test available on request.
> >
>
> I saw this as well.
>
> Reverting,
> > i386: Remove lock section support in semaphore.h
>
> Fixes it for me (and apparently akpm too from Message-Id:
> <[email protected]>)
>
> Linus, please revert 01215ad8d83e18321d99e9b5750a6f21cac243a2 for now...

I expect this patch to fix it.

-Andi


i386: Use early clobbers for semaphores now

The new code does clobber the result early, so make sure to tell
gcc to not put it into the same register as a input argument

Signed-off-by: Andi Kleen <[email protected]>

Index: linux/include/asm-i386/semaphore.h
===================================================================
--- linux.orig/include/asm-i386/semaphore.h
+++ linux/include/asm-i386/semaphore.h
@@ -126,7 +126,7 @@ static inline int down_interruptible(str
"lea %1,%%eax\n\t"
"call __down_failed_interruptible\n"
"2:"
- :"=a" (result), "+m" (sem->count)
+ :"=&a" (result), "+m" (sem->count)
:
:"memory");
return result;
@@ -148,7 +148,7 @@ static inline int down_trylock(struct se
"lea %1,%%eax\n\t"
"call __down_failed_trylock\n\t"
"2:\n"
- :"=a" (result), "+m" (sem->count)
+ :"=&a" (result), "+m" (sem->count)
:
:"memory");
return result;

2006-09-27 20:24:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [BUG] Oops on boot (probably ACPI related)



On Wed, 27 Sep 2006, Andi Kleen wrote:
>
> I expect this patch to fix it.

Andrew, Kyle, can you verify?

Linus

2006-09-27 20:35:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: [BUG] Oops on boot (probably ACPI related)



On Wed, 27 Sep 2006, Linus Torvalds wrote:
>
> On Wed, 27 Sep 2006, Andi Kleen wrote:
> >
> > I expect this patch to fix it.
>
> Andrew, Kyle, can you verify?

Not that it really matters. Andi sure as hell pinpointed a real problem
with the new and broken inline asm. That's almost certainly the bug that
crept in during the recent rewrite.

HOWEVER, now that I look more closely at the rewrite, I'm really wondering
whether the rewrite was worth it at all. It generates smaller code, but at
the expense of

- the actual cache-footprint is bigger
- the branch will now be mis-predicted by default

Since the "smaller code" really only tends to matter from a cache
usage standpoint, I don't know if I'm at all convinced.

The fact that rewinders have problems is fairly immaterial. Maybe we
should just take this as a hint that all the stupid rewinding code was
wrong in the first place, and we should stop doing that? We can go back
to just printing out our stacktrace guesses, that has worked for us for a
long time, and the stack unwinding simply looks _fundamentally_ flawed.

So I have a real urge to just revert that change anyway.

Are there any _real_ advantages to this broken unwinding code that has had
more bugs that Windows XP?

Linus

2006-09-27 20:50:30

by Andi Kleen

[permalink] [raw]
Subject: Re: [BUG] Oops on boot (probably ACPI related)

On Wednesday 27 September 2006 22:35, Linus Torvalds wrote:
>
> On Wed, 27 Sep 2006, Linus Torvalds wrote:
> >
> > On Wed, 27 Sep 2006, Andi Kleen wrote:
> > >
> > > I expect this patch to fix it.
> >
> > Andrew, Kyle, can you verify?
>
> Not that it really matters. Andi sure as hell pinpointed a real problem
> with the new and broken inline asm. That's almost certainly the bug that
> crept in during the recent rewrite.
>
> HOWEVER, now that I look more closely at the rewrite, I'm really wondering
> whether the rewrite was worth it at all. It generates smaller code, but at
> the expense of
>
> - the actual cache-footprint is bigger
> - the branch will now be mis-predicted by default

It doesn't matter much because these days this stuff is all out of lined
anyways and in a single function. And the dynamic branch predictor
in all modern CPUs will usually cache the decision (unlocked) there.

(Actually there is something dumb left -- on a non preempt kernel
spin_unlock caller is larger than doing it inline. But that is left
for fixing later)

> The fact that rewinders have problems is fairly immaterial. Maybe we
> should just take this as a hint that all the stupid rewinding code was
> wrong in the first place, and we should stop doing that? We can go back
> to just printing out our stacktrace guesse

>
> Linus
>
s, that has worked for us for a
> long time, and the stack unwinding simply looks _fundamentally_ flawed.

Unfortunately Linux is a lot more complex than it was in the early days.

> So I have a real urge to just revert that change anyway.
>
> Are there any _real_ advantages to this broken unwinding code that has had
> more bugs that Windows XP?

I thought for a long time we didn't need it either, but these days with all
these callbacks in some parts of the kernel (driver model, others) and you
get a oops with 60+ entries it is just too much trouble to figure it out manually.

I admit when I took the code I didn't realize that dwarf2 has these
problems (not supporting out of line sections is clearly a spec
bug and would even hit gcc generated code). But we don't have
that many out of line sections anyways, so it's not that big an issue.

And all the people who process a lot of oopses (e.g. Andrew, Ingo, others) tend
to use frame pointers by default anyways. They already voted with their feet.
And the unwinder certainly gives better code than frame pointers. The mispredicted
branches you're worrying about are nothing against frame pointers
(e.g. on K8 FP tends to stall the CPU on each function call slightly)

Anyways, in theory it would be possible to keep the out of line sections
and define some own dwarf2 extension that allows us to express them.
Jan might have some thoughts on it. But I didn't think it was worth
it for these cases due to the reasons above.

-Andi

2006-09-27 20:58:10

by Kyle McMartin

[permalink] [raw]
Subject: Re: [BUG] Oops on boot (probably ACPI related)

On Wed, Sep 27, 2006 at 01:21:17PM -0700, Linus Torvalds wrote:
> On Wed, 27 Sep 2006, Andi Kleen wrote:
> > I expect this patch to fix it.
>
> Andrew, Kyle, can you verify?
>

Yup, it works. (For reference, it's gcc 4.1.1-13 from Debian.)

Cheers,
Kyle M.

2006-09-27 21:38:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: [BUG] Oops on boot (probably ACPI related)



On Wed, 27 Sep 2006, Andi Kleen wrote:
>
> It doesn't matter much because these days this stuff is all out of lined
> anyways and in a single function. And the dynamic branch predictor
> in all modern CPUs will usually cache the decision (unlocked) there.

Ahh, good point. Once there's only one copy, the branch predictor will get
it right (and the code size won't much matter)

> > Are there any _real_ advantages to this broken unwinding code that has had
> > more bugs that Windows XP?
>
> I thought for a long time we didn't need it either, but these days with all
> these callbacks in some parts of the kernel (driver model, others) and you
> get a oops with 60+ entries it is just too much trouble to figure it out manually.

Ok, fair enough. I'll apply your fix (which in itself is obviously
correct).

I just wanted to bring up the possibility that we should just remove the
(fragile) unwinder.

But let's leave it for another day, if it keeps being problematic.

Linus

2006-09-27 22:32:22

by Andrew Morton

[permalink] [raw]
Subject: Re: [BUG] Oops on boot (probably ACPI related)

On Wed, 27 Sep 2006 16:58:05 -0400
Kyle McMartin <[email protected]> wrote:

> On Wed, Sep 27, 2006 at 01:21:17PM -0700, Linus Torvalds wrote:
> > On Wed, 27 Sep 2006, Andi Kleen wrote:
> > > I expect this patch to fix it.
> >
> > Andrew, Kyle, can you verify?
> >
>
> Yup, it works.

Ditto.

2006-09-28 07:04:16

by Rolf Eike Beer

[permalink] [raw]
Subject: Re: [BUG] Oops on boot (probably ACPI related)

Am Mittwoch, 27. September 2006 21:38 schrieb Andi Kleen:
> Rolf Eike Beer <[email protected]> writes:
> > I get this on my machine. SMP kernel, linus git from this morning.
> > .config and test available on request.
>
> What gcc do you use?

4.1.0 (SuSE 10.1)

> Anyways, does this patch fix it? This might have been Andrew's vaio problem
> too.

Looks good, now it hangs because the init skript seems to have problems
activating the root volume group. But that's a different story.

Eike


Attachments:
(No filename) (495.00 B)
(No filename) (189.00 B)
Download all attachments

2006-09-28 07:49:13

by Andi Kleen

[permalink] [raw]
Subject: Re: [BUG] Oops on boot (probably ACPI related)

On Wednesday 27 September 2006 23:38, Linus Torvalds wrote:
>
> On Wed, 27 Sep 2006, Andi Kleen wrote:
> >
> > It doesn't matter much because these days this stuff is all out of lined
> > anyways and in a single function. And the dynamic branch predictor
> > in all modern CPUs will usually cache the decision (unlocked) there.
>
> Ahh, good point. Once there's only one copy, the branch predictor will get
> it right (and the code size won't much matter)

As a postscript I (unintentionally) bended the truth on that one actually
yesterday. Sorry for that. Semaphores are still inline, unlike spinlocks.

However if the spinlocks are out of line I see no reason to keep semaphores
inline either, so perhaps it would be better to just move them. Then my
argument above would actually work :)

For some reason the unwinder also still seems to get stuck on it :/

-Andi