2009-06-20 19:55:18

by Sean MacLennan

[permalink] [raw]
Subject: Badness on the Warp

[ 0.000000] ------------[ cut here ]------------
[ 0.000000] Badness at c033ebc4 [verbose debug info unavailable]
[ 0.000000] NIP: c033ebc4 LR: c033eb94 CTR: 00000000
[ 0.000000] REGS: c037fe70 TRAP: 0700 Not tainted (2.6.30-pika)
[ 0.000000] MSR: 00021000 <ME,CE> CR: 22022024 XER: 00000000
[ 0.000000] TASK = c035b440[0] 'swapper' THREAD: c037e000
[ 0.000000] <6>GPR00: 00000001 c037ff20 c035b440 00000000 0000000c 00000020 3fffffff 00000000
[ 0.000000] <6>GPR08: c036faa4 c0380000 c07b7586 00000000 22002022 00000000 0ffa7f00 007fff99
[ 0.000000] <6>GPR16: 00400450 00800000 007fff00 0ffa7a90 00000000 ffffffff 00000000 c0351d40
[ 0.000000] <6>GPR24: 00000000 00001229 c0386b1c 0000000c 00000020 00000000 3fffffff 0000000c
[ 0.000000] NIP [c033ebc4] alloc_arch_preferred_bootmem+0x48/0x74
[ 0.000000] LR [c033eb94] alloc_arch_preferred_bootmem+0x18/0x74
[ 0.000000] Call Trace:
[ 0.000000] [c037ff20] [c0350fa8] 0xc0350fa8 (unreliable)
[ 0.000000] [c037ff30] [c033ec40] ___alloc_bootmem_nopanic+0x50/0x108
[ 0.000000] [c037ff60] [c033ed10] ___alloc_bootmem+0x18/0x50
[ 0.000000] [c037ff70] [c033b3bc] uic_init_one+0x40/0x22c
[ 0.000000] [c037ff90] [c033b630] uic_init_tree+0x88/0x168
[ 0.000000] [c037ffb0] [c0337378] init_IRQ+0x28/0x40
[ 0.000000] [c037ffc0] [c0334664] start_kernel+0x15c/0x2a0
[ 0.000000] [c037fff0] [c0000200] skpinv+0x190/0x1cc
[ 0.000000] Instruction dump:
[ 0.000000] 2f830000 38600000 409e0018 80010014 83e1000c 7c0803a6 38210010 4e800020
[ 0.000000] 3d20c038 80095408 21600000 7c0b0114 <0f000000> 2f800000 409e0018 38800000


I wouldn't trust the backtrace too much since it seems to keep
changing. It used to show __start_notes as the backtrace. This one is
with the head of Linus' tree as of a few minutes ago.

So I did a git bisect, and the problem goes back to June 11. The Warp
boots and seems to run fine. So I didn't notice until looking at the
logs for a different problem. I have since added a test to my
regression tests to look for badness or oops in the logs.

The git bisect returned:

871fa90791a6f83dd8e2e489feb9534a8c02088d is first bad commit

That is it, no more info.... strange. git show 8714...088d gives:

Merge: 7702667... 79f52b7...
Author: Linus Torvalds <[email protected]>
Date: Thu Jun 11 11:27:09 2009 -0700

Merge branch 'for-linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6
* 'for-linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6: jfs:
Add missing mutex_unlock call to error path missing unlock in
jfs_quota_write()

The Warp has jfs disabled, so I don't know why it would affect us,
especially this realy in the boot! Anyway, I tried to revert the patch:

$ git revert 871fa90791a6f83dd8e2e489feb9534a8c02088d
fatal: Commit 871fa90791a6f83dd8e2e489feb9534a8c02088d is a merge but no -m option was given.

So, anybody got an idea what is going on here? Or how I can revert the
patch. I am not a git expert, so if I am doing this wrong, let me know.

Cheers,
Sean


2009-06-20 20:57:09

by Frans Pop

[permalink] [raw]
Subject: Re: Badness on the Warp

> The git bisect returned:
>
> ? 871fa90791a6f83dd8e2e489feb9534a8c02088d is first bad commit
>
> That is it, no more info.... strange. git show 8714...088d gives:
>
> Merge: 7702667... 79f52b7...
> Author: Linus Torvalds <[email protected]>
> Date: ? Thu Jun 11 11:27:09 2009 -0700
>
> The Warp has jfs disabled, so I don't know why it would affect us,
> especially this realy in the boot!

The fact that your bisect ended at a merge essentially means that it is
invalid. As a merge does not introduce any actual change (unless it
includes changes to resolve conflicts), it normally cannot be the cause
of a regression.

So either you made a mistake when marking a commit as good or bad during
the bisect, or the "badness" does not trigger reliably, resulting in you
incorrectly marking a bad commit as good.

My suggestion would be to run the bisection again from the start.

Cheers,
FJP

2009-06-20 23:10:43

by Sean MacLennan

[permalink] [raw]
Subject: Re: Badness on the Warp

On Sat, 20 Jun 2009 22:56:45 +0200
Frans Pop <[email protected]> wrote:

> The fact that your bisect ended at a merge essentially means that it
> is invalid. As a merge does not introduce any actual change (unless
> it includes changes to resolve conflicts), it normally cannot be the
> cause of a regression.

Makes sense I messed up somewhere. This is the commit that causes
problems:

Merge branch 'topic/slab/earlyboot' of git://git./linux/kernel/git/penberg/slab-2.6


* 'topic/slab/earlyboot' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:

vgacon: use slab allocator instead of the bootmem allocator

irq: use kcalloc() instead of the bootmem allocator

sched: use slab in cpupri_init()

sched: use alloc_cpumask_var() instead of alloc_bootmem_cpumask_var()

memcg: don't use bootmem allocator in setup code

irq/cpumask: make memoryless node zero happy

x86: remove some alloc_bootmem_cpumask_var calling

vt: use kzalloc() instead of the bootmem allocator

sched: use kzalloc() instead of the bootmem allocator

init: introduce mm_init()

vmalloc: use kzalloc() instead of alloc_bootmem()

slab: setup allocators earlier in the boot sequence

bootmem: fix slab fallback on numa

bootmem: use slab if bootmem is no longer available

Makes sense since it is very early in the boot I get the badness. Also, there is no mention of JFS in the above logs.

Cheers,
Sean


2009-06-20 23:26:19

by Frans Pop

[permalink] [raw]
Subject: Re: Badness on the Warp

On Sunday 21 June 2009, Sean MacLennan wrote:
> On Sat, 20 Jun 2009 22:56:45 +0200 Frans Pop <[email protected]> wrote:
> > The fact that your bisect ended at a merge essentially means that it
> > is invalid. As a merge does not introduce any actual change (unless
> > it includes changes to resolve conflicts), it normally cannot be the
> > cause of a regression.
>
> Makes sense I messed up somewhere. This is the commit that causes
> problems:
>
> Merge branch 'topic/slab/earlyboot' of
> git://git./linux/kernel/git/penberg/slab-2.6
>
> * 'topic/slab/earlyboot' of
> git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:

Hmmm. That's b640f042faa2, which is another merge...
So you're still not there.

2009-06-20 23:42:59

by Sean MacLennan

[permalink] [raw]
Subject: Re: Badness on the Warp

I found the source of the badness. The backtrace is correct:

uic_init_one
___alloc_bootmem
___alloc_bootmem_nopanic
alloc_arch_preferred_bootmem

In alloc_arch_preferred_bootmem we have:

if (WARN_ON_ONCE(slab_is_available()))
return kzalloc(size, GFP_NOWAIT);

Since the slab is available (it had better be or the call will return
NULL), we get the badness message, then a successful return from
kzalloc.

I believe the author wants something like:

if (slab_is_available())
return kzalloc(size, GFP_NOWAIT);
else
WARN_ON_ONCE(1);

Cheers,
Sean

2009-06-21 04:08:17

by Daniel Barkalow

[permalink] [raw]
Subject: Re: Badness on the Warp

On Sun, 21 Jun 2009, Frans Pop wrote:

> On Sunday 21 June 2009, Sean MacLennan wrote:
> > On Sat, 20 Jun 2009 22:56:45 +0200 Frans Pop <[email protected]> wrote:
> > > The fact that your bisect ended at a merge essentially means that it
> > > is invalid. As a merge does not introduce any actual change (unless
> > > it includes changes to resolve conflicts), it normally cannot be the
> > > cause of a regression.
> >
> > Makes sense I messed up somewhere. This is the commit that causes
> > problems:
> >
> > Merge branch 'topic/slab/earlyboot' of
> > git://git./linux/kernel/git/penberg/slab-2.6
> >
> > * 'topic/slab/earlyboot' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
>
> Hmmm. That's b640f042faa2, which is another merge...
> So you're still not there.

It's not implausible that this is, in fact, the first bad commit. The
early slab topic could easily have been correct for everything in the tree
when it was written, but cause problems with some other patch developed
simultaneously.

It's worth rechecking that both 871fa90 and b8ec757 are good and
b640f042faa2 is bad, since it's also pretty likely that the test isn't
reproducing perfectly.

-Daniel
*This .sig left intentionally blank*

2009-06-21 04:28:43

by Frans Pop

[permalink] [raw]
Subject: Re: Badness on the Warp

On Sunday 21 June 2009, Sean MacLennan wrote:
> I found the source of the badness. The backtrace is correct:
>
> uic_init_one

So that's in arch/powerpc/sysdev/uic.c.

> ___alloc_bootmem
> ___alloc_bootmem_nopanic
> alloc_arch_preferred_bootmem
>
> In alloc_arch_preferred_bootmem we have:
>
> if (WARN_ON_ONCE(slab_is_available()))
> return kzalloc(size, GFP_NOWAIT);
>
> Since the slab is available (it had better be or the call will return
> NULL), we get the badness message, then a successful return from
> kzalloc.
>
> I believe the author wants something like:
>
> if (slab_is_available())
> return kzalloc(size, GFP_NOWAIT);
> else
> WARN_ON_ONCE(1);

Well, I myself have no idea. It could also indicate a bug in the uic code.

But let's CC some people responsible for this code. Pekka recently added
this WARN that triggers in your case; David and Paul look to be the
people most involved in the uic code.

Start of the thread is at http://lkml.org/lkml/2009/6/20/165.

Cheers,
FJP

2009-06-21 07:03:22

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Badness on the Warp

Frans Pop wrote:
>
> The fact that your bisect ended at a merge essentially means that it is
> invalid. As a merge does not introduce any actual change (unless it
> includes changes to resolve conflicts), it normally cannot be the cause
> of a regression.
>

Sort of. You can have a "bum merge" where code conflicts logically,
even if it isn't visible as a conflict in git. Furthermore, of course,
you can have misresolved actual conflicts.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2009-06-21 10:21:06

by Pekka Enberg

[permalink] [raw]
Subject: Re: Badness on the Warp

On Sun, Jun 21, 2009 at 7:28 AM, Frans Pop<[email protected]> wrote:
> On Sunday 21 June 2009, Sean MacLennan wrote:
>> I found the source of the badness. The backtrace is correct:
>>
>> uic_init_one
>
> So that's in arch/powerpc/sysdev/uic.c.
>
>> ___alloc_bootmem
>> ___alloc_bootmem_nopanic
>> alloc_arch_preferred_bootmem
>>
>> In alloc_arch_preferred_bootmem we have:
>>
>> ? ? ? if (WARN_ON_ONCE(slab_is_available()))
>> ? ? ? ? ? ? ? return kzalloc(size, GFP_NOWAIT);
>>
>> Since the slab is available (it had better be or the call will return
>> NULL), we get the badness message, then a successful return from
>> kzalloc.
>>
>> I believe the author wants something like:
>>
>> ? ? ? if (slab_is_available())
>> ? ? ? ? ? ? ? return kzalloc(size, GFP_NOWAIT);
>> ? ? ? else
>> ? ? ? ? ? ? ? WARN_ON_ONCE(1);
>
> Well, I myself have no idea. It could also indicate a bug in the uic code.
>
> But let's CC some people responsible for this code. Pekka recently added
> this WARN that triggers in your case; David and Paul look to be the
> people most involved in the uic code.
>
> Start of the thread is at http://lkml.org/lkml/2009/6/20/165.

The WARN_ON() is there to let us know that someone is doing a bootmem
allocation but the slab allocator is already up. So the proper fix
here is to use kmalloc() directly in the call-site that triggers this
WARN_ON. I'm cc'ing Ben as he has been taking care of the fall-out
from my patches on ppc.

Pekka

2009-06-21 22:25:45

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Badness on the Warp

On Sun, 2009-06-21 at 13:20 +0300, Pekka Enberg wrote:
> The WARN_ON() is there to let us know that someone is doing a bootmem
> allocation but the slab allocator is already up. So the proper fix
> here is to use kmalloc() directly in the call-site that triggers this
> WARN_ON. I'm cc'ing Ben as he has been taking care of the fall-out
> from my patches on ppc.
>
Right, our interrupt controllers need those fixes, they are low
on my priority list since it's a reasonably harmless warning and I'm
still chasing some actual breakage (though maybe not directly related to
your patches).

Kumar already submitted a couple, Frans, feel free to beat me
at converting UIC (just use kmalloc directly in there instead
of alloc_bootmem).

Cheers,
Ben.

2009-06-21 22:37:09

by Sean MacLennan

[permalink] [raw]
Subject: Re: Badness on the Warp

On Mon, 22 Jun 2009 08:25:04 +1000
Benjamin Herrenschmidt <[email protected]> wrote:

> Right, our interrupt controllers need those fixes, they are low
> on my priority list since it's a reasonably harmless warning and I'm
> still chasing some actual breakage (though maybe not directly related
> to your patches).
>
> Kumar already submitted a couple, Frans, feel free to beat me
> at converting UIC (just use kmalloc directly in there instead
> of alloc_bootmem).

I can make the changes to UIC if you want. They badness is harmless (it
ends up calling kzalloc anyway), but hard to explain to our PV (Product
Verification) department that they can ignore what looks like a crash ;)

Cheers,
Sean

2009-06-21 22:57:42

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Badness on the Warp

On Sun, 2009-06-21 at 18:36 -0400, Sean MacLennan wrote:
> On Mon, 22 Jun 2009 08:25:04 +1000
> Benjamin Herrenschmidt <[email protected]> wrote:
>
> > Right, our interrupt controllers need those fixes, they are low
> > on my priority list since it's a reasonably harmless warning and I'm
> > still chasing some actual breakage (though maybe not directly related
> > to your patches).
> >
> > Kumar already submitted a couple, Frans, feel free to beat me
> > at converting UIC (just use kmalloc directly in there instead
> > of alloc_bootmem).
>
> I can make the changes to UIC if you want. They badness is harmless (it
> ends up calling kzalloc anyway), but hard to explain to our PV (Product
> Verification) department that they can ignore what looks like a crash ;)

Heh, right, but hopefully we should have that fixed before .31 is
out :-)

Cheers,
Ben.

2009-06-22 04:48:43

by Pekka Enberg

[permalink] [raw]
Subject: Re: Badness on the Warp

Hi Sean,

On Mon, Jun 22, 2009 at 1:36 AM, Sean MacLennan<[email protected]> wrote:
> On Mon, 22 Jun 2009 08:25:04 +1000
> Benjamin Herrenschmidt <[email protected]> wrote:
>
>> Right, our interrupt controllers need those fixes, they are low
>> on my priority list since it's a reasonably harmless warning and I'm
>> still chasing some actual breakage (though maybe not directly related
>> to your patches).
>>
>> Kumar already submitted a couple, Frans, feel free to beat me
>> at converting UIC (just use kmalloc directly in there instead
>> of alloc_bootmem).
>
> I can make the changes to UIC if you want. They badness is harmless (it
> ends up calling kzalloc anyway), but hard to explain to our PV (Product
> Verification) department that they can ignore what looks like a crash ;)

Right. We can also wrap the WARN_ON() in CONFIG_DEBUG_BOOTMEM or
something too but I'd prefer a proper fix here.

Pekka

2009-06-23 05:24:50

by Sean MacLennan

[permalink] [raw]
Subject: Re: Badness on the Warp

On Mon, 22 Jun 2009 08:25:04 +1000
Benjamin Herrenschmidt <[email protected]> wrote:

> Kumar already submitted a couple, Frans, feel free to beat me
> at converting UIC (just use kmalloc directly in there instead
> of alloc_bootmem).

I replace the bootmem_alloc with a kzalloc and the badness went away.
So it looks like, for my config anyway, that solves the problem.

Cheers,
Sean