LinuxLists.cc - Re: mm snapshot broken-out-2006-08-08-00-59.tar.gz uploaded

2006-08-08 20:29:06

Subject: Re: mm snapshot broken-out-2006-08-08-00-59.tar.gz uploaded

Hi Andrew,

On 08/08/06, [email protected] <[email protected]> wrote:
> The mm snapshot broken-out-2006-08-08-00-59.tar.gz has been uploaded to
>
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/mm/broken-out-2006-08-08-00-59.tar.gz
>
> It contains the following patches against 2.6.18-rc4:

It appears very early. 2.6.18-rc3-mm2 was fine.

DWARF2 unwinder stuck at error_code+0x39/0x40
Leftover inexact backtrace
[<c0104194>] show_stack_log_lvl+0x8c/0x97
[<c0104320>] show_registers+0x181/0x215
[<c0104576>] die+0x1c2/0x2dd
[<c0117419>] do_page_fault+ox410/0x4f3
[<c02f5271>] error_code+0x39/0x40
[<c0104194>] show_stack_log_lvl+0x8c/0x97
[<c0104320>] show_registers+0x181/0x215
[<c0104576>] die+0x1c2/0x2dd
[<c0117419>] do_page_fault+0x410/0x4f3
[<c02f5271>] error_code+0x39/0x40
[<c047b609>] start_kernel+0x224/0x3a2
[<c0100210>] 0xc0100210
Code: 00 39 .......
EIP:[<c01040ca>] show_trace_log_lvl+0x11b/0x159 SS:ESP 0068:c0479e74
<0> Kernel panic - not syncing: Attempted to kill idle task!

http://www.stardust.webpages.pl/files/mm/2.6.18-rc4-mm1/mm-config

Regards,
Michal

--
Michal K. K. Piotrowski
LTG - Linux Testers Group
(http://www.stardust.webpages.pl/ltg/wiki/)

2006-08-08 21:05:28

by Andrew Morton

[permalink] [raw]

Subject: Re: mm snapshot broken-out-2006-08-08-00-59.tar.gz uploaded

On Tue, 8 Aug 2006 22:29:03 +0200
"Michal Piotrowski" <[email protected]> wrote:

> Hi Andrew,
>
> On 08/08/06, [email protected] <[email protected]> wrote:
> > The mm snapshot broken-out-2006-08-08-00-59.tar.gz has been uploaded to
> >
> > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/mm/broken-out-2006-08-08-00-59.tar.gz
> >
> > It contains the following patches against 2.6.18-rc4:
>
> It appears very early. 2.6.18-rc3-mm2 was fine.
>
> DWARF2 unwinder stuck at error_code+0x39/0x40

The novelty of this thing has worn off. Guys, please let's not release 2.6.18 in
this state.

> Leftover inexact backtrace
> [<c0104194>] show_stack_log_lvl+0x8c/0x97
> [<c0104320>] show_registers+0x181/0x215
> [<c0104576>] die+0x1c2/0x2dd
> [<c0117419>] do_page_fault+ox410/0x4f3
> [<c02f5271>] error_code+0x39/0x40
> [<c0104194>] show_stack_log_lvl+0x8c/0x97
> [<c0104320>] show_registers+0x181/0x215
> [<c0104576>] die+0x1c2/0x2dd
> [<c0117419>] do_page_fault+0x410/0x4f3
> [<c02f5271>] error_code+0x39/0x40
> [<c047b609>] start_kernel+0x224/0x3a2
> [<c0100210>] 0xc0100210
> Code: 00 39 .......
> EIP:[<c01040ca>] show_trace_log_lvl+0x11b/0x159 SS:ESP 0068:c0479e74
> <0> Kernel panic - not syncing: Attempted to kill idle task!
>
> http://www.stardust.webpages.pl/files/mm/2.6.18-rc4-mm1/mm-config
>

So I guess the dwarf unwinder oopsed and wrecked our oops. Perhaps you'll
get better info with CONFIG_UNWIND_INFO=n, CONFIG_STACK_UNWIND=n.

Now, _perhaps_ it oopsed at "[<c047b609>] start_kernel+0x224/0x3a2". You
can look these things up in gdb or using addr2line, provided you have
CONFIG_DEBUG_INFO=y.

2006-08-08 21:19:12

by Michal Piotrowski

[permalink] [raw]

Subject: Re: mm snapshot broken-out-2006-08-08-00-59.tar.gz uploaded

On 08/08/06, Andrew Morton <[email protected]> wrote:
> On Tue, 8 Aug 2006 22:29:03 +0200
> "Michal Piotrowski" <[email protected]> wrote:
>
> > Hi Andrew,
> >
> > On 08/08/06, [email protected] <[email protected]> wrote:
> > > The mm snapshot broken-out-2006-08-08-00-59.tar.gz has been uploaded to
> > >
> > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/mm/broken-out-2006-08-08-00-59.tar.gz
> > >
> > > It contains the following patches against 2.6.18-rc4:
> >
> > It appears very early. 2.6.18-rc3-mm2 was fine.
> >
> > DWARF2 unwinder stuck at error_code+0x39/0x40
>
> The novelty of this thing has worn off. Guys, please let's not release 2.6.18 in
> this state.
>
> > Leftover inexact backtrace
> > [<c0104194>] show_stack_log_lvl+0x8c/0x97
> > [<c0104320>] show_registers+0x181/0x215
> > [<c0104576>] die+0x1c2/0x2dd
> > [<c0117419>] do_page_fault+ox410/0x4f3
> > [<c02f5271>] error_code+0x39/0x40
> > [<c0104194>] show_stack_log_lvl+0x8c/0x97
> > [<c0104320>] show_registers+0x181/0x215
> > [<c0104576>] die+0x1c2/0x2dd
> > [<c0117419>] do_page_fault+0x410/0x4f3
> > [<c02f5271>] error_code+0x39/0x40
> > [<c047b609>] start_kernel+0x224/0x3a2
> > [<c0100210>] 0xc0100210
> > Code: 00 39 .......
> > EIP:[<c01040ca>] show_trace_log_lvl+0x11b/0x159 SS:ESP 0068:c0479e74
> > <0> Kernel panic - not syncing: Attempted to kill idle task!
> >
> > http://www.stardust.webpages.pl/files/mm/2.6.18-rc4-mm1/mm-config
> >
>
> So I guess the dwarf unwinder oopsed and wrecked our oops. Perhaps you'll
> get better info with CONFIG_UNWIND_INFO=n, CONFIG_STACK_UNWIND=n.
>
> Now, _perhaps_ it oopsed at "[<c047b609>] start_kernel+0x224/0x3a2".

eghm... typo.
[<c047d609>]

> You
> can look these things up in gdb or using addr2line, provided you have
> CONFIG_DEBUG_INFO=y.
>
>

(gdb) list *0xc047d609
0xc047d609 is in start_kernel (/usr/src/linux-work1/init/main.c:577).
572 cpuset_init_early();
573 mem_init();
574 kmem_cache_init();
575 setup_per_cpu_pageset();
576 numa_policy_init();
577 if (late_time_init)
578 late_time_init();
579 calibrate_delay();
580 pidmap_init();
581 pgtable_cache_init();

Regards,
Michal

--
Michal K. K. Piotrowski
LTG - Linux Testers Group
(http://www.stardust.webpages.pl/ltg/wiki/)

2006-08-08 21:37:59

by Andrew Morton

[permalink] [raw]

Subject: Re: mm snapshot broken-out-2006-08-08-00-59.tar.gz uploaded

On Tue, 8 Aug 2006 23:19:09 +0200
"Michal Piotrowski" <[email protected]> wrote:

> > You
> > can look these things up in gdb or using addr2line, provided you have
> > CONFIG_DEBUG_INFO=y.
> >
> >
>
> (gdb) list *0xc047d609
> 0xc047d609 is in start_kernel (/usr/src/linux-work1/init/main.c:577).
> 572 cpuset_init_early();
> 573 mem_init();
> 574 kmem_cache_init();
> 575 setup_per_cpu_pageset();
> 576 numa_policy_init();
> 577 if (late_time_init)
> 578 late_time_init();
> 579 calibrate_delay();
> 580 pidmap_init();
> 581 pgtable_cache_init();

hm.

- Try to get the full oops record, find out what the faulting address is
("unable to handle kernel paging request at virtual address xxxx") and
see if that lines up with any symbol in .vmlinux.

- Might be something bad in numa_policy_init(). I assume you don't have
CONFIG_NUMA=y ;)

This'll be hard to diagnose without a full oops trace.

2006-08-08 22:11:40

by Michal Piotrowski

[permalink] [raw]

Subject: Re: mm snapshot broken-out-2006-08-08-00-59.tar.gz uploaded

On 08/08/06, Andrew Morton <[email protected]> wrote:
> On Tue, 8 Aug 2006 23:19:09 +0200
> "Michal Piotrowski" <[email protected]> wrote:
>
> > > You
> > > can look these things up in gdb or using addr2line, provided you have
> > > CONFIG_DEBUG_INFO=y.
> > >
> > >
> >
> > (gdb) list *0xc047d609
> > 0xc047d609 is in start_kernel (/usr/src/linux-work1/init/main.c:577).
> > 572 cpuset_init_early();
> > 573 mem_init();
> > 574 kmem_cache_init();
> > 575 setup_per_cpu_pageset();
> > 576 numa_policy_init();
> > 577 if (late_time_init)
> > 578 late_time_init();
> > 579 calibrate_delay();
> > 580 pidmap_init();
> > 581 pgtable_cache_init();
>
> hm.
>
> - Try to get the full oops record,

BUG: unable to handle kernel paging request at virtual address 01020304
printing eip:
c041b95c
*pde= 00000000
Oops: 0000 [#1]
4K_STACK PREEMPT SMP
last sysfs file:
Modules linked in:
CPU 0
EIP: 0060: [<c041b95c>] Not tainted VLI
EFLAGS: 00010202
EIP is at kmem_cache_init+0x389/0x3f0
[..]
Call Trace:
[<c0104063>] show_stack_log_lvl+0x8c/0x97
[<c010422b>] show_registers+0x181/0x215
[<c0104481>] die+0x1c2/0x2dd
[<c0117419>] do_page_fault+0x410/0x4f3
[<c02f40a1>] error_code+0x39/0x40
[<c040b604>] start_kernel+0x21f/0x39d
[<c0100210>] 0xc0100210
[..]
EIP: [<c041b95c>] kmem_cache_init+0x389/0x3f0 SS:ESP0068:c0409fc4
<0> Kernel panic - not syncing: Attempted to kill idle task!

(gdb) list *0xc041b95c
0xc041b95c is in kmem_cache_init (/usr/src/linux-work1/mm/slab.c:714).
709 lockdep_set_class(&l3->list_lock,
&on_slab_l3_key);
710 alc = l3->alien;
711 if (!alc)
712 continue;
713 for_each_node(r) {
714 if (alc[r])
715 lockdep_set_class(&alc[r]->lock,
716 &on_slab_alc_key);
717 }
718 }

> find out what the faulting address is
> ("unable to handle kernel paging request at virtual address xxxx") and
> see if that lines up with any symbol in .vmlinux.

Did you mean "list 01020304"?

>
> - Might be something bad in numa_policy_init(). I assume you don't have
> CONFIG_NUMA=y ;)

No, I don't.

>
>
> This'll be hard to diagnose without a full oops trace.
>

Regards,
Michal

--
Michal K. K. Piotrowski
LTG - Linux Testers Group
(http://www.stardust.webpages.pl/ltg/wiki/)

2006-08-08 23:24:43

by Michal Piotrowski

[permalink] [raw]

Subject: Re: mm snapshot broken-out-2006-08-08-00-59.tar.gz uploaded

On 09/08/06, Michal Piotrowski <[email protected]> wrote:
> On 08/08/06, Andrew Morton <[email protected]> wrote:
> > On Tue, 8 Aug 2006 23:19:09 +0200
> > "Michal Piotrowski" <[email protected]> wrote:
> >
> > > > You
> > > > can look these things up in gdb or using addr2line, provided you have
> > > > CONFIG_DEBUG_INFO=y.
> > > >
> > > >
> > >
> > > (gdb) list *0xc047d609
> > > 0xc047d609 is in start_kernel (/usr/src/linux-work1/init/main.c:577).
> > > 572 cpuset_init_early();
> > > 573 mem_init();
> > > 574 kmem_cache_init();
> > > 575 setup_per_cpu_pageset();
> > > 576 numa_policy_init();
> > > 577 if (late_time_init)
> > > 578 late_time_init();
> > > 579 calibrate_delay();
> > > 580 pidmap_init();
> > > 581 pgtable_cache_init();
> >
> > hm.
> >
> > - Try to get the full oops record,
>
> BUG: unable to handle kernel paging request at virtual address 01020304
> printing eip:
> c041b95c
> *pde= 00000000
> Oops: 0000 [#1]
> 4K_STACK PREEMPT SMP
> last sysfs file:
> Modules linked in:
> CPU 0
> EIP: 0060: [<c041b95c>] Not tainted VLI
> EFLAGS: 00010202
> EIP is at kmem_cache_init+0x389/0x3f0
> [..]
> Call Trace:
> [<c0104063>] show_stack_log_lvl+0x8c/0x97
> [<c010422b>] show_registers+0x181/0x215
> [<c0104481>] die+0x1c2/0x2dd
> [<c0117419>] do_page_fault+0x410/0x4f3
> [<c02f40a1>] error_code+0x39/0x40
> [<c040b604>] start_kernel+0x21f/0x39d
> [<c0100210>] 0xc0100210
> [..]
> EIP: [<c041b95c>] kmem_cache_init+0x389/0x3f0 SS:ESP0068:c0409fc4
> <0> Kernel panic - not syncing: Attempted to kill idle task!
>
> (gdb) list *0xc041b95c
> 0xc041b95c is in kmem_cache_init (/usr/src/linux-work1/mm/slab.c:714).
> 709 lockdep_set_class(&l3->list_lock,
> &on_slab_l3_key);
> 710 alc = l3->alien;
> 711 if (!alc)
> 712 continue;
> 713 for_each_node(r) {
> 714 if (alc[r])
> 715 lockdep_set_class(&alc[r]->lock,
> 716 &on_slab_alc_key);
> 717 }
> 718 }
>

System works well without this patches
slab-cache-shrinker-statistics.patch
slab-fix-lockdep-warnings.patch
slab-optimize-kmalloc_node-the-same-way-as-kmalloc-fix.patch
slab-optimize-kmalloc_node-the-same-way-as-kmalloc.patch
slab-respect-architecture-and-caller-mandated-alignment.patch

Regards,
Michal

--
Michal K. K. Piotrowski
LTG - Linux Testers Group
(http://www.stardust.webpages.pl/ltg/wiki/)

2006-08-08 23:43:15

by Andrew Morton

[permalink] [raw]

Subject: Re: mm snapshot broken-out-2006-08-08-00-59.tar.gz uploaded

On Wed, 9 Aug 2006 00:11:38 +0200
"Michal Piotrowski" <[email protected]> wrote:

> On 08/08/06, Andrew Morton <[email protected]> wrote:
> > On Tue, 8 Aug 2006 23:19:09 +0200
> > "Michal Piotrowski" <[email protected]> wrote:
> >
> > > > You
> > > > can look these things up in gdb or using addr2line, provided you have
> > > > CONFIG_DEBUG_INFO=y.
> > > >
> > > >
> > >
> > > (gdb) list *0xc047d609
> > > 0xc047d609 is in start_kernel (/usr/src/linux-work1/init/main.c:577).
> > > 572 cpuset_init_early();
> > > 573 mem_init();
> > > 574 kmem_cache_init();
> > > 575 setup_per_cpu_pageset();
> > > 576 numa_policy_init();
> > > 577 if (late_time_init)
> > > 578 late_time_init();
> > > 579 calibrate_delay();
> > > 580 pidmap_init();
> > > 581 pgtable_cache_init();
> >
> > hm.
> >
> > - Try to get the full oops record,
>
> BUG: unable to handle kernel paging request at virtual address 01020304
> printing eip:
> c041b95c
> *pde= 00000000
> Oops: 0000 [#1]
> 4K_STACK PREEMPT SMP
> last sysfs file:
> Modules linked in:
> CPU 0
> EIP: 0060: [<c041b95c>] Not tainted VLI
> EFLAGS: 00010202
> EIP is at kmem_cache_init+0x389/0x3f0
> [..]
> Call Trace:
> [<c0104063>] show_stack_log_lvl+0x8c/0x97
> [<c010422b>] show_registers+0x181/0x215
> [<c0104481>] die+0x1c2/0x2dd
> [<c0117419>] do_page_fault+0x410/0x4f3
> [<c02f40a1>] error_code+0x39/0x40
> [<c040b604>] start_kernel+0x21f/0x39d
> [<c0100210>] 0xc0100210
> [..]
> EIP: [<c041b95c>] kmem_cache_init+0x389/0x3f0 SS:ESP0068:c0409fc4
> <0> Kernel panic - not syncing: Attempted to kill idle task!
>
> (gdb) list *0xc041b95c
> 0xc041b95c is in kmem_cache_init (/usr/src/linux-work1/mm/slab.c:714).
> 709 lockdep_set_class(&l3->list_lock,
> &on_slab_l3_key);
> 710 alc = l3->alien;
> 711 if (!alc)
> 712 continue;
> 713 for_each_node(r) {
> 714 if (alc[r])
> 715 lockdep_set_class(&alc[r]->lock,
> 716 &on_slab_alc_key);
> 717 }
> 718 }

ah-hah, thanks. The oopsing statement was added by
slab-fix-lockdep-warnings.patch.

I guess we can fix this by whacking another #ifdef CONFIG_NUMA in there but
I don't think that's how we want to address this.

We've been moving towards making the NUMA slab code work OK in a non-NUMA
build by setting the NUMA-specific fields to NULL and simply blowing a few
cycles at runtime to avoid many tens of ifdefs (it's that bad).

Here, we should have had either l3==NULL or l3->alien==NULL, but that has
been violated, hence the crash.

Kiran, could you take a look please? The 0x01020304 is interesting...

2006-08-09 00:01:05

by Michal Piotrowski

[permalink] [raw]

Subject: Re: mm snapshot broken-out-2006-08-08-00-59.tar.gz uploaded

On 09/08/06, Andrew Morton <[email protected]> wrote:
> On Wed, 9 Aug 2006 00:11:38 +0200
> "Michal Piotrowski" <[email protected]> wrote:
>
> > On 08/08/06, Andrew Morton <[email protected]> wrote:
> > > On Tue, 8 Aug 2006 23:19:09 +0200
> > > "Michal Piotrowski" <[email protected]> wrote:
> > >
> > > > > You
> > > > > can look these things up in gdb or using addr2line, provided you have
> > > > > CONFIG_DEBUG_INFO=y.
> > > > >
> > > > >
> > > >
> > > > (gdb) list *0xc047d609
> > > > 0xc047d609 is in start_kernel (/usr/src/linux-work1/init/main.c:577).
> > > > 572 cpuset_init_early();
> > > > 573 mem_init();
> > > > 574 kmem_cache_init();
> > > > 575 setup_per_cpu_pageset();
> > > > 576 numa_policy_init();
> > > > 577 if (late_time_init)
> > > > 578 late_time_init();
> > > > 579 calibrate_delay();
> > > > 580 pidmap_init();
> > > > 581 pgtable_cache_init();
> > >
> > > hm.
> > >
> > > - Try to get the full oops record,
> >
> > BUG: unable to handle kernel paging request at virtual address 01020304
> > printing eip:
> > c041b95c
> > *pde= 00000000
> > Oops: 0000 [#1]
> > 4K_STACK PREEMPT SMP
> > last sysfs file:
> > Modules linked in:
> > CPU 0
> > EIP: 0060: [<c041b95c>] Not tainted VLI
> > EFLAGS: 00010202
> > EIP is at kmem_cache_init+0x389/0x3f0
> > [..]
> > Call Trace:
> > [<c0104063>] show_stack_log_lvl+0x8c/0x97
> > [<c010422b>] show_registers+0x181/0x215
> > [<c0104481>] die+0x1c2/0x2dd
> > [<c0117419>] do_page_fault+0x410/0x4f3
> > [<c02f40a1>] error_code+0x39/0x40
> > [<c040b604>] start_kernel+0x21f/0x39d
> > [<c0100210>] 0xc0100210
> > [..]
> > EIP: [<c041b95c>] kmem_cache_init+0x389/0x3f0 SS:ESP0068:c0409fc4
> > <0> Kernel panic - not syncing: Attempted to kill idle task!
> >
> > (gdb) list *0xc041b95c
> > 0xc041b95c is in kmem_cache_init (/usr/src/linux-work1/mm/slab.c:714).
> > 709 lockdep_set_class(&l3->list_lock,
> > &on_slab_l3_key);
> > 710 alc = l3->alien;
> > 711 if (!alc)
> > 712 continue;
> > 713 for_each_node(r) {
> > 714 if (alc[r])
> > 715 lockdep_set_class(&alc[r]->lock,
> > 716 &on_slab_alc_key);
> > 717 }
> > 718 }
>
> ah-hah, thanks. The oopsing statement was added by
> slab-fix-lockdep-warnings.patch.

Confirmed.

Regards,
Michal

--
Michal K. K. Piotrowski
LTG - Linux Testers Group
(http://www.stardust.webpages.pl/ltg/wiki/)

2006-08-09 01:42:48

by Andi Kleen

[permalink] [raw]

Subject: Re: mm snapshot broken-out-2006-08-08-00-59.tar.gz uploaded

On Tue, Aug 08, 2006 at 02:05:11PM -0700, Andrew Morton wrote:
> On Tue, 8 Aug 2006 22:29:03 +0200
> "Michal Piotrowski" <[email protected]> wrote:
>
> > Hi Andrew,
> >
> > On 08/08/06, [email protected] <[email protected]> wrote:
> > > The mm snapshot broken-out-2006-08-08-00-59.tar.gz has been uploaded to
> > >
> > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/mm/broken-out-2006-08-08-00-59.tar.gz
> > >
> > > It contains the following patches against 2.6.18-rc4:
> >
> > It appears very early. 2.6.18-rc3-mm2 was fine.
> >
> > DWARF2 unwinder stuck at error_code+0x39/0x40
>
> The novelty of this thing has worn off. Guys, please let's not release 2.6.18 in
> this state.

The stucks are harmless (and currently expected) as long as the
dwarf2 trace and the leftover trace give you a full picture
and the unwinder doesn't crash (the later would be a bug that needs
to be fixed before)

I have various fixed queued for the unwinder (or rather
for annotations used by the unwinder), but expect them
to only be merged with .19.

>
> > Leftover inexact backtrace
> > [<c0104194>] show_stack_log_lvl+0x8c/0x97

This might be ok.

> > [<c0104320>] show_registers+0x181/0x215
> > [<c0104576>] die+0x1c2/0x2dd
> > [<c0117419>] do_page_fault+ox410/0x4f3
> > [<c02f5271>] error_code+0x39/0x40
> > [<c0104194>] show_stack_log_lvl+0x8c/0x97
> > [<c0104320>] show_registers+0x181/0x215
> > [<c0104576>] die+0x1c2/0x2dd
> > [<c0117419>] do_page_fault+0x410/0x4f3
> > [<c02f5271>] error_code+0x39/0x40
> > [<c047b609>] start_kernel+0x224/0x3a2
> > [<c0100210>] 0xc0100210
> > Code: 00 39 .......
> > EIP:[<c01040ca>] show_trace_log_lvl+0x11b/0x159 SS:ESP 0068:c0479e74

This might be not. Hard to tell. Can we have a complete oops please?
(using netconsole or digital camera or firescope)

-andi

2006-08-09 01:44:01

by Andi Kleen

[permalink] [raw]

Subject: Re: mm snapshot broken-out-2006-08-08-00-59.tar.gz uploaded

On Wed, Aug 09, 2006 at 12:11:38AM +0200, Michal Piotrowski wrote:
> On 08/08/06, Andrew Morton <[email protected]> wrote:
> >On Tue, 8 Aug 2006 23:19:09 +0200
> >"Michal Piotrowski" <[email protected]> wrote:
> >
> >> > You
> >> > can look these things up in gdb or using addr2line, provided you have
> >> > CONFIG_DEBUG_INFO=y.
> >> >
> >> >
> >>
> >> (gdb) list *0xc047d609
> >> 0xc047d609 is in start_kernel (/usr/src/linux-work1/init/main.c:577).
> >> 572 cpuset_init_early();
> >> 573 mem_init();
> >> 574 kmem_cache_init();
> >> 575 setup_per_cpu_pageset();
> >> 576 numa_policy_init();
> >> 577 if (late_time_init)
> >> 578 late_time_init();
> >> 579 calibrate_delay();
> >> 580 pidmap_init();
> >> 581 pgtable_cache_init();
> >
> >hm.
> >
> >- Try to get the full oops record,
>
> BUG: unable to handle kernel paging request at virtual address 01020304
> printing eip:
> c041b95c
> *pde= 00000000
> Oops: 0000 [#1]
> 4K_STACK PREEMPT SMP
> last sysfs file:
> Modules linked in:
> CPU 0
> EIP: 0060: [<c041b95c>] Not tainted VLI
> EFLAGS: 00010202
> EIP is at kmem_cache_init+0x389/0x3f0

Well it didn't crash in the unwinder.
> [..]

And that [..] isn't a unwinder problem, but a human operator error.
Michal, you removed the valuable part of the backtrace.

-AndI

> Call Trace:
> [<c0104063>] show_stack_log_lvl+0x8c/0x97
> [<c010422b>] show_registers+0x181/0x215
> [<c0104481>] die+0x1c2/0x2dd
> [<c0117419>] do_page_fault+0x410/0x4f3
> [<c02f40a1>] error_code+0x39/0x40
> [<c040b604>] start_kernel+0x21f/0x39d
> [<c0100210>] 0xc0100210

2006-08-09 06:20:48

by Ravikiran G Thirumalai

[permalink] [raw]

Subject: Re: mm snapshot broken-out-2006-08-08-00-59.tar.gz uploaded

On Tue, Aug 08, 2006 at 04:42:10PM -0700, Andrew Morton wrote:
> On Wed, 9 Aug 2006 00:11:38 +0200
> "Michal Piotrowski" <[email protected]> wrote:
>
> ah-hah, thanks. The oopsing statement was added by
> slab-fix-lockdep-warnings.patch.
>
> I guess we can fix this by whacking another #ifdef CONFIG_NUMA in there but
> I don't think that's how we want to address this.
>
> We've been moving towards making the NUMA slab code work OK in a non-NUMA
> build by setting the NUMA-specific fields to NULL and simply blowing a few
> cycles at runtime to avoid many tens of ifdefs (it's that bad).
>
> Here, we should have had either l3==NULL or l3->alien==NULL, but that has
> been violated, hence the crash.
> Kiran, could you take a look please? The 0x01020304 is interesting...

Eeesh, because on SMP, alloc_alien_cache returns 0x01020304 instead of
NULL, And it returns 0x01020304 because CPU_UP_PREPARE fails if
alloc_alien_cache returns NULL. NUMA and non NUMA slab should be able to
work even without alien caches, currently that doesn't seem to be the case.
We are working on that. In the meanwhile, the following patch should
fix the oops due to locdep annotation.

Thanks,
Kiran

Fix oops due to alien cache locdep annotation on non NUMA configurations.
A plain alien != NULL won't work as l3->alien is initialized with
0x01020304ul

Signed-off-by: Ravikiran Thirumalai <[email protected]>

Index: linux-2.6.18-rc3mm3/mm/slab.c
===================================================================
--- linux-2.6.18-rc3mm3.orig/mm/slab.c 2006-08-08 19:19:51.000000000 -0700
+++ linux-2.6.18-rc3mm3/mm/slab.c 2006-08-08 21:53:53.000000000 -0700
@@ -674,6 +674,8 @@ static struct kmem_cache cache_cache = {
#endif
};

+#define BAD_ALIEN_MAGIC 0x01020304ul
+
#ifdef CONFIG_LOCKDEP

/*
@@ -705,7 +707,14 @@ static inline void init_lock_keys(void)
continue;
lockdep_set_class(&l3->list_lock, &on_slab_l3_key);
alc = l3->alien;
- if (!alc)
+ /*
+ * FIXME: This check for BAD_ALIEN_MAGIC
+ * should go away when common slab code is taught to
+ * work even without alien caches.
+ * Currently, non NUMA code returns BAD_ALIEN_MAGIC
+ * for alloc_alien_cache,
+ */
+ if (!alc || (unsigned long) alc == BAD_ALIEN_MAGIC)
continue;
for_each_node(r) {
if (alc[r])
@@ -1112,7 +1121,7 @@ static inline int cache_free_alien(struc

static inline struct array_cache **alloc_alien_cache(int node, int limit)
{
- return (struct array_cache **) 0x01020304ul;
+ return (struct array_cache **) BAD_ALIEN_MAGIC;
}

static inline void free_alien_cache(struct array_cache **ac_ptr)

2006-08-09 12:49:30

by Michal Piotrowski

[permalink] [raw]

Subject: Re: mm snapshot broken-out-2006-08-08-00-59.tar.gz uploaded

On 9 Aug 2006 03:43:59 +0200, Andi Kleen <[email protected]> wrote:
> On Wed, Aug 09, 2006 at 12:11:38AM +0200, Michal Piotrowski wrote:
> > BUG: unable to handle kernel paging request at virtual address 01020304
> > printing eip:
> > c041b95c
> > *pde= 00000000
> > Oops: 0000 [#1]
> > 4K_STACK PREEMPT SMP
> > last sysfs file:
> > Modules linked in:
> > CPU 0
> > EIP: 0060: [<c041b95c>] Not tainted VLI
> > EFLAGS: 00010202
> > EIP is at kmem_cache_init+0x389/0x3f0
>
> Well it didn't crash in the unwinder.
> > [..]
>
> And that [..] isn't a unwinder problem, but a human operator error.
> Michal, you removed the valuable part of the backtrace.

No, I didn't. I haven't seen that oops. I don't have a serial console,
and the system hangs very early.

>
> -AndI
>

Regards,
Michal

--
Michal K. K. Piotrowski
LTG - Linux Testers Group
(http://www.stardust.webpages.pl/ltg/wiki/)

2006-08-09 13:21:23

by Michal Piotrowski

[permalink] [raw]

Subject: Re: mm snapshot broken-out-2006-08-08-00-59.tar.gz uploaded

Hi,

On 09/08/06, Ravikiran G Thirumalai <[email protected]> wrote:
> On Tue, Aug 08, 2006 at 04:42:10PM -0700, Andrew Morton wrote:
> > On Wed, 9 Aug 2006 00:11:38 +0200
> > "Michal Piotrowski" <[email protected]> wrote:
> >
> > ah-hah, thanks. The oopsing statement was added by
> > slab-fix-lockdep-warnings.patch.
> >
> > I guess we can fix this by whacking another #ifdef CONFIG_NUMA in there but
> > I don't think that's how we want to address this.
> >
> > We've been moving towards making the NUMA slab code work OK in a non-NUMA
> > build by setting the NUMA-specific fields to NULL and simply blowing a few
> > cycles at runtime to avoid many tens of ifdefs (it's that bad).
> >
> > Here, we should have had either l3==NULL or l3->alien==NULL, but that has
> > been violated, hence the crash.
> > Kiran, could you take a look please? The 0x01020304 is interesting...
>
> Eeesh, because on SMP, alloc_alien_cache returns 0x01020304 instead of
> NULL, And it returns 0x01020304 because CPU_UP_PREPARE fails if
> alloc_alien_cache returns NULL. NUMA and non NUMA slab should be able to
> work even without alien caches, currently that doesn't seem to be the case.
> We are working on that. In the meanwhile, the following patch should
> fix the oops due to locdep annotation.
>
> Thanks,
> Kiran
>
> Fix oops due to alien cache locdep annotation on non NUMA configurations.
> A plain alien != NULL won't work as l3->alien is initialized with
> 0x01020304ul

Bug fixed, thanks.

Regards,
Michal

--
Michal K. K. Piotrowski
LTG - Linux Testers Group
(http://www.stardust.webpages.pl/ltg/wiki/)