2006-05-31 19:32:50

by Martin Bligh

[permalink] [raw]
Subject: 2.6.17-rc5-mm1

The x86_65 panic in LTP has changed a bit. Looks more useful now.
Possibly just unrelated new stuff. Possibly we got lucky.

M


http://test.kernel.org/abat/33807/debug/console.log

nable to handle kernel paging request at 0000000000001034 RIP:
[<ffffffff8048c5c4>] __sched_text_start+0x51c/0x745
PGD 1d836a067 PUD 1e2704067 PMD 0
Oops: 0002 [1] SMP
last sysfs file: /devices/pci0000:00/0000:00:06.0/resource
CPU 0
Modules linked in:
Pid: 230, comm: kswapd0 Not tainted 2.6.17-rc5-mm1-autokern1 #1
RIP: 0010:[<ffffffff8048c5c4>] [<ffffffff8048c5c4>]
__sched_text_start+0x51c/0x745
RSP: 0000:ffff810100237b48 EFLAGS: 00010093
RAX: ffff8100e3c592b0 RBX: 0000000000000d84 RCX: ffffffff805f0cd8
RDX: 000000003b9a078a RSI: ffff8100e3c590d0 RDI: 000000000007989e
RBP: ffff810100237be8 R08: ffff81007bfec930 R09: ffff81007bfec8e8
R10: ffff81007bfec8e8 R11: ffff81007ff3c2d8 R12: 0000000000000000
R13: ffff8101fd8ec140 R14: ffff810001011680 R15: 000002b3bb3b33a4
FS: 0000000000000000(0000) GS:ffffffff8061c000(0000) knlGS:00000000f7db6460
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000001034 CR3: 00000001f18fa000 CR4: 00000000000006e0
Process kswapd0 (pid: 230, threadinfo ffff810100236000, task
ffff8100e3c590d0)
Stack: ffff81007bfec8e8 ffff81007bfec8e8 ffffffff805f0cd8 000000007ff3c2d8
ffff8100e3c590d0 000000000000c276 ffffffff805f0cd8 ffff8100e3c592b0
ffff81007be407b0 ffff81007be407b0
Call Trace:
[<ffffffff8024d24c>] __remove_from_page_cache+0x1c/0x65
[<ffffffff8048d16c>] cond_resched+0x4c/0x7b
[<ffffffff8025783d>] shrink_inactive_list+0x125/0x882
[<ffffffff8048c7ed>] thread_return+0x0/0xd1
[<ffffffff80258493>] shrink_zone+0xd2/0xf3
[<ffffffff80258a4b>] kswapd+0x359/0x49b
[<ffffffff8023e51c>] autoremove_wake_function+0x0/0x37
[<ffffffff802586f2>] kswapd+0x0/0x49b
[<ffffffff8023dff8>] kthread+0xd0/0xfc
[<ffffffff8020a34e>] child_rip+0x8/0x12
[<ffffffff8023df28>] kthread+0x0/0xfc
[<ffffffff8020a346>] child_rip+0x0/0x12


2006-05-31 21:04:08

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.17-rc5-mm1

Martin Bligh <[email protected]> wrote:
>
> The x86_65 panic in LTP has changed a bit. Looks more useful now.
> Possibly just unrelated new stuff. Possibly we got lucky.
>
> M
>
>
> http://test.kernel.org/abat/33807/debug/console.log
>
> nable to handle kernel paging request at 0000000000001034 RIP:
> [<ffffffff8048c5c4>] __sched_text_start+0x51c/0x745
> PGD 1d836a067 PUD 1e2704067 PMD 0
> Oops: 0002 [1] SMP
> last sysfs file: /devices/pci0000:00/0000:00:06.0/resource
> CPU 0
> Modules linked in:
> Pid: 230, comm: kswapd0 Not tainted 2.6.17-rc5-mm1-autokern1 #1
> RIP: 0010:[<ffffffff8048c5c4>] [<ffffffff8048c5c4>]
> __sched_text_start+0x51c/0x745
> RSP: 0000:ffff810100237b48 EFLAGS: 00010093
> RAX: ffff8100e3c592b0 RBX: 0000000000000d84 RCX: ffffffff805f0cd8
> RDX: 000000003b9a078a RSI: ffff8100e3c590d0 RDI: 000000000007989e
> RBP: ffff810100237be8 R08: ffff81007bfec930 R09: ffff81007bfec8e8
> R10: ffff81007bfec8e8 R11: ffff81007ff3c2d8 R12: 0000000000000000
> R13: ffff8101fd8ec140 R14: ffff810001011680 R15: 000002b3bb3b33a4
> FS: 0000000000000000(0000) GS:ffffffff8061c000(0000) knlGS:00000000f7db6460
> CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 0000000000001034 CR3: 00000001f18fa000 CR4: 00000000000006e0
> Process kswapd0 (pid: 230, threadinfo ffff810100236000, task
> ffff8100e3c590d0)
> Stack: ffff81007bfec8e8 ffff81007bfec8e8 ffffffff805f0cd8 000000007ff3c2d8
> ffff8100e3c590d0 000000000000c276 ffffffff805f0cd8 ffff8100e3c592b0
> ffff81007be407b0 ffff81007be407b0
> Call Trace:
> [<ffffffff8024d24c>] __remove_from_page_cache+0x1c/0x65
> [<ffffffff8048d16c>] cond_resched+0x4c/0x7b
> [<ffffffff8025783d>] shrink_inactive_list+0x125/0x882
> [<ffffffff8048c7ed>] thread_return+0x0/0xd1
> [<ffffffff80258493>] shrink_zone+0xd2/0xf3
> [<ffffffff80258a4b>] kswapd+0x359/0x49b
> [<ffffffff8023e51c>] autoremove_wake_function+0x0/0x37
> [<ffffffff802586f2>] kswapd+0x0/0x49b
> [<ffffffff8023dff8>] kthread+0xd0/0xfc
> [<ffffffff8020a34e>] child_rip+0x8/0x12
> [<ffffffff8023df28>] kthread+0x0/0xfc
> [<ffffffff8020a346>] child_rip+0x0/0x12

What are you doing to make this happen?

2006-05-31 21:23:14

by Martin Bligh

[permalink] [raw]
Subject: Re: 2.6.17-rc5-mm1

Andrew Morton wrote:
> Martin Bligh <[email protected]> wrote:
>
>>The x86_65 panic in LTP has changed a bit. Looks more useful now.
>>Possibly just unrelated new stuff. Possibly we got lucky.
>
> What are you doing to make this happen?

runalltests on LTP

M.

2006-05-31 21:40:25

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.17-rc5-mm1

"Martin J. Bligh" <[email protected]> wrote:
>
> Andrew Morton wrote:
> > Martin Bligh <[email protected]> wrote:
> >
> >>The x86_65 panic in LTP has changed a bit. Looks more useful now.
> >>Possibly just unrelated new stuff. Possibly we got lucky.
> >
> > What are you doing to make this happen?
>
> runalltests on LTP
>

We have to get to the bottom of this - there's a shadow over about 500
patches and we don't know which.

iirc I tried to reproduce this a couple of weeks back and failed.

Are you able to narrow it down to a particular LTP test? It was mtest01 or
something like that? Perhaps we can identify a particular command line
which triggers the fault in a standalone fashion?

And why can't I make it happen? Perhaps it's a memory initialisation
problem, and it only happens to hit in that stage of LTP because that's
when you started doing page reclaim, or something? Perhaps just
try putting a heap of memory pressure on the machine, see what
that does?

Being unable to reproduce it and not having a theory to go on leaves us
kinda stuck. Help, please?

2006-05-31 21:53:21

by Martin Bligh

[permalink] [raw]
Subject: Re: 2.6.17-rc5-mm1

Andrew Morton wrote:
> "Martin J. Bligh" <[email protected]> wrote:
>
>>Andrew Morton wrote:
>>
>>>Martin Bligh <[email protected]> wrote:
>>>
>>>
>>>>The x86_65 panic in LTP has changed a bit. Looks more useful now.
>>>>Possibly just unrelated new stuff. Possibly we got lucky.
>>>
>>>What are you doing to make this happen?
>>
>>runalltests on LTP
>>
>
>
> We have to get to the bottom of this - there's a shadow over about 500
> patches and we don't know which.

We did do one chop, and concluded it wasn't the x86_64 patches.

> iirc I tried to reproduce this a couple of weeks back and failed.

It looks like a different panic to me. It was a double-fault before.

> Are you able to narrow it down to a particular LTP test? It was mtest01 or
> something like that? Perhaps we can identify a particular command line
> which triggers the fault in a standalone fashion?

I can't do much from here - it's running on an IBM machine. Have to beg
Andy, or one of the other IBMers, for help.

> And why can't I make it happen? Perhaps it's a memory initialisation
> problem, and it only happens to hit in that stage of LTP because that's
> when you started doing page reclaim, or something?

It consistently happens on -mm, and not mainline, flicking back and
forth over time. So if you mean h/w mem init, I don't think so. if you
mena some patch in -mm, then yes.

> Perhaps just try putting a heap of memory pressure on the machine,
> see what that does?

Yes, the other stuff might not be swapping.

> Being unable to reproduce it and not having a theory to go on leaves us
> kinda stuck. Help, please?

Yeah, we have a sniff-testing mechanism that works well. However,
drill-down still requires significant amounts of human intervention.
The next gen of stuff should help do more intelligent stuff, but we're
kind of stuck with human-ness for now.

M.

2006-05-31 22:15:17

by Andy Whitcroft

[permalink] [raw]
Subject: Re: 2.6.17-rc5-mm1

Martin J. Bligh wrote:
> Andrew Morton wrote:
>
>> "Martin J. Bligh" <[email protected]> wrote:
>>
>>> Andrew Morton wrote:
>>>
>>>> Martin Bligh <[email protected]> wrote:
>>>>
>>>>
>>>>> The x86_65 panic in LTP has changed a bit. Looks more useful now.
>>>>> Possibly just unrelated new stuff. Possibly we got lucky.
>>>>
>>>>
>>>> What are you doing to make this happen?
>>>
>>>
>>> runalltests on LTP
>>>
>>
>>
>> We have to get to the bottom of this - there's a shadow over about 500
>> patches and we don't know which.
>
>
> We did do one chop, and concluded it wasn't the x86_64 patches.
>
>> iirc I tried to reproduce this a couple of weeks back and failed.
>
>
> It looks like a different panic to me. It was a double-fault before.
>
>> Are you able to narrow it down to a particular LTP test? It was
>> mtest01 or
>> something like that? Perhaps we can identify a particular command line
>> which triggers the fault in a standalone fashion?
>
>
> I can't do much from here - it's running on an IBM machine. Have to beg
> Andy, or one of the other IBMers, for help.
>
>> And why can't I make it happen? Perhaps it's a memory initialisation
>> problem, and it only happens to hit in that stage of LTP because that's
>> when you started doing page reclaim, or something?
>
>
> It consistently happens on -mm, and not mainline, flicking back and
> forth over time. So if you mean h/w mem init, I don't think so. if you
> mena some patch in -mm, then yes.
>
>> Perhaps just try putting a heap of memory pressure on the machine,
>
>> see what that does?
>
> Yes, the other stuff might not be swapping.
>
>> Being unable to reproduce it and not having a theory to go on leaves us
>> kinda stuck. Help, please?
>
>
> Yeah, we have a sniff-testing mechanism that works well. However,
> drill-down still requires significant amounts of human intervention.
> The next gen of stuff should help do more intelligent stuff, but we're
> kind of stuck with human-ness for now.

I am sure I got half way through diagnosing this one. We were context
switching to a bad thread. I've been meaning to get back to it. Its at
the top of my list for the AM.

-apw

2006-05-31 23:06:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.6.17-rc5-mm1


* Andrew Morton <[email protected]> wrote:

> > >>The x86_65 panic in LTP has changed a bit. Looks more useful now.
> > >>Possibly just unrelated new stuff. Possibly we got lucky.
> > >
> > > What are you doing to make this happen?
> >
> > runalltests on LTP
> >
>
> We have to get to the bottom of this - there's a shadow over about 500
> patches and we don't know which.
>
> iirc I tried to reproduce this a couple of weeks back and failed.

Martin, is the box still somewhat operational after such a crash? If yes
then we could use my crash-tracer to see the kernel function call
history leading up to the crash:

http://redhat.com/~mingo/lockdep-patches/latency-tracing-lockdep.patch

just apply the patch, accept the offered Kconfig defaults and it will be
configured to do the trace-crashes thing. Reproduce the crash and save
/proc/latency_trace - it contains the execution history leading up to
the crash. (on the CPU that crashes) Should work on i386 and x86_64.

the trace is saved upon the first crash or lockdep assert that occurs on
the box. (but you'll have lockdep disabled, so it's the crash that
matters)

if the box dies after a crash then there's a possibility to print the
execution history to the serial console - but that takes around 10-15
minutes even on 115200 baud. If you want/need to do this then edit
kernel/latency.c and change "trace_print_at_crash = 0" to
"trace_print_at_crash = 1".

(btw., the tracer has another neat feature as well: if a kernel crashes
or triple faults (and reboots) early during bootup, then the tracer can
be configured to print all function calls to the serial console, via
early_printk() - right when the function calls happen. I debugged
numerous nasty boot-time bugs via this. To set it, change
"print_functions = 0" to "print_functions = 1" in kernel/latency.c.)

Ingo

2006-05-31 23:14:46

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.6.17-rc5-mm1


* Ingo Molnar <[email protected]> wrote:

> Martin, is the box still somewhat operational after such a crash? If
> yes then we could use my crash-tracer to see the kernel function call
> history leading up to the crash:
>
> http://redhat.com/~mingo/lockdep-patches/latency-tracing-lockdep.patch
>
> just apply the patch, accept the offered Kconfig defaults and it will
> be configured to do the trace-crashes thing. Reproduce the crash and
> save /proc/latency_trace - it contains the execution history leading
> up to the crash. (on the CPU that crashes) Should work on i386 and
> x86_64.
>
> the trace is saved upon the first crash or lockdep assert that occurs
> on the box. (but you'll have lockdep disabled, so it's the crash that
> matters)

i just provoked a NULL pointer dereference with the tracer applied, and
/proc/latency_trace contained the proper trace, leading up to the crash:

gettimeo-2333 0D... 2210us : trace_hardirqs_on (restore_nocheck)
gettimeo-2333 0.... 2210us > sys_gettimeofday (00000000 00000000 0000007b)
gettimeo-2333 0.... 2210us : sys_gettimeofday (sysenter_past_esp)
gettimeo-2333 0D... 2211us : do_page_fault (error_code)
gettimeo-2333 0D... 2211us : do_page_fault (c0123238 0 2)
gettimeo-2333 0D... 2211us : do_page_fault (10202 202 7b)
gettimeo-2333 0D... 2211us : trace_hardirqs_on (do_page_fault)
gettimeo-2333 0.... 2211us : lockdep_acquire (do_page_fault)

for best trace output you should have KALLSYMS and KALLSYMS_ALL enabled.

of course it could happen that tracing makes your crash go away ...

Ingo

2006-05-31 23:30:13

by Martin Bligh

[permalink] [raw]
Subject: Re: 2.6.17-rc5-mm1

> Martin, is the box still somewhat operational after such a crash? If yes
> then we could use my crash-tracer to see the kernel function call
> history leading up to the crash:
>
> http://redhat.com/~mingo/lockdep-patches/latency-tracing-lockdep.patch
>
> just apply the patch, accept the offered Kconfig defaults and it will be
> configured to do the trace-crashes thing. Reproduce the crash and save
> /proc/latency_trace - it contains the execution history leading up to
> the crash. (on the CPU that crashes) Should work on i386 and x86_64.
>
> the trace is saved upon the first crash or lockdep assert that occurs on
> the box. (but you'll have lockdep disabled, so it's the crash that
> matters)
>
> if the box dies after a crash then there's a possibility to print the
> execution history to the serial console - but that takes around 10-15
> minutes even on 115200 baud. If you want/need to do this then edit
> kernel/latency.c and change "trace_print_at_crash = 0" to
> "trace_print_at_crash = 1".
>
> (btw., the tracer has another neat feature as well: if a kernel crashes
> or triple faults (and reboots) early during bootup, then the tracer can
> be configured to print all function calls to the serial console, via
> early_printk() - right when the function calls happen. I debugged
> numerous nasty boot-time bugs via this. To set it, change
> "print_functions = 0" to "print_functions = 1" in kernel/latency.c.)

Looks cool, but we'll have to beg the IBM guys to run it, I don't have
access to that box anymore. Dumping to serial console would be cool,
even if it's slow (the test boxes all have that logged).

M.

2006-05-31 23:38:13

by Martin Bligh

[permalink] [raw]
Subject: Re: 2.6.17-rc5-mm1

Andrew Morton wrote:
> "Martin J. Bligh" <[email protected]> wrote:
>
>>Andrew Morton wrote:
>>
>>>Martin Bligh <[email protected]> wrote:
>>>
>>>
>>>>The x86_65 panic in LTP has changed a bit. Looks more useful now.
>>>>Possibly just unrelated new stuff. Possibly we got lucky.
>>>
>>>What are you doing to make this happen?
>>
>>runalltests on LTP
>>
>
>
> We have to get to the bottom of this - there's a shadow over about 500
> patches and we don't know which.
>
> iirc I tried to reproduce this a couple of weeks back and failed.
>
> Are you able to narrow it down to a particular LTP test? It was mtest01 or
> something like that? Perhaps we can identify a particular command line
> which triggers the fault in a standalone fashion?

The LTP output is here:

http://test.kernel.org/abat/33803/debug/test.log.1

The last test run was memset01

From a good test run (http://test.kernel.org/abat/33964/debug/test.log.1)
the one after memset01 is a second instance of the same.

Which is bad I suppose, in that it's likely an intermittent failure.
Perhaps you can try running memset01 in a loop? I don't have such a
box set up here right now, I'm afraid ... will see what I can do.

OTOH, it looks like this might be a different failure than the double
fault we saw in previous -mm's, which was consistently in mtest01, IIRC.

M.

2006-06-01 00:45:17

by Robert Hancock

[permalink] [raw]
Subject: Re: 2.6.17-rc5-mm1

Martin Bligh wrote:
>> We have to get to the bottom of this - there's a shadow over about 500
>> patches and we don't know which.
>>
>> iirc I tried to reproduce this a couple of weeks back and failed.
>>
>> Are you able to narrow it down to a particular LTP test? It was
>> mtest01 or
>> something like that? Perhaps we can identify a particular command line
>> which triggers the fault in a standalone fashion?
>
> The LTP output is here:
>
> http://test.kernel.org/abat/33803/debug/test.log.1
>
> The last test run was memset01
>
> From a good test run (http://test.kernel.org/abat/33964/debug/test.log.1)
> the one after memset01 is a second instance of the same.
>
> Which is bad I suppose, in that it's likely an intermittent failure.
> Perhaps you can try running memset01 in a loop? I don't have such a
> box set up here right now, I'm afraid ... will see what I can do.
>
> OTOH, it looks like this might be a different failure than the double
> fault we saw in previous -mm's, which was consistently in mtest01, IIRC.

As a shot in the dark, I've seen problems on my Athlon 64 box with a
program that does memset on a huge chunk of memory repeatedly which
causes the machine to panic in various ways, lock up or reboot. Is this
what that test is doing? I suspect my problem is caused by a AMD Athlon
64/Opteron CPU erratum 97 "128-Bit Streaming Stores May Cause Coherency
Failure". The Newcastle CPU I have has this bug which can cause loss of
coherency on non-temporal stores, which the glibc memset function uses.
The BIOS is supposed to apply a workaround but I've no way of knowing if
mine (Asus A8N-SLI Deluxe) is..

And no it's not a memory problem, the system passes memtest86 overnight
without error. The problem usually shows up within a minute of starting
the continuous-memset program..

--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/