2006-11-16 15:34:52

by Lennart Sorensen

[permalink] [raw]
Subject: How to go about debuging a system lockup?

We have a router with a Geode SC1200 cpu, with 4 AMD 972 ethernet ports
(pcnet32) behind a PLX 6152 PCI-PCI bridge, which quite regularly locks
up completely if we try to do simultanius traffic on all 4 ports (our
test case sends data from port 1 to port 2, and back and from port 3 to
port 4 and back at a rate of 8000 packets per second using 1500byte
packets). We usually manage to run the test for about 1 minute before
the system hangs. This happens on every one of the systems we have
tried so far. If we only run 2 ports, it seems to never die, and with 3
ports we haven't seen any failures yet, although maybe we just haven't
tested long enough. If we just receive the packets but don't forward
them out again, then we never crash, so it seems to be related to
simultanious transmit on the pcnet32s.

So far I have tried printing a message everytime the pcnet32 driver
enables and disables interrupts to find out if it hangs somewhere with
interrupts disabled, but that didn't seem to indicate anything
meaningful.

So far I have tried this with 2.6.8, 2.6.16.22, and 2.6.18.2 and no
difference so far. I can't think of what kind of even could cause the
system to just hang with no further console output or a kernel panic or
oops or anything. Usually most errors produce some kind of message.

Does anyone have any suggestions for where I go from here to find out
what is happening and where to look? I don't even know if I should
suspect the hardware or the software at this point. I want to know if
the program counter is still changing, or if the cpu is simply hung or
something, but I have no idea how to get at that.

--
Len Sorensen


2006-11-16 20:49:10

by Jesper Juhl

[permalink] [raw]
Subject: Re: How to go about debuging a system lockup?

On 16/11/06, Lennart Sorensen <[email protected]> wrote:
> We have a router with a Geode SC1200 cpu, with 4 AMD 972 ethernet ports
> (pcnet32) behind a PLX 6152 PCI-PCI bridge, which quite regularly locks
> up completely if we try to do simultanius traffic on all 4 ports (our
> test case sends data from port 1 to port 2, and back and from port 3 to
> port 4 and back at a rate of 8000 packets per second using 1500byte
> packets). We usually manage to run the test for about 1 minute before
> the system hangs. This happens on every one of the systems we have
> tried so far. If we only run 2 ports, it seems to never die, and with 3
> ports we haven't seen any failures yet, although maybe we just haven't
> tested long enough. If we just receive the packets but don't forward
> them out again, then we never crash, so it seems to be related to
> simultanious transmit on the pcnet32s.
>
> So far I have tried printing a message everytime the pcnet32 driver
> enables and disables interrupts to find out if it hangs somewhere with
> interrupts disabled, but that didn't seem to indicate anything
> meaningful.
>
> So far I have tried this with 2.6.8, 2.6.16.22, and 2.6.18.2 and no
> difference so far. I can't think of what kind of even could cause the
> system to just hang with no further console output or a kernel panic or
> oops or anything. Usually most errors produce some kind of message.
>
> Does anyone have any suggestions for where I go from here to find out
> what is happening and where to look? I don't even know if I should
> suspect the hardware or the software at this point. I want to know if
> the program counter is still changing, or if the cpu is simply hung or
> something, but I have no idea how to get at that.
>
Well, I have a few ideas that are hopefully useul.

- If you have not done so already, then go in to the "Kernel Hacking"
section of the kernel configuration and enable some (all?) of the
debug options and see if that produces anything that will help you
track down the problem.

- You could enable 'magic sysrq' and see if you can manage to get a
backtrace with it when it hangs (see Documentation/sysrq.txt) (ohh and
raise the console log level so you get all messages, including debug
ones).

- You could also try kdb (http://oss.sgi.com/projects/kdb/) or kgdb
(http://kgdb.linsyssoft.com/). That might help you pinpoint the
failure.
See also : http://kerneltrap.org/node/112

- If you have (or can identify) an older, working, kernel version and
you are confident that you can reproduce the problem reliably, then
doing a git bisection search starting with your newest "known good"
and oldest "known bad" kernel versions, should help you pinpoint the
commit causing the breakage.


Hope some of that helps :)


--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html

2006-11-16 21:21:46

by Lennart Sorensen

[permalink] [raw]
Subject: Re: How to go about debuging a system lockup?

On Thu, Nov 16, 2006 at 09:49:06PM +0100, Jesper Juhl wrote:
> Well, I have a few ideas that are hopefully useul.
>
> - If you have not done so already, then go in to the "Kernel Hacking"
> section of the kernel configuration and enable some (all?) of the
> debug options and see if that produces anything that will help you
> track down the problem.

I enabled the things that sounded useful. I will try enabling the rest.

> - You could enable 'magic sysrq' and see if you can manage to get a
> backtrace with it when it hangs (see Documentation/sysrq.txt) (ohh and
> raise the console log level so you get all messages, including debug
> ones).

Yeah I did that. No response to sysrq (at least not on the serial
console. Maybe I should get a keyboard connector put on.) Normally we
run without VGA/keyboard/etc, and just serial console. Of course the
serial console requires working interrupts. Not sure about the keyboard
driver.

> - You could also try kdb (http://oss.sgi.com/projects/kdb/) or kgdb
> (http://kgdb.linsyssoft.com/). That might help you pinpoint the
> failure.

Can I run that remotely somehow? I never really looked at kdb or kgdb
before.

> See also : http://kerneltrap.org/node/112
>
> - If you have (or can identify) an older, working, kernel version and
> you are confident that you can reproduce the problem reliably, then
> doing a git bisection search starting with your newest "known good"
> and oldest "known bad" kernel versions, should help you pinpoint the
> commit causing the breakage.

I don't know of a good version yet. I so far don't know if there ever
was one. This could even be a bug in the PCI hardware, or the way the
BIOS on this system on a board configured the PCI controller. Maybe I
should go back and try a 2.4 kernel.

> Hope some of that helps :)

Well hopefully.

--
Len Sorensen

2006-11-16 21:30:06

by Jesper Juhl

[permalink] [raw]
Subject: Re: How to go about debuging a system lockup?

On 16/11/06, Lennart Sorensen <[email protected]> wrote:
> On Thu, Nov 16, 2006 at 09:49:06PM +0100, Jesper Juhl wrote:
...
> > - You could also try kdb (http://oss.sgi.com/projects/kdb/) or kgdb
> > (http://kgdb.linsyssoft.com/). That might help you pinpoint the
> > failure.
>
> Can I run that remotely somehow? I never really looked at kdb or kgdb
> before.
>
Yes you can. kgdb is run on a separate machine from the one you are debugging.

> > - If you have (or can identify) an older, working, kernel version and
> > you are confident that you can reproduce the problem reliably, then
> > doing a git bisection search starting with your newest "known good"
> > and oldest "known bad" kernel versions, should help you pinpoint the
> > commit causing the breakage.
>
> I don't know of a good version yet. I so far don't know if there ever
> was one. This could even be a bug in the PCI hardware, or the way the
> BIOS on this system on a board configured the PCI controller. Maybe I
> should go back and try a 2.4 kernel.
>
Or just try a few random older 2.6 kernels like 2.6.14, 2.6.9,
2.6.whatever (of course it needs to be a version that git knows
about).


--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html

2006-11-16 22:01:44

by Protasevich, Natalie

[permalink] [raw]
Subject: RE: How to go about debuging a system lockup?

> I don't know of a good version yet. I so far don't know if there ever
> was one. This could even be a bug in the PCI hardware, or the way the
> BIOS on this system on a board configured the PCI controller. Maybe I
> should go back and try a 2.4 kernel.
>
> > Hope some of that helps :)
>
> Well hopefully.
>

If you can't drop in kdb, or no sysreq, then your interrupts are
disabled. I used to be (with older systems anyway) that NMI button was
on the system, so one could send an NMI and make the handler to print a
trace. Newer systems might not have that, so you can built your own PCI
card to send an NMI :)
Another possibility is to use port 80 and make suspicious code print
something to it. Once we used a small self-built thing with LEDs to
catch the output to the parallel port while debugging silent boot
failure. There are some port 80 cards that you can buy:
http://auctions.yahoo.com/i:Port%2080%20Card%20and%20power%20supply%20te
ster:102201489
http://www.amazon.com/gp/product/B000234U3I/ref=pd_cp_e_title/103-887558
8-5330221

If your system has a jtag then in target probe would be useful if you
have one (or can borrow one, those are expensive).

--Natalie

2006-11-16 22:37:28

by Lennart Sorensen

[permalink] [raw]
Subject: Re: How to go about debuging a system lockup?

On Thu, Nov 16, 2006 at 04:01:03PM -0600, Protasevich, Natalie wrote:
> If you can't drop in kdb, or no sysreq, then your interrupts are
> disabled. I used to be (with older systems anyway) that NMI button was
> on the system, so one could send an NMI and make the handler to print a
> trace. Newer systems might not have that, so you can built your own PCI
> card to send an NMI :)

I still haven't found a place to send an NMI on the Geode SC1200. I
really want one for exactly that reason. I have been suspecting that it
gets stuck somewhere with interrupts disabled, but I can't make sense of
where that could be. They mention something about the NMI being
implemented by SMM in their VSA. I don't like their virtual hardware
part very much.

> Another possibility is to use port 80 and make suspicious code print
> something to it. Once we used a small self-built thing with LEDs to
> catch the output to the parallel port while debugging silent boot
> failure. There are some port 80 cards that you can buy:
> http://auctions.yahoo.com/i:Port%2080%20Card%20and%20power%20supply%20te
> ster:102201489
> http://www.amazon.com/gp/product/B000234U3I/ref=pd_cp_e_title/103-887558
> 8-5330221

Hmm, one of those on the PCI bus might work. Or perhaps the parallel
port will. Of course if the problem is that somehow the PCI bus is
locked up, then I won't get a message anywhere since all the busses are
connected via PCI it seems. I don't know if a PCI bus can lock up, but
for now I was assuming anything was possible.

> If your system has a jtag then in target probe would be useful if you
> have one (or can borrow one, those are expensive).

I have asked the system on a board maker if it has jtag anywhere. Still
waiting on the answer to that.

--
Len Sorensen

2006-11-17 13:43:41

by Stefan Richter

[permalink] [raw]
Subject: Re: How to go about debuging a system lockup?

Lennart Sorensen wrote:
> On Thu, Nov 16, 2006 at 04:01:03PM -0600, Protasevich, Natalie wrote:
>> There are some port 80 cards that you can buy:
...
> Hmm, one of those on the PCI bus might work. Or perhaps the parallel
> port will. Of course if the problem is that somehow the PCI bus is
> locked up, then I won't get a message anywhere since all the busses are
> connected via PCI it seems. I don't know if a PCI bus can lock up, but
> for now I was assuming anything was possible.

If the PCI bus itself isn't brought down, you could debug from remote
using Benjamin Herrenschmidt's Firescope on the remote node and a
FireWire card in the test machine. Once the ohci1394 driver was loaded,
the FireWire controller is able to read and write to the 32bit PCI
address range (and thus to system memory) without assistance of
interrupt handlers.
--
Stefan Richter
-=====-=-==- =-== =---=
http://arcgraph.de/sr/

2006-11-17 14:29:40

by Lennart Sorensen

[permalink] [raw]
Subject: Re: How to go about debuging a system lockup?

On Fri, Nov 17, 2006 at 02:43:36PM +0100, Stefan Richter wrote:
> If the PCI bus itself isn't brought down, you could debug from remote
> using Benjamin Herrenschmidt's Firescope on the remote node and a
> FireWire card in the test machine. Once the ohci1394 driver was loaded,
> the FireWire controller is able to read and write to the 32bit PCI
> address range (and thus to system memory) without assistance of
> interrupt handlers.

Wow, that looks really neat. I will have to go read up on that tool.

--
Len Sorensen

2006-11-17 22:44:20

by Lennart Sorensen

[permalink] [raw]
Subject: Re: How to go about debuging a system lockup?

On Fri, Nov 17, 2006 at 09:29:28AM -0500, Lennart Sorensen wrote:
> Wow, that looks really neat. I will have to go read up on that tool.

OK, I have now tried connecting with firescope to just follow the dmesg
buffer across firewire. Works great, until the system hangs, then
firescope reports that it couldn't perform the read. I wonder what part
of the system has to lock up for the firewire card to no longer be able
to read memory on the system.

--
Len Sorensen

2006-11-17 23:10:05

by Stefan Richter

[permalink] [raw]
Subject: Re: How to go about debuging a system lockup?

Lennart Sorensen wrote:
> OK, I have now tried connecting with firescope to just follow the dmesg
> buffer across firewire. Works great, until the system hangs, then
> firescope reports that it couldn't perform the read. I wonder what part
> of the system has to lock up for the firewire card to no longer be able
> to read memory on the system.

I suppose the PCI bus is no longer accessible to the chip.
--
Stefan Richter
-=====-=-==- =-== =--=-
http://arcgraph.de/sr/

2006-11-18 01:14:24

by Krzysztof Halasa

[permalink] [raw]
Subject: Re: How to go about debuging a system lockup?

"Jesper Juhl" <[email protected]> writes:

> Or just try a few random older 2.6 kernels like 2.6.14, 2.6.9,
> 2.6.whatever (of course it needs to be a version that git knows
> about).

One can also do "bisect" manually, works with all kernels.
--
Krzysztof Halasa

2006-11-20 15:21:11

by Lennart Sorensen

[permalink] [raw]
Subject: Re: How to go about debuging a system lockup?

On Sat, Nov 18, 2006 at 12:09:54AM +0100, Stefan Richter wrote:
> Lennart Sorensen wrote:
> > OK, I have now tried connecting with firescope to just follow the dmesg
> > buffer across firewire. Works great, until the system hangs, then
> > firescope reports that it couldn't perform the read. I wonder what part
> > of the system has to lock up for the firewire card to no longer be able
> > to read memory on the system.
>
> I suppose the PCI bus is no longer accessible to the chip.

Sure seems that way. Makes me wonder if somehow a PCI transfer fails,
and the PCI controller isn't aborting the transfer after a timeout
(quite likely given the timeout timer is never enabled, and whenever I
try to do so, it seems to hang the system). Time to start scoping the
lines.

--
Len Sorensen

2006-11-21 04:18:14

by Keith Owens

[permalink] [raw]
Subject: Re: How to go about debuging a system lockup?

Lennart Sorensen (on Thu, 16 Nov 2006 16:21:40 -0500) wrote:
>On Thu, Nov 16, 2006 at 09:49:06PM +0100, Jesper Juhl wrote:
>> Well, I have a few ideas that are hopefully useul.
>>
>> - If you have not done so already, then go in to the "Kernel Hacking"
>> section of the kernel configuration and enable some (all?) of the
>> debug options and see if that produces anything that will help you
>> track down the problem.
>
>I enabled the things that sounded useful. I will try enabling the rest.
>
>> - You could enable 'magic sysrq' and see if you can manage to get a
>> backtrace with it when it hangs (see Documentation/sysrq.txt) (ohh and
>> raise the console log level so you get all messages, including debug
>> ones).
>
>Yeah I did that. No response to sysrq (at least not on the serial
>console. Maybe I should get a keyboard connector put on.) Normally we
>run without VGA/keyboard/etc, and just serial console. Of course the
>serial console requires working interrupts. Not sure about the keyboard
>driver.
>
>> - You could also try kdb (http://oss.sgi.com/projects/kdb/) or kgdb
>> (http://kgdb.linsyssoft.com/). That might help you pinpoint the
>> failure.
>
>Can I run that remotely somehow? I never really looked at kdb or kgdb
>before.

kgdb can only be run remotely. kdb can be run on the local keyboard/console
or over a serial console.