2005-09-15 17:40:32

by Tomasz Kłoczko

[permalink] [raw]
Subject: [2.6.14-rc1/sparc54]: BUG: soft lockup detected on CPU#0!


I'm just catch series of kernel messages with soft lockup detected reports
(in attachment).
It occures during store big amout of data on NFS volume.

As NIC I use now Sun Swift gigabit eth (cassini driver). Probably this is
NFS related because I'm just browse yesterday logs and simillar was also
on Sun Happy Meal.

kloczek
--
-----------------------------------------------------------
*Ludzie nie maj? problem?w, tylko sobie sami je stwarzaj?*
-----------------------------------------------------------
Tomasz K?oczko, sys adm @zie.pg.gda.pl|*e-mail: [email protected]*


Attachments:
dmesg (21.57 kB)

2005-09-15 20:30:38

by David Miller

[permalink] [raw]
Subject: Re: [2.6.14-rc1/sparc54]: BUG: soft lockup detected on CPU#0!

From: Tomasz K?oczko <[email protected]>
Date: Thu, 15 Sep 2005 19:40:27 +0200 (CEST)

> I'm just catch series of kernel messages with soft lockup detected reports
> (in attachment).
> It occures during store big amout of data on NFS volume.
>
> As NIC I use now Sun Swift gigabit eth (cassini driver). Probably this is
> NFS related because I'm just browse yesterday logs and simillar was also
> on Sun Happy Meal.

Interesting. Can you reproduce this with SLAB poisioning disabled?
That debugging feature is extremely expensive, although it shouldn't
make the CPU stop scheduling processes for more than 10 seconds.

I wonder if the NFS daemon code needs to have some limits put on
how much cpu it consumes handling requests before it gives up the
cpu. Perhaps, it has such throttling already, I don't know.

I'll also try to see if there can be some kind of sparc64 specific
issue which would cause this.

Where did you get that Cassini driver btw? It's not upstream,
although if it exists it should be.

2005-09-15 20:36:15

by Jima

[permalink] [raw]
Subject: Re: [2.6.14-rc1/sparc54]: BUG: soft lockup detected on CPU#0!

On Thu, 15 Sep 2005, David S. Miller wrote:
> Where did you get that Cassini driver btw? It's not upstream,
> although if it exists it should be.

Spot rolled it into the most recent Aurora kernel RPMs. He got the
driver from me; I in turn got it from Sun's web site. (It's tagged GPL.)
Neither Spot nor I care to 'own' the driver, due to both of us lacking
the hardware it supports, which makes maintenance difficult, at best.

Jima

2005-09-16 12:43:03

by Tomasz Kłoczko

[permalink] [raw]
Subject: Re: [2.6.14-rc1/sparc54]: BUG: soft lockup detected on CPU#0!

On Thu, 15 Sep 2005, David S. Miller wrote:

> From: Tomasz K?oczko <[email protected]>
> Date: Thu, 15 Sep 2005 19:40:27 +0200 (CEST)
>
>> I'm just catch series of kernel messages with soft lockup detected reports
>> (in attachment).
>> It occures during store big amout of data on NFS volume.
>>
>> As NIC I use now Sun Swift gigabit eth (cassini driver). Probably this is
>> NFS related because I'm just browse yesterday logs and simillar was also
>> on Sun Happy Meal.
>
> Interesting. Can you reproduce this with SLAB poisioning disabled?
> That debugging feature is extremely expensive, although it shouldn't
> make the CPU stop scheduling processes for more than 10 seconds.
>
> I wonder if the NFS daemon code needs to have some limits put on
> how much cpu it consumes handling requests before it gives up the
> cpu. Perhaps, it has such throttling already, I don't know.

But this not case NFS server but NSF client. During this lookups I observe
rpciod takes 90-99% time of single processor. Load is between 10 and 20.

> I'll also try to see if there can be some kind of sparc64 specific
> issue which would cause this.
>
> Where did you get that Cassini driver btw? It's not upstream,
> although if it exists it should be.

It is avalaible on Sun pages (Copyright by Sun by it is GPLed):

http://www.sun.com/download/index.jsp?cat=Hardware%20Drivers&tab=3

Few days ago I'm talk about this driver with spot on #aurora channel and
it was included during prepare next kernel package.

But continue ..

Yesterday before testing NFS I was trying utilize NIC is using only
ftp/http/rcync/scp and all works correctly (without reporting lookups).
~23:00 CET after reported series of lookups system hangs completly. After
restart to now I see in dmesg some known messages:

[root@boss]# dmesg | sort | uniq -c | sort -n | tail -n 2
87 svc: bad direction 268435456, dropping request <==========
285 hw tcp v4 csum failed

Also I have question about second (hw tcp v4 csum failed):

[root@boss]# ifconfig eth0
eth0 Link encap:Ethernet HWaddr 00:03:BA:18:41:F9
inet addr:153.19.33.230 Bcast:153.19.33.255 Mask:255.255.255.0
inet6 addr: fe80::203:baff:fe18:41f9/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:2361199 errors:0 dropped:0 overruns:0 frame:0
TX packets:4060978 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:352428791 (336.1 Mb) TX bytes:4884521951 (4658.2 Mb)
Interrupt:192

As you see in both errors is "0". Is this correct reporting broken packets
in kernel messages instead in RX errors ?

kloczek
--
-----------------------------------------------------------
*Ludzie nie maj? problem?w, tylko sobie sami je stwarzaj?*
-----------------------------------------------------------
Tomasz K?oczko, sys adm @zie.pg.gda.pl|*e-mail: [email protected]*

2005-09-16 19:57:27

by David Miller

[permalink] [raw]
Subject: Re: [2.6.14-rc1/sparc54]: BUG: soft lockup detected on CPU#0!

From: Tomasz K?oczko <[email protected]>
Date: Fri, 16 Sep 2005 14:42:56 +0200 (CEST)

> Also I have question about second (hw tcp v4 csum failed):
...
> As you see in both errors is "0". Is this correct reporting broken packets
> in kernel messages instead in RX errors ?

It is protocol level error, therefore it is not counted
in raw device level statistics.

BTW, when you get the "hw tcp v4 csum failed", the kernel ignores
the hw checksum and rechecks it using software.

2005-09-16 19:59:25

by David Miller

[permalink] [raw]
Subject: Re: [2.6.14-rc1/sparc54]: BUG: soft lockup detected on CPU#0!

From: Tomasz K?oczko <[email protected]>
Date: Fri, 16 Sep 2005 14:42:56 +0200 (CEST)

> On Thu, 15 Sep 2005, David S. Miller wrote:
>
> > I wonder if the NFS daemon code needs to have some limits put on
> > how much cpu it consumes handling requests before it gives up the
> > cpu. Perhaps, it has such throttling already, I don't know.
>
> But this not case NFS server but NSF client. During this lookups I observe
> rpciod takes 90-99% time of single processor. Load is between 10 and 20.

After studying some code yesterday, NFS client has the same
exact problem as NFS daemon, namely that if you give it enough
work it will never give up the cpu so that other tasks can
be scheduled.

This is a serious bug, and can easily trigger those soft lockup
messages. Based upon some other reports seen on linux-kernel
and elsewhere, things like the raid1 kernel daemon have a similar
issue as well.

I think you can help things _enormusly_ by turning off SLAB
poisioning, as I said that debugging feature is _VERY_ expensive.

2005-09-17 17:49:19

by Tomasz Kłoczko

[permalink] [raw]
Subject: Re: [2.6.14-rc1/sparc54]: BUG: soft lockup detected on CPU#0!

On Sat, 17 Sep 2005, Tomasz K?oczko wrote:
[..]

Something new.
I'm just finish rewrite backup procedure from dumping to NFS volume to
using piped dump|ssh|dd.
During this happens:

PSYCHO0: Uncorrectable Error, primary error type[DMA Write]
PSYCHO0: bytemask[0000] dword_offset[7] UPA_MID[1f] was_block(1)
PSYCHO0: UE AFAR [000000005c8bc040]
PSYCHO0: UE Secondary errors [(DMA Write)]
PSYCHO0: IOMMU Error, type[Protection Error]

kloczek
--
-----------------------------------------------------------
*Ludzie nie maj? problem?w, tylko sobie sami je stwarzaj?*
-----------------------------------------------------------
Tomasz K?oczko, sys adm @zie.pg.gda.pl|*e-mail: [email protected]*

2005-09-17 15:22:48

by Tomasz Kłoczko

[permalink] [raw]
Subject: Re: [2.6.14-rc1/sparc54]: BUG: soft lockup detected on CPU#0!

On Fri, 16 Sep 2005, David S. Miller wrote:

> From: Tomasz K?oczko <[email protected]>
> Date: Fri, 16 Sep 2005 14:42:56 +0200 (CEST)
>
>> On Thu, 15 Sep 2005, David S. Miller wrote:
>>
>>> I wonder if the NFS daemon code needs to have some limits put on
>>> how much cpu it consumes handling requests before it gives up the
>>> cpu. Perhaps, it has such throttling already, I don't know.
>>
>> But this not case NFS server but NSF client. During this lookups I observe
>> rpciod takes 90-99% time of single processor. Load is between 10 and 20.
>
> After studying some code yesterday, NFS client has the same
> exact problem as NFS daemon, namely that if you give it enough
> work it will never give up the cpu so that other tasks can
> be scheduled.
>
> This is a serious bug, and can easily trigger those soft lockup
> messages. Based upon some other reports seen on linux-kernel
> and elsewhere, things like the raid1 kernel daemon have a similar
> issue as well.
>
> I think you can help things _enormusly_ by turning off SLAB
> poisioning, as I said that debugging feature is _VERY_ expensive.

I'll try.

Now I have next call trace catched during reboot system after perform full
backup on NFS volume (during this as previouse was tons of lookups). This
case was during umounting NFS volumes (on system shutdown). Below call
trace may have some typos because I read tem from screen and write on
paper:

Badness in interruptible_sleep_on_timeout at kernel/shed.c: 3403 (Not tainted)
Call trace:
[00000000021c3cf8] nfs_kill_super+0x8c/0xbc [nfs]
[00000000004962d4] deactivate_super+0x58/0x7c
[00000000004ac5cc] sys_umount+0x2f8/0x308
[0000000000410fd4] linux_sparc_syscall32_0x34/0x40
[0000000000011ebc] 0x11ebc

Above may be is some resoult corrupting something during soft lookupus but
if not probably may be usefull on finding bugs.

kloczek
--
-----------------------------------------------------------
*Ludzie nie maj? problem?w, tylko sobie sami je stwarzaj?*
-----------------------------------------------------------
Tomasz K?oczko, sys adm @zie.pg.gda.pl|*e-mail: [email protected]*

2005-09-18 06:45:44

by David Miller

[permalink] [raw]
Subject: Re: [2.6.14-rc1/sparc54]: BUG: soft lockup detected on CPU#0!

From: Tomasz K?oczko <[email protected]>
Date: Sat, 17 Sep 2005 19:49:17 +0200 (CEST)

> Something new.
> I'm just finish rewrite backup procedure from dumping to NFS volume to
> using piped dump|ssh|dd.
> During this happens:
>
> PSYCHO0: Uncorrectable Error, primary error type[DMA Write]
> PSYCHO0: bytemask[0000] dword_offset[7] UPA_MID[1f] was_block(1)
> PSYCHO0: UE AFAR [000000005c8bc040]
> PSYCHO0: UE Secondary errors [(DMA Write)]
> PSYCHO0: IOMMU Error, type[Protection Error]

A driver unmapped a DMA area while the device is still
accessing it. If you're still using the cassini card,
it's driver is the most likely suspect.