2003-01-23 23:21:29

by Daniel Khan

[permalink] [raw]
Subject: 2.4.20 CPU lockup - Now with OOPS message

Hello List,

I reported frequently system lockups today.
Now after some playing around (cause I don't know anything about kernel
debugging - Thanks to Mark Hahn for the tipps)
I found a way to reproduce the lock and to get the OOPS.
Dan Kegel told me after the last post that only kernels built from the
kernel.org sources can be supported
by this list. I now used the 2.4.20-2.25smp kernel from RawHide. And I
didn't build a kernel from kernel.org.
O.K. here's the deal:
The OOPS below looks like spanish to me but for the hackers the thing could
be very clear.
So if you think that is a common kernel issue please help. Otherwise I'll
report to RedHat immediately.

Scenario:
2.4.20-2.25smp from RawHide

Doing a rsync from the crashing host _to_ another host over a 1000 Mbit 3com
(TG3).
The rsynced files include bigger files with about 1.5 gigs.
Heartbeat runs.

Below are the OOPS.
Please CC to [email protected] if you are wanting to help.

Thanks a lot

Daniel Khan

<------------------------CUT---------------------------->
NMI Watchdog detected LOCKUP on CPU0, eip c02499ac, registers:
via686a eeprom lm80 i2c-proc i2c-isa i2c-viapro i2c-core tg3 eepro100 mii
ipt_LOG ipt_limit ipt_state ipt_REJECT iptable_nat ip_cona
CPU: 0
EIP: 0060:[<c02499ac>] Not tainted
EFLAGS: 00000086

EIP is at .text.lock.tcp_ipv4 [kernel] 0x182 (2.4.20-2.25smp)
eax: 00000001 ebx: d400010a ecx: 00000000 edx: f78837d8
esi: f6f22ae0 edi: c3d3ad40 ebp: f74939f4 esp: f1335d8c
ds: 0068 es: 0068 ss: 0068
Process rsync (pid: 3151, stackpage=f1335000)
Stack: c3d3ad40 f3121f38 00000001 f1335e28 00000000 03ff0202 00000004
000003ff
00000000 00000006 c3d3ad40 f74939e0 c022d67e c3d3ad40 f1335e28
c3d5a000
00000000 00000006 00000000 00000001 00000000 c022d530 c021ce67
c3d3ad40
Call Trace: [<c022d67e>] ip_local_deliver_finish [kernel] 0x14e
(0xf1335dbc))
[<c022d530>] ip_local_deliver_finish [kernel] 0x0 (0xf1335de0))
[<c021ce67>] nf_hook_slow [kernel] 0x107 (0xf1335de4))
[<c022d530>] ip_local_deliver_finish [kernel] 0x0 (0xf1335e00))
[<c022d2b3>] ip_local_deliver [kernel] 0x53 (0xf1335e1c))
[<c022d530>] ip_local_deliver_finish [kernel] 0x0 (0xf1335e34))
[<c022d8b9>] ip_rcv_finish [kernel] 0x219 (0xf1335e38))
[<c022d6a0>] ip_rcv_finish [kernel] 0x0 (0xf1335e5c))
[<c022d6a0>] ip_rcv_finish [kernel] 0x0 (0xf1335e6c))
[<c021ce67>] nf_hook_slow [kernel] 0x107 (0xf1335e70))
[<c022d6a0>] ip_rcv_finish [kernel] 0x0 (0xf1335e8c))
[<c022d480>] ip_rcv [kernel] 0x1a0 (0xf1335ea8))
[<c022d6a0>] ip_rcv_finish [kernel] 0x0 (0xf1335ec0))
[<c021566e>] netif_receive_skb [kernel] 0x14e (0xf1335ed8))
[<f89d2c7c>] tg3_rx [tg3] 0x27c (0xf1335ef8))
[<f89d2e71>] tg3_poll [tg3] 0x81 (0xf1335f38))
[<c0215917>] net_rx_action [kernel] 0xa7 (0xf1335f58))
[<c01289f9>] do_softirq [kernel] 0xd9 (0xf1335f80))
[<c010b81b>] do_IRQ [kernel] 0xfb (0xf1335f9c))
[<c010e7c8>] call_do_IRQ [kernel] 0x5 (0xf1335fc0))


Code: 7e f8 e9 68 e5 ff ff e8 2c ed eb ff e9 c3 ee ff ff e8 22 ed
console shuts up ...
NMMI Watchdog detected LOCKUP on CPU1, eip f89d9f3b, registers:
<------------------------CUT---------------------------->


2003-01-23 23:57:20

by GrandMasterLee

[permalink] [raw]
Subject: Re: 2.4.20 CPU lockup - Now with OOPS message

Can I ask how you reproduced this? I've got several systems with TG3's
and they only get lockups during network backups.


On Thu, 2003-01-23 at 17:30, Daniel Khan wrote:
> Hello List,
>
> I reported frequently system lockups today.
> Now after some playing around (cause I don't know anything about kernel
> debugging - Thanks to Mark Hahn for the tipps)
> I found a way to reproduce the lock and to get the OOPS.
> Dan Kegel told me after the last post that only kernels built from the
> kernel.org sources can be supported
> by this list. I now used the 2.4.20-2.25smp kernel from RawHide. And I
> didn't build a kernel from kernel.org.
> O.K. here's the deal:
> The OOPS below looks like spanish to me but for the hackers the thing could
> be very clear.
> So if you think that is a common kernel issue please help. Otherwise I'll
> report to RedHat immediately.
>
> Scenario:
> 2.4.20-2.25smp from RawHide
>
> Doing a rsync from the crashing host _to_ another host over a 1000 Mbit 3com
> (TG3).
> The rsynced files include bigger files with about 1.5 gigs.
> Heartbeat runs.
>
> Below are the OOPS.
> Please CC to [email protected] if you are wanting to help.
>
> Thanks a lot
>
> Daniel Khan
>
> <------------------------CUT---------------------------->
> NMI Watchdog detected LOCKUP on CPU0, eip c02499ac, registers:
> via686a eeprom lm80 i2c-proc i2c-isa i2c-viapro i2c-core tg3 eepro100 mii
> ipt_LOG ipt_limit ipt_state ipt_REJECT iptable_nat ip_cona
> CPU: 0
> EIP: 0060:[<c02499ac>] Not tainted
> EFLAGS: 00000086
>
> EIP is at .text.lock.tcp_ipv4 [kernel] 0x182 (2.4.20-2.25smp)
> eax: 00000001 ebx: d400010a ecx: 00000000 edx: f78837d8
> esi: f6f22ae0 edi: c3d3ad40 ebp: f74939f4 esp: f1335d8c
> ds: 0068 es: 0068 ss: 0068
> Process rsync (pid: 3151, stackpage=f1335000)
> Stack: c3d3ad40 f3121f38 00000001 f1335e28 00000000 03ff0202 00000004
> 000003ff
> 00000000 00000006 c3d3ad40 f74939e0 c022d67e c3d3ad40 f1335e28
> c3d5a000
> 00000000 00000006 00000000 00000001 00000000 c022d530 c021ce67
> c3d3ad40
> Call Trace: [<c022d67e>] ip_local_deliver_finish [kernel] 0x14e
> (0xf1335dbc))
> [<c022d530>] ip_local_deliver_finish [kernel] 0x0 (0xf1335de0))
> [<c021ce67>] nf_hook_slow [kernel] 0x107 (0xf1335de4))
> [<c022d530>] ip_local_deliver_finish [kernel] 0x0 (0xf1335e00))
> [<c022d2b3>] ip_local_deliver [kernel] 0x53 (0xf1335e1c))
> [<c022d530>] ip_local_deliver_finish [kernel] 0x0 (0xf1335e34))
> [<c022d8b9>] ip_rcv_finish [kernel] 0x219 (0xf1335e38))
> [<c022d6a0>] ip_rcv_finish [kernel] 0x0 (0xf1335e5c))
> [<c022d6a0>] ip_rcv_finish [kernel] 0x0 (0xf1335e6c))
> [<c021ce67>] nf_hook_slow [kernel] 0x107 (0xf1335e70))
> [<c022d6a0>] ip_rcv_finish [kernel] 0x0 (0xf1335e8c))
> [<c022d480>] ip_rcv [kernel] 0x1a0 (0xf1335ea8))
> [<c022d6a0>] ip_rcv_finish [kernel] 0x0 (0xf1335ec0))
> [<c021566e>] netif_receive_skb [kernel] 0x14e (0xf1335ed8))
> [<f89d2c7c>] tg3_rx [tg3] 0x27c (0xf1335ef8))
> [<f89d2e71>] tg3_poll [tg3] 0x81 (0xf1335f38))
> [<c0215917>] net_rx_action [kernel] 0xa7 (0xf1335f58))
> [<c01289f9>] do_softirq [kernel] 0xd9 (0xf1335f80))
> [<c010b81b>] do_IRQ [kernel] 0xfb (0xf1335f9c))
> [<c010e7c8>] call_do_IRQ [kernel] 0x5 (0xf1335fc0))
>
>
> Code: 7e f8 e9 68 e5 ff ff e8 2c ed eb ff e9 c3 ee ff ff e8 22 ed
> console shuts up ...
> NMMI Watchdog detected LOCKUP on CPU1, eip f89d9f3b, registers:
> <------------------------CUT---------------------------->
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
GrandMasterLee <[email protected]>

2003-01-24 00:02:57

by Daniel Khan

[permalink] [raw]
Subject: AW: 2.4.20 CPU lockup - Now with OOPS message

Hi,

> > I reported frequently system lockups today.
> > Now after some playing around (cause I don't know anything about kernel
> > debugging - Thanks to Mark Hahn for the tipps)
> > I found a way to reproduce the lock and to get the OOPS.
[..]

> Can I ask how you reproduced this? I've got several systems with TG3's
> and they only get lockups during network backups.

httpd session on the host which has big logfiles to get them changed.
Starting rsync to sync the logfiles and other stuff to the backup host.

Sometimes I have to retry 2-3 times but it crashes very reliable.
It's quite the same as the network backups you mentioning.

Daniel Khan

2003-01-24 01:33:03

by GrandMasterLee

[permalink] [raw]
Subject: Re: AW: 2.4.20 CPU lockup - Now with OOPS message

On Thu, 2003-01-23 at 18:11, Daniel Khan wrote:
> Hi,
>
> > > I reported frequently system lockups today.
> > > Now after some playing around (cause I don't know anything about kernel
> > > debugging - Thanks to Mark Hahn for the tipps)
> > > I found a way to reproduce the lock and to get the OOPS.
> [..]
>
> > Can I ask how you reproduced this? I've got several systems with TG3's
> > and they only get lockups during network backups.
>
> httpd session on the host which has big logfiles to get them changed.
> Starting rsync to sync the logfiles and other stuff to the backup host.
>
> Sometimes I have to retry 2-3 times but it crashes very reliable.
> It's quite the same as the network backups you mentioning.

We use rsync to do our backups. I've been getting lines in my backup
server kernel and dmesg like this:

TCP: Treason uncloaked! Peer 10.1.1.40:37859/873 shrinks window
2430745930:2430747378. Repaired.
TCP: Treason uncloaked! Peer 10.1.1.40:37859/873 shrinks window
2430745930:2430747378. Repaired.


I was able to successfully reproduce this error in a test setup, but not
the crashes. I'm curious if maybe I just start up too many instances of
rsync and see what happens.

Any particular method or size of files, etc, in reproducing this would
be greatly beneficial. TIA

> Daniel Khan

2003-01-24 02:24:19

by Daniel Khan

[permalink] [raw]
Subject: AW: AW: 2.4.20 CPU lockup - Now with OOPS message

Hi,

[..]
> I was able to successfully reproduce this error in a test setup, but not
> the crashes. I'm curious if maybe I just start up too many instances of
> rsync and see what happens.
>
> Any particular method or size of files, etc, in reproducing this would
> be greatly beneficial. TIA

Here is the command
/usr/local/bin/nice-rsync --rsync-path=/usr/local/bin/nice-rsync --whole-fil
e -auq --delete /var/log/httpd/ 10.1.0.212:/var/log/httpd

/usr/local/bin/nice-rsync :

#!/bin/sh
exec nice -n 19 rsync $*

Best regards

Daniel Khan

2003-01-24 06:02:45

by GrandMasterLee

[permalink] [raw]
Subject: Re: AW: AW: 2.4.20 CPU lockup - Now with OOPS message

On Thu, 2003-01-23 at 20:33, Daniel Khan wrote:
> Hi,
>
> [..]
> > I was able to successfully reproduce this error in a test setup, but not
> > the crashes. I'm curious if maybe I just start up too many instances of
> > rsync and see what happens.
> >
> > Any particular method or size of files, etc, in reproducing this would
> > be greatly beneficial. TIA
>
> Here is the command
> /usr/local/bin/nice-rsync --rsync-path=/usr/local/bin/nice-rsync --whole-fil
> e -auq --delete /var/log/httpd/ 10.1.0.212:/var/log/httpd
>
> /usr/local/bin/nice-rsync :
>
> #!/bin/sh
> exec nice -n 19 rsync $*
>
> Best regards
>
> Daniel Khan


Kewl. Thanks, I will try this out tomorrow and let you know.