2005-09-30 03:36:12

by Hendrik Visage

[permalink] [raw]
Subject: Starfire (Adaptec) kernel 2.6.13+ panics on AMD64 NFS server

Hi there,

Traced a panicing kernel to what appears the starfire changes for
2.6.13 up to 2.6.14_rc2

During a relative heavy NFS read (client a 32bit 2.6.13.1 P2-350) with
rsync (ripped CD archive) I get kernel panics (Aieee interupt handler
lost or something... okay also need
a way to capture those errors as it's a hard panic and needs a reset button :()

I've isolated the problem going from 2.6.12.5/2.6.12-gentoo-r10 (both
working) to
2.6.13/2.6.13-gentoo/2.6.14_rc2 while the NFS is served through the
Adaptec/starfire,
and further more the onboard forceth(nvidia) is serving the data
without hassles (at least
on 2.6.14_rc2)

Using gcc 3.4.4

--
Hendrik Visage


2005-09-30 04:17:26

by Andrew Morton

[permalink] [raw]
Subject: Re: Starfire (Adaptec) kernel 2.6.13+ panics on AMD64 NFS server

Hendrik Visage <[email protected]> wrote:
>
> Traced a panicing kernel to what appears the starfire changes for
> 2.6.13 up to 2.6.14_rc2
>
> During a relative heavy NFS read (client a 32bit 2.6.13.1 P2-350) with
> rsync (ripped CD archive) I get kernel panics (Aieee interupt handler
> lost or something... okay also need
> a way to capture those errors as it's a hard panic and needs a reset button :()

A serial console is useful. Often people will take a digital photo of the
screen, which works OK. But we do need that info somehow, please.

> I've isolated the problem going from 2.6.12.5/2.6.12-gentoo-r10 (both
> working) to
> 2.6.13/2.6.13-gentoo/2.6.14_rc2 while the NFS is served through the
> Adaptec/starfire,
> and further more the onboard forceth(nvidia) is serving the data
> without hassles (at least
> on 2.6.14_rc2)

The starfire changes in 2.6.12->2.6.13 look fairly innocuous. Need that
trace, please.

2005-09-30 08:14:10

by Hendrik Visage

[permalink] [raw]
Subject: Re: Starfire (Adaptec) kernel 2.6.13+ panics on AMD64 NFS server

On 9/30/05, Andrew Morton <[email protected]> wrote:
> A serial console is useful. Often people will take a digital photo of the
> screen, which works OK. But we do need that info somehow, please.

busy getting that (and/or lkcd|kdb) setup..

> The starfire changes in 2.6.12->2.6.13 look fairly innocuous. Need that
> trace, please.

Will do, but check perhaps for some 64bit uncleanes in the scatter gather stuff
that got enabled in 2.6.13 because of the GPL'd Adaptec firmware, as I
recalled some skb related stuff.

--
Hendrik Visage

2005-09-30 16:01:34

by Hendrik Visage

[permalink] [raw]
Subject: Re: Starfire (Adaptec) kernel 2.6.13+ panics on AMD64 NFS server

On 9/30/05, Andrew Morton <[email protected]> wrote:

> The starfire changes in 2.6.12->2.6.13 look fairly innocuous. Need that
> trace, please.

See attached :)

Will do a check without PREEMPT as I've noticed that to be the first
line of "problem" :(

--
Hendrik Visage


Attachments:
(No filename) (269.00 B)
crash2.minicom (4.09 kB)
Download all attachments

2005-09-30 16:46:09

by Ion Badulescu

[permalink] [raw]
Subject: Re: Starfire (Adaptec) kernel 2.6.13+ panics on AMD64 NFS server

Hi Henrik,

On Fri, 30 Sep 2005, Hendrik Visage wrote:

> Will do, but check perhaps for some 64bit uncleanes in the scatter gather stuff
> that got enabled in 2.6.13 because of the GPL'd Adaptec firmware, as I
> recalled some skb related stuff.

There is an easy way to disable the firmware and pretty much all the
changes that went into 2.6.13: load the starfire with enable_hw_cksum=0.
If you can easily reproduce this problem, try doing the above and see if
you can still hit it. Maybe it's a newly introduced problem in the upper
layer's SG--your other network driver simply isn't using SG so it's
not affected.

It's very suspicious that the bug would be in skb_checksum_help(), since
the starfire driver doesn't do anything with the skb before handing it
over to skb_checksum_help(). It would mean that the upper layer handed an
invalid skb to the driver, or that we have some random memory corruption
somewhere.

Thanks,
Ion

--
It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

2005-09-30 17:41:25

by Andrew Morton

[permalink] [raw]
Subject: Re: Starfire (Adaptec) kernel 2.6.13+ panics on AMD64 NFS server

Hendrik Visage <[email protected]> wrote:
>
> On 9/30/05, Andrew Morton <[email protected]> wrote:
>
> > The starfire changes in 2.6.12->2.6.13 look fairly innocuous. Need that
> > trace, please.
>
> See attached :)
>

It helps, thanks.


> ----------- [cut here ] --------- [please bite here ] ---------
> Kernel BUG at net/core/dev.c:1099
> invalid operand: 0000 [1] PREEMPT
> CPU 0
> Modules linked in: nvidia nfsd exportfs lockd sunrpc rfcomm l2cap hci_usb bluetooth starfire mii snd_ac97_bus soundcore snd_page_alloc forcedeth i2c_nforce2 dm_mirror dm_mod sbp2 ohci1394 ieee1394 ohci_hcd uhci_hcd usb_storage usbhid ehci_hcd usbcore
> Pid: 11252, comm: nfsd Tainted: P 2.6.14-rc2 #3
> RIP: 0010:[<ffffffff802cc7ed>] <ffffffff802cc7ed>{skb_checksum_help+157}
> RSP: 0000:ffff81003a0bd998 EFLAGS: 00010246
> RAX: ffff81003ff01624 RBX: ffff81003ca7f180 RCX: 00000000b7e42194
> RDX: 00000000b7e42194 RSI: ffff81003ff01624 RDI: ffff81003b026080
> RBP: ffff81003a0bd9b8 R08: 0000000000000000 R09: 0000000000000004
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> R13: 0000000000000000 R14: ffff81003ca7f180 R15: ffff81003d462218
> FS: 00002aaaaade6ae0(0000) GS:ffffffff804fe800(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00002aaaaaac2000 CR3: 000000003d5a2000 CR4: 00000000000006e0
> Process nfsd (pid: 11252, threadinfo ffff81003a0bc000, task ffff81003e0ed0c0)
> Stack: ffffffff804cd720 ffff81003d462000 ffff81003d4623e0 ffff81003ca7f180
> ffff81003a0bda08 ffffffff88104944 ffff81003d462218 000000013a2a8600
> ffff81003d462000 ffff81003d462000
> Call Trace:<ffffffff88104944>{:starfire:start_tx+164} <ffffffff802db0fc>{qdisc_restart+268}
> <ffffffff802ccad0>{dev_queue_xmit+288} <ffffffff802d29b0>{neigh_resolve_output+672}
> <ffffffff802ebb27>{ip_finish_output+455} <ffffffff802ec5ff>{ip_fragment+863}
> <ffffffff802eb960>{ip_finish_output+0} <ffffffff802eca6c>{ip_output+108}


yep, there's something wrong with the skb which starfire fed into
skb_checksum_help().

offset = skb->tail - skb->h.raw;
if (offset <= 0)
BUG();

And that's a post-2.6.12 driver change. You can probably work around
it by deleting the #define ZEROCOPY line.

2005-09-30 20:11:01

by Hendrik Visage

[permalink] [raw]
Subject: Re: Starfire (Adaptec) kernel 2.6.13+ panics on AMD64 NFS server

On 9/30/05, Andrew Morton <[email protected]> wrote:
> > ----------- [cut here ] --------- [please bite here ] ---------
> > Kernel BUG at net/core/dev.c:1099
> > invalid operand: 0000 [1] PREEMPT
>
> yep, there's something wrong with the skb which starfire fed into
> skb_checksum_help().
>
<snip>
>
> And that's a post-2.6.12 driver change. You can probably work around
> it by deleting the #define ZEROCOPY line.

:)
Anycase, here is a non-PREEMPT traceback. What makes this one
interesting, is that
in the preempt case, I had to push the NFS output to get the panic, but the
non-preempt case attached, sorta just happened, ie. when the clients
just checked on the server's status :(


--
Hendrik Visage


Attachments:
(No filename) (704.00 B)
non-prempt (4.01 kB)
Download all attachments

2005-09-30 20:55:30

by Ion Badulescu

[permalink] [raw]
Subject: Re: Starfire (Adaptec) kernel 2.6.13+ panics on AMD64 NFS server

On Fri, 30 Sep 2005, Hendrik Visage wrote:

> Anycase, here is a non-PREEMPT traceback.

Same trace, pretty much like I expected. Still, starfire must be getting
a bad skb from the upper layers, because it gets passed __unmodified__ to
skb_checksum_help().

Either that, or skb_checksum_help() itself got broken at some point, at
least on 64-bit platforms.

I'll try to reproduce it over the weekend (assumming I can get an x86_64
box set up, with a starfire inside) and see where the problem is.

> What makes this one interesting, is that in the preempt case, I had to
> push the NFS output to get the panic, but the non-preempt case attached,
> sorta just happened, ie. when the clients just checked on the server's
> status :(

I'm actually surprised you got your panic from nfsd. skb_checksum_help()
is called only when one of the fragments has length == 1, so the easiest
way to hit it is to slowly type something into a telnet session.

Thanks,
Ion

--
It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

2005-09-30 22:39:47

by Herbert Xu

[permalink] [raw]
Subject: Re: Starfire (Adaptec) kernel 2.6.13+ panics on AMD64 NFS server

On Fri, Sep 30, 2005 at 08:10:59PM +0000, Hendrik Visage wrote:
>
> Anycase, here is a non-PREEMPT traceback. What makes this one
> interesting, is that
> in the preempt case, I had to push the NFS output to get the panic, but the
> non-preempt case attached, sorta just happened, ie. when the clients
> just checked on the server's status :(

You must never call skb_checksum_help unless the packet is meant to
be checksummed by the hardware. So starfire is the guilty party here.

This patch makes it do the check and also check for errors from
skb_checksum_help.

Signed-off-by: Herbert Xu <[email protected]>

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Attachments:
(No filename) (843.00 B)
p (654.00 B)
Download all attachments

2005-10-01 19:21:27

by Hendrik Visage

[permalink] [raw]
Subject: Re: Starfire (Adaptec) kernel 2.6.13+ panics on AMD64 NFS server

On 10/1/05, Herbert Xu <[email protected]> wrote:
> You must never call skb_checksum_help unless the packet is meant to
> be checksummed by the hardware. So starfire is the guilty party here.
>
> This patch makes it do the check and also check for errors from
> skb_checksum_help.
>
> Signed-off-by: Herbert Xu <[email protected]>

Thanx Herbert,
at least on 2.6.14_rc2 the patch appears to work for my stress test :)

--
Hendrik Visage