2008-06-17 16:17:37

by Andrew Price

[permalink] [raw]
Subject: BUG: mac80211: Some connections hanging in 2.6.26-rc6...

In the latest Linus kernel some kinds of connections over my wireless
link have been hanging. By this I mean, for example:

$ ssh <somewhere>
[Wait a while]
Connection reset by <somewhere>

This also affects some http requests such as loading slashdot but not
others (perhaps small ones?) such as twitter API http requests. It does
not seem to affect connections to other machines on my LAN, only ones
over the internet.

I've tried tcpdumping to get more info but when I run tcpdump the
connections work again (if a tiny bit slower).

I've bisected this bug down to 608961a5eca8d3c6bd07172febc27b5559408c5d
"mac80211: Use skb_header_cloned() on TX path." and naively reverting
this change seems to make the bug go away.

I'm using the rt2500pci module and wpa supplicant. My setup is:

Laptop -- Wireless AP -- Router -- Cable Modem -- The Internet

and my card is a "RaLink RT2500 802.11g Cardbus/mini-PCI (rev 01)".

Please let me know if you need any more info or if there's a more
focussed test case I could try.

--
Andy Price



2008-06-19 06:50:22

by Johannes Berg

[permalink] [raw]
Subject: Re: BUG: mac80211: Some connections hanging in 2.6.26-rc6...

On Thu, 2008-06-19 at 02:19 +0100, Andrew Price wrote:
> On 18/06/08 16:42, Johannes Berg wrote:
> > Hence, that commit is now reverted, see
> > http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.git;a=commit;h=3a5be7d4b079f3a9ce1e8ce4a93ba15ae6d00111
>
> I've updated my kernel to Linus' latest since he merged Dave's revert
> and I'm pleased to say the bug is now gone. Thanks for your help (and
> thanks to Dave too).

Well, thanks for spending the time to bisect, the bug had been reported
a number of times but we never knew what was up until that tipped off
Dave. :)

johannes


Attachments:
signature.asc (836.00 B)
This is a digitally signed message part

2008-06-18 15:51:40

by Johannes Berg

[permalink] [raw]
Subject: Re: BUG: mac80211: Some connections hanging in 2.6.26-rc6...

On Wed, 2008-06-18 at 17:42 +0200, Johannes Berg wrote:
> On Wed, 2008-06-18 at 15:55 +0100, Andrew Price wrote:
> > On 18/06/08 08:22, Johannes Berg wrote:
> > >> This also affects some http requests such as loading slashdot but not
> > >> others (perhaps small ones?) such as twitter API http requests. It does
> > >> not seem to affect connections to other machines on my LAN, only ones
> > >> over the internet.
> > >>
> > >> I've tried tcpdumping to get more info but when I run tcpdump the
> > >> connections work again (if a tiny bit slower).
> > >
> > > Can you try tcpdump with the -p flag? promisc mode might affect things.
> >
> > I tried using -p and it made the bug go away again. However, I have some
> > packet captures from running tcpdump on the router while the bug was
> > occuring and from loading them into wireshark it seems there are some
> > (suspected) retransmissions with bad checksums. I've posted my packet
> > captures here:
> >
> > http://sucs.org/~welshbyte/lw/tcpdump_bad0.pcap
> > http://sucs.org/~welshbyte/lw/tcpdump_bad1.pcap
> > http://sucs.org/~welshbyte/lw/tcpdump_good0.pcap
> >
> > The two 'bad' ones were captured on the router while the bug was
> > occuring. The 'good' one is for comparison and was captured on the
> > router like the others but while running tcpdump on my laptop made the
> > bug go away (I hope that makes sense).
> >
> > In all cases, the test was to ssh to a remote server. The two bad cases
> > were closed before the password prompt was reached and the good case was
> > ^C'd at the password prompt.
> >
> > I hope this helps,
>
> Thanks. Unfortunately, it's very hard to tell what is causing the bug,

No, that was worded wrong, I mean, we know what is causing the bug and
we know what the bug is, but due to encryption being used it's hard to
confirm from the tcpdump you posted.

johannes


Attachments:
signature.asc (836.00 B)
This is a digitally signed message part

2008-06-18 14:55:45

by Andrew Price

[permalink] [raw]
Subject: Re: BUG: mac80211: Some connections hanging in 2.6.26-rc6...

On 18/06/08 08:22, Johannes Berg wrote:
>> This also affects some http requests such as loading slashdot but not
>> others (perhaps small ones?) such as twitter API http requests. It does
>> not seem to affect connections to other machines on my LAN, only ones
>> over the internet.
>>
>> I've tried tcpdumping to get more info but when I run tcpdump the
>> connections work again (if a tiny bit slower).
>
> Can you try tcpdump with the -p flag? promisc mode might affect things.

I tried using -p and it made the bug go away again. However, I have some
packet captures from running tcpdump on the router while the bug was
occuring and from loading them into wireshark it seems there are some
(suspected) retransmissions with bad checksums. I've posted my packet
captures here:

http://sucs.org/~welshbyte/lw/tcpdump_bad0.pcap
http://sucs.org/~welshbyte/lw/tcpdump_bad1.pcap
http://sucs.org/~welshbyte/lw/tcpdump_good0.pcap

The two 'bad' ones were captured on the router while the bug was
occuring. The 'good' one is for comparison and was captured on the
router like the others but while running tcpdump on my laptop made the
bug go away (I hope that makes sense).

In all cases, the test was to ssh to a remote server. The two bad cases
were closed before the password prompt was reached and the good case was
^C'd at the password prompt.

I hope this helps,

--
Andy Price


2008-06-19 01:19:26

by Andrew Price

[permalink] [raw]
Subject: Re: BUG: mac80211: Some connections hanging in 2.6.26-rc6...

On 18/06/08 16:42, Johannes Berg wrote:
> Hence, that commit is now reverted, see
> http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.git;a=commit;h=3a5be7d4b079f3a9ce1e8ce4a93ba15ae6d00111

I've updated my kernel to Linus' latest since he merged Dave's revert
and I'm pleased to say the bug is now gone. Thanks for your help (and
thanks to Dave too).

--
Andy Price


2008-06-18 07:23:07

by Johannes Berg

[permalink] [raw]
Subject: Re: BUG: mac80211: Some connections hanging in 2.6.26-rc6...


> This also affects some http requests such as loading slashdot but not
> others (perhaps small ones?) such as twitter API http requests. It does
> not seem to affect connections to other machines on my LAN, only ones
> over the internet.
>
> I've tried tcpdumping to get more info but when I run tcpdump the
> connections work again (if a tiny bit slower).

Can you try tcpdump with the -p flag? promisc mode might affect things.

> I've bisected this bug down to 608961a5eca8d3c6bd07172febc27b5559408c5d
> "mac80211: Use skb_header_cloned() on TX path." and naively reverting
> this change seems to make the bug go away.

That's odd, and I can't find an explanation for it off-hand.

johannes


Attachments:
signature.asc (836.00 B)
This is a digitally signed message part

2008-06-18 15:43:20

by Johannes Berg

[permalink] [raw]
Subject: Re: BUG: mac80211: Some connections hanging in 2.6.26-rc6...

On Wed, 2008-06-18 at 15:55 +0100, Andrew Price wrote:
> On 18/06/08 08:22, Johannes Berg wrote:
> >> This also affects some http requests such as loading slashdot but not
> >> others (perhaps small ones?) such as twitter API http requests. It does
> >> not seem to affect connections to other machines on my LAN, only ones
> >> over the internet.
> >>
> >> I've tried tcpdumping to get more info but when I run tcpdump the
> >> connections work again (if a tiny bit slower).
> >
> > Can you try tcpdump with the -p flag? promisc mode might affect things.
>
> I tried using -p and it made the bug go away again. However, I have some
> packet captures from running tcpdump on the router while the bug was
> occuring and from loading them into wireshark it seems there are some
> (suspected) retransmissions with bad checksums. I've posted my packet
> captures here:
>
> http://sucs.org/~welshbyte/lw/tcpdump_bad0.pcap
> http://sucs.org/~welshbyte/lw/tcpdump_bad1.pcap
> http://sucs.org/~welshbyte/lw/tcpdump_good0.pcap
>
> The two 'bad' ones were captured on the router while the bug was
> occuring. The 'good' one is for comparison and was captured on the
> router like the others but while running tcpdump on my laptop made the
> bug go away (I hope that makes sense).
>
> In all cases, the test was to ssh to a remote server. The two bad cases
> were closed before the password prompt was reached and the good case was
> ^C'd at the password prompt.
>
> I hope this helps,

Thanks. Unfortunately, it's very hard to tell what is causing the bug,
but Dave Miller and I analysed the problem and found it to be caused by
a disconnect in thinking -- I assumed skb_header_cloned() would check
for the full skb data (which seems to be called 'head' in some places),
not just the header; he, on the other hand, wasn't aware of software
encryption mangling it all so we never noticed that the commit you
bisected it down to was wrong to start with.

Hence, that commit is now reverted, see
http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.git;a=commit;h=3a5be7d4b079f3a9ce1e8ce4a93ba15ae6d00111

johannes


Attachments:
signature.asc (836.00 B)
This is a digitally signed message part