2008-06-17 23:04:45

by Ilpo Järvinen

[permalink] [raw]
Subject: Re: [Bugme-new] [Bug 10903] New: ssh connections hang with 2.6.26-rc5

On Tue, 17 Jun 2008, Didier Raboud wrote:

> Le lundi 16 juin 2008 15:21:25 Ilpo J?rvinen, vous avez ?crit?:
> > > > > > http://bugzilla.kernel.org/show_bug.cgi?id=10903
> > > > > >
> > > > > > Summary: ssh connections hang with 2.6.26-rc5
> > > > > > Product: Networking
> > > > > > Version: 2.5
> > > > > > KernelVersion: 2.6.26-rc5
> > > > > > Platform: All
> > > > > > OS/Version: Linux
> > > > > > Tree: Mainline
> > > > > > Status: NEW
> > > > > > Severity: normal
> > > > > > Priority: P1
> > > > > > Component: Other
> > > > > > AssignedTo: [email protected]
> > > > > > ReportedBy: [email protected]
> > > > > >
> > > > > >
> > > > > > Latest working kernel version: 2.6.25-2
> > > > > > Earliest failing kernel version: 2.6.26-rc5
> > > > > > Distribution: Debian (Lenny + Sid)
> > > > > > Hardware Environment: amd64 (Dell Latitude D630)
> > > > > > Software Environment: KDE
> > > > > > Problem Description:
> > > > > >
> > > > > > With kernel version 2.6.26-rc5, the ssh connections to remote
> > > > > > servers randomly
> > > > > > hang (no error message). No amelioration despite the activation of
> > > > > > "ServerAliveInterval" on both sides.
> > > > (...)
> > >
> > > The common point is my use of "iwl3945" : I have always tried the ssh
> > > connections through WiFi.
> > >
> > > > So please gather this information (at least for the relevant
> > > > connections):
> > > >
> > > > $ netstat -pn
> > > > $ cat /proc/net/tcp
>
> I used a script which logged both every 15 seconds on both sides (in "screen"
> on server side). I then triggered a hang with several "seq 1 100" in the ssh
> session. The logs are in the attached debug_ssh_hang.tar.gz . The hang
> appeared between 234327 and 234342 on the client side (so somewhere between
> 234324, 234339 and 234354 on the server side).
>
> I hope it'll help.

Thanks, a quick look didn't indicate anything similar to one other report
from H?kon... I'd probably need a tcpdump.

> > You probably run it under X, no? Please switch beforehand to some other vt
> > (a textual one) then (Ctrl-Alt-Fn, where n < 6) and then log in and
> > running that command there and see if you get some output into screen
> > there. If you see something (e.g., a sudden OOPS message or some other
> > warning printed) when it locks up, the easiest things is to take a shot
> > with a digicam (or write it down somewhere else) and send that shot (or
> > those details) to us please.
>
> I tried the following in vt1 (under the new 2.6.26-rc6 with kdm stopped):
>
> # tcpdump -i wlan0 -w /tmp/tcpdump.wlan0
>
> and I got the attached "soft lockup".

Thanks, I guess the folks at linux-wireless can continue in resolving this
one with you (I'm out of time anyway now).

> > ...Once you have a tcpdump, I can probably figure at least something out
> > (though it might still just point to the right direction rather than
> > exposing the actual cause).
>
> I can't get one... :)

...Ok.

--
i.


2008-06-18 08:05:28

by David Miller

[permalink] [raw]
Subject: Re: [Bugme-new] [Bug 10903] New: ssh connections hang with 2.6.26-rc5

RnJvbTogSm9oYW5uZXMgQmVyZyA8am9oYW5uZXNAc2lwc29sdXRpb25zLm5ldD4NCkRhdGU6IFdl
ZCwgMTggSnVuIDIwMDggMDk6MjQ6NDcgKzAyMDANCg0KPiANCj4gPiA+ID4gPiA+ID4gPiBodHRw
Oi8vYnVnemlsbGEua2VybmVsLm9yZy9zaG93X2J1Zy5jZ2k/aWQ9MTA5MDMNCj4gPiA+ID4gPiA+
ID4gPg0KPiA+ID4gPiA+ID4gPiA+ICAgICAgICAgICAgU3VtbWFyeTogc3NoIGNvbm5lY3Rpb25z
IGhhbmcgd2l0aCAyLjYuMjYtcmM1DQo+IA0KPiBBbmRyZXcgUHJpbmNlIHJlcG9ydGVkIGEgc2lt
aWxhciBwcm9ibGVtIGFuZCBzYWlkIGhlIGJpc2VjdGVkIGl0ICB0bw0KPiBkYXZlbSdzIO+7vzYw
ODk2MWE1ZWNhOGQzYzZiZDA3MTcyZmViYzI3YjU1NTk0MDhjNWQgKCJtYWM4MDIxMTogVXNlDQo+
IHNrYl9oZWFkZXJfY2xvbmVkKCkgb24gVFggcGF0aC4iKSB3aGljaCBtYWRlIG5vIHNlbnNlIHRv
IG1lIHNvIEkgbWFya2VkDQo+IHRoZSByZXBvcnQgYXMgJ3RvIGludmVzdGlnYXRlIHdoZW4gSSBo
YXZlIG1vcmUgdGltZScuDQoNClRoYXQncyB1c2VmdWwgaW5mb3JtYXRpb24uICBUaGUga2VybmVs
IGJ1Z3ppbGxhIGVudHJ5IGlzIGZvcg0KdGhlIGl3bDM5NDUgZHJpdmVyLCBzbyB0aGF0IG1hdGNo
ZXMgdXAgYWNjdXJhdGVseSB3aXRoIHRoaXMNCnRvby4NCg0KSWYgd2UgY2FuJ3QgZmlndXJlIG91
dCB3aGF0J3MgZ29pbmcgb24gaGVyZSBzb29uIChsaWtlLCBpbiBsZXNzIHRoYW4gYQ0KZGF5KSB3
ZSBzaG91bGQgcmV2ZXJ0IHRoYXQgY2hhbmdlc2V0Lg0KDQpBY3R1YWxseSwgSSB0aGluayBJIHNl
ZSBob3cgdGhlIGNoYW5nZXNldCBtaWdodCBiZSB3cm9uZy4gIEkgdGhpbmsNCnRoZSBlbmNyeXB0
aW9uIGxheWVyIG9mIG1hYzgwMjExIGFzc3VtZXMgaXQgY2FuIHdyaXRlIG92ZXIgdGhlDQpkYXRh
IGFyZWEgb2YgdGhlIFNLQiBpdCdzIHdvcmtpbmcgb24sIG5vdCBqdXN0IHRoZSBoZWFkZXJzLg0K
DQpPbmNlIHRoaXMgaGFwcGVucywgYW55IHJldHJhbnNtaXRzIGRvbmUgYnkgU0tCIHdpbGwgZmFp
bCBiZWNhdXNlIHRoZQ0KbWFzdGVyIHBhY2tldCBkYXRhIG9uIFRDUCdzIHJldHJhbnNtaXQgcXVl
dWUgaXMgbm93IHRoaXMgZW5jcnlwdGVkDQpnYXJiYWdlLg0K

2008-06-18 11:39:44

by Michael Büsch

[permalink] [raw]
Subject: Re: [Bugme-new] [Bug 10903] New: ssh connections hang with 2.6.26-rc5

On Wednesday 18 June 2008 13:34:06 Didier Raboud wrote:
> Le mercredi 18 juin 2008 09:24:47 Johannes Berg, vous avez =C3=A9crit=
=C2=A0:
> > > > > > > > > http://bugzilla.kernel.org/show_bug.cgi?id=3D10903
> > > > > > > > >
> > > > > > > > > Summary: ssh connections hang with 2.6.26-=
rc5
> >
> > Andrew Prince reported a similar problem and said he bisected it t=
o
> > davem's =EF=BB=BF608961a5eca8d3c6bd07172febc27b5559408c5d ("mac8021=
1: Use
> > skb_header_cloned() on TX path.") which made no sense to me so I ma=
rked
> > the report as 'to investigate when I have more time'.
> >
> > > > and I got the attached "soft lockup".
> >
> > Attached where? I can't find it in the bug either.
> >
> > johannes
>=20
> Hi,=20
>=20
> for unknown reasons, my mail doesn't appear in the mailing list archi=
ves and=20
> the big attachment has visibly been stripped.
>=20
> You'll find the "soft lockup" there:
>=20
> http://raboud.homelinux.org/~didier/kernel/tcpdump_oops.jpg

The soft lockup most likely is a followup-oops to a previous one that
locked up the machine.
Can you try to reproduce and capture the screen before waiting 61 secon=
ds
for the watchdog to trigger. There should be another oops before that (=
you see
the last two lines of it on this picture)

--=20
Greetings Michael.

2008-06-18 11:34:37

by Didier Raboud

[permalink] [raw]
Subject: Re: [Bugme-new] [Bug 10903] New: ssh connections hang with 2.6.26-rc5

Le mercredi 18 juin 2008 09:24:47 Johannes Berg, vous avez écrit :
> > > > > > > > http://bugzilla.kernel.org/show_bug.cgi?id=10903
> > > > > > > >
> > > > > > > > Summary: ssh connections hang with 2.6.26-rc5
>
> Andrew Prince reported a similar problem and said he bisected it to
> davem's 608961a5eca8d3c6bd07172febc27b5559408c5d ("mac80211: Use
> skb_header_cloned() on TX path.") which made no sense to me so I marked
> the report as 'to investigate when I have more time'.
>
> > > and I got the attached "soft lockup".
>
> Attached where? I can't find it in the bug either.
>
> johannes

Hi,

for unknown reasons, my mail doesn't appear in the mailing list archives and
the big attachment has visibly been stripped.

You'll find the "soft lockup" there:

http://raboud.homelinux.org/~didier/kernel/tcpdump_oops.jpg

Regards,

Didier
--
Didier Raboud, proud Debian user.
CH-1802 Corseaux
[email protected]


Attachments:
(No filename) (932.00 B)
signature.asc (197.00 B)
This is a digitally signed message part.
Download all attachments

2008-06-18 07:25:30

by Johannes Berg

[permalink] [raw]
Subject: Re: [Bugme-new] [Bug 10903] New: ssh connections hang with 2.6.26-rc5


> > > > > > > http://bugzilla.kernel.org/show_bug.cgi?id=10903
> > > > > > >
> > > > > > > Summary: ssh connections hang with 2.6.26-rc5

Andrew Prince reported a similar problem and said he bisected it to
davem's 608961a5eca8d3c6bd07172febc27b5559408c5d ("mac80211: Use
skb_header_cloned() on TX path.") which made no sense to me so I marked
the report as 'to investigate when I have more time'.

> > > You probably run it under X, no? Please switch beforehand to some other vt
> > > (a textual one) then (Ctrl-Alt-Fn, where n < 6) and then log in and
> > > running that command there and see if you get some output into screen
> > > there. If you see something (e.g., a sudden OOPS message or some other
> > > warning printed) when it locks up, the easiest things is to take a shot
> > > with a digicam (or write it down somewhere else) and send that shot (or
> > > those details) to us please.
> >
> > I tried the following in vt1 (under the new 2.6.26-rc6 with kdm stopped):
> >
> > # tcpdump -i wlan0 -w /tmp/tcpdump.wlan0
> >
> > and I got the attached "soft lockup".

Attached where? I can't find it in the bug either.

johannes


Attachments:
signature.asc (836.00 B)
This is a digitally signed message part

2008-06-18 08:26:58

by David Miller

[permalink] [raw]
Subject: Re: [Bugme-new] [Bug 10903] New: ssh connections hang with 2.6.26-rc5

From: David Miller <[email protected]>
Date: Wed, 18 Jun 2008 01:05:28 -0700 (PDT)

> If we can't figure out what's going on here soon (like, in less than a
> day) we should revert that changeset.
>
> Actually, I think I see how the changeset might be wrong. I think
> the encryption layer of mac80211 assumes it can write over the
> data area of the SKB it's working on, not just the headers.
>
> Once this happens, any retransmits done by SKB will fail because the
> master packet data on TCP's retransmit queue is now this encrypted
> garbage.

After some discussion about this with Johannes on IRC, we are
absolutely convinced this is exactly the problem.

I intend to send the following revert to Linus tonight so we
can close this:

--------------------
Revert "mac80211: Use skb_header_cloned() on TX path."

This reverts commit 608961a5eca8d3c6bd07172febc27b5559408c5d.

The problem is that the mac80211 stack not only needs to be able to
muck with the link-level headers, it also might need to mangle all of
the packet data if doing sw wireless encryption.

This fixes kernel bugzilla #10903. Thanks to Didier Raboud (for the
bugzilla report), Andrew Prince (for bisecting), Johannes Berg (for
bringing this bisection analysis to my attention), and Ilpo (for
trying to analyze this purely from the TCP side).

In 2.6.27 we can take another stab at this, by using something like
skb_cow_data() when the TX path of mac80211 ends up with a non-NULL
tx->key. The ESP protocol code in the IPSEC stack can be used as a
model for implementation.

Signed-off-by: David S. Miller <[email protected]>
---
net/mac80211/tx.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
index 1d7dd54..28d8bd5 100644
--- a/net/mac80211/tx.c
+++ b/net/mac80211/tx.c
@@ -1562,13 +1562,13 @@ int ieee80211_subif_start_xmit(struct sk_buff *skb,
* be cloned. This could happen, e.g., with Linux bridge code passing
* us broadcast frames. */

- if (head_need > 0 || skb_header_cloned(skb)) {
+ if (head_need > 0 || skb_cloned(skb)) {
#if 0
printk(KERN_DEBUG "%s: need to reallocate buffer for %d bytes "
"of headroom\n", dev->name, head_need);
#endif

- if (skb_header_cloned(skb))
+ if (skb_cloned(skb))
I802_DEBUG_INC(local->tx_expand_skb_head_cloned);
else
I802_DEBUG_INC(local->tx_expand_skb_head);
--
1.5.5.1.308.g1fbb5