2010-06-14 18:20:09

by Thomas Bächler

[permalink] [raw]
Subject: [2.6.35-rc3 regression] TCP connections on 'lo' interface randomly stall

With 2.6.35-rc3, I cannot use TCP over the 'lo' interface any more. This
is reproduced easily by running 'ssh localhost' and executing a few
commands inside the ssh session (if you are able to log in, 'ls -lhFR /'
does a good job) - the connection will stall completely after a very
short time. As far as I can see, all applications are affected.

It also seems that once a service is "stalled", I cannot open a new
connection to the same TCP port anymore. However, I can open a
connection to a different port until that one is stalled, too.

Running wireshark, I can see that TCP retransmissions are sent, but
never acknowledged.

Bisection (starting with 7908a9e as good and v2.6.35-rc3 as bad) leads
to the following commit. Please CC me on replies to this issue. Thanks
for your help.

commit 597a264b1a9c7e36d1728f677c66c5c1f7e3b837
Author: John Fastabend <[email protected]>
Date: Thu Jun 3 09:30:11 2010 +0000

net: deliver skbs on inactive slaves to exact matches

Currently, the accelerated receive path for VLAN's will
drop packets if the real device is an inactive slave and
is not one of the special pkts tested for in
skb_bond_should_drop(). This behavior is different then
the non-accelerated path and for pkts over a bonded vlan.

For example,

vlanx -> bond0 -> ethx

will be dropped in the vlan path and not delivered to any
packet handlers at all. However,

bond0 -> vlanx -> ethx

and

bond0 -> ethx

will be delivered to handlers that match the exact dev,
because the VLAN path checks the real_dev which is not a
slave and netif_recv_skb() doesn't drop frames but only
delivers them to exact matches.

This patch adds a sk_buff flag which is used for tagging
skbs that would previously been dropped and allows the
skb to continue to skb_netif_recv(). Here we add
logic to check for the deliver_no_wcard flag and if it
is set only deliver to handlers that match exactly. This
makes both paths above consistent and gives pkt handlers
a way to identify skbs that come from inactive slaves.
Without this patch in some configurations skbs will be
delivered to handlers with exact matches and in others
be dropped out right in the vlan path.

I have tested the following 4 configurations in failover modes
and load balancing modes.

# bond0 -> ethx

# vlanx -> bond0 -> ethx

# bond0 -> vlanx -> ethx

# bond0 -> ethx
|
vlanx -> --

Signed-off-by: John Fastabend <[email protected]>
Signed-off-by: David S. Miller <[email protected]>


Attachments:
signature.asc (262.00 B)
OpenPGP digital signature

2010-06-14 18:24:16

by Randy Dunlap

[permalink] [raw]
Subject: Re: [2.6.35-rc3 regression] TCP connections on 'lo' interface randomly stall

On Mon, 14 Jun 2010 20:10:27 +0200 Thomas B?chler wrote:

> With 2.6.35-rc3, I cannot use TCP over the 'lo' interface any more. This
> is reproduced easily by running 'ssh localhost' and executing a few
> commands inside the ssh session (if you are able to log in, 'ls -lhFR /'
> does a good job) - the connection will stall completely after a very
> short time. As far as I can see, all applications are affected.
>
> It also seems that once a service is "stalled", I cannot open a new
> connection to the same TCP port anymore. However, I can open a
> connection to a different port until that one is stalled, too.
>
> Running wireshark, I can see that TCP retransmissions are sent, but
> never acknowledged.
>
> Bisection (starting with 7908a9e as good and v2.6.35-rc3 as bad) leads
> to the following commit. Please CC me on replies to this issue. Thanks
> for your help.

Does this patch fix it for you?
http://lkml.org/lkml/2010/6/13/155


> commit 597a264b1a9c7e36d1728f677c66c5c1f7e3b837
> Author: John Fastabend <[email protected]>
> Date: Thu Jun 3 09:30:11 2010 +0000
>
> net: deliver skbs on inactive slaves to exact matches
>
> Currently, the accelerated receive path for VLAN's will
> drop packets if the real device is an inactive slave and
> is not one of the special pkts tested for in
> skb_bond_should_drop(). This behavior is different then
> the non-accelerated path and for pkts over a bonded vlan.
>
> For example,
>
> vlanx -> bond0 -> ethx
>
> will be dropped in the vlan path and not delivered to any
> packet handlers at all. However,
>
> bond0 -> vlanx -> ethx
>
> and
>
> bond0 -> ethx
>
> will be delivered to handlers that match the exact dev,
> because the VLAN path checks the real_dev which is not a
> slave and netif_recv_skb() doesn't drop frames but only
> delivers them to exact matches.
>
> This patch adds a sk_buff flag which is used for tagging
> skbs that would previously been dropped and allows the
> skb to continue to skb_netif_recv(). Here we add
> logic to check for the deliver_no_wcard flag and if it
> is set only deliver to handlers that match exactly. This
> makes both paths above consistent and gives pkt handlers
> a way to identify skbs that come from inactive slaves.
> Without this patch in some configurations skbs will be
> delivered to handlers with exact matches and in others
> be dropped out right in the vlan path.
>
> I have tested the following 4 configurations in failover modes
> and load balancing modes.
>
> # bond0 -> ethx
>
> # vlanx -> bond0 -> ethx
>
> # bond0 -> vlanx -> ethx
>
> # bond0 -> ethx
> |
> vlanx -> --
>
> Signed-off-by: John Fastabend <[email protected]>
> Signed-off-by: David S. Miller <[email protected]>
>


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

2010-06-14 19:36:13

by Thomas Bächler

[permalink] [raw]
Subject: Re: [2.6.35-rc3 regression] TCP connections on 'lo' interface randomly stall

Am 14.06.2010 20:23, schrieb Randy Dunlap:
> On Mon, 14 Jun 2010 20:10:27 +0200 Thomas B?chler wrote:
>
>> With 2.6.35-rc3, I cannot use TCP over the 'lo' interface any more. This
>> is reproduced easily by running 'ssh localhost' and executing a few
>> commands inside the ssh session (if you are able to log in, 'ls -lhFR /'
>> does a good job) - the connection will stall completely after a very
>> short time. As far as I can see, all applications are affected.
>>
>> It also seems that once a service is "stalled", I cannot open a new
>> connection to the same TCP port anymore. However, I can open a
>> connection to a different port until that one is stalled, too.
>>
>> Running wireshark, I can see that TCP retransmissions are sent, but
>> never acknowledged.
>>
>> Bisection (starting with 7908a9e as good and v2.6.35-rc3 as bad) leads
>> to the following commit. Please CC me on replies to this issue. Thanks
>> for your help.
>
> Does this patch fix it for you?
> http://lkml.org/lkml/2010/6/13/155

Yes, it seems to fix it. Thanks.


Attachments:
signature.asc (262.00 B)
OpenPGP digital signature