2009-04-22 17:38:30

by Kasparek Tomas

[permalink] [raw]
Subject: NFS client packet storm on 2.6.27.x

On Sat, Apr 18, 2009 at 07:17:39AM +0200, Kasparek Tomas wrote:
> On Tue, Mar 03, 2009 at 09:16:07AM -0500, Trond Myklebust wrote:
> > On Tue, 2009-03-03 at 13:08 +0100, Kasparek Tomas wrote:
> > > On Tue, Jan 20, 2009 at 10:32:27AM -0500, Trond Myklebust wrote:
> > > > A binary wireshark dump of the traffic between one such client and the
> > > > server would help.
> > >
> > > I was able to finally got the tcpdump. I got it from 2.6.27.19 client but
> > > after several weeks without problems. I include the file and place it on
> > > http://merlin.fit.vutbr.cz/tmp/nfs/dump_kas2_mat.dump_small (have over 1GB
> > > of dump, but it's all the time the same SYN+RST packets). The packet rate
> > > maxed at 260000pps from two clients.
> > >
> > > This dump is taken from server after reset (the server does not respond
> > > even to keybord) before clients are disconnected/rebooted. To remind it - all
> > > clients seems to work well with reversed
> > > e06799f958bf7f9f8fae15f0c6f519953fb0257c
> >
> > Yes. I saw that behaviour when testing at Connectathon last week. When
> > one of the servers I was testing against crashed and later came up
> > again, the patched client went into that same SYN+RST frenzy. I'm
> > planning to look at this now that I'm back at home.
>
> Hi, got a bit more data today as I get to the client early before it become
> unresponsible.
>
>
> The lockup may be becouse I disconnected the cable from that client to stop
> the packet storm, but still the backtrace may be usefull.
>
> Is there anything else I can do, that will help with this problem?

Hi,

(I changed the SUBJ to be more descriptive for current problem)

I got another client lockup today. It was a desktop so I have some more
dmesg warnings about soft lockup caused probably by network cable unplug
(but hopefully still showing what happens in rpciod) on

http://merlin.fit.vutbr.cz/tmp/nfs/pckas-dmesg

I can check with top, that rpciod was using 100% cpu. I limited the flow
from client to server with firewall so I was able to save the server and
get some tcpdump -s0 data (actually RPC null with ERR response from server)

Just to remind, the client is 2.6.27.21 (i386), the server is 2.6.16.62
(x86_64).

Please let me know if I can do anything more, this is really paintfull for
me.

--

Tomas Kasparek, PhD student E-mail: [email protected]
CVT FIT VUT Brno, L127 Web: http://www.fit.vutbr.cz/~kasparek
Bozetechova 1, 612 66 Fax: +420 54114-1270
Brno, Czech Republic Phone: +420 54114-1220

jabber: [email protected]
GPG: 2F1E 1AAF FD3B CFA3 1537 63BD DCBE 18FF A035 53BC


2009-04-29 12:15:53

by Steve Dickson

[permalink] [raw]
Subject: Re: NFS client packet storm on 2.6.27.x



Kasparek Tomas wrote:
> On Sat, Apr 18, 2009 at 07:17:39AM +0200, Kasparek Tomas wrote:
>> On Tue, Mar 03, 2009 at 09:16:07AM -0500, Trond Myklebust wrote:
>>> On Tue, 2009-03-03 at 13:08 +0100, Kasparek Tomas wrote:
>>>> On Tue, Jan 20, 2009 at 10:32:27AM -0500, Trond Myklebust wrote:
>>>>> A binary wireshark dump of the traffic between one such client and the
>>>>> server would help.
>>>> I was able to finally got the tcpdump. I got it from 2.6.27.19 client but
>>>> after several weeks without problems. I include the file and place it on
>>>> http://merlin.fit.vutbr.cz/tmp/nfs/dump_kas2_mat.dump_small (have over 1GB
>>>> of dump, but it's all the time the same SYN+RST packets). The packet rate
>>>> maxed at 260000pps from two clients.
>>>>
>>>> This dump is taken from server after reset (the server does not respond
>>>> even to keybord) before clients are disconnected/rebooted. To remind it - all
>>>> clients seems to work well with reversed
>>>> e06799f958bf7f9f8fae15f0c6f519953fb0257c
>>> Yes. I saw that behaviour when testing at Connectathon last week. When
>>> one of the servers I was testing against crashed and later came up
>>> again, the patched client went into that same SYN+RST frenzy. I'm
>>> planning to look at this now that I'm back at home.
>> Hi, got a bit more data today as I get to the client early before it become
>> unresponsible.
>>
>>
>> The lockup may be becouse I disconnected the cable from that client to stop
>> the packet storm, but still the backtrace may be usefull.
>>
>> Is there anything else I can do, that will help with this problem?
>
> Hi,
>
> (I changed the SUBJ to be more descriptive for current problem)
>
> I got another client lockup today. It was a desktop so I have some more
> dmesg warnings about soft lockup caused probably by network cable unplug
> (but hopefully still showing what happens in rpciod) on
>
> http://merlin.fit.vutbr.cz/tmp/nfs/pckas-dmesg
>
> I can check with top, that rpciod was using 100% cpu. I limited the flow
> from client to server with firewall so I was able to save the server and
> get some tcpdump -s0 data (actually RPC null with ERR response from server)
>
> Just to remind, the client is 2.6.27.21 (i386), the server is 2.6.16.62
> (x86_64).
>
> Please let me know if I can do anything more, this is really paintfull for
> me.
Try commenting out the tcp6/udp6 entries from /etc/netconfig....
This has help in other places...

steved.

2009-04-29 14:58:25

by Kasparek Tomas

[permalink] [raw]
Subject: Re: NFS client packet storm on 2.6.27.x

On Wed, Apr 29, 2009 at 08:12:38AM -0400, Steve Dickson wrote:
> > I got another client lockup today. It was a desktop so I have some more
> > dmesg warnings about soft lockup caused probably by network cable unplug
> > (but hopefully still showing what happens in rpciod) on
> >
> > http://merlin.fit.vutbr.cz/tmp/nfs/pckas-dmesg
> >
> > I can check with top, that rpciod was using 100% cpu. I limited the flow
> > from client to server with firewall so I was able to save the server and
> > get some tcpdump -s0 data (actually RPC null with ERR response from server)
> >
> > Just to remind, the client is 2.6.27.21 (i386), the server is 2.6.16.62
> > (x86_64).
> >
> > Please let me know if I can do anything more, this is really paintfull for
> > me.
> Try commenting out the tcp6/udp6 entries from /etc/netconfig....
> This has help in other places...

It's CentOS 5.3, but if you mean to disable IPv6, I can not do that, the
server is using IPv6 (but not with these clients). And as mentioned before
it works well without e06799f958bf7f9f8fae15f0c6f519953fb0257c, both for
lockd and client floods.

Bye

--

Tomas Kasparek, PhD student E-mail: [email protected]
CVT FIT VUT Brno, L127 Web: http://www.fit.vutbr.cz/~kasparek
Bozetechova 1, 612 66 Fax: +420 54114-1270
Brno, Czech Republic Phone: +420 54114-1220

jabber: [email protected]
GPG: 2F1E 1AAF FD3B CFA3 1537 63BD DCBE 18FF A035 53BC

2009-06-25 06:10:26

by Kasparek Tomas

[permalink] [raw]
Subject: Re: NFS client packet storm on 2.6.27.x

On Wed, Apr 22, 2009 at 07:27:07PM +0200, Kasparek Tomas wrote:
> I got another client lockup today. It was a desktop so I have some more
> dmesg warnings about soft lockup caused probably by network cable unplug
> (but hopefully still showing what happens in rpciod) on
>
> http://merlin.fit.vutbr.cz/tmp/nfs/pckas-dmesg
>
> I can check with top, that rpciod was using 100% cpu. I limited the flow
> from client to server with firewall so I was able to save the server and
> get some tcpdump -s0 data (actually RPC null with ERR response from server)
>
> Just to remind, the client is 2.6.27.21 (i386), the server is 2.6.16.62
> (x86_64).

Hi, I was playing with patches from

http://www.linux-nfs.org/Linux-2.6.x/2.6.27/

and find, that

.../fixups_4/linux-2.6.27-001-respond_promptly_to_socket_errors.dif
.../fixups_4/linux-2.6.27-002-respond_promptly_to_socket_errors_2.dif

change the locking behaviour from long to endless lock to 1-2sec locks and
it seems there are fewer situations when it locks.

The packet storms does not repeat once I switched to 2.6.27.24 (and .25)
kernels so far, so it may be solved by some other patch inside .24 too.

Together with tcp_linger patch it seems to improve the situation a lot to
state when it is possible for me to use 2.6.27.x kernels.

Trond, will it be possible to get tcp_linger and the upper twho patches to
2.6.27.x stable queue so others get these fixes?

Big thanks for your help to all.

--

Tomas Kasparek, PhD student E-mail: [email protected]
CVT FIT VUT Brno, L127 Web: http://www.fit.vutbr.cz/~kasparek
Bozetechova 1, 612 66 Fax: +420 54114-1270
Brno, Czech Republic Phone: +420 54114-1220

jabber: [email protected]
GPG: 2F1E 1AAF FD3B CFA3 1537 63BD DCBE 18FF A035 53BC

2009-07-13 11:13:17

by Kasparek Tomas

[permalink] [raw]
Subject: Re: NFS client packet storm on 2.6.27.x

On Thu, Jun 25, 2009 at 07:55:32AM +0200, Kasparek Tomas wrote:
> http://www.linux-nfs.org/Linux-2.6.x/2.6.27/
>
> .../fixups_4/linux-2.6.27-001-respond_promptly_to_socket_errors.dif
> .../fixups_4/linux-2.6.27-002-respond_promptly_to_socket_errors_2.dif
>
> ...
> Together with tcp_linger patch it seems to improve the situation a lot to
> state when it is possible for me to use 2.6.27.x kernels.
>
> Trond, will it be possible to get tcp_linger and the upper twho patches to
> 2.6.27.x stable queue so others get these fixes?

I got no response so far, is there someone who could advise me what to do
to get above patches to 2.6.27.x stable? Or is there some reason, why this
is not possible at all?

Thanks for advice and/or suggestions.

--

Tomas Kasparek, PhD student E-mail: [email protected]
CVT FIT VUT Brno, L127 Web: http://www.fit.vutbr.cz/~kasparek
Bozetechova 1, 612 66 Fax: +420 54114-1270
Brno, Czech Republic Phone: +420 54114-1220

jabber: [email protected]
GPG: 2F1E 1AAF FD3B CFA3 1537 63BD DCBE 18FF A035 53BC

2009-07-13 17:23:33

by Greg KH

[permalink] [raw]
Subject: Re: [stable] NFS client packet storm on 2.6.27.x

On Mon, Jul 13, 2009 at 01:12:15PM +0200, Kasparek Tomas wrote:
> On Thu, Jun 25, 2009 at 07:55:32AM +0200, Kasparek Tomas wrote:
> > http://www.linux-nfs.org/Linux-2.6.x/2.6.27/
> >
> > .../fixups_4/linux-2.6.27-001-respond_promptly_to_socket_errors.dif
> > .../fixups_4/linux-2.6.27-002-respond_promptly_to_socket_errors_2.dif
> >
> > ...
> > Together with tcp_linger patch it seems to improve the situation a lot to
> > state when it is possible for me to use 2.6.27.x kernels.
> >
> > Trond, will it be possible to get tcp_linger and the upper twho patches to
> > 2.6.27.x stable queue so others get these fixes?
>
> I got no response so far, is there someone who could advise me what to do
> to get above patches to 2.6.27.x stable? Or is there some reason, why this
> is not possible at all?

Can you backport them and send them to [email protected] with the git
commit ids of the orignal patches?

thanks,

greg k-h

2009-07-13 17:40:25

by Trond Myklebust

[permalink] [raw]
Subject: Re: [stable] NFS client packet storm on 2.6.27.x

On Mon, 2009-07-13 at 10:20 -0700, Greg KH wrote:
> On Mon, Jul 13, 2009 at 01:12:15PM +0200, Kasparek Tomas wrote:
> > On Thu, Jun 25, 2009 at 07:55:32AM +0200, Kasparek Tomas wrote:
> > > http://www.linux-nfs.org/Linux-2.6.x/2.6.27/
> > >
> > > .../fixups_4/linux-2.6.27-001-respond_promptly_to_socket_errors.dif
> > > .../fixups_4/linux-2.6.27-002-respond_promptly_to_socket_errors_2.dif
> > >
> > > ...
> > > Together with tcp_linger patch it seems to improve the situation a lot to
> > > state when it is possible for me to use 2.6.27.x kernels.
> > >
> > > Trond, will it be possible to get tcp_linger and the upper twho patches to
> > > 2.6.27.x stable queue so others get these fixes?
> >
> > I got no response so far, is there someone who could advise me what to do
> > to get above patches to 2.6.27.x stable? Or is there some reason, why this
> > is not possible at all?
>
> Can you backport them and send them to [email protected] with the git
> commit ids of the orignal patches?

When testing at a customer site, we recently found that we actually need
4 patches in order to completely fix the storming problem in 2.6.27.y,
and ensure a timely socket reconnect.

commit 15f081ca8ddfe150fb639c591b18944a539da0fc (SUNRPC: Avoid an
unnecessary task reschedule on ENOTCONN)

commit 670f94573104b4a25525d3fcdcd6496c678df172 (SUNRPC: Ensure we set
XPRT_CLOSING only after we've sent a tcp FIN...)

commit 40d2549db5f515e415894def98b49db7d4c56714 (SUNRPC: Don't
disconnect if a connection is still in progress.)

and finally

commit f75e6745aa3084124ae1434fd7629853bdaf6798 (SUNRPC: Fix the problem
of EADDRNOTAVAIL syslog floods on reconnect)

The first three need to be applied to kernels 2.6.27.y to 2.6.29.y,
while the last needs to be applied to 2.6.27.y to 2.6.30.y.

I'll ensure the patches get sent to [email protected]...

Trond

2009-07-28 18:57:50

by Greg KH

[permalink] [raw]
Subject: Re: [stable] NFS client packet storm on 2.6.27.x

On Mon, Jul 13, 2009 at 01:40:17PM -0400, Trond Myklebust wrote:
> On Mon, 2009-07-13 at 10:20 -0700, Greg KH wrote:
> > On Mon, Jul 13, 2009 at 01:12:15PM +0200, Kasparek Tomas wrote:
> > > On Thu, Jun 25, 2009 at 07:55:32AM +0200, Kasparek Tomas wrote:
> > > > http://www.linux-nfs.org/Linux-2.6.x/2.6.27/
> > > >
> > > > .../fixups_4/linux-2.6.27-001-respond_promptly_to_socket_errors.dif
> > > > .../fixups_4/linux-2.6.27-002-respond_promptly_to_socket_errors_2.dif
> > > >
> > > > ...
> > > > Together with tcp_linger patch it seems to improve the situation a lot to
> > > > state when it is possible for me to use 2.6.27.x kernels.
> > > >
> > > > Trond, will it be possible to get tcp_linger and the upper twho patches to
> > > > 2.6.27.x stable queue so others get these fixes?
> > >
> > > I got no response so far, is there someone who could advise me what to do
> > > to get above patches to 2.6.27.x stable? Or is there some reason, why this
> > > is not possible at all?
> >
> > Can you backport them and send them to [email protected] with the git
> > commit ids of the orignal patches?
>
> When testing at a customer site, we recently found that we actually need
> 4 patches in order to completely fix the storming problem in 2.6.27.y,
> and ensure a timely socket reconnect.
>
> commit 15f081ca8ddfe150fb639c591b18944a539da0fc (SUNRPC: Avoid an
> unnecessary task reschedule on ENOTCONN)
>
> commit 670f94573104b4a25525d3fcdcd6496c678df172 (SUNRPC: Ensure we set
> XPRT_CLOSING only after we've sent a tcp FIN...)
>
> commit 40d2549db5f515e415894def98b49db7d4c56714 (SUNRPC: Don't
> disconnect if a connection is still in progress.)

I've now applied all of these to the .27 stable tree.

> and finally
>
> commit f75e6745aa3084124ae1434fd7629853bdaf6798 (SUNRPC: Fix the problem
> of EADDRNOTAVAIL syslog floods on reconnect)
>
> The first three need to be applied to kernels 2.6.27.y to 2.6.29.y,
> while the last needs to be applied to 2.6.27.y to 2.6.30.y.

The last one is in .30, so it doesn't need to go there again :)

But this last one doesn't apply at all to the .27 stable tree. Care to
refresh it and send it to [email protected] if you think it is also
needed?

thanks,

greg k-h