2009-01-28 08:18:57

by Kasparek Tomas

[permalink] [raw]
Subject: Re: [PATCH 0/3] NFS regression in 2.6.26?, "task blocked for more than 120 seconds"

On Tue, Jan 20, 2009 at 10:32:27AM -0500, Trond Myklebust wrote:
> On Tue, 2009-01-20 at 16:03 +0100, Kasparek Tomas wrote:
> > On Sun, Jan 18, 2009 at 02:08:35PM +0100, Kasparek Tomas wrote:
> > > > > > The attached 2 patches have been tested using a server that was rigged
> > > > > > not to ever close the socket. They appear to work fine on my setup,
> > > > > > without the hang that you reported earlier.
> > > ...
> > > It seems that machines with this new kernel (tried on 10 other machines
> > > and the original client) may after few days get into state where they
> > > generate huge amounts (10000-100000pkt/s) of packets on another server they
> > > use (Linux 2.6.26.62, but the same behaviour with other kernels I tried -
> > > 2.6.24.7, 2.6.22.19, 2.6.27.10). It seems packets are quiet small as the
> > > flow on server is about 5-10MB/s. (probably) Each packet generates an answer.
> > > With this flow it is hard to get more info and the server is production
> > > one, so for now I only know it goes from these clients and end on tcp port
> > > 2049 on that server. It kills just this server, communication with the
> > > previously problematic (FreeBSD machines) is fine now.
> >
> > patches. Do not have more info about what's there on network, the only new
> > thing I can add is that the client is dead not reacting even on keyborad or
> > anything else. Trond, would you have and idea what to try now or what other
>
> A binary wireshark dump of the traffic between one such client and the
> server would help.

I tried to get some data several times, but the client is dead and the
server is overloaded so much, that I'm unable to get anything reasonable. I
did tried to insert another mechine in front of the client as a bridge, but
the traffic overloaded it the same way as the server. I will try to figure
out how to get some traffic dump, but have no other idea for now.

Bye

--

Tomas Kasparek, PhD student E-mail: [email protected]
CVT FIT VUT Brno, L127 Web: http://www.fit.vutbr.cz/~kasparek
Bozetechova 1, 612 66 Fax: +420 54114-1270
Brno, Czech Republic Phone: +420 54114-1220

jabber: [email protected]
GPG: 2F1E 1AAF FD3B CFA3 1537 63BD DCBE 18FF A035 53BC



2009-02-10 07:55:15

by Kasparek Tomas

[permalink] [raw]
Subject: Re: [PATCH 0/3] NFS regression in 2.6.26?, "task blocked for more than 120 seconds"

On Fri, Feb 06, 2009 at 07:35:13AM +0100, Kasparek Tomas wrote:
> > > A binary wireshark dump of the traffic between one such client and the
> > > server would help.
> >
> > I tried to get some data several times, but the client is dead and the
> > server is overloaded so much, that I'm unable to get anything reasonable. I
> > did tried to insert another mechine in front of the client as a bridge, but
> > the traffic overloaded it the same way as the server. I will try to figure
> > out how to get some traffic dump, but have no other idea for now.

> as another try, I did upgrade from 2.6.27.10 to 2.6.27.13 (and .14) and it
> looks like the problem disappeared. Righ now I'm running 5 clients with .13
> or .14 and tcpdumps for 3 days without any problem. I will try to stop
> tcpdumps as they can potentially influence behaviour and will confirm the
> state next week.

After 6 days all machines except the first client used are fine and have no
problems. Based on this I would conclude that:

- your patch fixes the problem I had
- there may be something wrong in <2.6.27.13, but it's ok in .13+
- I finnaly have some tcpdumps from the server concerning the first
problematic client, I will try to extract interesting packets and send it
here if you or someone else can find anything helpfull there. With .14
the client runs much better anyway staying alive for 3 days instead of
6-10hours as with .10

Thank you for your support.

--

Tomas Kasparek, PhD student E-mail: [email protected]
CVT FIT VUT Brno, L127 Web: http://www.fit.vutbr.cz/~kasparek
Bozetechova 1, 612 66 Fax: +420 54114-1270
Brno, Czech Republic Phone: +420 54114-1220

jabber: [email protected]
GPG: 2F1E 1AAF FD3B CFA3 1537 63BD DCBE 18FF A035 53BC


2009-02-06 06:35:22

by Kasparek Tomas

[permalink] [raw]
Subject: Re: [PATCH 0/3] NFS regression in 2.6.26?, "task blocked for more than 120 seconds"

On Wed, Jan 28, 2009 at 09:18:52AM +0100, Kasparek Tomas wrote:
> On Tue, Jan 20, 2009 at 10:32:27AM -0500, Trond Myklebust wrote:
> > On Tue, 2009-01-20 at 16:03 +0100, Kasparek Tomas wrote:
> > > On Sun, Jan 18, 2009 at 02:08:35PM +0100, Kasparek Tomas wrote:
> > > > > > > The attached 2 patches have been tested using a server that was rigged
> > > > > > > not to ever close the socket. They appear to work fine on my setup,
> > > > > > > without the hang that you reported earlier.
> > > > ...
> > > > It seems that machines with this new kernel (tried on 10 other machines
> > > > and the original client) may after few days get into state where they
> > > > generate huge amounts (10000-100000pkt/s) of packets on another server they
> > > > use (Linux 2.6.26.62, but the same behaviour with other kernels I tried -
> > > > 2.6.24.7, 2.6.22.19, 2.6.27.10). It seems packets are quiet small as the
> > > > flow on server is about 5-10MB/s. (probably) Each packet generates an answer.
> > > > With this flow it is hard to get more info and the server is production
> > > > one, so for now I only know it goes from these clients and end on tcp port
> > > > 2049 on that server. It kills just this server, communication with the
> > > > previously problematic (FreeBSD machines) is fine now.
> > >
> > > patches. Do not have more info about what's there on network, the only new
> > > thing I can add is that the client is dead not reacting even on keyborad or
> > > anything else. Trond, would you have and idea what to try now or what other
> >
> > A binary wireshark dump of the traffic between one such client and the
> > server would help.
>
> I tried to get some data several times, but the client is dead and the
> server is overloaded so much, that I'm unable to get anything reasonable. I
> did tried to insert another mechine in front of the client as a bridge, but
> the traffic overloaded it the same way as the server. I will try to figure
> out how to get some traffic dump, but have no other idea for now.

Hi,

as another try, I did upgrade from 2.6.27.10 to 2.6.27.13 (and .14) and it
looks like the problem disappeared. Righ now I'm running 5 clients with .13
or .14 and tcpdumps for 3 days without any problem. I will try to stop
tcpdumps as they can potentially influence behaviour and will confirm the
state next week.

Thank you very much for all the work you and others from nfs-linux are
doing!
(I'm going to test pNFS in short time, so if I can be helpfull in any
way, let me know).

Bye

--

Tomas Kasparek, PhD student E-mail: [email protected]
CVT FIT VUT Brno, L127 Web: http://www.fit.vutbr.cz/~kasparek
Bozetechova 1, 612 66 Fax: +420 54114-1270
Brno, Czech Republic Phone: +420 54114-1220

jabber: [email protected]
GPG: 2F1E 1AAF FD3B CFA3 1537 63BD DCBE 18FF A035 53BC