2009-01-13 15:22:06

by Kasparek Tomas

[permalink] [raw]
Subject: Re: [PATCH 0/3] NFS regression in 2.6.26?, "task blocked for more than 120 seconds"

On Mon, Jan 12, 2009 at 08:17:26PM -0500, Trond Myklebust wrote:
> On Mon, 2009-01-12 at 12:40 -0500, Trond Myklebust wrote:
> > On Mon, 2009-01-12 at 10:04 +0100, Kasparek Tomas wrote:
> > > Ok, I find that allready. With static mount the behaviour is the same as
> > > with amd - no new connection is created and the client waits forever (~
> > > tens of hours at least).
> >
> > OK. I now appear to be able to reproduce this problem. I should have a
> > fix ready soon.
>
> The attached 2 patches have been tested using a server that was rigged
> not to ever close the socket. They appear to work fine on my setup,
> without the hang that you reported earlier.

after 8hours it seems it works both with static mount and with amd. I will
let you know the state after few more days again.

Thank you very much for your help.

--

Tomas Kasparek, PhD student E-mail: [email protected]
CVT FIT VUT Brno, L127 Web: http://www.fit.vutbr.cz/~kasparek
Bozetechova 1, 612 66 Fax: +420 54114-1270
Brno, Czech Republic Phone: +420 54114-1220

jabber: [email protected]
GPG: 2F1E 1AAF FD3B CFA3 1537 63BD DCBE 18FF A035 53BC



2009-01-18 13:08:40

by Kasparek Tomas

[permalink] [raw]
Subject: Re: [PATCH 0/3] NFS regression in 2.6.26?, "task blocked for more than 120 seconds"

On Fri, Jan 16, 2009 at 11:48:02AM +0100, Kasparek Tomas wrote:
> On Tue, Jan 13, 2009 at 04:22:01PM +0100, Kasparek Tomas wrote:
> > On Mon, Jan 12, 2009 at 08:17:26PM -0500, Trond Myklebust wrote:
> > > On Mon, 2009-01-12 at 12:40 -0500, Trond Myklebust wrote:
> > > > On Mon, 2009-01-12 at 10:04 +0100, Kasparek Tomas wrote:
> > > > > Ok, I find that allready. With static mount the behaviour is the same as
> > > > > with amd - no new connection is created and the client waits forever (~
> > > > > tens of hours at least).
> > > >
> > > > OK. I now appear to be able to reproduce this problem. I should have a
> > > > fix ready soon.
> > >
> > > The attached 2 patches have been tested using a server that was rigged
> > > not to ever close the socket. They appear to work fine on my setup,
> > > without the hang that you reported earlier.
> >
> > after 8hours it seems it works both with static mount and with amd. I will
> > let you know the state after few more days again.
> >
> > Thank you very much for your help.
>
> Just confirming, that the last patch did help and it works well both with
> static mount and amd.
>
> Thank you very much for repairing this. Should I do something more, or can
> you propagate the change into vanilla and if possible to Greg for stable to
> get into 2.6.27.x ?

Hi Trond, for now please do not push your patches to mainstream, I have some
big troubles with my machines and it starts loking like the new kernel may
be the cause.

It seems that machines with this new kernel (tried on 10 other machines
and the original client) may after few days get into state where they
generate huge amounts (10000-100000pkt/s) of packets on another server they
use (Linux 2.6.26.62, but the same behaviour with other kernels I tried -
2.6.24.7, 2.6.22.19, 2.6.27.10). It seems packets are quiet small as the
flow on server is about 5-10MB/s. (probably) Each packet generates an answer.
With this flow it is hard to get more info and the server is production
one, so for now I only know it goes from these clients and end on tcp port
2049 on that server. It kills just this server, communication with the
previously problematic (FreeBSD machines) is fine now.

Will try to investigate more details.

Thanks.

--

Tomas Kasparek, PhD student E-mail: [email protected]
CVT FIT VUT Brno, L127 Web: http://www.fit.vutbr.cz/~kasparek
Bozetechova 1, 612 66 Fax: +420 54114-1270
Brno, Czech Republic Phone: +420 54114-1220

jabber: [email protected]
GPG: 2F1E 1AAF FD3B CFA3 1537 63BD DCBE 18FF A035 53BC


2009-01-20 15:03:06

by Kasparek Tomas

[permalink] [raw]
Subject: Re: [PATCH 0/3] NFS regression in 2.6.26?, "task blocked for more than 120 seconds"

On Sun, Jan 18, 2009 at 02:08:35PM +0100, Kasparek Tomas wrote:
> > > > The attached 2 patches have been tested using a server that was rigged
> > > > not to ever close the socket. They appear to work fine on my setup,
> > > > without the hang that you reported earlier.
> ...
> It seems that machines with this new kernel (tried on 10 other machines
> and the original client) may after few days get into state where they
> generate huge amounts (10000-100000pkt/s) of packets on another server they
> use (Linux 2.6.26.62, but the same behaviour with other kernels I tried -
> 2.6.24.7, 2.6.22.19, 2.6.27.10). It seems packets are quiet small as the
> flow on server is about 5-10MB/s. (probably) Each packet generates an answer.
> With this flow it is hard to get more info and the server is production
> one, so for now I only know it goes from these clients and end on tcp port
> 2049 on that server. It kills just this server, communication with the
> previously problematic (FreeBSD machines) is fine now.

Hi all,

configrming that the problem is with machines with 2.6.27.10+trond's
patches. Do not have more info about what's there on network, the only new
thing I can add is that the client is dead not reacting even on keyborad or
anything else. Trond, would you have and idea what to try now or what other
information to find to get any further in this?

The clients are different machines - Intel x AMD with different boards,
NICs etc. The server is dead too, but it may recover after some time - I
can not afford leaving it for longer time to see what happens.

Thanks so far.

--

Tomas Kasparek, PhD student E-mail: [email protected]
CVT FIT VUT Brno, L127 Web: http://www.fit.vutbr.cz/~kasparek
Bozetechova 1, 612 66 Fax: +420 54114-1270
Brno, Czech Republic Phone: +420 54114-1220

jabber: [email protected]
GPG: 2F1E 1AAF FD3B CFA3 1537 63BD DCBE 18FF A035 53BC


2009-01-20 15:32:29

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [PATCH 0/3] NFS regression in 2.6.26?, "task blocked for more than 120 seconds"

On Tue, 2009-01-20 at 16:03 +0100, Kasparek Tomas wrote:
> On Sun, Jan 18, 2009 at 02:08:35PM +0100, Kasparek Tomas wrote:
> > > > > The attached 2 patches have been tested using a server that was rigged
> > > > > not to ever close the socket. They appear to work fine on my setup,
> > > > > without the hang that you reported earlier.
> > ...
> > It seems that machines with this new kernel (tried on 10 other machines
> > and the original client) may after few days get into state where they
> > generate huge amounts (10000-100000pkt/s) of packets on another server they
> > use (Linux 2.6.26.62, but the same behaviour with other kernels I tried -
> > 2.6.24.7, 2.6.22.19, 2.6.27.10). It seems packets are quiet small as the
> > flow on server is about 5-10MB/s. (probably) Each packet generates an answer.
> > With this flow it is hard to get more info and the server is production
> > one, so for now I only know it goes from these clients and end on tcp port
> > 2049 on that server. It kills just this server, communication with the
> > previously problematic (FreeBSD machines) is fine now.
>
> Hi all,
>
> configrming that the problem is with machines with 2.6.27.10+trond's
> patches. Do not have more info about what's there on network, the only new
> thing I can add is that the client is dead not reacting even on keyborad or
> anything else. Trond, would you have and idea what to try now or what other
> information to find to get any further in this?

A binary wireshark dump of the traffic between one such client and the
server would help.

Cheers
Trond
--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2009-01-16 10:48:09

by Kasparek Tomas

[permalink] [raw]
Subject: Re: [PATCH 0/3] NFS regression in 2.6.26?, "task blocked for more than 120 seconds"

On Tue, Jan 13, 2009 at 04:22:01PM +0100, Kasparek Tomas wrote:
> On Mon, Jan 12, 2009 at 08:17:26PM -0500, Trond Myklebust wrote:
> > On Mon, 2009-01-12 at 12:40 -0500, Trond Myklebust wrote:
> > > On Mon, 2009-01-12 at 10:04 +0100, Kasparek Tomas wrote:
> > > > Ok, I find that allready. With static mount the behaviour is the same as
> > > > with amd - no new connection is created and the client waits forever (~
> > > > tens of hours at least).
> > >
> > > OK. I now appear to be able to reproduce this problem. I should have a
> > > fix ready soon.
> >
> > The attached 2 patches have been tested using a server that was rigged
> > not to ever close the socket. They appear to work fine on my setup,
> > without the hang that you reported earlier.
>
> after 8hours it seems it works both with static mount and with amd. I will
> let you know the state after few more days again.
>
> Thank you very much for your help.

Just confirming, that the last patch did help and it works well both with
static mount and amd.

Thank you very much for repairing this. Should I do something more, or can
you propagate the change into vanilla and if possible to Greg for stable to
get into 2.6.27.x ?

--

Tomas Kasparek, PhD student E-mail: [email protected]
CVT FIT VUT Brno, L127 Web: http://www.fit.vutbr.cz/~kasparek
Bozetechova 1, 612 66 Fax: +420 54114-1270
Brno, Czech Republic Phone: +420 54114-1220

jabber: [email protected]
GPG: 2F1E 1AAF FD3B CFA3 1537 63BD DCBE 18FF A035 53BC


2009-03-03 14:16:29

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [PATCH 0/3] NFS regression in 2.6.26?, "task blocked for more than 120 seconds"

On Tue, 2009-03-03 at 13:08 +0100, Kasparek Tomas wrote:
> On Tue, Jan 20, 2009 at 10:32:27AM -0500, Trond Myklebust wrote:
> > A binary wireshark dump of the traffic between one such client and the
> > server would help.
>
> I was able to finally got the tcpdump. I got it from 2.6.27.19 client but
> after several weeks without problems. I include the file and place it on
> http://merlin.fit.vutbr.cz/tmp/nfs/dump_kas2_mat.dump_small (have over 1GB
> of dump, but it's all the time the same SYN+RST packets). The packet rate
> maxed at 260000pps from two clients.
>
> This dump is taken from server after reset (the server does not respond
> even to keybord) before clients are disconnected/rebooted. To remind it - all
> clients seems to work well with reversed
> e06799f958bf7f9f8fae15f0c6f519953fb0257c

Yes. I saw that behaviour when testing at Connectathon last week. When
one of the servers I was testing against crashed and later came up
again, the patched client went into that same SYN+RST frenzy. I'm
planning to look at this now that I'm back at home.

Cheers
Trond

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2009-03-03 12:08:55

by Kasparek Tomas

[permalink] [raw]
Subject: Re: [PATCH 0/3] NFS regression in 2.6.26?, "task blocked for more than 120 seconds"

On Tue, Jan 20, 2009 at 10:32:27AM -0500, Trond Myklebust wrote:
> > > > > > The attached 2 patches have been tested using a server that was rigged
> > > > > > not to ever close the socket. They appear to work fine on my setup,
> > > > > > without the hang that you reported earlier.
> > > ...
> > > It seems that machines with this new kernel (tried on 10 other machines
> > > and the original client) may after few days get into state where they
> > > generate huge amounts (10000-100000pkt/s) of packets on another server they
> > > use (Linux 2.6.26.62, but the same behaviour with other kernels I tried -
> > > 2.6.24.7, 2.6.22.19, 2.6.27.10). It seems packets are quiet small as the
> > > flow on server is about 5-10MB/s. (probably) Each packet generates an answer.
> > > With this flow it is hard to get more info and the server is production
> > > one, so for now I only know it goes from these clients and end on tcp port
> > > 2049 on that server. It kills just this server, communication with the
> > > previously problematic (FreeBSD machines) is fine now.

> > configrming that the problem is with machines with 2.6.27.10+trond's
> > patches. Do not have more info about what's there on network, the only new
> > thing I can add is that the client is dead not reacting even on keyborad or
> > anything else. Trond, would you have and idea what to try now or what other
> > information to find to get any further in this?
>
> A binary wireshark dump of the traffic between one such client and the
> server would help.

I was able to finally got the tcpdump. I got it from 2.6.27.19 client but
after several weeks without problems. I include the file and place it on
http://merlin.fit.vutbr.cz/tmp/nfs/dump_kas2_mat.dump_small (have over 1GB
of dump, but it's all the time the same SYN+RST packets). The packet rate
maxed at 260000pps from two clients.

This dump is taken from server after reset (the server does not respond
even to keybord) before clients are disconnected/rebooted. To remind it - all
clients seems to work well with reversed
e06799f958bf7f9f8fae15f0c6f519953fb0257c

(just to be a hint do not mean the patch is wrong)

Thanks in advance

--

Tomas Kasparek, PhD student E-mail: [email protected]
CVT FIT VUT Brno, L127 Web: http://www.fit.vutbr.cz/~kasparek
Bozetechova 1, 612 66 Fax: +420 54114-1270
Brno, Czech Republic Phone: +420 54114-1220

jabber: [email protected]
GPG: 2F1E 1AAF FD3B CFA3 1537 63BD DCBE 18FF A035 53BC