2009-02-16 11:37:32

by Arto Jantunen

[permalink] [raw]
Subject: Timeout issue (similar to bugs 11061 and 11154), bisected


(I'm not subscribed, so please CC me on any replies)

I seem to have hit a NFS bug while upgrading a machine from Debian
Etch to Debian Lenny. I have a NFS server running FreeBSD 7.0 RC1 and
a bunch of clients running Linux. The ones running kernel 2.6.18 work
perfectly, as do the ones running 2.6.24. The one I upgraded to 2.6.26
fails. After 5-15 minutes of working normally the mount dies and I get
the usual "nfs: server <server> not responding, still trying" in
dmesg. The only way I have found to get the mount back is umount -f &&
mount, waiting does not bring it back.

I have tested quite a bunch of different kernel versions, and starting
from 25 and ending at the git tree last week they all fail in the same
way. Bisecting tracks the problem to commit
e06799f958bf7f9f8fae15f0c6f519953fb0257c

I originally thought that it was the same as bug 11154, but the
patches attached to that bug do not fix this issue.

Any thoughts, patches, ideas?

--
Arto Jantunen


2009-02-16 13:04:25

by Trond Myklebust

[permalink] [raw]
Subject: Re: Timeout issue (similar to bugs 11061 and 11154), bisected

On Mon, 2009-02-16 at 13:11 +0200, Arto Jantunen wrote:
> (I'm not subscribed, so please CC me on any replies)
>
> I seem to have hit a NFS bug while upgrading a machine from Debian
> Etch to Debian Lenny. I have a NFS server running FreeBSD 7.0 RC1 and
> a bunch of clients running Linux. The ones running kernel 2.6.18 work
> perfectly, as do the ones running 2.6.24. The one I upgraded to 2.6.26
> fails. After 5-15 minutes of working normally the mount dies and I get
> the usual "nfs: server <server> not responding, still trying" in
> dmesg. The only way I have found to get the mount back is umount -f &&
> mount, waiting does not bring it back.
>
> I have tested quite a bunch of different kernel versions, and starting
> from 25 and ending at the git tree last week they all fail in the same
> way. Bisecting tracks the problem to commit
> e06799f958bf7f9f8fae15f0c6f519953fb0257c
>
> I originally thought that it was the same as bug 11154, but the
> patches attached to that bug do not fix this issue.
>
> Any thoughts, patches, ideas?

That looks like the known problem with the NFS server failing to close
connections in a timely manner. There is a fix for this in

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git&a=commitdiff&h=69b6ba3712b796a66595cfaf0a5ab4dfe1cf964a

There is also a client side patch that increases the robustness of the
client when it hits a buggy server, and that causes it to do the
equivalent of a linger2 timeout. That patch is as of yet not merged into
mainline, however I've attached it below together with a followup patch
that makes the timeout configurable...

Cheers
Trond


Attachments:
linux-2.6.28-100-add_tcp_linger.dif (8.97 kB)
linux-2.6.28-101-add_tcp_linger_sysctl.dif (1.86 kB)
Download all attachments

2009-02-17 10:39:05

by Arto Jantunen

[permalink] [raw]
Subject: Re: Timeout issue (similar to bugs 11061 and 11154), bisected

Trond Myklebust <[email protected]> writes:

> On Mon, 2009-02-16 at 13:11 +0200, Arto Jantunen wrote:
>> (I'm not subscribed, so please CC me on any replies)
>>
>> I seem to have hit a NFS bug while upgrading a machine from Debian
>> Etch to Debian Lenny. I have a NFS server running FreeBSD 7.0 RC1 and
>> a bunch of clients running Linux. The ones running kernel 2.6.18 work
>> perfectly, as do the ones running 2.6.24. The one I upgraded to 2.6.26
>> fails. After 5-15 minutes of working normally the mount dies and I get
>> the usual "nfs: server <server> not responding, still trying" in
>> dmesg. The only way I have found to get the mount back is umount -f &&
>> mount, waiting does not bring it back.
>>
>> I have tested quite a bunch of different kernel versions, and starting
>> from 25 and ending at the git tree last week they all fail in the same
>> way. Bisecting tracks the problem to commit
>> e06799f958bf7f9f8fae15f0c6f519953fb0257c
>>
>> I originally thought that it was the same as bug 11154, but the
>> patches attached to that bug do not fix this issue.
>>
>> Any thoughts, patches, ideas?
>
> That looks like the known problem with the NFS server failing to close
> connections in a timely manner. There is a fix for this in
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git&a=commitdiff&h=69b6ba3712b796a66595cfaf0a5ab4dfe1cf964a
>
> There is also a client side patch that increases the robustness of the
> client when it hits a buggy server, and that causes it to do the
> equivalent of a linger2 timeout. That patch is as of yet not merged into
> mainline, however I've attached it below together with a followup patch
> that makes the timeout configurable...

The client side patch you attached hides the problem on the server,
after applying it the mount sticks around. As previously discussed,
the server is running an apparently buggy version of FreeBSD and I'd
rather not touch it right now since it is in production.

Thanks for your fast response.

--
Arto Jantunen