2012-03-31 16:23:42

by Christoph Bartoschek

[permalink] [raw]
Subject: Re: nfsd hangs for more than 120 seconds

Myklebust, Trond wrote:

> On Sat, 2012-03-31 at 13:55 +0200, Christoph Bartoschek wrote:
>> Hi,
>>
>> we use Ubuntu 10.04.3 LTS and often get a traceback for NFS indicating
>> that the daemon hangs for several seconds. At the same time some client
>> machines cannot access the server and have to wait. After some minutes
>> everything goes on.
>>
>> What could cause the problem? Is there anything we should change?
>>
>> Here is the message in the kernel log:
>>
>> [330573.697121] INFO: task nfsd:1376 blocked for more than 120 seconds.
>> [330573.708375] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> [disables
>> this message.
>> [330573.730773] nfsd D 0000000000000001 0 1376 2
>> 0x00000000
>> [330573.730776] ffff88061c21bdc0 0000000000000046 0000000000015f00
>> 0000000000015f00
>> [330573.730779] ffff88061c111ad0 ffff88061c21bfd8 0000000000015f00
>> ffff88061c111700
>> [330573.730781] 0000000000015f00 ffff88061c21bfd8 0000000000015f00
>> ffff88061c111ad0
>> [330573.730784] Call Trace:
>> [330573.730788] [<ffffffff81559e67>] __mutex_lock_slowpath+0x107/0x190
>> [330573.730796] [<ffffffffa012300f>] ? svc_authorise+0x3f/0x50 [sunrpc]
>
> At a guess, I'd say that your mountd or rpc.svcgssd is probably
> busy/hanging, causing the kernel NFS daemon to hang while it waits to
> authorise a client or user. Typically, you will see the above in the
> case of a kerberos, NIS or ldap outage.
>
> So are you using NIS or ldap-based netgroups in your /etc/exports, or
> are your clients perhaps mounting with sys=krb5?

We are still using NFS3 and NIS.

We are also sometimes seeing the following problem that might be related:

One user suddenly has no access to a directory and its subdirectories on a
NFS share. The user always gets "permission denied". The access bits and
group memberships did not change.

At the same time all other users within the same groups can access the
directory on the same client machine and on other client machines.

After about 15 minutes the problem vanishes by itself. The user no longer
gets "permission denied" and everything is normal.

This happens about twice a week for different users. We see no pattern in
which user is affected and when this happens.

Thanks
Christoph



Subject: Re: nfsd hangs for more than 120 seconds



--- On Sun, 2012/4/1, Christoph Bartoschek <[email protected]> wrote:

> Myklebust, Trond wrote:
>
> > On Sat, 2012-03-31 at 13:55 +0200, Christoph Bartoschek wrote:
> >> Hi,
> >>
> >> we use Ubuntu 10.04.3 LTS and often get a traceback for NFS indicating
> >> that the daemon hangs for several seconds. At the same time some client
> >> machines cannot access the server and have to wait. After some minutes
> >> everything goes on.
> >>
> >> What could cause the problem? Is there anything we should change?

> > At a guess, I'd say that your mountd or rpc.svcgssd is probably
> > busy/hanging, causing the kernel NFS daemon to hang while it waits to
> > authorise a client or user. Typically, you will see the above in the
> > case of a kerberos, NIS or ldap outage.
> >
> > So are you using NIS or ldap-based netgroups in your /etc/exports, or
> > are your clients perhaps mounting with sys=krb5?

> We are still using NFS3 and NIS.
>
> We are also sometimes seeing the following problem that might be related:
>
> One user suddenly has no access to a directory and its subdirectories on a
> NFS share. The user always gets "permission denied". The access bits and
> group memberships did not change.
>
> At the same time all other users within the same groups can access the
> directory on the same client machine and on other client machines.
>
> After about 15 minutes the problem vanishes by itself. The user no longer
> gets "permission denied" and everything is normal.
>
> This happens about twice a week for different users. We see no pattern in
> which user is affected and when this happens.

That sounds a lot like an NIS lookup problem. I've been experiencing hangs (not quite 120 seconds, but over a minute at times, and really annoying) with NFS4 even with an export set this way:

/mnt/export/home *.subdomain.localnet(rw.fsid=0,insecure)
/mnt/export/home *.subdomain.localnet(rw,nohide,insecure)

But its universal, not on a single user. When LDAP was sketchy we used to get a single user or a few users who wouldn't get a complete directory listing of, say, /home/* uid:gid owners, and so that one user would not be able to access anything that didn't come in the listing before it failed until the cache cleared.

But that's been sorted. The only applications that are completely handicapped by the current mystery problem are email clients like Thunderbird and Evolution, and it seems that new requests pass through fine (like a new file browser instance or browsing in Bash works). I've yet to figure that out.