2010-11-11 02:35:22

by Simon Kirby

[permalink] [raw]
Subject: NFS client/sunrpc getting stuck on 2.6.36

Still seeing all sorts of boxes fall over with 2.6.35 and 2.6.36 NFS.
Unfortunately, it doesn't happen all the time...only certain load
patterns seem to start it off. Once it starts, I can't find a way to
make it recover without rebooting.

In this case, tcpdump and rpc/nfs debug logs show a repeating command,
but only 3-4 commands are being sent per second, even though all slots
are in use, there is a backlog, 300 processes in rpc_wait_bit_killable,
and the debug logs show that responses are coming back immediately.
Weird congestion/backoff issue?

Over a few seconds, the only numbers changing from "nfsstat" are "calls",
"authrefresh", and "getattr". Debug level 1 for nfs and rpc show this
repeating:

NFS: permission(0:4c/5284877), mask=0x1, res=0
RPC: 5778 reserved req ffff88012ae042d0 xid 64030e74
RPC: 5778 xprt_prepare_transmit
RPC: 5778 xprt_transmit(96)
RPC: 5778 xmit complete
RPC: 5778 xid 64030e74 complete (112 bytes received)
RPC: 5778 release request ffff88012ae042d0
NFS: nfs_update_inode(0:4c/3247737045 ct=1 info=0x7e7f)
NFS: permission(0:4c/3247737045), mask=0x24, res=0
NFS: dentry_delete(tmp/t.sh, 8)
NFS: dentry_delete(tmp/9040-16697.done, 8)
NFS: dentry_delete(tmp/9040-29371.done, 8)

The only thing that changes between repeats are the pid and xid.

Debug level 255 for nfs and rpc:

NFS call getattr
RPC: new task initialized, procpid 15460
RPC: allocated task ffff8800c09f7e00
RPC: 5321 __rpc_execute flags=0x80
RPC: 5321 call_start nfs3 proc GETATTR (sync)
RPC: 5321 call_reserve (status 0)
RPC: 5321 reserved req ffff88012ae042d0 xid 9e010e74
RPC: 5321 call_reserveresult (status 0)
RPC: 5321 call_refresh (status 0)
RPC: 5321 looking up UNIX cred
RPC: looking up UNIX cred
RPC: 5321 refreshing UNIX cred ffff880011ec1300
RPC: 5321 call_refreshresult (status 0)
RPC: 5321 call_allocate (status 0)
RPC: 5321 allocated buffer of size 552 at ffff880129b1d000
RPC: 5321 call_bind (status 0)
RPC: 5321 call_connect xprt ffff88012a74d000 is connected
RPC: 5321 call_transmit (status 0)
RPC: 5321 xprt_prepare_transmit
RPC: 5321 rpc_xdr_encode (status 0)
RPC: 5321 marshaling UNIX cred ffff880011ec1300
RPC: 5321 using AUTH_UNIX cred ffff880011ec1300 to wrap rpc data
RPC: 5321 xprt_transmit(96)
RPC: xs_tcp_send_request(96) = 96
RPC: 5321 xmit complete
RPC: 5321 sleep_on(queue "xprt_pending" time 4494861481)
RPC: xs_tcp_data_ready...
RPC: xs_tcp_data_recv started
RPC: reading TCP record fragment of length 112
RPC: reading XID (4 bytes)
RPC: reading reply for XID 9e010e74
RPC: reading CALL/REPLY flag (4 bytes)
RPC: read reply XID 9e010e74
RPC: 5321 added to queue ffff88012a74d3f8 "xprt_pending"
RPC: 5321 setting alarm for 900 ms
RPC: XID 9e010e74 read 104 bytes
RPC: xprt = ffff88012a74d000, tcp_copied = 112, tcp_offset = 112, tcp_reclen = 112
RPC: 5321 xid 9e010e74 complete (112 bytes received)
RPC: 5321 __rpc_wake_up_task (now 4494861481)
RPC: 5321 disabling timer
RPC: 5321 removed from queue ffff88012a74d3f8 "xprt_pending"
RPC: __rpc_wake_up_task done
RPC: xs_tcp_data_recv done
RPC: wake_up_next(ffff88012a74d2e8 "xprt_resend")
RPC: wake_up_next(ffff88012a74d1d8 "xprt_sending")
RPC: 5321 call_status (status 112)
RPC: 5321 call_decode (status 112)
RPC: 5321 validating UNIX cred ffff880011ec1300
RPC: 5321 using AUTH_UNIX cred ffff880011ec1300 to unwrap rpc data
RPC: 5321 call_decode result 0
RPC: 5321 return 0, status 0
RPC: 5321 release task
RPC: freeing buffer of size 552 at ffff880129b1d000
RPC: 5321 release request ffff88012ae042d0
RPC: wake_up_next(ffff88012a74d508 "xprt_backlog")
RPC: rpc_release_client(ffff88012016e400)
RPC: 5321 freeing task
NFS reply getattr: 0
NFS: nfs_update_inode(0:4c/3247737045 ct=1 info=0x7e7f)
NFS: (0:4c/3247737045) revalidation complete
NFS: nfs_lookup_revalidate(HWD/tmp) is valid
RPC: looking up Generic cred
NFS: permission(0:4c/3247737045), mask=0x24, res=0
NFS: open dir(HWD/tmp)
RPC: looking up Generic cred
NFS: readdir(HWD/tmp) starting at cookie 0
NFS: readdir_search_pagecache() searching for offset 0
NFS: find_dirent_page: searching page 0 for target 0
NFS: found cookie 4 at index 0
NFS: find_dirent_page: returns 0
NFS: readdir_search_pagecache: returns 0
NFS: nfs_do_filldir() filling starting @ cookie 4
NFS: dentry_delete(tmp/t.sh, 8)
NFS: dentry_delete(tmp/9040-16697.done, 8)
NFS: dentry_delete(tmp/9040-29371.done, 8)
NFS: nfs_do_filldir() filling ended @ cookie 512; returning = 0
NFS: readdir(HWD/tmp) returns 0
NFS: readdir(HWD/tmp) starting at cookie 5
NFS: readdir_search_pagecache() searching for cookie 512
NFS: find_dirent_page: searching page 0 for target 512
NFS: find_dirent: examining cookie 4
NFS: find_dirent: examining cookie 6
NFS: find_dirent: examining cookie 11
NFS: find_dirent: examining cookie 18
NFS: find_dirent: examining cookie 512
NFS: find_dirent_page: returns -523
NFS: readdir_search_pagecache: returns -523
NFS: readdir(HWD/tmp) returns 0
RPC: looking up Generic cred
NFS: permission(0:4c/5284877), mask=0x1, res=0
NFS: revalidating (0:4c/3247737045)

900ms matches the probably-silly nfs mount settings we're currently using:

rw,hard,intr,tcp,timeo=9,retrans=3,rsize=8192,wsize=8192

Full kernel log here: http://0x.ca/sim/ref/2.6.36_stuck_nfs/

Simon-


2010-11-21 19:54:25

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS client/sunrpc getting stuck on 2.6.36

On Sat, 2010-11-20 at 22:40 -0800, Simon Kirby wrote:
> On Fri, Nov 19, 2010 at 05:17:19PM -0500, Trond Myklebust wrote:
>
> > The only change in 2.6.37-rcX I can think of that might have caused an
> > issue here would be Bryan's readdir changes.
> >
> > If you can reproduce the above error condition, then could you try
> > turning off readdirplus (using the 'nordirplus' mount option) and seeing
> > if that makes a difference?
>
> So, of the three boxes now running 2.6.37-rc2-git3, I have two mounted
> with nordirplus. One I booted with "nordirplus" has never logged the
> error. Another I haven't changed the fstab on, is still logging this
> fairly often (nfsstat shows it has only made readdirplus calls):
>
> [219158.456270] NFS: server 10.10.52.230 error: fileid changed
> [219158.456273] fsid 0:4e: expected fileid 0x57bcf57c, got 0x57bd5442
>
> Finally, on the last one I had remounted all of the NFS mount points with
> that option, and there have been no errors since.

I just posted 8 patches to fix up bugs and regressions in the readdir
code. Can you try with some of these?

Cheers
Trond


2010-11-11 08:49:17

by Simon Kirby

[permalink] [raw]
Subject: Re: NFS client/sunrpc getting stuck on 2.6.36

On Thu, Nov 11, 2010 at 01:22:47PM +0800, Trond Myklebust wrote:

> > 900ms matches the probably-silly nfs mount settings we're currently using:
> >
> > rw,hard,intr,tcp,timeo=9,retrans=3,rsize=8192,wsize=8192
> >
> > Full kernel log here: http://0x.ca/sim/ref/2.6.36_stuck_nfs/
>
> timeo=9 is a completely insane retransmit value for a tcp connection.
> Please use the default timeo=600, and all will work correctly.

So this setting is expected to produce an usually-working setup that
occasionally fails? :)

A short history of our experiences:

We first tried to use TCP mounts but found that, in some cases,
reconnection time to NFS servers after an NFS server crash and fail-over
to the backup node would take upwards of 10 minutes, even with timeouts
configured much lower than that. This was with 2.6.32-era NFS clients.
Clients with little or no activity to the affected mounts would normally
reconnect immediately, but active clients would usually hit this hanging
issue. Different client hosts would return to operation all at different
times (some took 5 minutes, some took 10 minutes, for example). I was
not able to reproduce this outside of production or I would have reported
it with steps to reproduce.

So, UDP worked around this issue for us, where I think this timeo value
is reasonable. In seeing these stuck NFS stuck issues, we replaced "udp"
with "tcp" without changing anything else to see if the problem stopped,
but we seem to be getting the same result.

We'll try with the default timeo and see if the problem recurs. Perhaps
the TCP reconnection times have improved or maybe there are other options
we can set to improve that case.

Simon-

2010-11-21 06:43:01

by Simon Kirby

[permalink] [raw]
Subject: Re: NFS client/sunrpc getting stuck on 2.6.36

On Fri, Nov 19, 2010 at 06:17:25PM -0500, Trond Myklebust wrote:

> On Fri, 2010-11-19 at 14:58 -0800, Simon Kirby wrote:
> > On Fri, Nov 19, 2010 at 05:17:19PM -0500, Trond Myklebust wrote:
> > > So what were all the
> > >
> > > 'lockd: server 10.10.52.xxx not responding, still trying'
> > >
> > > messages all about? There were quite a few of them for a number of
> > > different servers in the moments leading up to the hang. Could it be a
> > > problem with the switch these clients are attached to?
> >
> > If it were a switch problem, would we see port 2049 socket backlogs with
> > netstat -tan or ss -tan? I haven't seen this at all when the problem
> > occurs. All of the sockets are idle (and usually it seems to close them
> > all except the one server that all of the slots are stuck on). tcpdump
> > shows no problems, just very slow requests rates that match the rpc/nfs
> > debugging.
>
> No retransmits that might indicate dropped packets at the switch? How
> fast are the tcp ACKs from the server being returned?

That tcpdump I sent included the ACKs, which all looked normal.
Unfortunately, we haven't seen the problem again yet. Is your "Fix an
infinite loop in call_refresh/call_refreshresult" patch possibly related?

> > If the rpc slots are stuck full, would that cause lockd to print those
> > timeouts?
>
> Yes. That would be the only kind of event that would trigger these
> messages.

and in this case, rpcinto -t and -u should look normal, I would assume,
unless there is a switch/network issue?

Still waiting for it to occur again to try those commands.

Simon-

2010-11-24 15:05:59

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS client/sunrpc getting stuck on 2.6.36

On Tue, 2010-11-23 at 21:18 -0800, Simon Kirby wrote:
> On Sun, Nov 21, 2010 at 02:54:20PM -0500, Trond Myklebust wrote:
>
> > > [219158.456270] NFS: server 10.10.52.230 error: fileid changed
> > > [219158.456273] fsid 0:4e: expected fileid 0x57bcf57c, got 0x57bd5442
> > >
> > > Finally, on the last one I had remounted all of the NFS mount points with
> > > that option, and there have been no errors since.
> >
> > I just posted 8 patches to fix up bugs and regressions in the readdir
> > code. Can you try with some of these?
>
> 2.6.37-rc3 plus those 8 patches was running on one box for just over a
> day when it just crashed for some reason (got rebooted before I could see
> the console). Anyway, while it was running, it was still logging "fileid
> changed" lines as before:
>
> [84197.182497] NFS: server 10.10.52.228 error: fileid changed
> [84197.182499] fsid 0:47: expected fileid 0x96fe0343, got 0x4926a389
> [85226.297721] NFS: server 10.10.52.227 error: fileid changed
> [85226.297724] fsid 0:39: expected fileid 0x254d3a97, got 0x480014a6
>
> Simon-

Please remind me which NFS version you are running? Is this v3 or v4?
Also could you check which files might have inode numbers 2533229379,
and/or 1227269001, on server 10.10.52.228, 625818263 and/or 1207964838
on server 10.10.52.227.

Cheers
Trond


2010-11-11 05:22:54

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS client/sunrpc getting stuck on 2.6.36

On Wed, 2010-11-10 at 18:35 -0800, Simon Kirby wrote:
> Still seeing all sorts of boxes fall over with 2.6.35 and 2.6.36 NFS.
> Unfortunately, it doesn't happen all the time...only certain load
> patterns seem to start it off. Once it starts, I can't find a way to
> make it recover without rebooting.
>
> In this case, tcpdump and rpc/nfs debug logs show a repeating command,
> but only 3-4 commands are being sent per second, even though all slots
> are in use, there is a backlog, 300 processes in rpc_wait_bit_killable,
> and the debug logs show that responses are coming back immediately.
> Weird congestion/backoff issue?
>
> Over a few seconds, the only numbers changing from "nfsstat" are "calls",
> "authrefresh", and "getattr". Debug level 1 for nfs and rpc show this
> repeating:
>
> NFS: permission(0:4c/5284877), mask=0x1, res=0
> RPC: 5778 reserved req ffff88012ae042d0 xid 64030e74
> RPC: 5778 xprt_prepare_transmit
> RPC: 5778 xprt_transmit(96)
> RPC: 5778 xmit complete
> RPC: 5778 xid 64030e74 complete (112 bytes received)
> RPC: 5778 release request ffff88012ae042d0
> NFS: nfs_update_inode(0:4c/3247737045 ct=1 info=0x7e7f)
> NFS: permission(0:4c/3247737045), mask=0x24, res=0
> NFS: dentry_delete(tmp/t.sh, 8)
> NFS: dentry_delete(tmp/9040-16697.done, 8)
> NFS: dentry_delete(tmp/9040-29371.done, 8)
>
> The only thing that changes between repeats are the pid and xid.
>
> Debug level 255 for nfs and rpc:
>
> NFS call getattr
> RPC: new task initialized, procpid 15460
> RPC: allocated task ffff8800c09f7e00
> RPC: 5321 __rpc_execute flags=0x80
> RPC: 5321 call_start nfs3 proc GETATTR (sync)
> RPC: 5321 call_reserve (status 0)
> RPC: 5321 reserved req ffff88012ae042d0 xid 9e010e74
> RPC: 5321 call_reserveresult (status 0)
> RPC: 5321 call_refresh (status 0)
> RPC: 5321 looking up UNIX cred
> RPC: looking up UNIX cred
> RPC: 5321 refreshing UNIX cred ffff880011ec1300
> RPC: 5321 call_refreshresult (status 0)
> RPC: 5321 call_allocate (status 0)
> RPC: 5321 allocated buffer of size 552 at ffff880129b1d000
> RPC: 5321 call_bind (status 0)
> RPC: 5321 call_connect xprt ffff88012a74d000 is connected
> RPC: 5321 call_transmit (status 0)
> RPC: 5321 xprt_prepare_transmit
> RPC: 5321 rpc_xdr_encode (status 0)
> RPC: 5321 marshaling UNIX cred ffff880011ec1300
> RPC: 5321 using AUTH_UNIX cred ffff880011ec1300 to wrap rpc data
> RPC: 5321 xprt_transmit(96)
> RPC: xs_tcp_send_request(96) = 96
> RPC: 5321 xmit complete
> RPC: 5321 sleep_on(queue "xprt_pending" time 4494861481)
> RPC: xs_tcp_data_ready...
> RPC: xs_tcp_data_recv started
> RPC: reading TCP record fragment of length 112
> RPC: reading XID (4 bytes)
> RPC: reading reply for XID 9e010e74
> RPC: reading CALL/REPLY flag (4 bytes)
> RPC: read reply XID 9e010e74
> RPC: 5321 added to queue ffff88012a74d3f8 "xprt_pending"
> RPC: 5321 setting alarm for 900 ms
> RPC: XID 9e010e74 read 104 bytes
> RPC: xprt = ffff88012a74d000, tcp_copied = 112, tcp_offset = 112, tcp_reclen = 112
> RPC: 5321 xid 9e010e74 complete (112 bytes received)
> RPC: 5321 __rpc_wake_up_task (now 4494861481)
> RPC: 5321 disabling timer
> RPC: 5321 removed from queue ffff88012a74d3f8 "xprt_pending"
> RPC: __rpc_wake_up_task done
> RPC: xs_tcp_data_recv done
> RPC: wake_up_next(ffff88012a74d2e8 "xprt_resend")
> RPC: wake_up_next(ffff88012a74d1d8 "xprt_sending")
> RPC: 5321 call_status (status 112)
> RPC: 5321 call_decode (status 112)
> RPC: 5321 validating UNIX cred ffff880011ec1300
> RPC: 5321 using AUTH_UNIX cred ffff880011ec1300 to unwrap rpc data
> RPC: 5321 call_decode result 0
> RPC: 5321 return 0, status 0
> RPC: 5321 release task
> RPC: freeing buffer of size 552 at ffff880129b1d000
> RPC: 5321 release request ffff88012ae042d0
> RPC: wake_up_next(ffff88012a74d508 "xprt_backlog")
> RPC: rpc_release_client(ffff88012016e400)
> RPC: 5321 freeing task
> NFS reply getattr: 0
> NFS: nfs_update_inode(0:4c/3247737045 ct=1 info=0x7e7f)
> NFS: (0:4c/3247737045) revalidation complete
> NFS: nfs_lookup_revalidate(HWD/tmp) is valid
> RPC: looking up Generic cred
> NFS: permission(0:4c/3247737045), mask=0x24, res=0
> NFS: open dir(HWD/tmp)
> RPC: looking up Generic cred
> NFS: readdir(HWD/tmp) starting at cookie 0
> NFS: readdir_search_pagecache() searching for offset 0
> NFS: find_dirent_page: searching page 0 for target 0
> NFS: found cookie 4 at index 0
> NFS: find_dirent_page: returns 0
> NFS: readdir_search_pagecache: returns 0
> NFS: nfs_do_filldir() filling starting @ cookie 4
> NFS: dentry_delete(tmp/t.sh, 8)
> NFS: dentry_delete(tmp/9040-16697.done, 8)
> NFS: dentry_delete(tmp/9040-29371.done, 8)
> NFS: nfs_do_filldir() filling ended @ cookie 512; returning = 0
> NFS: readdir(HWD/tmp) returns 0
> NFS: readdir(HWD/tmp) starting at cookie 5
> NFS: readdir_search_pagecache() searching for cookie 512
> NFS: find_dirent_page: searching page 0 for target 512
> NFS: find_dirent: examining cookie 4
> NFS: find_dirent: examining cookie 6
> NFS: find_dirent: examining cookie 11
> NFS: find_dirent: examining cookie 18
> NFS: find_dirent: examining cookie 512
> NFS: find_dirent_page: returns -523
> NFS: readdir_search_pagecache: returns -523
> NFS: readdir(HWD/tmp) returns 0
> RPC: looking up Generic cred
> NFS: permission(0:4c/5284877), mask=0x1, res=0
> NFS: revalidating (0:4c/3247737045)
>
> 900ms matches the probably-silly nfs mount settings we're currently using:
>
> rw,hard,intr,tcp,timeo=9,retrans=3,rsize=8192,wsize=8192
>
> Full kernel log here: http://0x.ca/sim/ref/2.6.36_stuck_nfs/

timeo=9 is a completely insane retransmit value for a tcp connection.
Please use the default timeo=600, and all will work correctly.

Cheers
Trond


2010-11-19 20:20:08

by Simon Kirby

[permalink] [raw]
Subject: Re: NFS client/sunrpc getting stuck on 2.6.36

On Thu, Nov 11, 2010 at 01:22:47PM +0800, Trond Myklebust wrote:

> On Wed, 2010-11-10 at 18:35 -0800, Simon Kirby wrote:
> > Still seeing all sorts of boxes fall over with 2.6.35 and 2.6.36 NFS.
> > Unfortunately, it doesn't happen all the time...only certain load
> > patterns seem to start it off. Once it starts, I can't find a way to
> > make it recover without rebooting.
> >...
> > NFS: permission(0:4c/5284877), mask=0x1, res=0
> > NFS: revalidating (0:4c/3247737045)
> >
> > 900ms matches the probably-silly nfs mount settings we're currently using:
> >
> > rw,hard,intr,tcp,timeo=9,retrans=3,rsize=8192,wsize=8192
> >
> > Full kernel log here: http://0x.ca/sim/ref/2.6.36_stuck_nfs/
>
> timeo=9 is a completely insane retransmit value for a tcp connection.
>
> Please use the default timeo=600, and all will work correctly.

Ok, so, we were running with timeo=300 instead on a number of servers,
and we were still seeing the problem on 2.6.36. I've uploaded a new
kernel log (lsh1051) here:

http://0x.ca/sim/ref/2.6.36_stuck_nfs/

The log starts out with the hung task warnings occurring after
otherwise-normal operation. Once I noticed, I set rpc/nfs_debug to 1,
and then later set it to 255.

Since several servers were stuck at the same time and we were losing
quorum, I decided to try something more drastic and booted into
2.6.37-rc2-git3. This kernel hasn't got stuck yet! However, it's
spitting out some new errors which may be worth looking into:

[ 1574.088812] NFS: server 10.10.52.222 error: fileid changed
[ 1574.088814] fsid 0:18: expected fileid 0x4c081940, got 0x4c081950
[11340.409447] NFS: server 10.10.52.228 error: fileid changed
[11340.409450] fsid 0:45: expected fileid 0x696ff82, got 0x16a98bd7
[20832.579912] NFS: server 10.10.52.225 error: fileid changed
[20832.579914] fsid 0:2a: expected fileid 0x8c67ebab, got 0x8c6811e5
[32775.957351] NFS: server 10.10.52.230 error: fileid changed
[32775.957354] fsid 0:52: expected fileid 0x919041fd, got 0x93f1962d

These are also in the same kernel log. The error code isn't new, so
something else seems to have changed to cause it.

Simon-

2010-11-19 21:24:53

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS client/sunrpc getting stuck on 2.6.36

On Fri, 2010-11-19 at 12:20 -0800, Simon Kirby wrote:
> On Thu, Nov 11, 2010 at 01:22:47PM +0800, Trond Myklebust wrote:
>
> > On Wed, 2010-11-10 at 18:35 -0800, Simon Kirby wrote:
> > > Still seeing all sorts of boxes fall over with 2.6.35 and 2.6.36 NFS.
> > > Unfortunately, it doesn't happen all the time...only certain load
> > > patterns seem to start it off. Once it starts, I can't find a way to
> > > make it recover without rebooting.
> > >...
> > > NFS: permission(0:4c/5284877), mask=0x1, res=0
> > > NFS: revalidating (0:4c/3247737045)
> > >
> > > 900ms matches the probably-silly nfs mount settings we're currently using:
> > >
> > > rw,hard,intr,tcp,timeo=9,retrans=3,rsize=8192,wsize=8192
> > >
> > > Full kernel log here: http://0x.ca/sim/ref/2.6.36_stuck_nfs/
> >
> > timeo=9 is a completely insane retransmit value for a tcp connection.
> >
> > Please use the default timeo=600, and all will work correctly.
>
> Ok, so, we were running with timeo=300 instead on a number of servers,
> and we were still seeing the problem on 2.6.36. I've uploaded a new
> kernel log (lsh1051) here:
>
> http://0x.ca/sim/ref/2.6.36_stuck_nfs/
>
> The log starts out with the hung task warnings occurring after
> otherwise-normal operation. Once I noticed, I set rpc/nfs_debug to 1,
> and then later set it to 255.

Were the NFS servers hung at this point? If so, then that probably
suffices to explain the hung task warnings (which would be false
positives) as being due to the page cache waiting to lock pages on which
I/O is being performed.

> Since several servers were stuck at the same time and we were losing
> quorum, I decided to try something more drastic and booted into
> 2.6.37-rc2-git3. This kernel hasn't got stuck yet! However, it's
> spitting out some new errors which may be worth looking into:
>
> [ 1574.088812] NFS: server 10.10.52.222 error: fileid changed
> [ 1574.088814] fsid 0:18: expected fileid 0x4c081940, got 0x4c081950
> [11340.409447] NFS: server 10.10.52.228 error: fileid changed
> [11340.409450] fsid 0:45: expected fileid 0x696ff82, got 0x16a98bd7
> [20832.579912] NFS: server 10.10.52.225 error: fileid changed
> [20832.579914] fsid 0:2a: expected fileid 0x8c67ebab, got 0x8c6811e5
> [32775.957351] NFS: server 10.10.52.230 error: fileid changed
> [32775.957354] fsid 0:52: expected fileid 0x919041fd, got 0x93f1962d
>
> These are also in the same kernel log. The error code isn't new, so
> something else seems to have changed to cause it.

These indicate server bugs: your failover event appears to have caused
the inode numbers to have changed on a number of files. This is
something that shouldn't happen in a normal NFS environment, and so the
client prints out the above warnings...

Trond


2010-11-19 22:58:05

by Simon Kirby

[permalink] [raw]
Subject: Re: NFS client/sunrpc getting stuck on 2.6.36

On Fri, Nov 19, 2010 at 05:17:19PM -0500, Trond Myklebust wrote:

> On Fri, 2010-11-19 at 14:03 -0800, Simon Kirby wrote:
> > On Fri, Nov 19, 2010 at 04:24:48PM -0500, Trond Myklebust wrote:
> >
> > > On Fri, 2010-11-19 at 12:20 -0800, Simon Kirby wrote:
> > > > On Thu, Nov 11, 2010 at 01:22:47PM +0800, Trond Myklebust wrote:
> > > >
> > > > > On Wed, 2010-11-10 at 18:35 -0800, Simon Kirby wrote:
> > > > > > Still seeing all sorts of boxes fall over with 2.6.35 and 2.6.36 NFS.
> > > > > > Unfortunately, it doesn't happen all the time...only certain load
> > > > > > patterns seem to start it off. Once it starts, I can't find a way to
> > > > > > make it recover without rebooting.
> > > > > >...
> > > > > > NFS: permission(0:4c/5284877), mask=0x1, res=0
> > > > > > NFS: revalidating (0:4c/3247737045)
> > > > > >
> > > > > > 900ms matches the probably-silly nfs mount settings we're currently using:
> > > > > >
> > > > > > rw,hard,intr,tcp,timeo=9,retrans=3,rsize=8192,wsize=8192
> > > > > >
> > > > > > Full kernel log here: http://0x.ca/sim/ref/2.6.36_stuck_nfs/
> > > > >
> > > > > timeo=9 is a completely insane retransmit value for a tcp connection.
> > > > >
> > > > > Please use the default timeo=600, and all will work correctly.
> > > >
> > > > Ok, so, we were running with timeo=300 instead on a number of servers,
> > > > and we were still seeing the problem on 2.6.36. I've uploaded a new
> > > > kernel log (lsh1051) here:
> > > >
> > > > http://0x.ca/sim/ref/2.6.36_stuck_nfs/
> > > >
> > > > The log starts out with the hung task warnings occurring after
> > > > otherwise-normal operation. Once I noticed, I set rpc/nfs_debug to 1,
> > > > and then later set it to 255.
> > >
> > > Were the NFS servers hung at this point? If so, then that probably
> > > suffices to explain the hung task warnings (which would be false
> > > positives) as being due to the page cache waiting to lock pages on which
> > > I/O is being performed.
> >
> > Nope...Many other NFS clients did not notice anything, and there were no
> > obvious problems on any NFS server. This was only affecting two clients
> > at the same time, but we had a limited LVS pool pointing at them at the
> > time to try to isolate load patterns that might be tickling the issue.
>
> So what were all the
>
> 'lockd: server 10.10.52.xxx not responding, still trying'
>
> messages all about? There were quite a few of them for a number of
> different servers in the moments leading up to the hang. Could it be a
> problem with the switch these clients are attached to?

If it were a switch problem, would we see port 2049 socket backlogs with
netstat -tan or ss -tan? I haven't seen this at all when the problem
occurs. All of the sockets are idle (and usually it seems to close them
all except the one server that all of the slots are stuck on). tcpdump
shows no problems, just very slow requests rates that match the rpc/nfs
debugging.

If the rpc slots are stuck full, would that cause lockd to print those
timeouts?

Actually, another one just got stuck right now:

[root@lsh1003:/root]# dmesg|tail
lockd: server 10.10.52.227 not responding, still trying
lockd: server 10.10.52.155 not responding, still trying
lockd: server 10.10.52.163 not responding, still trying
lockd: server 10.10.52.155 not responding, still trying
lockd: server 10.10.52.150 not responding, still trying
lockd: server 10.10.52.151 not responding, still trying
lockd: server 10.10.52.162 not responding, still trying
lockd: server 10.10.52.155 not responding, still trying
lockd: server 10.10.52.163 not responding, still trying
lockd: server 10.10.52.155 not responding, still trying
[root@lsh1003:/root]# netstat -tano | grep 2049
tcp 0 0 10.10.52.13:809 10.10.52.222:2049 ESTABLISHED off (0.00/0/0)
tcp 0 0 10.10.52.13:826 10.10.52.163:2049 ESTABLISHED off (0.00/0/0)
tcp 0 0 10.10.52.13:724 10.10.52.230:2049 ESTABLISHED off (0.00/0/0)
tcp 0 0 10.10.52.13:784 10.10.52.228:2049 ESTABLISHED off (0.00/0/0)
tcp 0 0 10.10.52.13:786 10.10.52.155:2049 ESTABLISHED off (0.00/0/0)
tcp 0 0 10.10.52.13:672 10.10.52.224:2049 ESTABLISHED off (0.00/0/0)
tcp 0 0 10.10.52.13:948 10.10.52.227:2049 ESTABLISHED off (0.00/0/0)
tcp 0 0 10.10.52.13:925 10.10.52.150:2049 ESTABLISHED off (0.00/0/0)
tcp 0 0 10.10.52.13:773 10.10.52.225:2049 ESTABLISHED off (0.00/0/0)
tcp 0 0 10.10.52.13:742 10.10.52.162:2049 ESTABLISHED off (0.00/0/0)
tcp 0 0 10.10.52.13:788 10.10.52.151:2049 ESTABLISHED off (0.00/0/0)
tcp 0 0 10.10.52.13:733 10.10.52.160:2049 ESTABLISHED off (0.00/0/0)
tcp 0 0 10.10.52.13:1010 10.10.52.15:2049 ESTABLISHED off (0.00/0/0)
tcp 0 0 10.10.52.13:854 10.10.52.221:2049 ESTABLISHED off (0.00/0/0)
tcp 0 0 10.10.52.13:892 10.10.52.229:2049 ESTABLISHED off (0.00/0/0)
[root@lsh1003:/root]# netstat -tano | grep 2049
[root@lsh1003:/root]# tcpdump -i any -n port 2049 -s 9999
tcpdump: WARNING: Promiscuous mode not supported on the "any" device
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 9999 bytes
16:37:35.391957 IP 10.10.52.13.603423014 > 10.10.52.150.2049: 100 getattr fh Unknown/01000600B1C2A7D11304420481B0C6A6D63811CC000000000000000000000000
16:37:35.392129 IP 10.10.52.150.2049 > 10.10.52.13.603423014: reply ok 116 getattr DIR 40755 ids 0/0 sz 4096
16:37:35.392157 IP 10.10.52.13.925 > 10.10.52.150.2049: . ack 185054172 win 3078 <nop,nop,timestamp 124239060 3176255375>
16:37:35.392295 IP 10.10.52.13.620200230 > 10.10.52.150.2049: 104 access fh Unknown/01000600B1C2A7D11304420481B0C6A6D63811CC0000001F000000004CE6FC2F 001f
16:37:35.392438 IP 10.10.52.150.2049 > 10.10.52.13.620200230: reply ok 124 access c 001f
16:37:35.392511 IP 10.10.52.13.636977446 > 10.10.52.150.2049: 148 lookup fh Unknown/01000600B1C2A7D11304420481B0C6A6D63811CC0000002C6E6673436865636B "nfsCheckTestFile.lsh1003.lsh.hostnameremoved"
16:37:35.392660 IP 10.10.52.150.2049 > 10.10.52.13.636977446: reply ok 240 lookup fh Unknown/01000601B1C2A7D11304420481B0C6A6D63811CC12000000142933E300000001
16:37:35.392701 IP 10.10.52.13.653754662 > 10.10.52.150.2049: 112 access fh Unknown/01000601B1C2A7D11304420481B0C6A6D63811CC12000000142933E30000002D 002d
16:37:35.392849 IP 10.10.52.150.2049 > 10.10.52.13.653754662: reply ok 124 access c 000d
16:37:35.392898 IP 10.10.52.13.670531878 > 10.10.52.150.2049: 136 setattr fh Unknown/01000601B1C2A7D11304420481B0C6A6D63811CC12000000142933E300000000
16:37:35.393067 IP 10.10.52.150.2049 > 10.10.52.13.670531878: reply ok 148 setattr
16:37:35.429871 IP 10.10.52.13.925 > 10.10.52.150.2049: . ack 637 win 3078 <nop,nop,timestamp 124239070 3176255375>
16:37:36.393758 IP 10.10.52.13.2386403 > 10.10.52.151.2049: 100 getattr fh Unknown/010006004828B5EEFFF84A13B9DD0D3DCC3F0F130000000000000000C8533CC5
16:37:36.393924 IP 10.10.52.151.2049 > 10.10.52.13.2386403: reply ok 116 getattr DIR 40751 ids 0/0 sz 4096
16:37:36.393954 IP 10.10.52.13.788 > 10.10.52.151.2049: . ack 1968150002 win 7953 <nop,nop,timestamp 124239311 3336236361>
16:37:36.394034 IP 10.10.52.13.19163619 > 10.10.52.151.2049: 104 access fh Unknown/010006004828B5EEFFF84A13B9DD0D3DCC3F0F130000001F000000004CE6FC30 001f
16:37:36.394179 IP 10.10.52.151.2049 > 10.10.52.13.19163619: reply ok 124 access c 001f
16:37:36.394238 IP 10.10.52.13.35940835 > 10.10.52.151.2049: 148 lookup fh Unknown/010006004828B5EEFFF84A13B9DD0D3DCC3F0F130000002C6E6673436865636B "nfsCheckTestFile.lsh1003.lsh.hostnameremoved"
16:37:36.394400 IP 10.10.52.151.2049 > 10.10.52.13.35940835: reply ok 120 lookup ERROR: No such file or directory
16:37:36.394465 IP 10.10.52.13.52718051 > 10.10.52.151.2049: 180 create fh Unknown/010006004828B5EEFFF84A13B9DD0D3DCC3F0F130000002C6E6673436865636B "nfsCheckTestFile.lsh1003.lsh.hostnameremoved"
16:37:36.394637 IP 10.10.52.151.2049 > 10.10.52.13.52718051: reply ok 272 create fh Unknown/010006014828B5EEFFF84A13B9DD0D3DCC3F0F130D000000498FBC9300000001
16:37:36.433868 IP 10.10.52.13.788 > 10.10.52.151.2049: . ack 517 win 7953 <nop,nop,timestamp 124239321 3336236361>
16:37:37.395911 IP 10.10.52.13.2628690491 > 10.10.52.162.2049: 100 getattr fh Unknown/01000600BEEAE54F02694FC9B8BEC5F023EBC250000000000000000000000000
16:37:37.396075 IP 10.10.52.162.2049 > 10.10.52.13.2628690491: reply ok 116 getattr DIR 40751 ids 0/0 sz 4096
16:37:37.396119 IP 10.10.52.13.742 > 10.10.52.162.2049: . ack 440239041 win 5348 <nop,nop,timestamp 124239561 1314678534>
16:37:37.396206 IP 10.10.52.13.2645467707 > 10.10.52.162.2049: 104 access fh Unknown/01000600BEEAE54F02694FC9B8BEC5F023EBC2500000001F000000004CE6FC31 001f
16:37:37.396371 IP 10.10.52.162.2049 > 10.10.52.13.2645467707: reply ok 124 access c 001f
16:37:37.396429 IP 10.10.52.13.2662244923 > 10.10.52.162.2049: 148 lookup fh Unknown/01000600BEEAE54F02694FC9B8BEC5F023EBC2500000002C6E6673436865636B "nfsCheckTestFile.lsh1003.lsh.hostnameremoved"
16:37:37.396571 IP 10.10.52.162.2049 > 10.10.52.13.2662244923: reply ok 240 lookup fh Unknown/01000601BEEAE54F02694FC9B8BEC5F023EBC25060000000231C4E0900000001
16:37:37.396617 IP 10.10.52.13.2679022139 > 10.10.52.162.2049: 112 access fh Unknown/01000601BEEAE54F02694FC9B8BEC5F023EBC25060000000231C4E090000002D 002d
16:37:37.396747 IP 10.10.52.162.2049 > 10.10.52.13.2679022139: reply ok 124 access c 000d
16:37:37.396787 IP 10.10.52.13.2695799355 > 10.10.52.162.2049: 136 setattr fh Unknown/01000601BEEAE54F02694FC9B8BEC5F023EBC25060000000231C4E0900000000
16:37:37.396926 IP 10.10.52.162.2049 > 10.10.52.13.2695799355: reply ok 148 setattr
16:37:37.433868 IP 10.10.52.13.742 > 10.10.52.162.2049: . ack 637 win 5348 <nop,nop,timestamp 124239571 131467853

Or do I need to tcpdump the some other dynamic rpc port?

> > > > Since several servers were stuck at the same time and we were losing
> > > > quorum, I decided to try something more drastic and booted into
> > > > 2.6.37-rc2-git3. This kernel hasn't got stuck yet! However, it's
> > > > spitting out some new errors which may be worth looking into:
> > > >
> > > > [ 1574.088812] NFS: server 10.10.52.222 error: fileid changed
> > > > [ 1574.088814] fsid 0:18: expected fileid 0x4c081940, got 0x4c081950
> > > > [11340.409447] NFS: server 10.10.52.228 error: fileid changed
> > > > [11340.409450] fsid 0:45: expected fileid 0x696ff82, got 0x16a98bd7
> > > > [20832.579912] NFS: server 10.10.52.225 error: fileid changed
> > > > [20832.579914] fsid 0:2a: expected fileid 0x8c67ebab, got 0x8c6811e5
> > > > [32775.957351] NFS: server 10.10.52.230 error: fileid changed
> > > > [32775.957354] fsid 0:52: expected fileid 0x919041fd, got 0x93f1962d
> > > >
> > > > These are also in the same kernel log. The error code isn't new, so
> > > > something else seems to have changed to cause it.
> > >
> > > These indicate server bugs: your failover event appears to have caused
> > > the inode numbers to have changed on a number of files. This is
> > > something that shouldn't happen in a normal NFS environment, and so the
> > > client prints out the above warnings...
> >
> > There was no fail-over event on any NFS server for the last week, so
> > I'm not sure what would be causing this. The IPs listed there are
> > running 2.6.30.10 with XFS-exported fses.
> >
> > All of the other clients running 2.6.36 (another 20 or so boxes) with the
> > same NFS mounts are not logging any "fileid changed" messages. The first
> > time I've seen this message is with this 2.6.37-rc2-git3 kernel.
>
> The only change in 2.6.37-rcX I can think of that might have caused an
> issue here would be Bryan's readdir changes.
>
> If you can reproduce the above error condition, then could you try
> turning off readdirplus (using the 'nordirplus' mount option) and seeing
> if that makes a difference?

Trying now with "nordirplus" on 2.6.37-rc2-git3...

Simon-

2010-11-24 05:18:58

by Simon Kirby

[permalink] [raw]
Subject: Re: NFS client/sunrpc getting stuck on 2.6.36

On Sun, Nov 21, 2010 at 02:54:20PM -0500, Trond Myklebust wrote:

> > [219158.456270] NFS: server 10.10.52.230 error: fileid changed
> > [219158.456273] fsid 0:4e: expected fileid 0x57bcf57c, got 0x57bd5442
> >
> > Finally, on the last one I had remounted all of the NFS mount points with
> > that option, and there have been no errors since.
>
> I just posted 8 patches to fix up bugs and regressions in the readdir
> code. Can you try with some of these?

2.6.37-rc3 plus those 8 patches was running on one box for just over a
day when it just crashed for some reason (got rebooted before I could see
the console). Anyway, while it was running, it was still logging "fileid
changed" lines as before:

[84197.182497] NFS: server 10.10.52.228 error: fileid changed
[84197.182499] fsid 0:47: expected fileid 0x96fe0343, got 0x4926a389
[85226.297721] NFS: server 10.10.52.227 error: fileid changed
[85226.297724] fsid 0:39: expected fileid 0x254d3a97, got 0x480014a6

Simon-

2010-11-21 19:55:32

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS client/sunrpc getting stuck on 2.6.36

On Sat, 2010-11-20 at 22:43 -0800, Simon Kirby wrote:
> That tcpdump I sent included the ACKs, which all looked normal.
> Unfortunately, we haven't seen the problem again yet. Is your "Fix an
> infinite loop in call_refresh/call_refreshresult" patch possibly related?

It might help if the cause is an RPCSEC_GSS ticket expiration.

Cheers
Trond


2010-11-19 23:17:29

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS client/sunrpc getting stuck on 2.6.36

On Fri, 2010-11-19 at 14:58 -0800, Simon Kirby wrote:
> On Fri, Nov 19, 2010 at 05:17:19PM -0500, Trond Myklebust wrote:
> > So what were all the
> >
> > 'lockd: server 10.10.52.xxx not responding, still trying'
> >
> > messages all about? There were quite a few of them for a number of
> > different servers in the moments leading up to the hang. Could it be a
> > problem with the switch these clients are attached to?
>
> If it were a switch problem, would we see port 2049 socket backlogs with
> netstat -tan or ss -tan? I haven't seen this at all when the problem
> occurs. All of the sockets are idle (and usually it seems to close them
> all except the one server that all of the slots are stuck on). tcpdump
> shows no problems, just very slow requests rates that match the rpc/nfs
> debugging.

No retransmits that might indicate dropped packets at the switch? How
fast are the tcp ACKs from the server being returned?

> If the rpc slots are stuck full, would that cause lockd to print those
> timeouts?

Yes. That would be the only kind of event that would trigger these
messages.

> Actually, another one just got stuck right now:
>
> [root@lsh1003:/root]# dmesg|tail
> lockd: server 10.10.52.227 not responding, still trying
> lockd: server 10.10.52.155 not responding, still trying
> lockd: server 10.10.52.163 not responding, still trying
> lockd: server 10.10.52.155 not responding, still trying
> lockd: server 10.10.52.150 not responding, still trying
> lockd: server 10.10.52.151 not responding, still trying
> lockd: server 10.10.52.162 not responding, still trying
> lockd: server 10.10.52.155 not responding, still trying
> lockd: server 10.10.52.163 not responding, still trying
> lockd: server 10.10.52.155 not responding, still trying
> [root@lsh1003:/root]# netstat -tano | grep 2049

lockd requests don't get sent to port 2049. They go to whatever port the
server is advertising using the RPC portmapper.

rpcinfo -p <servername> | grep nlockmgr

should tell you on which tcp and udp ports lockd is listening. Then you
can try probing for service using

rpcinfo -t <servername> nlockmgr
rpcinfo -u <servername> nlockmgr

Trond


2010-11-19 22:03:57

by Simon Kirby

[permalink] [raw]
Subject: Re: NFS client/sunrpc getting stuck on 2.6.36

On Fri, Nov 19, 2010 at 04:24:48PM -0500, Trond Myklebust wrote:

> On Fri, 2010-11-19 at 12:20 -0800, Simon Kirby wrote:
> > On Thu, Nov 11, 2010 at 01:22:47PM +0800, Trond Myklebust wrote:
> >
> > > On Wed, 2010-11-10 at 18:35 -0800, Simon Kirby wrote:
> > > > Still seeing all sorts of boxes fall over with 2.6.35 and 2.6.36 NFS.
> > > > Unfortunately, it doesn't happen all the time...only certain load
> > > > patterns seem to start it off. Once it starts, I can't find a way to
> > > > make it recover without rebooting.
> > > >...
> > > > NFS: permission(0:4c/5284877), mask=0x1, res=0
> > > > NFS: revalidating (0:4c/3247737045)
> > > >
> > > > 900ms matches the probably-silly nfs mount settings we're currently using:
> > > >
> > > > rw,hard,intr,tcp,timeo=9,retrans=3,rsize=8192,wsize=8192
> > > >
> > > > Full kernel log here: http://0x.ca/sim/ref/2.6.36_stuck_nfs/
> > >
> > > timeo=9 is a completely insane retransmit value for a tcp connection.
> > >
> > > Please use the default timeo=600, and all will work correctly.
> >
> > Ok, so, we were running with timeo=300 instead on a number of servers,
> > and we were still seeing the problem on 2.6.36. I've uploaded a new
> > kernel log (lsh1051) here:
> >
> > http://0x.ca/sim/ref/2.6.36_stuck_nfs/
> >
> > The log starts out with the hung task warnings occurring after
> > otherwise-normal operation. Once I noticed, I set rpc/nfs_debug to 1,
> > and then later set it to 255.
>
> Were the NFS servers hung at this point? If so, then that probably
> suffices to explain the hung task warnings (which would be false
> positives) as being due to the page cache waiting to lock pages on which
> I/O is being performed.

Nope...Many other NFS clients did not notice anything, and there were no
obvious problems on any NFS server. This was only affecting two clients
at the same time, but we had a limited LVS pool pointing at them at the
time to try to isolate load patterns that might be tickling the issue.

> > Since several servers were stuck at the same time and we were losing
> > quorum, I decided to try something more drastic and booted into
> > 2.6.37-rc2-git3. This kernel hasn't got stuck yet! However, it's
> > spitting out some new errors which may be worth looking into:
> >
> > [ 1574.088812] NFS: server 10.10.52.222 error: fileid changed
> > [ 1574.088814] fsid 0:18: expected fileid 0x4c081940, got 0x4c081950
> > [11340.409447] NFS: server 10.10.52.228 error: fileid changed
> > [11340.409450] fsid 0:45: expected fileid 0x696ff82, got 0x16a98bd7
> > [20832.579912] NFS: server 10.10.52.225 error: fileid changed
> > [20832.579914] fsid 0:2a: expected fileid 0x8c67ebab, got 0x8c6811e5
> > [32775.957351] NFS: server 10.10.52.230 error: fileid changed
> > [32775.957354] fsid 0:52: expected fileid 0x919041fd, got 0x93f1962d
> >
> > These are also in the same kernel log. The error code isn't new, so
> > something else seems to have changed to cause it.
>
> These indicate server bugs: your failover event appears to have caused
> the inode numbers to have changed on a number of files. This is
> something that shouldn't happen in a normal NFS environment, and so the
> client prints out the above warnings...

There was no fail-over event on any NFS server for the last week, so
I'm not sure what would be causing this. The IPs listed there are
running 2.6.30.10 with XFS-exported fses.

All of the other clients running 2.6.36 (another 20 or so boxes) with the
same NFS mounts are not logging any "fileid changed" messages. The first
time I've seen this message is with this 2.6.37-rc2-git3 kernel.

Cheers,

Simon-

2010-11-21 06:40:30

by Simon Kirby

[permalink] [raw]
Subject: Re: NFS client/sunrpc getting stuck on 2.6.36

On Fri, Nov 19, 2010 at 05:17:19PM -0500, Trond Myklebust wrote:

> The only change in 2.6.37-rcX I can think of that might have caused an
> issue here would be Bryan's readdir changes.
>
> If you can reproduce the above error condition, then could you try
> turning off readdirplus (using the 'nordirplus' mount option) and seeing
> if that makes a difference?

So, of the three boxes now running 2.6.37-rc2-git3, I have two mounted
with nordirplus. One I booted with "nordirplus" has never logged the
error. Another I haven't changed the fstab on, is still logging this
fairly often (nfsstat shows it has only made readdirplus calls):

[219158.456270] NFS: server 10.10.52.230 error: fileid changed
[219158.456273] fsid 0:4e: expected fileid 0x57bcf57c, got 0x57bd5442

Finally, on the last one I had remounted all of the NFS mount points with
that option, and there have been no errors since.

Simon-

2010-11-19 22:17:23

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS client/sunrpc getting stuck on 2.6.36

On Fri, 2010-11-19 at 14:03 -0800, Simon Kirby wrote:
> On Fri, Nov 19, 2010 at 04:24:48PM -0500, Trond Myklebust wrote:
>
> > On Fri, 2010-11-19 at 12:20 -0800, Simon Kirby wrote:
> > > On Thu, Nov 11, 2010 at 01:22:47PM +0800, Trond Myklebust wrote:
> > >
> > > > On Wed, 2010-11-10 at 18:35 -0800, Simon Kirby wrote:
> > > > > Still seeing all sorts of boxes fall over with 2.6.35 and 2.6.36 NFS.
> > > > > Unfortunately, it doesn't happen all the time...only certain load
> > > > > patterns seem to start it off. Once it starts, I can't find a way to
> > > > > make it recover without rebooting.
> > > > >...
> > > > > NFS: permission(0:4c/5284877), mask=0x1, res=0
> > > > > NFS: revalidating (0:4c/3247737045)
> > > > >
> > > > > 900ms matches the probably-silly nfs mount settings we're currently using:
> > > > >
> > > > > rw,hard,intr,tcp,timeo=9,retrans=3,rsize=8192,wsize=8192
> > > > >
> > > > > Full kernel log here: http://0x.ca/sim/ref/2.6.36_stuck_nfs/
> > > >
> > > > timeo=9 is a completely insane retransmit value for a tcp connection.
> > > >
> > > > Please use the default timeo=600, and all will work correctly.
> > >
> > > Ok, so, we were running with timeo=300 instead on a number of servers,
> > > and we were still seeing the problem on 2.6.36. I've uploaded a new
> > > kernel log (lsh1051) here:
> > >
> > > http://0x.ca/sim/ref/2.6.36_stuck_nfs/
> > >
> > > The log starts out with the hung task warnings occurring after
> > > otherwise-normal operation. Once I noticed, I set rpc/nfs_debug to 1,
> > > and then later set it to 255.
> >
> > Were the NFS servers hung at this point? If so, then that probably
> > suffices to explain the hung task warnings (which would be false
> > positives) as being due to the page cache waiting to lock pages on which
> > I/O is being performed.
>
> Nope...Many other NFS clients did not notice anything, and there were no
> obvious problems on any NFS server. This was only affecting two clients
> at the same time, but we had a limited LVS pool pointing at them at the
> time to try to isolate load patterns that might be tickling the issue.

So what were all the

'lockd: server 10.10.52.xxx not responding, still trying'

messages all about? There were quite a few of them for a number of
different servers in the moments leading up to the hang. Could it be a
problem with the switch these clients are attached to?

> > > Since several servers were stuck at the same time and we were losing
> > > quorum, I decided to try something more drastic and booted into
> > > 2.6.37-rc2-git3. This kernel hasn't got stuck yet! However, it's
> > > spitting out some new errors which may be worth looking into:
> > >
> > > [ 1574.088812] NFS: server 10.10.52.222 error: fileid changed
> > > [ 1574.088814] fsid 0:18: expected fileid 0x4c081940, got 0x4c081950
> > > [11340.409447] NFS: server 10.10.52.228 error: fileid changed
> > > [11340.409450] fsid 0:45: expected fileid 0x696ff82, got 0x16a98bd7
> > > [20832.579912] NFS: server 10.10.52.225 error: fileid changed
> > > [20832.579914] fsid 0:2a: expected fileid 0x8c67ebab, got 0x8c6811e5
> > > [32775.957351] NFS: server 10.10.52.230 error: fileid changed
> > > [32775.957354] fsid 0:52: expected fileid 0x919041fd, got 0x93f1962d
> > >
> > > These are also in the same kernel log. The error code isn't new, so
> > > something else seems to have changed to cause it.
> >
> > These indicate server bugs: your failover event appears to have caused
> > the inode numbers to have changed on a number of files. This is
> > something that shouldn't happen in a normal NFS environment, and so the
> > client prints out the above warnings...
>
> There was no fail-over event on any NFS server for the last week, so
> I'm not sure what would be causing this. The IPs listed there are
> running 2.6.30.10 with XFS-exported fses.
>
> All of the other clients running 2.6.36 (another 20 or so boxes) with the
> same NFS mounts are not logging any "fileid changed" messages. The first
> time I've seen this message is with this 2.6.37-rc2-git3 kernel.

The only change in 2.6.37-rcX I can think of that might have caused an
issue here would be Bryan's readdir changes.

If you can reproduce the above error condition, then could you try
turning off readdirplus (using the 'nordirplus' mount option) and seeing
if that makes a difference?

Trond