On Fri, Sep 26, 2008 at 07:17:33PM +0100, Ian Campbell wrote:
> On Fri, 2008-09-26 at 16:37 +0100, Ian Campbell wrote:
> > It's possible, the good case is naturally quite hard to establish with
> > 100% certainty. I declared v2.6.26 OK after an uptime of 4 days and 19
> > hours, compared with failure normally within 1-2 days. It's possible I
> > was premature in doing so. I'll run 2e96d2867 for at least a full week
> > before reporting back.
>
> Or it'll reproduce after just a few hours :-/
>
> Looks like a false negative on v2.6.26 -- I'll retest from there.
OK, so apologies, but this has been a long thread, and maybe we could
use a summary of the symptoms and the results so far. I think you said
2.6.24 or .25 was the last you're *positive* was good?
--b.
On Fri, 2008-09-26 at 23:54 -0400, J. Bruce Fields wrote:
> OK, so apologies, but this has been a long thread, and maybe we could
> use a summary of the symptoms and the results so far. I think you said
> 2.6.24 or .25 was the last you're *positive* was good?
2.6.24.x was good 2.6.25.y was not. The Debian packages don't include
the stable release numbers in their version so I'm unsure of x and y,
Judging from the changelog entries I think 2.6.24.3 and 2.6.25.10.
Since I have been building my own kernels I have seen repros with:
a551b98d5f6fce5897d497abd8bfb262efb33d2a Merge branch 'for-linus' of
git://git.kernel.dk/linux-2.6-block (was tip of git at the time I tested
it a while ago)
c02a79dedc7f3c3d4fdbb5eb2000cacea5df4cde v2.6.26.5
daedfbe2a67628a40076a6c75fb945c60f608a2e NFS: Ensure we zap only the
access and acl caches when setting new acls (backport of
f41f741838480aeaa3a189cff6e210503cf9c42d to 2.6.26.y stream, identified
by bisect, possibly/probably incorrectly)
f41f741838480aeaa3a189cff6e210503cf9c42d NFS: Ensure we zap only the
access and acl caches when setting new acls (original trunk commit of
above)
2e96d2867245668dbdb973729288cf69b9fafa66 NFS: Fix a warning in
nfs4_async_handle_error (commit before
f41f741838480aeaa3a189cff6e210503cf9c42d)
I did not see the problem with bce7f793daec3e65ec5c5705d2457b81fe7b5725
(v2.6.26) in 4 days 19 hours. Due to the apparent mis-bisection leading
to daedfbe2a67628a40076a6c75fb945c60f608a2e this result is in doubt, I'm
currently retesting this kernel.
The bisect was restricted to paths under fs/nfs* and net/sunrpc. I've
gone over the range v2.6.26..daedfbe2a67628a40076a6c75fb945c60f608a2e
and nothing jumps out at me. If my retest of 2.6.26 shows it to be OK
again I'll bisect over the non-NFS changesets in that range.
The symptoms are a hang of any access to an NFS mounted path, although
the rest of the system continues to work.
Other potentially interesting info cribbed from earlier postings:
While the hang is occurring rpc_debug on the server gives nothing, on
the client gives:
[144741.637997] -pid- proc flgs status -client- -prog- --rqstp- -timeout -rpcwait -action- ---ops--
[144741.637997] 3439 0004 0080 -11 f3f48200 100003 f7770000 0 xprt_sending fa0ae88e fa0bddf4
[144741.637997] 3438 0001 00a0 0 f77f2a00 100003 f77700d0 5000 xprt_pending fa0ae88e fa0bddf4
The mount points are
hopkins:/storage/music /storage/music nfs rw,nosuid,vers=3,rsize=32768,wsize=32768,namlen=255,hard,intr,proto=tcp,timeo=600,retrans=2,sec=sys,mountproto=udp,addr=192.168.1.6 0 0
hopkins:/storage/mythtv/recordings /var/lib/mythtv/recordings nfs rw,nosuid,vers=3,rsize=32768,wsize=32768,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,hard,intr,proto=tcp,timeo=600,retrans=2,sec=sys,mountproto=udp,addr=192.168.1.6 0 0
hopkins:/var/lib/mythvideo /var/lib/mythvideo nfs rw,nosuid,vers=3,rsize=32768,wsize=32768,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,hard,intr,proto=tcp,timeo=600,retrans=2,sec=sys,mountproto=udp,addr=192.168.1.6 0 0
hopkins:/storage/home/ijc /home/ijc nfs rw,vers=3,rsize=131072,wsize=131072,namlen=255,hard,nointr,proto=tcp,timeo=600,retrans=2,sec=sys,mountproto=udp,addr=192.168.1.6 0 0
all of them seem to be effected simultaneously (but I'll check to make double sure next time it occurs).
I've seen hung task backtraces from effected processes, e.g.
Aug 4 06:27:28 iranon kernel: [137969.382277] INFO: task mythbackend:3161 blocked for more than 120 seconds.
Aug 4 06:27:28 iranon kernel: [137969.382287] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 4 06:27:28 iranon kernel: [137969.382291] mythbackend D 00005dfc 0 3161 1
Aug 4 06:27:28 iranon kernel: [137969.382295] f2006c70 00000082 4b05cc7e 00005dfc f2006df0 c1c0e920 00000000 f7273390
Aug 4 06:27:28 iranon kernel: [137969.382301] 0000c8b4 00000000 00000001 f7273398 00000282 00000202 f71b0ab0 f200ddf0
Aug 4 06:27:28 iranon kernel: [137969.382306] c1c012bc fa18f0a6 fa18f0bf c02b45a7 f71b0ab0 00000000 f200de0c fa18f0a6
Aug 4 06:27:28 iranon kernel: [137969.382311] Call Trace:
Aug 4 06:27:28 iranon kernel: [137969.382347] [<fa18f0a6>] nfs_wait_schedule+0x0/0x1e [nfs]
Aug 4 06:27:28 iranon kernel: [137969.382384] [<fa18f0bf>] nfs_wait_schedule+0x19/0x1e [nfs]
Aug 4 06:27:28 iranon kernel: [137969.382399] [<c02b45a7>] __wait_on_bit_lock+0x2a/0x52
Aug 4 06:27:28 iranon kernel: [137969.382407] [<fa18f0a6>] nfs_wait_schedule+0x0/0x1e [nfs]
Aug 4 06:27:28 iranon kernel: [137969.382421] [<c02b462e>] out_of_line_wait_on_bit_lock+0x5f/0x67
Aug 4 06:27:28 iranon kernel: [137969.382429] [<c0133a1d>] wake_bit_function+0x0/0x3c
Aug 4 06:27:28 iranon kernel: [137969.382441] [<fa18f16e>] __nfs_revalidate_inode+0xaa/0x211 [nfs]
Aug 4 06:27:28 iranon kernel: [137969.382458] [<c017bac0>] do_lookup+0x53/0x145
Aug 4 06:27:28 iranon kernel: [137969.382466] [<c0187609>] mntput_no_expire+0x11/0x64
Aug 4 06:27:28 iranon kernel: [137969.382472] [<c017d971>] __link_path_walk+0xa71/0xb65
Aug 4 06:27:28 iranon kernel: [137969.382477] [<c017bac0>] do_lookup+0x53/0x145
Aug 4 06:27:28 iranon kernel: [137969.382483] [<c0187609>] mntput_no_expire+0x11/0x64
Aug 4 06:27:28 iranon kernel: [137969.382492] [<c0187609>] mntput_no_expire+0x11/0x64
Aug 4 06:27:28 iranon kernel: [137969.382496] [<c017daf5>] path_walk+0x90/0x98
Aug 4 06:27:28 iranon kernel: [137969.382502] [<fa18f935>] nfs_getattr+0x8f/0xbe [nfs]
Aug 4 06:27:28 iranon kernel: [137969.382520] [<fa18f8a6>] nfs_getattr+0x0/0xbe [nfs]
Aug 4 06:27:28 iranon kernel: [137969.382536] [<c01785ac>] vfs_getattr+0x36/0x4d
Aug 4 06:27:28 iranon kernel: [137969.382545] [<c0178684>] vfs_lstat_fd+0x27/0x39
Aug 4 06:27:28 iranon kernel: [137969.382550] [<fa18bca6>] nfs_permission+0x0/0x129 [nfs]
Aug 4 06:27:28 iranon kernel: [137969.382567] [<c0187609>] mntput_no_expire+0x11/0x64
Aug 4 06:27:28 iranon kernel: [137969.382572] [<c0174914>] sys_faccessat+0x11e/0x15d
Aug 4 06:27:28 iranon kernel: [137969.382582] [<c0178703>] sys_lstat64+0xf/0x23
Aug 4 06:27:28 iranon kernel: [137969.382588] [<c017603c>] vfs_read+0xe3/0x11e
Aug 4 06:27:28 iranon kernel: [137969.382598] [<c0174962>] sys_access+0xf/0x13
Aug 4 06:27:28 iranon kernel: [137969.382603] [<c01077e8>] sysenter_past_esp+0x6d/0xa5
Aug 4 06:27:28 iranon kernel: [137969.382617] =======================
Ian.
--
Ian Campbell
Contemptuous lights flashed across the computer's console.
-- Hitchhiker's Guide to the Galaxy