LinuxLists.cc - NFS regression? Odd delays and lockups accessing an NFS export.

2008-08-18 02:03:04

Subject: NFS regression? Odd delays and lockups accessing an NFS export.

Hi there,

I've been using NFS here for years, lately there's something odd going on
since about a month or so. Previously reported last month:
http://www.gossamer-threads.com/lists/linux/kernel/951419?page=last

Now with 2.6.27-rc3 on one of the client boxes I get a complete stall
at odd times when accessing the server's exported directory, cannot
see a pattern to it. Eventually recovers after a Ctrl-C. Nothing in
the server or client log files. Not easy to reproduce either.

The server runs 2.6.26.2 at the moment.

Server config, etc: http://bugsplatter.id.au/kernel/boxen/deltree/
Client config, etc: http://bugsplatter.id.au/kernel/boxen/pooh/

Grant.

2008-08-20 01:10:42

by Grant Coady

[permalink] [raw]

Subject: Re: NFS regression? Odd delays and lockups accessing an NFS export.

Hi Trond,

On Mon, 18 Aug 2008 15:20:58 -0400, Trond Myklebust <[email protected]> wrote:

>On Mon, 2008-08-18 at 12:02 +1000, Grant Coady wrote:
>> Hi there,
>>
>> I've been using NFS here for years, lately there's something odd going on
>> since about a month or so. Previously reported last month:
>> http://www.gossamer-threads.com/lists/linux/kernel/951419?page=last
>>
>> Now with 2.6.27-rc3 on one of the client boxes I get a complete stall
>> at odd times when accessing the server's exported directory, cannot
>> see a pattern to it. Eventually recovers after a Ctrl-C. Nothing in
>> the server or client log files. Not easy to reproduce either.
>>
>> The server runs 2.6.26.2 at the moment.
>>
>> Server config, etc: http://bugsplatter.id.au/kernel/boxen/deltree/
>> Client config, etc: http://bugsplatter.id.au/kernel/boxen/pooh/
>
>Please try to reproduce the hang, then do
>
> echo 0 >/proc/sys/sunrpc/rpc_debug
>
>and send the output from 'dmesg'...

dmesg and logs from client machine showed nothing, dmesg from server:

-pid- proc flgs status -client- -prog- --rqstp- -timeout -rpcwait -action- ---ops--

(nothing, nada, zilch after what looks like the header line)

Situation, first use of 'df' after reboot, client machine hung for many
seconds before the /home/common export showed up, this is not repeatable.
Mounting another export and doing 'df' again had no delays.

Client is running 2.6.27-rc3-git5 (x86_64) at the moment, configs, etc
at: http://bugsplatter.id.au/kernel/boxen/pooh64/

Server as above. Am I supposed to have some debugging turned on in the
.configs?

Grant.

2008-08-20 23:17:39

by Grant Coady

[permalink] [raw]

Subject: Re: NFS regression? Odd delays and lockups accessing an NFS export.

On Mon, 18 Aug 2008 15:20:58 -0400, Trond Myklebust <[email protected]> wrote:

>On Mon, 2008-08-18 at 12:02 +1000, Grant Coady wrote:
>> Hi there,
>>
>> I've been using NFS here for years, lately there's something odd going on
>> since about a month or so. Previously reported last month:
>> http://www.gossamer-threads.com/lists/linux/kernel/951419?page=last
>>
>> Now with 2.6.27-rc3 on one of the client boxes I get a complete stall
>> at odd times when accessing the server's exported directory, cannot
>> see a pattern to it. Eventually recovers after a Ctrl-C. Nothing in
>> the server or client log files. Not easy to reproduce either.
>>
>> The server runs 2.6.26.2 at the moment.
>>
>> Server config, etc: http://bugsplatter.id.au/kernel/boxen/deltree/
>> Client config, etc: http://bugsplatter.id.au/kernel/boxen/pooh/
>
>Please try to reproduce the hang, then do
>
> echo 0 >/proc/sys/sunrpc/rpc_debug
>
>and send the output from 'dmesg'...

It's not NFS, I just had WinXP box stall completely (no KB NumLock
response) on a PuTTY (ssh) terminal to a linux box running 2.6.26.3
for about five or ten seconds, when it came back to operation the
PuTYY toolbar icon was flashing. Dunno which terminal as I had four
open at the time, but they all to a server box running 2.6.26.3 with
multiple Intel pro/100 (e100) NICs.

I think this may be to do with the e100 network driver instead, as a
few weeks ago I started compiling it in rather than as a module like
I've been doing for years (in response to the firmware (microcode)
being dropped on the floor last month issue). This box is the server
(deltree) that all linux boxes mount a /home/common NFS export from.

All machines have the Intel pro/100 NIC, going to recompile the lot
back to modules, see if that fixes the problem.

Grant.

2008-08-22 10:23:43

by Ian Campbell

[permalink] [raw]

Subject: Re: NFS regression? Odd delays and lockups accessing an NFS export.

On Mon, 2008-08-18 at 15:20 -0400, Trond Myklebust wrote:
> Please try to reproduce the hang, then do
>
> echo 0 >/proc/sys/sunrpc/rpc_debug
>
> and send the output from 'dmesg'...

I've also been seeing some NFS related lockups, although I'm not sure if
they are the same as the one in this thread or not. Client is 2.6.26
(Debian's kernel) and Server is 2.6.25 (also a Debian kernel, but from
backports.org).

rpc_debug on the server gives nothing, on the client gives:
[144741.637997] -pid- proc flgs status -client- -prog- --rqstp- -timeout -rpcwait -action- ---ops--
[144741.637997] 3439 0004 0080 -11 f3f48200 100003 f7770000 0 xprt_sending fa0ae88e fa0bddf4
[144741.637997] 3438 0001 00a0 0 f77f2a00 100003 f77700d0 15000 xprt_pending fa0ae88e fa0bddf4

There are no processes running with pid 3439 3438 (I don't think it's
that sort of pid though).

mounts points are:
hopkins:/storage/music /storage/music nfs rw,nosuid,vers=3,rsize=32768,wsize=32768,namlen=255,hard,intr,proto=tcp,timeo=600,retrans=2,sec=sys,mountproto=udp,addr=192.168.1.6 0 0
hopkins:/storage/mythtv/recordings /var/lib/mythtv/recordings nfs rw,nosuid,vers=3,rsize=32768,wsize=32768,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,hard,intr,proto=tcp,timeo=600,retrans=2,sec=sys,mountproto=udp,addr=192.168.1.6 0 0
hopkins:/var/lib/mythvideo /var/lib/mythvideo nfs rw,nosuid,vers=3,rsize=32768,wsize=32768,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,hard,intr,proto=tcp,timeo=600,retrans=2,sec=sys,mountproto=udp,addr=192.168.1.6 0 0
hopkins:/storage/home/ijc /home/ijc nfs rw,vers=3,rsize=131072,wsize=131072,namlen=255,hard,nointr,proto=tcp,timeo=600,retrans=2,sec=sys,mountproto=udp,addr=192.168.1.6 0 0
and all seem to be effected.

It hasn't happened this time (yet) but usually I get a hung task
backtrace like this:
Aug 4 06:27:28 iranon kernel: [137969.382277] INFO: task mythbackend:3161 blocked for more than 120 seconds.
Aug 4 06:27:28 iranon kernel: [137969.382287] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 4 06:27:28 iranon kernel: [137969.382291] mythbackend D 00005dfc 0 3161 1
Aug 4 06:27:28 iranon kernel: [137969.382295] f2006c70 00000082 4b05cc7e 00005dfc f2006df0 c1c0e920 00000000 f7273390
Aug 4 06:27:28 iranon kernel: [137969.382301] 0000c8b4 00000000 00000001 f7273398 00000282 00000202 f71b0ab0 f200ddf0
Aug 4 06:27:28 iranon kernel: [137969.382306] c1c012bc fa18f0a6 fa18f0bf c02b45a7 f71b0ab0 00000000 f200de0c fa18f0a6
Aug 4 06:27:28 iranon kernel: [137969.382311] Call Trace:
Aug 4 06:27:28 iranon kernel: [137969.382347] [<fa18f0a6>] nfs_wait_schedule+0x0/0x1e [nfs]
Aug 4 06:27:28 iranon kernel: [137969.382384] [<fa18f0bf>] nfs_wait_schedule+0x19/0x1e [nfs]
Aug 4 06:27:28 iranon kernel: [137969.382399] [<c02b45a7>] __wait_on_bit_lock+0x2a/0x52
Aug 4 06:27:28 iranon kernel: [137969.382407] [<fa18f0a6>] nfs_wait_schedule+0x0/0x1e [nfs]
Aug 4 06:27:28 iranon kernel: [137969.382421] [<c02b462e>] out_of_line_wait_on_bit_lock+0x5f/0x67
Aug 4 06:27:28 iranon kernel: [137969.382429] [<c0133a1d>] wake_bit_function+0x0/0x3c
Aug 4 06:27:28 iranon kernel: [137969.382441] [<fa18f16e>] __nfs_revalidate_inode+0xaa/0x211 [nfs]
Aug 4 06:27:28 iranon kernel: [137969.382458] [<c017bac0>] do_lookup+0x53/0x145
Aug 4 06:27:28 iranon kernel: [137969.382466] [<c0187609>] mntput_no_expire+0x11/0x64
Aug 4 06:27:28 iranon kernel: [137969.382472] [<c017d971>] __link_path_walk+0xa71/0xb65
Aug 4 06:27:28 iranon kernel: [137969.382477] [<c017bac0>] do_lookup+0x53/0x145
Aug 4 06:27:28 iranon kernel: [137969.382483] [<c0187609>] mntput_no_expire+0x11/0x64
Aug 4 06:27:28 iranon kernel: [137969.382492] [<c0187609>] mntput_no_expire+0x11/0x64
Aug 4 06:27:28 iranon kernel: [137969.382496] [<c017daf5>] path_walk+0x90/0x98
Aug 4 06:27:28 iranon kernel: [137969.382502] [<fa18f935>] nfs_getattr+0x8f/0xbe [nfs]
Aug 4 06:27:28 iranon kernel: [137969.382520] [<fa18f8a6>] nfs_getattr+0x0/0xbe [nfs]
Aug 4 06:27:28 iranon kernel: [137969.382536] [<c01785ac>] vfs_getattr+0x36/0x4d
Aug 4 06:27:28 iranon kernel: [137969.382545] [<c0178684>] vfs_lstat_fd+0x27/0x39
Aug 4 06:27:28 iranon kernel: [137969.382550] [<fa18bca6>] nfs_permission+0x0/0x129 [nfs]
Aug 4 06:27:28 iranon kernel: [137969.382567] [<c0187609>] mntput_no_expire+0x11/0x64
Aug 4 06:27:28 iranon kernel: [137969.382572] [<c0174914>] sys_faccessat+0x11e/0x15d
Aug 4 06:27:28 iranon kernel: [137969.382582] [<c0178703>] sys_lstat64+0xf/0x23
Aug 4 06:27:28 iranon kernel: [137969.382588] [<c017603c>] vfs_read+0xe3/0x11e
Aug 4 06:27:28 iranon kernel: [137969.382598] [<c0174962>] sys_access+0xf/0x13
Aug 4 06:27:28 iranon kernel: [137969.382603] [<c01077e8>] sysenter_past_esp+0x6d/0xa5
Aug 4 06:27:28 iranon kernel: [137969.382617] =======================

Sysrq-T from dmesg (therefore truncated a bit) is attached.

Ian.

--
Ian Campbell

There are no answers, only cross-references.
-- Weiner

Attachments:

sysrq (121.87 kB)

2008-08-22 18:08:09

by Trond Myklebust

[permalink] [raw]

Subject: Re: NFS regression? Odd delays and lockups accessing an NFS export.

On Fri, 2008-08-22 at 11:23 +0100, Ian Campbell wrote:
> On Mon, 2008-08-18 at 15:20 -0400, Trond Myklebust wrote:
> > Please try to reproduce the hang, then do
> >
> > echo 0 >/proc/sys/sunrpc/rpc_debug
> >
> > and send the output from 'dmesg'...
>
> I've also been seeing some NFS related lockups, although I'm not sure if
> they are the same as the one in this thread or not. Client is 2.6.26
> (Debian's kernel) and Server is 2.6.25 (also a Debian kernel, but from
> backports.org).
>
> rpc_debug on the server gives nothing, on the client gives:
> [144741.637997] -pid- proc flgs status -client- -prog- --rqstp- -timeout -rpcwait -action- ---ops--
> [144741.637997] 3439 0004 0080 -11 f3f48200 100003 f7770000 0 xprt_sending fa0ae88e fa0bddf4
> [144741.637997] 3438 0001 00a0 0 f77f2a00 100003 f77700d0 15000 xprt_pending fa0ae88e fa0bddf4

That's probably also a networking device driver issue candidate: your
RPC task is queued up waiting to be sent.

What networking card+device driver are you using here?

> There are no processes running with pid 3439 3438 (I don't think it's
> that sort of pid though).

The 'pid' is an internal RPC cookie that just serves to identify and
track specific RPC requests.

Cheers
Trond

2008-08-22 18:13:51

by Ian Campbell

[permalink] [raw]

Subject: Re: NFS regression? Odd delays and lockups accessing an NFS export.

On Fri, 2008-08-22 at 11:08 -0700, Trond Myklebust wrote:
> On Fri, 2008-08-22 at 11:23 +0100, Ian Campbell wrote:
> > On Mon, 2008-08-18 at 15:20 -0400, Trond Myklebust wrote:
> > > Please try to reproduce the hang, then do
> > >
> > > echo 0 >/proc/sys/sunrpc/rpc_debug
> > >
> > > and send the output from 'dmesg'...
> >
> > I've also been seeing some NFS related lockups, although I'm not sure if
> > they are the same as the one in this thread or not. Client is 2.6.26
> > (Debian's kernel) and Server is 2.6.25 (also a Debian kernel, but from
> > backports.org).
> >
> > rpc_debug on the server gives nothing, on the client gives:
> > [144741.637997] -pid- proc flgs status -client- -prog- --rqstp- -timeout -rpcwait -action- ---ops--
> > [144741.637997] 3439 0004 0080 -11 f3f48200 100003 f7770000 0 xprt_sending fa0ae88e fa0bddf4
> > [144741.637997] 3438 0001 00a0 0 f77f2a00 100003 f77700d0 15000 xprt_pending fa0ae88e fa0bddf4
>
> That's probably also a networking device driver issue candidate: your
> RPC task is queued up waiting to be sent.
>
> What networking card+device driver are you using here?

# ethtool -i eth0
driver: e1000
version: 7.3.20-k2-NAPI
firmware-version: N/A
bus-info: 0000:01:0a.0

Adding CC's to peeps listed in MAINTAINERS

I have to reboot the system now (or else there will be no TV to watch
during my tea ;-). It will probably repro again quite soon though.

> > There are no processes running with pid 3439 3438 (I don't think it's
> > that sort of pid though).
>
> The 'pid' is an internal RPC cookie that just serves to identify and
> track specific RPC requests.

Right, thanks.

Ian.

--
Ian Campbell

Colors may fade.

Attachments:

signature.asc (197.00 B)
This is a digitally signed message part

2008-08-22 19:33:07

by John Ronciak

[permalink] [raw]

Subject: Re: NFS regression? Odd delays and lockups accessing an NFS export.

On Fri, Aug 22, 2008 at 11:13 AM, Ian Campbell <[email protected]> wrote:

>> That's probably also a networking device driver issue candidate: your
>> RPC task is queued up waiting to be sent.
>>
>> What networking card+device driver are you using here?
>
> # ethtool -i eth0
> driver: e1000
> version: 7.3.20-k2-NAPI
> firmware-version: N/A
> bus-info: 0000:01:0a.0
There is nothing indicating that the NIC/driver is causing any sort of
problem here, at least not with what has been presented so far. When
the NFS mount isn't working is the networking still active and
working?

--
Cheers,
John

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/

2008-08-18 18:50:48

by Athanasius

[permalink] [raw]

Subject: Re: NFS regression? Odd delays and lockups accessing an NFS export.

On Mon, Aug 18, 2008 at 12:02:20PM +1000, Grant Coady wrote:
> I've been using NFS here for years, lately there's something odd going on
> since about a month or so. Previously reported last month:
> http://www.gossamer-threads.com/lists/linux/kernel/951419?page=last
>
> Now with 2.6.27-rc3 on one of the client boxes I get a complete stall
> at odd times when accessing the server's exported directory, cannot
> see a pattern to it. Eventually recovers after a Ctrl-C. Nothing in
> the server or client log files. Not easy to reproduce either.

I wonder if this is what I've been seeing. I've been otherwise too
busy to properly report it, thinking that *someone* else must also be
seeing it and it's being worked on, else it's a subtle configuration
problem my end.

I first started seeing this with 2.6.26 on the client end, 2.6.25.10 on
the server end. Things last worked fine with that same server version
and 2.6.25.10 on the client end. The affected export is:

/home/users 192.168.1.161(rw,no_root_squash,sync,no_subtree_check)

using the in-kernel NFS server, not the userspace one.

With the mount on the client being:

192.168.1.162:/home/users /home/users nfs defaults,rw,nfsvers=3,rsize=8192,wsize=8192,nosuid,nodev,soft,intr 0 0

It used to have a 'nolock' in it, but I took that out and saw no
difference. The one possible solid clue as to what's happening is this,
after I turned on all the lock etc checking in a 2.6.26.2 kernel:

Aug 12 16:10:28 emelia kernel: [ 361.851316] INFO: task firefox-bin:5716 blocked for more than 120 seconds.
Aug 12 16:10:28 emelia kernel: [ 361.851326] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 12 16:10:28 emelia kernel: [ 361.851332] firefox-bin D f7e0a018 5556 5716 5696
Aug 12 16:10:28 emelia kernel: [ 361.851348] e24f3ddc 00000046 c2cc2e40 f7e0a018 c0640030 c0642e40 c0642e40 c0642e40
Aug 12 16:10:28 emelia kernel: [ 361.851382] e2ec1018 e2ec1270 c2cc2e40 00000001 c2cc2e40 e24f3db8 00000046 c2c0b680
Aug 12 16:10:28 emelia kernel: [ 361.851406] e2e40da0 c2c0b680 e2ec1270 fffef1e4 00000001 00000046 c2c0b670 c2c0b670
Aug 12 16:10:28 emelia kernel: [ 361.851435] Call Trace:
Aug 12 16:10:28 emelia kernel: [ 361.851449] [<c01e9d71>] nfs_wait_bit_killable+0x2a/0x2e
Aug 12 16:10:28 emelia kernel: [ 361.851462] [<c041ee67>] __wait_on_bit+0x36/0x5d
Aug 12 16:10:28 emelia kernel: [ 361.851475] [<c01e9d47>] ? nfs_wait_bit_killable+0x0/0x2e
Aug 12 16:10:28 emelia kernel: [ 361.851586] [<c041ef3a>] out_of_line_wait_on_bit+0xac/0xb4
Aug 12 16:10:28 emelia kernel: [ 361.851599] [<c01e9d47>] ? nfs_wait_bit_killable+0x0/0x2e
Aug 12 16:10:28 emelia kernel: [ 361.851613] [<c0135286>] ? wake_bit_function+0x0/0x43
Aug 12 16:10:28 emelia kernel: [ 361.851632] [<c01e9d40>] nfs_wait_on_request+0x1f/0x26
Aug 12 16:10:28 emelia kernel: [ 361.851642] [<c01ed9cd>] nfs_sync_mapping_wait+0xde/0x27b
Aug 12 16:10:28 emelia kernel: [ 361.851653] [<c01e9c3d>] ? nfs_pageio_complete+0x8/0xa
Aug 12 16:10:28 emelia kernel: [ 361.851673] [<c01ee372>] __nfs_write_mapping+0x27/0x45
Aug 12 16:10:28 emelia kernel: [ 361.851688] [<c01ee3c9>] nfs_write_mapping+0x39/0x57
Aug 12 16:10:28 emelia kernel: [ 361.851701] [<c01ee40c>] nfs_wb_all+0x10/0x12
Aug 12 16:10:28 emelia kernel: [ 361.851713] [<c01e3169>] nfs_file_flush+0x8b/0xc8
Aug 12 16:10:28 emelia kernel: [ 361.851723] [<c0178505>] filp_close+0x31/0x5a
Aug 12 16:10:28 emelia kernel: [ 361.851735] [<c0126583>] put_files_struct+0x68/0xaa
Aug 12 16:10:28 emelia kernel: [ 361.851747] [<c01265fc>] exit_files+0x37/0x3c
Aug 12 16:10:28 emelia kernel: [ 361.851759] [<c0127988>] do_exit+0x21b/0x632
Aug 12 16:10:28 emelia kernel: [ 361.851771] [<c0127dfd>] do_group_exit+0x5e/0x85
Aug 12 16:10:28 emelia kernel: [ 361.851786] [<c0127e37>] sys_exit_group+0x13/0x15
Aug 12 16:10:28 emelia kernel: [ 361.851799] [<c0102e1d>] sysenter_past_esp+0x6a/0xb1
Aug 12 16:10:28 emelia kernel: [ 361.851816] =======================
Aug 12 16:10:28 emelia kernel: [ 361.851821] INFO: lockdep is turned off.
Aug 12 16:10:30 emelia kernel: [ 363.999492] INFO: task famd:2760 blocked for more than 120 seconds.
Aug 12 16:10:30 emelia kernel: [ 363.999501] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 12 16:10:30 emelia kernel: [ 363.999507] famd D c093dd28 5684 2760 1
Aug 12 16:10:30 emelia kernel: [ 363.999521] f570dbe8 00000046 00000000 c093dd28 c0640030 c0642e40 c0642e40 c0642e40
Aug 12 16:10:30 emelia kernel: [ 363.999697] f6dcc018 f6dcc270 c2cc2e40 00000001 c2cc2e40 f570dbc4 00000046 c2c027b0
Aug 12 16:10:30 emelia kernel: [ 363.999767] f6516da0 c2c027b0 f6dcc270 fffefe66 00000001 00000046 c2c027a0 c2c027a0
Aug 12 16:10:30 emelia kernel: [ 363.999783] Call Trace:
Aug 12 16:10:30 emelia kernel: [ 363.999796] [<c0400fbd>] rpc_wait_bit_killable+0x2a/0x2e
Aug 12 16:10:30 emelia kernel: [ 363.999808] [<c041ee67>] __wait_on_bit+0x36/0x5d
Aug 12 16:10:30 emelia kernel: [ 363.999817] [<c0400f93>] ? rpc_wait_bit_killable+0x0/0x2e
Aug 12 16:10:30 emelia kernel: [ 363.999828] [<c041ef3a>] out_of_line_wait_on_bit+0xac/0xb4
Aug 12 16:10:30 emelia kernel: [ 363.999836] [<c0400f93>] ? rpc_wait_bit_killable+0x0/0x2e
Aug 12 16:10:30 emelia kernel: [ 363.999846] [<c0135286>] ? wake_bit_function+0x0/0x43
Aug 12 16:10:30 emelia kernel: [ 363.999857] [<c040157a>] __rpc_execute+0xcb/0x1e1
Aug 12 16:10:30 emelia kernel: [ 363.999865] [<c04016ab>] rpc_execute+0x1b/0x1e
Aug 12 16:10:30 emelia kernel: [ 363.999872] [<c03fc3d3>] rpc_run_task+0x43/0x49
Aug 12 16:10:30 emelia kernel: [ 363.999881] [<c03fc4de>] rpc_call_sync+0x44/0x5f
Aug 12 16:10:30 emelia kernel: [ 363.999891] [<c01eeeb8>] nfs3_rpc_wrapper+0x17/0x4d
Aug 12 16:10:30 emelia kernel: [ 363.999900] [<c01ef309>] nfs3_proc_access+0xc9/0x121
Aug 12 16:10:30 emelia kernel: [ 363.999919] [<c01e261a>] nfs_do_access+0x129/0x27c
Aug 12 16:10:30 emelia kernel: [ 363.999931] [<c01e2848>] nfs_permission+0xdb/0x13f
Aug 12 16:10:30 emelia kernel: [ 363.999938] [<c01e276d>] ? nfs_permission+0x0/0x13f
Aug 12 16:10:30 emelia kernel: [ 363.999947] [<c0180451>] permission+0x91/0xd0
Aug 12 16:10:30 emelia kernel: [ 363.999956] [<c01804b2>] vfs_permission+0x10/0x12
Aug 12 16:10:30 emelia kernel: [ 363.999963] [<c0181d56>] __link_path_walk+0x106/0xc1d
Aug 12 16:10:30 emelia kernel: [ 363.999972] [<c011764e>] ? kernel_map_pages+0xff/0x116
Aug 12 16:10:30 emelia kernel: [ 363.999986] [<c01828b9>] path_walk+0x4c/0x9b
Aug 12 16:10:30 emelia kernel: [ 363.999994] [<c0182c27>] do_path_lookup+0x198/0x1e1
Aug 12 16:10:30 emelia kernel: [ 364.000003] [<c018355d>] __user_walk_fd+0x2f/0x43
Aug 12 16:10:30 emelia kernel: [ 364.000012] [<c017ce43>] vfs_lstat_fd+0x16/0x3d
Aug 12 16:10:30 emelia kernel: [ 364.000022] [<c0130542>] ? groups_free+0x34/0x38
Aug 12 16:10:30 emelia kernel: [ 364.000034] [<c017cea8>] vfs_lstat+0x11/0x13
Aug 12 16:10:30 emelia kernel: [ 364.000040] [<c017cebe>] sys_lstat64+0x14/0x28
Aug 12 16:10:30 emelia kernel: [ 364.000047] [<c0130542>] ? groups_free+0x34/0x38
Aug 12 16:10:30 emelia kernel: [ 364.000056] [<c0130685>] ? set_current_groups+0x13f/0x149
Aug 12 16:10:30 emelia kernel: [ 364.000067] [<c01311d6>] ? sys_setresuid+0x153/0x165
Aug 12 16:10:30 emelia kernel: [ 364.000077] [<c0102e58>] ? sysenter_past_esp+0xa5/0xb1
Aug 12 16:10:30 emelia kernel: [ 364.000086] [<c0102e1d>] sysenter_past_esp+0x6a/0xb1
Aug 12 16:10:30 emelia kernel: [ 364.000097] =======================
Aug 12 16:10:30 emelia kernel: [ 364.000100] INFO: lockdep is turned off.
Aug 12 16:10:33 emelia kernel: [ 366.146658] INFO: task bash:5640 blocked for more than 120 seconds.
Aug 12 16:10:33 emelia kernel: [ 366.146667] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 12 16:10:33 emelia kernel: [ 366.146672] bash D 00000000 6472 5640 5627
Aug 12 16:10:33 emelia kernel: [ 366.146682] e2d4bad4 00000046 00000000 00000000 c0640030 c0642e40 c0642e40 c0642e40
Aug 12 16:10:33 emelia kernel: [ 366.146699] e2cee018 e2cee270 c2c70e40 00000000 c2c70e40 28d8cec5 00000037 c2c02b70
Aug 12 16:10:33 emelia kernel: [ 366.146713] e2e43da0 c2c02b70 e2cee270 00000046 00000001 00000046 c2c02b60 c2c02b60
Aug 12 16:10:33 emelia kernel: [ 366.146728] Call Trace:
Aug 12 16:10:33 emelia kernel: [ 366.146739] [<c04208e1>] ? _spin_unlock_irqrestore+0x42/0x58
Aug 12 16:10:33 emelia kernel: [ 366.146992] [<c0400fbd>] rpc_wait_bit_killable+0x2a/0x2e
Aug 12 16:10:33 emelia kernel: [ 366.147001] [<c041ee67>] __wait_on_bit+0x36/0x5d
Aug 12 16:10:33 emelia kernel: [ 366.147010] [<c0400f93>] ? rpc_wait_bit_killable+0x0/0x2e
Aug 12 16:10:33 emelia kernel: [ 366.147020] [<c041ef3a>] out_of_line_wait_on_bit+0xac/0xb4
Aug 12 16:10:33 emelia kernel: [ 366.147028] [<c0400f93>] ? rpc_wait_bit_killable+0x0/0x2e
Aug 12 16:10:33 emelia kernel: [ 366.147038] [<c0135286>] ? wake_bit_function+0x0/0x43
Aug 12 16:10:33 emelia kernel: [ 366.147050] [<c040157a>] __rpc_execute+0xcb/0x1e1
Aug 12 16:10:33 emelia kernel: [ 366.147058] [<c04016ab>] rpc_execute+0x1b/0x1e
Aug 12 16:10:33 emelia kernel: [ 366.147065] [<c03fc3d3>] rpc_run_task+0x43/0x49
Aug 12 16:10:33 emelia kernel: [ 366.147074] [<c03fc4de>] rpc_call_sync+0x44/0x5f
Aug 12 16:10:33 emelia kernel: [ 366.147083] [<c01eeeb8>] nfs3_rpc_wrapper+0x17/0x4d
Aug 12 16:10:33 emelia kernel: [ 366.147093] [<c01ef4d4>] nfs3_proc_getattr+0x58/0x7a
Aug 12 16:10:33 emelia kernel: [ 366.147102] [<c01e4b31>] __nfs_revalidate_inode+0x122/0x26d
Aug 12 16:10:33 emelia kernel: [ 366.147116] [<c01e482b>] ? nfs_attribute_timeout+0x0/0xae
Aug 12 16:10:33 emelia kernel: [ 366.147126] [<c01e488c>] ? nfs_attribute_timeout+0x61/0xae
Aug 12 16:10:33 emelia kernel: [ 366.147137] [<c01e048b>] nfs_lookup_revalidate+0x245/0x471
Aug 12 16:10:33 emelia kernel: [ 366.147160] [<c0189b15>] ? __d_lookup+0x12a/0x146
Aug 12 16:10:33 emelia kernel: [ 366.147171] [<c01808ce>] do_lookup+0x115/0x154
Aug 12 16:10:33 emelia kernel: [ 366.147180] [<c0182466>] __link_path_walk+0x816/0xc1d
Aug 12 16:10:33 emelia kernel: [ 366.147190] [<c011764e>] ? kernel_map_pages+0xff/0x116
Aug 12 16:10:33 emelia kernel: [ 366.147203] [<c01828b9>] path_walk+0x4c/0x9b
Aug 12 16:10:33 emelia kernel: [ 366.147212] [<c0182c27>] do_path_lookup+0x198/0x1e1
Aug 12 16:10:33 emelia kernel: [ 366.147220] [<c0182ed0>] __path_lookup_intent_open+0x45/0x76
Aug 12 16:10:33 emelia kernel: [ 366.147230] [<c0182f11>] path_lookup_open+0x10/0x12
Aug 12 16:10:33 emelia kernel: [ 366.147238] [<c01838c9>] do_filp_open+0x9e/0x6cc
Aug 12 16:10:33 emelia kernel: [ 366.147252] [<c0178695>] ? get_unused_fd_flags+0xc3/0xcd
Aug 12 16:10:33 emelia kernel: [ 366.147263] [<c01786df>] do_sys_open+0x40/0xb6
Aug 12 16:10:33 emelia kernel: [ 366.147271] [<c0178797>] sys_open+0x1e/0x26
Aug 12 16:10:33 emelia kernel: [ 366.147279] [<c0102e1d>] sysenter_past_esp+0x6a/0xb1
Aug 12 16:10:33 emelia kernel: [ 366.147290] =======================
Aug 12 16:10:33 emelia kernel: [ 366.147294] INFO: lockdep is turned off.
Aug 12 16:10:41 emelia kernel: [ 374.736634] INFO: task gconfd-2:5492 blocked for more than 120 seconds.
Aug 12 16:10:41 emelia kernel: [ 374.736644] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 12 16:10:41 emelia kernel: [ 374.736648] gconfd-2 D e31a5dec 5132 5492 1
Aug 12 16:10:41 emelia kernel: [ 374.736658] e31a5e48 00000046 f62d3b28 e31a5dec c0640030 c0642e40 c0642e40 c0642e40
Aug 12 16:10:41 emelia kernel: [ 374.736674] e44e9018 e44e9270 c2c70e40 00000000 c2c70e40 e31a5e24 00000046 c2c05280
Aug 12 16:10:41 emelia kernel: [ 374.736689] f40a9da0 c2c05280 e44e9270 ffff24ad 00000001 00000046 c2c05270 c2c05270
Aug 12 16:10:41 emelia kernel: [ 374.736704] Call Trace:
Aug 12 16:10:41 emelia kernel: [ 374.736716] [<c01e9d71>] nfs_wait_bit_killable+0x2a/0x2e
Aug 12 16:10:41 emelia kernel: [ 374.736728] [<c041ee67>] __wait_on_bit+0x36/0x5d
Aug 12 16:10:41 emelia kernel: [ 374.736737] [<c01e9d47>] ? nfs_wait_bit_killable+0x0/0x2e
Aug 12 16:10:41 emelia kernel: [ 374.736747] [<c041ef3a>] out_of_line_wait_on_bit+0xac/0xb4
Aug 12 16:10:41 emelia kernel: [ 374.736755] [<c01e9d47>] ? nfs_wait_bit_killable+0x0/0x2e
Aug 12 16:10:41 emelia kernel: [ 374.736765] [<c0135286>] ? wake_bit_function+0x0/0x43
Aug 12 16:10:41 emelia kernel: [ 374.736777] [<c01e9d40>] nfs_wait_on_request+0x1f/0x26
Aug 12 16:10:41 emelia kernel: [ 374.736784] [<c01ed9cd>] nfs_sync_mapping_wait+0xde/0x27b
Aug 12 16:10:41 emelia kernel: [ 374.736791] [<c01e9c3d>] ? nfs_pageio_complete+0x8/0xa
Aug 12 16:10:41 emelia kernel: [ 374.736805] [<c01ee372>] __nfs_write_mapping+0x27/0x45
Aug 12 16:10:41 emelia kernel: [ 374.736813] [<c01ee3c9>] nfs_write_mapping+0x39/0x57
Aug 12 16:10:41 emelia kernel: [ 374.736910] [<c01ee40c>] nfs_wb_all+0x10/0x12
Aug 12 16:10:41 emelia kernel: [ 374.736917] [<c01e3169>] nfs_file_flush+0x8b/0xc8
Aug 12 16:10:41 emelia kernel: [ 374.736926] [<c0178505>] filp_close+0x31/0x5a
Aug 12 16:10:41 emelia kernel: [ 374.736934] [<c0178598>] sys_close+0x6a/0xa4
Aug 12 16:10:41 emelia kernel: [ 374.736942] [<c0102e1d>] sysenter_past_esp+0x6a/0xb1
Aug 12 16:10:41 emelia kernel: [ 374.736954] =======================
Aug 12 16:10:41 emelia kernel: [ 374.736958] INFO: lockdep is turned off.

Server and client kernel configs attached.
--
- Athanasius = Athanasius(at)miggy.org / http://www.miggy.org/
Finger athan(at)fysh.org for PGP key
"And it's me who is my enemy. Me who beats me up.
Me who makes the monsters. Me who strips my confidence." Paula Cole - ME

Attachments:

(No filename) (0.00 B)
signature.asc (189.00 B)
Digital signature Download all attachments

2008-08-18 19:19:33

by Trond Myklebust

[permalink] [raw]

Subject: Re: NFS regression? Odd delays and lockups accessing an NFS export.

On Mon, 2008-08-18 at 19:50 +0100, Athanasius wrote:
> On Mon, Aug 18, 2008 at 12:02:20PM +1000, Grant Coady wrote:
> > I've been using NFS here for years, lately there's something odd going on
> > since about a month or so. Previously reported last month:
> > http://www.gossamer-threads.com/lists/linux/kernel/951419?page=last
> >
> > Now with 2.6.27-rc3 on one of the client boxes I get a complete stall
> > at odd times when accessing the server's exported directory, cannot
> > see a pattern to it. Eventually recovers after a Ctrl-C. Nothing in
> > the server or client log files. Not easy to reproduce either.
>
> I wonder if this is what I've been seeing. I've been otherwise too
> busy to properly report it, thinking that *someone* else must also be
> seeing it and it's being worked on, else it's a subtle configuration
> problem my end.

Your lockdep trace basically shows that the rpc layer is blocking for
some reason. Could you please try to reproduce the problem, and then do

echo 0 >/proc/sys/sunrpc/rpc_debug

...and see what the output from 'dmesg' shows?

Cheers
Trond

2008-08-18 19:21:03

by Trond Myklebust

[permalink] [raw]

Subject: Re: NFS regression? Odd delays and lockups accessing an NFS export.

On Mon, 2008-08-18 at 12:02 +1000, Grant Coady wrote:
> Hi there,
>
> I've been using NFS here for years, lately there's something odd going on
> since about a month or so. Previously reported last month:
> http://www.gossamer-threads.com/lists/linux/kernel/951419?page=last
>
> Now with 2.6.27-rc3 on one of the client boxes I get a complete stall
> at odd times when accessing the server's exported directory, cannot
> see a pattern to it. Eventually recovers after a Ctrl-C. Nothing in
> the server or client log files. Not easy to reproduce either.
>
> The server runs 2.6.26.2 at the moment.
>
> Server config, etc: http://bugsplatter.id.au/kernel/boxen/deltree/
> Client config, etc: http://bugsplatter.id.au/kernel/boxen/pooh/

Please try to reproduce the hang, then do

echo 0 >/proc/sys/sunrpc/rpc_debug

and send the output from 'dmesg'...

Cheers
Trond

2008-08-18 19:37:48

by J. K. Cliburn

[permalink] [raw]

Subject: Re: NFS regression? Odd delays and lockups accessing an NFS export.

Athanasius wrote:
> On Mon, Aug 18, 2008 at 12:02:20PM +1000, Grant Coady wrote:
>> I've been using NFS here for years, lately there's something odd going on
>> since about a month or so. Previously reported last month:
>> http://www.gossamer-threads.com/lists/linux/kernel/951419?page=last
>>
>> Now with 2.6.27-rc3 on one of the client boxes I get a complete stall
>> at odd times when accessing the server's exported directory, cannot
>> see a pattern to it. Eventually recovers after a Ctrl-C. Nothing in
>> the server or client log files. Not easy to reproduce either.
>
> I wonder if this is what I've been seeing. I've been otherwise too
> busy to properly report it, thinking that *someone* else must also be
> seeing it and it's being worked on, else it's a subtle configuration
> problem my end.
>
> I first started seeing this with 2.6.26 on the client end,

I see you're running the atl1 driver on your client. I'm chasing an
intermittent bug in that driver that seems to be affected by certain
offload parameters. Try disabling TSO and see if things improve.

ethtool -K eth0 tso off

Let me know if it helps.

Jay

2008-08-18 23:13:42

by Athanasius

[permalink] [raw]

Subject: Re: NFS regression? Odd delays and lockups accessing an NFS export.

On Mon, Aug 18, 2008 at 02:37:45PM -0500, J. K. Cliburn wrote:
> Athanasius wrote:
> > I wonder if this is what I've been seeing. I've been otherwise too
> >busy to properly report it, thinking that *someone* else must also be
> >seeing it and it's being worked on, else it's a subtle configuration
> >problem my end.
> >
> > I first started seeing this with 2.6.26 on the client end,
>
> I see you're running the atl1 driver on your client. I'm chasing an
> intermittent bug in that driver that seems to be affected by certain
> offload parameters. Try disabling TSO and see if things improve.
>
> ethtool -K eth0 tso off
>
> Let me know if it helps.

Indeed it does, and indeed tso was on by default. I booted up once
after first trying this, thinking I had things set to apply it
automatically, but hadn't and ran into the exact same problem. This 2nd
time around with the setting definitely applied things appear to be
working normally. At least I can actually start Firefox up and have it
work (lots of /home/users/... access over NFS to the server).

thanks,

-Ath
--
- Athanasius = Athanasius(at)miggy.org / http://www.miggy.org/
Finger athan(at)fysh.org for PGP key
"And it's me who is my enemy. Me who beats me up.
Me who makes the monsters. Me who strips my confidence." Paula Cole - ME

Attachments:

(No filename) (1.30 kB)
signature.asc (189.00 B)
Digital signature Download all attachments

2009-05-12 20:27:58

by Frank Filz

[permalink] [raw]

Subject: Re: NFS regression? Odd delays and lockups accessing an NFS export.

On Mon, 2008-08-18 at 14:37 -0500, J. K. Cliburn wrote:
> Athanasius wrote:
> > On Mon, Aug 18, 2008 at 12:02:20PM +1000, Grant Coady wrote:
> >> I've been using NFS here for years, lately there's something odd going on
> >> since about a month or so. Previously reported last month:
> >> http://www.gossamer-threads.com/lists/linux/kernel/951419?page=last
> >>
> >> Now with 2.6.27-rc3 on one of the client boxes I get a complete stall
> >> at odd times when accessing the server's exported directory, cannot
> >> see a pattern to it. Eventually recovers after a Ctrl-C. Nothing in
> >> the server or client log files. Not easy to reproduce either.
> >
> > I wonder if this is what I've been seeing. I've been otherwise too
> > busy to properly report it, thinking that *someone* else must also be
> > seeing it and it's being worked on, else it's a subtle configuration
> > problem my end.
> >
> > I first started seeing this with 2.6.26 on the client end,
>
> I see you're running the atl1 driver on your client. I'm chasing an
> intermittent bug in that driver that seems to be affected by certain
> offload parameters. Try disabling TSO and see if things improve.

Was this ever resolved? We're seeing a similar hangup in a 2.6.27
kernel.

Thanks

Frank Filz

2009-05-13 00:05:46

by Jay Cliburn

[permalink] [raw]

Subject: Re: NFS regression? Odd delays and lockups accessing an NFS export.

On Tue, 12 May 2009 13:27:58 -0700
Frank Filz <[email protected]> wrote:

> On Mon, 2008-08-18 at 14:37 -0500, J. K. Cliburn wrote:
> > Athanasius wrote:
> > > On Mon, Aug 18, 2008 at 12:02:20PM +1000, Grant Coady wrote:
> > >> I've been using NFS here for years, lately there's something odd going on
> > >> since about a month or so. Previously reported last month:
> > >> http://www.gossamer-threads.com/lists/linux/kernel/951419?page=last
[...]
> > I see you're running the atl1 driver on your client. I'm chasing an
> > intermittent bug in that driver that seems to be affected by certain
> > offload parameters. Try disabling TSO and see if things improve.
>
> Was this ever resolved? We're seeing a similar hangup in a 2.6.27
> kernel.

The TSO problem in the atl1 driver was never resolved, and TSO was turned off by default with commit 82c26a9d117f0178b8c1b33429014b6d99c470f6 in the 2.6.27 cycle. It was also added to the -stable tree at 2.6.26.4.

http://lkml.org/lkml/2008/9/3/251

Jay