LinuxLists.cc - Client hangs in __mutex_lock_slowpath, network connection maxed out

2011-02-03 18:10:53

Subject: Client hangs in __mutex_lock_slowpath, network connection maxed out

Hello,

I'm mounting an NFS4 volume from a kernel 2.6.32-27 server in a
2.6.37-020637rc2 client.
Very often, when there is significant IO on the mounted volume, the
client hangs. The process accessing the mountpoint is stuck in D
state. Any more attempts to access the mountpoint get stuck in D state
as well. Shortly after, the kernel reports the task as hanging:

[ 3120.620125] Call Trace:
[ 3120.620138] [<c05ec44d>] ? _raw_spin_lock+0xd/0x10
[ 3120.620146] [<c05eb4b6>] __mutex_lock_slowpath+0xc6/0x120
[ 3120.620152] [<c05eb1c5>] mutex_lock+0x25/0x40
[ 3120.620158] [<c022e3b3>] do_lookup+0xc3/0x120
[ 3120.620164] [<c022fbb2>] link_path_walk+0x402/0x9b0
[ 3120.620171] [<c0167b5a>] ? queue_work+0x1a/0x20

Additional tasks hang at a slightly different point:

[ 3120.620304] [<f90b863d>] ? nfs_access_add_cache+0xcd/0x120 [nfs]
[ 3120.620326] [<f90b8709>] ? nfs_do_access+0x79/0xc0 [nfs]
[ 3120.620333] [<c05eb4b6>] __mutex_lock_slowpath+0xc6/0x120
[ 3120.620339] [<c05eb1c5>] mutex_lock+0x25/0x40
[ 3120.620344] [<c022e3b3>] do_lookup+0xc3/0x120
[ 3120.620350] [<c022fbb2>] link_path_walk+0x402/0x9b0
[ 3120.620356] [<c0230697>] path_walk+0x47/0xa0
[ 3120.620361] [<c0230819>] do_path_lookup+0x59/0xb0
[ 3120.620367] [<c02311ff>] user_path_at+0x3f/0x70
[ 3120.620374] [<c02005b8>] ? do_linear_fault+0x58/0x70
[ 3120.620380] [<c020133a>] ? handle_mm_fault+0xda/0x230
[ 3120.620387] [<c0228899>] vfs_fstatat+0x49/0x70

Finally, the client saturates the 100 MBit connection with sending and
receiving from the NFS server. I'm pretty sure this is garbage data,
even if the client would copy the entire mountpoint there wouldn't be
that much data to copy.

There are no visible symptoms on the server.

Anyone able to help?

Thanks,

-Nikolaus

--
»Time flies like an arrow, fruit flies like a Banana.«

PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6 02CF A9AD B7F8 AE4E 425C

2011-02-03 18:14:05

by J. Bruce Fields

[permalink] [raw]

Subject: Re: Client hangs in __mutex_lock_slowpath, network connection maxed out

On Thu, Feb 03, 2011 at 01:04:55PM -0500, Nikolaus Rath wrote:
> Hello,
>
> I'm mounting an NFS4 volume from a kernel 2.6.32-27 server in a
> 2.6.37-020637rc2 client.
> Very often, when there is significant IO on the mounted volume, the
> client hangs. The process accessing the mountpoint is stuck in D
> state. Any more attempts to access the mountpoint get stuck in D state
> as well. Shortly after, the kernel reports the task as hanging:
>
> [ 3120.620125] Call Trace:
> [ 3120.620138] [<c05ec44d>] ? _raw_spin_lock+0xd/0x10
> [ 3120.620146] [<c05eb4b6>] __mutex_lock_slowpath+0xc6/0x120
> [ 3120.620152] [<c05eb1c5>] mutex_lock+0x25/0x40
> [ 3120.620158] [<c022e3b3>] do_lookup+0xc3/0x120
> [ 3120.620164] [<c022fbb2>] link_path_walk+0x402/0x9b0
> [ 3120.620171] [<c0167b5a>] ? queue_work+0x1a/0x20
>
> Additional tasks hang at a slightly different point:
>
> [ 3120.620304] [<f90b863d>] ? nfs_access_add_cache+0xcd/0x120 [nfs]
> [ 3120.620326] [<f90b8709>] ? nfs_do_access+0x79/0xc0 [nfs]
> [ 3120.620333] [<c05eb4b6>] __mutex_lock_slowpath+0xc6/0x120
> [ 3120.620339] [<c05eb1c5>] mutex_lock+0x25/0x40
> [ 3120.620344] [<c022e3b3>] do_lookup+0xc3/0x120
> [ 3120.620350] [<c022fbb2>] link_path_walk+0x402/0x9b0
> [ 3120.620356] [<c0230697>] path_walk+0x47/0xa0
> [ 3120.620361] [<c0230819>] do_path_lookup+0x59/0xb0
> [ 3120.620367] [<c02311ff>] user_path_at+0x3f/0x70
> [ 3120.620374] [<c02005b8>] ? do_linear_fault+0x58/0x70
> [ 3120.620380] [<c020133a>] ? handle_mm_fault+0xda/0x230
> [ 3120.620387] [<c0228899>] vfs_fstatat+0x49/0x70
>
>
> Finally, the client saturates the 100 MBit connection with sending and
> receiving from the NFS server. I'm pretty sure this is garbage data,
> even if the client would copy the entire mountpoint there wouldn't be
> that much data to copy.

Could you take a look at that traffic with wireshark (or send the raw
packet dump data) and see what that traffic consists of?

For example it may be repeating the same few operations, in which case
there's probably some kind of infinite loop.

--b.

2011-02-03 20:10:35

by Myklebust, Trond

[permalink] [raw]

Subject: Re: Client hangs in __mutex_lock_slowpath, network connection maxed out

On Thu, 2011-02-03 at 13:04 -0500, Nikolaus Rath wrote:
> Hello,
>
> I'm mounting an NFS4 volume from a kernel 2.6.32-27 server in a
> 2.6.37-020637rc2 client.

2.6.37-rc2 is known to have a number of NFS bugs (particularly in the
readdir code) which are fixed in the 2.6.37 release. Please upgrade.

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2011-02-03 20:17:16

by Myklebust, Trond

[permalink] [raw]

Subject: Re: Client hangs in __mutex_lock_slowpath, network connection maxed out

On Thu, 2011-02-03 at 13:14 -0500, J. Bruce Fields wrote:
> On Thu, Feb 03, 2011 at 01:04:55PM -0500, Nikolaus Rath wrote:
> > Hello,
> >
> > I'm mounting an NFS4 volume from a kernel 2.6.32-27 server in a
> > 2.6.37-020637rc2 client.
> > Very often, when there is significant IO on the mounted volume, the
> > client hangs. The process accessing the mountpoint is stuck in D
> > state. Any more attempts to access the mountpoint get stuck in D state
> > as well. Shortly after, the kernel reports the task as hanging:
> >
> > [ 3120.620125] Call Trace:
> > [ 3120.620138] [<c05ec44d>] ? _raw_spin_lock+0xd/0x10
> > [ 3120.620146] [<c05eb4b6>] __mutex_lock_slowpath+0xc6/0x120
> > [ 3120.620152] [<c05eb1c5>] mutex_lock+0x25/0x40
> > [ 3120.620158] [<c022e3b3>] do_lookup+0xc3/0x120
> > [ 3120.620164] [<c022fbb2>] link_path_walk+0x402/0x9b0
> > [ 3120.620171] [<c0167b5a>] ? queue_work+0x1a/0x20
> >
> > Additional tasks hang at a slightly different point:
> >
> > [ 3120.620304] [<f90b863d>] ? nfs_access_add_cache+0xcd/0x120 [nfs]
> > [ 3120.620326] [<f90b8709>] ? nfs_do_access+0x79/0xc0 [nfs]
> > [ 3120.620333] [<c05eb4b6>] __mutex_lock_slowpath+0xc6/0x120
> > [ 3120.620339] [<c05eb1c5>] mutex_lock+0x25/0x40
> > [ 3120.620344] [<c022e3b3>] do_lookup+0xc3/0x120
> > [ 3120.620350] [<c022fbb2>] link_path_walk+0x402/0x9b0
> > [ 3120.620356] [<c0230697>] path_walk+0x47/0xa0
> > [ 3120.620361] [<c0230819>] do_path_lookup+0x59/0xb0
> > [ 3120.620367] [<c02311ff>] user_path_at+0x3f/0x70
> > [ 3120.620374] [<c02005b8>] ? do_linear_fault+0x58/0x70
> > [ 3120.620380] [<c020133a>] ? handle_mm_fault+0xda/0x230
> > [ 3120.620387] [<c0228899>] vfs_fstatat+0x49/0x70
> >
> >
> > Finally, the client saturates the 100 MBit connection with sending and
> > receiving from the NFS server. I'm pretty sure this is garbage data,
> > even if the client would copy the entire mountpoint there wouldn't be
> > that much data to copy.
>
> Could you take a look at that traffic with wireshark (or send the raw
> packet dump data) and see what that traffic consists of?
>
> For example it may be repeating the same few operations, in which case
> there's probably some kind of infinite loop.

Not before upgrading the client. As I said earlier, there are a number
of known nasty bugs/hangs in 2.6.37-rc2, so there is no point in
repeating the debugging process for them...
--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com