2012-12-19 20:47:22

by Orion Poplawski

[permalink] [raw]
Subject: umount(,MNT_DETACH) for nfsv4 hangs when using sec=krb5 and network is down

nfs4 mounts with sec=krb5 cannot be unmounted with the network down, even with
umount -l because umount() with MNT_DETACH set will hang, presumably somewhere
in the gss stack.

A successful umount yields the following packet trace:

1 0.000000000 10.10.20.2 -> 10.10.10.1 NFS 218 V4 Call GETATTR FH:0x3b470ee7
2 0.000236000 10.10.10.1 -> 10.10.20.2 NFS 318 V4 Reply (Call In 1) GETATTR
3 0.000282000 10.10.20.2 -> 10.10.10.1 TCP 66 943 > nfs [ACK] Seq=153
Ack=253 Win=331 Len=0 TSval=3468186 TSecr=878557922
4 0.008761000 10.10.20.2 -> 10.10.10.1 TCP 66 943 > nfs [FIN, ACK] Seq=153
Ack=253 Win=331 Len=0 TSval=3468195 TSecr=878557922
5 0.008923000 10.10.10.1 -> 10.10.20.2 TCP 66 nfs > 943 [FIN, ACK] Seq=253
Ack=154 Win=683 Len=0 TSval=878557930 TSecr=3468195
6 0.008970000 10.10.20.2 -> 10.10.10.1 TCP 66 943 > nfs [ACK] Seq=154
Ack=254 Win=331 Len=0 TSval=3468195 TSecr=878557930

So my guess is that something in the gss stack is preventing the GETATTR call
from succeeding as unmounting succeeds without sec=krb5. Although running
rpc.gssd and rpcidmap with -vvvv does not appear to produce any output. A
successful unmount produces:

Dec 19 13:42:44 orca rpc.gssd[18495]: destroying client
/var/lib/nfs/rpc_pipefs/nfs/clnt27
Dec 19 13:42:44 orca rpc.gssd[18495]: destroying client
/var/lib/nfs/rpc_pipefs/nfs/clnt24

However, we need someway to be able to drop mounts after the network connection
has been removed. This behavior is causing sever problems for our laptop and
vpn users.

Tested with:

3.6.11-3.fc18
nfs-utils-1.2.7-2.fc18

I've also filed https://bugzilla.redhat.com/show_bug.cgi?id=888942

--
Orion Poplawski
Technical Manager 303-415-9701 x222
NWRA, Boulder Office FAX: 303-415-9702
3380 Mitchell Lane [email protected]
Boulder, CO 80301 http://www.nwra.com



2012-12-20 20:31:15

by Orion Poplawski

[permalink] [raw]
Subject: Re: umount(,MNT_DETACH) for nfsv4 hangs when using sec=krb5 and network is down

On 12/19/2012 03:19 PM, Myklebust, Trond wrote:
>
> Commit eb96d5c97b0825d542e9c4ba5e0a22b519355166 (SUNRPC handle
> EKEYEXPIRED in call_refreshresult), which will be in 3.8-rc1 when Linus
> releases it, may help.
>

FWIW - I cherry picked that into the latest Fedora rawhide kernel but no
effect. Sounds like a nice patch though, the current hang forever behavior
doesn't seem the trigger the needed "ah, need a new ticket" response.

--
Orion Poplawski
Technical Manager 303-415-9701 x222
NWRA, Boulder Office FAX: 303-415-9702
3380 Mitchell Lane [email protected]
Boulder, CO 80301 http://www.nwra.com

2012-12-19 21:09:02

by Myklebust, Trond

[permalink] [raw]
Subject: Re: umount(,MNT_DETACH) for nfsv4 hangs when using sec=krb5 and network is down

On Wed, 2012-12-19 at 20:47 +-0000, Orion Poplawski wrote:
+AD4- nfs4 mounts with sec+AD0-krb5 cannot be unmounted with the network down, even with
+AD4- umount -l because umount() with MNT+AF8-DETACH set will hang, presumably somewhere
+AD4- in the gss stack.
+AD4-
+AD4- A successful umount yields the following packet trace:
+AD4-
+AD4- 1 0.000000000 10.10.20.2 -+AD4- 10.10.10.1 NFS 218 V4 Call GETATTR FH:0x3b470ee7
+AD4- 2 0.000236000 10.10.10.1 -+AD4- 10.10.20.2 NFS 318 V4 Reply (Call In 1) GETATTR
+AD4- 3 0.000282000 10.10.20.2 -+AD4- 10.10.10.1 TCP 66 943 +AD4- nfs +AFs-ACK+AF0- Seq+AD0-153
+AD4- Ack+AD0-253 Win+AD0-331 Len+AD0-0 TSval+AD0-3468186 TSecr+AD0-878557922
+AD4- 4 0.008761000 10.10.20.2 -+AD4- 10.10.10.1 TCP 66 943 +AD4- nfs +AFs-FIN, ACK+AF0- Seq+AD0-153
+AD4- Ack+AD0-253 Win+AD0-331 Len+AD0-0 TSval+AD0-3468195 TSecr+AD0-878557922
+AD4- 5 0.008923000 10.10.10.1 -+AD4- 10.10.20.2 TCP 66 nfs +AD4- 943 +AFs-FIN, ACK+AF0- Seq+AD0-253
+AD4- Ack+AD0-154 Win+AD0-683 Len+AD0-0 TSval+AD0-878557930 TSecr+AD0-3468195
+AD4- 6 0.008970000 10.10.20.2 -+AD4- 10.10.10.1 TCP 66 943 +AD4- nfs +AFs-ACK+AF0- Seq+AD0-154
+AD4- Ack+AD0-254 Win+AD0-331 Len+AD0-0 TSval+AD0-3468195 TSecr+AD0-878557930
+AD4-
+AD4- So my guess is that something in the gss stack is preventing the GETATTR call
+AD4- from succeeding as unmounting succeeds without sec+AD0-krb5. Although running
+AD4- rpc.gssd and rpcidmap with -vvvv does not appear to produce any output. A
+AD4- successful unmount produces:
+AD4-
+AD4- Dec 19 13:42:44 orca rpc.gssd+AFs-18495+AF0-: destroying client
+AD4- /var/lib/nfs/rpc+AF8-pipefs/nfs/clnt27
+AD4- Dec 19 13:42:44 orca rpc.gssd+AFs-18495+AF0-: destroying client
+AD4- /var/lib/nfs/rpc+AF8-pipefs/nfs/clnt24
+AD4-
+AD4- However, we need someway to be able to drop mounts after the network connection
+AD4- has been removed. This behavior is causing sever problems for our laptop and
+AD4- vpn users.
+AD4-
+AD4- Tested with:
+AD4-
+AD4- 3.6.11-3.fc18
+AD4- nfs-utils-1.2.7-2.fc18
+AD4-
+AD4- I've also filed https://bugzilla.redhat.com/show+AF8-bug.cgi?id+AD0-888942

No. What you need is a way to unmount +AF8-before+AF8- you kill the network.
Once the network is gone, you are in severe data loss territory, and you
are entirely on your own dealing with that problem...

Maybe one day we will get round to supporting offline mounts, but that's
not the case today.

--
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust+AEA-netapp.com
http://www.netapp.com

2012-12-20 21:52:45

by Orion Poplawski

[permalink] [raw]
Subject: Re: umount(,MNT_DETACH) for nfsv4 hangs when using sec=krb5 and network is down

On 12/20/2012 01:47 PM, Myklebust, Trond wrote:
> On Thu, 2012-12-20 at 13:31 -0700, Orion Poplawski wrote:
>> On 12/19/2012 03:19 PM, Myklebust, Trond wrote:
>>>
>>> Commit eb96d5c97b0825d542e9c4ba5e0a22b519355166 (SUNRPC handle
>>> EKEYEXPIRED in call_refreshresult), which will be in 3.8-rc1 when Linus
>>> releases it, may help.
>>>
>>
>> FWIW - I cherry picked that into the latest Fedora rawhide kernel but no
>> effect. Sounds like a nice patch though, the current hang forever behavior
>> doesn't seem the trigger the needed "ah, need a new ticket" response.
>>
>
> So does simply killing the rpc.gssd process help?
>

Yes, if automount is already stopped (these are automounted directories). If
automount is running, it still seems to hang. I think I'm going to need to
spend some time talking to Ian.

--
Orion Poplawski
Technical Manager 303-415-9701 x222
NWRA, Boulder Office FAX: 303-415-9702
3380 Mitchell Lane [email protected]
Boulder, CO 80301 http://www.nwra.com

2012-12-19 22:24:55

by Orion Poplawski

[permalink] [raw]
Subject: Re: umount(,MNT_DETACH) for nfsv4 hangs when using sec=krb5 and network is down

On 12/19/2012 02:08 PM, Myklebust, Trond wrote:
> On Wed, 2012-12-19 at 20:47 +0000, Orion Poplawski wrote:
>>
>> However, we need someway to be able to drop mounts after the network connection
>> has been removed. This behavior is causing sever problems for our laptop and
>> vpn users.
>>
>> Tested with:
>>
>> 3.6.11-3.fc18
>> nfs-utils-1.2.7-2.fc18
>>
>> I've also filed https://bugzilla.redhat.com/show_bug.cgi?id=888942
>
> No. What you need is a way to unmount _before_ you kill the network.
> Once the network is gone, you are in severe data loss territory, and you
> are entirely on your own dealing with that problem...
>
> Maybe one day we will get round to supporting offline mounts, but that's
> not the case today.
>

I agree (see https://bugzilla.gnome.org/show_bug.cgi?id=387832 for example),
however it happens (and, because of lack of support as indicated by the bug,
hard to prevent) and it seems unfortunate to then subject the user to hanging
mounts (which will effectively lock up the desktop). It currently is possible
for sec=sys mounts, so I thought it would be worth while making it work for
sec=krb5 mounts. The same data loss issues are present for both.

We have put work in the past into making umount work for offline nfs mounts
(https://bugzilla.redhat.com/show_bug.cgi?id=820707). In fact that looks
remarkably familiar :).

[ 131.832005] umount.nfs4 D f1585bc8 0 1959 1958 0x00000080
[ 131.832005] f1585c34 00000086 0000ea8a f1585bc8 c045a297 f705f110 644b6440
0000001c
[ 131.832005] f1585bd8 c0cd5080 c0cd5080 00000282 f1585c00 f7591080 f3a27110
f1585c24
[ 131.832005] 00000000 c0d2e280 00000282 00000246 f1585c00 c097a273 f1585c2c
f7ee11c5
[ 131.832005] Call Trace:
[ 131.832005] [<c045a297>] ? __internal_add_timer+0x77/0xc0
[ 131.832005] [<c097a273>] ? _raw_spin_unlock_bh+0x13/0x20
[ 131.832005] [<f7ee11c5>] ? rpc_wake_up_first+0x65/0x180 [sunrpc]
[ 131.832005] [<f7eda240>] ? rpc_show_tasks+0x1b0/0x1b0 [sunrpc]
[ 131.832005] [<c09794d3>] schedule+0x23/0x60
[ 131.832005] [<f7ee064d>] rpc_wait_bit_killable+0x2d/0x70 [sunrpc]
[ 131.832005] [<c0977fc1>] __wait_on_bit+0x51/0x70
[ 131.832005] [<f7ee0620>] ? __rpc_wait_for_completion_task+0x30/0x30 [sunrpc]
[ 131.832005] [<f7ee0620>] ? __rpc_wait_for_completion_task+0x30/0x30 [sunrpc]
[ 131.832005] [<c0978041>] out_of_line_wait_on_bit+0x61/0x70
[ 131.832005] [<c046c100>] ? autoremove_wake_function+0x50/0x50
[ 131.832005] [<f7ee198f>] __rpc_execute+0x11f/0x340 [sunrpc]
[ 131.832005] [<c0507774>] ? mempool_alloc+0x44/0x120
[ 131.832005] [<f7ed8a50>] ? call_connect+0x90/0x90 [sunrpc]
[ 131.832005] [<f7ed8a50>] ? call_connect+0x90/0x90 [sunrpc]
[ 131.832005] [<c046c0a3>] ? wake_up_bit+0x23/0x30
[ 131.832005] [<f7ee1ec8>] rpc_execute+0x48/0x80 [sunrpc]
[ 131.832005] [<f7ed9929>] rpc_run_task+0x59/0x70 [sunrpc]
[ 131.832005] [<f7ed9a3c>] rpc_call_sync+0x3c/0x60 [sunrpc]
[ 131.832005] [<f8a402fc>] _nfs4_call_sync+0x3c/0x50 [nfsv4]
[ 131.832005] [<f8a403d5>] _nfs4_proc_getattr+0x95/0xa0 [nfsv4]
[ 131.832005] [<f8a41bab>] nfs4_proc_getattr+0x3b/0x60 [nfsv4]
[ 131.832005] [<f897f891>] __nfs_revalidate_inode+0x81/0x210 [nfs]
[ 131.832005] [<f897fbd2>] nfs_revalidate_inode+0x62/0x90 [nfs]
[ 131.832005] [<f89793ef>] nfs_check_verifier+0x4f/0x80 [nfs]
[ 131.832005] [<f897b4da>] nfs_lookup_revalidate+0x2ba/0x440 [nfs]
[ 131.832005] [<c055f8cb>] ? follow_managed+0x19b/0x200
[ 131.832005] [<c0560000>] ? unlazy_walk+0xf0/0x1a0
[ 131.832005] [<f897c184>] nfs4_lookup_revalidate+0x34/0xe0 [nfs]
[ 131.832005] [<c055fedc>] complete_walk+0x8c/0xc0
[ 131.832005] [<c05611b3>] path_lookupat+0x63/0x650
[ 131.832005] [<c05617ca>] do_path_lookup+0x2a/0xb0
[ 131.832005] [<c0563df6>] user_path_at_empty+0x46/0x80
[ 131.832005] [<c097d440>] ? vmalloc_fault+0x176/0x176
[ 131.832005] [<c097d5f7>] ? do_page_fault+0x1b7/0x450
[ 131.832005] [<c0563e4f>] user_path_at+0x1f/0x30
[ 131.832005] [<c05707b1>] sys_umount+0x41/0x340
[ 131.832005] [<c04bd59c>] ? __audit_syscall_entry+0xbc/0x290
[ 131.832005] [<c04bdac6>] ? __audit_syscall_exit+0x356/0x3b0
[ 131.832005] [<c0980fdf>] sysenter_do_call+0x12/0x28

I wonder if it never did get fixed for krb5 mounts then...

Bah.

--
Orion Poplawski
Technical Manager 303-415-9701 x222
NWRA, Boulder Office FAX: 303-415-9702
3380 Mitchell Lane [email protected]
Boulder, CO 80301 http://www.nwra.com

2012-12-20 20:47:54

by Myklebust, Trond

[permalink] [raw]
Subject: Re: umount(,MNT_DETACH) for nfsv4 hangs when using sec=krb5 and network is down

On Thu, 2012-12-20 at 13:31 -0700, Orion Poplawski wrote:
+AD4- On 12/19/2012 03:19 PM, Myklebust, Trond wrote:
+AD4- +AD4-
+AD4- +AD4- Commit eb96d5c97b0825d542e9c4ba5e0a22b519355166 (SUNRPC handle
+AD4- +AD4- EKEYEXPIRED in call+AF8-refreshresult), which will be in 3.8-rc1 when Linus
+AD4- +AD4- releases it, may help.
+AD4- +AD4-
+AD4-
+AD4- FWIW - I cherry picked that into the latest Fedora rawhide kernel but no
+AD4- effect. Sounds like a nice patch though, the current hang forever behavior
+AD4- doesn't seem the trigger the needed +ACI-ah, need a new ticket+ACI- response.
+AD4-

So does simply killing the rpc.gssd process help?
--
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust+AEA-netapp.com
http://www.netapp.com

2012-12-20 22:01:57

by Myklebust, Trond

[permalink] [raw]
Subject: Re: umount(,MNT_DETACH) for nfsv4 hangs when using sec=krb5 and network is down

On Thu, 2012-12-20 at 14:52 -0700, Orion Poplawski wrote:
+AD4- On 12/20/2012 01:47 PM, Myklebust, Trond wrote:
+AD4- +AD4- On Thu, 2012-12-20 at 13:31 -0700, Orion Poplawski wrote:
+AD4- +AD4APg- On 12/19/2012 03:19 PM, Myklebust, Trond wrote:
+AD4- +AD4APgA+-
+AD4- +AD4APgA+- Commit eb96d5c97b0825d542e9c4ba5e0a22b519355166 (SUNRPC handle
+AD4- +AD4APgA+- EKEYEXPIRED in call+AF8-refreshresult), which will be in 3.8-rc1 when Linus
+AD4- +AD4APgA+- releases it, may help.
+AD4- +AD4APgA+-
+AD4- +AD4APg-
+AD4- +AD4APg- FWIW - I cherry picked that into the latest Fedora rawhide kernel but no
+AD4- +AD4APg- effect. Sounds like a nice patch though, the current hang forever behavior
+AD4- +AD4APg- doesn't seem the trigger the needed +ACI-ah, need a new ticket+ACI- response.
+AD4- +AD4APg-
+AD4- +AD4-
+AD4- +AD4- So does simply killing the rpc.gssd process help?
+AD4- +AD4-
+AD4-
+AD4- Yes, if automount is already stopped (these are automounted directories). If
+AD4- automount is running, it still seems to hang. I think I'm going to need to
+AD4- spend some time talking to Ian.
+AD4-

I'd suggest also taking a long hard look at rpc.gssd and making sure
that it handles ENETUNREACH, ECONNREFUSED and friends correctly. I
suspect right now it is just baling out of the upcall instead of
completing it by propagating the error reply to the kernel.

--
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust+AEA-netapp.com
http://www.netapp.com

2012-12-19 22:19:56

by Myklebust, Trond

[permalink] [raw]
Subject: Re: umount(,MNT_DETACH) for nfsv4 hangs when using sec=krb5 and network is down

On Wed, 2012-12-19 at 14:50 -0700, Orion Poplawski wrote:
+AD4- On 12/19/2012 02:08 PM, Myklebust, Trond wrote:
+AD4- +AD4- On Wed, 2012-12-19 at 20:47 +000-, Orion Poplawski wrote:
+AD4- +AD4APg-
+AD4- +AD4APg- However, we need someway to be able to drop mounts after the network connection
+AD4- +AD4APg- has been removed. This behavior is causing sever problems for our laptop and
+AD4- +AD4APg- vpn users.
+AD4- +AD4APg-
+AD4- +AD4APg- Tested with:
+AD4- +AD4APg-
+AD4- +AD4APg- 3.6.11-3.fc18
+AD4- +AD4APg- nfs-utils-1.2.7-2.fc18
+AD4- +AD4APg-
+AD4- +AD4APg- I've also filed https://bugzilla.redhat.com/show+AF8-bug.cgi?id+AD0-888942
+AD4- +AD4-
+AD4- +AD4- No. What you need is a way to unmount +AF8-before+AF8- you kill the network.
+AD4- +AD4- Once the network is gone, you are in severe data loss territory, and you
+AD4- +AD4- are entirely on your own dealing with that problem...
+AD4- +AD4-
+AD4- +AD4- Maybe one day we will get round to supporting offline mounts, but that's
+AD4- +AD4- not the case today.
+AD4- +AD4-
+AD4-
+AD4- I agree (see https://bugzilla.gnome.org/show+AF8-bug.cgi?id+AD0-387832 for example),
+AD4- however it happens (and, because of lack of support as indicated by the bug,
+AD4- hard to prevent) and it seems unfortunate to then subject the user to hanging
+AD4- mounts (which will effectively lock up the desktop). It currently is possible
+AD4- for sec+AD0-sys mounts, so I thought it would be worth while making it work for
+AD4- sec+AD0-krb5 mounts. The same data loss issues are present for both.

Any application which is already hanging on a file in that filesystem
will continue to hang across a 'umount -l'. The only thing you are doing
is preventing future attempts to access the filesystem.

As I said above, this whole thing really needs to be handled as part of
the suspend scripts and/or networkmanager...

+AD4- We have put work in the past into making umount work for offline nfs mounts
+AD4- (https://bugzilla.redhat.com/show+AF8-bug.cgi?id+AD0-820707). In fact that looks
+AD4- remarkably familiar :).
+AD4-
+AD4- +AFs- 131.832005+AF0- umount.nfs4 D f1585bc8 0 1959 1958 0x00000080
+AD4- +AFs- 131.832005+AF0- f1585c34 00000086 0000ea8a f1585bc8 c045a297 f705f110 644b6440
+AD4- 0000001c
+AD4- +AFs- 131.832005+AF0- f1585bd8 c0cd5080 c0cd5080 00000282 f1585c00 f7591080 f3a27110
+AD4- f1585c24
+AD4- +AFs- 131.832005+AF0- 00000000 c0d2e280 00000282 00000246 f1585c00 c097a273 f1585c2c
+AD4- f7ee11c5
+AD4- +AFs- 131.832005+AF0- Call Trace:
+AD4- +AFs- 131.832005+AF0- +AFsAPA-c045a297+AD4AXQ- ? +AF8AXw-internal+AF8-add+AF8-timer+0x77/0xc-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-c097a273+AD4AXQ- ? +AF8-raw+AF8-spin+AF8-unlock+AF8-bh+0x13/0x2-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-f7ee11c5+AD4AXQ- ? rpc+AF8-wake+AF8-up+AF8-first+0x65/0x1- +AFs-sunrpc+AF0-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-f7eda240+AD4AXQ- ? rpc+AF8-show+AF8-tasks+0x1b0/0x1b0- +AFs-sunrpc+AF0-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-c09794d3+AD4AXQ- schedule+0x23/0x6-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-f7ee064d+AD4AXQ- rpc+AF8-wait+AF8-bit+AF8-killable+0x2d/0x7- +AFs-sunrpc+AF0-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-c0977fc1+AD4AXQ- +AF8AXw-wait+AF8-on+AF8-bit+0x51/0x7-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-f7ee0620+AD4AXQ- ? +AF8AXw-rpc+AF8-wait+AF8-for+AF8-completion+AF8-task+0x30/0x3- +AFs-sunrpc+AF0-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-f7ee0620+AD4AXQ- ? +AF8AXw-rpc+AF8-wait+AF8-for+AF8-completion+AF8-task+0x30/0x3- +AFs-sunrpc+AF0-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-c0978041+AD4AXQ- out+AF8-of+AF8-line+AF8-wait+AF8-on+AF8-bit+0x61/0x7-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-c046c100+AD4AXQ- ? autoremove+AF8-wake+AF8-function+0x50/0x5-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-f7ee198f+AD4AXQ- +AF8AXw-rpc+AF8-execute+0x11f/0x//0- +AFs-sunrpc+AF0-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-c0507774+AD4AXQ- ? mempool+AF8-alloc+0x44/0x1-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-f7ed8a50+AD4AXQ- ? call+AF8-connect+0x90/0x9- +AFs-sunrpc+AF0-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-f7ed8a50+AD4AXQ- ? call+AF8-connect+0x90/0x9- +AFs-sunrpc+AF0-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-c046c0a3+AD4AXQ- ? wake+AF8-up+AF8-bit+0x23/0x3-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-f7ee1ec8+AD4AXQ- rpc+AF8-execute+0x48/0x8- +AFs-sunrpc+AF0-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-f7ed9929+AD4AXQ- rpc+AF8-run+AF8-task+0x59/0x7- +AFs-sunrpc+AF0-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-f7ed9a3c+AD4AXQ- rpc+AF8-call+AF8-sync+0x3//Ux6- +AFs-sunrpc+AF0-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-f8a402fc+AD4AXQ- +AF8-nfs4+AF8-call+AF8-sync+0x3//Ux5- +AFs-nfsv4+AF0-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-f8a403d5+AD4AXQ- +AF8-nfs4+AF8-proc+AF8-getattr+0x95/0xa- +AFs-nfsv4+AF0-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-f8a41bab+AD4AXQ- nfs4+AF8-proc+AF8-getattr+0x3//Ux6- +AFs-nfsv4+AF0-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-f897f891+AD4AXQ- +AF8AXw-nfs+AF8-revalidate+AF8-inode+0x81/0x2- +AFs-nfs+AF0-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-f897fbd2+AD4AXQ- nfs+AF8-revalidate+AF8-inode+0x62/0x9- +AFs-nfs+AF0-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-f89793ef+AD4AXQ- nfs+AF8-check+AF8-verifier+0x4f/0x8- +AFs-nfs+AF0-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-f897b4da+AD4AXQ- nfs+AF8-lookup+AF8-revalidate+0x2ba/0x440- +AFs-nfs+AF0-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-c055f8cb+AD4AXQ- ? follow+AF8-managed+0x19b/0x//0-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-c0560000+AD4AXQ- ? unlazy+AF8-walk+0xf0/0x1-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-f897c184+AD4AXQ- nfs4+AF8-lookup+AF8-revalidate+0x34/0xe- +AFs-nfs+AF0-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-c055fedc+AD4AXQ- complete+AF8-walk+0x8c/0xc-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-c05611b3+AD4AXQ- path+AF8-lookupat+0x63/0x6-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-c05617ca+AD4AXQ- do+AF8-path+AF8-lookup+0x2a/0xb-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-c0563df6+AD4AXQ- user+AF8-path+AF8-at+AF8-empty+0x46/0x8-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-c097d440+AD4AXQ- ? vmalloc+AF8-fault+0x176/0x174-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-c097d5f7+AD4AXQ- ? do+AF8-page+AF8-fault+0x1b7/0x450-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-c0563e4f+AD4AXQ- user+AF8-path+AF8-at+0x1f/0x3-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-c05707b1+AD4AXQ- sys+AF8-umount+0x41/0x3-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-c04bd59c+AD4AXQ- ? +AF8AXw-audit+AF8-syscall+AF8-entry+0xb//Ux2-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-c04bdac6+AD4AXQ- ? +AF8AXw-audit+AF8-syscall+AF8-exit+0x356/0x//0-
+AD4- +AFs- 131.832005+AF0- +AFsAPA-c0980fdf+AD4AXQ- sysenter+AF8-do+AF8-call+0x12/0x2-
+AD4-
+AD4- I wonder if it never did get fixed for krb5 mounts then...

Commit eb96d5c97b0825d542e9c4ba5e0a22b519355166 (SUNRPC handle
EKEYEXPIRED in call+AF8-refreshresult), which will be in 3.8-rc1 when Linus
releases it, may help.

--
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust+AEA-netapp.com
http://www.netapp.com

2013-02-13 19:02:38

by Orion Poplawski

[permalink] [raw]
Subject: Re: umount(,MNT_DETACH) for nfsv4 hangs when using sec=krb5 and network is down

On 12/20/2012 03:01 PM, Myklebust, Trond wrote:
> On Thu, 2012-12-20 at 14:52 -0700, Orion Poplawski wrote:
>> On 12/20/2012 01:47 PM, Myklebust, Trond wrote:
>>> On Thu, 2012-12-20 at 13:31 -0700, Orion Poplawski wrote:
>>>> On 12/19/2012 03:19 PM, Myklebust, Trond wrote:
>>>>>
>>>>> Commit eb96d5c97b0825d542e9c4ba5e0a22b519355166 (SUNRPC handle
>>>>> EKEYEXPIRED in call_refreshresult), which will be in 3.8-rc1 when Linus
>>>>> releases it, may help.
>>>>>
>>>>
>>>> FWIW - I cherry picked that into the latest Fedora rawhide kernel but no
>>>> effect. Sounds like a nice patch though, the current hang forever behavior
>>>> doesn't seem the trigger the needed "ah, need a new ticket" response.
>>>>
>>>
>>> So does simply killing the rpc.gssd process help?
>>>
>>
>> Yes, if automount is already stopped (these are automounted directories). If
>> automount is running, it still seems to hang. I think I'm going to need to
>> spend some time talking to Ian.
>>
>
> I'd suggest also taking a long hard look at rpc.gssd and making sure
> that it handles ENETUNREACH, ECONNREFUSED and friends correctly. I
> suspect right now it is just baling out of the upcall instead of
> completing it by propagating the error reply to the kernel.
>

Actually, I take that back - I'm not sure it's directly involved and killing
rpc.gssd doesn't seem to be helping me now. I connected to rpc.gssd with
strace, dropped the interface and tried to umount.nfs4 -l but rpc.gssd is
still in poll and doesn't do anyting. kernel process trace shows:

[ 2788.807017] umount.nfs4 D ffff88007cc13d40 0 3001 3000 0x00000084
[ 2788.807017] ffff8800361319a8 0000000000000082 ffff880036131fd8
0000000000013d40
[ 2788.807017] ffff880036131fd8 0000000000013d40 ffff8800773add80
ffff8800773add80
[ 2788.807017] ffff88007cfe2cb8 0000000000000082 ffffffffa0009cc0
ffff880036131a20
[ 2788.807017] Call Trace:
[ 2788.807017] [<ffffffffa0009cc0>] ?
__rpc_wait_for_completion_task+0x30/0x30 [sunrpc]
[ 2788.807017] [<ffffffff81634b39>] schedule+0x29/0x70
[ 2788.807017] [<ffffffffa0009cf5>] rpc_wait_bit_killable+0x35/0x90 [sunrpc]
[ 2788.807017] [<ffffffff816335a0>] __wait_on_bit+0x60/0x90
[ 2788.807017] [<ffffffffa0001c50>] ? call_connect+0x90/0x90 [sunrpc]
[ 2788.807017] [<ffffffffa0009cc0>] ?
__rpc_wait_for_completion_task+0x30/0x30 [sunrpc]
[ 2788.807017] [<ffffffff81633707>] out_of_line_wait_on_bit+0x77/0x90
[ 2788.807017] [<ffffffff81080560>] ? autoremove_wake_function+0x40/0x40
[ 2788.807017] [<ffffffffa0001c50>] ? call_connect+0x90/0x90 [sunrpc]
[ 2788.807017] [<ffffffffa0001c50>] ? call_connect+0x90/0x90 [sunrpc]
[ 2788.807017] [<ffffffffa000ac7a>] __rpc_execute+0x13a/0x3f0 [sunrpc]
[ 2788.807017] [<ffffffffa000bd65>] rpc_execute+0x55/0x90 [sunrpc]
[ 2788.807017] [<ffffffffa0002e60>] rpc_run_task+0x70/0x90 [sunrpc]
[ 2788.807017] [<ffffffffa0002ec3>] rpc_call_sync+0x43/0xa0 [sunrpc]
[ 2788.807017] [<ffffffffa01f5653>] _nfs4_call_sync+0x13/0x20 [nfsv4]
[ 2788.807017] [<ffffffffa01f4e50>] _nfs4_proc_getattr+0xb0/0xc0 [nfsv4]
[ 2788.807017] [<ffffffffa01f9b9e>] nfs4_proc_getattr+0x4e/0x70 [nfsv4]
[ 2788.807017] [<ffffffffa01b67bc>] __nfs_revalidate_inode+0x8c/0x200 [nfs]
[ 2788.807017] [<ffffffffa01b69a3>] nfs_revalidate_inode+0x73/0xa0 [nfs]
[ 2788.807017] [<ffffffffa01afc60>] nfs_check_verifier+0x50/0x80 [nfs]
[ 2788.807017] [<ffffffffa01b255b>] nfs_lookup_revalidate+0x2fb/0x470 [nfs]
[ 2788.807017] [<ffffffffa01b2705>] nfs4_lookup_revalidate+0x35/0xe0 [nfs]
[ 2788.807017] [<ffffffff811a18fb>] complete_walk+0xbb/0x110
[ 2788.807017] [<ffffffff811a3310>] path_lookupat+0x70/0x7f0
[ 2788.807017] [<ffffffff811a216f>] ? getname_flags+0x4f/0x1a0
[ 2788.807017] [<ffffffff811a3abb>] filename_lookup+0x2b/0xc0
[ 2788.807017] [<ffffffff811a67c4>] user_path_at_empty+0x54/0x90
[ 2788.807017] [<ffffffff8117e4e6>] ? kmem_cache_free+0x46/0x1f0
[ 2788.807017] [<ffffffff8115d4e3>] ? remove_vma+0x63/0x70
[ 2788.807017] [<ffffffff811a6811>] user_path_at+0x11/0x20
[ 2788.807017] [<ffffffff811b57af>] sys_umount+0x3f/0x3a0
[ 2788.807017] [<ffffffff81639f7e>] ? do_page_fault+0xe/0x10
[ 2788.807017] [<ffffffff8163e419>] system_call_fastpath+0x16/0x1b

But every other process in schedule.

The mount point gets "deleted":

# grep mnt /proc/mounts
earth:/export/home/orion /mnt\040(deleted) nfs4
rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=krb5,clientaddr=10.10.11.101,local_lock=none,addr=10.10.10.1
0 0

but that's it.

--
Orion Poplawski
Technical Manager 303-415-9701 x222
NWRA, Boulder Office FAX: 303-415-9702
3380 Mitchell Lane [email protected]
Boulder, CO 80301 http://www.nwra.com