2008-11-03 15:26:34

by Brian J. Murrell

[permalink] [raw]
Subject: stuck/hung nfsv4 mounts

I have a system which I migrated a couple of mounts from sys
authentication on nfs3 to gssapi on nfsv4 to overcome the 16
supplementary groups limitation inherent in the "sys" security model.

It was working for a while but now I have "stuck" (i.e. hung) NFS
mounts. It seems that a mount issued from the automounter kicked all of
this off but I cannot be sure of that.

In my process table I have a number of processes blocked in the NFSv4
path including a mount command:

/bin/mount -t nfs4 -s -o sec=krb5i,rw,soft,intr,rsize=8192,wsize=8192 linux:/home/share /autohome/share

It's stack trace is:

Nov 3 09:54:14 pc kernel: [474304.152005] mount.nfs4 D c012ad55 0 25903 25902
Nov 3 09:54:14 pc kernel: [474304.152005] dcf1bb34 00000086 dcf1badc c012ad55 00001eff 026b0b7a 00000000 dcf1badc
Nov 3 09:54:14 pc kernel: [474304.152005] 00000086 f670d7f0 ec213ed0 003649ad 00000000 c012b4ee 00000000 ec2b5340
Nov 3 09:54:14 pc kernel: [474304.152005] c050f080 f3740000 dcf1a000 ec214148 c28d1d00 00000000 dcf1bb0c c28d1d00
Nov 3 09:54:14 pc kernel: [474304.152005] Call Trace:
Nov 3 09:54:14 pc kernel: [474304.152005] [<c012ad55>] ? check_preempt_wakeup+0x175/0x1b0
Nov 3 09:54:14 pc kernel: [474304.152005] [<c012b4ee>] ? try_to_wake_up+0xde/0x290
Nov 3 09:54:14 pc kernel: [474304.152005] [<f9574c5d>] rpc_wait_bit_killable+0x1d/0x50 [sunrpc]
Nov 3 09:54:14 pc kernel: [474304.152005] [<c037d2df>] __wait_on_bit+0x4f/0x70
Nov 3 09:54:14 pc kernel: [474304.152005] [<f9574c40>] ? rpc_wait_bit_killable+0x0/0x50 [sunrpc]
Nov 3 09:54:14 pc kernel: [474304.152005] [<c037d3ad>] out_of_line_wait_on_bit+0xad/0xc0
Nov 3 09:54:14 pc kernel: [474304.152005] [<f9574c40>] ? rpc_wait_bit_killable+0x0/0x50 [sunrpc]
Nov 3 09:54:14 pc kernel: [474304.152005] [<c0147500>] ? wake_bit_function+0x0/0x60
Nov 3 09:54:14 pc kernel: [474304.152005] [<f95756ae>] __rpc_execute+0xae/0x250 [sunrpc]
Nov 3 09:54:14 pc kernel: [474304.152005] [<f95758c9>] rpc_execute+0x79/0x90 [sunrpc]
Nov 3 09:54:14 pc kernel: [474304.152005] [<f956e743>] rpc_run_task+0x33/0x70 [sunrpc]
Nov 3 09:54:14 pc kernel: [474304.152005] [<f956e8a0>] rpc_call_sync+0x40/0x70 [sunrpc]
Nov 3 09:54:14 pc kernel: [474304.152005] [<f9cd0401>] nfs4_proc_get_root+0x71/0xc0 [nfs]
Nov 3 09:54:14 pc kernel: [474304.152005] [<f9cba216>] nfs4_path_walk+0x86/0x340 [nfs]
Nov 3 09:54:14 pc kernel: [474304.152005] [<c01cb0d2>] ? mntput_no_expire+0x22/0x120
Nov 3 09:54:14 pc kernel: [474304.152005] [<c01d1ab4>] ? simple_release_fs+0x54/0x70
Nov 3 09:54:14 pc kernel: [474304.152005] [<f95818bb>] ? rpc_mkdir+0x9b/0x110 [sunrpc]
Nov 3 09:54:14 pc kernel: [474304.152005] [<c037e61d>] ? _spin_lock+0xd/0x10
Nov 3 09:54:14 pc kernel: [474304.152005] [<c01d1b14>] ? simple_pin_fs+0x44/0xb0
Nov 3 09:54:14 pc kernel: [474304.152005] [<f956ec30>] ? rpc_setup_pipedir+0x90/0xd0 [sunrpc]
Nov 3 09:54:14 pc kernel: [474304.152005] [<f9574528>] ? rpciod_up+0x8/0x30 [sunrpc]
Nov 3 09:54:14 pc kernel: [474304.152005] [<f956ed46>] ? rpc_clone_client+0xd6/0x130 [sunrpc]
Nov 3 09:54:14 pc kernel: [474304.152005] [<f9cb373b>] ? nfs_init_server_rpcclient+0x2b/0xf0 [nfs]
Nov 3 09:54:14 pc kernel: [474304.152005] [<f9cb4c0a>] ? nfs4_init_server+0xea/0x220 [nfs]
Nov 3 09:54:14 pc kernel: [474304.152005] [<f9cb4dbc>] nfs4_create_server+0x7c/0x240 [nfs]
Nov 3 09:54:14 pc kernel: [474304.152005] [<c01acc24>] ? kfree+0x14/0xd0
Nov 3 09:54:14 pc kernel: [474304.152005] [<c01ad388>] ? get_slab+0x8/0x60
Nov 3 09:54:14 pc kernel: [474304.152005] [<f9cbdcb3>] ? nfs_parse_simple_hostname+0x83/0x160 [nfs]
Nov 3 09:54:14 pc kernel: [474304.152005] [<c0191496>] ? kstrndup+0x36/0x60
Nov 3 09:54:14 pc kernel: [474304.152005] [<f9cbdcb3>] ? nfs_parse_simple_hostname+0x83/0x160 [nfs]
Nov 3 09:54:14 pc kernel: [474304.152005] [<f9cbddb3>] ? nfs_parse_devname+0x23/0x50 [nfs]
Nov 3 09:54:14 pc kernel: [474304.152005] [<f9cbfe7e>] ? nfs4_validate_mount_data+0x31e/0x370 [nfs]
Nov 3 09:54:14 pc kernel: [474304.152005] [<f9cbff65>] nfs4_get_sb+0x95/0x280 [nfs]
Nov 3 09:54:14 pc kernel: [474304.152005] [<c01b43ee>] vfs_kern_mount+0x5e/0x130
Nov 3 09:54:14 pc kernel: [474304.152005] [<c01b451e>] do_kern_mount+0x3e/0xe0
Nov 3 09:54:14 pc kernel: [474304.152005] [<c01cc9ef>] do_new_mount+0x6f/0x90
Nov 3 09:54:14 pc kernel: [474304.152005] [<c01ccf32>] do_mount+0x1d2/0x1f0
Nov 3 09:54:14 pc kernel: [474304.152005] [<c01ca67d>] ? exact_copy_from_user+0x4d/0xa0
Nov 3 09:54:14 pc kernel: [474304.152005] [<c01cac8e>] ? copy_mount_options+0x6e/0xd0
Nov 3 09:54:14 pc kernel: [474304.152005] [<c01ccfe1>] sys_mount+0x91/0xc0
Nov 3 09:54:14 pc kernel: [474304.152005] [<c0103f7b>] sysenter_do_call+0x12/0x2f

There are a number of other processes stuck in the nfs4 paths as well:

Nov 3 09:54:13 pc kernel: [474304.152005] gnome-panel D cafc1ca4 0 7266 16025
Nov 3 09:54:13 pc kernel: [474304.152005] cafc1d08 00000086 f7070000 cafc1ca4 cafc1ca4 00000246 c2e6a204 f7070000
Nov 3 09:54:13 pc kernel: [474304.152005] f7070000 cafc1cc4 ec216480 000025ad 00000000 00000246 00000001 f63ec380
Nov 3 09:54:13 pc kernel: [474304.152005] c050f080 f0d30000 cafc0000 ec2166f8 c28dcd00 00000001 f9574ec0 c28dcd00
Nov 3 09:54:13 pc kernel: [474304.152005] Call Trace:
Nov 3 09:54:13 pc kernel: [474304.152005] [<f9574ec0>] ? __rpc_sleep_on+0x230/0x270 [sunrpc]
Nov 3 09:54:13 pc kernel: [474304.152005] [<f9574c5d>] rpc_wait_bit_killable+0x1d/0x50 [sunrpc]
Nov 3 09:54:13 pc kernel: [474304.152005] [<c037d2df>] __wait_on_bit+0x4f/0x70
Nov 3 09:54:13 pc kernel: [474304.152005] [<f9574c40>] ? rpc_wait_bit_killable+0x0/0x50 [sunrpc]
Nov 3 09:54:13 pc kernel: [474304.152005] [<c037d3ad>] out_of_line_wait_on_bit+0xad/0xc0
Nov 3 09:54:13 pc kernel: [474304.152005] [<f9574c40>] ? rpc_wait_bit_killable+0x0/0x50 [sunrpc]
Nov 3 09:54:13 pc kernel: [474304.152005] [<c0147500>] ? wake_bit_function+0x0/0x60
Nov 3 09:54:13 pc kernel: [474304.152005] [<f95756ae>] __rpc_execute+0xae/0x250 [sunrpc]
Nov 3 09:54:13 pc kernel: [474304.152005] [<f95758c9>] rpc_execute+0x79/0x90 [sunrpc]
Nov 3 09:54:13 pc kernel: [474304.152005] [<f956e743>] rpc_run_task+0x33/0x70 [sunrpc]
Nov 3 09:54:13 pc kernel: [474304.152005] [<f956e8a0>] rpc_call_sync+0x40/0x70 [sunrpc]
Nov 3 09:54:13 pc kernel: [474304.152005] [<f9cd0206>] nfs4_proc_getattr+0x76/0xa0 [nfs]
Nov 3 09:54:13 pc kernel: [474304.152005] [<f9cbbc86>] __nfs_revalidate_inode+0x156/0x2b0 [nfs]
Nov 3 09:54:13 pc kernel: [474304.152005] [<c014b474>] ? hrtimer_start+0xc4/0x1c0
Nov 3 09:54:13 pc kernel: [474304.152005] [<c01cb0d2>] ? mntput_no_expire+0x22/0x120
Nov 3 09:54:13 pc kernel: [474304.152005] [<c01babc5>] ? path_put+0x25/0x30
Nov 3 09:54:13 pc kernel: [474304.152005] [<c01bca62>] ? path_walk+0xa2/0xb0
Nov 3 09:54:13 pc kernel: [474304.152005] [<c01ba9fa>] ? putname+0x2a/0x40
Nov 3 09:54:13 pc kernel: [474304.152005] [<c01ba9fa>] ? putname+0x2a/0x40
Nov 3 09:54:13 pc kernel: [474304.152005] [<c01bd7e3>] ? user_path_at+0x53/0x80
Nov 3 09:54:13 pc kernel: [474304.152005] [<f9cbcb2d>] nfs_revalidate_inode+0x3d/0x70 [nfs]
Nov 3 09:54:13 pc kernel: [474304.152005] [<f9cbcc13>] nfs_getattr+0xb3/0x100 [nfs]
Nov 3 09:54:13 pc kernel: [474304.152005] [<f9cbcb60>] ? nfs_getattr+0x0/0x100 [nfs]
Nov 3 09:54:13 pc kernel: [474304.152005] [<c01b5bda>] vfs_getattr+0x4a/0x70
Nov 3 09:54:13 pc kernel: [474304.152005] [<c01b5e39>] vfs_stat_fd+0x39/0x60
Nov 3 09:54:13 pc kernel: [474304.152005] [<c01b5f66>] vfs_stat+0x16/0x20
Nov 3 09:54:13 pc kernel: [474304.152005] [<c01b5f89>] sys_stat64+0x19/0x30
Nov 3 09:54:13 pc kernel: [474304.152005] [<c0103f7b>] sysenter_do_call+0x12/0x2f

Nov 3 09:54:14 pc kernel: [474304.152005] lsof D ece08000 0 18247 18245
Nov 3 09:54:14 pc kernel: [474304.152005] c0e0dd08 00200086 f0953a74 ece08000 c0e0dca0 f95c629b f0953a74 00000000
Nov 3 09:54:14 pc kernel: [474304.152005] c0e0dce0 f95c6a10 c2e657f0 00007541 00000000 e5f76cc0 00000000 ee177c00
Nov 3 09:54:14 pc kernel: [474304.152005] c050f080 ee066000 c0e0c000 c2e65a68 c28d1d00 00000000 c0e0dce0 c28d1d00
Nov 3 09:54:14 pc kernel: [474304.152005] Call Trace:
Nov 3 09:54:14 pc kernel: [474304.152005] [<f95c629b>] ? gss_release_msg+0x4b/0x90 [auth_rpcgss]
Nov 3 09:54:14 pc kernel: [474304.152005] [<f95c6a10>] ? gss_cred_init+0xe0/0x140 [auth_rpcgss]
Nov 3 09:54:14 pc kernel: [474304.152005] [<f9574c5d>] rpc_wait_bit_killable+0x1d/0x50 [sunrpc]
Nov 3 09:54:14 pc kernel: [474304.152005] [<c037d2df>] __wait_on_bit+0x4f/0x70
Nov 3 09:54:14 pc kernel: [474304.152005] [<f9574c40>] ? rpc_wait_bit_killable+0x0/0x50 [sunrpc]
Nov 3 09:54:14 pc kernel: [474304.152005] [<c037d3ad>] out_of_line_wait_on_bit+0xad/0xc0
Nov 3 09:54:14 pc kernel: [474304.152005] [<f9574c40>] ? rpc_wait_bit_killable+0x0/0x50 [sunrpc]
Nov 3 09:54:14 pc kernel: [474304.152005] [<c0147500>] ? wake_bit_function+0x0/0x60
Nov 3 09:54:14 pc kernel: [474304.152005] [<f95756ae>] __rpc_execute+0xae/0x250 [sunrpc]
Nov 3 09:54:14 pc kernel: [474304.152005] [<f95758c9>] rpc_execute+0x79/0x90 [sunrpc]
Nov 3 09:54:14 pc kernel: [474304.152005] [<f956e743>] rpc_run_task+0x33/0x70 [sunrpc]
Nov 3 09:54:14 pc kernel: [474304.152005] [<f956e8a0>] rpc_call_sync+0x40/0x70 [sunrpc]
Nov 3 09:54:14 pc kernel: [474304.152005] [<f9cd0206>] nfs4_proc_getattr+0x76/0xa0 [nfs]
Nov 3 09:54:14 pc kernel: [474304.152005] [<f9cbbc86>] __nfs_revalidate_inode+0x156/0x2b0 [nfs]
Nov 3 09:54:14 pc kernel: [474304.152005] [<c01bac46>] ? path_to_nameidata+0x36/0x50
Nov 3 09:54:14 pc kernel: [474304.152005] [<c01cb0d2>] ? mntput_no_expire+0x22/0x120
Nov 3 09:54:14 pc kernel: [474304.152005] [<c01babc5>] ? path_put+0x25/0x30
Nov 3 09:54:14 pc kernel: [474304.152005] [<c01bca62>] ? path_walk+0xa2/0xb0
Nov 3 09:54:14 pc kernel: [474304.152005] [<c01ba9fa>] ? putname+0x2a/0x40
Nov 3 09:54:14 pc kernel: [474304.152005] [<c01ba9fa>] ? putname+0x2a/0x40
Nov 3 09:54:14 pc kernel: [474304.152005] [<c01bd7e3>] ? user_path_at+0x53/0x80
Nov 3 09:54:14 pc kernel: [474304.152005] [<f9cbcb2d>] nfs_revalidate_inode+0x3d/0x70 [nfs]
Nov 3 09:54:14 pc kernel: [474304.152005] [<f9cbcc13>] nfs_getattr+0xb3/0x100 [nfs]
Nov 3 09:54:14 pc kernel: [474304.152005] [<f9cbcb60>] ? nfs_getattr+0x0/0x100 [nfs]
Nov 3 09:54:14 pc kernel: [474304.152005] [<c01b5bda>] vfs_getattr+0x4a/0x70
Nov 3 09:54:14 pc kernel: [474304.152005] [<c01b5e39>] vfs_stat_fd+0x39/0x60
Nov 3 09:54:14 pc kernel: [474304.152005] [<c01b5f66>] vfs_stat+0x16/0x20
Nov 3 09:54:14 pc kernel: [474304.152005] [<c01b5f89>] sys_stat64+0x19/0x30
Nov 3 09:54:14 pc kernel: [474304.152005] [<c01b26db>] ? vfs_read+0xab/0x110
Nov 3 09:54:14 pc kernel: [474304.152005] [<c01b1e40>] ? do_sync_read+0x0/0x120
Nov 3 09:54:14 pc kernel: [474304.152005] [<c01b2812>] ? sys_read+0x42/0x70
Nov 3 09:54:14 pc kernel: [474304.152005] [<c0103f7b>] sysenter_do_call+0x12/0x2f

I have used wireshark and filtered for NFS4 (given that I have lots of
NFV3 still) with "rpc.programversion == 4" on the wire and noticed that
only a couple of packets (1 each after the TCP setup and teardown) are
sent between the client and server. It was many tens of minutes between
these last night when I was debugging. For as long as I have been
looking this morning there has not been a single NFS4 packet.

Client machine is the Ubuntu 2.6.27-7-generic kernel and the server is
the Ubuntu 2.6.24-16-generic kernel.

I've tried restarting NFS on the server a number of times and that has
not helped.

Any ideas?

b.




2008-11-03 22:20:51

by Peter Staubach

[permalink] [raw]
Subject: Re: stuck/hung nfsv4 mounts

Chuck Lever wrote:
> On Nov 3, 2008, at Nov 3, 2008, 12:37 PM, Trond Myklebust wrote:
>> On Mon, 2008-11-03 at 12:25 -0500, Jim Rees wrote:
>>> Trond Myklebust wrote:
>>>
>>> BTW: NFSv4 + soft == BAD BAD BAD!
>>>
>>> Maybe this combination should be prohibited. Does it make any sense
>>> given
>>> the stateful nature of v4?
>>
>> It might make sense if we were to fix up the granularity of the recovery
>> routines so that we are able to recover all the state associated with
>> just a single open owner or lock owner. Currently we'd have to recover
>> all the state associated with that server.
>>
>> IOW: we might be able to fix things up in the future, but right now,
>> NFSv4+soft is not a good idea.
>
> Two cents worth: Until NFSv4+soft works reasonably well, I wouldn't
> have any problem with making "soft" a no-op for nfs4 mounts.

It may be the right thing to do in the short term, but it will
generate calls to support organizations. Anyone trying it will
wonder why it doesn't work "as it used to". That sort of thing.
When that happens, we will be forced to do something and not at
our choice of time.

Perhaps we should deny such mount attempts because at least that
would be obvious what was happening? Perhaps we should also
deny NFSv4 mounting over UDP?

ps

2008-11-03 22:33:11

by Kevin Coffman

[permalink] [raw]
Subject: Re: stuck/hung nfsv4 mounts

On Mon, Nov 3, 2008 at 4:12 PM, Brian J. Murrell <[email protected]> wrote:
> On Mon, 2008-11-03 at 15:28 -0500, Kevin Coffman wrote:
>>
>> There should definitely be corresponding debug output from rpc.gssd
>> (-vvv) if you got those librpcsecgss (-rrr) messages.
>
> Yes, but by my last post I had debug only on the client and not the
> server but when I enabled on the server I couldn't elicit the client at
> will.
>
>> Re: the upcall. Any new mount using Kerberos should cause the upcall
>> to create a new context with the server. Make sure the filesystem is
>> not already mounted.
>
> It was definitely not mounted the last time I tried to mount it and
> produce the upcall output but it just never happened.
>
>> After the mount, any new user accessing that
>> mounted filesystem should cause an upcall to create a context for that
>> user.
>
> I seem to have since gotten an upcall and here is the corresponding
> client and server debug:
>
> Nov 3 15:17:33 pc rpc.gssd[2937]: in authgss_create_default()
> Nov 3 15:17:33 pc rpc.gssd[2937]: in authgss_create()
> Nov 3 15:17:33 pc rpc.gssd[2937]: authgss_create: name is 0x98fd358
> Nov 3 15:17:33 pc rpc.gssd[2937]: authgss_create: gd->name is 0x98f9cc8
> Nov 3 15:17:33 pc rpc.gssd[2937]: in authgss_refresh()
> Nov 3 15:17:33 pc rpc.gssd[2937]: struct rpc_gss_sec:
> Nov 3 15:17:33 pc rpc.gssd[2937]: mechanism_OID: { 1 2 134 72 134 247 18 1 2 2 }
> Nov 3 15:17:33 pc rpc.gssd[2937]: qop: 0
> Nov 3 15:17:33 pc rpc.gssd[2937]: service: 1
> Nov 3 15:17:33 pc rpc.gssd[2937]: cred: 0x98fb568
> Nov 3 15:17:33 pc rpc.gssd[2937]: req_flags: 00000002
> Nov 3 15:17:33 pc rpc.gssd[2937]: in authgss_marshal()
> Nov 3 15:17:33 pc rpc.gssd[2937]: xdr_rpc_gss_buf: encode success ((nil):0)
> Nov 3 15:17:33 pc rpc.gssd[2937]: xdr_rpc_gss_cred: encode success (v 1, proc 1, seq 0, svc 1, ctx (nil):0)
> Nov 3 15:17:33 pc rpc.gssd[2937]: in authgss_wrap()
> Nov 3 15:17:33 pc rpc.gssd[2937]: xdr_rpc_gss_buf: encode success (0x98fbc08:459)
> Nov 3 15:17:33 pc rpc.gssd[2937]: xdr_rpc_gss_init_args: encode success (token 0x98fbc08:459)
> Nov 3 15:17:33 pc rpc.gssd[2937]: in authgss_validate()
> Nov 3 15:17:33 pc rpc.gssd[2937]: in authgss_unwrap()
> Nov 3 15:17:33 pc rpc.gssd[2937]: xdr_rpc_gss_buf: decode success (0x98fd428:4)
> Nov 3 15:17:33 pc rpc.gssd[2937]: xdr_rpc_gss_buf: decode success (0x990e428:114)
> Nov 3 15:17:33 pc rpc.gssd[2937]: xdr_rpc_gss_init_res decode success (ctx 0x98fd428:4, maj 0, min 0, win 128, token 0x990e428:114)
> Nov 3 15:17:33 pc rpc.gssd[2937]: authgss_create_default: freeing name 0x98fd358
> Nov 3 15:17:33 pc rpc.gssd[2937]: in authgss_get_private_data()
> Nov 3 15:17:33 pc rpc.gssd[2937]: in authgss_free_private_data()
> Nov 3 15:17:33 pc rpc.gssd[2937]: in authgss_destroy()
> Nov 3 15:17:33 pc rpc.gssd[2937]: in authgss_destroy_context()
> Nov 3 15:17:33 pc rpc.gssd[2937]: authgss_destroy: freeing name 0x98f9cc8

Once again, this is all from librpcsecgss. No messages originating
from rpc.gssd itself here.

> Nov 3 15:17:33 linux krb5kdc[5006]: TGS_REQ (1 etypes {1}) 10.75.22.1: ISSUE: authtime 1225740577, etypes {rep=1 tkt=1 ses=1}, brian@ILINX for nfs/linux.interlinx.bc.ca@ILINX
> Nov 3 15:17:33 linux rpc.svcgssd[13724]: leaving poll
> Nov 3 15:17:33 linux rpc.svcgssd[13724]: handling null request
> Nov 3 15:17:33 linux rpc.svcgssd[13724]: sname = brian@ILINX
> Nov 3 15:17:33 linux rpc.svcgssd[13724]: DEBUG: serialize_krb5_ctx: lucid version!
> Nov 3 15:17:33 linux rpc.svcgssd[13724]: prepare_krb5_rfc1964_buffer: serializing keys with enctype 4 and length 8
> Nov 3 15:17:33 linux rpc.svcgssd[13724]: doing downcall
> Nov 3 15:17:33 linux rpc.svcgssd[13724]: mech: krb5, hndl len: 4, ctx len 85, timeout: 2147483647, uid: 1001, gid: 1001, num aux grps: 13:
> Nov 3 15:17:33 linux rpc.svcgssd[13724]: ( 1) 1001
> Nov 3 15:17:33 linux rpc.svcgssd[13724]: ( 2) 117
> Nov 3 15:17:33 linux rpc.svcgssd[13724]: ( 3) 1010
> Nov 3 15:17:33 linux rpc.svcgssd[13724]: ( 4) 1011
> Nov 3 15:17:33 linux rpc.svcgssd[13724]: ( 5) 2000
> Nov 3 15:17:33 linux rpc.svcgssd[13724]: ( 6) 29
> Nov 3 15:17:33 linux rpc.svcgssd[13724]: ( 7) 20
> Nov 3 15:17:33 linux rpc.svcgssd[13724]: ( 8) 24
> Nov 3 15:17:33 linux rpc.svcgssd[13724]: ( 9) 25
> Nov 3 15:17:33 linux rpc.svcgssd[13724]: ( 10) 30
> Nov 3 15:17:33 linux rpc.svcgssd[13724]: ( 11) 44
> Nov 3 15:17:33 linux rpc.svcgssd[13724]: ( 12) 46
> Nov 3 15:17:33 linux rpc.svcgssd[13724]: ( 13) 111
> Nov 3 15:17:33 linux rpc.svcgssd[13724]: sending null reply
> Nov 3 15:17:33 linux rpc.svcgssd[13724]: writing message: ...
> Nov 3 15:17:33 linux rpc.svcgssd[13724]: finished handling null request
> Nov 3 15:17:33 linux rpc.svcgssd[13724]: entering poll

This looks like the server (the svcgssd/user-land part anyway) was
happy, and replied with a context. There were a couple of reports in
the past where the server returned error 524288 (GSS_S_NO_CONTEXT)
after getting through the user-land part successfully. If you can get
a packet trace between the client and server during the mount /
context negotiation, check the NULL reply from the server for this
error code.

K.C.

2008-11-03 22:47:45

by Chuck Lever

[permalink] [raw]
Subject: Re: stuck/hung nfsv4 mounts

On Nov 3, 2008, at Nov 3, 2008, 5:20 PM, Peter Staubach wrote:
> Chuck Lever wrote:
>> On Nov 3, 2008, at Nov 3, 2008, 12:37 PM, Trond Myklebust wrote:
>>> On Mon, 2008-11-03 at 12:25 -0500, Jim Rees wrote:
>>>> Trond Myklebust wrote:
>>>>
>>>> BTW: NFSv4 + soft == BAD BAD BAD!
>>>>
>>>> Maybe this combination should be prohibited. Does it make any
>>>> sense given
>>>> the stateful nature of v4?
>>>
>>> It might make sense if we were to fix up the granularity of the
>>> recovery
>>> routines so that we are able to recover all the state associated
>>> with
>>> just a single open owner or lock owner. Currently we'd have to
>>> recover
>>> all the state associated with that server.
>>>
>>> IOW: we might be able to fix things up in the future, but right now,
>>> NFSv4+soft is not a good idea.
>>
>> Two cents worth: Until NFSv4+soft works reasonably well, I
>> wouldn't have any problem with making "soft" a no-op for nfs4 mounts.
>
> It may be the right thing to do in the short term, but it will
> generate calls to support organizations.

As opposed to the support "call" we got today? ;-)

> Anyone trying it will
> wonder why it doesn't work "as it used to". That sort of thing.
> When that happens, we will be forced to do something and not at
> our choice of time.

I think our petard is hoist either way. Better to avoid data
corruption, by whatever means.

> Perhaps we should deny such mount attempts because at least that
> would be obvious what was happening? Perhaps we should also
> deny NFSv4 mounting over UDP?

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2008-11-04 16:04:11

by Peter Staubach

[permalink] [raw]
Subject: Re: stuck/hung nfsv4 mounts

Chuck Lever wrote:
> On Nov 3, 2008, at Nov 3, 2008, 5:20 PM, Peter Staubach wrote:
>> Chuck Lever wrote:
>>> On Nov 3, 2008, at Nov 3, 2008, 12:37 PM, Trond Myklebust wrote:
>>>> On Mon, 2008-11-03 at 12:25 -0500, Jim Rees wrote:
>>>>> Trond Myklebust wrote:
>>>>>
>>>>> BTW: NFSv4 + soft == BAD BAD BAD!
>>>>>
>>>>> Maybe this combination should be prohibited. Does it make any
>>>>> sense given
>>>>> the stateful nature of v4?
>>>>
>>>> It might make sense if we were to fix up the granularity of the
>>>> recovery
>>>> routines so that we are able to recover all the state associated with
>>>> just a single open owner or lock owner. Currently we'd have to recover
>>>> all the state associated with that server.
>>>>
>>>> IOW: we might be able to fix things up in the future, but right now,
>>>> NFSv4+soft is not a good idea.
>>>
>>> Two cents worth: Until NFSv4+soft works reasonably well, I wouldn't
>>> have any problem with making "soft" a no-op for nfs4 mounts.
>>
>> It may be the right thing to do in the short term, but it will
>> generate calls to support organizations.
>
> As opposed to the support "call" we got today? ;-)
>

Well, at least, you have an answer today.

>> Anyone trying it will
>> wonder why it doesn't work "as it used to". That sort of thing.
>> When that happens, we will be forced to do something and not at
>> our choice of time.
>
> I think our petard is hoist either way. Better to avoid data
> corruption, by whatever means.
>

Then it is probably time to fix it, sooner as opposed to later.

I don't have the resources at this time to put into doing this work,
so I am just expressing opinions.

I do know that it caused quite the flood of interest when it was
discovered that the superblock sharing stuff had broken a lot of
customers who expected to be able mount the same file system from
a server, multiple times, in different places, with different
options, and it stopped using the options beyond the first mount.

I am thinking that perhaps we can take something from this experience
and do better this next time. This time, we know that we are making
an incompatible change and can do better.

ps

>> Perhaps we should deny such mount attempts because at least that
>> would be obvious what was happening? Perhaps we should also
>> deny NFSv4 mounting over UDP?
>


2008-11-03 16:59:10

by Trond Myklebust

[permalink] [raw]
Subject: Re: stuck/hung nfsv4 mounts

On Mon, 2008-11-03 at 10:05 -0500, Brian J. Murrell wrote:
> I have a system which I migrated a couple of mounts from sys
> authentication on nfs3 to gssapi on nfsv4 to overcome the 16
> supplementary groups limitation inherent in the "sys" security model.
>
> It was working for a while but now I have "stuck" (i.e. hung) NFS
> mounts. It seems that a mount issued from the automounter kicked all of
> this off but I cannot be sure of that.
>
> In my process table I have a number of processes blocked in the NFSv4
> path including a mount command:
>
> /bin/mount -t nfs4 -s -o sec=krb5i,rw,soft,intr,rsize=8192,wsize=8192 linux:/home/share /autohome/share

BTW: NFSv4 + soft == BAD BAD BAD! You will see weird behaviour, as the
client and the server will have divergent opinions on what happened
if/when an RPC call fails.

Otherwise, have you checked on the state of your rpc.gssd? It looked as
if several of those traces were waiting around RPCSEC_GSS upcalls...

Cheers
Trond


2008-11-03 17:25:38

by Jim Rees

[permalink] [raw]
Subject: Re: stuck/hung nfsv4 mounts

Trond Myklebust wrote:

BTW: NFSv4 + soft == BAD BAD BAD!

Maybe this combination should be prohibited. Does it make any sense given
the stateful nature of v4?

2008-11-03 17:37:21

by Trond Myklebust

[permalink] [raw]
Subject: Re: stuck/hung nfsv4 mounts

On Mon, 2008-11-03 at 12:25 -0500, Jim Rees wrote:
> Trond Myklebust wrote:
>
> BTW: NFSv4 + soft == BAD BAD BAD!
>
> Maybe this combination should be prohibited. Does it make any sense given
> the stateful nature of v4?

It might make sense if we were to fix up the granularity of the recovery
routines so that we are able to recover all the state associated with
just a single open owner or lock owner. Currently we'd have to recover
all the state associated with that server.

IOW: we might be able to fix things up in the future, but right now,
NFSv4+soft is not a good idea.

Trond


2008-11-03 17:38:39

by Benny Halevy

[permalink] [raw]
Subject: Re: stuck/hung nfsv4 mounts

On Nov. 03, 2008, 19:25 +0200, Jim Rees <[email protected]> wrote:
> Trond Myklebust wrote:
>
> BTW: NFSv4 + soft == BAD BAD BAD!
>
> Maybe this combination should be prohibited. Does it make any sense given
> the stateful nature of v4?

>From the usage perspective it allows the application to fail rather
than stall. If it would have been implemented similar to "intr"
and return an error to the app as if the system call had received
an interrupt but under the covers the RPC will be processed to
completion (or graceful cleanup) I think it might still make sense for nfs4.
Until then this combination should at least be warned about.

Benny

> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2008-11-03 17:50:35

by Brian J. Murrell

[permalink] [raw]
Subject: Re: stuck/hung nfsv4 mounts

On Mon, 2008-11-03 at 11:59 -0500, Trond Myklebust wrote:
>
> BTW: NFSv4 + soft == BAD BAD BAD!

Ahh. OK.

> You will see weird behaviour, as the
> client and the server will have divergent opinions on what happened
> if/when an RPC call fails.

I have removed the soft option from the auto map and killed the
in-progress mount.nfs4. Seems the new one is not using soft any more:

root 22731 22730 0 12:06 ? 00:00:00 /sbin/mount.nfs4 linux:/home/share /autohome/share -s -o rw,sec=krb5i,intr,rsize=8192,wsize=8192

> Otherwise, have you checked on the state of your rpc.gssd? It looked as
> if several of those traces were waiting around RPCSEC_GSS upcalls...
I thought it was working. I killed it and restarted it with -vvv -rrr
and this is what it said when automount re-issued the above mount:

Nov 3 12:02:15 pc rpc.gssd[21773]: rpcsec_gss: debug level is 3
Nov 3 12:06:00 pc rpc.gssd[21774]: in authgss_create_default()
Nov 3 12:06:00 pc rpc.gssd[21774]: in authgss_create()
Nov 3 12:06:00 pc rpc.gssd[21774]: authgss_create: name is 0x8ab8d88
Nov 3 12:06:00 pc rpc.gssd[21774]: authgss_create: gd->name is 0x8ab8ed0
Nov 3 12:06:00 pc rpc.gssd[21774]: in authgss_refresh()
Nov 3 12:06:00 pc rpc.gssd[21774]: struct rpc_gss_sec:
Nov 3 12:06:00 pc rpc.gssd[21774]: mechanism_OID: { 1 2 134 72 134 247 18 1 2 2 }
Nov 3 12:06:00 pc rpc.gssd[21774]: qop: 0
Nov 3 12:06:00 pc rpc.gssd[21774]: service: 1
Nov 3 12:06:00 pc rpc.gssd[21774]: cred: 0x8abb830
Nov 3 12:06:00 pc rpc.gssd[21774]: req_flags: 00000002
Nov 3 12:06:00 pc rpc.gssd[21774]: in authgss_marshal()
Nov 3 12:06:00 pc rpc.gssd[21774]: xdr_rpc_gss_buf: encode success ((nil):0)
Nov 3 12:06:00 pc rpc.gssd[21774]: xdr_rpc_gss_cred: encode success (v 1, proc 1, seq 0, svc 1, ctx (nil):0)
Nov 3 12:06:00 pc rpc.gssd[21774]: in authgss_wrap()
Nov 3 12:06:00 pc rpc.gssd[21774]: xdr_rpc_gss_buf: encode success (0x8abbeb8:483)
Nov 3 12:06:00 pc rpc.gssd[21774]: xdr_rpc_gss_init_args: encode success (token 0x8abbeb8:483)
Nov 3 12:06:00 pc rpc.gssd[21774]: in authgss_validate()
Nov 3 12:06:00 pc rpc.gssd[21774]: in authgss_unwrap()
Nov 3 12:06:00 pc rpc.gssd[21774]: xdr_rpc_gss_buf: decode success (0x8abbe98:4)
Nov 3 12:06:00 pc rpc.gssd[21774]: xdr_rpc_gss_buf: decode success (0x8abc110:114)
Nov 3 12:06:00 pc rpc.gssd[21774]: xdr_rpc_gss_init_res decode success (ctx 0x8abbe98:4, maj 0, min 0, win 128, token 0x8abc110:114)
Nov 3 12:06:00 pc rpc.gssd[21774]: authgss_create_default: freeing name 0x8ab8d88
Nov 3 12:06:00 pc rpc.gssd[21774]: in authgss_get_private_data()
Nov 3 12:06:00 pc rpc.gssd[21774]: in authgss_free_private_data()
Nov 3 12:06:00 pc rpc.gssd[21774]: in authgss_destroy()
Nov 3 12:06:00 pc rpc.gssd[21774]: in authgss_destroy_context()
Nov 3 12:06:00 pc rpc.gssd[21774]: authgss_destroy: freeing name 0x8ab8ed0

The kdc logged:

Nov 3 12:06:00 linux krb5kdc[5006]: AS_REQ (1 etypes {1}) 10.75.22.1: ISSUE: authtime 1225731960, etypes {rep=1 tkt=16 ses=1}, nfs/pc.interlinx.bc.ca@ILINX for krbtgt/ILINX@ILINX
Nov 3 12:06:00 linux krb5kdc[5006]: TGS_REQ (1 etypes {1}) 10.75.22.1: ISSUE: authtime 1225731960, etypes {rep=1 tkt=1 ses=1}, nfs/pc.interlinx.bc.ca@ILINX for nfs/linux.interlinx.bc.ca@ILINX

in correlation to the new mount request, but the mount.nfs4 didn't
complete.

b.



2008-11-03 19:58:58

by Kevin Coffman

[permalink] [raw]
Subject: Re: stuck/hung nfsv4 mounts

On Mon, Nov 3, 2008 at 12:50 PM, Brian J. Murrell <[email protected]> wrote:
> On Mon, 2008-11-03 at 11:59 -0500, Trond Myklebust wrote:
>
>> Otherwise, have you checked on the state of your rpc.gssd? It looked as
>> if several of those traces were waiting around RPCSEC_GSS upcalls...
>
> I thought it was working. I killed it and restarted it with -vvv -rrr
> and this is what it said when automount re-issued the above mount:
>
> Nov 3 12:02:15 pc rpc.gssd[21773]: rpcsec_gss: debug level is 3
> Nov 3 12:06:00 pc rpc.gssd[21774]: in authgss_create_default()
> Nov 3 12:06:00 pc rpc.gssd[21774]: in authgss_create()
> Nov 3 12:06:00 pc rpc.gssd[21774]: authgss_create: name is 0x8ab8d88
> Nov 3 12:06:00 pc rpc.gssd[21774]: authgss_create: gd->name is 0x8ab8ed0
> Nov 3 12:06:00 pc rpc.gssd[21774]: in authgss_refresh()
> Nov 3 12:06:00 pc rpc.gssd[21774]: struct rpc_gss_sec:
> Nov 3 12:06:00 pc rpc.gssd[21774]: mechanism_OID: { 1 2 134 72 134 247 18 1 2 2 }
> Nov 3 12:06:00 pc rpc.gssd[21774]: qop: 0
> Nov 3 12:06:00 pc rpc.gssd[21774]: service: 1
> Nov 3 12:06:00 pc rpc.gssd[21774]: cred: 0x8abb830
> Nov 3 12:06:00 pc rpc.gssd[21774]: req_flags: 00000002
> Nov 3 12:06:00 pc rpc.gssd[21774]: in authgss_marshal()
> Nov 3 12:06:00 pc rpc.gssd[21774]: xdr_rpc_gss_buf: encode success ((nil):0)
> Nov 3 12:06:00 pc rpc.gssd[21774]: xdr_rpc_gss_cred: encode success (v 1, proc 1, seq 0, svc 1, ctx (nil):0)
> Nov 3 12:06:00 pc rpc.gssd[21774]: in authgss_wrap()
> Nov 3 12:06:00 pc rpc.gssd[21774]: xdr_rpc_gss_buf: encode success (0x8abbeb8:483)
> Nov 3 12:06:00 pc rpc.gssd[21774]: xdr_rpc_gss_init_args: encode success (token 0x8abbeb8:483)
> Nov 3 12:06:00 pc rpc.gssd[21774]: in authgss_validate()
> Nov 3 12:06:00 pc rpc.gssd[21774]: in authgss_unwrap()
> Nov 3 12:06:00 pc rpc.gssd[21774]: xdr_rpc_gss_buf: decode success (0x8abbe98:4)
> Nov 3 12:06:00 pc rpc.gssd[21774]: xdr_rpc_gss_buf: decode success (0x8abc110:114)
> Nov 3 12:06:00 pc rpc.gssd[21774]: xdr_rpc_gss_init_res decode success (ctx 0x8abbe98:4, maj 0, min 0, win 128, token 0x8abc110:114)
> Nov 3 12:06:00 pc rpc.gssd[21774]: authgss_create_default: freeing name 0x8ab8d88
> Nov 3 12:06:00 pc rpc.gssd[21774]: in authgss_get_private_data()
> Nov 3 12:06:00 pc rpc.gssd[21774]: in authgss_free_private_data()
> Nov 3 12:06:00 pc rpc.gssd[21774]: in authgss_destroy()
> Nov 3 12:06:00 pc rpc.gssd[21774]: in authgss_destroy_context()
> Nov 3 12:06:00 pc rpc.gssd[21774]: authgss_destroy: freeing name 0x8ab8ed0

These are all from librpcsecgss (from the -rrr). I don't see an error here.

You should have gotten output from gssd itself as well. You might
also look for any errors on the server from rpc.svcgssd.

> The kdc logged:
>
> Nov 3 12:06:00 linux krb5kdc[5006]: AS_REQ (1 etypes {1}) 10.75.22.1: ISSUE: authtime 1225731960, etypes {rep=1 tkt=16 ses=1}, nfs/pc.interlinx.bc.ca@ILINX for krbtgt/ILINX@ILINX
> Nov 3 12:06:00 linux krb5kdc[5006]: TGS_REQ (1 etypes {1}) 10.75.22.1: ISSUE: authtime 1225731960, etypes {rep=1 tkt=1 ses=1}, nfs/pc.interlinx.bc.ca@ILINX for nfs/linux.interlinx.bc.ca@ILINX

This looks fine. (Assuming linux.interlinx.bc.ca is the NFS server
and pc.interlinx.bc.ca is the client.)


> in correlation to the new mount request, but the mount.nfs4 didn't
> complete.
>

2008-11-03 20:09:21

by Brian J. Murrell

[permalink] [raw]
Subject: Re: stuck/hung nfsv4 mounts

On Mon, 2008-11-03 at 14:58 -0500, Kevin Coffman wrote:
>
> These are all from librpcsecgss (from the -rrr). I don't see an error here.

Good.

> You should have gotten output from gssd itself as well. You might
> also look for any errors on the server from rpc.svcgssd.

Nothing.

> This looks fine. (Assuming linux.interlinx.bc.ca is the NFS server
> and pc.interlinx.bc.ca is the client.)

Indeed, they both are as you assume.

So I've restarted both the rpc.svcgssd and rpc.gssd with -rrr -vvv on
the server and client (respectively) to see what more I can see but I
can't get the upcall to trigger. I would have thought simply issuing a
mount would do that, but it seems not. What exactly causes the upcall
to be made and what might prevent it from being made?

b.



2008-11-03 20:28:12

by Kevin Coffman

[permalink] [raw]
Subject: Re: stuck/hung nfsv4 mounts

On Mon, Nov 3, 2008 at 3:09 PM, Brian J. Murrell <[email protected]> wrote:
> On Mon, 2008-11-03 at 14:58 -0500, Kevin Coffman wrote:
>>
>> These are all from librpcsecgss (from the -rrr). I don't see an error here.
>
> Good.
>
>> You should have gotten output from gssd itself as well. You might
>> also look for any errors on the server from rpc.svcgssd.
>
> Nothing.
>
>> This looks fine. (Assuming linux.interlinx.bc.ca is the NFS server
>> and pc.interlinx.bc.ca is the client.)
>
> Indeed, they both are as you assume.
>
> So I've restarted both the rpc.svcgssd and rpc.gssd with -rrr -vvv on
> the server and client (respectively) to see what more I can see but I
> can't get the upcall to trigger. I would have thought simply issuing a
> mount would do that, but it seems not. What exactly causes the upcall
> to be made and what might prevent it from being made?

There should definitely be corresponding debug output from rpc.gssd
(-vvv) if you got those librpcsecgss (-rrr) messages. I would suggest
that you leave off the -rrr for now as it appears to be extra clutter
at this point. It might help if you run rpc.gssd in the foreground
(-f) while debugging. Also, make sure this is the only rpc.gssd
running.

Re: the upcall. Any new mount using Kerberos should cause the upcall
to create a new context with the server. Make sure the filesystem is
not already mounted. After the mount, any new user accessing that
mounted filesystem should cause an upcall to create a context for that
user.

K.C.

2008-11-03 21:12:48

by Brian J. Murrell

[permalink] [raw]
Subject: Re: stuck/hung nfsv4 mounts

On Mon, 2008-11-03 at 15:28 -0500, Kevin Coffman wrote:
>
> There should definitely be corresponding debug output from rpc.gssd
> (-vvv) if you got those librpcsecgss (-rrr) messages.

Yes, but by my last post I had debug only on the client and not the
server but when I enabled on the server I couldn't elicit the client at
will.

> Re: the upcall. Any new mount using Kerberos should cause the upcall
> to create a new context with the server. Make sure the filesystem is
> not already mounted.

It was definitely not mounted the last time I tried to mount it and
produce the upcall output but it just never happened.

> After the mount, any new user accessing that
> mounted filesystem should cause an upcall to create a context for that
> user.

I seem to have since gotten an upcall and here is the corresponding
client and server debug:

Nov 3 15:17:33 pc rpc.gssd[2937]: in authgss_create_default()
Nov 3 15:17:33 pc rpc.gssd[2937]: in authgss_create()
Nov 3 15:17:33 pc rpc.gssd[2937]: authgss_create: name is 0x98fd358
Nov 3 15:17:33 pc rpc.gssd[2937]: authgss_create: gd->name is 0x98f9cc8
Nov 3 15:17:33 pc rpc.gssd[2937]: in authgss_refresh()
Nov 3 15:17:33 pc rpc.gssd[2937]: struct rpc_gss_sec:
Nov 3 15:17:33 pc rpc.gssd[2937]: mechanism_OID: { 1 2 134 72 134 247 18 1 2 2 }
Nov 3 15:17:33 pc rpc.gssd[2937]: qop: 0
Nov 3 15:17:33 pc rpc.gssd[2937]: service: 1
Nov 3 15:17:33 pc rpc.gssd[2937]: cred: 0x98fb568
Nov 3 15:17:33 pc rpc.gssd[2937]: req_flags: 00000002
Nov 3 15:17:33 pc rpc.gssd[2937]: in authgss_marshal()
Nov 3 15:17:33 pc rpc.gssd[2937]: xdr_rpc_gss_buf: encode success ((nil):0)
Nov 3 15:17:33 pc rpc.gssd[2937]: xdr_rpc_gss_cred: encode success (v 1, proc 1, seq 0, svc 1, ctx (nil):0)
Nov 3 15:17:33 pc rpc.gssd[2937]: in authgss_wrap()
Nov 3 15:17:33 pc rpc.gssd[2937]: xdr_rpc_gss_buf: encode success (0x98fbc08:459)
Nov 3 15:17:33 pc rpc.gssd[2937]: xdr_rpc_gss_init_args: encode success (token 0x98fbc08:459)
Nov 3 15:17:33 pc rpc.gssd[2937]: in authgss_validate()
Nov 3 15:17:33 pc rpc.gssd[2937]: in authgss_unwrap()
Nov 3 15:17:33 pc rpc.gssd[2937]: xdr_rpc_gss_buf: decode success (0x98fd428:4)
Nov 3 15:17:33 pc rpc.gssd[2937]: xdr_rpc_gss_buf: decode success (0x990e428:114)
Nov 3 15:17:33 pc rpc.gssd[2937]: xdr_rpc_gss_init_res decode success (ctx 0x98fd428:4, maj 0, min 0, win 128, token 0x990e428:114)
Nov 3 15:17:33 pc rpc.gssd[2937]: authgss_create_default: freeing name 0x98fd358
Nov 3 15:17:33 pc rpc.gssd[2937]: in authgss_get_private_data()
Nov 3 15:17:33 pc rpc.gssd[2937]: in authgss_free_private_data()
Nov 3 15:17:33 pc rpc.gssd[2937]: in authgss_destroy()
Nov 3 15:17:33 pc rpc.gssd[2937]: in authgss_destroy_context()
Nov 3 15:17:33 pc rpc.gssd[2937]: authgss_destroy: freeing name 0x98f9cc8


Nov 3 15:17:33 linux krb5kdc[5006]: TGS_REQ (1 etypes {1}) 10.75.22.1: ISSUE: authtime 1225740577, etypes {rep=1 tkt=1 ses=1}, brian@ILINX for nfs/linux.interlinx.bc.ca@ILINX
Nov 3 15:17:33 linux rpc.svcgssd[13724]: leaving poll
Nov 3 15:17:33 linux rpc.svcgssd[13724]: handling null request
Nov 3 15:17:33 linux rpc.svcgssd[13724]: sname = brian@ILINX
Nov 3 15:17:33 linux rpc.svcgssd[13724]: DEBUG: serialize_krb5_ctx: lucid version!
Nov 3 15:17:33 linux rpc.svcgssd[13724]: prepare_krb5_rfc1964_buffer: serializing keys with enctype 4 and length 8
Nov 3 15:17:33 linux rpc.svcgssd[13724]: doing downcall
Nov 3 15:17:33 linux rpc.svcgssd[13724]: mech: krb5, hndl len: 4, ctx len 85, timeout: 2147483647, uid: 1001, gid: 1001, num aux grps: 13:
Nov 3 15:17:33 linux rpc.svcgssd[13724]: ( 1) 1001
Nov 3 15:17:33 linux rpc.svcgssd[13724]: ( 2) 117
Nov 3 15:17:33 linux rpc.svcgssd[13724]: ( 3) 1010
Nov 3 15:17:33 linux rpc.svcgssd[13724]: ( 4) 1011
Nov 3 15:17:33 linux rpc.svcgssd[13724]: ( 5) 2000
Nov 3 15:17:33 linux rpc.svcgssd[13724]: ( 6) 29
Nov 3 15:17:33 linux rpc.svcgssd[13724]: ( 7) 20
Nov 3 15:17:33 linux rpc.svcgssd[13724]: ( 8) 24
Nov 3 15:17:33 linux rpc.svcgssd[13724]: ( 9) 25
Nov 3 15:17:33 linux rpc.svcgssd[13724]: ( 10) 30
Nov 3 15:17:33 linux rpc.svcgssd[13724]: ( 11) 44
Nov 3 15:17:33 linux rpc.svcgssd[13724]: ( 12) 46
Nov 3 15:17:33 linux rpc.svcgssd[13724]: ( 13) 111
Nov 3 15:17:33 linux rpc.svcgssd[13724]: sending null reply
Nov 3 15:17:33 linux rpc.svcgssd[13724]: writing message: ...
Nov 3 15:17:33 linux rpc.svcgssd[13724]: finished handling null request
Nov 3 15:17:33 linux rpc.svcgssd[13724]: entering poll

And I still have an uncompleted mount.nfs4 command in my process table.

b.



2008-11-03 21:41:05

by Chuck Lever

[permalink] [raw]
Subject: Re: stuck/hung nfsv4 mounts

On Nov 3, 2008, at Nov 3, 2008, 12:37 PM, Trond Myklebust wrote:
> On Mon, 2008-11-03 at 12:25 -0500, Jim Rees wrote:
>> Trond Myklebust wrote:
>>
>> BTW: NFSv4 + soft == BAD BAD BAD!
>>
>> Maybe this combination should be prohibited. Does it make any
>> sense given
>> the stateful nature of v4?
>
> It might make sense if we were to fix up the granularity of the
> recovery
> routines so that we are able to recover all the state associated with
> just a single open owner or lock owner. Currently we'd have to recover
> all the state associated with that server.
>
> IOW: we might be able to fix things up in the future, but right now,
> NFSv4+soft is not a good idea.

Two cents worth: Until NFSv4+soft works reasonably well, I wouldn't
have any problem with making "soft" a no-op for nfs4 mounts.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com