2010-09-30 17:41:46

by Cláudio Martins

[permalink] [raw]
Subject: rpc.gssd / kernel infinite loop when a kerberos ticket expires


Greetings,

During linux NFS4 server and client testing, I think I have found a
bug which causes an infinite loop on a client machine. I am not sure if
this should be considered a kernel or a gssd specific bug (or both).

Steps to reproduce:

NFS4 server running kernel 2.6.36-rc5-00073-g57aebd7 (from linus'
git), with an export configured for krb5p security. The client is
running 2.6.36-rc6-00006-g4193d91 (a bit newer than the server).
nfs-utils is version 1.2.3:

(server)
#exportfs -v
/srv/ts2 193.136.0.0/16(rw,wdelay,root_squash,no_subtree_check,fsid=0,sec=krb5p,rw,root_squash,no_all_squash)

Both client and server have keytabs including encryption type aes256-cts:

(client)
# klist -ek /etc/krb5.keytab
Keytab name: WRFILE:/etc/krb5.keytab
KVNO Principal
---- --------------------------------------------------------------------------
4 nfs/[email protected] (AES-256 CTS mode with 96-bit SHA-1 HMAC)

(server)
## klist -ek /etc/krb5.keytab
Keytab name: WRFILE:/etc/krb5.keytab
KVNO Principal
---- --------------------------------------------------------------------------
6 nfs/[email protected] (AES-256 CTS mode with 96-bit SHA-1 HMAC)
6 nfs/[email protected] (Triple DES cbc mode with HMAC/sha1)


On the client, the remote filesystem is mounted with:

# mount -t nfs4 stor-test.ciist.ist.utl.pt:/ /mnt -o sec=krb5p

So we should be making use of the gss AES encryption support recently
commited to the kernel and nfs-utils.

Now after mounting, everything looks reasonable both on the client and
server. I can su to different local usernames on the client, and use
kinit to obtain tickets for several valid principal (the server knows
all those principals), and do regular read/write file operations.

Up to this point, so far so good. Now let it sit idle for a few hours -
enough to expire the user tickets:

(client)
#klist
Ticket cache: FILE:/tmp/krb5cc_33
Default principal: [email protected]

Valid starting Expires Service principal
09/30/10 04:24:28 09/30/10 14:24:28 krbtgt/[email protected]
renew until 10/01/10 04:24:24
09/30/10 04:25:21 09/30/10 14:24:28 nfs/[email protected]
renew until 10/01/10 04:24:24

So this ticket expired at 14:24:28, but now it's about 15:37, and I do
an ls on a remote directory where this principal has "rwxcty" perms
(one that definitely worked a few hours ago). This results in a stuck
"ls" command (SIGKILL doesn't work) and rpc.gssd is using up ~99% CPU
time. I had rpc.gssd started with -vvvv, and this is what I get:

Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: handling gssd upcall (/var/lib/nfs/rpc_pipefs/nfs/clnt23)
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: handle_gssd_upcall: 'mech=krb5 uid=146506 enctypes=18,17,16,23,3,1,2 '
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: handling krb5 upcall (/var/lib/nfs/rpc_pipefs/nfs/clnt23)
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: process_krb5_upcall: service is '<null>'
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: getting credentials for client with uid 146506 for server stor-test.ciist.ist.utl.pt
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: CC file '/tmp/krb5cc_machine_IST.UTL.PT' being considered, with preferred realm 'IST.UTL.PT'
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: CC file '/tmp/krb5cc_machine_IST.UTL.PT' owned by 0, not 146506
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: CC file '/tmp/krb5cc_146506' being considered, with preferred realm 'IST.UTL.PT'
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: CC file '/tmp/krb5cc_146506' is expired or corrupt
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: CC file '/tmp/krb5cc_90001' being considered, with preferred realm 'IST.UTL.PT'
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: CC file '/tmp/krb5cc_90001' owned by 90001, not 146506
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: CC file '/tmp/krb5cc_33' being considered, with preferred realm 'IST.UTL.PT'
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: CC file '/tmp/krb5cc_33' owned by 33, not 146506
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: CC file '/tmp/krb5cc_8500' being considered, with preferred realm 'IST.UTL.PT'
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: CC file '/tmp/krb5cc_8500' owned by 8500, not 146506
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: WARNING: Failed to create krb5 context for user with uid 146506 for server stor-test.ciist.ist.utl.pt
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: doing error downcall
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: handling gssd upcall (/var/lib/nfs/rpc_pipefs/nfs/clnt23)
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: handle_gssd_upcall: 'mech=krb5 uid=33 enctypes=18,17,16,23,3,1,2 '
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: handling krb5 upcall (/var/lib/nfs/rpc_pipefs/nfs/clnt23)
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: process_krb5_upcall: service is '<null>'
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: getting credentials for client with uid 33 for server stor-test.ciist.ist.utl.pt
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: CC file '/tmp/krb5cc_machine_IST.UTL.PT' being considered, with preferred realm 'IST.UTL.PT'
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: CC file '/tmp/krb5cc_machine_IST.UTL.PT' owned by 0, not 33
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: CC file '/tmp/krb5cc_146506' being considered, with preferred realm 'IST.UTL.PT'
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: CC file '/tmp/krb5cc_146506' owned by 146506, not 33
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: CC file '/tmp/krb5cc_90001' being considered, with preferred realm 'IST.UTL.PT'
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: CC file '/tmp/krb5cc_90001' owned by 90001, not 33
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: CC file '/tmp/krb5cc_33' being considered, with preferred realm 'IST.UTL.PT'
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: CC file '/tmp/krb5cc_33' is expired or corrupt
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: CC file '/tmp/krb5cc_8500' being considered, with preferred realm 'IST.UTL.PT'
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: CC file '/tmp/krb5cc_8500' owned by 8500, not 33
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: WARNING: Failed to create krb5 context for user with uid 33 for server stor-test.ciist.ist.utl.pt
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: doing error downcall
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: handling gssd upcall (/var/lib/nfs/rpc_pipefs/nfs/clnt23)
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: handle_gssd_upcall: 'mech=krb5 uid=146506 enctypes=18,17,16,23,3,1,2 '
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: handling krb5 upcall (/var/lib/nfs/rpc_pipefs/nfs/clnt23)
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: process_krb5_upcall: service is '<null>'
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: getting credentials for client with uid 146506 for server stor-test.ciist.ist.utl.pt
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: CC file '/tmp/krb5cc_machine_IST.UTL.PT' being considered, with preferred realm 'IST.UTL.PT'
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: CC file '/tmp/krb5cc_machine_IST.UTL.PT' owned by 0, not 146506
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: CC file '/tmp/krb5cc_146506' being considered, with preferred realm 'IST.UTL.PT'
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: CC file '/tmp/krb5cc_146506' is expired or corrupt
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: CC file '/tmp/krb5cc_90001' being considered, with preferred realm 'IST.UTL.PT'
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: CC file '/tmp/krb5cc_90001' owned by 90001, not 146506
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: CC file '/tmp/krb5cc_33' being considered, with preferred realm 'IST.UTL.PT'
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: CC file '/tmp/krb5cc_33' owned by 33, not 146506
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: CC file '/tmp/krb5cc_8500' being considered, with preferred realm 'IST.UTL.PT'
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: CC file '/tmp/krb5cc_8500' owned by 8500, not 146506
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: WARNING: Failed to create krb5 context for user with uid 146506 for server stor-test.ciist.ist.utl.pt
Sep 30 15:37:42 stor-test2 rpc.gssd[2927]: doing error downcall


...and so on. The machine gets quite unresponsive and I have to reboot
to get rid of the hung processes.

On the log transcribed above, note that gssd seems to be iterating
over all the cred. cache files on /tmp. The "ls" is being exec'ed as
uid=33, so IMO it should be looking at /tmp/krb5cc_33, deciding that
the creds have expired, and return an error, and the kernel client
should return an error to the "ls" process.
Instead, what we're seeing is the above downcall/upcall infinite loop,
which seems to hint that the kernel client is not rate-limiting any
kind of requests to gssd, and does not seem to do anything about the
error downcalls from gssd.

I couldn't get any interesting client kernel messages to corelate to
the gssd log (apart from OOM being invoked after a while to kill
rsyslogd which got overwhelmed by gssd debug info).

While this was happening, I verified that the client does not get to
the point of establishing a new connection to the server, so this would
seem to be a client related problem.

Please let me know if you would like me to provide further information
about this test setup or if you need any trace info. I will also gladly
run any further diagnostics you may like to debug this.

Thanks in advance for any help with this.

Best regards

Cláudio