2011-11-16 18:13:51

by John Hughes

[permalink] [raw]
Subject: [PATCH] Don't hang user processes if Kerberos ticket for nfs4 mount expires

With recent kernels if the Kerberos ticket for a nfs4 mount expires any
user process trying to access the mount hangs until a new ticket is
obtained. Simultaneously a (luckily rate-limited, but still seemingly
endless) stream of "Error: state manager encountered RPCSEC_GSS session
expired against NFSv4 server" messages is written to the kernel log.

In a common setup with user home directories nfs4 mounted on
workstations one of the processes that is likely to hang is the
screen-unlock function which would normally (via pam_krb5 or similar)
get the new ticket.

In older kernels the EKEYEXPIRED error would be passed to userland,
which would usualy just give up.

This patch restores the old behavior, which makes nfs4 mounted home
directories usable for me.




Attachments:
nfs4-ekeyexpired.patch (1.25 kB)

2011-11-17 01:30:31

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH] Don't hang user processes if Kerberos ticket for nfs4 mount expires

On Wed, 16 Nov 2011 18:44:34 -0500
Jim Rees <[email protected]> wrote:

> Jeff Layton wrote:
>
> Uhhh, no...EKEYEXPIRED was never passed to userland. The patchset that
> added EKEYEXPIRED returns in this codepath also added the code to make
> it hang.
>
> This not a bug, or at least it's intentional behavior. When a krb5
> ticket expires, we *want* the process to hang. Otherwise, people with
> long running jobs will often find that their jobs error out
> inexplicably when their ticket expires.
>
> Who decided that? This seems completely wrong to me. If my credentials
> expire, I want to get permission denied, not a client hang. In 20 years of
> using authenticated file systems I never once wished my process had hung
> when my ticket expired.
>

I proposed it, we discussed it on the list, and Trond and Steve
committed the patches necessary to make it happen. This was back in
late 2009/early 2010 though, so my memory is a bit fuzzy...

> Why should this be any different from any other failure condition? If you
> try to open a file that doesn't exist, do you want your process to hang
> instead of getting ENOENT, just in case the file magically appears at some
> point in the future?
>

That's different. Not renewing your credentials is often a temporary
situation. Kerberos is different than other authentication methods in
that you get a ticket only for a period of time, so expired credentials
are not a situation that's common with other authentication methods.

> This seems a recipe for disaster. Suppose I have a cron job that fires once
> a minute, and all those jobs hang waiting for a ticket. I come to work in
> the morning and discover I've got 10,000 hung processes. Or not, because my
> computer has crashed from resource exhaustion.

The previous situation was also a recipe for disaster, and was often
cited as a primary reason why people didn't want to deploy kerberized
NFS. Having everything fall down and go boom when your ticket expires
is not desirable either.

I suppose we'll have to agree to disagree on this point. That said, I'm
open to sane suggestions however that don't regress the behavior for
those users who need to be able to cope with expired tickets.

--
Jeff Layton <[email protected]>

2011-11-16 19:46:29

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH] Don't hang user processes if Kerberos ticket for nfs4 mount expires

On Wed, 16 Nov 2011 19:14:35 +0100
John Hughes <[email protected]> wrote:

> With recent kernels if the Kerberos ticket for a nfs4 mount expires any
> user process trying to access the mount hangs until a new ticket is
> obtained. Simultaneously a (luckily rate-limited, but still seemingly
> endless) stream of "Error: state manager encountered RPCSEC_GSS session
> expired against NFSv4 server" messages is written to the kernel log.
>
> In a common setup with user home directories nfs4 mounted on
> workstations one of the processes that is likely to hang is the
> screen-unlock function which would normally (via pam_krb5 or similar)
> get the new ticket.
>
> In older kernels the EKEYEXPIRED error would be passed to userland,
> which would usualy just give up.
>
> This patch restores the old behavior, which makes nfs4 mounted home
> directories usable for me.
>

Uhhh, no...EKEYEXPIRED was never passed to userland. The patchset that
added EKEYEXPIRED returns in this codepath also added the code to make
it hang.

This not a bug, or at least it's intentional behavior. When a krb5
ticket expires, we *want* the process to hang. Otherwise, people with
long running jobs will often find that their jobs error out
inexplicably when their ticket expires.

The patches that introduced this behavior went into 2.6.34. See the
commits around 2c64348 (and some preceding ones in the rpc layer).

If you want to fix this use case, you'll need to come up with a scheme
that doesn't regress this behavior. I think that you'll really need to
ensure that whatever process you expect to re-fetch your TGT is not
dependent on accessing kerberized nfs mounts. That really seems like an
untenable chicken and egg situation.

--
Jeff Layton <[email protected]>

2011-11-17 13:13:50

by John Hughes

[permalink] [raw]
Subject: Re: [PATCH] Don't hang user processes if Kerberos ticket for nfs4 mount expires

On 17/11/11 12:05, John Hughes wrote:
> On 17/11/11 02:38, Jeff Layton wrote:
>> Note too that the gssd code distinguishes between an expired TGT and a
>> non-existent credcache. The latter will give you the error you desire
>> here. So one possibility is just to remove the credcache from /tmp in
>> this situation.
>
> Something to scan /tmp for expired credentials and zap em? rpc.gssd
> would communicate that to the kernel?
>
> Whadaya know, that works.
Here's a dumb perl script that could be run from, for example, .xsession
to automatically destroy expired ticket caches.

Would need a bit of trickery to make it go away on end of session and
something in /etc/pm/sleep.d to send it a SIGALRM when the system wakes
from suspend or hibernate.

It has a potential race between destroying an expired ticket and a new
ticket being granted.

I guess now I'll look at a hack to rpc.gssd for a neater way of doing this.





Attachments:
monitor-tickets (1.28 kB)

2011-11-17 09:37:13

by John Hughes

[permalink] [raw]
Subject: Re: [PATCH] Don't hang user processes if Kerberos ticket for nfs4 mount expires

On 16/11/11 20:47, Jeff Layton wrote:
> On Wed, 16 Nov 2011 19:14:35 +0100
> John Hughes<[email protected]> wrote:
>
>> With recent kernels if the Kerberos ticket for a nfs4 mount expires any
>> user process trying to access the mount hangs until a new ticket is
>> obtained. Simultaneously a (luckily rate-limited, but still seemingly
>> endless) stream of "Error: state manager encountered RPCSEC_GSS session
>> expired against NFSv4 server" messages is written to the kernel log.
[...]
>> This patch restores the old behavior, which makes nfs4 mounted home
>> directories usable for me.
>>
> Uhhh, no...EKEYEXPIRED was never passed to userland. The patchset that
> added EKEYEXPIRED returns in this codepath also added the code to make
> it hang.

You are, of course, right. userland used to get EPERM.

> This not a bug, or at least it's intentional behavior. When a krb5
> ticket expires, we *want* the process to hang. Otherwise, people with
> long running jobs will often find that their jobs error out
> inexplicably when their ticket expires.
I thought that was what kstart/krenew were for.
> The patches that introduced this behavior went into 2.6.34. See the
> commits around 2c64348 (and some preceding ones in the rpc layer).

Ah, I'm a Debian user - 2.6.32 for the moment, soon to be 3.?

> If you want to fix this use case, you'll need to come up with a scheme
> that doesn't regress this behavior. I think that you'll really need to
> ensure that whatever process you expect to re-fetch your TGT is not
> dependent on accessing kerberized nfs mounts. That really seems like an
> untenable chicken and egg situation.

Ow. "Fixing" (at least) Gnome-3 and Gnome-2 screen-lock/screensavers.

How about a mount option to chose between the two behaviours?


2011-11-18 02:03:01

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH] Don't hang user processes if Kerberos ticket for nfs4 mount expires

On Thu, 17 Nov 2011 20:51:16 -0500
Jim Rees <[email protected]> wrote:

> I would argue that if you don't want your applications to stop working when
> your ticket expires, you shouldn't let the ticket expire. If you don't want
> to have to renew your ticket, you should use an infinite ticket lifetime.
>

That's the ideal situation, but shit happens, and losing a long-running
job can often be an expensive proposition.

> It sounds like you've made up your mind, but I would urge you to make this
> a mount option, analogous to the hard/soft mount option.

I've not made up my mind about anything, and in any case it's not my
decision to make. I think you need to convince Trond here... :)

I'm quite open to sane proposals as long as we can accomodate those who
are dependent on the current behavior. As I said before, when I
originally did the patches a couple of years ago, I sort of figured the
current behavior was a first approximation.

A mount option will be harder to implement than a rpc.gssd command-line
option, but it sounds reasonable. Still, it would be better not to have
to make this an either/or decision somehow.

--
Jeff Layton <[email protected]>

2011-11-18 01:51:27

by Jim Rees

[permalink] [raw]
Subject: Re: [PATCH] Don't hang user processes if Kerberos ticket for nfs4 mount expires

I would argue that if you don't want your applications to stop working when
your ticket expires, you shouldn't let the ticket expire. If you don't want
to have to renew your ticket, you should use an infinite ticket lifetime.

It sounds like you've made up your mind, but I would urge you to make this
a mount option, analogous to the hard/soft mount option.

2011-11-17 01:37:21

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH] Don't hang user processes if Kerberos ticket for nfs4 mount expires

On Wed, 16 Nov 2011 20:31:19 -0500
Jeff Layton <[email protected]> wrote:

> On Wed, 16 Nov 2011 18:44:34 -0500
> Jim Rees <[email protected]> wrote:
>
> > Jeff Layton wrote:
> >
> > Uhhh, no...EKEYEXPIRED was never passed to userland. The patchset that
> > added EKEYEXPIRED returns in this codepath also added the code to make
> > it hang.
> >
> > This not a bug, or at least it's intentional behavior. When a krb5
> > ticket expires, we *want* the process to hang. Otherwise, people with
> > long running jobs will often find that their jobs error out
> > inexplicably when their ticket expires.
> >
> > Who decided that? This seems completely wrong to me. If my credentials
> > expire, I want to get permission denied, not a client hang. In 20 years of
> > using authenticated file systems I never once wished my process had hung
> > when my ticket expired.
> >
>
> I proposed it, we discussed it on the list, and Trond and Steve
> committed the patches necessary to make it happen. This was back in
> late 2009/early 2010 though, so my memory is a bit fuzzy...
>
> > Why should this be any different from any other failure condition? If you
> > try to open a file that doesn't exist, do you want your process to hang
> > instead of getting ENOENT, just in case the file magically appears at some
> > point in the future?
> >
>
> That's different. Not renewing your credentials is often a temporary
> situation. Kerberos is different than other authentication methods in
> that you get a ticket only for a period of time, so expired credentials
> are not a situation that's common with other authentication methods.
>
> > This seems a recipe for disaster. Suppose I have a cron job that fires once
> > a minute, and all those jobs hang waiting for a ticket. I come to work in
> > the morning and discover I've got 10,000 hung processes. Or not, because my
> > computer has crashed from resource exhaustion.
>
> The previous situation was also a recipe for disaster, and was often
> cited as a primary reason why people didn't want to deploy kerberized
> NFS. Having everything fall down and go boom when your ticket expires
> is not desirable either.
>
> I suppose we'll have to agree to disagree on this point. That said, I'm
> open to sane suggestions however that don't regress the behavior for
> those users who need to be able to cope with expired tickets.
>

Note too that the gssd code distinguishes between an expired TGT and a
non-existent credcache. The latter will give you the error you desire
here. So one possibility is just to remove the credcache from /tmp in
this situation.

Another possibility might be a new option to rpc.gssd that allows the
user to select the error that it passes back to the kernel on an
expired ticket.

--
Jeff Layton <[email protected]>

2011-11-17 01:46:30

by Matt W. Benjamin

[permalink] [raw]
Subject: Re: [PATCH] Don't hang user processes if Kerberos ticket for nfs4 mount expires

Hi,

While I'm not expert in this area, my impression had been that the established
practice was that used with AFS, i.e., run jobs under a process capable of renewing
kerberos tickets, e.g., kstart (http://www.eyrie.org/~eagle/software/kstart/).

Matt

----- "Jeff Layton" <[email protected]> wrote:

> The previous situation was also a recipe for disaster, and was often
> cited as a primary reason why people didn't want to deploy kerberized
> NFS. Having everything fall down and go boom when your ticket expires
> is not desirable either.
>
> I suppose we'll have to agree to disagree on this point. That said,
> I'm
> open to sane suggestions however that don't regress the behavior for
> those users who need to be able to cope with expired tickets.
>
> --
> Jeff Layton <[email protected]>

--

Matt Benjamin

The Linux Box
206 South Fifth Ave. Suite 150
Ann Arbor, MI 48104

http://linuxbox.com

tel. 734-761-4689
fax. 734-769-8938
cel. 734-216-5309

2011-11-17 21:47:01

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH] Don't hang user processes if Kerberos ticket for nfs4 mount expires

On Thu, 17 Nov 2011 14:13:38 +0100
John Hughes <[email protected]> wrote:

> On 17/11/11 12:05, John Hughes wrote:
> > On 17/11/11 02:38, Jeff Layton wrote:
> >> Note too that the gssd code distinguishes between an expired TGT and a
> >> non-existent credcache. The latter will give you the error you desire
> >> here. So one possibility is just to remove the credcache from /tmp in
> >> this situation.
> >
> > Something to scan /tmp for expired credentials and zap em? rpc.gssd
> > would communicate that to the kernel?
> >
> > Whadaya know, that works.
> Here's a dumb perl script that could be run from, for example, .xsession
> to automatically destroy expired ticket caches.
>
> Would need a bit of trickery to make it go away on end of session and
> something in /etc/pm/sleep.d to send it a SIGALRM when the system wakes
> from suspend or hibernate.
>
> It has a potential race between destroying an expired ticket and a new
> ticket being granted.
>
> I guess now I'll look at a hack to rpc.gssd for a neater way of doing this.
>

Ok, I can remember a bit more about the genesis of this scheme...

At the time the argument went something like this:

No one expects that when their krb5 ticket expires that their
applications will fail. A case in point is something like a krb5 ssh
session. If I had a valid ticket when I initiated the session, then it
we would consider it a bug if it were to suddenly die when the ticket
expired.

Contrast that however with applications running on a kerberized NFS
mount. As soon as the ticket expires they start failing with
non-transient errors. This is probably the case as well with screen
locker you're using, but it's apparently able to recover enough to
allow the TGT to be renewed. I expect though, that you may have other
less visible programs that are dying in this situation or are getting
unexpected errors.

The current behavior was really intended as a first approximation. I
fully expected that it would need some refinement, but AFAIK, no one
has complained loudly about the current behavior until now, so I
haven't seen need to mess with it.

I'm not that familiar with kstart, but I assume that it gets a
renewable TGT and just renews it as needed? I have to wonder if that
sort of tool might be verboten in security conscious sites (the very
sort that want kerberized nfs).

If we decide that making this behavior switchable is the right thing to
do, then what you'll probably want to do is add a new command-line
option to rpc.gssd, and make it conditionally return -EKEYEXPIRED or
-EACCES in the downcall based on it. It should be a fairly simple
patch. See process_krb5_upcall() in rpc.gssd...

Long term, we probably need to consider this use-case in the GSSAPI
proxy initiative that Simo has been scoping out. It would be nice to
have a solution that would work for both home directory configurations
and long-running jobs without needing these sorts of hacks.

--
Jeff Layton <[email protected]>

2011-11-16 23:44:47

by Jim Rees

[permalink] [raw]
Subject: Re: [PATCH] Don't hang user processes if Kerberos ticket for nfs4 mount expires

Jeff Layton wrote:

Uhhh, no...EKEYEXPIRED was never passed to userland. The patchset that
added EKEYEXPIRED returns in this codepath also added the code to make
it hang.

This not a bug, or at least it's intentional behavior. When a krb5
ticket expires, we *want* the process to hang. Otherwise, people with
long running jobs will often find that their jobs error out
inexplicably when their ticket expires.

Who decided that? This seems completely wrong to me. If my credentials
expire, I want to get permission denied, not a client hang. In 20 years of
using authenticated file systems I never once wished my process had hung
when my ticket expired.

Why should this be any different from any other failure condition? If you
try to open a file that doesn't exist, do you want your process to hang
instead of getting ENOENT, just in case the file magically appears at some
point in the future?

This seems a recipe for disaster. Suppose I have a cron job that fires once
a minute, and all those jobs hang waiting for a ticket. I come to work in
the morning and discover I've got 10,000 hung processes. Or not, because my
computer has crashed from resource exhaustion.

2011-11-17 11:06:33

by John Hughes

[permalink] [raw]
Subject: Re: [PATCH] Don't hang user processes if Kerberos ticket for nfs4 mount expires

On 17/11/11 02:38, Jeff Layton wrote:
> Note too that the gssd code distinguishes between an expired TGT and a
> non-existent credcache. The latter will give you the error you desire
> here. So one possibility is just to remove the credcache from /tmp in
> this situation.

Something to scan /tmp for expired credentials and zap em? rpc.gssd
would communicate that to the kernel?

Whadaya know, that works.

With the 3.1-rc10 kernel I let my ticket expire, did a ls - it hangs.

Now, from another terminal I do a kdestroy on my ticket cache, and (a
second or so later) the ls gets an EPERM.

So this behaviour can be changed from userland with no changes to the
kernel, rpc.gssd or anything else.

Some fun racing possibilities.