2010-10-28 00:30:33

by Brian De Wolf

[permalink] [raw]
Subject: rpc.gssd still spammed in 2.6.35

Greetings,

I recently started testing a build of 2.6.35 to hopefully relieve some
issues we have on our login boxes. Specifically, I was after this
commit:
http://git.kernel.org/?p=linux/kernel/git/next/linux-next.git;a=commit;h=126e216a8730532dfb685205309275f87e3d133e

The issue we've run into is that some user loses their credentials,
but has a process looping on a read/write of their Kerberized NFSv4 home
directory without checking the return value. Not only did this spam
logs, but it also prevents rpc.gssd from handling anyone else's logins,
effectively taking down the service for anyone not already connected.

I was hoping this commit would protect rpc.gssd from any potential
flooding of requests, but it all depends on how the user loses their
credentials. If their credentials have expired or their caches become
corrupt, rpc.gssd returns EKEYEXPIRED and the kernel rate limits the
requests to rpc.gssd via negative caching.

If the user's credential cache gets destroyed, however, rpc.gssd
returns EACCES, and the user process can cause the kernel to hammer
rpc.gssd. The kicker here is that pam_krb5 destroys credentials on
logout by default, so if someone's using screen or long background
processes in their home directory, it's a ticking time bomb waiting to
destroy rpc.gssd.

That's assuming a benign user, as well. A malicious user could easily
kdestroy, wait for their credentials to expire from the cache in the
kernel, and start tying up rpc.gssd with failed requests.


With this in mind, I initially patched the kernel to negative cache
entries with EACCES errors, in addition to EKEYEXPIRED errors. But the
more that I thought about it, the more it seemed appropriate to subject
all possible errors to negative caching. The underlying question is,
is there any possible error from rpc.gssd where it would be appropriate
to allow a process to cause another request to rpc.gssd immediately?
If there isn't, negative caching all errors seems reasonable.

Here's a simple patch implementing the behavior of negative caching of
every failed request, as a proof of concept, I guess. With it applied,
I have yet to produce a scenario where rpc.gssd becomes unresponsive.

Let me know what you think. I'd love to see a fix for this behavior
enter the kernel at some point, as it's been rather disruptive on our
login boxes lately.


diff --git a/net/sunrpc/auth_gss/auth_gss.c b/net/sunrpc/auth_gss/auth_gss.c
index 3835ce3..38bdf90 100644
--- a/net/sunrpc/auth_gss/auth_gss.c
+++ b/net/sunrpc/auth_gss/auth_gss.c
@@ -362,7 +362,7 @@ gss_handle_downcall_result(struct gss_cred *gss_cred, struct gss_upcall_msg *gss
clear_bit(RPCAUTH_CRED_NEGATIVE, &gss_cred->gc_base.cr_flags);
gss_cred_set_ctx(&gss_cred->gc_base, gss_msg->ctx);
break;
- case -EKEYEXPIRED:
+ default:
set_bit(RPCAUTH_CRED_NEGATIVE, &gss_cred->gc_base.cr_flags);
}
gss_cred->gc_upcall_timestamp = jiffies;


2010-10-28 23:15:38

by Brian De Wolf

[permalink] [raw]
Subject: Re: rpc.gssd still spammed in 2.6.35

On Thu, 28 Oct 2010 07:00:19 -0700
Trond Myklebust <[email protected]> wrote:

> What about the rpc_pipefs errors, EAGAIN, EPIPE and ETIMEDOUT? Why
> should they result in the cred being marked as negative?
>

I have a limited grasp of the exact mechanics going on, but the general
reasoning I have in my mind is this:

If a given credential request causes an error to be returned, be it
from rpc_pipefs or rpc.gssd, there are two possible reasons for the
failure:

1) rpc.gssd is missing or unresponsive. If this is the case, it doesn't
matter if you can retry immediately or if you wait 5 seconds, it's
still going to fail.

2) Something about the request has caused either rpc_pipefs or rpc.gssd
to produce an error, while other requests still process normally. If
this is the case, we should prioritize the requests that will succeed
by penalizing the requests that don't via negative caching of their
failures. Otherwise those failing requests can flood rpc.gssd and
prevent those that can succeed from ever being attempted (and this is
what has been happening in my environment).


The only problem I can see with it is that, if a request fails and the
keys become available within 5 seconds, the user just has to wait it
out. I don't think I can usually "kinit" with my password in 5 seconds,
but I could see an automated system being interfered with. I haven't
experimented with it, but I suspect a sub-second negative cache timeout
would still protect rpc.gssd from flooding while not causing extra
disruption to use.

I'd really just like to see some sort of rate-limiting on the failures
heading into rpc.gssd so that it can continue processing valid requests.

> rpc.gssd itself will only pass down 3 errors: 0, EKEYEXPIRED and EACCES.
>

Is this set in stone? My fear is that, if rpc.gssd is ever improved to
return even more error codes or can somehow be coerced to return some
other unexpected error code, rpc.gssd can be taken out of service by
flooding it with requests that subvert the negative caching.


Sorry if I'm out of touch with the internals or what's best for the
kernel. I'm just a sysadmin dabbling in the kernel, trying to fix some
problems I've been running into...

2010-10-28 14:00:38

by Myklebust, Trond

[permalink] [raw]
Subject: Re: rpc.gssd still spammed in 2.6.35

On Wed, 2010-10-27 at 17:24 -0700, Brian De Wolf wrote:
> Greetings,
>
> I recently started testing a build of 2.6.35 to hopefully relieve some
> issues we have on our login boxes. Specifically, I was after this
> commit:
> http://git.kernel.org/?p=linux/kernel/git/next/linux-next.git;a=commit;h=126e216a8730532dfb685205309275f87e3d133e
>
> The issue we've run into is that some user loses their credentials,
> but has a process looping on a read/write of their Kerberized NFSv4 home
> directory without checking the return value. Not only did this spam
> logs, but it also prevents rpc.gssd from handling anyone else's logins,
> effectively taking down the service for anyone not already connected.
>
> I was hoping this commit would protect rpc.gssd from any potential
> flooding of requests, but it all depends on how the user loses their
> credentials. If their credentials have expired or their caches become
> corrupt, rpc.gssd returns EKEYEXPIRED and the kernel rate limits the
> requests to rpc.gssd via negative caching.
>
> If the user's credential cache gets destroyed, however, rpc.gssd
> returns EACCES, and the user process can cause the kernel to hammer
> rpc.gssd. The kicker here is that pam_krb5 destroys credentials on
> logout by default, so if someone's using screen or long background
> processes in their home directory, it's a ticking time bomb waiting to
> destroy rpc.gssd.
>
> That's assuming a benign user, as well. A malicious user could easily
> kdestroy, wait for their credentials to expire from the cache in the
> kernel, and start tying up rpc.gssd with failed requests.
>
>
> With this in mind, I initially patched the kernel to negative cache
> entries with EACCES errors, in addition to EKEYEXPIRED errors. But the
> more that I thought about it, the more it seemed appropriate to subject
> all possible errors to negative caching. The underlying question is,
> is there any possible error from rpc.gssd where it would be appropriate
> to allow a process to cause another request to rpc.gssd immediately?
> If there isn't, negative caching all errors seems reasonable.
>
> Here's a simple patch implementing the behavior of negative caching of
> every failed request, as a proof of concept, I guess. With it applied,
> I have yet to produce a scenario where rpc.gssd becomes unresponsive.
>
> Let me know what you think. I'd love to see a fix for this behavior
> enter the kernel at some point, as it's been rather disruptive on our
> login boxes lately.
>
>
> diff --git a/net/sunrpc/auth_gss/auth_gss.c b/net/sunrpc/auth_gss/auth_gss.c
> index 3835ce3..38bdf90 100644
> --- a/net/sunrpc/auth_gss/auth_gss.c
> +++ b/net/sunrpc/auth_gss/auth_gss.c
> @@ -362,7 +362,7 @@ gss_handle_downcall_result(struct gss_cred *gss_cred, struct gss_upcall_msg *gss
> clear_bit(RPCAUTH_CRED_NEGATIVE, &gss_cred->gc_base.cr_flags);
> gss_cred_set_ctx(&gss_cred->gc_base, gss_msg->ctx);
> break;
> - case -EKEYEXPIRED:
> + default:
> set_bit(RPCAUTH_CRED_NEGATIVE, &gss_cred->gc_base.cr_flags);
> }
> gss_cred->gc_upcall_timestamp = jiffies;

What about the rpc_pipefs errors, EAGAIN, EPIPE and ETIMEDOUT? Why
should they result in the cred being marked as negative?

rpc.gssd itself will only pass down 3 errors: 0, EKEYEXPIRED and EACCES.

Trond