LinuxLists.cc - NFS v4 + kerberos: 4 minute window of slowness

2020-05-05 17:37:02

Subject: NFS v4 + kerberos: 4 minute window of slowness

Hi,

Currently the last 4 minutes before kernel gss context expires, all writes
to NFSv4 are synchronous and all dirty pages associated with the file being
written to are being destaged.
This will continue for the 4 minutes until the context expires, at which
point it gets refreshed and everything gets back to normal.

The rpc.gssd by default sets the timeout to match the Kerberos service
ticket, but this can be lowered by using -t option.
In fact many sites set it to lower value, like for example 30 minutes.
This means that every 30 minutes, the last 4 minutes results in severely
slower writes (assuming these are buffered - no O_DSYNC, etc.).

In extreme case, when one sets the timeout to 5 minutes, during the 4
minutes out of the minutes, there will be the slowness observed.

I understand the idea behind this mechanism - I guess it tries to avoid
situation when a gss context can't be refreshed (due to error or account
disabled, etc.),
and it expires suddenly nfs client wouldn't be able to destage all the
buffered writes. The way it is currently implemented though is rather crude.
In my opinion, instead of making everything slow for the whole 4 minutes, it
should first try to refresh the context immediately and if successful things
go back to normal,
if it can't refresh the context then it should continue with the previous
one and revert to the current behaviour. I implemented a na?ve quick fix
which does exactly that
(attached at the end of this email).

How to re-produce.

$ uname -r
5.7.0-rc4+

$ grep -- -t /etc/sysconfig/nfs
RPCGSSDARGS="-t 300"

I'm setting it to 5 minutes so I can quickly see the behaviour without
having to wait for too long.

Now, let's generate a small write every 10s to a file on nfsv4,sec=krb5
filesystem and record how long each write takes.
Since these are buffered writes it should be very quick most of the time.

$ while [ 1 ]; do strace -qq -tT -v -e trace=write /bin/echo aa >f1; rm f1;
sleep 10; done

15:22:41 write(1, "aa\n", 3) = 3 <0.000108>
15:22:51 write(1, "aa\n", 3) = 3 <0.000113>
15:23:01 write(1, "aa\n", 3) = 3 <0.000111>
15:23:11 write(1, "aa\n", 3) = 3 <0.000112>
15:23:21 write(1, "aa\n", 3) = 3 <0.001510> <<<<<< becomes
slow
15:23:31 write(1, "aa\n", 3) = 3 <0.001622>
15:23:41 write(1, "aa\n", 3) = 3 <0.001553>
15:23:51 write(1, "aa\n", 3) = 3 <0.001495>
...
15:27:01 write(1, "aa\n", 3) = 3 <0.001528>
15:27:12 write(1, "aa\n", 3) = 3 <0.001553>
15:27:22 write(1, "aa\n", 3) = 3 <0.000104> <<<<<< becomes
fast again
15:27:32 write(1, "aa\n", 3) = 3 <0.000125>
15:27:42 write(1, "aa\n", 3) = 3 <0.000129>
15:27:52 write(1, "aa\n", 3) = 3 <0.000113>
15:28:02 write(1, "aa\n", 3) = 3 <0.000112>
15:28:12 write(1, "aa\n", 3) = 3 <0.000112>
15:28:22 write(1, "aa\n", 3) = 3 <0.001510> <<<<<< slow
...
15:32:02 write(1, "aa\n", 3) = 3 <0.001501>
15:32:12 write(1, "aa\n", 3) = 3 <0.001440>
15:32:22 write(1, "aa\n", 3) = 3 <0.000136> <<<<<< fast
15:32:32 write(1, "aa\n", 3) = 3 <0.000109>
15:32:42 write(1, "aa\n", 3) = 3 <0.000110>
15:32:52 write(1, "aa\n", 3) = 3 <0.000112>
15:33:02 write(1, "aa\n", 3) = 3 <0.000103>
15:33:12 write(1, "aa\n", 3) = 3 <0.000112>
15:33:22 write(1, "aa\n", 3) = 3 <0.001405> <<<<<< slow
15:33:32 write(1, "aa\n", 3) = 3 <0.001393>
15:33:42 write(1, "aa\n", 3) = 3 <0.001479>
...

So we have 4 minute long windows of slowness followed by 1 minute window
when writes are fast.

15:23:21 - 15:27:22 slow
15:27:22 - 15:28:22 fast
15:28:22 - 15:32:22 slow
15:32:22 - 15:33:22 fast

After some tracing with systemtap and looking at the source code, I found
where the issue is coming from.
The nfs_file_write() function ends up calling nfs_ctx_key_to_expire() on
each write, which in turn calls gss_key_timeout() which
has hard-coded value of 240s (GSS_KEY_EXPIRE_TIMEO/gss_key_expire_timeo).

nfs_file_write()
...
result = nfs_key_timeout_notify(file, inode);
if (result)
return result;
...
if (nfs_need_check_write(file, inode)) {
int err = nfs_wb_all(inode);
...

/*
* Avoid buffered writes when a open context credential's key would
* expire soon.
*
* Returns -EACCES if the key will expire within RPC_KEY_EXPIRE_FAIL.
*
* Return 0 and set a credential flag which triggers the inode to flush
* and performs NFS_FILE_SYNC writes if the key will expired within
* RPC_KEY_EXPIRE_TIMEO.
*/
int
nfs_key_timeout_notify(struct file *filp, struct inode *inode)
{
struct nfs_open_context *ctx = nfs_file_open_context(filp);

if (nfs_ctx_key_to_expire(ctx, inode) &&
!ctx->ll_cred)
/* Already expired! */
return -EACCES;
return 0;
}

nfs_need_check_write()
...
if (nfs_ctx_key_to_expire(ctx, inode))
return 1;
return 0;

nfs_write_end()
...
if (nfs_ctx_key_to_expire(ctx, mapping->host)) {
status = nfs_wb_all(mapping->host);
...

/*
* Test if the open context credential key is marked to expire soon.
*/
bool nfs_ctx_key_to_expire(struct nfs_open_context *ctx, struct inode
*inode)
{
struct rpc_auth *auth = NFS_SERVER(inode)->client->cl_auth;
struct rpc_cred *cred = ctx->ll_cred;
struct auth_cred acred = {
.cred = ctx->cred,
};

if (cred && !cred->cr_ops->crmatch(&acred, cred, 0)) {
put_rpccred(cred);
ctx->ll_cred = NULL;
cred = NULL;
}
if (!cred)
cred = auth->au_ops->lookup_cred(auth, &acred, 0);
if (!cred || IS_ERR(cred))
return true;
ctx->ll_cred = cred;
return !!(cred->cr_ops->crkey_timeout &&
cred->cr_ops->crkey_timeout(cred));
}

net/sunrpc/auth_gss/auth_gss.c: .crkey_timeout = gss_key_timeout,

/*
* Returns -EACCES if GSS context is NULL or will expire within the
* timeout (miliseconds)
*/
static int
gss_key_timeout(struct rpc_cred *rc)
{
struct gss_cred *gss_cred = container_of(rc, struct gss_cred,
gc_base);
struct gss_cl_ctx *ctx;
unsigned long timeout = jiffies + (gss_key_expire_timeo * HZ);
int ret = 0;

rcu_read_lock();
ctx = rcu_dereference(gss_cred->gc_ctx);
if (!ctx || time_after(timeout, ctx->gc_expiry))
ret = -EACCES;
rcu_read_unlock();

return ret;
}

#define GSS_KEY_EXPIRE_TIMEO 240
static unsigned int gss_key_expire_timeo = GSS_KEY_EXPIRE_TIMEO;

A na?ve attempt at a fix:

diff --git a/net/sunrpc/auth_gss/auth_gss.c b/net/sunrpc/auth_gss/auth_gss.c
index 25fbd8d9de74..864661bdfdf3 100644
--- a/net/sunrpc/auth_gss/auth_gss.c
+++ b/net/sunrpc/auth_gss/auth_gss.c
@@ -1477,6 +1477,8 @@ gss_key_timeout(struct rpc_cred *rc)

rcu_read_lock();
ctx = rcu_dereference(gss_cred->gc_ctx);
+ if (ctx && time_after(timeout + (60 * HZ), ctx->gc_expiry))
+ clear_bit(RPCAUTH_CRED_UPTODATE, &rc->cr_flags);
if (!ctx || time_after(timeout, ctx->gc_expiry))
ret = -EACCES;
rcu_read_unlock();

With the above patch, if there is a write within 300s before a context is to
expire (use RPCGSSDARGS="-t 400" or any value larger than 300 to test),
it will now try to refresh the context and if successful then writes will be
fast again (assuming -t option is >300s and/or krb ticket is valid for more
than 300s).

What I haven't tested nor analysed code is what would happen if it now fails
to refresh the context, but after a quick glance at gss_refresh() it does
seem it would continue
using the previous cred...

Is this the correct approach to fix the issue, or can you suggest some other
approach?

--
Robert Milkowski

2020-05-14 20:39:54

by Robert Milkowski

[permalink] [raw]

Subject: RE: NFS v4 + kerberos: 4 minute window of slowness

+Trond, Chuck, Anna

Ping...

> -----Original Message-----
> From: Robert Milkowski <[email protected]>
> Sent: 05 May 2020 18:35
> To: [email protected]
> Subject: NFS v4 + kerberos: 4 minute window of slowness
>
> Hi,
>
> Currently the last 4 minutes before kernel gss context expires, all
> writes to NFSv4 are synchronous and all dirty pages associated with the
> file being written to are being destaged.
> This will continue for the 4 minutes until the context expires, at which
> point it gets refreshed and everything gets back to normal.
>
> The rpc.gssd by default sets the timeout to match the Kerberos service
> ticket, but this can be lowered by using -t option.
> In fact many sites set it to lower value, like for example 30 minutes.
> This means that every 30 minutes, the last 4 minutes results in severely
> slower writes (assuming these are buffered - no O_DSYNC, etc.).
>
> In extreme case, when one sets the timeout to 5 minutes, during the 4
> minutes out of the minutes, there will be the slowness observed.
>
>
> I understand the idea behind this mechanism - I guess it tries to avoid
> situation when a gss context can't be refreshed (due to error or account
> disabled, etc.), and it expires suddenly nfs client wouldn't be able to
> destage all the buffered writes. The way it is currently implemented
> though is rather crude.
> In my opinion, instead of making everything slow for the whole 4
> minutes, it should first try to refresh the context immediately and if
> successful things go back to normal, if it can't refresh the context
> then it should continue with the previous one and revert to the current
> behaviour. I implemented a na?ve quick fix which does exactly that
> (attached at the end of this email).
>
>
> How to re-produce.
>
>
> $ uname -r
> 5.7.0-rc4+
>
> $ grep -- -t /etc/sysconfig/nfs
> RPCGSSDARGS="-t 300"
>
> I'm setting it to 5 minutes so I can quickly see the behaviour without
> having to wait for too long.
>
>
> Now, let's generate a small write every 10s to a file on nfsv4,sec=krb5
> filesystem and record how long each write takes.
> Since these are buffered writes it should be very quick most of the
> time.
>
> $ while [ 1 ]; do strace -qq -tT -v -e trace=write /bin/echo aa >f1; rm
> f1; sleep 10; done
>
> 15:22:41 write(1, "aa\n", 3) = 3 <0.000108>
> 15:22:51 write(1, "aa\n", 3) = 3 <0.000113>
> 15:23:01 write(1, "aa\n", 3) = 3 <0.000111>
> 15:23:11 write(1, "aa\n", 3) = 3 <0.000112>
> 15:23:21 write(1, "aa\n", 3) = 3 <0.001510> <<<<<<
> becomes
> slow
> 15:23:31 write(1, "aa\n", 3) = 3 <0.001622>
> 15:23:41 write(1, "aa\n", 3) = 3 <0.001553>
> 15:23:51 write(1, "aa\n", 3) = 3 <0.001495>
> ...
> 15:27:01 write(1, "aa\n", 3) = 3 <0.001528>
> 15:27:12 write(1, "aa\n", 3) = 3 <0.001553>
> 15:27:22 write(1, "aa\n", 3) = 3 <0.000104> <<<<<<
> becomes
> fast again
> 15:27:32 write(1, "aa\n", 3) = 3 <0.000125>
> 15:27:42 write(1, "aa\n", 3) = 3 <0.000129>
> 15:27:52 write(1, "aa\n", 3) = 3 <0.000113>
> 15:28:02 write(1, "aa\n", 3) = 3 <0.000112>
> 15:28:12 write(1, "aa\n", 3) = 3 <0.000112>
> 15:28:22 write(1, "aa\n", 3) = 3 <0.001510> <<<<<< slow
> ...
> 15:32:02 write(1, "aa\n", 3) = 3 <0.001501>
> 15:32:12 write(1, "aa\n", 3) = 3 <0.001440>
> 15:32:22 write(1, "aa\n", 3) = 3 <0.000136> <<<<<< fast
> 15:32:32 write(1, "aa\n", 3) = 3 <0.000109>
> 15:32:42 write(1, "aa\n", 3) = 3 <0.000110>
> 15:32:52 write(1, "aa\n", 3) = 3 <0.000112>
> 15:33:02 write(1, "aa\n", 3) = 3 <0.000103>
> 15:33:12 write(1, "aa\n", 3) = 3 <0.000112>
> 15:33:22 write(1, "aa\n", 3) = 3 <0.001405> <<<<<< slow
> 15:33:32 write(1, "aa\n", 3) = 3 <0.001393>
> 15:33:42 write(1, "aa\n", 3) = 3 <0.001479>
> ...
>
>
>
> So we have 4 minute long windows of slowness followed by 1 minute window
> when writes are fast.
>
> 15:23:21 - 15:27:22 slow
> 15:27:22 - 15:28:22 fast
> 15:28:22 - 15:32:22 slow
> 15:32:22 - 15:33:22 fast
>
>
>
> After some tracing with systemtap and looking at the source code, I
> found where the issue is coming from.
> The nfs_file_write() function ends up calling nfs_ctx_key_to_expire() on
> each write, which in turn calls gss_key_timeout() which has hard-coded
> value of 240s (GSS_KEY_EXPIRE_TIMEO/gss_key_expire_timeo).
>
>
> nfs_file_write()
> ...
> result = nfs_key_timeout_notify(file, inode);
> if (result)
> return result;
> ...
> if (nfs_need_check_write(file, inode)) {
> int err = nfs_wb_all(inode); ...
>
>
> /*
> * Avoid buffered writes when a open context credential's key would
> * expire soon.
> *
> * Returns -EACCES if the key will expire within RPC_KEY_EXPIRE_FAIL.
> *
> * Return 0 and set a credential flag which triggers the inode to flush
> * and performs NFS_FILE_SYNC writes if the key will expired within
> * RPC_KEY_EXPIRE_TIMEO.
> */
> int
> nfs_key_timeout_notify(struct file *filp, struct inode *inode) {
> struct nfs_open_context *ctx = nfs_file_open_context(filp);
>
> if (nfs_ctx_key_to_expire(ctx, inode) &&
> !ctx->ll_cred)
> /* Already expired! */
> return -EACCES;
> return 0;
> }
>
>
> nfs_need_check_write()
> ...
> if (nfs_ctx_key_to_expire(ctx, inode))
> return 1;
> return 0;
>
>
>
> nfs_write_end()
> ...
> if (nfs_ctx_key_to_expire(ctx, mapping->host)) {
> status = nfs_wb_all(mapping->host); ...
>
>
>
> /*
> * Test if the open context credential key is marked to expire soon.
> */
> bool nfs_ctx_key_to_expire(struct nfs_open_context *ctx, struct inode
> *inode)
> {
> struct rpc_auth *auth = NFS_SERVER(inode)->client->cl_auth;
> struct rpc_cred *cred = ctx->ll_cred;
> struct auth_cred acred = {
> .cred = ctx->cred,
> };
>
> if (cred && !cred->cr_ops->crmatch(&acred, cred, 0)) {
> put_rpccred(cred);
> ctx->ll_cred = NULL;
> cred = NULL;
> }
> if (!cred)
> cred = auth->au_ops->lookup_cred(auth, &acred, 0);
> if (!cred || IS_ERR(cred))
> return true;
> ctx->ll_cred = cred;
> return !!(cred->cr_ops->crkey_timeout &&
> cred->cr_ops->crkey_timeout(cred));
> }
>
>
>
> net/sunrpc/auth_gss/auth_gss.c: .crkey_timeout =
> gss_key_timeout,
>
>
> /*
> * Returns -EACCES if GSS context is NULL or will expire within the
> * timeout (miliseconds)
> */
> static int
> gss_key_timeout(struct rpc_cred *rc)
> {
> struct gss_cred *gss_cred = container_of(rc, struct gss_cred,
> gc_base);
> struct gss_cl_ctx *ctx;
> unsigned long timeout = jiffies + (gss_key_expire_timeo * HZ);
> int ret = 0;
>
> rcu_read_lock();
> ctx = rcu_dereference(gss_cred->gc_ctx);
> if (!ctx || time_after(timeout, ctx->gc_expiry))
> ret = -EACCES;
> rcu_read_unlock();
>
> return ret;
> }
>
>
>
>
> #define GSS_KEY_EXPIRE_TIMEO 240
> static unsigned int gss_key_expire_timeo = GSS_KEY_EXPIRE_TIMEO;
>
>
>
>
>
> A na?ve attempt at a fix:
>
>
> diff --git a/net/sunrpc/auth_gss/auth_gss.c
> b/net/sunrpc/auth_gss/auth_gss.c index 25fbd8d9de74..864661bdfdf3 100644
> --- a/net/sunrpc/auth_gss/auth_gss.c
> +++ b/net/sunrpc/auth_gss/auth_gss.c
> @@ -1477,6 +1477,8 @@ gss_key_timeout(struct rpc_cred *rc)
>
> rcu_read_lock();
> ctx = rcu_dereference(gss_cred->gc_ctx);
> + if (ctx && time_after(timeout + (60 * HZ), ctx->gc_expiry))
> + clear_bit(RPCAUTH_CRED_UPTODATE, &rc->cr_flags);
> if (!ctx || time_after(timeout, ctx->gc_expiry))
> ret = -EACCES;
> rcu_read_unlock();
>
>
>
>
> With the above patch, if there is a write within 300s before a context
> is to expire (use RPCGSSDARGS="-t 400" or any value larger than 300 to
> test), it will now try to refresh the context and if successful then
> writes will be fast again (assuming -t option is >300s and/or krb ticket
> is valid for more than 300s).
>
> What I haven't tested nor analysed code is what would happen if it now
> fails to refresh the context, but after a quick glance at gss_refresh()
> it does seem it would continue using the previous cred...
>
> Is this the correct approach to fix the issue, or can you suggest some
> other approach?
>
>
>
> --
> Robert Milkowski

2020-05-20 10:49:07

by Robert Milkowski

[permalink] [raw]

Subject: RE: NFS v4 + kerberos: 4 minute window of slowness

Polite ping...

> -----Original Message-----
> From: Robert Milkowski <[email protected]>
> Sent: 14 May 2020 21:39
> To: [email protected]; [email protected]; 'Chuck
> Lever' <[email protected]>; 'Schumaker, Anna'
> <[email protected]>
> Subject: RE: NFS v4 + kerberos: 4 minute window of slowness
>
> +Trond, Chuck, Anna
>
>
> Ping...
>
> > -----Original Message-----
> > From: Robert Milkowski <[email protected]>
> > Sent: 05 May 2020 18:35
> > To: [email protected]
> > Subject: NFS v4 + kerberos: 4 minute window of slowness
> >
> > Hi,
> >
> > Currently the last 4 minutes before kernel gss context expires, all
> > writes to NFSv4 are synchronous and all dirty pages associated with
> > the file being written to are being destaged.
> > This will continue for the 4 minutes until the context expires, at
> > which point it gets refreshed and everything gets back to normal.
> >
> > The rpc.gssd by default sets the timeout to match the Kerberos service
> > ticket, but this can be lowered by using -t option.
> > In fact many sites set it to lower value, like for example 30 minutes.
> > This means that every 30 minutes, the last 4 minutes results in
> > severely slower writes (assuming these are buffered - no O_DSYNC,
> etc.).
> >
> > In extreme case, when one sets the timeout to 5 minutes, during the 4
> > minutes out of the minutes, there will be the slowness observed.
> >
> >
> > I understand the idea behind this mechanism - I guess it tries to
> > avoid situation when a gss context can't be refreshed (due to error or
> > account disabled, etc.), and it expires suddenly nfs client wouldn't
> > be able to destage all the buffered writes. The way it is currently
> > implemented though is rather crude.
> > In my opinion, instead of making everything slow for the whole 4
> > minutes, it should first try to refresh the context immediately and if
> > successful things go back to normal, if it can't refresh the context
> > then it should continue with the previous one and revert to the
> > current behaviour. I implemented a na?ve quick fix which does exactly
> > that (attached at the end of this email).
> >
> >
> > How to re-produce.
> >
> >
> > $ uname -r
> > 5.7.0-rc4+
> >
> > $ grep -- -t /etc/sysconfig/nfs
> > RPCGSSDARGS="-t 300"
> >
> > I'm setting it to 5 minutes so I can quickly see the behaviour without
> > having to wait for too long.
> >
> >
> > Now, let's generate a small write every 10s to a file on
> > nfsv4,sec=krb5 filesystem and record how long each write takes.
> > Since these are buffered writes it should be very quick most of the
> > time.
> >
> > $ while [ 1 ]; do strace -qq -tT -v -e trace=write /bin/echo aa >f1;
> > rm f1; sleep 10; done
> >
> > 15:22:41 write(1, "aa\n", 3) = 3 <0.000108>
> > 15:22:51 write(1, "aa\n", 3) = 3 <0.000113>
> > 15:23:01 write(1, "aa\n", 3) = 3 <0.000111>
> > 15:23:11 write(1, "aa\n", 3) = 3 <0.000112>
> > 15:23:21 write(1, "aa\n", 3) = 3 <0.001510> <<<<<<
> > becomes
> > slow
> > 15:23:31 write(1, "aa\n", 3) = 3 <0.001622>
> > 15:23:41 write(1, "aa\n", 3) = 3 <0.001553>
> > 15:23:51 write(1, "aa\n", 3) = 3 <0.001495>
> > ...
> > 15:27:01 write(1, "aa\n", 3) = 3 <0.001528>
> > 15:27:12 write(1, "aa\n", 3) = 3 <0.001553>
> > 15:27:22 write(1, "aa\n", 3) = 3 <0.000104> <<<<<<
> > becomes
> > fast again
> > 15:27:32 write(1, "aa\n", 3) = 3 <0.000125>
> > 15:27:42 write(1, "aa\n", 3) = 3 <0.000129>
> > 15:27:52 write(1, "aa\n", 3) = 3 <0.000113>
> > 15:28:02 write(1, "aa\n", 3) = 3 <0.000112>
> > 15:28:12 write(1, "aa\n", 3) = 3 <0.000112>
> > 15:28:22 write(1, "aa\n", 3) = 3 <0.001510> <<<<<< slow
> > ...
> > 15:32:02 write(1, "aa\n", 3) = 3 <0.001501>
> > 15:32:12 write(1, "aa\n", 3) = 3 <0.001440>
> > 15:32:22 write(1, "aa\n", 3) = 3 <0.000136> <<<<<< fast
> > 15:32:32 write(1, "aa\n", 3) = 3 <0.000109>
> > 15:32:42 write(1, "aa\n", 3) = 3 <0.000110>
> > 15:32:52 write(1, "aa\n", 3) = 3 <0.000112>
> > 15:33:02 write(1, "aa\n", 3) = 3 <0.000103>
> > 15:33:12 write(1, "aa\n", 3) = 3 <0.000112>
> > 15:33:22 write(1, "aa\n", 3) = 3 <0.001405> <<<<<< slow
> > 15:33:32 write(1, "aa\n", 3) = 3 <0.001393>
> > 15:33:42 write(1, "aa\n", 3) = 3 <0.001479>
> > ...
> >
> >
> >
> > So we have 4 minute long windows of slowness followed by 1 minute
> > window when writes are fast.
> >
> > 15:23:21 - 15:27:22 slow
> > 15:27:22 - 15:28:22 fast
> > 15:28:22 - 15:32:22 slow
> > 15:32:22 - 15:33:22 fast
> >
> >
> >
> > After some tracing with systemtap and looking at the source code, I
> > found where the issue is coming from.
> > The nfs_file_write() function ends up calling nfs_ctx_key_to_expire()
> > on each write, which in turn calls gss_key_timeout() which has
> > hard-coded value of 240s (GSS_KEY_EXPIRE_TIMEO/gss_key_expire_timeo).
> >
> >
> > nfs_file_write()
> > ...
> > result = nfs_key_timeout_notify(file, inode);
> > if (result)
> > return result;
> > ...
> > if (nfs_need_check_write(file, inode)) {
> > int err = nfs_wb_all(inode); ...
> >
> >
> > /*
> > * Avoid buffered writes when a open context credential's key would
> > * expire soon.
> > *
> > * Returns -EACCES if the key will expire within RPC_KEY_EXPIRE_FAIL.
> > *
> > * Return 0 and set a credential flag which triggers the inode to
> > flush
> > * and performs NFS_FILE_SYNC writes if the key will expired within
> > * RPC_KEY_EXPIRE_TIMEO.
> > */
> > int
> > nfs_key_timeout_notify(struct file *filp, struct inode *inode) {
> > struct nfs_open_context *ctx = nfs_file_open_context(filp);
> >
> > if (nfs_ctx_key_to_expire(ctx, inode) &&
> > !ctx->ll_cred)
> > /* Already expired! */
> > return -EACCES;
> > return 0;
> > }
> >
> >
> > nfs_need_check_write()
> > ...
> > if (nfs_ctx_key_to_expire(ctx, inode))
> > return 1;
> > return 0;
> >
> >
> >
> > nfs_write_end()
> > ...
> > if (nfs_ctx_key_to_expire(ctx, mapping->host)) {
> > status = nfs_wb_all(mapping->host); ...
> >
> >
> >
> > /*
> > * Test if the open context credential key is marked to expire soon.
> > */
> > bool nfs_ctx_key_to_expire(struct nfs_open_context *ctx, struct inode
> > *inode)
> > {
> > struct rpc_auth *auth = NFS_SERVER(inode)->client->cl_auth;
> > struct rpc_cred *cred = ctx->ll_cred;
> > struct auth_cred acred = {
> > .cred = ctx->cred,
> > };
> >
> > if (cred && !cred->cr_ops->crmatch(&acred, cred, 0)) {
> > put_rpccred(cred);
> > ctx->ll_cred = NULL;
> > cred = NULL;
> > }
> > if (!cred)
> > cred = auth->au_ops->lookup_cred(auth, &acred, 0);
> > if (!cred || IS_ERR(cred))
> > return true;
> > ctx->ll_cred = cred;
> > return !!(cred->cr_ops->crkey_timeout &&
> > cred->cr_ops->crkey_timeout(cred));
> > }
> >
> >
> >
> > net/sunrpc/auth_gss/auth_gss.c: .crkey_timeout =
> > gss_key_timeout,
> >
> >
> > /*
> > * Returns -EACCES if GSS context is NULL or will expire within the
> > * timeout (miliseconds)
> > */
> > static int
> > gss_key_timeout(struct rpc_cred *rc)
> > {
> > struct gss_cred *gss_cred = container_of(rc, struct gss_cred,
> > gc_base);
> > struct gss_cl_ctx *ctx;
> > unsigned long timeout = jiffies + (gss_key_expire_timeo * HZ);
> > int ret = 0;
> >
> > rcu_read_lock();
> > ctx = rcu_dereference(gss_cred->gc_ctx);
> > if (!ctx || time_after(timeout, ctx->gc_expiry))
> > ret = -EACCES;
> > rcu_read_unlock();
> >
> > return ret;
> > }
> >
> >
> >
> >
> > #define GSS_KEY_EXPIRE_TIMEO 240
> > static unsigned int gss_key_expire_timeo = GSS_KEY_EXPIRE_TIMEO;
> >
> >
> >
> >
> >
> > A na?ve attempt at a fix:
> >
> >
> > diff --git a/net/sunrpc/auth_gss/auth_gss.c
> > b/net/sunrpc/auth_gss/auth_gss.c index 25fbd8d9de74..864661bdfdf3
> > 100644
> > --- a/net/sunrpc/auth_gss/auth_gss.c
> > +++ b/net/sunrpc/auth_gss/auth_gss.c
> > @@ -1477,6 +1477,8 @@ gss_key_timeout(struct rpc_cred *rc)
> >
> > rcu_read_lock();
> > ctx = rcu_dereference(gss_cred->gc_ctx);
> > + if (ctx && time_after(timeout + (60 * HZ), ctx->gc_expiry))
> > + clear_bit(RPCAUTH_CRED_UPTODATE, &rc->cr_flags);
> > if (!ctx || time_after(timeout, ctx->gc_expiry))
> > ret = -EACCES;
> > rcu_read_unlock();
> >
> >
> >
> >
> > With the above patch, if there is a write within 300s before a context
> > is to expire (use RPCGSSDARGS="-t 400" or any value larger than 300 to
> > test), it will now try to refresh the context and if successful then
> > writes will be fast again (assuming -t option is >300s and/or krb
> > ticket is valid for more than 300s).
> >
> > What I haven't tested nor analysed code is what would happen if it now
> > fails to refresh the context, but after a quick glance at
> > gss_refresh() it does seem it would continue using the previous
> cred...
> >
> > Is this the correct approach to fix the issue, or can you suggest some
> > other approach?
> >
> >
> >
> > --
> > Robert Milkowski
>

2021-10-22 18:24:43

by Dan Mezhiborsky

[permalink] [raw]

Subject: Re: NFS v4 + kerberos: 4 minute window of slowness

On Wed, 2020-05-20 at 11:48 +0100, Robert Milkowski wrote:
> Polite ping...
>
> > -----Original Message-----
> > From: Robert Milkowski <[email protected]>
> > Sent: 14 May 2020 21:39
> > To: [email protected]; [email protected];
> > 'Chuck
> > Lever' <[email protected]>; 'Schumaker, Anna'
> > <[email protected]>
> > Subject: RE: NFS v4 + kerberos: 4 minute window of slowness
> >
> > +Trond, Chuck, Anna
> >
> >
> > Ping...
> >
> > > -----Original Message-----
> > > From: Robert Milkowski <[email protected]>
> > > Sent: 05 May 2020 18:35
> > > To: [email protected]
> > > Subject: NFS v4 + kerberos: 4 minute window of slowness
> > >
> > > Hi,
> > >
> > > Currently the last 4 minutes before kernel gss context expires,
> > > all
> > > writes to NFSv4 are synchronous and all dirty pages associated
> > > with
> > > the file being written to are being destaged.
> > > This will continue for the 4 minutes until the context expires,
> > > at
> > > which point it gets refreshed and everything gets back to normal.
> > >
> > > The rpc.gssd by default sets the timeout to match the Kerberos
> > > service
> > > ticket, but this can be lowered by using -t option.
> > > In fact many sites set it to lower value, like for example 30
> > > minutes.
> > > This means that every 30 minutes, the last 4 minutes results in
> > > severely slower writes (assuming these are buffered - no O_DSYNC,
> > etc.).
> > > In extreme case, when one sets the timeout to 5 minutes, during
> > > the 4
> > > minutes out of the minutes, there will be the slowness observed.
> > >
> > >
> > > I understand the idea behind this mechanism - I guess it tries to
> > > avoid situation when a gss context can't be refreshed (due to
> > > error or
> > > account disabled, etc.), and it expires suddenly nfs client
> > > wouldn't
> > > be able to destage all the buffered writes. The way it is
> > > currently
> > > implemented though is rather crude.
> > > In my opinion, instead of making everything slow for the whole 4
> > > minutes, it should first try to refresh the context immediately
> > > and if
> > > successful things go back to normal, if it can't refresh the
> > > context
> > > then it should continue with the previous one and revert to the
> > > current behaviour. I implemented a naïve quick fix which does
> > > exactly
> > > that (attached at the end of this email).
> > >
> > >
> > > How to re-produce.
> > >
> > >
> > > $ uname -r
> > > 5.7.0-rc4+
> > >
> > > $ grep -- -t /etc/sysconfig/nfs
> > > RPCGSSDARGS="-t 300"
> > >
> > > I'm setting it to 5 minutes so I can quickly see the behaviour
> > > without
> > > having to wait for too long.
> > >
> > >
> > > Now, let's generate a small write every 10s to a file on
> > > nfsv4,sec=krb5 filesystem and record how long each write takes.
> > > Since these are buffered writes it should be very quick most of
> > > the
> > > time.
> > >
> > > $ while [ 1 ]; do strace -qq -tT -v -e trace=write /bin/echo aa
> > > >f1;
> > > rm f1; sleep 10; done
> > >
> > > 15:22:41 write(1, "aa\n", 3) = 3 <0.000108>
> > > 15:22:51 write(1, "aa\n", 3) = 3 <0.000113>
> > > 15:23:01 write(1, "aa\n", 3) = 3 <0.000111>
> > > 15:23:11 write(1, "aa\n", 3) = 3 <0.000112>
> > > 15:23:21 write(1, "aa\n", 3) = 3 <0.001510> <<<<<<
> > > becomes
> > > slow
> > > 15:23:31 write(1, "aa\n", 3) = 3 <0.001622>
> > > 15:23:41 write(1, "aa\n", 3) = 3 <0.001553>
> > > 15:23:51 write(1, "aa\n", 3) = 3 <0.001495>
> > > ...
> > > 15:27:01 write(1, "aa\n", 3) = 3 <0.001528>
> > > 15:27:12 write(1, "aa\n", 3) = 3 <0.001553>
> > > 15:27:22 write(1, "aa\n", 3) = 3 <0.000104> <<<<<<
> > > becomes
> > > fast again
> > > 15:27:32 write(1, "aa\n", 3) = 3 <0.000125>
> > > 15:27:42 write(1, "aa\n", 3) = 3 <0.000129>
> > > 15:27:52 write(1, "aa\n", 3) = 3 <0.000113>
> > > 15:28:02 write(1, "aa\n", 3) = 3 <0.000112>
> > > 15:28:12 write(1, "aa\n", 3) = 3 <0.000112>
> > > 15:28:22 write(1, "aa\n", 3) = 3 <0.001510> <<<<<<
> > > slow
> > > ...
> > > 15:32:02 write(1, "aa\n", 3) = 3 <0.001501>
> > > 15:32:12 write(1, "aa\n", 3) = 3 <0.001440>
> > > 15:32:22 write(1, "aa\n", 3) = 3 <0.000136> <<<<<<
> > > fast
> > > 15:32:32 write(1, "aa\n", 3) = 3 <0.000109>
> > > 15:32:42 write(1, "aa\n", 3) = 3 <0.000110>
> > > 15:32:52 write(1, "aa\n", 3) = 3 <0.000112>
> > > 15:33:02 write(1, "aa\n", 3) = 3 <0.000103>
> > > 15:33:12 write(1, "aa\n", 3) = 3 <0.000112>
> > > 15:33:22 write(1, "aa\n", 3) = 3 <0.001405> <<<<<<
> > > slow
> > > 15:33:32 write(1, "aa\n", 3) = 3 <0.001393>
> > > 15:33:42 write(1, "aa\n", 3) = 3 <0.001479>
> > > ...
> > >
> > >
> > >
> > > So we have 4 minute long windows of slowness followed by 1 minute
> > > window when writes are fast.
> > >
> > > 15:23:21 - 15:27:22 slow
> > > 15:27:22 - 15:28:22 fast
> > > 15:28:22 - 15:32:22 slow
> > > 15:32:22 - 15:33:22 fast
> > >
> > >
> > >
> > > After some tracing with systemtap and looking at the source code,
> > > I
> > > found where the issue is coming from.
> > > The nfs_file_write() function ends up calling
> > > nfs_ctx_key_to_expire()
> > > on each write, which in turn calls gss_key_timeout() which has
> > > hard-coded value of 240s
> > > (GSS_KEY_EXPIRE_TIMEO/gss_key_expire_timeo).
> > >
> > >
> > > nfs_file_write()
> > > ...
> > > result = nfs_key_timeout_notify(file, inode);
> > > if (result)
> > > return result;
> > > ...
> > > if (nfs_need_check_write(file, inode)) {
> > > int err = nfs_wb_all(inode); ...
> > >
> > >
> > > /*
> > > * Avoid buffered writes when a open context credential's key
> > > would
> > > * expire soon.
> > > *
> > > * Returns -EACCES if the key will expire within
> > > RPC_KEY_EXPIRE_FAIL.
> > > *
> > > * Return 0 and set a credential flag which triggers the inode to
> > > flush
> > > * and performs NFS_FILE_SYNC writes if the key will expired
> > > within
> > > * RPC_KEY_EXPIRE_TIMEO.
> > > */
> > > int
> > > nfs_key_timeout_notify(struct file *filp, struct inode *inode) {
> > > struct nfs_open_context *ctx =
> > > nfs_file_open_context(filp);
> > >
> > > if (nfs_ctx_key_to_expire(ctx, inode) &&
> > > !ctx->ll_cred)
> > > /* Already expired! */
> > > return -EACCES;
> > > return 0;
> > > }
> > >
> > >
> > > nfs_need_check_write()
> > > ...
> > > if (nfs_ctx_key_to_expire(ctx, inode))
> > > return 1;
> > > return 0;
> > >
> > >
> > >
> > > nfs_write_end()
> > > ...
> > > if (nfs_ctx_key_to_expire(ctx, mapping->host)) {
> > > status = nfs_wb_all(mapping->host); ...
> > >
> > >
> > >
> > > /*
> > > * Test if the open context credential key is marked to expire
> > > soon.
> > > */
> > > bool nfs_ctx_key_to_expire(struct nfs_open_context *ctx, struct
> > > inode
> > > *inode)
> > > {
> > > struct rpc_auth *auth = NFS_SERVER(inode)->client-
> > > >cl_auth;
> > > struct rpc_cred *cred = ctx->ll_cred;
> > > struct auth_cred acred = {
> > > .cred = ctx->cred,
> > > };
> > >
> > > if (cred && !cred->cr_ops->crmatch(&acred, cred, 0)) {
> > > put_rpccred(cred);
> > > ctx->ll_cred = NULL;
> > > cred = NULL;
> > > }
> > > if (!cred)
> > > cred = auth->au_ops->lookup_cred(auth, &acred,
> > > 0);
> > > if (!cred || IS_ERR(cred))
> > > return true;
> > > ctx->ll_cred = cred;
> > > return !!(cred->cr_ops->crkey_timeout &&
> > > cred->cr_ops->crkey_timeout(cred));
> > > }
> > >
> > >
> > >
> > > net/sunrpc/auth_gss/auth_gss.c: .crkey_timeout =
> > > gss_key_timeout,
> > >
> > >
> > > /*
> > > * Returns -EACCES if GSS context is NULL or will expire within
> > > the
> > > * timeout (miliseconds)
> > > */
> > > static int
> > > gss_key_timeout(struct rpc_cred *rc)
> > > {
> > > struct gss_cred *gss_cred = container_of(rc, struct
> > > gss_cred,
> > > gc_base);
> > > struct gss_cl_ctx *ctx;
> > > unsigned long timeout = jiffies + (gss_key_expire_timeo *
> > > HZ);
> > > int ret = 0;
> > >
> > > rcu_read_lock();
> > > ctx = rcu_dereference(gss_cred->gc_ctx);
> > > if (!ctx || time_after(timeout, ctx->gc_expiry))
> > > ret = -EACCES;
> > > rcu_read_unlock();
> > >
> > > return ret;
> > > }
> > >
> > >
> > >
> > >
> > > #define GSS_KEY_EXPIRE_TIMEO 240
> > > static unsigned int gss_key_expire_timeo = GSS_KEY_EXPIRE_TIMEO;
> > >
> > >
> > >
> > >
> > >
> > > A naïve attempt at a fix:
> > >
> > >
> > > diff --git a/net/sunrpc/auth_gss/auth_gss.c
> > > b/net/sunrpc/auth_gss/auth_gss.c index 25fbd8d9de74..864661bdfdf3
> > > 100644
> > > --- a/net/sunrpc/auth_gss/auth_gss.c
> > > +++ b/net/sunrpc/auth_gss/auth_gss.c
> > > @@ -1477,6 +1477,8 @@ gss_key_timeout(struct rpc_cred *rc)
> > >
> > > rcu_read_lock();
> > > ctx = rcu_dereference(gss_cred->gc_ctx);
> > > + if (ctx && time_after(timeout + (60 * HZ), ctx-
> > > >gc_expiry))
> > > + clear_bit(RPCAUTH_CRED_UPTODATE, &rc->cr_flags);
> > > if (!ctx || time_after(timeout, ctx->gc_expiry))
> > > ret = -EACCES;
> > > rcu_read_unlock();
> > >
> > >
> > >
> > >
> > > With the above patch, if there is a write within 300s before a
> > > context
> > > is to expire (use RPCGSSDARGS="-t 400" or any value larger than
> > > 300 to
> > > test), it will now try to refresh the context and if successful
> > > then
> > > writes will be fast again (assuming -t option is >300s and/or krb
> > > ticket is valid for more than 300s).
> > >
> > > What I haven't tested nor analysed code is what would happen if
> > > it now
> > > fails to refresh the context, but after a quick glance at
> > > gss_refresh() it does seem it would continue using the previous
> > cred...
> > > Is this the correct approach to fix the issue, or can you suggest
> > > some
> > > other approach?
> > >
> > >
> > >
> > > --
> > > Robert Milkowski
>
>

Hello all, is this behavior something that can be looked into again? In
some applications, it's preferable to have a short timeout for frequent
auth refreshes.

Thank you!
Dan Mezhiborsky