Received: by 2002:a25:23cc:0:0:0:0:0 with SMTP id j195csp972093ybj; Tue, 5 May 2020 10:37:02 -0700 (PDT) X-Google-Smtp-Source: APiQypKxe0QvFvDoKiYOonSdZbEDB2MFKMJwoQJsvnBOuBXtmmCkl86r82pPAZMPgWjjB2ND9E55 X-Received: by 2002:aa7:cf16:: with SMTP id a22mr3615051edy.77.1588700222212; Tue, 05 May 2020 10:37:02 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1588700222; cv=none; d=google.com; s=arc-20160816; b=e/qS9Fhh3tTCNuU9wOYFjbErQABBL1T1grDK0y6m7CgpNyPTqcQs/zvv/9uwFYZUHe clFHcaUjw3VvaxIeEiX/GRWcmedywVR30VOlsrpfuRmn1fK21nPZlvJOk5jnsDsz/KMw Ua4Xl48PV63fr3ZjWjxKUXJ8mvC+Z6rUJ0hWLq4SZXn6qJY5gPFVZe3WpfXXk8ffIplI /tEObBPu22jK5Tv5b7O+jSfOg7GwRv63A6XExC7UfzeFDy1bHwBstFL1mWuH9INYIiRM tL+AmlaRnJIJHU4T0wCAMpVrQpKjk4Y8vXD/Kc6SjlBjyWNgsQhPxmXak7Hh8FujWhED DoiQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language:thread-index :content-transfer-encoding:mime-version:message-id:date:subject:to :from:dkim-signature; bh=5R6Xw32uPFyvknn2MSRki2vW9MFgEWb60lMIPOnYFJ0=; b=awa/jX6jNHydpnJpuhcPznibv8vvVGiYNVoCSxeKdTmPnVQeAUdiSMnPqJyrtxjV+D E/jDwFUTuGxtk1Xry1g9E0vBwxA9OIZJhRNbb8DcAGwKVsqD/j4wlJom7FQRIM/s7DNo Ec0QAxJo40JMg4/0hYPYjxWH6HHEXSEfJD0IKFx8H5Wcz2SXULfQmL8qrX0NVj33/ryx YAXkpAk7nYjx1T4sBU9yZyE2i2sV/y+ROU4fVCZGAMOEE5X1Gv/0x3NMWFVDw0WiaLEn fhQ3FbTlHaFIDWdZYfc20jeLQuUU7+zg4X6pumDfUdxEMOAmiPhmNDJ49dfAR49W3lgA 68UA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=qyMMvBqC; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id d10si1634126edx.13.2020.05.05.10.36.27; Tue, 05 May 2020 10:37:02 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=qyMMvBqC; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730408AbgEEReg (ORCPT + 99 others); Tue, 5 May 2020 13:34:36 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52210 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1729663AbgEERef (ORCPT ); Tue, 5 May 2020 13:34:35 -0400 Received: from mail-wr1-x431.google.com (mail-wr1-x431.google.com [IPv6:2a00:1450:4864:20::431]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 792A7C061A0F for ; Tue, 5 May 2020 10:34:35 -0700 (PDT) Received: by mail-wr1-x431.google.com with SMTP id e16so3740163wra.7 for ; Tue, 05 May 2020 10:34:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:subject:date:message-id:mime-version :content-transfer-encoding:thread-index:content-language; bh=5R6Xw32uPFyvknn2MSRki2vW9MFgEWb60lMIPOnYFJ0=; b=qyMMvBqCoRipvrhrGmseQus8A5tjxXdKlAQGIahKEI1U09tvXQdWxfDwACq6jIYxgb 3ECr2BjCXKXi3LeKBaThumkQRpIh3bF6cC3BI+CVWn4RBkmHUz77mvWABRapiZp+Fsfh p8/UMGaOQCyg8vqiF3UVo+vge/dfocCR4b0G0mlwxqH11kgjlxj/YPlO13rNGdzpNAqA SWh3DiblZJUB3oEf1lYOlV5W3paJVFxVbd3myIqkxk5vCHtaFo9+3FPzrRhOu9t7mqvk mitVBqbKcT4srZX3gnjz4msIPgrg98HHa+qzkPGEwR5QvRBrKA8tCugy8WaE1ra0MjFn kFPg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:subject:date:message-id:mime-version :content-transfer-encoding:thread-index:content-language; bh=5R6Xw32uPFyvknn2MSRki2vW9MFgEWb60lMIPOnYFJ0=; b=KKeD72aunu/vW2Qsd/S0XmfpMQ77Dlm89ca40pf8Xf5D6s8T/wR/qGIWVF4NNu/Gje 4tyo3isFX0/O5BMJO5OanzakGP5EdIiTO3pPqr2OXVVckj8cN05lXIlxoM951IGTJk83 QhvWdwgPyw9273/ZDsW/HoT63TYo3s+33pVQay0P4mpXlEH3+ElcqdwBwZ/KaMO+dWC8 1fmomK8kQi9EmLKQosJODTmnYkqbjPbB9QfB4CTyure66Uiv2Mu2RNeHOe7FLNFtUDkL PeDYzrBMOj4VsEWUvRcpZcxZhpoUrWmU3H0rabH3tAD6oNSc1KCpnWVjQrUT1St8gbLg mzhQ== X-Gm-Message-State: AGi0PubAhH965LoXyYMGYlcwtTeps9krjkBQ2yvALn+UorcFhq4TYBTy i+afMx9WilE6426x88ydcqLReCdk1QY= X-Received: by 2002:adf:e84f:: with SMTP id d15mr5227889wrn.296.1588700073577; Tue, 05 May 2020 10:34:33 -0700 (PDT) Received: from WINDOWSSS5SP16 (cpc96954-walt26-2-0-cust383.13-2.cable.virginm.net. [82.31.89.128]) by smtp.gmail.com with ESMTPSA id e21sm4330363wrc.1.2020.05.05.10.34.32 for (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Tue, 05 May 2020 10:34:32 -0700 (PDT) From: "Robert Milkowski" To: Subject: NFS v4 + kerberos: 4 minute window of slowness Date: Tue, 5 May 2020 18:34:32 +0100 Message-ID: <20ce01d62303$70571e80$51055b80$@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable X-Mailer: Microsoft Outlook 16.0 Thread-Index: AdYjA0EsG4HeF4PiRI6X+nNMAJWhhQ== Content-Language: en-gb Sender: linux-nfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org Hi, Currently the last 4 minutes before kernel gss context expires, all = writes to NFSv4 are synchronous and all dirty pages associated with the file = being written to are being destaged. This will continue for the 4 minutes until the context expires, at which point it gets refreshed and everything gets back to normal. The rpc.gssd by default sets the timeout to match the Kerberos service ticket, but this can be lowered by using -t option. In fact many sites set it to lower value, like for example 30 minutes. This means that every 30 minutes, the last 4 minutes results in severely slower writes (assuming these are buffered - no O_DSYNC, etc.). In extreme case, when one sets the timeout to 5 minutes, during the 4 minutes out of the minutes, there will be the slowness observed. I understand the idea behind this mechanism - I guess it tries to avoid situation when a gss context can't be refreshed (due to error or account disabled, etc.), and it expires suddenly nfs client wouldn't be able to destage all the buffered writes. The way it is currently implemented though is rather = crude. In my opinion, instead of making everything slow for the whole 4 = minutes, it should first try to refresh the context immediately and if successful = things go back to normal, if it can't refresh the context then it should continue with the = previous one and revert to the current behaviour. I implemented a na=EFve quick = fix which does exactly that (attached at the end of this email). How to re-produce. $ uname -r 5.7.0-rc4+ $ grep -- -t /etc/sysconfig/nfs RPCGSSDARGS=3D"-t 300" I'm setting it to 5 minutes so I can quickly see the behaviour without having to wait for too long. Now, let's generate a small write every 10s to a file on = nfsv4,sec=3Dkrb5 filesystem and record how long each write takes. Since these are buffered writes it should be very quick most of the = time. $ while [ 1 ]; do strace -qq -tT -v -e trace=3Dwrite /bin/echo aa >f1; = rm f1; sleep 10; done 15:22:41 write(1, "aa\n", 3) =3D 3 <0.000108> 15:22:51 write(1, "aa\n", 3) =3D 3 <0.000113> 15:23:01 write(1, "aa\n", 3) =3D 3 <0.000111> 15:23:11 write(1, "aa\n", 3) =3D 3 <0.000112> 15:23:21 write(1, "aa\n", 3) =3D 3 <0.001510> <<<<<< = becomes slow 15:23:31 write(1, "aa\n", 3) =3D 3 <0.001622> 15:23:41 write(1, "aa\n", 3) =3D 3 <0.001553> 15:23:51 write(1, "aa\n", 3) =3D 3 <0.001495> ... 15:27:01 write(1, "aa\n", 3) =3D 3 <0.001528> 15:27:12 write(1, "aa\n", 3) =3D 3 <0.001553> 15:27:22 write(1, "aa\n", 3) =3D 3 <0.000104> <<<<<< = becomes fast again 15:27:32 write(1, "aa\n", 3) =3D 3 <0.000125> 15:27:42 write(1, "aa\n", 3) =3D 3 <0.000129> 15:27:52 write(1, "aa\n", 3) =3D 3 <0.000113> 15:28:02 write(1, "aa\n", 3) =3D 3 <0.000112> 15:28:12 write(1, "aa\n", 3) =3D 3 <0.000112> 15:28:22 write(1, "aa\n", 3) =3D 3 <0.001510> <<<<<< slow ... 15:32:02 write(1, "aa\n", 3) =3D 3 <0.001501> 15:32:12 write(1, "aa\n", 3) =3D 3 <0.001440> 15:32:22 write(1, "aa\n", 3) =3D 3 <0.000136> <<<<<< fast 15:32:32 write(1, "aa\n", 3) =3D 3 <0.000109> 15:32:42 write(1, "aa\n", 3) =3D 3 <0.000110> 15:32:52 write(1, "aa\n", 3) =3D 3 <0.000112> 15:33:02 write(1, "aa\n", 3) =3D 3 <0.000103> 15:33:12 write(1, "aa\n", 3) =3D 3 <0.000112> 15:33:22 write(1, "aa\n", 3) =3D 3 <0.001405> <<<<<< slow 15:33:32 write(1, "aa\n", 3) =3D 3 <0.001393> 15:33:42 write(1, "aa\n", 3) =3D 3 <0.001479> ... So we have 4 minute long windows of slowness followed by 1 minute window when writes are fast. 15:23:21 - 15:27:22 slow 15:27:22 - 15:28:22 fast 15:28:22 - 15:32:22 slow 15:32:22 - 15:33:22 fast After some tracing with systemtap and looking at the source code, I = found where the issue is coming from. The nfs_file_write() function ends up calling nfs_ctx_key_to_expire() on each write, which in turn calls gss_key_timeout() which has hard-coded value of 240s = (GSS_KEY_EXPIRE_TIMEO/gss_key_expire_timeo). nfs_file_write() ... result =3D nfs_key_timeout_notify(file, inode); if (result) return result; ... if (nfs_need_check_write(file, inode)) { int err =3D nfs_wb_all(inode); ... /* * Avoid buffered writes when a open context credential's key would * expire soon. * * Returns -EACCES if the key will expire within RPC_KEY_EXPIRE_FAIL. * * Return 0 and set a credential flag which triggers the inode to flush * and performs NFS_FILE_SYNC writes if the key will expired within * RPC_KEY_EXPIRE_TIMEO. */ int nfs_key_timeout_notify(struct file *filp, struct inode *inode) { struct nfs_open_context *ctx =3D nfs_file_open_context(filp); if (nfs_ctx_key_to_expire(ctx, inode) && !ctx->ll_cred) /* Already expired! */ return -EACCES; return 0; } nfs_need_check_write() ... if (nfs_ctx_key_to_expire(ctx, inode)) return 1; return 0; nfs_write_end() ... if (nfs_ctx_key_to_expire(ctx, mapping->host)) { status =3D nfs_wb_all(mapping->host); ... /* * Test if the open context credential key is marked to expire soon. */ bool nfs_ctx_key_to_expire(struct nfs_open_context *ctx, struct inode *inode) { struct rpc_auth *auth =3D NFS_SERVER(inode)->client->cl_auth; struct rpc_cred *cred =3D ctx->ll_cred; struct auth_cred acred =3D { .cred =3D ctx->cred, }; if (cred && !cred->cr_ops->crmatch(&acred, cred, 0)) { put_rpccred(cred); ctx->ll_cred =3D NULL; cred =3D NULL; } if (!cred) cred =3D auth->au_ops->lookup_cred(auth, &acred, 0); if (!cred || IS_ERR(cred)) return true; ctx->ll_cred =3D cred; return !!(cred->cr_ops->crkey_timeout && cred->cr_ops->crkey_timeout(cred)); } net/sunrpc/auth_gss/auth_gss.c: .crkey_timeout =3D = gss_key_timeout, /* * Returns -EACCES if GSS context is NULL or will expire within the * timeout (miliseconds) */ static int gss_key_timeout(struct rpc_cred *rc) { struct gss_cred *gss_cred =3D container_of(rc, struct gss_cred, gc_base); struct gss_cl_ctx *ctx; unsigned long timeout =3D jiffies + (gss_key_expire_timeo * HZ); int ret =3D 0; rcu_read_lock(); ctx =3D rcu_dereference(gss_cred->gc_ctx); if (!ctx || time_after(timeout, ctx->gc_expiry)) ret =3D -EACCES; rcu_read_unlock(); return ret; } #define GSS_KEY_EXPIRE_TIMEO 240 static unsigned int gss_key_expire_timeo =3D GSS_KEY_EXPIRE_TIMEO; A na=EFve attempt at a fix: diff --git a/net/sunrpc/auth_gss/auth_gss.c = b/net/sunrpc/auth_gss/auth_gss.c index 25fbd8d9de74..864661bdfdf3 100644 --- a/net/sunrpc/auth_gss/auth_gss.c +++ b/net/sunrpc/auth_gss/auth_gss.c @@ -1477,6 +1477,8 @@ gss_key_timeout(struct rpc_cred *rc) rcu_read_lock(); ctx =3D rcu_dereference(gss_cred->gc_ctx); + if (ctx && time_after(timeout + (60 * HZ), ctx->gc_expiry)) + clear_bit(RPCAUTH_CRED_UPTODATE, &rc->cr_flags); if (!ctx || time_after(timeout, ctx->gc_expiry)) ret =3D -EACCES; rcu_read_unlock(); With the above patch, if there is a write within 300s before a context = is to expire (use RPCGSSDARGS=3D"-t 400" or any value larger than 300 to = test), it will now try to refresh the context and if successful then writes = will be fast again (assuming -t option is >300s and/or krb ticket is valid for = more than 300s). What I haven't tested nor analysed code is what would happen if it now = fails to refresh the context, but after a quick glance at gss_refresh() it = does seem it would continue using the previous cred... Is this the correct approach to fix the issue, or can you suggest some = other approach? --=20 Robert Milkowski