LinuxLists.cc - [Patch 0/3 v2] Bug fixes for NFSD callback

2023-12-15 21:49:41

Subject: [Patch 0/3 v2] Bug fixes for NFSD callback

Here are some fixes for NFSD callback including one that addresses
the problem of server hang on reboot with 6.6.3 reported recently
by Wolfgang Walter <[email protected]>:

0001: SUNRPC: remove printk when back channel request not found
0002: NFSD: restore delegation's sc_count if nfsd4_run_cb fails
0003: NFSD: Fix server reboot hang problem when callback workqueue

v2:
. patch 2/3:
add reason to hold the flc_lock in commit message
add Fixes tag
---

fs/nfsd/nfs4state.c | 23 ++++++++++++++++++-----
fs/nfsd/state.h | 2 ++
net/sunrpc/svcsock.c | 8 +-------
3 files changed, 21 insertions(+), 12 deletions(-)

2023-12-15 21:49:43

by Dai Ngo

[permalink] [raw]

Subject: [PATCH 2/3 v2] NFSD: restore delegation's sc_count if nfsd4_run_cb fails

Under some load conditions the callback work request can not be queued
and nfsd4_run_cb returns 0 to caller. When this happens, the sc_count
of the delegation state was left with an extra reference count preventing
the state to be freed later.

We need to hold the flc_lock to prevent the lease to be removed which
allows the delegation state to be released. We need to do this since
we just do the refcount_dec if nfsd4_run_cb fails, instead of doing
nfs4_put_stid to free the state if this is the last refcount.

Fixes: 6c41d9a9bd02 ("NFSD: handle GETATTR conflict with write delegation")
Signed-off-by: Dai Ngo <[email protected]>
---
fs/nfsd/nfs4state.c | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 40415929e2ae..175f3e9f5822 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -2947,8 +2947,14 @@ void nfs4_cb_getattr(struct nfs4_cb_fattr *ncf)

if (test_and_set_bit(CB_GETATTR_BUSY, &ncf->ncf_cb_flags))
return;
+
refcount_inc(&dp->dl_stid.sc_count);
- nfsd4_run_cb(&ncf->ncf_getattr);
+ if (!nfsd4_run_cb(&ncf->ncf_getattr)) {
+ refcount_dec(&dp->dl_stid.sc_count);
+ clear_bit(CB_GETATTR_BUSY, &ncf->ncf_cb_flags);
+ wake_up_bit(&ncf->ncf_cb_flags, CB_GETATTR_BUSY);
+ WARN_ON_ONCE(1);
+ }
}

static struct nfs4_client *create_client(struct xdr_netobj name,
@@ -4967,7 +4973,10 @@ static void nfsd_break_one_deleg(struct nfs4_delegation *dp)
* we know it's safe to take a reference.
*/
refcount_inc(&dp->dl_stid.sc_count);
- WARN_ON_ONCE(!nfsd4_run_cb(&dp->dl_recall));
+ if (!nfsd4_run_cb(&dp->dl_recall)) {
+ refcount_dec(&dp->dl_stid.sc_count);
+ WARN_ON_ONCE(1);
+ }
}

/* Called from break_lease() with flc_lock held. */
@@ -8543,12 +8552,12 @@ nfsd4_deleg_getattr_conflict(struct svc_rqst *rqstp, struct inode *inode,
return 0;
}
break_lease:
- spin_unlock(&ctx->flc_lock);
nfsd_stats_wdeleg_getattr_inc();
-
dp = fl->fl_owner;
ncf = &dp->dl_cb_fattr;
nfs4_cb_getattr(&dp->dl_cb_fattr);
+ spin_unlock(&ctx->flc_lock);
+
wait_on_bit(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, TASK_INTERRUPTIBLE);
if (ncf->ncf_cb_status) {
status = nfserrno(nfsd_open_break_lease(inode, NFSD_MAY_READ));
--
2.39.3

2023-12-15 21:49:48

by Dai Ngo

[permalink] [raw]

Subject: [PATCH 1/3 v2] SUNRPC: remove printk when back channel request not found

If the client interface is down, or there is a network partition between
the client and server, that prevents the callback request to reach the
client TCP on the server will keep re-transmitting the callback for about
~9 minutes before giving up and closes the connection.

If the connection between the client and the server is re-established
before the connection is closed and after the callback timed out (9 secs)
then the re-transmitted callback request will arrive at the client. When
the server receives the reply of the callback, receive_cb_reply prints the
"Got unrecognized reply..." message in the system log since the callback
request was already removed from the server xprt's recv_queue.

Even though this scenario has no effect on the server operation, a
malicious client can take advantage of this behavior and send thousand
of callback replies with random XIDs to fill up the server's system log.

Signed-off-by: Dai Ngo <[email protected]>
---
net/sunrpc/svcsock.c | 8 +-------
1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 998687421fa6..3e89dc0afbef 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1060,7 +1060,7 @@ static int receive_cb_reply(struct svc_sock *svsk, struct svc_rqst *rqstp)
spin_lock(&bc_xprt->queue_lock);
req = xprt_lookup_rqst(bc_xprt, xid);
if (!req)
- goto unlock_notfound;
+ goto unlock_eagain;

memcpy(&req->rq_private_buf, &req->rq_rcv_buf, sizeof(struct xdr_buf));
/*
@@ -1077,12 +1077,6 @@ static int receive_cb_reply(struct svc_sock *svsk, struct svc_rqst *rqstp)
rqstp->rq_arg.len = 0;
spin_unlock(&bc_xprt->queue_lock);
return 0;
-unlock_notfound:
- printk(KERN_NOTICE
- "%s: Got unrecognized reply: "
- "calldir 0x%x xpt_bc_xprt %p xid %08x\n",
- __func__, ntohl(calldir),
- bc_xprt, ntohl(xid));
unlock_eagain:
spin_unlock(&bc_xprt->queue_lock);
return -EAGAIN;
--
2.39.3

2023-12-15 21:51:21

by Dai Ngo

[permalink] [raw]

Subject: [PATCH 3/3 v2] NFSD: Fix server reboot hang problem when callback workqueue is stuck

If the callback workqueue is stuck, nfsd4_deleg_getattr_conflict will
also stuck waiting for the callback request to be executed. This causes
the client to hang waiting for the reply of the GETATTR and also causes
the reboot of the NFS server to hang due to the pending NFS request.

Fix by replacing wait_on_bit with wait_on_bit_timeout with 20 seconds
time out.

Reported-by: Wolfgang Walter <[email protected]>
Fixes: 6c41d9a9bd02 ("NFSD: handle GETATTR conflict with write delegation")
Signed-off-by: Dai Ngo <[email protected]>
---
fs/nfsd/nfs4state.c | 6 +++++-
fs/nfsd/state.h | 2 ++
2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 175f3e9f5822..0cc7d4953807 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -2948,6 +2948,9 @@ void nfs4_cb_getattr(struct nfs4_cb_fattr *ncf)
if (test_and_set_bit(CB_GETATTR_BUSY, &ncf->ncf_cb_flags))
return;

+ /* set to proper status when nfsd4_cb_getattr_done runs */
+ ncf->ncf_cb_status = NFS4ERR_IO;
+
refcount_inc(&dp->dl_stid.sc_count);
if (!nfsd4_run_cb(&ncf->ncf_getattr)) {
refcount_dec(&dp->dl_stid.sc_count);
@@ -8558,7 +8561,8 @@ nfsd4_deleg_getattr_conflict(struct svc_rqst *rqstp, struct inode *inode,
nfs4_cb_getattr(&dp->dl_cb_fattr);
spin_unlock(&ctx->flc_lock);

- wait_on_bit(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, TASK_INTERRUPTIBLE);
+ wait_on_bit_timeout(&ncf->ncf_cb_flags, CB_GETATTR_BUSY,
+ TASK_INTERRUPTIBLE, NFSD_CB_GETATTR_TIMEOUT);
if (ncf->ncf_cb_status) {
status = nfserrno(nfsd_open_break_lease(inode, NFSD_MAY_READ));
if (status != nfserr_jukebox ||
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index f96eaa8e9413..94563a6813a6 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -135,6 +135,8 @@ struct nfs4_cb_fattr {
/* bits for ncf_cb_flags */
#define CB_GETATTR_BUSY 0

+#define NFSD_CB_GETATTR_TIMEOUT msecs_to_jiffies(20000) /* 20 secs */
+
/*
* Represents a delegation stateid. The nfs4_client holds references to these
* and they are put when it is being destroyed or when the delegation is
--
2.39.3

2023-12-16 09:53:28

by kernel test robot

[permalink] [raw]

Subject: Re: [PATCH 1/3 v2] SUNRPC: remove printk when back channel request not found

Hi Dai,

kernel test robot noticed the following build warnings:

[auto build test WARNING on linus/master]
[also build test WARNING on v6.7-rc5 next-20231215]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Dai-Ngo/SUNRPC-remove-printk-when-back-channel-request-not-found/20231216-055046
base: linus/master
patch link: https://lore.kernel.org/r/1702676837-31320-2-git-send-email-dai.ngo%40oracle.com
patch subject: [PATCH 1/3 v2] SUNRPC: remove printk when back channel request not found
config: arc-defconfig (https://download.01.org/0day-ci/archive/20231216/[email protected]/config)
compiler: arc-elf-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231216/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All warnings (new ones prefixed by >>):

net/sunrpc/svcsock.c: In function 'receive_cb_reply':
>> net/sunrpc/svcsock.c:1053:16: warning: variable 'calldir' set but not used [-Wunused-but-set-variable]
1053 | __be32 calldir;
| ^~~~~~~

vim +/calldir +1053 net/sunrpc/svcsock.c

8f55f3c0a013c4 Alexandros Batsakis 2009-08-20 1045
586c52cc61b5b8 Trond Myklebust 2009-05-18 1046 static int receive_cb_reply(struct svc_sock *svsk, struct svc_rqst *rqstp)
4cfc7e6019caa3 Rahul Iyer 2009-09-10 1047 {
586c52cc61b5b8 Trond Myklebust 2009-05-18 1048 struct rpc_xprt *bc_xprt = svsk->sk_xprt.xpt_bc_xprt;
4cfc7e6019caa3 Rahul Iyer 2009-09-10 1049 struct rpc_rqst *req = NULL;
586c52cc61b5b8 Trond Myklebust 2009-05-18 1050 struct kvec *src, *dst;
586c52cc61b5b8 Trond Myklebust 2009-05-18 1051 __be32 *p = (__be32 *)rqstp->rq_arg.head[0].iov_base;
48e6555c7b3bf0 J. Bruce Fields 2011-02-14 1052 __be32 xid;
48e6555c7b3bf0 J. Bruce Fields 2011-02-14 @1053 __be32 calldir;
4cfc7e6019caa3 Rahul Iyer 2009-09-10 1054
4cfc7e6019caa3 Rahul Iyer 2009-09-10 1055 xid = *p++;
4cfc7e6019caa3 Rahul Iyer 2009-09-10 1056 calldir = *p;
4cfc7e6019caa3 Rahul Iyer 2009-09-10 1057
093a1468b6edb0 Trond Myklebust 2014-11-12 1058 if (!bc_xprt)
586c52cc61b5b8 Trond Myklebust 2009-05-18 1059 return -EAGAIN;
75c84151a9dc7a Trond Myklebust 2018-08-31 1060 spin_lock(&bc_xprt->queue_lock);
093a1468b6edb0 Trond Myklebust 2014-11-12 1061 req = xprt_lookup_rqst(bc_xprt, xid);
093a1468b6edb0 Trond Myklebust 2014-11-12 1062 if (!req)
75b63ccc0be260 Dai Ngo 2023-12-15 1063 goto unlock_eagain;
4cfc7e6019caa3 Rahul Iyer 2009-09-10 1064
586c52cc61b5b8 Trond Myklebust 2009-05-18 1065 memcpy(&req->rq_private_buf, &req->rq_rcv_buf, sizeof(struct xdr_buf));
586c52cc61b5b8 Trond Myklebust 2009-05-18 1066 /*
586c52cc61b5b8 Trond Myklebust 2009-05-18 1067 * XXX!: cheating for now! Only copying HEAD.
586c52cc61b5b8 Trond Myklebust 2009-05-18 1068 * But we know this is good enough for now (in fact, for any
586c52cc61b5b8 Trond Myklebust 2009-05-18 1069 * callback reply in the forseeable future).
586c52cc61b5b8 Trond Myklebust 2009-05-18 1070 */
586c52cc61b5b8 Trond Myklebust 2009-05-18 1071 dst = &req->rq_private_buf.head[0];
586c52cc61b5b8 Trond Myklebust 2009-05-18 1072 src = &rqstp->rq_arg.head[0];
586c52cc61b5b8 Trond Myklebust 2009-05-18 1073 if (dst->iov_len < src->iov_len)
093a1468b6edb0 Trond Myklebust 2014-11-12 1074 goto unlock_eagain; /* whatever; just giving up. */
586c52cc61b5b8 Trond Myklebust 2009-05-18 1075 memcpy(dst->iov_base, src->iov_base, src->iov_len);
cc248d4b1ddf05 J. Bruce Fields 2012-12-03 1076 xprt_complete_rqst(req->rq_task, rqstp->rq_arg.len);
586c52cc61b5b8 Trond Myklebust 2009-05-18 1077 rqstp->rq_arg.len = 0;
75c84151a9dc7a Trond Myklebust 2018-08-31 1078 spin_unlock(&bc_xprt->queue_lock);
586c52cc61b5b8 Trond Myklebust 2009-05-18 1079 return 0;
093a1468b6edb0 Trond Myklebust 2014-11-12 1080 unlock_eagain:
75c84151a9dc7a Trond Myklebust 2018-08-31 1081 spin_unlock(&bc_xprt->queue_lock);
093a1468b6edb0 Trond Myklebust 2014-11-12 1082 return -EAGAIN;
4cfc7e6019caa3 Rahul Iyer 2009-09-10 1083 }
586c52cc61b5b8 Trond Myklebust 2009-05-18 1084

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2023-12-16 11:12:49

by Benjamin Coddington

[permalink] [raw]

Subject: Re: [PATCH 1/3 v2] SUNRPC: remove printk when back channel request not found

On 15 Dec 2023, at 16:47, Dai Ngo wrote:

> If the client interface is down, or there is a network partition between
> the client and server, that prevents the callback request to reach the
> client TCP on the server will keep re-transmitting the callback for about
> ~9 minutes before giving up and closes the connection.
>
> If the connection between the client and the server is re-established
> before the connection is closed and after the callback timed out (9 secs)
> then the re-transmitted callback request will arrive at the client. When
> the server receives the reply of the callback, receive_cb_reply prints the
> "Got unrecognized reply..." message in the system log since the callback
> request was already removed from the server xprt's recv_queue.
>
> Even though this scenario has no effect on the server operation, a
> malicious client can take advantage of this behavior and send thousand
> of callback replies with random XIDs to fill up the server's system log.

I don't think this is a serious risk. There's plenty of things a malicious
client can do besides try to fill up a system log.

This particular printk has been an excellent indicator of transport or
client issues over the years. Seeing it in the log on a customer systems
shaves a lot of time off an initial triage of an issue. Seeing it in my
testing environment immediately notifies me of what might be an otherwise
hard-to-notice problem.

Ben

2023-12-16 16:41:28

by Chuck Lever III

[permalink] [raw]

Subject: Re: [PATCH 1/3 v2] SUNRPC: remove printk when back channel request not found

> On Dec 16, 2023, at 6:12 AM, Benjamin Coddington <[email protected]> wrote:
>
> On 15 Dec 2023, at 16:47, Dai Ngo wrote:
>
>> If the client interface is down, or there is a network partition between
>> the client and server, that prevents the callback request to reach the
>> client TCP on the server will keep re-transmitting the callback for about
>> ~9 minutes before giving up and closes the connection.
>>
>> If the connection between the client and the server is re-established
>> before the connection is closed and after the callback timed out (9 secs)
>> then the re-transmitted callback request will arrive at the client. When
>> the server receives the reply of the callback, receive_cb_reply prints the
>> "Got unrecognized reply..." message in the system log since the callback
>> request was already removed from the server xprt's recv_queue.
>>
>> Even though this scenario has no effect on the server operation, a
>> malicious client can take advantage of this behavior and send thousand
>> of callback replies with random XIDs to fill up the server's system log.
>
> I don't think this is a serious risk. There's plenty of things a malicious
> client can do besides try to fill up a system log.

It's a general policy to remove printk's that can be triggered via
the network. On the client side (xprt_lookup_rqst) we have a dprintk
and a trace point. There's no unconditional log message there.

> This particular printk has been an excellent indicator of transport or
> client issues over the years.

The problem is it can also be triggered by malicious behavior as well
as unrelated issues (like a network partition). It's got a pretty low
signal-to-noise ratio IMO; it's somewhat alarming and non-actionable
for server administrators.

> Seeing it in the log on a customer systems
> shaves a lot of time off an initial triage of an issue. Seeing it in my
> testing environment immediately notifies me of what might be an otherwise
> hard-to-notice problem.

Can you give some details?

It would be OK with me to replace the unconditional warning with a
trace point.

--
Chuck Lever

2023-12-18 13:11:46

by Benjamin Coddington

[permalink] [raw]

Subject: Re: [PATCH 1/3 v2] SUNRPC: remove printk when back channel request not found

On 16 Dec 2023, at 11:39, Chuck Lever III wrote:

>> On Dec 16, 2023, at 6:12 AM, Benjamin Coddington <[email protected]> wrote:
>>
>> On 15 Dec 2023, at 16:47, Dai Ngo wrote:
>>
>>> If the client interface is down, or there is a network partition between
>>> the client and server, that prevents the callback request to reach the
>>> client TCP on the server will keep re-transmitting the callback for about
>>> ~9 minutes before giving up and closes the connection.
>>>
>>> If the connection between the client and the server is re-established
>>> before the connection is closed and after the callback timed out (9 secs)
>>> then the re-transmitted callback request will arrive at the client. When
>>> the server receives the reply of the callback, receive_cb_reply prints the
>>> "Got unrecognized reply..." message in the system log since the callback
>>> request was already removed from the server xprt's recv_queue.
>>>
>>> Even though this scenario has no effect on the server operation, a
>>> malicious client can take advantage of this behavior and send thousand
>>> of callback replies with random XIDs to fill up the server's system log.
>>
>> I don't think this is a serious risk. There's plenty of things a malicious
>> client can do besides try to fill up a system log.
>
> It's a general policy to remove printk's that can be triggered via
> the network. On the client side (xprt_lookup_rqst) we have a dprintk
> and a trace point. There's no unconditional log message there.

Ok, fair.

>> This particular printk has been an excellent indicator of transport or
>> client issues over the years.
>
> The problem is it can also be triggered by malicious behavior as well
> as unrelated issues (like a network partition). It's got a pretty low
> signal-to-noise ratio IMO; it's somewhat alarming and non-actionable
> for server administrators.

I have never seen a case with a malicious NFS client, and I'm also sure if I
were to run a malicious client I couldn't think of a better way of raising
my hand to say so.

I've seen a lot of misbehaving middle-boxes, or incorrectly setup split
routing, or migrated-across-the-network clients..

>> Seeing it in the log on a customer systems
>> shaves a lot of time off an initial triage of an issue. Seeing it in my
>> testing environment immediately notifies me of what might be an otherwise
>> hard-to-notice problem.
>
> Can you give some details?

I don't immediately have more details at hand beyond what I've already said
- when a customer says they're seeing this message and NFS is slow or
failing in some way, its a big short cut to finding the problem.

> It would be OK with me to replace the unconditional warning with a
> trace point.

Of course, but that tracepoint would have to be enabled in order to see that
a significant disruption in the client-server conversation was occuring.

That's all I have for this patch -- just giving some feedback.

Ben

2023-12-18 15:08:36

by Chuck Lever III

[permalink] [raw]

Subject: Re: [PATCH 1/3 v2] SUNRPC: remove printk when back channel request not found

> On Dec 18, 2023, at 7:48 AM, Benjamin Coddington <[email protected]> wrote:
>
> On 16 Dec 2023, at 11:39, Chuck Lever III wrote:
>
>>> On Dec 16, 2023, at 6:12 AM, Benjamin Coddington <[email protected]> wrote:
>>>
>>> On 15 Dec 2023, at 16:47, Dai Ngo wrote:
>>>
>>>> If the client interface is down, or there is a network partition between
>>>> the client and server, that prevents the callback request to reach the
>>>> client TCP on the server will keep re-transmitting the callback for about
>>>> ~9 minutes before giving up and closes the connection.
>>>>
>>>> If the connection between the client and the server is re-established
>>>> before the connection is closed and after the callback timed out (9 secs)
>>>> then the re-transmitted callback request will arrive at the client. When
>>>> the server receives the reply of the callback, receive_cb_reply prints the
>>>> "Got unrecognized reply..." message in the system log since the callback
>>>> request was already removed from the server xprt's recv_queue.
>>>>
>>>> Even though this scenario has no effect on the server operation, a
>>>> malicious client can take advantage of this behavior and send thousand
>>>> of callback replies with random XIDs to fill up the server's system log.
>>>
>>> I don't think this is a serious risk. There's plenty of things a malicious
>>> client can do besides try to fill up a system log.
>>
>> It's a general policy to remove printk's that can be triggered via
>> the network. On the client side (xprt_lookup_rqst) we have a dprintk
>> and a trace point. There's no unconditional log message there.
>
> Ok, fair.

Fwiw, that policy is the particular reason I favor applying
this one.

>>> This particular printk has been an excellent indicator of transport or
>>> client issues over the years.
>>
>> The problem is it can also be triggered by malicious behavior as well
>> as unrelated issues (like a network partition). It's got a pretty low
>> signal-to-noise ratio IMO; it's somewhat alarming and non-actionable
>> for server administrators.
>
> I have never seen a case with a malicious NFS client, and I'm also sure if I
> were to run a malicious client I couldn't think of a better way of raising
> my hand to say so.

Right. Maybe the patch description should say "malfunctioning
or malicious client" or something more generic.

> I've seen a lot of misbehaving middle-boxes, or incorrectly setup split
> routing, or migrated-across-the-network clients..

OK, then, there might be some value to this warning outside
of code development. But the current warning message might
be better at directing administrators to look at network
issues rather than "hey I don't recognize that XID".

>>> Seeing it in the log on a customer systems
>>> shaves a lot of time off an initial triage of an issue. Seeing it in my
>>> testing environment immediately notifies me of what might be an otherwise
>>> hard-to-notice problem.
>>
>> Can you give some details?
>
> I don't immediately have more details at hand beyond what I've already said
> - when a customer says they're seeing this message and NFS is slow or
> failing in some way, its a big short cut to finding the problem.

Well I'm mostly interested in understanding why and what kind
of problems you find this way... it could mean we could replace
this warning with something as good, or we could find and fix
a class of problems if you see a common element among issues
reported in this way.

I wasn't really going for "put up or shut up", more like "how
can we improve backchannel observability without blathering
all over the kernel log?"

>> It would be OK with me to replace the unconditional warning with a
>> trace point.
>
> Of course, but that tracepoint would have to be enabled in order to see that
> a significant disruption in the client-server conversation was occurring.

A counter might work -- that's always on. I can't see that the
particular XIDs reported by this warning are useful.

If you have one or two bugs that are public I could look at,
I would be really interested in what you find with this failure
mode.

> That's all I have for this patch -- just giving some feedback.

Thanks, appreciated!

--
Chuck Lever

2023-12-20 13:04:53

by Benjamin Coddington

[permalink] [raw]

Subject: Re: [PATCH 1/3 v2] SUNRPC: remove printk when back channel request not found

On 18 Dec 2023, at 10:05, Chuck Lever III wrote:

> If you have one or two bugs that are public I could look at,
> I would be really interested in what you find with this failure
> mode.

After a full-text search of bugzilla, I found only three bugs with this
printk -- and all three were cases of kernel memory corruption or a
crashed-then-restarted server.

It's probably safe to ignore my feelings on this one. :)

Ben

2023-12-20 13:57:44

by Chuck Lever III

[permalink] [raw]

Subject: Re: [PATCH 1/3 v2] SUNRPC: remove printk when back channel request not found

> On Dec 20, 2023, at 8:04 AM, Benjamin Coddington <[email protected]> wrote:
>
> On 18 Dec 2023, at 10:05, Chuck Lever III wrote:
>
>> If you have one or two bugs that are public I could look at,
>> I would be really interested in what you find with this failure
>> mode.
>
> After a full-text search of bugzilla, I found only three bugs with this
> printk -- and all three were cases of kernel memory corruption or a
> crashed-then-restarted server.
>
> It's probably safe to ignore my feelings on this one. :)

Thank you for checking!

--
Chuck Lever

2023-12-20 16:48:44

by Dai Ngo

[permalink] [raw]

Subject: Re: [PATCH 1/3 v2] SUNRPC: remove printk when back channel request not found

Thank you Ben and Chuck,

I will add a counter in svc_stat as replacement for the printk.

-Dai

On 12/20/23 5:54 AM, Chuck Lever III wrote:
>
>> On Dec 20, 2023, at 8:04 AM, Benjamin Coddington <[email protected]> wrote:
>>
>> On 18 Dec 2023, at 10:05, Chuck Lever III wrote:
>>
>>> If you have one or two bugs that are public I could look at,
>>> I would be really interested in what you find with this failure
>>> mode.
>> After a full-text search of bugzilla, I found only three bugs with this
>> printk -- and all three were cases of kernel memory corruption or a
>> crashed-then-restarted server.
>>
>> It's probably safe to ignore my feelings on this one. :)
> Thank you for checking!
>
>
> --
> Chuck Lever
>
>

2023-12-20 19:08:29

by Chuck Lever III

[permalink] [raw]

Subject: Re: [PATCH 1/3 v2] SUNRPC: remove printk when back channel request not found

> On Dec 20, 2023, at 11:45 AM, Dai Ngo <[email protected]> wrote:
>
> Thank you Ben and Chuck,
>
> I will add a counter in svc_stat as replacement for the printk.

No need. I've applied this one to nfsd-next. I'm still on the fence
about whether to add a counter or a trace point, but that's not
urgent.

> -Dai
>
> On 12/20/23 5:54 AM, Chuck Lever III wrote:
>>
>>> On Dec 20, 2023, at 8:04 AM, Benjamin Coddington <[email protected]> wrote:
>>>
>>> On 18 Dec 2023, at 10:05, Chuck Lever III wrote:
>>>
>>>> If you have one or two bugs that are public I could look at,
>>>> I would be really interested in what you find with this failure
>>>> mode.
>>> After a full-text search of bugzilla, I found only three bugs with this
>>> printk -- and all three were cases of kernel memory corruption or a
>>> crashed-then-restarted server.
>>>
>>> It's probably safe to ignore my feelings on this one. :)
>> Thank you for checking!
>>
>>
>> --
>> Chuck Lever
>>
>>

--
Chuck Lever