2021-05-16 21:50:40

by Timo Rothenpieler

[permalink] [raw]
Subject: Spurious instability with NFSoRDMA under moderate load

This has happened 3 times so far over the last couple months, and I do
not have a clear way to reproduce it.
It happens under moderate load, when lots of nodes read and write from
the server. Though not in any super intense way. Just normal program
execution, writing of light logs, and other standard tasks.

The issues on the clients manifest in a multitude of ways. Most of the
time, random IO operations just fail, rarely hang indefinitely and make
the process unkillable.
Another example would be: "Failed to remove
'.../.nfs00000000007b03af00000001': Device or resource busy"

Once a client is in that state, the only way to get it back into order
is a reboot.

On the server side, a single error cqe is dumped each time this problem
happened. So far, I always rebooted the server as well, to make sure
everything is back in order. Not sure if that is strictly necessary.

> [561889.198889] infiniband mlx5_0: dump_cqe:272:(pid 709): dump error cqe
> [561889.198945] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [561889.198984] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [561889.199023] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [561889.199061] 00000030: 00 00 00 00 00 00 88 13 08 00 01 13 07 47 67 d2

> [985074.602880] infiniband mlx5_0: dump_cqe:272:(pid 599): dump error cqe
> [985074.602921] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [985074.602946] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [985074.602970] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [985074.602994] 00000030: 00 00 00 00 00 00 88 13 08 00 01 46 f2 93 0b d3

> [1648894.168819] infiniband ibp1s0: dump_cqe:272:(pid 696): dump error cqe
> [1648894.168853] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [1648894.168878] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [1648894.168903] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [1648894.168928] 00000030: 00 00 00 00 00 00 88 13 08 00 01 08 6b d2 b9 d3

These all happened under different Versions of the 5.10 Kernel. The last
one under 5.10.32 today.

Switching all clients to TCP seems to make NFS works perfectly reliable.

I'm not sure how to read those error dumps, so help there would be
appreciated.

Could this be similar to spurious issues you get with UDP, where dropped
packages cause havoc? Though I would not expect heavy load on IB to
cause an error cqe to be logged.




Thanks,
Timo


Attachments:
smime.p7s (4.39 kB)
S/MIME Cryptographic Signature

2021-05-17 16:29:29

by Chuck Lever III

[permalink] [raw]
Subject: Re: Spurious instability with NFSoRDMA under moderate load

Hello Timo-

> On May 16, 2021, at 1:29 PM, Timo Rothenpieler <[email protected]> wrote:
>
> This has happened 3 times so far over the last couple months, and I do not have a clear way to reproduce it.
> It happens under moderate load, when lots of nodes read and write from the server. Though not in any super intense way. Just normal program execution, writing of light logs, and other standard tasks.
>
> The issues on the clients manifest in a multitude of ways. Most of the time, random IO operations just fail, rarely hang indefinitely and make the process unkillable.
> Another example would be: "Failed to remove '.../.nfs00000000007b03af00000001': Device or resource busy"
>
> Once a client is in that state, the only way to get it back into order is a reboot.
>
> On the server side, a single error cqe is dumped each time this problem happened. So far, I always rebooted the server as well, to make sure everything is back in order. Not sure if that is strictly necessary.
>
>> [561889.198889] infiniband mlx5_0: dump_cqe:272:(pid 709): dump error cqe
>> [561889.198945] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> [561889.198984] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> [561889.199023] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> [561889.199061] 00000030: 00 00 00 00 00 00 88 13 08 00 01 13 07 47 67 d2
>
>> [985074.602880] infiniband mlx5_0: dump_cqe:272:(pid 599): dump error cqe
>> [985074.602921] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> [985074.602946] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> [985074.602970] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> [985074.602994] 00000030: 00 00 00 00 00 00 88 13 08 00 01 46 f2 93 0b d3
>
>> [1648894.168819] infiniband ibp1s0: dump_cqe:272:(pid 696): dump error cqe
>> [1648894.168853] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> [1648894.168878] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> [1648894.168903] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> [1648894.168928] 00000030: 00 00 00 00 00 00 88 13 08 00 01 08 6b d2 b9 d3

I'm hoping Leon can get out his magic decoder ring and tell us if
these CQE dumps contain a useful WC status code.

Meanwhile, you could try 5.11 or 5.12 on your NFS server to see if
the problem persists.


> These all happened under different Versions of the 5.10 Kernel. The last one under 5.10.32 today.
>
> Switching all clients to TCP seems to make NFS works perfectly reliable.
>
> I'm not sure how to read those error dumps, so help there would be appreciated.
>
> Could this be similar to spurious issues you get with UDP, where dropped packages cause havoc? Though I would not expect heavy load on IB to cause an error cqe to be logged.

The underlying RDMA connection for RPC/RDMA is reliable (lossless),
so I don't see an immediate relation to how RPC/UDP might behave.


--
Chuck Lever




2021-05-18 20:22:41

by Timo Rothenpieler

[permalink] [raw]
Subject: Re: Spurious instability with NFSoRDMA under moderate load

On 17.05.2021 18:27, Chuck Lever III wrote:
> Meanwhile, you could try 5.11 or 5.12 on your NFS server to see if
> the problem persists.

I updated the NFS server to 5.12.4 now and will observe and maybe try to
cause some mixed load.


Attachments:
smime.p7s (4.39 kB)
S/MIME Cryptographic Signature

2021-05-19 20:12:27

by Leon Romanovsky

[permalink] [raw]
Subject: Re: Spurious instability with NFSoRDMA under moderate load

On Mon, May 17, 2021 at 04:27:29PM +0000, Chuck Lever III wrote:
> Hello Timo-
>
> > On May 16, 2021, at 1:29 PM, Timo Rothenpieler <[email protected]> wrote:
> >
> > This has happened 3 times so far over the last couple months, and I do not have a clear way to reproduce it.
> > It happens under moderate load, when lots of nodes read and write from the server. Though not in any super intense way. Just normal program execution, writing of light logs, and other standard tasks.
> >
> > The issues on the clients manifest in a multitude of ways. Most of the time, random IO operations just fail, rarely hang indefinitely and make the process unkillable.
> > Another example would be: "Failed to remove '.../.nfs00000000007b03af00000001': Device or resource busy"
> >
> > Once a client is in that state, the only way to get it back into order is a reboot.
> >
> > On the server side, a single error cqe is dumped each time this problem happened. So far, I always rebooted the server as well, to make sure everything is back in order. Not sure if that is strictly necessary.
> >
> >> [561889.198889] infiniband mlx5_0: dump_cqe:272:(pid 709): dump error cqe
> >> [561889.198945] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> [561889.198984] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> [561889.199023] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> [561889.199061] 00000030: 00 00 00 00 00 00 88 13 08 00 01 13 07 47 67 d2
> >
> >> [985074.602880] infiniband mlx5_0: dump_cqe:272:(pid 599): dump error cqe
> >> [985074.602921] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> [985074.602946] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> [985074.602970] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> [985074.602994] 00000030: 00 00 00 00 00 00 88 13 08 00 01 46 f2 93 0b d3
> >
> >> [1648894.168819] infiniband ibp1s0: dump_cqe:272:(pid 696): dump error cqe
> >> [1648894.168853] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> [1648894.168878] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> [1648894.168903] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> [1648894.168928] 00000030: 00 00 00 00 00 00 88 13 08 00 01 08 6b d2 b9 d3
>
> I'm hoping Leon can get out his magic decoder ring and tell us if
> these CQE dumps contain a useful WC status code.

Unfortunately, no. I failed to parse it, if I read the dump correctly,
it is not marked as error and opcode is 0.

Thanks

2021-06-21 16:08:55

by Timo Rothenpieler

[permalink] [raw]
Subject: Re: Spurious instability with NFSoRDMA under moderate load

On 17.05.2021 19:37, Timo Rothenpieler wrote:
> On 17.05.2021 18:27, Chuck Lever III wrote:
>> Meanwhile, you could try 5.11 or 5.12 on your NFS server to see if
>> the problem persists.
>
> I updated the NFS server to 5.12.4 now and will observe and maybe try to
> cause some mixed load.
>

Had this running on 5.12 for a month now, and haven't observed any kind
of instability so far.
So I'd carefully assume that something in after 5.10, that made it into
5.11 or 5.12 fixed or greatly improved the situation.

Can't really sensibly bisect this sadly, given the extremely long and
uncertain turnaround times.


Attachments:
smime.p7s (4.39 kB)
S/MIME Cryptographic Signature

2021-06-21 16:31:27

by Chuck Lever III

[permalink] [raw]
Subject: Re: Spurious instability with NFSoRDMA under moderate load



> On Jun 21, 2021, at 12:06 PM, Timo Rothenpieler <[email protected]> wrote:
>
> On 17.05.2021 19:37, Timo Rothenpieler wrote:
>> On 17.05.2021 18:27, Chuck Lever III wrote:
>>> Meanwhile, you could try 5.11 or 5.12 on your NFS server to see if
>>> the problem persists.
>> I updated the NFS server to 5.12.4 now and will observe and maybe try to cause some mixed load.
>
> Had this running on 5.12 for a month now, and haven't observed any kind of instability so far.
> So I'd carefully assume that something in after 5.10, that made it into 5.11 or 5.12 fixed or greatly improved the situation.
>
> Can't really sensibly bisect this sadly, given the extremely long and uncertain turnaround times.

Thank you for the update!


--
Chuck Lever



2021-08-10 14:17:55

by Timo Rothenpieler

[permalink] [raw]
Subject: Re: Spurious instability with NFSoRDMA under moderate load

On 21.06.2021 18:06, Timo Rothenpieler wrote:
> On 17.05.2021 19:37, Timo Rothenpieler wrote:
>> On 17.05.2021 18:27, Chuck Lever III wrote:
>>> Meanwhile, you could try 5.11 or 5.12 on your NFS server to see if
>>> the problem persists.
>>
>> I updated the NFS server to 5.12.4 now and will observe and maybe try
>> to cause some mixed load.
>>
>
> Had this running on 5.12 for a month now, and haven't observed any kind
> of instability so far.
> So I'd carefully assume that something in after 5.10, that made it into
> 5.11 or 5.12 fixed or greatly improved the situation.
>
> Can't really sensibly bisect this sadly, given the extremely long and
> uncertain turnaround times.
>

Ok, so this just happened again for the first time since upgrading to 5.12.
Exact same thing, except this time no error cqe was dumped
simultaneously (It still appeared in dmesg, but over a week before the
issue showed up). So I'll assume it's unrelated to this issue.

I had no issues while running 5.12.12 and below. Recently (14 days ago
or so) updated to 5.12.19, and now it's happening again.
Unfortunately, with how rarely this issue happens, this can either be a
regression between those two versions, or it was still present all along
and just never triggered for several months.

Makes me wonder if this is somehow related to the problem described in
"NFS server regression in kernel 5.13 (tested w/ 5.13.9)". But the
pattern of the issue does not look all that similar, given that for me,
the hangs never recover, and I have RDMA in the mix.


Attachments:
smime.p7s (4.39 kB)
S/MIME Cryptographic Signature

2021-08-10 17:18:14

by Chuck Lever III

[permalink] [raw]
Subject: Re: Spurious instability with NFSoRDMA under moderate load



> On Aug 10, 2021, at 10:06 AM, Timo Rothenpieler <[email protected]> wrote:
>
> On 10.08.2021 14:49, Timo Rothenpieler wrote:
>> Ok, so this just happened again for the first time since upgrading to 5.12.
>> Exact same thing, except this time no error cqe was dumped simultaneously (It still appeared in dmesg, but over a week before the issue showed up). So I'll assume it's unrelated to this issue.
>> I had no issues while running 5.12.12 and below. Recently (14 days ago or so) updated to 5.12.19, and now it's happening again.
>> Unfortunately, with how rarely this issue happens, this can either be a regression between those two versions, or it was still present all along and just never triggered for several months.
>> Makes me wonder if this is somehow related to the problem described in "NFS server regression in kernel 5.13 (tested w/ 5.13.9)". But the pattern of the issue does not look all that similar, given that for me, the hangs never recover, and I have RDMA in the mix.
>
> Here's some trace and debug data I collected while it was in a state where everything using copy_range would get stuck.
> The file(s) in question is named "tmp-copy-*", with a bunch of numbers at the end.
>
> This also seems to be entirely a client side issue, since other NFS clients on the same server work just fine the whole time, and a reboot of that one affected client makes it work normally again, without touching the server.
>
> In this instance right now, I was even unable to cleanly reboot the client machine, since it got completely stuck, seemingly unable to unmount the NFS shares, getting stuck the same way other operations would.
>
> What confuses me the most is that while it is in this messed up state, it generally still works fine, and it's only a few very specific operations (only one I know of for sure is copy_range) get stuck.
> The processes also aren't unkillable. a SIGTERM or SIGKILL will end them, and also flushes NFS. Captured that in the dmesg_output2.

What I see in this data is that the server is reporting

SEQ4_STATUS_CB_PATH_DOWN

and the client is attempting to recover (repeatedly) using
BIND_CONN_TO_SESSION. But apparently the recovery didn't
actually work, because the server continues to report a
callback path problem.

[1712389.125641] nfs41_handle_sequence_flag_errors: "10.110.10.200" (client ID 6765f8600a675814) flags=0x00000001
[1712389.129264] nfs4_bind_conn_to_session: bind_conn_to_session was successful for server 10.110.10.200!

[1712389.171953] nfs41_handle_sequence_flag_errors: "10.110.10.200" (client ID 6765f8600a675814) flags=0x00000001
[1712389.178361] nfs4_bind_conn_to_session: bind_conn_to_session was successful for server 10.110.10.200!

[1712389.195606] nfs41_handle_sequence_flag_errors: "10.110.10.200" (client ID 6765f8600a675814) flags=0x00000001
[1712389.203891] nfs4_bind_conn_to_session: bind_conn_to_session was successful for server 10.110.10.200!

I guess it's time to switch to tracing on the server side
to see if you can nail down why the server's callback
requests are failing. On your NFS server, run:

# trace-cmd record -e nfsd -e sunrpc -e rpcgss -e rpcrdma

at roughly the same point during your test that you captured
the previous client-side trace.

--
Chuck Lever



2021-08-10 21:41:16

by Timo Rothenpieler

[permalink] [raw]
Subject: Re: Spurious instability with NFSoRDMA under moderate load

On 10.08.2021 19:17, Chuck Lever III wrote:
>
> What I see in this data is that the server is reporting
>
> SEQ4_STATUS_CB_PATH_DOWN
>
> and the client is attempting to recover (repeatedly) using
> BIND_CONN_TO_SESSION. But apparently the recovery didn't
> actually work, because the server continues to report a
> callback path problem.
>
> [1712389.125641] nfs41_handle_sequence_flag_errors: "10.110.10.200" (client ID 6765f8600a675814) flags=0x00000001
> [1712389.129264] nfs4_bind_conn_to_session: bind_conn_to_session was successful for server 10.110.10.200!
>
> [1712389.171953] nfs41_handle_sequence_flag_errors: "10.110.10.200" (client ID 6765f8600a675814) flags=0x00000001
> [1712389.178361] nfs4_bind_conn_to_session: bind_conn_to_session was successful for server 10.110.10.200!
>
> [1712389.195606] nfs41_handle_sequence_flag_errors: "10.110.10.200" (client ID 6765f8600a675814) flags=0x00000001
> [1712389.203891] nfs4_bind_conn_to_session: bind_conn_to_session was successful for server 10.110.10.200!
>
> I guess it's time to switch to tracing on the server side
> to see if you can nail down why the server's callback
> requests are failing. On your NFS server, run:
>
> # trace-cmd record -e nfsd -e sunrpc -e rpcgss -e rpcrdma
>
> at roughly the same point during your test that you captured
> the previous client-side trace.

I wonder if reverting 6820bf77864d on the server, to have an easier way
to reproduce this state, would be worth it.

Cause it seems like the actual underlying issue is the inability of the
NFS server (and/or client) to reestablish the backchannel if it gets
disconnected for whatever reason?

Right now I already rebooted the client, and everything is working
again, so I'll have to wait a potentially long time for this to happen
again otherwise.


Attachments:
smime.p7s (4.39 kB)
S/MIME Cryptographic Signature

2021-08-11 00:22:10

by Chuck Lever III

[permalink] [raw]
Subject: Re: Spurious instability with NFSoRDMA under moderate load


> On Aug 10, 2021, at 8:07 PM, Timo Rothenpieler <[email protected]> wrote:
>
> On 10.08.2021 23:40, Timo Rothenpieler wrote:
>>> On 10.08.2021 19:17, Chuck Lever III wrote:
>>>
>>> What I see in this data is that the server is reporting
>>>
>>> SEQ4_STATUS_CB_PATH_DOWN
>>>
>>> and the client is attempting to recover (repeatedly) using
>>> BIND_CONN_TO_SESSION. But apparently the recovery didn't
>>> actually work, because the server continues to report a
>>> callback path problem.
>>>
>>> [1712389.125641] nfs41_handle_sequence_flag_errors: "10.110.10.200" (client ID 6765f8600a675814) flags=0x00000001
>>> [1712389.129264] nfs4_bind_conn_to_session: bind_conn_to_session was successful for server 10.110.10.200!
>>>
>>> [1712389.171953] nfs41_handle_sequence_flag_errors: "10.110.10.200" (client ID 6765f8600a675814) flags=0x00000001
>>> [1712389.178361] nfs4_bind_conn_to_session: bind_conn_to_session was successful for server 10.110.10.200!
>>>
>>> [1712389.195606] nfs41_handle_sequence_flag_errors: "10.110.10.200" (client ID 6765f8600a675814) flags=0x00000001
>>> [1712389.203891] nfs4_bind_conn_to_session: bind_conn_to_session was successful for server 10.110.10.200!
>>>
>>> I guess it's time to switch to tracing on the server side
>>> to see if you can nail down why the server's callback
>>> requests are failing. On your NFS server, run:
>>>
>>> # trace-cmd record -e nfsd -e sunrpc -e rpcgss -e rpcrdma
>>>
>>> at roughly the same point during your test that you captured
>>> the previous client-side trace.
>> I wonder if reverting 6820bf77864d on the server, to have an easier way to reproduce this state, would be worth it.
>> Cause it seems like the actual underlying issue is the inability of the NFS server (and/or client) to reestablish the backchannel if it gets disconnected for whatever reason?
>> Right now I already rebooted the client, and everything is working again, so I'll have to wait a potentially long time for this to happen again otherwise.
>
> I actually ended up doing that, and sure enough, on 5.12.19, with 6820bf77864d reverted, it ends up getting stuck very quickly.
> Using xfs_io with copy_range this time, to have a more isolated test case.
>
> Though the trace logs I'm getting from that look different, see attached traces and rpcdebug enabled dmesg from both client and server.
> There is no appearance of "nfs41_handle_sequence_flag_errors" or "nfs4_bind_conn_to_session" whatsoever, and it does not seem like it's trying to recover the timed out callback channel at all.

Thanks, I’ll take a look at the new information tomorrow.


> So I'm not sure if the other issue that spuriously happens is related.
> But there definitely is an issue with it not re-establishing the backchannel, so fixing one might fix the other as well.
> <rpcdebug.tar.xz>

2021-08-11 17:35:29

by Chuck Lever III

[permalink] [raw]
Subject: Re: Spurious instability with NFSoRDMA under moderate load



> On Aug 11, 2021, at 12:20 PM, Timo Rothenpieler <[email protected]> wrote:
>
> resulting dmesg and trace logs of both client and server are attached.
>
> Test procedure:
>
> - start tracing on client and server
> - mount NFS on client
> - immediately run 'xfs_io -fc "copy_range testfile" testfile.copy' (which succeeds)
> - wait 10~15 minutes for the backchannel to time out (still running 5.12.19 with the fix for that reverted)
> - run xfs_io command again, getting stuck now
> - let it sit there stuck for a minute, then cancel it
> - run the command again
> - while it's still stuck, finished recording the logs and traces

The server tries to send CB_OFFLOAD when the offloaded copy
completes, but finds the backchannel transport is not connected.

The server can't report the problem until the client sends a
SEQUENCE operation, but there's really no other traffic going
on, so it just waits.

The client eventually sends a singleton SEQUENCE to renew its
lease. The server replies with the SEQ4_STATUS_BACKCHANNEL_FAULT
flag set at that point. Client's recovery is to destroy that
session and create a new one. That appears to be successful.

But the server doesn't send another CB_OFFLOAD to let the client
know the copy is complete, so the client hangs.

This seems to be peculiar to COPY_OFFLOAD, but I wonder if the
other CB operations suffer from the same "failed to retransmit
after the CB path is restored" issue. It might not matter for
some of them, but for others like CB_RECALL, that could be
important.


--
Chuck Lever



2021-08-11 18:39:40

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: Spurious instability with NFSoRDMA under moderate load

On Wed, Aug 11, 2021 at 1:30 PM Chuck Lever III <[email protected]> wrote:
>
>
>
> > On Aug 11, 2021, at 12:20 PM, Timo Rothenpieler <[email protected]> wrote:
> >
> > resulting dmesg and trace logs of both client and server are attached.
> >
> > Test procedure:
> >
> > - start tracing on client and server
> > - mount NFS on client
> > - immediately run 'xfs_io -fc "copy_range testfile" testfile.copy' (which succeeds)
> > - wait 10~15 minutes for the backchannel to time out (still running 5.12.19 with the fix for that reverted)
> > - run xfs_io command again, getting stuck now
> > - let it sit there stuck for a minute, then cancel it
> > - run the command again
> > - while it's still stuck, finished recording the logs and traces
>
> The server tries to send CB_OFFLOAD when the offloaded copy
> completes, but finds the backchannel transport is not connected.
>
> The server can't report the problem until the client sends a
> SEQUENCE operation, but there's really no other traffic going
> on, so it just waits.
>
> The client eventually sends a singleton SEQUENCE to renew its
> lease. The server replies with the SEQ4_STATUS_BACKCHANNEL_FAULT
> flag set at that point. Client's recovery is to destroy that
> session and create a new one. That appears to be successful.
>
> But the server doesn't send another CB_OFFLOAD to let the client
> know the copy is complete, so the client hangs.
>
> This seems to be peculiar to COPY_OFFLOAD, but I wonder if the
> other CB operations suffer from the same "failed to retransmit
> after the CB path is restored" issue. It might not matter for
> some of them, but for others like CB_RECALL, that could be
> important.

Thank you for the analysis Chuck (btw I haven't seen any attachments
with Timo's posts so I'm assuming some offline communication must have
happened).
?
I'm looking at the code and wouldn't the mentioned flags be set on the
CB_SEQUENCE operation? nfsd4_cb_done() has code to mark the channel
and retry (or another way of saying this, this code should generically
handle retrying whatever operation it is be it CB_OFFLOAD or
CB_RECALL)? Is that not working (not sure if this is a question or a
statement).... I would think that would be the place to handle this
problem.

>
>
> --
> Chuck Lever
>
>
>

2021-08-11 18:56:57

by Chuck Lever III

[permalink] [raw]
Subject: Re: Spurious instability with NFSoRDMA under moderate load



> On Aug 11, 2021, at 2:38 PM, Olga Kornievskaia <[email protected]> wrote:
>
> On Wed, Aug 11, 2021 at 1:30 PM Chuck Lever III <[email protected]> wrote:
>>
>>
>>
>>> On Aug 11, 2021, at 12:20 PM, Timo Rothenpieler <[email protected]> wrote:
>>>
>>> resulting dmesg and trace logs of both client and server are attached.
>>>
>>> Test procedure:
>>>
>>> - start tracing on client and server
>>> - mount NFS on client
>>> - immediately run 'xfs_io -fc "copy_range testfile" testfile.copy' (which succeeds)
>>> - wait 10~15 minutes for the backchannel to time out (still running 5.12.19 with the fix for that reverted)
>>> - run xfs_io command again, getting stuck now
>>> - let it sit there stuck for a minute, then cancel it
>>> - run the command again
>>> - while it's still stuck, finished recording the logs and traces
>>
>> The server tries to send CB_OFFLOAD when the offloaded copy
>> completes, but finds the backchannel transport is not connected.
>>
>> The server can't report the problem until the client sends a
>> SEQUENCE operation, but there's really no other traffic going
>> on, so it just waits.
>>
>> The client eventually sends a singleton SEQUENCE to renew its
>> lease. The server replies with the SEQ4_STATUS_BACKCHANNEL_FAULT
>> flag set at that point. Client's recovery is to destroy that
>> session and create a new one. That appears to be successful.
>>
>> But the server doesn't send another CB_OFFLOAD to let the client
>> know the copy is complete, so the client hangs.
>>
>> This seems to be peculiar to COPY_OFFLOAD, but I wonder if the
>> other CB operations suffer from the same "failed to retransmit
>> after the CB path is restored" issue. It might not matter for
>> some of them, but for others like CB_RECALL, that could be
>> important.
>
> Thank you for the analysis Chuck (btw I haven't seen any attachments
> with Timo's posts so I'm assuming some offline communication must have
> happened).
> ?
> I'm looking at the code and wouldn't the mentioned flags be set on the
> CB_SEQUENCE operation?

CB_SEQUENCE is sent from server to client, and that can't work if
the callback channel is down.

So the server waits for the client to send a SEQUENCE and it sets
the SEQ4_STATUS_BACKCHANNEL_FAULT in its reply.


> nfsd4_cb_done() has code to mark the channel
> and retry (or another way of saying this, this code should generically
> handle retrying whatever operation it is be it CB_OFFLOAD or
> CB_RECALL)?

cb_done() marks the callback fault, but as far as I can tell the
RPC is terminated at that point and there is no subsequent retry.
The RPC_TASK flags on the CB_OFFLOAD operation cause that RPC to
fail immediately if there's no connection.

And in the BACKCHANNEL_FAULT case, the bc_xprt is destroyed as
part of recovery. I think that would kill all pending RPC tasks.


> Is that not working (not sure if this is a question or a
> statement).... I would think that would be the place to handle this
> problem.

IMHO the OFFLOAD code needs to note that the CB_OFFLOAD RPC
failed and then try the call again once the new backchannel is
available.


--
Chuck Lever



2021-08-11 19:46:58

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: Spurious instability with NFSoRDMA under moderate load

On Wed, Aug 11, 2021 at 2:52 PM Chuck Lever III <[email protected]> wrote:
>
>
>
> > On Aug 11, 2021, at 2:38 PM, Olga Kornievskaia <[email protected]> wrote:
> >
> > On Wed, Aug 11, 2021 at 1:30 PM Chuck Lever III <[email protected]> wrote:
> >>
> >>
> >>
> >>> On Aug 11, 2021, at 12:20 PM, Timo Rothenpieler <[email protected]> wrote:
> >>>
> >>> resulting dmesg and trace logs of both client and server are attached.
> >>>
> >>> Test procedure:
> >>>
> >>> - start tracing on client and server
> >>> - mount NFS on client
> >>> - immediately run 'xfs_io -fc "copy_range testfile" testfile.copy' (which succeeds)
> >>> - wait 10~15 minutes for the backchannel to time out (still running 5.12.19 with the fix for that reverted)
> >>> - run xfs_io command again, getting stuck now
> >>> - let it sit there stuck for a minute, then cancel it
> >>> - run the command again
> >>> - while it's still stuck, finished recording the logs and traces
> >>
> >> The server tries to send CB_OFFLOAD when the offloaded copy
> >> completes, but finds the backchannel transport is not connected.
> >>
> >> The server can't report the problem until the client sends a
> >> SEQUENCE operation, but there's really no other traffic going
> >> on, so it just waits.
> >>
> >> The client eventually sends a singleton SEQUENCE to renew its
> >> lease. The server replies with the SEQ4_STATUS_BACKCHANNEL_FAULT
> >> flag set at that point. Client's recovery is to destroy that
> >> session and create a new one. That appears to be successful.
> >>
> >> But the server doesn't send another CB_OFFLOAD to let the client
> >> know the copy is complete, so the client hangs.
> >>
> >> This seems to be peculiar to COPY_OFFLOAD, but I wonder if the
> >> other CB operations suffer from the same "failed to retransmit
> >> after the CB path is restored" issue. It might not matter for
> >> some of them, but for others like CB_RECALL, that could be
> >> important.
> >
> > Thank you for the analysis Chuck (btw I haven't seen any attachments
> > with Timo's posts so I'm assuming some offline communication must have
> > happened).
> > ?
> > I'm looking at the code and wouldn't the mentioned flags be set on the
> > CB_SEQUENCE operation?
>
> CB_SEQUENCE is sent from server to client, and that can't work if
> the callback channel is down.
>
> So the server waits for the client to send a SEQUENCE and it sets
> the SEQ4_STATUS_BACKCHANNEL_FAULT in its reply.

yes scratch that, this is for when CB_SEQUENCE has it in its reply.

> > nfsd4_cb_done() has code to mark the channel
> > and retry (or another way of saying this, this code should generically
> > handle retrying whatever operation it is be it CB_OFFLOAD or
> > CB_RECALL)?
>
> cb_done() marks the callback fault, but as far as I can tell the
> RPC is terminated at that point and there is no subsequent retry.
> The RPC_TASK flags on the CB_OFFLOAD operation cause that RPC to
> fail immediately if there's no connection.
>
> And in the BACKCHANNEL_FAULT case, the bc_xprt is destroyed as
> part of recovery. I think that would kill all pending RPC tasks.
>
>
> > Is that not working (not sure if this is a question or a
> > statement).... I would think that would be the place to handle this
> > problem.
>
> IMHO the OFFLOAD code needs to note that the CB_OFFLOAD RPC
> failed and then try the call again once the new backchannel is
> available.

I still argue that this needs to be done generically not per operation
as CB_RECALL has the same problem. If CB_RECALL call is never
answered, rpc would fail with ETIMEOUT. nfsd4_cb_recall_done() returns
1 back to the nfsd4_cb_done() which then based on the rpc task status
would set the callback path down but will do nothing else.
nfsd4_cb_offload_one() just always returns 1.

Now given that you say during the backchannel fault case, we'll have
to destroy that backchannel and create a new one. Are we losing all
those RPCs that flagged it as faulty? At this point, nothing obvious
comes to mind how to engineer the recovery but it should be done.

>
>
> --
> Chuck Lever
>
>
>

2021-08-11 20:02:19

by Chuck Lever III

[permalink] [raw]
Subject: Re: Spurious instability with NFSoRDMA under moderate load



> On Aug 11, 2021, at 3:46 PM, Olga Kornievskaia <[email protected]> wrote:
>
> On Wed, Aug 11, 2021 at 2:52 PM Chuck Lever III <[email protected]> wrote:
>>
>>
>>
>>> On Aug 11, 2021, at 2:38 PM, Olga Kornievskaia <[email protected]> wrote:
>>>
>>> On Wed, Aug 11, 2021 at 1:30 PM Chuck Lever III <[email protected]> wrote:
>>>>
>>>>
>>>>
>>>>> On Aug 11, 2021, at 12:20 PM, Timo Rothenpieler <[email protected]> wrote:
>>>>>
>>>>> resulting dmesg and trace logs of both client and server are attached.
>>>>>
>>>>> Test procedure:
>>>>>
>>>>> - start tracing on client and server
>>>>> - mount NFS on client
>>>>> - immediately run 'xfs_io -fc "copy_range testfile" testfile.copy' (which succeeds)
>>>>> - wait 10~15 minutes for the backchannel to time out (still running 5.12.19 with the fix for that reverted)
>>>>> - run xfs_io command again, getting stuck now
>>>>> - let it sit there stuck for a minute, then cancel it
>>>>> - run the command again
>>>>> - while it's still stuck, finished recording the logs and traces
>>>>
>>>> The server tries to send CB_OFFLOAD when the offloaded copy
>>>> completes, but finds the backchannel transport is not connected.
>>>>
>>>> The server can't report the problem until the client sends a
>>>> SEQUENCE operation, but there's really no other traffic going
>>>> on, so it just waits.
>>>>
>>>> The client eventually sends a singleton SEQUENCE to renew its
>>>> lease. The server replies with the SEQ4_STATUS_BACKCHANNEL_FAULT
>>>> flag set at that point. Client's recovery is to destroy that
>>>> session and create a new one. That appears to be successful.
>>>>
>>>> But the server doesn't send another CB_OFFLOAD to let the client
>>>> know the copy is complete, so the client hangs.
>>>>
>>>> This seems to be peculiar to COPY_OFFLOAD, but I wonder if the
>>>> other CB operations suffer from the same "failed to retransmit
>>>> after the CB path is restored" issue. It might not matter for
>>>> some of them, but for others like CB_RECALL, that could be
>>>> important.
>>>
>>> Thank you for the analysis Chuck (btw I haven't seen any attachments
>>> with Timo's posts so I'm assuming some offline communication must have
>>> happened).
>>> ?
>>> I'm looking at the code and wouldn't the mentioned flags be set on the
>>> CB_SEQUENCE operation?
>>
>> CB_SEQUENCE is sent from server to client, and that can't work if
>> the callback channel is down.
>>
>> So the server waits for the client to send a SEQUENCE and it sets
>> the SEQ4_STATUS_BACKCHANNEL_FAULT in its reply.
>
> yes scratch that, this is for when CB_SEQUENCE has it in its reply.
>
>>> nfsd4_cb_done() has code to mark the channel
>>> and retry (or another way of saying this, this code should generically
>>> handle retrying whatever operation it is be it CB_OFFLOAD or
>>> CB_RECALL)?
>>
>> cb_done() marks the callback fault, but as far as I can tell the
>> RPC is terminated at that point and there is no subsequent retry.
>> The RPC_TASK flags on the CB_OFFLOAD operation cause that RPC to
>> fail immediately if there's no connection.
>>
>> And in the BACKCHANNEL_FAULT case, the bc_xprt is destroyed as
>> part of recovery. I think that would kill all pending RPC tasks.
>>
>>
>>> Is that not working (not sure if this is a question or a
>>> statement).... I would think that would be the place to handle this
>>> problem.
>>
>> IMHO the OFFLOAD code needs to note that the CB_OFFLOAD RPC
>> failed and then try the call again once the new backchannel is
>> available.
>
> I still argue that this needs to be done generically not per operation
> as CB_RECALL has the same problem.

Probably not just CB_RECALL, but agreed, there doesn't seem to
be any mechanism that can re-drive callback operations when the
backchannel is replaced.


> If CB_RECALL call is never
> answered, rpc would fail with ETIMEOUT.

Well we have a mechanism already to deal with that case, which is
delegation revocation. That's not optimal, but at least it won't
leave the client hanging forever.


> nfsd4_cb_recall_done() returns
> 1 back to the nfsd4_cb_done() which then based on the rpc task status
> would set the callback path down but will do nothing else.
> nfsd4_cb_offload_one() just always returns 1.
>
> Now given that you say during the backchannel fault case, we'll have
> to destroy that backchannel and create a new one.

BACKCHANNEL_FAULT is the particular case that Timo's reproducer
exercises. I haven't looked closely at the CB_PATH_DOWN and
CB_PATH_DOWN_SESSION cases, which use BIND_CONN_TO_SESSION,
to see if those also destroy the backchannel transport.

However I suspect they are just as broken. There isn't much
regression testing around these scenarios.


> Are we losing all those RPCs that flagged it as faulty?

I believe the RPC tasks are already gone at that point. I'm
just saying that /hypothetically/ if the RPCs were not set up
to exit immediately, that wouldn't matter because the transport
and client are destroyed as part of restoring the backchannel.

Perhaps instead of destroying the rpc_clnt, the server should
retain it and simply recycle the underlying transport data
structures.


> At this point, nothing obvious
> comes to mind how to engineer the recovery but it should be done.

Well I think we need to:

1. Review the specs to see if there's any indication of how to
recover from this situation

2. Perhaps construct a way to move CB RPCs to a workqueue so
they can be retried indefinitely (or figure out when they should
stop being retried: COPY_OFFLOAD_CANCEL comes to mind).

3. Keep an eye peeled for cases where a malicious or broken
client could cause CB operations to tie up all the server's
nfsd threads or other resources.

--
Chuck Lever



2021-08-11 20:15:46

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Spurious instability with NFSoRDMA under moderate load

On Wed, Aug 11, 2021 at 08:01:30PM +0000, Chuck Lever III wrote:
> Probably not just CB_RECALL, but agreed, there doesn't seem to
> be any mechanism that can re-drive callback operations when the
> backchannel is replaced.

The nfsd4_queue_cb() in nfsd4_cb_release() should queue a work item
to run nfsd4_run_cb_work, which should set up another callback client if
necessary.

But I doubt this is well tested.

--b.

2021-08-11 20:43:23

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: Spurious instability with NFSoRDMA under moderate load

On Wed, Aug 11, 2021 at 4:14 PM J. Bruce Fields <[email protected]> wrote:
>
> On Wed, Aug 11, 2021 at 08:01:30PM +0000, Chuck Lever III wrote:
> > Probably not just CB_RECALL, but agreed, there doesn't seem to
> > be any mechanism that can re-drive callback operations when the
> > backchannel is replaced.
>
> The nfsd4_queue_cb() in nfsd4_cb_release() should queue a work item
> to run nfsd4_run_cb_work, which should set up another callback client if
> necessary.

diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
index 7325592b456e..ed0e76f7185c 100644
--- a/fs/nfsd/nfs4callback.c
+++ b/fs/nfsd/nfs4callback.c
@@ -1191,6 +1191,7 @@ static void nfsd4_cb_done(struct rpc_task *task,
void *calldata)
case -ETIMEDOUT:
case -EACCES:
nfsd4_mark_cb_down(clp, task->tk_status);
+ cb->cb_need_restart = true;
}
break;
default:

Something like this should requeue and retry the callback?

>
> But I doubt this is well tested.
>
> --b.

2021-08-11 20:51:51

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Spurious instability with NFSoRDMA under moderate load

On Wed, Aug 11, 2021 at 08:01:30PM +0000, Chuck Lever III wrote:
> Perhaps instead of destroying the rpc_clnt, the server should
> retain it and simply recycle the underlying transport data
> structures.

I think there was more trunking supported added to the rpc client since
the callback code was written, so something like that might make sense
now.

--b.

2021-08-11 20:52:26

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: Spurious instability with NFSoRDMA under moderate load

On Wed, Aug 11, 2021 at 4:01 PM Chuck Lever III <[email protected]> wrote:
>
>
>
> > On Aug 11, 2021, at 3:46 PM, Olga Kornievskaia <[email protected]> wrote:
> >
> > On Wed, Aug 11, 2021 at 2:52 PM Chuck Lever III <[email protected]> wrote:
> >>
> >>
> >>
> >>> On Aug 11, 2021, at 2:38 PM, Olga Kornievskaia <[email protected]> wrote:
> >>>
> >>> On Wed, Aug 11, 2021 at 1:30 PM Chuck Lever III <[email protected]> wrote:
> >>>>
> >>>>
> >>>>
> >>>>> On Aug 11, 2021, at 12:20 PM, Timo Rothenpieler <[email protected]> wrote:
> >>>>>
> >>>>> resulting dmesg and trace logs of both client and server are attached.
> >>>>>
> >>>>> Test procedure:
> >>>>>
> >>>>> - start tracing on client and server
> >>>>> - mount NFS on client
> >>>>> - immediately run 'xfs_io -fc "copy_range testfile" testfile.copy' (which succeeds)
> >>>>> - wait 10~15 minutes for the backchannel to time out (still running 5.12.19 with the fix for that reverted)
> >>>>> - run xfs_io command again, getting stuck now
> >>>>> - let it sit there stuck for a minute, then cancel it
> >>>>> - run the command again
> >>>>> - while it's still stuck, finished recording the logs and traces
> >>>>
> >>>> The server tries to send CB_OFFLOAD when the offloaded copy
> >>>> completes, but finds the backchannel transport is not connected.
> >>>>
> >>>> The server can't report the problem until the client sends a
> >>>> SEQUENCE operation, but there's really no other traffic going
> >>>> on, so it just waits.
> >>>>
> >>>> The client eventually sends a singleton SEQUENCE to renew its
> >>>> lease. The server replies with the SEQ4_STATUS_BACKCHANNEL_FAULT
> >>>> flag set at that point. Client's recovery is to destroy that
> >>>> session and create a new one. That appears to be successful.
> >>>>
> >>>> But the server doesn't send another CB_OFFLOAD to let the client
> >>>> know the copy is complete, so the client hangs.
> >>>>
> >>>> This seems to be peculiar to COPY_OFFLOAD, but I wonder if the
> >>>> other CB operations suffer from the same "failed to retransmit
> >>>> after the CB path is restored" issue. It might not matter for
> >>>> some of them, but for others like CB_RECALL, that could be
> >>>> important.
> >>>
> >>> Thank you for the analysis Chuck (btw I haven't seen any attachments
> >>> with Timo's posts so I'm assuming some offline communication must have
> >>> happened).
> >>> ?
> >>> I'm looking at the code and wouldn't the mentioned flags be set on the
> >>> CB_SEQUENCE operation?
> >>
> >> CB_SEQUENCE is sent from server to client, and that can't work if
> >> the callback channel is down.
> >>
> >> So the server waits for the client to send a SEQUENCE and it sets
> >> the SEQ4_STATUS_BACKCHANNEL_FAULT in its reply.
> >
> > yes scratch that, this is for when CB_SEQUENCE has it in its reply.
> >
> >>> nfsd4_cb_done() has code to mark the channel
> >>> and retry (or another way of saying this, this code should generically
> >>> handle retrying whatever operation it is be it CB_OFFLOAD or
> >>> CB_RECALL)?
> >>
> >> cb_done() marks the callback fault, but as far as I can tell the
> >> RPC is terminated at that point and there is no subsequent retry.
> >> The RPC_TASK flags on the CB_OFFLOAD operation cause that RPC to
> >> fail immediately if there's no connection.
> >>
> >> And in the BACKCHANNEL_FAULT case, the bc_xprt is destroyed as
> >> part of recovery. I think that would kill all pending RPC tasks.
> >>
> >>
> >>> Is that not working (not sure if this is a question or a
> >>> statement).... I would think that would be the place to handle this
> >>> problem.
> >>
> >> IMHO the OFFLOAD code needs to note that the CB_OFFLOAD RPC
> >> failed and then try the call again once the new backchannel is
> >> available.
> >
> > I still argue that this needs to be done generically not per operation
> > as CB_RECALL has the same problem.
>
> Probably not just CB_RECALL, but agreed, there doesn't seem to
> be any mechanism that can re-drive callback operations when the
> backchannel is replaced.
>
>
> > If CB_RECALL call is never
> > answered, rpc would fail with ETIMEOUT.
>
> Well we have a mechanism already to deal with that case, which is
> delegation revocation. That's not optimal, but at least it won't
> leave the client hanging forever.
>
>
> > nfsd4_cb_recall_done() returns
> > 1 back to the nfsd4_cb_done() which then based on the rpc task status
> > would set the callback path down but will do nothing else.
> > nfsd4_cb_offload_one() just always returns 1.
> >
> > Now given that you say during the backchannel fault case, we'll have
> > to destroy that backchannel and create a new one.
>
> BACKCHANNEL_FAULT is the particular case that Timo's reproducer
> exercises. I haven't looked closely at the CB_PATH_DOWN and
> CB_PATH_DOWN_SESSION cases, which use BIND_CONN_TO_SESSION,
> to see if those also destroy the backchannel transport.
>
> However I suspect they are just as broken. There isn't much
> regression testing around these scenarios.
>
>
> > Are we losing all those RPCs that flagged it as faulty?
>
> I believe the RPC tasks are already gone at that point. I'm
> just saying that /hypothetically/ if the RPCs were not set up
> to exit immediately, that wouldn't matter because the transport
> and client are destroyed as part of restoring the backchannel.
>
> Perhaps instead of destroying the rpc_clnt, the server should
> retain it and simply recycle the underlying transport data
> structures.
>
>
> > At this point, nothing obvious
> > comes to mind how to engineer the recovery but it should be done.
>
> Well I think we need to:
>
> 1. Review the specs to see if there's any indication of how to
> recover from this situation

I guess one fix we can consider on the client is to make waiting for
the CB_OFFLOAD time based and then use OFFLOAD_STATUS operation to
check on the copy status.

> 2. Perhaps construct a way to move CB RPCs to a workqueue so
> they can be retried indefinitely (or figure out when they should
> stop being retried: COPY_OFFLOAD_CANCEL comes to mind).
>
> 3. Keep an eye peeled for cases where a malicious or broken
> client could cause CB operations to tie up all the server's
> nfsd threads or other resources.
>
> --
> Chuck Lever
>
>
>

2021-08-12 15:42:17

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Spurious instability with NFSoRDMA under moderate load

On Wed, Aug 11, 2021 at 04:40:04PM -0400, Olga Kornievskaia wrote:
> On Wed, Aug 11, 2021 at 4:14 PM J. Bruce Fields <[email protected]> wrote:
> >
> > On Wed, Aug 11, 2021 at 08:01:30PM +0000, Chuck Lever III wrote:
> > > Probably not just CB_RECALL, but agreed, there doesn't seem to
> > > be any mechanism that can re-drive callback operations when the
> > > backchannel is replaced.
> >
> > The nfsd4_queue_cb() in nfsd4_cb_release() should queue a work item
> > to run nfsd4_run_cb_work, which should set up another callback client if
> > necessary.

But I think the result is it'll look to see if there's another
connection available for callbacks, and give up immediately if not.

There's no logic to wait for the client to fix the problem.

> diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
> index 7325592b456e..ed0e76f7185c 100644
> --- a/fs/nfsd/nfs4callback.c
> +++ b/fs/nfsd/nfs4callback.c
> @@ -1191,6 +1191,7 @@ static void nfsd4_cb_done(struct rpc_task *task,
> void *calldata)
> case -ETIMEDOUT:
> case -EACCES:
> nfsd4_mark_cb_down(clp, task->tk_status);
> + cb->cb_need_restart = true;
> }
> break;
> default:
>
> Something like this should requeue and retry the callback?

I think we'd need more than just that.

--b.

2021-08-12 19:05:53

by Timo Rothenpieler

[permalink] [raw]
Subject: Re: Spurious instability with NFSoRDMA under moderate load

On 11.08.2021 19:30, Chuck Lever III wrote:
>
>
>> On Aug 11, 2021, at 12:20 PM, Timo Rothenpieler <[email protected]> wrote:
>>
>> resulting dmesg and trace logs of both client and server are attached.
>>
>> Test procedure:
>>
>> - start tracing on client and server
>> - mount NFS on client
>> - immediately run 'xfs_io -fc "copy_range testfile" testfile.copy' (which succeeds)
>> - wait 10~15 minutes for the backchannel to time out (still running 5.12.19 with the fix for that reverted)
>> - run xfs_io command again, getting stuck now
>> - let it sit there stuck for a minute, then cancel it
>> - run the command again
>> - while it's still stuck, finished recording the logs and traces
>
> The server tries to send CB_OFFLOAD when the offloaded copy
> completes, but finds the backchannel transport is not connected.
>
> The server can't report the problem until the client sends a
> SEQUENCE operation, but there's really no other traffic going
> on, so it just waits.
>
> The client eventually sends a singleton SEQUENCE to renew its
> lease. The server replies with the SEQ4_STATUS_BACKCHANNEL_FAULT
> flag set at that point. Client's recovery is to destroy that
> session and create a new one. That appears to be successful.

If it re-created the session and the backchannel, shouldn't that mean
that after I cancel the first stuck xfs_io command, and try it again
immediately (before the backchannel had a chance to timeout again) it
should work?
Cause that's explicitly not the case, once the backchannel initially
times out, all subsequent commands get stuck, even if the system is
seeing other work on the NFS mount being done in parallel, and no matter
how often I re-try and how long I wait in between or with it stuck.

> But the server doesn't send another CB_OFFLOAD to let the client
> know the copy is complete, so the client hangs.
>
> This seems to be peculiar to COPY_OFFLOAD, but I wonder if the
> other CB operations suffer from the same "failed to retransmit
> after the CB path is restored" issue. It might not matter for
> some of them, but for others like CB_RECALL, that could be
> important.
>
>
> --
> Chuck Lever
>
>
>


Attachments:
smime.p7s (4.39 kB)
S/MIME Cryptographic Signature

2021-08-16 13:32:31

by Chuck Lever III

[permalink] [raw]
Subject: Re: Spurious instability with NFSoRDMA under moderate load



> On Aug 12, 2021, at 2:13 PM, Timo Rothenpieler <[email protected]> wrote:
>
> On 11.08.2021 19:30, Chuck Lever III wrote:
>>> On Aug 11, 2021, at 12:20 PM, Timo Rothenpieler <[email protected]> wrote:
>>>
>>> resulting dmesg and trace logs of both client and server are attached.
>>>
>>> Test procedure:
>>>
>>> - start tracing on client and server
>>> - mount NFS on client
>>> - immediately run 'xfs_io -fc "copy_range testfile" testfile.copy' (which succeeds)
>>> - wait 10~15 minutes for the backchannel to time out (still running 5.12.19 with the fix for that reverted)
>>> - run xfs_io command again, getting stuck now
>>> - let it sit there stuck for a minute, then cancel it
>>> - run the command again
>>> - while it's still stuck, finished recording the logs and traces
>> The server tries to send CB_OFFLOAD when the offloaded copy
>> completes, but finds the backchannel transport is not connected.
>> The server can't report the problem until the client sends a
>> SEQUENCE operation, but there's really no other traffic going
>> on, so it just waits.
>> The client eventually sends a singleton SEQUENCE to renew its
>> lease. The server replies with the SEQ4_STATUS_BACKCHANNEL_FAULT
>> flag set at that point. Client's recovery is to destroy that
>> session and create a new one. That appears to be successful.
>
> If it re-created the session and the backchannel, shouldn't that mean that after I cancel the first stuck xfs_io command, and try it again immediately (before the backchannel had a chance to timeout again) it should work?

I would guess that yes, subsequent COPY_OFFLOAD requests
should work unless the backchannel has already timed out
again.

I was about to use your reproducer myself, but a storm
came through on Thursday and knocked out my power and
internet. I'm still waiting for restoration.

Once power is restored I can chase this a little more
efficiently in my lab.


> Cause that's explicitly not the case, once the backchannel initially times out, all subsequent commands get stuck, even if the system is seeing other work on the NFS mount being done in parallel, and no matter how often I re-try and how long I wait in between or with it stuck.
>
>> But the server doesn't send another CB_OFFLOAD to let the client
>> know the copy is complete, so the client hangs.
>> This seems to be peculiar to COPY_OFFLOAD, but I wonder if the
>> other CB operations suffer from the same "failed to retransmit
>> after the CB path is restored" issue. It might not matter for
>> some of them, but for others like CB_RECALL, that could be
>> important.
>> --
>> Chuck Lever

--
Chuck Lever



2021-08-17 21:10:32

by Chuck Lever III

[permalink] [raw]
Subject: Re: Spurious instability with NFSoRDMA under moderate load



> On Aug 12, 2021, at 2:13 PM, Timo Rothenpieler <[email protected]> wrote:
>
> On 11.08.2021 19:30, Chuck Lever III wrote:
>>> On Aug 11, 2021, at 12:20 PM, Timo Rothenpieler <[email protected]> wrote:
>>>
>>> resulting dmesg and trace logs of both client and server are attached.
>>>
>>> Test procedure:
>>>
>>> - start tracing on client and server
>>> - mount NFS on client
>>> - immediately run 'xfs_io -fc "copy_range testfile" testfile.copy' (which succeeds)
>>> - wait 10~15 minutes for the backchannel to time out (still running 5.12.19 with the fix for that reverted)
>>> - run xfs_io command again, getting stuck now
>>> - let it sit there stuck for a minute, then cancel it
>>> - run the command again
>>> - while it's still stuck, finished recording the logs and traces
>> The server tries to send CB_OFFLOAD when the offloaded copy
>> completes, but finds the backchannel transport is not connected.
>> The server can't report the problem until the client sends a
>> SEQUENCE operation, but there's really no other traffic going
>> on, so it just waits.
>> The client eventually sends a singleton SEQUENCE to renew its
>> lease. The server replies with the SEQ4_STATUS_BACKCHANNEL_FAULT
>> flag set at that point. Client's recovery is to destroy that
>> session and create a new one. That appears to be successful.
>
> If it re-created the session and the backchannel, shouldn't that mean that after I cancel the first stuck xfs_io command, and try it again immediately (before the backchannel had a chance to timeout again) it should work?
> Cause that's explicitly not the case, once the backchannel initially times out, all subsequent commands get stuck, even if the system is seeing other work on the NFS mount being done in parallel, and no matter how often I re-try and how long I wait in between or with it stuck.
>
>> But the server doesn't send another CB_OFFLOAD to let the client
>> know the copy is complete, so the client hangs.
>> This seems to be peculiar to COPY_OFFLOAD, but I wonder if the
>> other CB operations suffer from the same "failed to retransmit
>> after the CB path is restored" issue. It might not matter for
>> some of them, but for others like CB_RECALL, that could be
>> important.

I tried reproducing this using your 'xfs_io -fc "copy_range
testfile" testfile.copy' reproducer, but couldn't.

A network capture shows that the client tries CLONE first. The
server reports that's not supported, so the client tries COPY.
The COPY works, and the reply shows that the COPY was synchro-
nous. Thus there's no need for a callback, and I'm not tripping
over backchannel misbehavior.

The export I'm using is an xfs filesystem. Did you already
report the filesystem type you're testing against? I can't
find it in the thread.

If there's a way to force an offload-style COPY, let me know.

Oh. Also I looked at what might have been pulled into the
linux-5.12.y kernel between .12 and .19, and I don't see
anything that's especially relevant to either COPY_OFFLOAD
or backchannel.

--
Chuck Lever



2021-08-17 21:52:24

by Timo Rothenpieler

[permalink] [raw]
Subject: Re: Spurious instability with NFSoRDMA under moderate load

On 17.08.2021 23:08, Chuck Lever III wrote:
> I tried reproducing this using your 'xfs_io -fc "copy_range
> testfile" testfile.copy' reproducer, but couldn't.
>
> A network capture shows that the client tries CLONE first. The
> server reports that's not supported, so the client tries COPY.
> The COPY works, and the reply shows that the COPY was synchro-
> nous. Thus there's no need for a callback, and I'm not tripping
> over backchannel misbehavior.

Make sure the testfile is of sufficient size. I'm not sure what the
threshold is, but if it's too small, it'll just do a synchronous copy
for me as well.
I'm using a 50MB file in my tests.

> The export I'm using is an xfs filesystem. Did you already
> report the filesystem type you're testing against? I can't
> find it in the thread.
>
> If there's a way to force an offload-style COPY, let me know.
>
> Oh. Also I looked at what might have been pulled into the
> linux-5.12.y kernel between .12 and .19, and I don't see
> anything that's especially relevant to either COPY_OFFLOAD
> or backchannel.

I'm observing this with both an ext4 and zfs filesystem.
Can easily test xfs as well if desired.

Are you testing this on a normal network, or with RDMA? With normal tcp,
I also can't observe this issue(it doesn't time out the backchannel in
the first place), it only happens in RDMA mode.
I'm using Mellanox ConnectX-4 cards in IB mode for my tests.


Attachments:
smime.p7s (4.39 kB)
S/MIME Cryptographic Signature

2021-08-17 22:57:26

by Dai Ngo

[permalink] [raw]
Subject: Re: Spurious instability with NFSoRDMA under moderate load


On 8/17/21 2:51 PM, Timo Rothenpieler wrote:
> On 17.08.2021 23:08, Chuck Lever III wrote:
>> I tried reproducing this using your 'xfs_io -fc "copy_range
>> testfile" testfile.copy' reproducer, but couldn't.
>>
>> A network capture shows that the client tries CLONE first. The
>> server reports that's not supported, so the client tries COPY.
>> The COPY works, and the reply shows that the COPY was synchro-
>> nous. Thus there's no need for a callback, and I'm not tripping
>> over backchannel misbehavior.
>
> Make sure the testfile is of sufficient size. I'm not sure what the
> threshold is, but if it's too small, it'll just do a synchronous copy
> for me as well.
> I'm using a 50MB file in my tests.

The threshold for intra-server copy is (2*rsize) of the source server.
Any copy smaller than this is done synchronously.

-Dai

>
>> The export I'm using is an xfs filesystem. Did you already
>> report the filesystem type you're testing against? I can't
>> find it in the thread.
>>
>> If there's a way to force an offload-style COPY, let me know.
>>
>> Oh. Also I looked at what might have been pulled into the
>> linux-5.12.y kernel between .12 and .19, and I don't see
>> anything that's especially relevant to either COPY_OFFLOAD
>> or backchannel.
>
> I'm observing this with both an ext4 and zfs filesystem.
> Can easily test xfs as well if desired.
>
> Are you testing this on a normal network, or with RDMA? With normal
> tcp, I also can't observe this issue(it doesn't time out the
> backchannel in the first place), it only happens in RDMA mode.
> I'm using Mellanox ConnectX-4 cards in IB mode for my tests.
>

2021-08-17 23:06:40

by Dai Ngo

[permalink] [raw]
Subject: Re: Spurious instability with NFSoRDMA under moderate load


On 8/17/21 3:55 PM, [email protected] wrote:
>
> On 8/17/21 2:51 PM, Timo Rothenpieler wrote:
>> On 17.08.2021 23:08, Chuck Lever III wrote:
>>> I tried reproducing this using your 'xfs_io -fc "copy_range
>>> testfile" testfile.copy' reproducer, but couldn't.
>>>
>>> A network capture shows that the client tries CLONE first. The
>>> server reports that's not supported, so the client tries COPY.
>>> The COPY works, and the reply shows that the COPY was synchro-
>>> nous. Thus there's no need for a callback, and I'm not tripping
>>> over backchannel misbehavior.
>>
>> Make sure the testfile is of sufficient size. I'm not sure what the
>> threshold is, but if it's too small, it'll just do a synchronous copy
>> for me as well.
>> I'm using a 50MB file in my tests.
>
> The threshold for intra-server copy is (2*rsize) of the source server.

The threshold for inter-server copy is (14*rsize). This restriction is
to be removed in the next merge.

e-Dai

> Any copy smaller than this is done synchronously.
>
> -Dai
>
>>
>>> The export I'm using is an xfs filesystem. Did you already
>>> report the filesystem type you're testing against? I can't
>>> find it in the thread.
>>>
>>> If there's a way to force an offload-style COPY, let me know.
>>>
>>> Oh. Also I looked at what might have been pulled into the
>>> linux-5.12.y kernel between .12 and .19, and I don't see
>>> anything that's especially relevant to either COPY_OFFLOAD
>>> or backchannel.
>>
>> I'm observing this with both an ext4 and zfs filesystem.
>> Can easily test xfs as well if desired.
>>
>> Are you testing this on a normal network, or with RDMA? With normal
>> tcp, I also can't observe this issue(it doesn't time out the
>> backchannel in the first place), it only happens in RDMA mode.
>> I'm using Mellanox ConnectX-4 cards in IB mode for my tests.
>>

2021-08-18 00:04:30

by Timo Rothenpieler

[permalink] [raw]
Subject: Re: Spurious instability with NFSoRDMA under moderate load

On 17.08.2021 23:51, Timo Rothenpieler wrote:
> On 17.08.2021 23:08, Chuck Lever III wrote:
>> I tried reproducing this using your 'xfs_io -fc "copy_range
>> testfile" testfile.copy' reproducer, but couldn't.
>>
>> A network capture shows that the client tries CLONE first. The
>> server reports that's not supported, so the client tries COPY.
>> The COPY works, and the reply shows that the COPY was synchro-
>> nous. Thus there's no need for a callback, and I'm not tripping
>> over backchannel misbehavior.
>
> Make sure the testfile is of sufficient size. I'm not sure what the
> threshold is, but if it's too small, it'll just do a synchronous copy
> for me as well.
> I'm using a 50MB file in my tests.
>
>> The export I'm using is an xfs filesystem. Did you already
>> report the filesystem type you're testing against? I can't
>> find it in the thread.
>>
>> If there's a way to force an offload-style COPY, let me know.
>>
>> Oh. Also I looked at what might have been pulled into the
>> linux-5.12.y kernel between .12 and .19, and I don't see
>> anything that's especially relevant to either COPY_OFFLOAD
>> or backchannel.
>
> I'm observing this with both an ext4 and zfs filesystem.
> Can easily test xfs as well if desired.

Re-ran the test with xfs instead of ext4, and indeed, the issue does not
manifest in that case.
I guess xfs is lacking some capability for server side copy to work
properly.

> Are you testing this on a normal network, or with RDMA? With normal tcp,
> I also can't observe this issue(it doesn't time out the backchannel in
> the first place), it only happens in RDMA mode.
> I'm using Mellanox ConnectX-4 cards in IB mode for my tests.
>


Attachments:
smime.p7s (4.39 kB)
S/MIME Cryptographic Signature

2021-08-18 16:56:17

by Chuck Lever III

[permalink] [raw]
Subject: Re: Spurious instability with NFSoRDMA under moderate load



> On Aug 17, 2021, at 7:05 PM, Dai Ngo <[email protected]> wrote:
>
>
> On 8/17/21 3:55 PM, [email protected] wrote:
>>
>> On 8/17/21 2:51 PM, Timo Rothenpieler wrote:
>>> On 17.08.2021 23:08, Chuck Lever III wrote:
>>>> I tried reproducing this using your 'xfs_io -fc "copy_range
>>>> testfile" testfile.copy' reproducer, but couldn't.
>>>>
>>>> A network capture shows that the client tries CLONE first. The
>>>> server reports that's not supported, so the client tries COPY.
>>>> The COPY works, and the reply shows that the COPY was synchro-
>>>> nous. Thus there's no need for a callback, and I'm not tripping
>>>> over backchannel misbehavior.
>>>
>>> Make sure the testfile is of sufficient size. I'm not sure what the threshold is, but if it's too small, it'll just do a synchronous copy for me as well.
>>> I'm using a 50MB file in my tests.
>>
>> The threshold for intra-server copy is (2*rsize) of the source server.
>
> The threshold for inter-server copy is (14*rsize). This restriction is to be removed in the next merge.
>
> e-Dai
>
>> Any copy smaller than this is done synchronously.

Thank you, Dai, that helped. I was able to reproduce with a 200MB
source file on an xfs-backed export. Looking at the issue now.


--
Chuck Lever



2021-08-20 15:13:13

by Chuck Lever III

[permalink] [raw]
Subject: Re: Spurious instability with NFSoRDMA under moderate load


> On Aug 16, 2021, at 9:26 AM, Chuck Lever III <[email protected]> wrote:
>
>> On Aug 12, 2021, at 2:13 PM, Timo Rothenpieler <[email protected]> wrote:
>>
>> On 11.08.2021 19:30, Chuck Lever III wrote:
>>>> On Aug 11, 2021, at 12:20 PM, Timo Rothenpieler <[email protected]> wrote:
>>>>
>>>> resulting dmesg and trace logs of both client and server are attached.
>>>>
>>>> Test procedure:
>>>>
>>>> - start tracing on client and server
>>>> - mount NFS on client
>>>> - immediately run 'xfs_io -fc "copy_range testfile" testfile.copy' (which succeeds)
>>>> - wait 10~15 minutes for the backchannel to time out (still running 5.12.19 with the fix for that reverted)
>>>> - run xfs_io command again, getting stuck now
>>>> - let it sit there stuck for a minute, then cancel it
>>>> - run the command again
>>>> - while it's still stuck, finished recording the logs and traces
>>> The server tries to send CB_OFFLOAD when the offloaded copy
>>> completes, but finds the backchannel transport is not connected.
>>> The server can't report the problem until the client sends a
>>> SEQUENCE operation, but there's really no other traffic going
>>> on, so it just waits.
>>> The client eventually sends a singleton SEQUENCE to renew its
>>> lease. The server replies with the SEQ4_STATUS_BACKCHANNEL_FAULT
>>> flag set at that point. Client's recovery is to destroy that
>>> session and create a new one. That appears to be successful.
>>
>> If it re-created the session and the backchannel, shouldn't that mean that after I cancel the first stuck xfs_io command, and try it again immediately (before the backchannel had a chance to timeout again) it should work?
>
> I would guess that yes, subsequent COPY_OFFLOAD requests
> should work unless the backchannel has already timed out
> again.
>
> I was about to use your reproducer myself, but a storm
> came through on Thursday and knocked out my power and
> internet. I'm still waiting for restoration.
>
> Once power is restored I can chase this a little more
> efficiently in my lab.

OK, I think the issue with this reproducer was resolved
completely with 6820bf77864d.

I went back and reviewed the traces from when the client got
stuck after a long uptime. This looks very different from
what we're seeing with 6820bf77864d. It involves CB_PATH_DOWN
and BIND_CONN_TO_SESSION, which is a different scenario. Long
story short, I don't think we're getting any more value by
leaving 6820bf77864d reverted.

Can you re-apply that commit on your server, and then when
the client hangs again, please capture with:

# trace-cmd record -e nfsd -e sunrpc -e rpcrdma

I'd like to see why the client's BIND_CONN_TO_SESSION fails
to repair the backchannel session.

--
Chuck Lever



2021-08-20 16:22:26

by Timo Rothenpieler

[permalink] [raw]
Subject: Re: Spurious instability with NFSoRDMA under moderate load

On 20.08.2021 17:12, Chuck Lever III wrote:
> OK, I think the issue with this reproducer was resolved
> completely with 6820bf77864d.
>
> I went back and reviewed the traces from when the client got
> stuck after a long uptime. This looks very different from
> what we're seeing with 6820bf77864d. It involves CB_PATH_DOWN
> and BIND_CONN_TO_SESSION, which is a different scenario. Long
> story short, I don't think we're getting any more value by
> leaving 6820bf77864d reverted.
>
> Can you re-apply that commit on your server, and then when
> the client hangs again, please capture with:
>
> # trace-cmd record -e nfsd -e sunrpc -e rpcrdma
>
> I'd like to see why the client's BIND_CONN_TO_SESSION fails
> to repair the backchannel session.
>
> --
> Chuck Lever

The main system is still running a regular 5.12.19, but so far the issue
has not come up again yet.
Given how long it takes, I'll report whenever it happens next. I have
sadly not found a way to provoke it getting stuck yet.


Attachments:
smime.p7s (4.39 kB)
S/MIME Cryptographic Signature

2021-10-29 15:16:06

by Chuck Lever III

[permalink] [raw]
Subject: Re: Spurious instability with NFSoRDMA under moderate load

Hi Timo-

> On Oct 29, 2021, at 7:47 AM, Timo Rothenpieler <[email protected]> wrote:
>
> On 20/08/2021 17:12, Chuck Lever III wrote:
>> OK, I think the issue with this reproducer was resolved
>> completely with 6820bf77864d.
>> I went back and reviewed the traces from when the client got
>> stuck after a long uptime. This looks very different from
>> what we're seeing with 6820bf77864d. It involves CB_PATH_DOWN
>> and BIND_CONN_TO_SESSION, which is a different scenario. Long
>> story short, I don't think we're getting any more value by
>> leaving 6820bf77864d reverted.
>> Can you re-apply that commit on your server, and then when
>> the client hangs again, please capture with:
>> # trace-cmd record -e nfsd -e sunrpc -e rpcrdma
>> I'd like to see why the client's BIND_CONN_TO_SESSION fails
>> to repair the backchannel session.
>
> Happened again today, after a long time of no issues.
> Still on 5.12.19, since the system did not have a chance for a bigger maintenance window yet.
>
> Attached are traces from both client and server, while the client is trying to do the usual xfs_io copy_range.
> The system also has a bunch of other users and nodes working on it at this time, so there's a good chance for unrelated noise in the traces.
>
> The affected client is 10.110.10.251.
> Other clients are working just fine, it's only this one client that's affected.
>
> There was also quite a bit of heavy IO work going on on the Cluster, which I think coincides with the last couple times this happened as well.<nfstrace.tar.xz>

Thanks for the report. We believe this issue has been addressed in v5.15-rc:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=02579b2ff8b0becfb51d85a975908ac4ab15fba8


--
Chuck Lever



2021-10-29 18:18:22

by Timo Rothenpieler

[permalink] [raw]
Subject: Re: Spurious instability with NFSoRDMA under moderate load

On 29.10.2021 17:14, Chuck Lever III wrote:
> Hi Timo-
>
>> On Oct 29, 2021, at 7:47 AM, Timo Rothenpieler <[email protected]> wrote:
>>
>> On 20/08/2021 17:12, Chuck Lever III wrote:
>>> OK, I think the issue with this reproducer was resolved
>>> completely with 6820bf77864d.
>>> I went back and reviewed the traces from when the client got
>>> stuck after a long uptime. This looks very different from
>>> what we're seeing with 6820bf77864d. It involves CB_PATH_DOWN
>>> and BIND_CONN_TO_SESSION, which is a different scenario. Long
>>> story short, I don't think we're getting any more value by
>>> leaving 6820bf77864d reverted.
>>> Can you re-apply that commit on your server, and then when
>>> the client hangs again, please capture with:
>>> # trace-cmd record -e nfsd -e sunrpc -e rpcrdma
>>> I'd like to see why the client's BIND_CONN_TO_SESSION fails
>>> to repair the backchannel session.
>>
>> Happened again today, after a long time of no issues.
>> Still on 5.12.19, since the system did not have a chance for a bigger maintenance window yet.
>>
>> Attached are traces from both client and server, while the client is trying to do the usual xfs_io copy_range.
>> The system also has a bunch of other users and nodes working on it at this time, so there's a good chance for unrelated noise in the traces.
>>
>> The affected client is 10.110.10.251.
>> Other clients are working just fine, it's only this one client that's affected.
>>
>> There was also quite a bit of heavy IO work going on on the Cluster, which I think coincides with the last couple times this happened as well.<nfstrace.tar.xz>
>
> Thanks for the report. We believe this issue has been addressed in v5.15-rc:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=02579b2ff8b0becfb51d85a975908ac4ab15fba8

5.15 is a little too bleeding edge for my comfort to roll out on a
production system.
But the patch applies cleanly on top of 5.12.19. So I pulled it and am
now running the resulting kernel on all clients and the server(s).

Hopefully won't see this happen again from now on, thanks!


Timo


Attachments:
smime.p7s (4.39 kB)
S/MIME Cryptographic Signature

2021-10-29 19:07:35

by Chuck Lever III

[permalink] [raw]
Subject: Re: Spurious instability with NFSoRDMA under moderate load



> On Oct 29, 2021, at 2:17 PM, Timo Rothenpieler <[email protected]> wrote:
>
> On 29.10.2021 17:14, Chuck Lever III wrote:
>> Hi Timo-
>>> On Oct 29, 2021, at 7:47 AM, Timo Rothenpieler <[email protected]> wrote:
>>>
>>> On 20/08/2021 17:12, Chuck Lever III wrote:
>>>> OK, I think the issue with this reproducer was resolved
>>>> completely with 6820bf77864d.
>>>> I went back and reviewed the traces from when the client got
>>>> stuck after a long uptime. This looks very different from
>>>> what we're seeing with 6820bf77864d. It involves CB_PATH_DOWN
>>>> and BIND_CONN_TO_SESSION, which is a different scenario. Long
>>>> story short, I don't think we're getting any more value by
>>>> leaving 6820bf77864d reverted.
>>>> Can you re-apply that commit on your server, and then when
>>>> the client hangs again, please capture with:
>>>> # trace-cmd record -e nfsd -e sunrpc -e rpcrdma
>>>> I'd like to see why the client's BIND_CONN_TO_SESSION fails
>>>> to repair the backchannel session.
>>>
>>> Happened again today, after a long time of no issues.
>>> Still on 5.12.19, since the system did not have a chance for a bigger maintenance window yet.
>>>
>>> Attached are traces from both client and server, while the client is trying to do the usual xfs_io copy_range.
>>> The system also has a bunch of other users and nodes working on it at this time, so there's a good chance for unrelated noise in the traces.
>>>
>>> The affected client is 10.110.10.251.
>>> Other clients are working just fine, it's only this one client that's affected.
>>>
>>> There was also quite a bit of heavy IO work going on on the Cluster, which I think coincides with the last couple times this happened as well.<nfstrace.tar.xz>
>> Thanks for the report. We believe this issue has been addressed in v5.15-rc:
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=02579b2ff8b0becfb51d85a975908ac4ab15fba8
>
> 5.15 is a little too bleeding edge for my comfort to roll out on a production system.
> But the patch applies cleanly on top of 5.12.19. So I pulled it and am now running the resulting kernel on all clients and the server(s).

Yup, that's the best we can do for now. Thanks for testing!


> Hopefully won't see this happen again from now on, thanks!
>
>
> Timo

--
Chuck Lever