LinuxLists.cc - copy_file_range() infinitely hangs on NFSv4.2 over RDMA

[permalink] [raw]

Subject: Re: copy_file_range() infinitely hangs on NFSv4.2 over RDMA

Hi Timo,

Can you get a network trace? Also, you say that the copy_file_range()
(after what looks like a successful copy) never returns (and
application hangs), can you get a sysrq output of what the process's
stack (echo t > /proc/sysrq-trigger and see what gets dumped into the
var log messages and locate your application and report what the stack
says)?

On Sat, Feb 13, 2021 at 10:41 PM Timo Rothenpieler
<[email protected]> wrote:
>
> On our Fileserver, running a few weeks old 5.10, we are running into a
> weird issue with NFS 4.2 Server-Side Copy and RDMA (and ZFS, though I'm
> not sure how relevant that is to the issue).
> The servers are connected via InfiniBand, on a Mellanox ConnectX-4 card,
> using the mlx5 driver.
>
> Anything using the copy_file_range() syscall to copy stuff just hangs.
> In strace, the syscall never returns.
>
> Simple way to reproduce on the client:
> > xfs_io -fc "pwrite 0 1M" testfile
> > xfs_io -fc "copy_range testfile" testfile.copy
>
> The second call just never exits. It sits in S+ state, with no CPU
> usage, and can easily be killed via Ctrl+C.
> I let it sit for a couple hours as well, it does not seem to ever complete.
>
> Some more observations about it:
>
> If I do a fresh reboot of the client, the operation works fine for a
> short while (like, 10~15 minutes). No load is on the system during that
> time, it's effectively idle.
>
> The operation actually does successfully copy all data. The size and
> checksum of the target file is as expected. It just never returns.
>
> This only happens when mounting via RDMA. Mounting the same NFS share
> via plain TCP has the operation work reliably.
>
> Had this issue with Kernel 5.4 already, and had hoped that 5.10 might
> have fixed it, but unfortunately it didn't.
>
> I tried two server and 30 different client machines, they all exhibit
> the exact same behaviour. So I'd carefully rule out a hardware issue.
>
>
> Any pointers on how to debug or maybe even fix this?
>
>
>
> Thanks,
> Timo

2021-02-16 20:40:25

[permalink] [raw]

Subject: Re: copy_file_range() infinitely hangs on NFSv4.2 over RDMA

I can't get a network (I assume just TCP/20049 is fine, and not also
some RDMA trace?) right now, but I will once a user has finished their
work on the machine.

The stack of the stuck process looks as follows:

> task:xfs_io state:S stack: 0 pid:841684 ppid:841677 flags:0x00004001
> Call Trace:
> __schedule+0x3e9/0x660
> ? rpc_task_release_transport+0x42/0x60
> schedule+0x46/0xb0
> schedule_timeout+0x20e/0x2a0
> ? nfs4_call_sync_custom+0x23/0x30
> wait_for_completion_interruptible+0x80/0x120
> nfs42_proc_copy+0x505/0xb00
> ? find_get_pages_range_tag+0x211/0x270
> ? enqueue_task_fair+0xb5/0x500
> ? __filemap_fdatawait_range+0x66/0xf0
> nfs4_copy_file_range+0x198/0x240
> vfs_copy_file_range+0x39a/0x470
> ? ptrace_do_notify+0x82/0xb0
> __x64_sys_copy_file_range+0xd6/0x210
> do_syscall_64+0x33/0x40
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x7f8eead1e259
> RSP: 002b:00007ffcb1bb5778 EFLAGS: 00000206 ORIG_RAX: 0000000000000146
> RAX: ffffffffffffffda RBX: 00007ffcb1bb5790 RCX: 00007f8eead1e259
> RDX: 0000000000000003 RSI: 00007ffcb1bb5790 RDI: 0000000000000004
> RBP: 0000000020000000 R08: 0000000020000000 R09: 0000000000000000
> R10: 00007ffcb1bb5798 R11: 0000000000000206 R12: 00007ffcb1bb5798
> R13: 0000000000000004 R14: 0000000000000001 R15: 0000000000000000

On 16.02.2021 21:12, Olga Kornievskaia wrote:
> Hi Timo,
>
> Can you get a network trace? Also, you say that the copy_file_range()
> (after what looks like a successful copy) never returns (and
> application hangs), can you get a sysrq output of what the process's
> stack (echo t > /proc/sysrq-trigger and see what gets dumped into the
> var log messages and locate your application and report what the stack
> says)?
>
> On Sat, Feb 13, 2021 at 10:41 PM Timo Rothenpieler
> <[email protected]> wrote:
>>
>> On our Fileserver, running a few weeks old 5.10, we are running into a
>> weird issue with NFS 4.2 Server-Side Copy and RDMA (and ZFS, though I'm
>> not sure how relevant that is to the issue).
>> The servers are connected via InfiniBand, on a Mellanox ConnectX-4 card,
>> using the mlx5 driver.
>>
>> Anything using the copy_file_range() syscall to copy stuff just hangs.
>> In strace, the syscall never returns.
>>
>> Simple way to reproduce on the client:
>> > xfs_io -fc "pwrite 0 1M" testfile
>> > xfs_io -fc "copy_range testfile" testfile.copy
>>
>> The second call just never exits. It sits in S+ state, with no CPU
>> usage, and can easily be killed via Ctrl+C.
>> I let it sit for a couple hours as well, it does not seem to ever complete.
>>
>> Some more observations about it:
>>
>> If I do a fresh reboot of the client, the operation works fine for a
>> short while (like, 10~15 minutes). No load is on the system during that
>> time, it's effectively idle.
>>
>> The operation actually does successfully copy all data. The size and
>> checksum of the target file is as expected. It just never returns.
>>
>> This only happens when mounting via RDMA. Mounting the same NFS share
>> via plain TCP has the operation work reliably.
>>
>> Had this issue with Kernel 5.4 already, and had hoped that 5.10 might
>> have fixed it, but unfortunately it didn't.
>>
>> I tried two server and 30 different client machines, they all exhibit
>> the exact same behaviour. So I'd carefully rule out a hardware issue.
>>
>>
>> Any pointers on how to debug or maybe even fix this?
>>
>>
>>
>> Thanks,
>> Timo

2021-02-17 22:42:52

[permalink] [raw]

Subject: Re: copy_file_range() infinitely hangs on NFSv4.2 over RDMA

On Tue, Feb 16, 2021 at 5:27 PM Timo Rothenpieler <[email protected]> wrote:
>
> On 16.02.2021 21:37, Timo Rothenpieler wrote:
> > I can't get a network (I assume just TCP/20049 is fine, and not also
> > some RDMA trace?) right now, but I will once a user has finished their
> > work on the machine.
>
> There wasn't any TCP traffic to dump on the NFSoRDMA Port, probably
> because everything is handled via RDMA/IB.

Yeah, I'm not sure if tcpdump can snoop on the IB traffic. I know that
upstream tcpdump can snoop on RDMA mellanox card (but I only know
about the Roce mode).

> But I recorded a trace log of rpcrdma and sunrpc observing the situation.
>
> To me it looks like the COPY task (task:15886@7) completes successfully?
> The compressed trace.dat is attached.

I'm having a hard time reproducing the problem. But I only tried
"xfs", "btrfs", "ext4" (first two send a CLONE since the file system
supports it), the last one exercises a copy. In all my tries your
xfs_io commands succeed. The differences between our environments are
(1) ZFS vs (xfs, etc) and (2) IB vs RoCE. Question is: does any
copy_file_range work over RDMA/IB. One thing to try a synchronous
copy: create a small file 10bytes and do a copy. Is this the case
where we have copy and the callback racing, so instead do a really
large copy: create a >=1GB file and do a copy. that will be an async
copy but will not have a racy condition. Can you try those 2 examples
for me?

Not sure how useful tracepoints here are. The results of the COPY
isn't interesting as this is an async copy. The server should have
sent a CB_COMPOUND with the copy's results. The process stack tells me
that COPY is waiting for the results (waiting for the callback). So
the question is there a problem of sending a callback over RDMA/IB? Or
did the client receive it and missed it somehow? We really do need
some better tracepoints in the copy (but we don't have them
currently).

Would you be willing to install the upstream libpcap/tcpdump to see if
it can capture RDMA/IB traffic or perhaps Chunk knows that it doesn't
work for sure?

Thank you.

2021-02-18 01:25:25

[permalink] [raw]

Subject: Re: copy_file_range() infinitely hangs on NFSv4.2 over RDMA

On 17.02.2021 23:37, Olga Kornievskaia wrote:
> On Tue, Feb 16, 2021 at 5:27 PM Timo Rothenpieler <[email protected]> wrote:
>>
>> On 16.02.2021 21:37, Timo Rothenpieler wrote:
>>> I can't get a network (I assume just TCP/20049 is fine, and not also
>>> some RDMA trace?) right now, but I will once a user has finished their
>>> work on the machine.
>>
>> There wasn't any TCP traffic to dump on the NFSoRDMA Port, probably
>> because everything is handled via RDMA/IB.
>
> Yeah, I'm not sure if tcpdump can snoop on the IB traffic. I know that
> upstream tcpdump can snoop on RDMA mellanox card (but I only know
> about the Roce mode).

I managed to get https://github.com/Mellanox/ibdump working. Attached is
what it records when I run the xfs_io copy_range command that gets
stuck(sniffer.pcap).
Additionally, I rebooted the client machine, and captured the traffic
when it does a then successful copy during the first few minutes of
uptime(sniffer2.pcap).

Both those commands were run on a the same 500M file.

>> But I recorded a trace log of rpcrdma and sunrpc observing the situation.
>>
>> To me it looks like the COPY task (task:15886@7) completes successfully?
>> The compressed trace.dat is attached.
>
> I'm having a hard time reproducing the problem. But I only tried
> "xfs", "btrfs", "ext4" (first two send a CLONE since the file system
> supports it), the last one exercises a copy. In all my tries your

I can also reproduce this on a test NFS share from an ext4 filesystem.
Have not tested xfs yet.

> xfs_io commands succeed. The differences between our environments are
> (1) ZFS vs (xfs, etc) and (2) IB vs RoCE. Question is: does any
> copy_file_range work over RDMA/IB. One thing to try a synchronous

It works, on any size of file, when the client machine is freshly booted
(within its first 10~30 minutes of uptime).

> copy: create a small file 10bytes and do a copy. Is this the case
> where we have copy and the callback racing, so instead do a really
> large copy: create a >=1GB file and do a copy. that will be an async
> copy but will not have a racy condition. Can you try those 2 examples
> for me?

I have observed in the past, that the xfs_io copy is more likely to
succeed the smaller the file is, though I did not make out a definite
pattern.

I did some bisecting on the number of bytes, and came up with the following:
A 2097153 byte sized file gets stuck, while a 2097152(=2^21) sized one
still works.

It's been stable at that cutoff point for a while now, so I think that's
actually the point where it starts happening, and different behaviour I
saw in the past was an issue in my testing.

> Not sure how useful tracepoints here are. The results of the COPY
> isn't interesting as this is an async copy. The server should have
> sent a CB_COMPOUND with the copy's results. The process stack tells me
> that COPY is waiting for the results (waiting for the callback). So
> the question is there a problem of sending a callback over RDMA/IB? Or
> did the client receive it and missed it somehow? We really do need
> some better tracepoints in the copy (but we don't have them
> currently).
>
> Would you be willing to install the upstream libpcap/tcpdump to see if
> it can capture RDMA/IB traffic or perhaps Chunk knows that it doesn't
> work for sure?

Managed to get ibdump working, as stated above.

Attachments:

sniffer.pcap (18.17 kB)
sniffer2.pcap (21.30 kB)
smime.p7s (4.39 kB)
S/MIME Cryptographic Signature Download all attachments

2021-02-18 03:56:13

[permalink] [raw]

Subject: Re: copy_file_range() infinitely hangs on NFSv4.2 over RDMA

On Wed, Feb 17, 2021 at 8:12 PM Timo Rothenpieler <[email protected]> wrote:
>
> On 17.02.2021 23:37, Olga Kornievskaia wrote:
> > On Tue, Feb 16, 2021 at 5:27 PM Timo Rothenpieler <[email protected]> wrote:
> >>
> >> On 16.02.2021 21:37, Timo Rothenpieler wrote:
> >>> I can't get a network (I assume just TCP/20049 is fine, and not also
> >>> some RDMA trace?) right now, but I will once a user has finished their
> >>> work on the machine.
> >>
> >> There wasn't any TCP traffic to dump on the NFSoRDMA Port, probably
> >> because everything is handled via RDMA/IB.
> >
> > Yeah, I'm not sure if tcpdump can snoop on the IB traffic. I know that
> > upstream tcpdump can snoop on RDMA mellanox card (but I only know
> > about the Roce mode).
>
> I managed to get https://github.com/Mellanox/ibdump working. Attached is
> what it records when I run the xfs_io copy_range command that gets
> stuck(sniffer.pcap).
> Additionally, I rebooted the client machine, and captured the traffic
> when it does a then successful copy during the first few minutes of
> uptime(sniffer2.pcap).
>
> Both those commands were run on a the same 500M file.
>
> >> But I recorded a trace log of rpcrdma and sunrpc observing the situation.
> >>
> >> To me it looks like the COPY task (task:15886@7) completes successfully?
> >> The compressed trace.dat is attached.
> >
> > I'm having a hard time reproducing the problem. But I only tried
> > "xfs", "btrfs", "ext4" (first two send a CLONE since the file system
> > supports it), the last one exercises a copy. In all my tries your
>
> I can also reproduce this on a test NFS share from an ext4 filesystem.
> Have not tested xfs yet.
>
> > xfs_io commands succeed. The differences between our environments are
> > (1) ZFS vs (xfs, etc) and (2) IB vs RoCE. Question is: does any
> > copy_file_range work over RDMA/IB. One thing to try a synchronous
>
> It works, on any size of file, when the client machine is freshly booted
> (within its first 10~30 minutes of uptime).

Reboot of the client or the server machine?

> > copy: create a small file 10bytes and do a copy. Is this the case
> > where we have copy and the callback racing, so instead do a really
> > large copy: create a >=1GB file and do a copy. that will be an async
> > copy but will not have a racy condition. Can you try those 2 examples
> > for me?
>
> I have observed in the past, that the xfs_io copy is more likely to
> succeed the smaller the file is, though I did not make out a definite
> pattern.

That's because small files are done with a synchronous copy. Your
network captures (while not fully decodable by wireshark because the
trace lacks the connection establishment needed for the wireshark to
parse the RDMA replies) the fact that no callback is sent from the
server and thus the client is stuck waiting for it. So the focus
should be on the server piece as to why it's not sending it. There are
some error conditions on the server that if that happens, it will not
be able to send a callback. What springs to mind looking thru the code
is that somehow the server thinks there is no callback channel. Can
you turn on nfsd tracepoints please? I wonder if we can see something
of interest there.

The logic for determining whether the copy is sent sync or async
depends on server's rsize, if a file smaller than 2 RPC payloads, it's
sent synchronously.

> I did some bisecting on the number of bytes, and came up with the following:
> A 2097153 byte sized file gets stuck, while a 2097152(=2^21) sized one
> still works.
>
> It's been stable at that cutoff point for a while now, so I think that's
> actually the point where it starts happening, and different behaviour I
> saw in the past was an issue in my testing.
>
> > Not sure how useful tracepoints here are. The results of the COPY
> > isn't interesting as this is an async copy. The server should have
> > sent a CB_COMPOUND with the copy's results. The process stack tells me
> > that COPY is waiting for the results (waiting for the callback). So
> > the question is there a problem of sending a callback over RDMA/IB? Or
> > did the client receive it and missed it somehow? We really do need
> > some better tracepoints in the copy (but we don't have them
> > currently).
> >
> > Would you be willing to install the upstream libpcap/tcpdump to see if
> > it can capture RDMA/IB traffic or perhaps Chunk knows that it doesn't
> > work for sure?
>
> Managed to get ibdump working, as stated above.

2021-02-18 17:05:23

[permalink] [raw]

Subject: Re: copy_file_range() infinitely hangs on NFSv4.2 over RDMA

On 18.02.2021 04:52, Olga Kornievskaia wrote:
> On Wed, Feb 17, 2021 at 8:12 PM Timo Rothenpieler <[email protected]> wrote:
>>
>> On 17.02.2021 23:37, Olga Kornievskaia wrote:
>>> On Tue, Feb 16, 2021 at 5:27 PM Timo Rothenpieler <[email protected]> wrote:
>>>>
>>>> On 16.02.2021 21:37, Timo Rothenpieler wrote:
>>>>> I can't get a network (I assume just TCP/20049 is fine, and not also
>>>>> some RDMA trace?) right now, but I will once a user has finished their
>>>>> work on the machine.
>>>>
>>>> There wasn't any TCP traffic to dump on the NFSoRDMA Port, probably
>>>> because everything is handled via RDMA/IB.
>>>
>>> Yeah, I'm not sure if tcpdump can snoop on the IB traffic. I know that
>>> upstream tcpdump can snoop on RDMA mellanox card (but I only know
>>> about the Roce mode).
>>
>> I managed to get https://github.com/Mellanox/ibdump working. Attached is
>> what it records when I run the xfs_io copy_range command that gets
>> stuck(sniffer.pcap).
>> Additionally, I rebooted the client machine, and captured the traffic
>> when it does a then successful copy during the first few minutes of
>> uptime(sniffer2.pcap).
>>
>> Both those commands were run on a the same 500M file.
>>
>>>> But I recorded a trace log of rpcrdma and sunrpc observing the situation.
>>>>
>>>> To me it looks like the COPY task (task:15886@7) completes successfully?
>>>> The compressed trace.dat is attached.
>>>
>>> I'm having a hard time reproducing the problem. But I only tried
>>> "xfs", "btrfs", "ext4" (first two send a CLONE since the file system
>>> supports it), the last one exercises a copy. In all my tries your
>>
>> I can also reproduce this on a test NFS share from an ext4 filesystem.
>> Have not tested xfs yet.
>>
>>> xfs_io commands succeed. The differences between our environments are
>>> (1) ZFS vs (xfs, etc) and (2) IB vs RoCE. Question is: does any
>>> copy_file_range work over RDMA/IB. One thing to try a synchronous
>>
>> It works, on any size of file, when the client machine is freshly booted
>> (within its first 10~30 minutes of uptime).
>
> Reboot of the client or the server machine?

Just the client. The server is in production use, so I can't freely
reboot it without organizing a maintenance window a couple days ahead.

>>> copy: create a small file 10bytes and do a copy. Is this the case
>>> where we have copy and the callback racing, so instead do a really
>>> large copy: create a >=1GB file and do a copy. that will be an async
>>> copy but will not have a racy condition. Can you try those 2 examples
>>> for me?
>>
>> I have observed in the past, that the xfs_io copy is more likely to
>> succeed the smaller the file is, though I did not make out a definite
>> pattern.
>
> That's because small files are done with a synchronous copy. Your
> network captures (while not fully decodable by wireshark because the
> trace lacks the connection establishment needed for the wireshark to
> parse the RDMA replies) the fact that no callback is sent from the
> server and thus the client is stuck waiting for it. So the focus
> should be on the server piece as to why it's not sending it. There are
> some error conditions on the server that if that happens, it will not
> be able to send a callback. What springs to mind looking thru the code
> is that somehow the server thinks there is no callback channel. Can
> you turn on nfsd tracepoints please? I wonder if we can see something
> of interest there.

On the server I guess?
I'll have to figure out a way to do that while it's not in active use.
Otherwise the trace log will be enormous and contain a lot of noise from
the general system use.

I'll report back once I got a trace log.

> The logic for determining whether the copy is sent sync or async
> depends on server's rsize, if a file smaller than 2 RPC payloads, it's
> sent synchronously.

Attachments:

smime.p7s (4.39 kB)
S/MIME Cryptographic Signature

2021-02-18 19:01:53

[permalink] [raw]

Subject: Re: copy_file_range() infinitely hangs on NFSv4.2 over RDMA

On Thu, Feb 18, 2021 at 10:55 AM Timo Rothenpieler
<[email protected]> wrote:
>
> On 18.02.2021 14:28, Timo Rothenpieler wrote:
> > I'll report back once I got a trace log.
>
> Find the trace log attached.
> It also grabbed quite a bit of unrelated noise, but I think it's still
> helpful:
>
> > nfsd-7226 [025] 2228027.306471: nfsd_compound: xid=0xbca1d6e9 opcnt=5
> > nfsd-7226 [025] 2228027.306472: nfsd_compound_status: op=1/5 OP_SEQUENCE status=0
> > nfsd-7226 [025] 2228027.306479: nfsd_compound_status: op=2/5 OP_PUTFH status=0
> > nfsd-7226 [025] 2228027.306480: nfsd_compound_status: op=3/5 OP_SAVEFH status=0
> > nfsd-7226 [025] 2228027.306483: nfsd_compound_status: op=4/5 OP_PUTFH status=0
> > nfsd-7226 [025] 2228027.306590: nfsd_compound_status: op=5/5 OP_COPY status=0
> > nfsd-7226 [025] 2228027.306702: nfsd_compound: xid=0xbda1d6e9 opcnt=2
> > nfsd-7226 [025] 2228027.306703: nfsd_compound_status: op=1/2 OP_SEQUENCE status=0
> > nfsd-7226 [025] 2228027.306703: nfsd_compound_status: op=2/2 OP_TEST_STATEID status=0
> > nfsd-7226 [025] 2228027.306741: nfsd_compound: xid=0xbea1d6e9 opcnt=2
> > nfsd-7226 [025] 2228027.306742: nfsd_compound_status: op=1/2 OP_SEQUENCE status=0
> > nfsd-7226 [025] 2228027.306747: nfsd_compound_status: op=2/2 OP_TEST_STATEID status=0
> > nfsd-7226 [025] 2228027.306791: nfsd_compound: xid=0xbfa1d6e9 opcnt=2
> > nfsd-7226 [025] 2228027.306792: nfsd_compound_status: op=1/2 OP_SEQUENCE status=0
> > nfsd-7226 [025] 2228027.306793: nfsd_compound_status: op=2/2 OP_TEST_STATEID status=0
> > nfsd-7226 [025] 2228027.306829: nfsd_compound: xid=0xc0a1d6e9 opcnt=2
> > nfsd-7226 [025] 2228027.306830: nfsd_compound_status: op=1/2 OP_SEQUENCE status=0
> > nfsd-7226 [025] 2228027.306831: nfsd_compound_status: op=2/2 OP_TEST_STATEID status=0
> > nfsd-7226 [025] 2228027.306865: nfsd_compound: xid=0xc1a1d6e9 opcnt=2
> > nfsd-7226 [025] 2228027.306866: nfsd_compound_status: op=1/2 OP_SEQUENCE status=0
> > nfsd-7226 [025] 2228027.306866: nfsd_compound_status: op=2/2 OP_TEST_STATEID status=0
> > <...>-2019374 [012] 2228027.307694: nfsd_file_put: hash=0x275 inode=0x0xffffa0c8c35ab490 ref=4 flags=HASHED|REFERENCED may=READ file=0xffffa0e819758800
> > <...>-2019374 [012] 2228027.307694: nfsd_file_put: hash=0x365 inode=0x0xffffa0d70dd5dec0 ref=5 flags=HASHED|REFERENCED may=READ|WRITE file=0xffffa0e819759000
> > <...>-2019374 [012] 2228027.307701: nfsd_file_put: hash=0x365 inode=0x0xffffa0d70dd5dec0 ref=4 flags=HASHED|REFERENCED may=READ|WRITE file=0xffffa0e819759000
> > <...>-2019374 [012] 2228027.307701: nfsd_file_put: hash=0x275 inode=0x0xffffa0c8c35ab490 ref=3 flags=HASHED|REFERENCED may=READ file=0xffffa0e819758800
> > <...>-1885588 [029] 2228027.307725: nfsd_cb_work: addr=10.110.10.252:0 client 600c8efc:868a6681 procedure=CB_OFFLOAD
> > <...>-1885588 [029] 2228027.307746: nfsd_cb_done: addr=10.110.10.252:0 client 600c8efc:868a6681 status=-107
> > <...>-1885588 [029] 2228027.307747: nfsd_cb_state: addr=10.110.10.252:0 client 600c8efc:868a6681 state=FAULT

Thank you for getting tracepoints from a busy server but can you get
more? As suspected, the server is having issues sending the callback.
I'm not sure why. Any chance to turn on the server's sunrpc
tracespoints, probably both sunrpc and rdmas tracepoints, I wonder if
we can any more info about why it's failing?

2021-02-19 17:42:09

[permalink] [raw]

Subject: Re: copy_file_range() infinitely hangs on NFSv4.2 over RDMA

On Thu, Feb 18, 2021 at 3:22 PM Timo Rothenpieler <[email protected]> wrote:
>
> On 18.02.2021 19:30, Olga Kornievskaia wrote:
> > Thank you for getting tracepoints from a busy server but can you get
> > more? As suspected, the server is having issues sending the callback.
> > I'm not sure why. Any chance to turn on the server's sunrpc
> > tracespoints, probably both sunrpc and rdmas tracepoints, I wonder if
> > we can any more info about why it's failing?
>
> I isolated out two of the machines on that cluster now, one acting as
> NFS server from an ext4 mount, the other is the same client as before.
> That way I managed to capture a trace and ibdump of an entire cycle:
> mount + successful copy + 5 minutes later a copy that got stuck
>
> Next to no noise happened during those traces, you can find them attached.
>
> Another observation made due to this: unmount and re-mounting the NFS
> share also gets it back into working condition for a while, no reboot
> necessary.
> During this trace, I got "lucky", and after just 5 minutes of waiting,
> it got stuck.
>
> Before that, I had a run of mount + trying to copy every 5 minutes where
> it ran for 45 minutes without getting stuck. At which point I decided to
> remount once more.

Timo, thank you for gathering the debug info.

Chuck, I need your help. Why would the server lose a callback channel?

<...>-1461944 [001] 2076465.200151: rpc_request:
task:57752@6 nfs4_cbv1 CB_OFFLOAD (async)
<...>-1461944 [001] 2076465.200151: rpc_task_run_action:
task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
runstate=RUNNING|ACTIVE status=0 action=call_reserve
<...>-1461944 [001] 2076465.200154: xprt_reserve:
task:57752@6 xid=0x00a0aaf9
<...>-1461944 [001] 2076465.200155: rpc_task_run_action:
task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
runstate=RUNNING|ACTIVE status=0 action=call_reserveresult
<...>-1461944 [001] 2076465.200156: rpc_task_run_action:
task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
runstate=RUNNING|ACTIVE status=0 action=call_refresh
<...>-1461944 [001] 2076465.200163: rpc_task_run_action:
task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
runstate=RUNNING|ACTIVE status=0 action=call_refreshresult
<...>-1461944 [001] 2076465.200163: rpc_task_run_action:
task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
runstate=RUNNING|ACTIVE status=0 action=call_allocate
<...>-1461944 [001] 2076465.200168: rpc_buf_alloc:
task:57752@6 callsize=548 recvsize=104 status=0
<...>-1461944 [001] 2076465.200168: rpc_task_run_action:
task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
runstate=RUNNING|ACTIVE status=0 action=call_encode
<...>-1461944 [001] 2076465.200173: rpc_task_run_action:
task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=0
action=call_connect
<...>-1461944 [001] 2076465.200174: rpc_call_rpcerror:
task:57752@6 tk_status=-107 rpc_status=-107
<...>-1461944 [001] 2076465.200174: rpc_task_run_action:
task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=-107
action=rpc_exit_task

It's reporting ENOTCON. I'm not really sure if this is related to copy
offload but more perhaps doing callbacks over RDMA/IB.

2021-02-19 17:52:17

[permalink] [raw]

Subject: Re: copy_file_range() infinitely hangs on NFSv4.2 over RDMA

> On Feb 19, 2021, at 12:38 PM, Olga Kornievskaia <[email protected]> wrote:
>
> On Thu, Feb 18, 2021 at 3:22 PM Timo Rothenpieler <[email protected]> wrote:
>>
>> On 18.02.2021 19:30, Olga Kornievskaia wrote:
>>> Thank you for getting tracepoints from a busy server but can you get
>>> more? As suspected, the server is having issues sending the callback.
>>> I'm not sure why. Any chance to turn on the server's sunrpc
>>> tracespoints, probably both sunrpc and rdmas tracepoints, I wonder if
>>> we can any more info about why it's failing?
>>
>> I isolated out two of the machines on that cluster now, one acting as
>> NFS server from an ext4 mount, the other is the same client as before.
>> That way I managed to capture a trace and ibdump of an entire cycle:
>> mount + successful copy + 5 minutes later a copy that got stuck
>>
>> Next to no noise happened during those traces, you can find them attached.
>>
>> Another observation made due to this: unmount and re-mounting the NFS
>> share also gets it back into working condition for a while, no reboot
>> necessary.
>> During this trace, I got "lucky", and after just 5 minutes of waiting,
>> it got stuck.
>>
>> Before that, I had a run of mount + trying to copy every 5 minutes where
>> it ran for 45 minutes without getting stuck. At which point I decided to
>> remount once more.
>
> Timo, thank you for gathering the debug info.
>
> Chuck, I need your help. Why would the server lose a callback channel?
>
> <...>-1461944 [001] 2076465.200151: rpc_request:
> task:57752@6 nfs4_cbv1 CB_OFFLOAD (async)
> <...>-1461944 [001] 2076465.200151: rpc_task_run_action:
> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
> runstate=RUNNING|ACTIVE status=0 action=call_reserve
> <...>-1461944 [001] 2076465.200154: xprt_reserve:
> task:57752@6 xid=0x00a0aaf9
> <...>-1461944 [001] 2076465.200155: rpc_task_run_action:
> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
> runstate=RUNNING|ACTIVE status=0 action=call_reserveresult
> <...>-1461944 [001] 2076465.200156: rpc_task_run_action:
> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
> runstate=RUNNING|ACTIVE status=0 action=call_refresh
> <...>-1461944 [001] 2076465.200163: rpc_task_run_action:
> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
> runstate=RUNNING|ACTIVE status=0 action=call_refreshresult
> <...>-1461944 [001] 2076465.200163: rpc_task_run_action:
> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
> runstate=RUNNING|ACTIVE status=0 action=call_allocate
> <...>-1461944 [001] 2076465.200168: rpc_buf_alloc:
> task:57752@6 callsize=548 recvsize=104 status=0
> <...>-1461944 [001] 2076465.200168: rpc_task_run_action:
> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
> runstate=RUNNING|ACTIVE status=0 action=call_encode
> <...>-1461944 [001] 2076465.200173: rpc_task_run_action:
> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
> runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=0
> action=call_connect
> <...>-1461944 [001] 2076465.200174: rpc_call_rpcerror:
> task:57752@6 tk_status=-107 rpc_status=-107
> <...>-1461944 [001] 2076465.200174: rpc_task_run_action:
> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
> runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=-107
> action=rpc_exit_task
>
> It's reporting ENOTCON. I'm not really sure if this is related to copy
> offload but more perhaps doing callbacks over RDMA/IB.

If the client is awaiting a COPY notification callback, I can see why
a lost CB channel would cause the client to wait indefinitely.

The copy mechanism has to deal with this contingency... Perhaps the
server could force a disconnect so that the client and server have an
opportunity to re-establish the callback channel. <shrug>

In any event, the trace log above shows nothing more than "hey, it's
not working." Are there any rpcrdma trace events we can look at to
determine why the backchannel is getting lost?

--
Chuck Lever

2021-02-19 18:51:54

[permalink] [raw]

Subject: Re: copy_file_range() infinitely hangs on NFSv4.2 over RDMA

> On Feb 19, 2021, at 1:01 PM, Timo Rothenpieler <[email protected]> wrote:
>
> On 19.02.2021 18:48, Chuck Lever wrote:
>>> On Feb 19, 2021, at 12:38 PM, Olga Kornievskaia <[email protected]> wrote:
>>>
>>> On Thu, Feb 18, 2021 at 3:22 PM Timo Rothenpieler <[email protected]> wrote:
>>>>
>>>> On 18.02.2021 19:30, Olga Kornievskaia wrote:
>>>>> Thank you for getting tracepoints from a busy server but can you get
>>>>> more? As suspected, the server is having issues sending the callback.
>>>>> I'm not sure why. Any chance to turn on the server's sunrpc
>>>>> tracespoints, probably both sunrpc and rdmas tracepoints, I wonder if
>>>>> we can any more info about why it's failing?
>>>>
>>>> I isolated out two of the machines on that cluster now, one acting as
>>>> NFS server from an ext4 mount, the other is the same client as before.
>>>> That way I managed to capture a trace and ibdump of an entire cycle:
>>>> mount + successful copy + 5 minutes later a copy that got stuck
>>>>
>>>> Next to no noise happened during those traces, you can find them attached.
>>>>
>>>> Another observation made due to this: unmount and re-mounting the NFS
>>>> share also gets it back into working condition for a while, no reboot
>>>> necessary.
>>>> During this trace, I got "lucky", and after just 5 minutes of waiting,
>>>> it got stuck.
>>>>
>>>> Before that, I had a run of mount + trying to copy every 5 minutes where
>>>> it ran for 45 minutes without getting stuck. At which point I decided to
>>>> remount once more.
>>>
>>> Timo, thank you for gathering the debug info.
>>>
>>> Chuck, I need your help. Why would the server lose a callback channel?
>>>
>>> <...>-1461944 [001] 2076465.200151: rpc_request:
>>> task:57752@6 nfs4_cbv1 CB_OFFLOAD (async)
>>> <...>-1461944 [001] 2076465.200151: rpc_task_run_action:
>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
>>> runstate=RUNNING|ACTIVE status=0 action=call_reserve
>>> <...>-1461944 [001] 2076465.200154: xprt_reserve:
>>> task:57752@6 xid=0x00a0aaf9
>>> <...>-1461944 [001] 2076465.200155: rpc_task_run_action:
>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
>>> runstate=RUNNING|ACTIVE status=0 action=call_reserveresult
>>> <...>-1461944 [001] 2076465.200156: rpc_task_run_action:
>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
>>> runstate=RUNNING|ACTIVE status=0 action=call_refresh
>>> <...>-1461944 [001] 2076465.200163: rpc_task_run_action:
>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
>>> runstate=RUNNING|ACTIVE status=0 action=call_refreshresult
>>> <...>-1461944 [001] 2076465.200163: rpc_task_run_action:
>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
>>> runstate=RUNNING|ACTIVE status=0 action=call_allocate
>>> <...>-1461944 [001] 2076465.200168: rpc_buf_alloc:
>>> task:57752@6 callsize=548 recvsize=104 status=0
>>> <...>-1461944 [001] 2076465.200168: rpc_task_run_action:
>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
>>> runstate=RUNNING|ACTIVE status=0 action=call_encode
>>> <...>-1461944 [001] 2076465.200173: rpc_task_run_action:
>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
>>> runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=0
>>> action=call_connect
>>> <...>-1461944 [001] 2076465.200174: rpc_call_rpcerror:
>>> task:57752@6 tk_status=-107 rpc_status=-107
>>> <...>-1461944 [001] 2076465.200174: rpc_task_run_action:
>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
>>> runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=-107
>>> action=rpc_exit_task
>>>
>>> It's reporting ENOTCON. I'm not really sure if this is related to copy
>>> offload but more perhaps doing callbacks over RDMA/IB.
>> If the client is awaiting a COPY notification callback, I can see why
>> a lost CB channel would cause the client to wait indefinitely.
>> The copy mechanism has to deal with this contingency... Perhaps the
>> server could force a disconnect so that the client and server have an
>> opportunity to re-establish the callback channel. <shrug>
>> In any event, the trace log above shows nothing more than "hey, it's
>> not working." Are there any rpcrdma trace events we can look at to
>> determine why the backchannel is getting lost?
>
> The trace log attached to my previous mail has it enabled, along with nfsd and sunrpc.
> I'm attaching the two files again here.

Thanks, Timo.

The first CB_OFFLOAD callback succeeds:

2076155.216687: nfsd_cb_work: addr=10.110.10.252:0 client 602eb645:daa037ae procedure=CB_OFFLOAD
2076155.216704: rpc_request: task:57746@6 nfs4_cbv1 CB_OFFLOAD (async)

2076155.216975: rpc_stats_latency: task:57746@6 xid=0xff9faaf9 nfs4_cbv1 CB_OFFLOAD backlog=33 rtt=195 execute=282
2076155.216977: nfsd_cb_done: addr=10.110.10.252:0 client 602eb645:daa037ae status=0

About 305 seconds later, the autodisconnect timer fires. I'm not sure if this is the backchannel transport, but it looks suspicious:

2076460.314954: xprt_disconnect_auto: peer=[10.110.10.252]:0 state=LOCKED|CONNECTED|BOUND
2076460.314957: xprt_disconnect_done: peer=[10.110.10.252]:0 state=LOCKED|CONNECTED|BOUND

The next CB_OFFLOAD callback fails because the xprt has been marked "disconnected" and the request's NOCONNECT flag is set.

2076465.200136: nfsd_cb_work: addr=10.110.10.252:0 client 602eb645:daa037ae procedure=CB_OFFLOAD
2076465.200151: rpc_request: task:57752@6 nfs4_cbv1 CB_OFFLOAD (async)

2076465.200168: rpc_task_run_action: task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT runstate=RUNNING|ACTIVE status=0 action=call_encode
2076465.200173: rpc_task_run_action: task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=0 action=call_connect
2076465.200174: rpc_call_rpcerror: task:57752@6 tk_status=-107 rpc_status=-107
2076465.200174: rpc_task_run_action: task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=-107 action=rpc_exit_task
2076465.200179: nfsd_cb_done: addr=10.110.10.252:0 client 602eb645:daa037ae status=-107
2076465.200180: nfsd_cb_state: addr=10.110.10.252:0 client 602eb645:daa037ae state=FAULT

Perhaps RPC clients for NFSv4.1+ callback should be created with
the RPC_CLNT_CREATE_NO_IDLE_TIMEOUT flag.

--
Chuck Lever

2021-02-19 20:41:16

[permalink] [raw]

Subject: Re: copy_file_range() infinitely hangs on NFSv4.2 over RDMA

On 19.02.2021 19:48, Chuck Lever wrote:
>
>
>> On Feb 19, 2021, at 1:01 PM, Timo Rothenpieler <[email protected]> wrote:
>>
>> On 19.02.2021 18:48, Chuck Lever wrote:
>>>> On Feb 19, 2021, at 12:38 PM, Olga Kornievskaia <[email protected]> wrote:
>>>>
>>>> On Thu, Feb 18, 2021 at 3:22 PM Timo Rothenpieler <[email protected]> wrote:
>>>>>
>>>>> On 18.02.2021 19:30, Olga Kornievskaia wrote:
>>>>>> Thank you for getting tracepoints from a busy server but can you get
>>>>>> more? As suspected, the server is having issues sending the callback.
>>>>>> I'm not sure why. Any chance to turn on the server's sunrpc
>>>>>> tracespoints, probably both sunrpc and rdmas tracepoints, I wonder if
>>>>>> we can any more info about why it's failing?
>>>>>
>>>>> I isolated out two of the machines on that cluster now, one acting as
>>>>> NFS server from an ext4 mount, the other is the same client as before.
>>>>> That way I managed to capture a trace and ibdump of an entire cycle:
>>>>> mount + successful copy + 5 minutes later a copy that got stuck
>>>>>
>>>>> Next to no noise happened during those traces, you can find them attached.
>>>>>
>>>>> Another observation made due to this: unmount and re-mounting the NFS
>>>>> share also gets it back into working condition for a while, no reboot
>>>>> necessary.
>>>>> During this trace, I got "lucky", and after just 5 minutes of waiting,
>>>>> it got stuck.
>>>>>
>>>>> Before that, I had a run of mount + trying to copy every 5 minutes where
>>>>> it ran for 45 minutes without getting stuck. At which point I decided to
>>>>> remount once more.
>>>>
>>>> Timo, thank you for gathering the debug info.
>>>>
>>>> Chuck, I need your help. Why would the server lose a callback channel?
>>>>
>>>> <...>-1461944 [001] 2076465.200151: rpc_request:
>>>> task:57752@6 nfs4_cbv1 CB_OFFLOAD (async)
>>>> <...>-1461944 [001] 2076465.200151: rpc_task_run_action:
>>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
>>>> runstate=RUNNING|ACTIVE status=0 action=call_reserve
>>>> <...>-1461944 [001] 2076465.200154: xprt_reserve:
>>>> task:57752@6 xid=0x00a0aaf9
>>>> <...>-1461944 [001] 2076465.200155: rpc_task_run_action:
>>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
>>>> runstate=RUNNING|ACTIVE status=0 action=call_reserveresult
>>>> <...>-1461944 [001] 2076465.200156: rpc_task_run_action:
>>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
>>>> runstate=RUNNING|ACTIVE status=0 action=call_refresh
>>>> <...>-1461944 [001] 2076465.200163: rpc_task_run_action:
>>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
>>>> runstate=RUNNING|ACTIVE status=0 action=call_refreshresult
>>>> <...>-1461944 [001] 2076465.200163: rpc_task_run_action:
>>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
>>>> runstate=RUNNING|ACTIVE status=0 action=call_allocate
>>>> <...>-1461944 [001] 2076465.200168: rpc_buf_alloc:
>>>> task:57752@6 callsize=548 recvsize=104 status=0
>>>> <...>-1461944 [001] 2076465.200168: rpc_task_run_action:
>>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
>>>> runstate=RUNNING|ACTIVE status=0 action=call_encode
>>>> <...>-1461944 [001] 2076465.200173: rpc_task_run_action:
>>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
>>>> runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=0
>>>> action=call_connect
>>>> <...>-1461944 [001] 2076465.200174: rpc_call_rpcerror:
>>>> task:57752@6 tk_status=-107 rpc_status=-107
>>>> <...>-1461944 [001] 2076465.200174: rpc_task_run_action:
>>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
>>>> runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=-107
>>>> action=rpc_exit_task
>>>>
>>>> It's reporting ENOTCON. I'm not really sure if this is related to copy
>>>> offload but more perhaps doing callbacks over RDMA/IB.
>>> If the client is awaiting a COPY notification callback, I can see why
>>> a lost CB channel would cause the client to wait indefinitely.
>>> The copy mechanism has to deal with this contingency... Perhaps the
>>> server could force a disconnect so that the client and server have an
>>> opportunity to re-establish the callback channel. <shrug>
>>> In any event, the trace log above shows nothing more than "hey, it's
>>> not working." Are there any rpcrdma trace events we can look at to
>>> determine why the backchannel is getting lost?
>>
>> The trace log attached to my previous mail has it enabled, along with nfsd and sunrpc.
>> I'm attaching the two files again here.
>
> Thanks, Timo.
>
> The first CB_OFFLOAD callback succeeds:
>
> 2076155.216687: nfsd_cb_work: addr=10.110.10.252:0 client 602eb645:daa037ae procedure=CB_OFFLOAD
> 2076155.216704: rpc_request: task:57746@6 nfs4_cbv1 CB_OFFLOAD (async)
>
> 2076155.216975: rpc_stats_latency: task:57746@6 xid=0xff9faaf9 nfs4_cbv1 CB_OFFLOAD backlog=33 rtt=195 execute=282
> 2076155.216977: nfsd_cb_done: addr=10.110.10.252:0 client 602eb645:daa037ae status=0
>
>
> About 305 seconds later, the autodisconnect timer fires. I'm not sure if this is the backchannel transport, but it looks suspicious:

This would also align with what I observed in my first test session.
In that I mounted the share and then every minute tested if it's getting
stuck.
I left it run for almost an hour, and nothing ever got stuck.
Then I restarted the test run with it testing every 5 minutes, and the
first test after 5 minutes got stuck.

Attachments:

smime.p7s (4.39 kB)
S/MIME Cryptographic Signature

2021-02-19 20:44:41

[permalink] [raw]

Subject: Re: copy_file_range() infinitely hangs on NFSv4.2 over RDMA

On Fri, Feb 19, 2021 at 1:48 PM Chuck Lever <[email protected]> wrote:
>
>
>
> > On Feb 19, 2021, at 1:01 PM, Timo Rothenpieler <[email protected]> wrote:
> >
> > On 19.02.2021 18:48, Chuck Lever wrote:
> >>> On Feb 19, 2021, at 12:38 PM, Olga Kornievskaia <[email protected]> wrote:
> >>>
> >>> On Thu, Feb 18, 2021 at 3:22 PM Timo Rothenpieler <[email protected]> wrote:
> >>>>
> >>>> On 18.02.2021 19:30, Olga Kornievskaia wrote:
> >>>>> Thank you for getting tracepoints from a busy server but can you get
> >>>>> more? As suspected, the server is having issues sending the callback.
> >>>>> I'm not sure why. Any chance to turn on the server's sunrpc
> >>>>> tracespoints, probably both sunrpc and rdmas tracepoints, I wonder if
> >>>>> we can any more info about why it's failing?
> >>>>
> >>>> I isolated out two of the machines on that cluster now, one acting as
> >>>> NFS server from an ext4 mount, the other is the same client as before.
> >>>> That way I managed to capture a trace and ibdump of an entire cycle:
> >>>> mount + successful copy + 5 minutes later a copy that got stuck
> >>>>
> >>>> Next to no noise happened during those traces, you can find them attached.
> >>>>
> >>>> Another observation made due to this: unmount and re-mounting the NFS
> >>>> share also gets it back into working condition for a while, no reboot
> >>>> necessary.
> >>>> During this trace, I got "lucky", and after just 5 minutes of waiting,
> >>>> it got stuck.
> >>>>
> >>>> Before that, I had a run of mount + trying to copy every 5 minutes where
> >>>> it ran for 45 minutes without getting stuck. At which point I decided to
> >>>> remount once more.
> >>>
> >>> Timo, thank you for gathering the debug info.
> >>>
> >>> Chuck, I need your help. Why would the server lose a callback channel?
> >>>
> >>> <...>-1461944 [001] 2076465.200151: rpc_request:
> >>> task:57752@6 nfs4_cbv1 CB_OFFLOAD (async)
> >>> <...>-1461944 [001] 2076465.200151: rpc_task_run_action:
> >>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
> >>> runstate=RUNNING|ACTIVE status=0 action=call_reserve
> >>> <...>-1461944 [001] 2076465.200154: xprt_reserve:
> >>> task:57752@6 xid=0x00a0aaf9
> >>> <...>-1461944 [001] 2076465.200155: rpc_task_run_action:
> >>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
> >>> runstate=RUNNING|ACTIVE status=0 action=call_reserveresult
> >>> <...>-1461944 [001] 2076465.200156: rpc_task_run_action:
> >>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
> >>> runstate=RUNNING|ACTIVE status=0 action=call_refresh
> >>> <...>-1461944 [001] 2076465.200163: rpc_task_run_action:
> >>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
> >>> runstate=RUNNING|ACTIVE status=0 action=call_refreshresult
> >>> <...>-1461944 [001] 2076465.200163: rpc_task_run_action:
> >>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
> >>> runstate=RUNNING|ACTIVE status=0 action=call_allocate
> >>> <...>-1461944 [001] 2076465.200168: rpc_buf_alloc:
> >>> task:57752@6 callsize=548 recvsize=104 status=0
> >>> <...>-1461944 [001] 2076465.200168: rpc_task_run_action:
> >>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
> >>> runstate=RUNNING|ACTIVE status=0 action=call_encode
> >>> <...>-1461944 [001] 2076465.200173: rpc_task_run_action:
> >>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
> >>> runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=0
> >>> action=call_connect
> >>> <...>-1461944 [001] 2076465.200174: rpc_call_rpcerror:
> >>> task:57752@6 tk_status=-107 rpc_status=-107
> >>> <...>-1461944 [001] 2076465.200174: rpc_task_run_action:
> >>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
> >>> runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=-107
> >>> action=rpc_exit_task
> >>>
> >>> It's reporting ENOTCON. I'm not really sure if this is related to copy
> >>> offload but more perhaps doing callbacks over RDMA/IB.
> >> If the client is awaiting a COPY notification callback, I can see why
> >> a lost CB channel would cause the client to wait indefinitely.
> >> The copy mechanism has to deal with this contingency... Perhaps the
> >> server could force a disconnect so that the client and server have an
> >> opportunity to re-establish the callback channel. <shrug>
> >> In any event, the trace log above shows nothing more than "hey, it's
> >> not working." Are there any rpcrdma trace events we can look at to
> >> determine why the backchannel is getting lost?
> >
> > The trace log attached to my previous mail has it enabled, along with nfsd and sunrpc.
> > I'm attaching the two files again here.
>
> Thanks, Timo.
>
> The first CB_OFFLOAD callback succeeds:
>
> 2076155.216687: nfsd_cb_work: addr=10.110.10.252:0 client 602eb645:daa037ae procedure=CB_OFFLOAD
> 2076155.216704: rpc_request: task:57746@6 nfs4_cbv1 CB_OFFLOAD (async)
>
> 2076155.216975: rpc_stats_latency: task:57746@6 xid=0xff9faaf9 nfs4_cbv1 CB_OFFLOAD backlog=33 rtt=195 execute=282
> 2076155.216977: nfsd_cb_done: addr=10.110.10.252:0 client 602eb645:daa037ae status=0
>
>
> About 305 seconds later, the autodisconnect timer fires. I'm not sure if this is the backchannel transport, but it looks suspicious:
>
> 2076460.314954: xprt_disconnect_auto: peer=[10.110.10.252]:0 state=LOCKED|CONNECTED|BOUND
> 2076460.314957: xprt_disconnect_done: peer=[10.110.10.252]:0 state=LOCKED|CONNECTED|BOUND
>
>
> The next CB_OFFLOAD callback fails because the xprt has been marked "disconnected" and the request's NOCONNECT flag is set.
>
> 2076465.200136: nfsd_cb_work: addr=10.110.10.252:0 client 602eb645:daa037ae procedure=CB_OFFLOAD
> 2076465.200151: rpc_request: task:57752@6 nfs4_cbv1 CB_OFFLOAD (async)
>
> 2076465.200168: rpc_task_run_action: task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT runstate=RUNNING|ACTIVE status=0 action=call_encode
> 2076465.200173: rpc_task_run_action: task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=0 action=call_connect
> 2076465.200174: rpc_call_rpcerror: task:57752@6 tk_status=-107 rpc_status=-107
> 2076465.200174: rpc_task_run_action: task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=-107 action=rpc_exit_task
> 2076465.200179: nfsd_cb_done: addr=10.110.10.252:0 client 602eb645:daa037ae status=-107
> 2076465.200180: nfsd_cb_state: addr=10.110.10.252:0 client 602eb645:daa037ae state=FAULT
>
>
> Perhaps RPC clients for NFSv4.1+ callback should be created with
> the RPC_CLNT_CREATE_NO_IDLE_TIMEOUT flag.

Do you know if this callback behavior is new? The same problem would
exist with delegation recalls.

>
>
> --
> Chuck Lever
>
>
>

2021-02-19 20:57:48

[permalink] [raw]

Subject: Re: copy_file_range() infinitely hangs on NFSv4.2 over RDMA

> On Feb 19, 2021, at 3:43 PM, Olga Kornievskaia <[email protected]> wrote:
>
> On Fri, Feb 19, 2021 at 1:48 PM Chuck Lever <[email protected]> wrote:
>>
>>
>>
>>> On Feb 19, 2021, at 1:01 PM, Timo Rothenpieler <[email protected]> wrote:
>>>
>>> On 19.02.2021 18:48, Chuck Lever wrote:
>>>>> On Feb 19, 2021, at 12:38 PM, Olga Kornievskaia <[email protected]> wrote:
>>>>>
>>>>> On Thu, Feb 18, 2021 at 3:22 PM Timo Rothenpieler <[email protected]> wrote:
>>>>>>
>>>>>> On 18.02.2021 19:30, Olga Kornievskaia wrote:
>>>>>>> Thank you for getting tracepoints from a busy server but can you get
>>>>>>> more? As suspected, the server is having issues sending the callback.
>>>>>>> I'm not sure why. Any chance to turn on the server's sunrpc
>>>>>>> tracespoints, probably both sunrpc and rdmas tracepoints, I wonder if
>>>>>>> we can any more info about why it's failing?
>>>>>>
>>>>>> I isolated out two of the machines on that cluster now, one acting as
>>>>>> NFS server from an ext4 mount, the other is the same client as before.
>>>>>> That way I managed to capture a trace and ibdump of an entire cycle:
>>>>>> mount + successful copy + 5 minutes later a copy that got stuck
>>>>>>
>>>>>> Next to no noise happened during those traces, you can find them attached.
>>>>>>
>>>>>> Another observation made due to this: unmount and re-mounting the NFS
>>>>>> share also gets it back into working condition for a while, no reboot
>>>>>> necessary.
>>>>>> During this trace, I got "lucky", and after just 5 minutes of waiting,
>>>>>> it got stuck.
>>>>>>
>>>>>> Before that, I had a run of mount + trying to copy every 5 minutes where
>>>>>> it ran for 45 minutes without getting stuck. At which point I decided to
>>>>>> remount once more.
>>>>>
>>>>> Timo, thank you for gathering the debug info.
>>>>>
>>>>> Chuck, I need your help. Why would the server lose a callback channel?
>>>>>
>>>>> <...>-1461944 [001] 2076465.200151: rpc_request:
>>>>> task:57752@6 nfs4_cbv1 CB_OFFLOAD (async)
>>>>> <...>-1461944 [001] 2076465.200151: rpc_task_run_action:
>>>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
>>>>> runstate=RUNNING|ACTIVE status=0 action=call_reserve
>>>>> <...>-1461944 [001] 2076465.200154: xprt_reserve:
>>>>> task:57752@6 xid=0x00a0aaf9
>>>>> <...>-1461944 [001] 2076465.200155: rpc_task_run_action:
>>>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
>>>>> runstate=RUNNING|ACTIVE status=0 action=call_reserveresult
>>>>> <...>-1461944 [001] 2076465.200156: rpc_task_run_action:
>>>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
>>>>> runstate=RUNNING|ACTIVE status=0 action=call_refresh
>>>>> <...>-1461944 [001] 2076465.200163: rpc_task_run_action:
>>>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
>>>>> runstate=RUNNING|ACTIVE status=0 action=call_refreshresult
>>>>> <...>-1461944 [001] 2076465.200163: rpc_task_run_action:
>>>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
>>>>> runstate=RUNNING|ACTIVE status=0 action=call_allocate
>>>>> <...>-1461944 [001] 2076465.200168: rpc_buf_alloc:
>>>>> task:57752@6 callsize=548 recvsize=104 status=0
>>>>> <...>-1461944 [001] 2076465.200168: rpc_task_run_action:
>>>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
>>>>> runstate=RUNNING|ACTIVE status=0 action=call_encode
>>>>> <...>-1461944 [001] 2076465.200173: rpc_task_run_action:
>>>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
>>>>> runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=0
>>>>> action=call_connect
>>>>> <...>-1461944 [001] 2076465.200174: rpc_call_rpcerror:
>>>>> task:57752@6 tk_status=-107 rpc_status=-107
>>>>> <...>-1461944 [001] 2076465.200174: rpc_task_run_action:
>>>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
>>>>> runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=-107
>>>>> action=rpc_exit_task
>>>>>
>>>>> It's reporting ENOTCON. I'm not really sure if this is related to copy
>>>>> offload but more perhaps doing callbacks over RDMA/IB.
>>>> If the client is awaiting a COPY notification callback, I can see why
>>>> a lost CB channel would cause the client to wait indefinitely.
>>>> The copy mechanism has to deal with this contingency... Perhaps the
>>>> server could force a disconnect so that the client and server have an
>>>> opportunity to re-establish the callback channel. <shrug>
>>>> In any event, the trace log above shows nothing more than "hey, it's
>>>> not working." Are there any rpcrdma trace events we can look at to
>>>> determine why the backchannel is getting lost?
>>>
>>> The trace log attached to my previous mail has it enabled, along with nfsd and sunrpc.
>>> I'm attaching the two files again here.
>>
>> Thanks, Timo.
>>
>> The first CB_OFFLOAD callback succeeds:
>>
>> 2076155.216687: nfsd_cb_work: addr=10.110.10.252:0 client 602eb645:daa037ae procedure=CB_OFFLOAD
>> 2076155.216704: rpc_request: task:57746@6 nfs4_cbv1 CB_OFFLOAD (async)
>>
>> 2076155.216975: rpc_stats_latency: task:57746@6 xid=0xff9faaf9 nfs4_cbv1 CB_OFFLOAD backlog=33 rtt=195 execute=282
>> 2076155.216977: nfsd_cb_done: addr=10.110.10.252:0 client 602eb645:daa037ae status=0
>>
>>
>> About 305 seconds later, the autodisconnect timer fires. I'm not sure if this is the backchannel transport, but it looks suspicious:
>>
>> 2076460.314954: xprt_disconnect_auto: peer=[10.110.10.252]:0 state=LOCKED|CONNECTED|BOUND
>> 2076460.314957: xprt_disconnect_done: peer=[10.110.10.252]:0 state=LOCKED|CONNECTED|BOUND
>>
>>
>> The next CB_OFFLOAD callback fails because the xprt has been marked "disconnected" and the request's NOCONNECT flag is set.
>>
>> 2076465.200136: nfsd_cb_work: addr=10.110.10.252:0 client 602eb645:daa037ae procedure=CB_OFFLOAD
>> 2076465.200151: rpc_request: task:57752@6 nfs4_cbv1 CB_OFFLOAD (async)
>>
>> 2076465.200168: rpc_task_run_action: task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT runstate=RUNNING|ACTIVE status=0 action=call_encode
>> 2076465.200173: rpc_task_run_action: task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=0 action=call_connect
>> 2076465.200174: rpc_call_rpcerror: task:57752@6 tk_status=-107 rpc_status=-107
>> 2076465.200174: rpc_task_run_action: task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=-107 action=rpc_exit_task
>> 2076465.200179: nfsd_cb_done: addr=10.110.10.252:0 client 602eb645:daa037ae status=-107
>> 2076465.200180: nfsd_cb_state: addr=10.110.10.252:0 client 602eb645:daa037ae state=FAULT
>>
>>
>> Perhaps RPC clients for NFSv4.1+ callback should be created with
>> the RPC_CLNT_CREATE_NO_IDLE_TIMEOUT flag.
>
> Do you know if this callback behavior is new? The same problem would
> exist with delegation recalls.

Delegation recall would be affected, but that wouldn't cause an
indefinite hang. The server should eventually revoke the delegation
and life would continue.

Also, the server should be setting the CALLBACK_DOWN sequence status
flag in its replies. If the client isn't sending any other traffic
while waiting for the copy to complete, it probably wouldn't notice
the problem.

Is the client renewing its lease while the copy is ongoing? That
would be an opportunity for the server to signal the loss of
backchannel connectivity.

Again, in the long term, the copy mechanism needs to be prepared for
the contingency that the callback channel is lost while copy
operations are pending.

There's plenty to investigate and confirm here.

--
Chuck Lever

2021-02-20 21:07:36

[permalink] [raw]

Subject: Re: copy_file_range() infinitely hangs on NFSv4.2 over RDMA

On 19.02.2021 19:48, Chuck Lever wrote:
>
>
>> On Feb 19, 2021, at 1:01 PM, Timo Rothenpieler <[email protected]> wrote:
>>
>> On 19.02.2021 18:48, Chuck Lever wrote:
>>>> On Feb 19, 2021, at 12:38 PM, Olga Kornievskaia <[email protected]> wrote:
>>>>
>>>> On Thu, Feb 18, 2021 at 3:22 PM Timo Rothenpieler <[email protected]> wrote:
>>>>>
>>>>> On 18.02.2021 19:30, Olga Kornievskaia wrote:
>>>>>> Thank you for getting tracepoints from a busy server but can you get
>>>>>> more? As suspected, the server is having issues sending the callback.
>>>>>> I'm not sure why. Any chance to turn on the server's sunrpc
>>>>>> tracespoints, probably both sunrpc and rdmas tracepoints, I wonder if
>>>>>> we can any more info about why it's failing?
>>>>>
>>>>> I isolated out two of the machines on that cluster now, one acting as
>>>>> NFS server from an ext4 mount, the other is the same client as before.
>>>>> That way I managed to capture a trace and ibdump of an entire cycle:
>>>>> mount + successful copy + 5 minutes later a copy that got stuck
>>>>>
>>>>> Next to no noise happened during those traces, you can find them attached.
>>>>>
>>>>> Another observation made due to this: unmount and re-mounting the NFS
>>>>> share also gets it back into working condition for a while, no reboot
>>>>> necessary.
>>>>> During this trace, I got "lucky", and after just 5 minutes of waiting,
>>>>> it got stuck.
>>>>>
>>>>> Before that, I had a run of mount + trying to copy every 5 minutes where
>>>>> it ran for 45 minutes without getting stuck. At which point I decided to
>>>>> remount once more.
>>>>
>>>> Timo, thank you for gathering the debug info.
>>>>
>>>> Chuck, I need your help. Why would the server lose a callback channel?
>>>>
>>>> <...>-1461944 [001] 2076465.200151: rpc_request:
>>>> task:57752@6 nfs4_cbv1 CB_OFFLOAD (async)
>>>> <...>-1461944 [001] 2076465.200151: rpc_task_run_action:
>>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
>>>> runstate=RUNNING|ACTIVE status=0 action=call_reserve
>>>> <...>-1461944 [001] 2076465.200154: xprt_reserve:
>>>> task:57752@6 xid=0x00a0aaf9
>>>> <...>-1461944 [001] 2076465.200155: rpc_task_run_action:
>>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
>>>> runstate=RUNNING|ACTIVE status=0 action=call_reserveresult
>>>> <...>-1461944 [001] 2076465.200156: rpc_task_run_action:
>>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
>>>> runstate=RUNNING|ACTIVE status=0 action=call_refresh
>>>> <...>-1461944 [001] 2076465.200163: rpc_task_run_action:
>>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
>>>> runstate=RUNNING|ACTIVE status=0 action=call_refreshresult
>>>> <...>-1461944 [001] 2076465.200163: rpc_task_run_action:
>>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
>>>> runstate=RUNNING|ACTIVE status=0 action=call_allocate
>>>> <...>-1461944 [001] 2076465.200168: rpc_buf_alloc:
>>>> task:57752@6 callsize=548 recvsize=104 status=0
>>>> <...>-1461944 [001] 2076465.200168: rpc_task_run_action:
>>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
>>>> runstate=RUNNING|ACTIVE status=0 action=call_encode
>>>> <...>-1461944 [001] 2076465.200173: rpc_task_run_action:
>>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
>>>> runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=0
>>>> action=call_connect
>>>> <...>-1461944 [001] 2076465.200174: rpc_call_rpcerror:
>>>> task:57752@6 tk_status=-107 rpc_status=-107
>>>> <...>-1461944 [001] 2076465.200174: rpc_task_run_action:
>>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
>>>> runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=-107
>>>> action=rpc_exit_task
>>>>
>>>> It's reporting ENOTCON. I'm not really sure if this is related to copy
>>>> offload but more perhaps doing callbacks over RDMA/IB.
>>> If the client is awaiting a COPY notification callback, I can see why
>>> a lost CB channel would cause the client to wait indefinitely.
>>> The copy mechanism has to deal with this contingency... Perhaps the
>>> server could force a disconnect so that the client and server have an
>>> opportunity to re-establish the callback channel. <shrug>
>>> In any event, the trace log above shows nothing more than "hey, it's
>>> not working." Are there any rpcrdma trace events we can look at to
>>> determine why the backchannel is getting lost?
>>
>> The trace log attached to my previous mail has it enabled, along with nfsd and sunrpc.
>> I'm attaching the two files again here.
>
> Thanks, Timo.
>
> The first CB_OFFLOAD callback succeeds:
>
> 2076155.216687: nfsd_cb_work: addr=10.110.10.252:0 client 602eb645:daa037ae procedure=CB_OFFLOAD
> 2076155.216704: rpc_request: task:57746@6 nfs4_cbv1 CB_OFFLOAD (async)
>
> 2076155.216975: rpc_stats_latency: task:57746@6 xid=0xff9faaf9 nfs4_cbv1 CB_OFFLOAD backlog=33 rtt=195 execute=282
> 2076155.216977: nfsd_cb_done: addr=10.110.10.252:0 client 602eb645:daa037ae status=0
>
>
> About 305 seconds later, the autodisconnect timer fires. I'm not sure if this is the backchannel transport, but it looks suspicious:
>
> 2076460.314954: xprt_disconnect_auto: peer=[10.110.10.252]:0 state=LOCKED|CONNECTED|BOUND
> 2076460.314957: xprt_disconnect_done: peer=[10.110.10.252]:0 state=LOCKED|CONNECTED|BOUND
>
>
> The next CB_OFFLOAD callback fails because the xprt has been marked "disconnected" and the request's NOCONNECT flag is set.
>
> 2076465.200136: nfsd_cb_work: addr=10.110.10.252:0 client 602eb645:daa037ae procedure=CB_OFFLOAD
> 2076465.200151: rpc_request: task:57752@6 nfs4_cbv1 CB_OFFLOAD (async)
>
> 2076465.200168: rpc_task_run_action: task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT runstate=RUNNING|ACTIVE status=0 action=call_encode
> 2076465.200173: rpc_task_run_action: task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=0 action=call_connect
> 2076465.200174: rpc_call_rpcerror: task:57752@6 tk_status=-107 rpc_status=-107
> 2076465.200174: rpc_task_run_action: task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=-107 action=rpc_exit_task
> 2076465.200179: nfsd_cb_done: addr=10.110.10.252:0 client 602eb645:daa037ae status=-107
> 2076465.200180: nfsd_cb_state: addr=10.110.10.252:0 client 602eb645:daa037ae state=FAULT
>
>
> Perhaps RPC clients for NFSv4.1+ callback should be created with
> the RPC_CLNT_CREATE_NO_IDLE_TIMEOUT flag.

I added that flag to the callback client creation flags, and so far
after a few hours of uptime, copying still works.
Can't say anything about that being the appropriate fix for the issue at
hand, as I lack the necessary insight into the NFS code, but I'll leave
it running like that for now and observe.

Can also gladly test any other patches.

Thanks,
Timo

Attachments:

smime.p7s (4.39 kB)
S/MIME Cryptographic Signature

2021-02-21 17:48:50