2016-06-23 15:42:30

by Steve Wise

[permalink] [raw]
Subject: Interrupted IO causing async errors

Hey chuck, we observe with 4.7-rc4 (and older kernels too) that interrupting a
dbench test on a nfsrdma/cxgb4 mount while it is doing heavy I/O can result in
cxgb4 logging an "invalid stag" error on an ingress RDMA WRITE message. Is
this expected? I'm wondering if this is a normal side effect of interrupting
the IO on the mount. Maybe due to the mount options or NFS version? This
error could happen if the NFSRDMA client invalidated MRs that were advertised to
the server for IO, while IO was still in flight. Is this expected or should we
dive in further? Thoughts? thanks...

Here are the details of the test.

Steps:

-> Load iw_cxgb4,rdma_ucm on both nodes.
-> Assign ip to chelsio interfaces on both nodes.

Server Side [gayabari]:

-> mknod /dev/ram0 b 1 0
-> modprobe brd rd_nr=1 rd_size=1048576
-> mkdir /nfsrdma
-> mkfs.ext3 /dev/ram0
-> mount /dev/ram0 /nfsrdma
-> vim /etc/exports
/nfsrdma *(sync,insecure,rw,no_root_squash,no_subtree_check)

-> modprobe xprtrdma
-> modprobe svcrdma
-> service nfsserver restart
-> echo rdma 20049 > /proc/fs/nfsd/portlist
-> exportfs -rav

Client Side [sonada]:

-> modprobe xprtrdma
-> modprobe svcrdma

-> mount 102.1.1.186:/nfsrdma/ -o
rdma,port=20049,vers=3,wsize=65536,rsize=65536
/mnt/

-> Then run below command on client [sonada] :
sonada:~ # dbench -t100 -D /root/share1/ 10


-> Issue is seen only on killing dbench test in between otherwise it ran fine.

Error seen on the nfsdma client:

[ 1593.398351] cxgb4 0000:01:00.4: AE qpid 1028 opcode 0 status 0x1 type 0 len
0x18e6009c wrid.hi 0x2cce2dc wrid.lo 0x2
[ 1593.398374] RPC: rpcrdma_qp_async_error_upcall: QP request error on
device cxgb4_0 ep ffff88022f3567e8



2016-06-24 00:15:37

by Chuck Lever

[permalink] [raw]
Subject: Re: Interrupted IO causing async errors

Hi Steve-

> On Jun 23, 2016, at 11:42 AM, Steve Wise <[email protected]> wrote:
>
> Hey chuck, we observe with 4.7-rc4 (and older kernels too) that interrupting a
> dbench test on a nfsrdma/cxgb4 mount while it is doing heavy I/O can result in
> cxgb4 logging an "invalid stag" error on an ingress RDMA WRITE message. Is
> this expected? I'm wondering if this is a normal side effect of interrupting
> the IO on the mount. Maybe due to the mount options or NFS version? This
> error could happen if the NFSRDMA client invalidated MRs that were advertised to
> the server for IO, while IO was still in flight. Is this expected or should we
> dive in further? Thoughts? thanks...

When an application is signaled, outstanding RPCs are terminated.
When an RPC completes, whether because a reply was received,
or because the local application has died, any memory that was
registered on behalf of that RPC is invalidated before it can be
used for something else. The data in that memory remains at rest
until invalidation and DMA unmapping is complete.

It appears that your server is attempting to read an argument or
write a result for an RPC that is no longer pending. I think both
sides should report a transport error, and the connection should
terminate. No other problems, though: other operation should
continue normally after the client re-establishes a fresh connection.

If this doesn't match your observations, let me know.

> Here are the details of the test.
>
> Steps:
>
> -> Load iw_cxgb4,rdma_ucm on both nodes.
> -> Assign ip to chelsio interfaces on both nodes.
>
> Server Side [gayabari]:
>
> -> mknod /dev/ram0 b 1 0
> -> modprobe brd rd_nr=1 rd_size=1048576
> -> mkdir /nfsrdma
> -> mkfs.ext3 /dev/ram0
> -> mount /dev/ram0 /nfsrdma
> -> vim /etc/exports
> /nfsrdma *(sync,insecure,rw,no_root_squash,no_subtree_check)
>
> -> modprobe xprtrdma
> -> modprobe svcrdma
> -> service nfsserver restart
> -> echo rdma 20049 > /proc/fs/nfsd/portlist
> -> exportfs -rav
>
> Client Side [sonada]:
>
> -> modprobe xprtrdma
> -> modprobe svcrdma
>
> -> mount 102.1.1.186:/nfsrdma/ -o
> rdma,port=20049,vers=3,wsize=65536,rsize=65536
> /mnt/
>
> -> Then run below command on client [sonada] :
> sonada:~ # dbench -t100 -D /root/share1/ 10
>
>
> -> Issue is seen only on killing dbench test in between otherwise it ran fine.
>
> Error seen on the nfsdma client:
>
> [ 1593.398351] cxgb4 0000:01:00.4: AE qpid 1028 opcode 0 status 0x1 type 0 len
> 0x18e6009c wrid.hi 0x2cce2dc wrid.lo 0x2
> [ 1593.398374] RPC: rpcrdma_qp_async_error_upcall: QP request error on
> device cxgb4_0 ep ffff88022f3567e8
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2016-06-24 13:36:56

by Steve Wise

[permalink] [raw]
Subject: RE: Interrupted IO causing async errors



> -----Original Message-----
> From: Chuck Lever [mailto:[email protected]]
> Sent: Thursday, June 23, 2016 7:15 PM
> To: Steve Wise
> Cc: Raju Rangoju; [email protected]; [email protected]
> Subject: Re: Interrupted IO causing async errors
>
> Hi Steve-
>
> > On Jun 23, 2016, at 11:42 AM, Steve Wise <[email protected]>
> wrote:
> >
> > Hey chuck, we observe with 4.7-rc4 (and older kernels too) that interrupting
a
> > dbench test on a nfsrdma/cxgb4 mount while it is doing heavy I/O can result
in
> > cxgb4 logging an "invalid stag" error on an ingress RDMA WRITE message. Is
> > this expected? I'm wondering if this is a normal side effect of
interrupting
> > the IO on the mount. Maybe due to the mount options or NFS version? This
> > error could happen if the NFSRDMA client invalidated MRs that were
advertised
> to
> > the server for IO, while IO was still in flight. Is this expected or should
we
> > dive in further? Thoughts? thanks...
>
> When an application is signaled, outstanding RPCs are terminated.
> When an RPC completes, whether because a reply was received,
> or because the local application has died, any memory that was
> registered on behalf of that RPC is invalidated before it can be
> used for something else. The data in that memory remains at rest
> until invalidation and DMA unmapping is complete.
>
> It appears that your server is attempting to read an argument or
> write a result for an RPC that is no longer pending. I think both
> sides should report a transport error, and the connection should
> terminate. No other problems, though: other operation should
> continue normally after the client re-establishes a fresh connection.
>
> If this doesn't match your observations, let me know.
>

This is exactly what we see. Thanks!

Steve.


2016-06-24 19:26:58

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Interrupted IO causing async errors

On Thu, Jun 23, 2016 at 08:15:29PM -0400, Chuck Lever wrote:
> When an application is signaled, outstanding RPCs are terminated.
> When an RPC completes, whether because a reply was received,
> or because the local application has died, any memory that was
> registered on behalf of that RPC is invalidated before it can be
> used for something else. The data in that memory remains at rest
> until invalidation and DMA unmapping is complete.

> It appears that your server is attempting to read an argument or
> write a result for an RPC that is no longer pending. I think both
> sides should report a transport error, and the connection should
> terminate. No other problems, though: other operation should
> continue normally after the client re-establishes a fresh connection.

Yuk! A transport tare down and restart on user space CTRL-C/etc ?
Isn't that a little too common and a little too expensive to be a
permanent solution?

Jason

2016-06-24 22:16:56

by Chuck Lever

[permalink] [raw]
Subject: Re: Interrupted IO causing async errors


> On Jun 24, 2016, at 3:26 PM, Jason Gunthorpe <[email protected]> wrote:
>
>> On Thu, Jun 23, 2016 at 08:15:29PM -0400, Chuck Lever wrote:
>> When an application is signaled, outstanding RPCs are terminated.
>> When an RPC completes, whether because a reply was received,
>> or because the local application has died, any memory that was
>> registered on behalf of that RPC is invalidated before it can be
>> used for something else. The data in that memory remains at rest
>> until invalidation and DMA unmapping is complete.
>
>> It appears that your server is attempting to read an argument or
>> write a result for an RPC that is no longer pending. I think both
>> sides should report a transport error, and the connection should
>> terminate. No other problems, though: other operation should
>> continue normally after the client re-establishes a fresh connection.
>
> Yuk! A transport tare down and restart on user space CTRL-C/etc ?
> Isn't that a little too common and a little too expensive to be a
> permanent solution?

I agree, it's not an optimal solution, but:

- The server must not be allowed to update memory
that may have already been re-used after an RPC has
terminated.

- Typically reconnecting is quite fast.

- Not every ^C or signal will trigger connection loss: only
if there happens to be an outstanding RPC involving the
signaled application.

- Often a ^C or signal is the result of impatience while
waiting for storage target restart. In that case, a reconnect
is required anyway and the client has to redrive RPCs it is
still interested in.

- The problem seems most acute for Long Replies, where a
large buffer has been registered for the server to write
the reply into (as opposed to using a simple RDMA Send).
As we work to increase the inline thresholds, there will be
many fewer Long Replies and that narrows the exposure
significantly.

- Signals during workloads where NFS/RDMA is beneficial
seem to be fairly infrequent. That of course is likely to change
over time.

So this is a compromise solution that is safe and should
not be too onerous to performance. However, I'm open to
better (less noisy) approaches.

A protocol mechanism could be added so that a client can
notify a server of a canceled RPC. This helps some, but is
racy (impossible to be 100% effective), and protocol
changes take time.

Wait a moment after a signal occurs during a synchronous
RPC, to provide an opportunity for server replies to drain
and handle invalidation for us. Again, helps some, but not
100% guaranteed to prevent a Remote Protection Error.
Care must be taken to wait without deadlocking the RPC
client.

Memory resources for a terminated RPC could be parked
until the server replies and the regions can be invalidated
normally in the reply handler.

If the advertised buffer is part of a direct I/O, and the
application has terminated, that memory would have to
be put aside somehow until it was safe to re-use.

A page in the page cache would have to be removed or
locked during the waiting period, and that has some
undesirable consequences.