2018-05-03 20:40:19

by scar

[permalink] [raw]
Subject: RDMA connection lost and not re-opened

We are using NFSoRDMA on our cluster, which is using CentOS 6.9 with
kernel 2.6.32-696.1.1.el6.x86_64. 2/10 of the clients had to be
rebooted recently. It appears due to NFS connection closed but not
reopened. For example, we will commonly see these messages:

May 2 14:46:08 n006 kernel: rpcrdma: connection to 10.10.11.249:2050
closed (-103)
May 2 15:42:39 n006 kernel: rpcrdma: connection to 10.10.11.249:2050 on
mlx4_0, memreg 5 slots 32 ird 16
May 2 15:42:44 n006 kernel: rpcrdma: connection to 10.10.11.249:2050 on
mlx4_0, memreg 5 slots 32 ird 16
May 2 16:04:02 n006 kernel: rpcrdma: connection to 10.10.11.249:2050
closed (-103)
May 2 16:04:02 n006 kernel: rpcrdma: connection to 10.10.11.249:2050 on
mlx4_0, memreg 5 slots 32 ird 16
May 2 18:46:00 n006 kernel: rpcrdma: connection to 10.10.11.249:2050
closed (-103)
May 2 19:16:09 n006 kernel: rpcrdma: connection to 10.10.11.249:2050 on
mlx4_0, memreg 5 slots 32 ird 16
May 2 19:28:49 n006 kernel: rpcrdma: connection to 10.10.11.249:2050
closed (-103)
May 2 21:14:42 n006 kernel: rpcrdma: connection to 10.10.11.10:20049
closed (-103)
May 3 11:51:13 n006 kernel: rpcrdma: connection to 10.10.11.249:2050 on
mlx4_0, memreg 5 slots 32 ird 16
May 3 11:56:13 n006 kernel: rpcrdma: connection to 10.10.11.249:2050
closed (-103)
May 3 13:14:34 n006 kernel: rpcrdma: connection to 10.10.11.249:2050 on
mlx4_0, memreg 5 slots 32 ird 16


I asked about these messages previously and they are just normal
operations. You can see the connection is usually reopened immediately
if the resource is still required, but the message at 21:14:42 was not
accompanied with a re-opening message, and this is about the time the
client hung and became unresponsive. I noticed similar messages on the
other server that had to be rebooted:

May 2 15:46:52 n001 kernel: rpcrdma: connection to 10.10.11.249:2050
closed (-103)
May 2 16:08:39 n001 kernel: rpcrdma: connection to 10.10.11.249:2050 on
mlx4_0, memreg 5 slots 32 ird 16
May 2 19:14:23 n001 kernel: rpcrdma: connection to 10.10.11.249:2050
closed (-103)
May 2 21:14:38 n001 kernel: rpcrdma: connection to 10.10.11.10:20049
closed (-103)
May 3 11:54:58 n001 kernel: rpcrdma: connection to 10.10.11.249:2050 on
mlx4_0, memreg 5 slots 32 ird 16
May 3 11:59:59 n001 kernel: rpcrdma: connection to 10.10.11.249:2050
closed (-103)
May 3 12:50:57 n001 kernel: rpcrdma: connection to 10.10.11.249:2050 on
mlx4_0, memreg 5 slots 32 ird 16
May 3 12:55:58 n001 kernel: rpcrdma: connection to 10.10.11.249:2050
closed (-103)


You can see on each machine that the connection to 10.10.11.249:2050 was
re-opened when i tried to login today on May 3 but the connection to
10.10.11.10:20049 was not re-opened. Meanwhile our other clients still
have the connection to 10.10.11.10:20049 and the server at 10.10.11.10
is working fine.

Any idea why this happened and how it could possibly be resolved without
having to reboot the server and losing work?

Thanks



2018-05-03 23:03:08

by scar

[permalink] [raw]
Subject: Re: RDMA connection lost and not re-opened

I did also notice these errors on the NFS server 10.10.11.10:

May 2 21:27:59 pac kernel: svcrdma: failed to send reply chunks, rc=-5
May 2 21:27:59 pac kernel: nfsd: peername failed (err 107)!
May 2 21:27:59 pac kernel: nfsd: peername failed (err 107)!
May 2 21:27:59 pac kernel: nfsd: peername failed (err 107)!
May 2 21:27:59 pac kernel: nfsd: peername failed (err 107)!
May 2 21:27:59 pac kernel: nfsd: peername failed (err 107)!
May 2 21:27:59 pac kernel: nfsd: peername failed (err 107)!
May 2 21:27:59 pac kernel: nfsd: peername failed (err 107)!
May 2 21:27:59 pac kernel: nfsd: peername failed (err 107)!
May 2 21:27:59 pac kernel: nfsd: peername failed (err 107)!
May 2 21:27:59 pac kernel: nfsd: peername failed (err 107)!



2018-05-04 16:58:37

by Chuck Lever

[permalink] [raw]
Subject: Re: RDMA connection lost and not re-opened



> On May 3, 2018, at 7:02 PM, scar <[email protected]> wrote:
>
> I did also notice these errors on the NFS server 10.10.11.10:
>
> May 2 21:27:59 pac kernel: svcrdma: failed to send reply chunks, rc=-5
> May 2 21:27:59 pac kernel: nfsd: peername failed (err 107)!
> May 2 21:27:59 pac kernel: nfsd: peername failed (err 107)!
> May 2 21:27:59 pac kernel: nfsd: peername failed (err 107)!
> May 2 21:27:59 pac kernel: nfsd: peername failed (err 107)!
> May 2 21:27:59 pac kernel: nfsd: peername failed (err 107)!
> May 2 21:27:59 pac kernel: nfsd: peername failed (err 107)!
> May 2 21:27:59 pac kernel: nfsd: peername failed (err 107)!
> May 2 21:27:59 pac kernel: nfsd: peername failed (err 107)!
> May 2 21:27:59 pac kernel: nfsd: peername failed (err 107)!
> May 2 21:27:59 pac kernel: nfsd: peername failed (err 107)!

Thanks for checking on the server side.

These timestamps don't line up with the client messages you posted
yesterday, the unmatched "closed (-103)" message being at 21:14:42.

"peername failed" is from the NFSD TCP accept path. I don't immediately
see how that is related to an NFS/RDMA mount. However, it might indicate
there was an HCA or fabric issue around that time that affected both
NFS/RDMA and IP-over-IB.

"failed to send reply chunks" would be expected if the RDMA connection
is lost before the server can send an RPC reply. It doesn't explain why
the connection is lost, however.

There don't seem to be any other probative messages on either system;
I'm looking for reports of flushed Send or Receive WRs, QP errors, or
DMAR faults. And of course any BUG output.

I assume you are running CentOS 6.9 on both the client and server
systems. That's a fairly old NFS/RDMA implementation, and one that I'm
not familiar with. RHEL 6 forked from upstream at 2.6.39, but then
parts of more recent upstream were backported to it, so now it is only
loosely related to what's currently in upstream, and all of that done
before I was deeply involved in NFS/RDMA upstream. Because of this
divergence we typically recommend that such reports be addressed first
to an appropriate Linux distributor who is responsible for the content
of that kernel.

In any event I don't think there's much you can do about a stuck mount
in this configuration, and you will have to reboot (perhaps even power
on reset) your client to recover.

Since two clients saw the same symptom with the same server at nearly
the same time, my first guess would be a server problem (bug).


--
Chuck Lever
[email protected]