LinuxLists.cc - Re: RDMA connection closed and not re-opened

2018-06-29 15:04:46

Subject: Re: RDMA connection closed and not re-opened

Hi Chandler-

> On Jun 28, 2018, at 8:23 PM, [email protected] wrote:
>=20
> Dear Chuck et. al.,
>=20
> Sorry for my late reply. I have since lost the previous messages in =
my news client and gmane isn't very reliable anymore. I am replying to =
the message-id A9E63254-22F5-48A7-85C2-8016D85CD192 [1] which was in =
reference to my original posts [2][3] (links in footer).
>=20
> We keep having this problem and having to reset servers and losing =
work. The latest incident involved 7 out of 9 of our NFS clients. I've =
attached the latest messages from these clients (n001.txt through =
n007.txt) as well as the messages from the server.
>=20
> Here is a short summary in chronological order: I first notice a =
message on our server at Jun 27 19:09:03 in reference to Ganglia not =
being able to reach one of the data sources. Not sure if it is related =
but the message seems to only appear when there are these problems with =
the NFS... the next message doesn't happen until Jun 27 20:01:55.
>=20
> On the clients, the first errors happen on n005,
> Jun 27 20:04:07 n005 kernel: RPC: rpcrdma_sendcq_process_wc: =
frmr ffff88204ea3b840 (stale): WR flushed
>=20
> there are similar messages on n007 and n003 which happen at 20:04:09 =
and 20:04:17. However I don't see these "WR flushed" messages on the =
other nodes. These are accompanied by the INFO messages that our =
application (daligner) is being blocked as well as the "rpcrdma: =
connection to 10.10.11.10:20049 closed (-103)" error. After that the =
nodes become unresponsive to SSH, although Ganglia seems to still be =
able to collect some information from them as I can see the load graphs =
continually increasing.

These are informational messages that are typical of network
problems or maybe the server has failed or is overloaded. I'm
especially inclined to think this is not a client issue because it
happens on multiple clients at around the same time.

These appear to be typical of all the clients:

Jun 27 20:07:07 n005 kernel: nfs: server 10.10.11.10 not responding, =
still trying
Jun 27 20:08:34 n005 kernel: rpcrdma: connection to 10.10.11.10:20049 on =
mlx4_0, memreg 5 slots 32 ird 16
Jun 27 20:08:35 n005 kernel: nfs: server 10.10.11.10 OK
Jun 27 20:08:35 n005 kernel: nfs: server 10.10.11.10 not responding, =
still trying
Jun 27 20:08:35 n005 kernel: nfs: server 10.10.11.10 OK
Jun 27 20:13:59 n005 kernel: RPC: rpcrdma_sendcq_process_wc: frmr =
ffff88204f86b380 (stale): WR flushed
Jun 27 20:13:59 n005 kernel: RPC: rpcrdma_sendcq_process_wc: frmr =
ffff88204eea9180 (stale): WR flushed
Jun 27 20:13:59 n005 kernel: RPC: rpcrdma_sendcq_process_wc: frmr =
ffff88204e743f80 (stale): WR flushed
Jun 27 20:15:43 n005 kernel: rpcrdma: connection to 10.10.11.10:20049 on =
mlx4_0, memreg 5 slots 32 ird 16
Jun 27 20:32:08 n005 kernel: rpcrdma: connection to 10.10.11.10:20049 =
closed (-103)

The "closed" message appears only in some client logs.

On the server:

Jun 27 20:08:34 pac kernel: svcrdma: failed to send reply chunks, rc=3D-5
Jun 27 20:08:34 pac kernel: nfsd: peername failed (err 107)!
Jun 27 20:08:34 pac kernel: nfsd: peername failed (err 107)!
Jun 27 20:08:35 pac kernel: svcrdma: failed to send reply chunks, rc=3D-5

This is suspicious. I don't have access to the CentOS 6.9 source
code, but it could mean that the server logic that transmits reply
chunks is broken, and the client is requesting an operation that
has to use reply chunks. That would cause a deadlock on that
connection because the client's recourse is to send that operation
again and again, but the server would repeatedly fail to reply.

> We haven't had this problem until recently. I upgraded our cluster to =
add two additional nodes (n008 and n009, which have problems too and =
have to be rebooted) and we also added more storage to the server. The =
jobs are submitted to the cluster via Sun Grid Engine, and in total =
there are about 61 jobs (daligner) that may start at once and open =
connections to the NFS server... is it too much work for NFS to handle?
>=20
> Yes both clients and servers have CentOS 6.9. Is there a way to =
report this to Red Hat? Otherwise i'm not sure of a way to report this =
to the "Linux distributor".

I don't know how to contact CentOS support, but that would be the
first step here to do the basic troubleshooting steps with people
who are familiar with that code base and the tools that are available
in that distribution.

Perhaps a RH staffer on this list could provide some guidance?

> The machines are not completely updated and there appears to be a new =
kernel (2.6.32.696.30.1.el6) available as well as new nfs-utils =
(1:1.2.3-75.el6_9). So not sure if updating those may help...

If there are no other constraints on your NFS server's kernel /
distribution, I recommend upgrading it to a recent update of CentOS
7 (not simply a newer CentOS 6 release).

IMO nfs-utils is not involved in these issues.

> If you do not see any solution to this old implementation then would =
you perhaps suggest I manually install the latest stable version of NFS =
on the clients and server? In that case please let me know of any =
relevant configure flags I might need to use if you can think of any off =
the top of your head.

The NFS implementation is integrated into the Linux kernel, so it's
not a simple matter of "installing the latest stable version of NFS".

> Many Thanks,
> Chandler / Systems Administrator
> Arizona Genomics Institute
> http://www.genome.arizona.edu
>=20
> --
> 1. https://marc.info/?l=3Dlinux-nfs&m=3D152545311928035&w=3D2
> 2. https://marc.info/?l=3Dlinux-nfs&m=3D152538002122612&w=3D2
> 3. https://marc.info/?l=3Dlinux-nfs&m=3D152538859227047&w=3D2
>=20
>=20
> =
<n001.txt><n002.txt><n003.txt><n004.txt><n005.txt><n006.txt><n007.txt><ser=
ver.txt>

--
Chuck Lever
[email protected]

2018-07-02 23:22:34

by Chandler

[permalink] [raw]

Subject: Re: RDMA connection closed and not re-opened

Thanks Chuck for your input, let me address it below like normal for
mailing lists. Although I'm confused as to why my message hasn't shown
up on the mailing list, even though I'm subscribed with this address...
I've written to [email protected] regarding this
discrepancy and it was rejected as spam so now i'm waiting to hear from
[email protected], so I guess I'll need to continue to CC you
as well in the time being since your responses show up on the mailing
list at least...

Chuck Lever wrote on 06/29/2018 08:04 AM:
> These are informational messages that are typical of network
> problems or maybe the server has failed or is overloaded. I'm
> especially inclined to think this is not a client issue because it
> happens on multiple clients at around the same time.

Yes it makes sense to be a server problem, however our server is more
than capable of handling this I would think. Although it is an older
server, it still has 2x 6-core Intel Xeon E5-2620 v2 @ 2.10GHz with
128GB of RAM and maybe 10% utilization normally. I have not watched the
server when we start these daligner jobs so that could be something I
look for to see if I notice any bottlenecks... what is a typical
bottleneck for NFS/RDMA?

> If there are no other constraints on your NFS server's kernel /
> distribution, I recommend upgrading it to a recent update of CentOS
> 7 (not simply a newer CentOS 6 release).

Unfortunately CentOS doesn't support upgrading from 6 to 7 and this
machine is too critical to take down for a fresh
installation/reconfiguration, so I have a feeling we'll need to figure
out how to get the 6.9 kernel working. I will try updating to the
latest kernel on all of the nodes to see if it helps.

2018-07-03 02:44:49

by Chuck Lever III

[permalink] [raw]

Subject: Re: RDMA connection closed and not re-opened

> On Jul 2, 2018, at 7:22 PM, [email protected] wrote:
>=20
> Thanks Chuck for your input, let me address it below like normal for maili=
ng lists. Although I'm confused as to why my message hasn't shown up on the=
mailing list, even though I'm subscribed with this address... I've written t=
o [email protected] regarding this discrepancy and it was reje=
cted as spam so now i'm waiting to hear from [email protected], so I=
guess I'll need to continue to CC you as well in the time being since your r=
esponses show up on the mailing list at least...
>=20
>=20
> Chuck Lever wrote on 06/29/2018 08:04 AM:
> > These are informational messages that are typical of network
> > problems or maybe the server has failed or is overloaded. I'm
> > especially inclined to think this is not a client issue because it
> > happens on multiple clients at around the same time.
>=20
> Yes it makes sense to be a server problem, however our server is more than=
capable of handling this I would think. Although it is an older server, it=
still has 2x 6-core Intel Xeon E5-2620 v2 @ 2.10GHz with 128GB of RAM and m=
aybe 10% utilization normally. I have not watched the server when we start t=
hese daligner jobs so that could be something I look for to see if I notice a=
ny bottlenecks... what is a typical bottleneck for NFS/RDMA?

Please review all of my last email. I concluded the likely culprit is a soft=
ware bug, not server overload.

> > If there are no other constraints on your NFS server's kernel /
> > distribution, I recommend upgrading it to a recent update of CentOS
> > 7 (not simply a newer CentOS 6 release).
>=20
> Unfortunately CentOS doesn't support upgrading from 6 to 7 and this machin=
e is too critical to take down for a fresh installation/reconfiguration, so I=
have a feeling we'll need to figure out how to get the 6.9 kernel working. =
I will try updating to the latest kernel on all of the nodes to see if it h=
elps.

If CentOS 6 is required, CentOS / Red Hat really does need to be involved as=
you troubleshoot. Any code changes will necessitate a new kernel build that=
only they can provide.

2018-07-03 23:41:16

by Chandler

[permalink] [raw]

Subject: Re: RDMA connection closed and not re-opened

Chuck Lever wrote on 07/02/2018 07:44 PM:
> Please review all of my last email. I concluded the likely culprit is a software bug, not server overload.
> If CentOS 6 is required, CentOS / Red Hat really does need to be involved as you troubleshoot. Any code changes will necessitate a new kernel build that only they can provide.

Thanks we will see how it goes with the latest kernel and if there are
still problems I'll look into filing bug report with CentOS or something.

2018-07-12 22:55:54

by Chandler

[permalink] [raw]

Subject: Re: RDMA connection closed and not re-opened

> Thanks we will see how it goes with the latest kernel and if there are
> still problems I'll look into filing bug report with CentOS or something.

So, the latest CentOS kernel, 2.6.32-696.30.1, has not helped yet. In
the mean time we have reverted to using NFS/TCP over the gigabit
ethernet link, which creates a bottleneck for the full processing of our
cluster, but at least hasn't crashed yet.

I did notice that the hangups have all been after 8pm in each
occurrence. Each night at 8PM, the NFS server acts as a NFS client and
runs a couple rsnapshot jobs which backup to a different NFS server.
Even with NFS/TCP the NFS server became unresponsive after 8pm when the
rsnapshot jobs were running. I can see in the system messages the same
sort of errors with Ganglia we were seeing, as well as rsyslog dropping
messages related to the ganglia process, as well as nfsd peername failed
(err 107). For example,

Jul 11 20:07:31 pac /usr/sbin/gmetad[3582]: poll() timeout from source 0
for [Pac] data source after 0 bytes read
<repeated 13 times>
Jul 11 20:21:31 pac /usr/sbin/gmetad[3582]: RRD_update
(/var/lib/ganglia/rrds/Pac/n003.genome.arizona.edu/load_one.rrd):
/var/lib/ganglia/rrds/Pac/n003.genome.arizona.edu/load_one.rrd: illegal
attempt to update using time 1531365691 when last update time is
1531365691 (minimum one second step)
<many messages like this from all the nodes n001-n009
Jul 11 20:21:31 pac rsyslogd-2177: imuxsock begins to drop messages from
pid 3582 due to rate-limiting
Jul 11 20:22:25 pac rsyslogd-2177: imuxsock lost 116 messages from pid
3582 due to rate-limiting
Jul 11 20:22:25 pac /usr/sbin/gmetad[3582]: poll() timeout from source 0
for [Pac] data source after 0 bytes read
<bunch more of these and RRD_update errors>
Jul 11 20:41:54 pac rsyslogd-2177: imuxsock begins to drop messages from
pid 3582 due to rate-limiting
Jul 11 20:42:34 pac rsyslogd-2177: imuxsock lost 116 messages from pid
3582 due to rate-limiting
Jul 11 21:09:56 pac kernel: nfsd: peername failed (err 107)!
<repeated 9 more times>
Jul 11 21:09:59 pac /usr/sbin/gmetad[3582]: poll() timeout from source 0
for [Pac] data source after 0 bytes read
<repeated ~50 more times>
Jul 11 21:48:30 pac /usr/sbin/gmetad[3582]: poll() timeout from source 0
for [Pac] data source after 0 bytes read
Jul 11 21:48:43 pac kernel: nfsd: peername failed (err 107)!
<repeated 3 more times>
Jul 11 21:53:59 pac /usr/sbin/gmetad[3582]: poll() timeout from source 0
for [Pac] data source after 0 bytes read
Jul 11 22:39:05 pac rsnapshot[24727]: /usr/bin/rsnapshot -V -c
/etc/rsnapshotData.conf daily: completed successfully
Jul 11 23:16:24 pac /usr/sbin/gmetad[3582]: poll() timeout from source 0
for [Pac] data source after 0 bytes read
<EOF>

The difference is it was able to recover once the rsnapshot jobs had
completed and our other cluster jobs (daligner) are still running and
servers are responsive.

We are going to let this large job finish with the NFS/TCP before I file
a bug report with CentOS.. but i thought this extra info might be
helpful in troubleshooting. I found the CentOS bug report page and
there are several options for the "Category" including "rdma" or
"kernel" ... which do you think I should file it under?

Thanks,

--
Chandler
Arizona Genomics Institute

2018-07-13 14:51:26

by Chuck Lever III

[permalink] [raw]

Subject: Re: RDMA connection closed and not re-opened

> On Jul 12, 2018, at 6:44 PM, [email protected] wrote:
>=20
>> Thanks we will see how it goes with the latest kernel and if there =
are still problems I'll look into filing bug report with CentOS or =
something.
>=20
> So, the latest CentOS kernel, 2.6.32-696.30.1, has not helped yet. In =
the mean time we have reverted to using NFS/TCP over the gigabit =
ethernet link, which creates a bottleneck for the full processing of our =
cluster, but at least hasn't crashed yet.

You should be able to mount using "proto=3Dtcp" with your mlx4 cards.
That avoids the use of NFS/RDMA but would enable the use of the
higher bandwidth network fabric.

> I did notice that the hangups have all been after 8pm in each =
occurrence. Each night at 8PM, the NFS server acts as a NFS client and =
runs a couple rsnapshot jobs which backup to a different NFS server.

Can you diagram your full configuration during the backup? Does the
NFS client mount the NFS server on this same host? Does it use
NFS/RDMA or can it use ssh instead of NFS?

> Even with NFS/TCP the NFS server became unresponsive after 8pm when =
the rsnapshot jobs were running. I can see in the system messages the =
same sort of errors with Ganglia we were seeing, as well as rsyslog =
dropping messages related to the ganglia process, as well as nfsd =
peername failed (err 107). For example,
>=20
> Jul 11 20:07:31 pac /usr/sbin/gmetad[3582]: poll() timeout from source =
0 for [Pac] data source after 0 bytes read
> <repeated 13 times>
> Jul 11 20:21:31 pac /usr/sbin/gmetad[3582]: RRD_update =
(/var/lib/ganglia/rrds/Pac/n003.genome.arizona.edu/load_one.rrd): =
/var/lib/ganglia/rrds/Pac/n003.genome.arizona.edu/load_one.rrd: illegal =
attempt to update using time 1531365691 when last update time is =
1531365691 (minimum one second step)
> <many messages like this from all the nodes n001-n009
> Jul 11 20:21:31 pac rsyslogd-2177: imuxsock begins to drop messages =
from pid 3582 due to rate-limiting
> Jul 11 20:22:25 pac rsyslogd-2177: imuxsock lost 116 messages from pid =
3582 due to rate-limiting
> Jul 11 20:22:25 pac /usr/sbin/gmetad[3582]: poll() timeout from source =
0 for [Pac] data source after 0 bytes read
> <bunch more of these and RRD_update errors>
> Jul 11 20:41:54 pac rsyslogd-2177: imuxsock begins to drop messages =
from pid 3582 due to rate-limiting
> Jul 11 20:42:34 pac rsyslogd-2177: imuxsock lost 116 messages from pid =
3582 due to rate-limiting
> Jul 11 21:09:56 pac kernel: nfsd: peername failed (err 107)!
> <repeated 9 more times>
> Jul 11 21:09:59 pac /usr/sbin/gmetad[3582]: poll() timeout from source =
0 for [Pac] data source after 0 bytes read
> <repeated ~50 more times>
> Jul 11 21:48:30 pac /usr/sbin/gmetad[3582]: poll() timeout from source =
0 for [Pac] data source after 0 bytes read
> Jul 11 21:48:43 pac kernel: nfsd: peername failed (err 107)!
> <repeated 3 more times>
> Jul 11 21:53:59 pac /usr/sbin/gmetad[3582]: poll() timeout from source =
0 for [Pac] data source after 0 bytes read
> Jul 11 22:39:05 pac rsnapshot[24727]: /usr/bin/rsnapshot -V -c =
/etc/rsnapshotData.conf daily: completed successfully
> Jul 11 23:16:24 pac /usr/sbin/gmetad[3582]: poll() timeout from source =
0 for [Pac] data source after 0 bytes read
> <EOF>
>=20
>=20
> The difference is it was able to recover once the rsnapshot jobs had =
completed and our other cluster jobs (daligner) are still running and =
servers are responsive.

That does describe a possible server overload. Using only GbE could
slow things down enough to avoid catastrophic deadlock.

> We are going to let this large job finish with the NFS/TCP before I =
file a bug report with CentOS.. but i thought this extra info might be =
helpful in troubleshooting. I found the CentOS bug report page and =
there are several options for the "Category" including "rdma" or =
"kernel" ... which do you think I should file it under?

I'm not familiar with the CentOS bug database. If there's an "NFS"
category, I would go with that.

Before filing, you should search that database to see if there are
similar bugs. Simply Googling "peername failed!" brings up several
NFSD related entries right at the top of the list that appear
similar to your circumstance (and there is no mention of NFS/RDMA).

--
Chuck Lever

2018-07-13 22:49:14

by Chandler

[permalink] [raw]

Subject: Re: RDMA connection closed and not re-opened

Chuck Lever wrote on 07/13/2018 07:36 AM:
> You should be able to mount using "proto=tcp" with your mlx4 cards.
> That avoids the use of NFS/RDMA but would enable the use of the
> higher bandwidth network fabric.
Thanks I could definitely try that. IPoIB has it's own set of issues
though but can cross that bridge when I get to it....

> Can you diagram your full configuration during the backup?
The main server in relation to this issue, which is named "pac" in the
log files, has several local storage devices which are exported over the
Ethernet and Infiniband interfaces. In addition, it has several other
mounts over Ethernet to some of our other NFS servers. The
rsnapshot/backup job uses rsync to read from the local storage and sends
to the NFS mounts to another server using standard 1Gb ethernet and TCP
protocol. So the answer to your second question,
> Does the
> NFS client mount the NFS server on this same host?
I believe is "yes"

> Does it use
> NFS/RDMA or can it use ssh instead of NFS?
Currently just uses NFS/TCP over 1Gb Ethernet link. rsnapshot does have
the ability to use SSH

> I'm not familiar with the CentOS bug database. If there's an "NFS"
> category, I would go with that.
There is no "NFS" category, only nfs-utils, nfs-utils-lib, and
nfs4-acl-tools. So I'm guessing if we want to report against NFS then
"kernel" would be the category?

> Before filing, you should search that database to see if there are
> similar bugs. Simply Googling "peername failed!" brings up several
> NFSD related entries right at the top of the list that appear
> similar to your circumstance (and there is no mention of NFS/RDMA).
Thanks I will be checking that out

2018-07-14 14:56:23

by Chuck Lever III

[permalink] [raw]

Subject: Re: RDMA connection closed and not re-opened

> On Jul 13, 2018, at 6:32 PM, [email protected] wrote:
>=20
> Chuck Lever wrote on 07/13/2018 07:36 AM:
>> You should be able to mount using "proto=3Dtcp" with your mlx4 cards.
>> That avoids the use of NFS/RDMA but would enable the use of the
>> higher bandwidth network fabric.
> Thanks I could definitely try that. IPoIB has it's own set of issues =
though but can cross that bridge when I get to it....

Stick with connected mode and keep rsize and wsize smaller
than the IPoIB MTU, which can be set as high as 65KB.

>> Can you diagram your full configuration during the backup?
> The main server in relation to this issue, which is named "pac" in the =
log files, has several local storage devices which are exported over the =
Ethernet and Infiniband interfaces. In addition, it has several other =
mounts over Ethernet to some of our other NFS servers. The =
rsnapshot/backup job uses rsync to read from the local storage and sends =
to the NFS mounts to another server using standard 1Gb ethernet and TCP =
protocol. So the answer to your second question,
>> Does the
>> NFS client mount the NFS server on this same host?
> I believe is "yes"

I wasn't entirely clear: Does pac mount itself?

I don't know what the workload is like on this "self mount" but
we recommend not to use this kind of configuration, because it
is prone to deadlock with a significant workload.

>> Does it use
>> NFS/RDMA or can it use ssh instead of NFS?
> Currently just uses NFS/TCP over 1Gb Ethernet link. rsnapshot does =
have the ability to use SSH

I was thinking that it might be better to use ssh and avoid NFS
for the backup workload, in order to avoid pac mounting itself.

>> I'm not familiar with the CentOS bug database. If there's an "NFS"
>> category, I would go with that.
> There is no "NFS" category, only nfs-utils, nfs-utils-lib, and =
nfs4-acl-tools. So I'm guessing if we want to report against NFS then =
"kernel" would be the category?

In the "kernel" category, there might be an "NFS or NFSD"
subcomponent.

>> Before filing, you should search that database to see if there are
>> similar bugs. Simply Googling "peername failed!" brings up several
>> NFSD related entries right at the top of the list that appear
>> similar to your circumstance (and there is no mention of NFS/RDMA).
> Thanks I will be checking that out

--
Chuck Lever

2018-07-18 01:03:28

by Chandler

[permalink] [raw]

Subject: Re: RDMA connection closed and not re-opened

Chuck Lever wrote on 07/14/2018 07:37 AM> I wasn't entirely clear: Does
pac mount itself?
No, why would we do that? Do people do that? Here is a listing of
relevant mounts on our server pac:

/dev/sdc1 on /data type xfs (rw)
/dev/sdb1 on /projects type xfs (rw)
/dev/sde1 on /working type xfs (rw,nobarrier)
nfsd on /proc/fs/nfsd type nfsd (rw)
/dev/drbd0 on /newwing type xfs (rw)
150.x.x.116:/wing on /wing type nfs (rw,addr=150.x.x.116)
150.x.x.116:/archive on /archive type nfs (rw,addr=150.x.x.116)
150.x.x.116:/backups on /backups type nfs (rw,addr=150.x.x.116)

The backup jobs read from the mounted local disks /data and /projects
and write to the remote NFS server at /backups and /archive. I have
noticed in the log files for our other servers which mount the pac
exports, "nfs: server pac not responding, timed out" which all show up
after 8PM when the backup jobs are running.

And here is listing of our pac server exports:

/data 10.10.10.0/24(rw,no_root_squash,async)
/data 10.10.11.0/24(rw,no_root_squash,async)
/data 150.x.x.192/27(rw,no_root_squash,async)
/data 150.x.x.64/26(rw,no_root_squash,async)
/home 10.10.10.0/24(rw,no_root_squash,async)
/home 10.10.11.0/24(rw,no_root_squash,async)
/opt 10.10.10.0/24(rw,no_root_squash,async)
/opt 10.10.11.0/24(rw,no_root_squash,async)
/projects 10.10.10.0/24(rw,no_root_squash,async)
/projects 10.10.11.0/24(rw,no_root_squash,async)
/projects 150.x.x.192/27(rw,no_root_squash,async)
/projects 150.x.x.64/26(rw,no_root_squash,async)
/tools 10.10.10.0/24(rw,no_root_squash,async)
/tools 10.10.11.0/24(rw,no_root_squash,async)
/usr/share/gridengine 10.10.10.10/24(rw,no_root_squash,async)
/usr/share/gridengine 10.10.11.10/24(rw,no_root_squash,async)
/usr/local 10.10.10.10/24(rw,no_root_squash,async)
/usr/local 10.10.11.10/24(rw,no_root_squash,async)
/working 10.10.10.0/24(rw,no_root_squash,async)
/working 10.10.11.0/24(rw,no_root_squash,async)
/working 150.x.x.192/27(rw,no_root_squash,async)
/working 150.x.x.64/26(rw,no_root_squash,async)
/newwing 10.10.10.0/24(rw,no_root_squash,async)
/newwing 10.10.11.0/24(rw,no_root_squash,async)
/newwing 150.x.x.192/27(rw,no_root_squash,async)
/newwing 150.x.x.64/26(rw,no_root_squash,async)

The 10.10.10.0/24 network is 1GbE and the 10.10.11.0/24 is the
Infiniband. The other networks are also 1GbE. Our cluster nodes will
normally mount all of these using the Infiniband with RDMA and the
computation jobs will normally be using /working which will see the most
reading/writing but /newwing, /projects, and /data are also used.

It does continue to seem to be a bug in NFS. Somehow seems to be
triggered when the NFS server runs the backup job. I just tried it now
and about 20 mins into the backup job the server stopped responding to
some things, like iotop froze. top remained active and could see the
load on the server going up but only to about 22/24 and still about 95%
idle cpu time. Also noticed the "nfs: server pac not responding, timed
out" messages on our other servers. After about 10 minutes the server
became responsive again and load dropped down to 3/24 while the backup
job continued.

Perhaps it could be mitigated if I change the backup job to use SSH
instead of NFS. I'll try that and see if it helps, then once our job
has completed I can try going back to RDMA to see if it still happens....

2018-08-08 21:15:58

by Chandler

[permalink] [raw]

Subject: Re: RDMA connection closed and not re-opened

Chuck Lever wrote on 07/14/2018 07:37 AM:
>> On Jul 13, 2018, at 6:32 PM, [email protected] wrote:
>> Chuck Lever wrote on 07/13/2018 07:36 AM:
>>> You should be able to mount using "proto=tcp" with your mlx4 cards.
>>> That avoids the use of NFS/RDMA but would enable the use of the
>>> higher bandwidth network fabric.
>> Thanks I could definitely try that. IPoIB has it's own set of issues though but can cross that bridge when I get to it....
> Stick with connected mode and keep rsize and wsize smaller
> than the IPoIB MTU, which can be set as high as 65KB.
We are running in this setup, so far so good... however the rsize/wsize
were much greater than the IPoIB MTU, and it is probably causing these
"page allocation failures" which fortunately have not been fatal; our
computation is still running. In the ifcfg file for the IPoIB
interface, the MTU is set to 65520, which was the recommended maximum
from the Red Hat manual. So should rsize/wsize be set to 65519? or is
it better to pick another value that is a multiple 1024 or something?
Thanks

2018-08-08 21:23:01

by Chuck Lever III

[permalink] [raw]

Subject: Re: RDMA connection closed and not re-opened

> On Aug 8, 2018, at 2:54 PM, [email protected] wrote:
>=20
> Chuck Lever wrote on 07/14/2018 07:37 AM:
>>> On Jul 13, 2018, at 6:32 PM, [email protected] wrote:
>>> Chuck Lever wrote on 07/13/2018 07:36 AM:
>>>> You should be able to mount using "proto=3Dtcp" with your mlx4 =
cards.
>>>> That avoids the use of NFS/RDMA but would enable the use of the
>>>> higher bandwidth network fabric.
>>> Thanks I could definitely try that. IPoIB has it's own set of =
issues though but can cross that bridge when I get to it....
>> Stick with connected mode and keep rsize and wsize smaller
>> than the IPoIB MTU, which can be set as high as 65KB.
> We are running in this setup, so far so good... however the =
rsize/wsize were much greater than the IPoIB MTU, and it is probably =
causing these "page allocation failures" which fortunately have not been =
fatal; our computation is still running. In the ifcfg file for the =
IPoIB interface, the MTU is set to 65520, which was the recommended =
maximum from the Red Hat manual. So should rsize/wsize be set to 65519? =
or is it better to pick another value that is a multiple 1024 or =
something?

The r/wsize settings have to be power of two. The next power of
two smaller than 65520 is 32768. Try "rsize=3D32768,wsize=3D32768" .

--
Chuck Lever

2018-08-08 21:32:54

by Chandler

[permalink] [raw]

Subject: Re: RDMA connection closed and not re-opened

Chuck Lever wrote on 08/08/2018 12:01 PM:
>> On Aug 8, 2018, at 2:54 PM, [email protected] wrote:
>> Chuck Lever wrote on 07/14/2018 07:37 AM:
>>>> On Jul 13, 2018, at 6:32 PM, [email protected] wrote:
>>>> Chuck Lever wrote on 07/13/2018 07:36 AM:
>>>>> You should be able to mount using "proto=tcp" with your mlx4 cards.
>>>>> That avoids the use of NFS/RDMA but would enable the use of the
>>>>> higher bandwidth network fabric.
>>>> Thanks I could definitely try that. IPoIB has it's own set of issues though but can cross that bridge when I get to it....
>>> Stick with connected mode and keep rsize and wsize smaller
>>> than the IPoIB MTU, which can be set as high as 65KB.
>> We are running in this setup, so far so good... however the rsize/wsize were much greater than the IPoIB MTU, and it is probably causing these "page allocation failures" which fortunately have not been fatal; our computation is still running. In the ifcfg file for the IPoIB interface, the MTU is set to 65520, which was the recommended maximum from the Red Hat manual. So should rsize/wsize be set to 65519? or is it better to pick another value that is a multiple 1024 or something?
>
> The r/wsize settings have to be power of two. The next power of
> two smaller than 65520 is 32768. Try "rsize=32768,wsize=32768" .

Thanks but what is the reason for that? After googling around a while
for rsize/wsize settings, i finally found in the nfs manual page (of all
places!!) that "If a specified value is within the supported range but
not a multiple of 1024, it is rounded down to the nearest multiple of
1024." So it sound like we could use 63KiB or 64512.

2018-08-08 22:09:57

by Chuck Lever III

[permalink] [raw]

Subject: Re: RDMA connection closed and not re-opened

> On Aug 8, 2018, at 3:11 PM, [email protected] wrote:
>=20
> Chuck Lever wrote on 08/08/2018 12:01 PM:
>>> On Aug 8, 2018, at 2:54 PM, [email protected] wrote:
>>> Chuck Lever wrote on 07/14/2018 07:37 AM:
>>>>> On Jul 13, 2018, at 6:32 PM, [email protected] wrote:
>>>>> Chuck Lever wrote on 07/13/2018 07:36 AM:
>>>>>> You should be able to mount using "proto=3Dtcp" with your mlx4 =
cards.
>>>>>> That avoids the use of NFS/RDMA but would enable the use of the
>>>>>> higher bandwidth network fabric.
>>>>> Thanks I could definitely try that. IPoIB has it's own set of =
issues though but can cross that bridge when I get to it....
>>>> Stick with connected mode and keep rsize and wsize smaller
>>>> than the IPoIB MTU, which can be set as high as 65KB.
>>> We are running in this setup, so far so good... however the =
rsize/wsize were much greater than the IPoIB MTU, and it is probably =
causing these "page allocation failures" which fortunately have not been =
fatal; our computation is still running. In the ifcfg file for the =
IPoIB interface, the MTU is set to 65520, which was the recommended =
maximum from the Red Hat manual. So should rsize/wsize be set to 65519? =
or is it better to pick another value that is a multiple 1024 or =
something?
>> The r/wsize settings have to be power of two. The next power of
>> two smaller than 65520 is 32768. Try "rsize=3D32768,wsize=3D32768" .
>=20
> Thanks but what is the reason for that? After googling around a while =
for rsize/wsize settings, i finally found in the nfs manual page (of all =
places!!) that "If a specified value is within the supported range but =
not a multiple of 1024, it is rounded down to the nearest multiple of =
1024." So it sound like we could use 63KiB or 64512.

I just tried this:

[root@manet ~]# mount -o vers=3D3,rsize=3D65520,wsize=3D65520 =
klimt:/export/tmp/ /mnt
[root@manet ~]# grep klimt /proc/mounts
klimt:/export/tmp/ /mnt nfs =
rw,relatime,vers=3D3,rsize=3D32768,wsize=3D32768,namlen=3D255,hard,proto=3D=
tcp,timeo=3D600,retrans=3D2,sec=3Dsys,mountaddr=3D192.168.1.55,mountvers=3D=
3,mountport=3D20048,mountproto=3Dudp,local_lock=3Dnone,addr=3D192.168.1.55=
0 0

Looks like the man page is wrong.

--
Chuck Lever

2018-08-09 01:33:38

by Chandler

[permalink] [raw]

Subject: RDMA connection closed and not re-opened

Chuck Lever wrote on 08/08/2018 12:18 PM:
> Looks like the man page is wrong.

Right you are!