2023-11-29 14:52:25

by Charles Hedrick

[permalink] [raw]
Subject: what data can I collect to help with nfs 4.2 hang?

I have a departmental server on ubuntu 22.04 (kernel 5.15), where peridically logins hang, due to the home directory hanging. Reboot of the client doesn't fix it. Backing down to nfs 3 does. The only error I see that might be relevant is in kern.log on the client:

kern.log-20231129.gz:Nov 29 00:00:41 c207-01.cs.rutgers.edu kernel: [28373970.471104] NFS: state manager: check lease failed on NFSv4 server communis.lcsr.rutgers.edu with error 13

echo expire > ctl on the server hangs. Sometimes I get a backtrace in kern.log. It's a bit long for this message, but I can send it.

The problem takes months to show up, but once it does, we get about one system per week.?

We don't want to abandon nfs 4.2, because we'd like to use default ACLs. But we can't until NFS works.

The problem also occurred on ubuntu 22, with kernel 5.4.0

We have several file servers. It only happens on one, with our home directories. One peculiarity about this server is that we have a very high rate of locking. We have up to 10,000 lock/unlock operations per second. This is because of how Gnome and the browsers work. I should note that we have disabled delegations because of issues in 5.4.0. I'll turn them on next time we reboot, which will be in January. (Because of the nature of this server, we try to keep it up for at least 6 months. There are 2 maintenance windows per year.)