I'm working on a compute cluster with approximately 300 NFS
client machines running Linux 4.19.28[1]. These machines accept
user submitted jobs which access exported filesystems from
approximately a dozen NFS servers mostly running Linux 4.4.0 but
a couple running 4.19.28. In all cases we mount with nfsvers=4.
From time to time one of these user submitted jobs hangs in
uninterruptible sleep (D state) while performing a write to one or
more of these NFS servers, and never complete. Once this happens
calls to sync will themselves hang in uninterruptible sleep.
Eventually the same thing happens to automount/mount.nfs and by
that point the host is completely irrecoverable.
The problem is more common on our NFS clients when they’re
communicating with an NFS server running 4.19.28, but is not
unique to that NFS server kernel version.
A representative sample of stack traces from hung user-submitted
processes (jobs). The first here is quite a lot more common than
the later two:
$ sudo cat /proc/197520/stack
[<0>] io_schedule+0x12/0x40
[<0>] nfs_lock_and_join_requests+0x309/0x4c0 [nfs]
[<0>] nfs_updatepage+0x2a2/0x8b0 [nfs]
[<0>] nfs_write_end+0x63/0x4c0 [nfs]
[<0>] generic_perform_write+0x138/0x1b0
[<0>] nfs_file_write+0xdc/0x200 [nfs]
[<0>] new_sync_write+0xfb/0x160
[<0>] vfs_write+0xa5/0x1a0
[<0>] ksys_write+0x4f/0xb0
[<0>] do_syscall_64+0x53/0x100
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0>] 0xffffffffffffffff
$ sudo cat /proc/120744/stack
[<0>] rpc_wait_bit_killable+0x1e/0x90 [sunrpc]
[<0>] nfs4_do_close+0x27d/0x300 [nfsv4]
[<0>] __put_nfs_open_context+0xb9/0x110 [nfs]
[<0>] nfs_file_release+0x39/0x40 [nfs]
[<0>] __fput+0xb4/0x220
[<0>] task_work_run+0x8a/0xb0
[<0>] do_exit+0x2d7/0xb90
[<0>] do_group_exit+0x3a/0xa0
[<0>] get_signal+0x12e/0x720
[<0>] do_signal+0x36/0x6d0
[<0>] exit_to_usermode_loop+0x89/0xf0
[<0>] do_syscall_64+0xfd/0x100
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0>] 0xffffffffffffffff
$ sudo cat /proc/231018/stack
[<0>] io_schedule+0x12/0x40
[<0>] wait_on_page_bit_common+0xfd/0x180
[<0>] __filemap_fdatawait_range+0xe1/0x130
[<0>] filemap_write_and_wait_range+0x45/0x80
[<0>] nfs_file_fsync+0x44/0x1e0 [nfs]
[<0>] filp_close+0x31/0x60
[<0>] __x64_sys_close+0x1e/0x50
[<0>] do_syscall_64+0x53/0x100
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0>] 0xffffffffffffffff
All of these processes are user-submitted jobs that are expected to
be doing NFS writes and would under normal circumstances succeed
rather than hang. In each case, despite the variation in the
stack traces, the problem as described in this message occurred
leading to the host being closed, drained, and reboot.
After the above D state process, sync will also hang in
uninterruptible sleep (D) with the following stack:
$ sudo cat /proc/164581/stack
[<0>] io_schedule+0x12/0x40
[<0>] wait_on_page_bit_common+0xfd/0x180
[<0>] __filemap_fdatawait_range+0xd5/0x130
[<0>] filemap_fdatawait_keep_errors+0x1a/0x40
[<0>] sync_inodes_sb+0x21d/0x2b0
[<0>] iterate_supers+0x98/0x100
[<0>] ksys_sync+0x40/0xb0
[<0>] __ia32_sys_sync+0xa/0x10
[<0>] do_syscall_64+0x55/0x110
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0>] 0xffffffffffffffff
Further sync processes will hang waiting for this sync to complete.
For one instance of this issue, I used sysrq to dump all tasks in
uninterruptible sleep. That output is attached, as it shows more
detailed stack traces than are available from /proc/<pid>/stack.
We found a promising discussion in the following threads:
https://marc.info/?l=linux-nfs&m=154835352315685&w=2
https://marc.info/?l=linux-nfs&m=154836054918954&w=2
But after building a 4.20.9 kernel and applying deaa5c9 (SUNRPC:
Address Kerberos performance/behavior regression) we still get
hangs on write. Do any of you have anything to suggest or see
something promising in these stack traces?
-A
1: In testing, I have observed this problem on our NFS clients
running Linux 3.13.0, Linux 4.17.17, Linux 4.18.10, Linux 4.19.28,
and Linux 4.20.9.
--
Alan Post | Xen VPS hosting for the technically adept
PO Box 61688 | Sunnyvale, CA 94088-1681 | https://prgmr.com/
email: [email protected]
On Mon, 2019-05-20 at 16:33 -0600, Alan Post wrote:
> I'm working on a compute cluster with approximately 300 NFS
> client machines running Linux 4.19.28[1]. These machines accept
> user submitted jobs which access exported filesystems from
> approximately a dozen NFS servers mostly running Linux 4.4.0 but
> a couple running 4.19.28. In all cases we mount with nfsvers=4.
>
> From time to time one of these user submitted jobs hangs in
> uninterruptible sleep (D state) while performing a write to one or
> more of these NFS servers, and never complete. Once this happens
> calls to sync will themselves hang in uninterruptible sleep.
> Eventually the same thing happens to automount/mount.nfs and by
> that point the host is completely irrecoverable.
>
> The problem is more common on our NFS clients when they’re
> communicating with an NFS server running 4.19.28, but is not
> unique to that NFS server kernel version.
>
> A representative sample of stack traces from hung user-submitted
> processes (jobs). The first here is quite a lot more common than
> the later two:
>
> $ sudo cat /proc/197520/stack
> [<0>] io_schedule+0x12/0x40
> [<0>] nfs_lock_and_join_requests+0x309/0x4c0 [nfs]
> [<0>] nfs_updatepage+0x2a2/0x8b0 [nfs]
> [<0>] nfs_write_end+0x63/0x4c0 [nfs]
> [<0>] generic_perform_write+0x138/0x1b0
> [<0>] nfs_file_write+0xdc/0x200 [nfs]
> [<0>] new_sync_write+0xfb/0x160
> [<0>] vfs_write+0xa5/0x1a0
> [<0>] ksys_write+0x4f/0xb0
> [<0>] do_syscall_64+0x53/0x100
> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [<0>] 0xffffffffffffffff
>
Have you tried upgrading to 4.19.44? There is a fix that went in not
too long ago that deals with a request leak that can cause stack traces
like the above that wait forever.
By the way, the above stack trace with "nfs_lock_and_join_requests"
usually means that you are using a very small rsize or wsize (less than
4k). Is that the case? If so, you might want to look into just
increasing the I/O size.
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Tue, May 21, 2019 at 03:46:03PM +0000, Trond Myklebust wrote:
> > A representative sample of stack traces from hung user-submitted
> > processes (jobs). The first here is quite a lot more common than
> > the later two:
> >
> > $ sudo cat /proc/197520/stack
> > [<0>] io_schedule+0x12/0x40
> > [<0>] nfs_lock_and_join_requests+0x309/0x4c0 [nfs]
> > [<0>] nfs_updatepage+0x2a2/0x8b0 [nfs]
> > [<0>] nfs_write_end+0x63/0x4c0 [nfs]
> > [<0>] generic_perform_write+0x138/0x1b0
> > [<0>] nfs_file_write+0xdc/0x200 [nfs]
> > [<0>] new_sync_write+0xfb/0x160
> > [<0>] vfs_write+0xa5/0x1a0
> > [<0>] ksys_write+0x4f/0xb0
> > [<0>] do_syscall_64+0x53/0x100
> > [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [<0>] 0xffffffffffffffff
> >
>
> Have you tried upgrading to 4.19.44? There is a fix that went in not
> too long ago that deals with a request leak that can cause stack traces
> like the above that wait forever.
>
That I haven't tried. I gather you're talking about either or both
of:
63b0ee126f7e
be74fddc976e
Which I do see went in after 4.19.24 (which I've tried) but didn't
get in 4.20.9 (which I've also tried). Let me see about trying the
4.19.44 kernel.
> By the way, the above stack trace with "nfs_lock_and_join_requests"
> usually means that you are using a very small rsize or wsize (less than
> 4k). Is that the case? If so, you might want to look into just
> increasing the I/O size.
>
These exports have rsize and wsize set to 1048576. That decision was
before my time, and I'll guess this value was picked to match
NFSSVC_MAXBLKSIZE.
Thank you for your help,
-A
--
Alan Post | Xen VPS hosting for the technically adept
PO Box 61688 | Sunnyvale, CA 94088-1681 | https://prgmr.com/
email: [email protected]
On Tue, May 21, 2019 at 03:46:03PM +0000, Trond Myklebust wrote:
> Have you tried upgrading to 4.19.44? There is a fix that went in not
> too long ago that deals with a request leak that can cause stack traces
> like the above that wait forever.
>
Following up on this. I have set aside a rack of machines and put
Linux 4.19.44 on them. They ran jobs overnight and will do the
same over the long weekend (Memorial day in the US). Given the
error rate (both over time and over submitted jobs) we see across
the cluster this well be enough time to draw a conclusion as to
whether 4.19.44 exhibits this hang.
Other than stack traces, what kind of information could I collect
that would be helpful for debugging or describing more precisely
what is happening to these hosts? I'd like to exit from the condition
of trying different kernels (as you no doubt saw in my initial message
I've done a lot of it) and enter the condition of debugging or
reproducing the problem.
I'll report back early next week and appreciate your feedback,
-A
--
Alan Post | Xen VPS hosting for the technically adept
PO Box 61688 | Sunnyvale, CA 94088-1681 | https://prgmr.com/
email: [email protected]
On Fri, 24 May 2019 at 13:32, Alan Post <[email protected]> wrote:
>
> On Tue, May 21, 2019 at 03:46:03PM +0000, Trond Myklebust wrote:
> > Have you tried upgrading to 4.19.44? There is a fix that went in not
> > too long ago that deals with a request leak that can cause stack traces
> > like the above that wait forever.
> >
>
> Following up on this. I have set aside a rack of machines and put
> Linux 4.19.44 on them. They ran jobs overnight and will do the
> same over the long weekend (Memorial day in the US). Given the
> error rate (both over time and over submitted jobs) we see across
> the cluster this well be enough time to draw a conclusion as to
> whether 4.19.44 exhibits this hang.
>
> Other than stack traces, what kind of information could I collect
> that would be helpful for debugging or describing more precisely
> what is happening to these hosts? I'd like to exit from the condition
> of trying different kernels (as you no doubt saw in my initial message
> I've done a lot of it) and enter the condition of debugging or
> reproducing the problem.
>
> I'll report back early next week and appreciate your feedback,
>
Perhaps the output from 'cat /sys/kernel/debug/rpc_clnt/*/tasks'?
Thanks
Trond
On Fri, May 24, 2019 at 11:31:55AM -0600, Alan Post wrote:
> On Tue, May 21, 2019 at 03:46:03PM +0000, Trond Myklebust wrote:
> > Have you tried upgrading to 4.19.44? There is a fix that went in not
> > too long ago that deals with a request leak that can cause stack traces
> > like the above that wait forever.
> >
>
> Following up on this. I have set aside a rack of machines and put
> Linux 4.19.44 on them. They ran jobs overnight and will do the
> same over the long weekend (Memorial day in the US). Given the
> error rate (both over time and over submitted jobs) we see across
> the cluster this well be enough time to draw a conclusion as to
> whether 4.19.44 exhibits this hang.
>
In the six days I've run Linux 4.19.44 on a single rack, I've seen
no occurrences of this hang. Given the incident rate for this
issue across the cluster over the same period of time, I would have
expected to see one on two incidents on the rack running 4.19.44.
This is promising--I'm going to deploy 4.19.44 to another rack
by the end of the day Friday May 31st and hope for more of the
same.
I wondered upthread whether the following commits were what you
had in mind when you asked about 4.19.44:
63b0ee126f7e: NFS: Fix an I/O request leakage in nfs_do_recoalesce
be74fddc976e: NFS: Fix I/O request leakages
Confirming that it is these patches and no others has become
topical for me: my upstream is now providing a 4.19.37 build,
and I note these two patches are included since 4.19.31 and so
are presumably in my now-available upstream 4.19.37 build.
If I could trouble you to confirm whether or not this is the
complete set of patches you had in mind for the 4.19 branch
after 4.19.28 when you recommended I try 4.19.44 I would
appreciate it.
Lurking on the list for the past week or two and watching
everyone's work has been inspiring. Thank you again. I'll report
back no later than next week.
-A
--
Alan Post | Xen VPS hosting for the technically adept
PO Box 61688 | Sunnyvale, CA 94088-1681 | https://prgmr.com/
email: [email protected]
On Wed, 2019-05-29 at 18:41 -0600, Alan Post wrote:
> On Fri, May 24, 2019 at 11:31:55AM -0600, Alan Post wrote:
> > On Tue, May 21, 2019 at 03:46:03PM +0000, Trond Myklebust wrote:
> > > Have you tried upgrading to 4.19.44? There is a fix that went in
> > > not
> > > too long ago that deals with a request leak that can cause stack
> > > traces
> > > like the above that wait forever.
> > >
> >
> > Following up on this. I have set aside a rack of machines and put
> > Linux 4.19.44 on them. They ran jobs overnight and will do the
> > same over the long weekend (Memorial day in the US). Given the
> > error rate (both over time and over submitted jobs) we see across
> > the cluster this well be enough time to draw a conclusion as to
> > whether 4.19.44 exhibits this hang.
> >
>
> In the six days I've run Linux 4.19.44 on a single rack, I've seen
> no occurrences of this hang. Given the incident rate for this
> issue across the cluster over the same period of time, I would have
> expected to see one on two incidents on the rack running 4.19.44.
>
> This is promising--I'm going to deploy 4.19.44 to another rack
> by the end of the day Friday May 31st and hope for more of the
> same.
>
> I wondered upthread whether the following commits were what you
> had in mind when you asked about 4.19.44:
>
> 63b0ee126f7e: NFS: Fix an I/O request leakage in
> nfs_do_recoalesce
> be74fddc976e: NFS: Fix I/O request leakages
>
> Confirming that it is these patches and no others has become
> topical for me: my upstream is now providing a 4.19.37 build,
> and I note these two patches are included since 4.19.31 and so
> are presumably in my now-available upstream 4.19.37 build.
>
> If I could trouble you to confirm whether or not this is the
> complete set of patches you had in mind for the 4.19 branch
> after 4.19.28 when you recommended I try 4.19.44 I would
> appreciate it.
>
Yes, those two patches are the ones I was specifically considering,
given the problem that you reported.
Cheers
Trond
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Tue, May 21, 2019 at 01:22:54PM -0600, Alan Post wrote:
> On Tue, May 21, 2019 at 03:46:03PM +0000, Trond Myklebust wrote:
> > > A representative sample of stack traces from hung user-submitted
> > > processes (jobs). The first here is quite a lot more common than
> > > the later two:
> > >
> > > $ sudo cat /proc/197520/stack
> > > [<0>] io_schedule+0x12/0x40
> > > [<0>] nfs_lock_and_join_requests+0x309/0x4c0 [nfs]
> > > [<0>] nfs_updatepage+0x2a2/0x8b0 [nfs]
> > > [<0>] nfs_write_end+0x63/0x4c0 [nfs]
> > > [<0>] generic_perform_write+0x138/0x1b0
> > > [<0>] nfs_file_write+0xdc/0x200 [nfs]
> > > [<0>] new_sync_write+0xfb/0x160
> > > [<0>] vfs_write+0xa5/0x1a0
> > > [<0>] ksys_write+0x4f/0xb0
> > > [<0>] do_syscall_64+0x53/0x100
> > > [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > [<0>] 0xffffffffffffffff
> > >
> >
> > Have you tried upgrading to 4.19.44? There is a fix that went in not
> > too long ago that deals with a request leak that can cause stack traces
> > like the above that wait forever.
> >
>
> That I haven't tried. I gather you're talking about either or both
> of:
>
> 63b0ee126f7e
> be74fddc976e
>
> Which I do see went in after 4.19.24 (which I've tried) but didn't
> get in 4.20.9 (which I've also tried). Let me see about trying the
> 4.19.44 kernel.
>
> > By the way, the above stack trace with "nfs_lock_and_join_requests"
> > usually means that you are using a very small rsize or wsize (less than
> > 4k). Is that the case? If so, you might want to look into just
> > increasing the I/O size.
> >
>
> These exports have rsize and wsize set to 1048576.
Are you getting that from the mount commandline? It could be negotiated
down during mount. I think you can get the negotiated values form the
rsize= and wsize= values on the opts: line in /proc/self/mountstats.
See also /proc/fs/nfsd/max_block_size.
--b.
> That decision was
> before my time, and I'll guess this value was picked to match
> NFSSVC_MAXBLKSIZE.
>
> Thank you for your help,
>
> -A
> --
> Alan Post | Xen VPS hosting for the technically adept
> PO Box 61688 | Sunnyvale, CA 94088-1681 | https://prgmr.com/
> email: [email protected]
On Thu, May 30, 2019 at 02:39:58PM -0400, J. Bruce Fields wrote:
> > > By the way, the above stack trace with "nfs_lock_and_join_requests"
> > > usually means that you are using a very small rsize or wsize (less than
> > > 4k). Is that the case? If so, you might want to look into just
> > > increasing the I/O size.
> > >
> >
> > These exports have rsize and wsize set to 1048576.
>
> Are you getting that from the mount commandline? It could be negotiated
> down during mount. I think you can get the negotiated values form the
> rsize= and wsize= values on the opts: line in /proc/self/mountstats.
> See also /proc/fs/nfsd/max_block_size.
>
Great catch. I was reporting configuration from the mount command-line.
I've spot checked /proc/self/mountstats and they report the same value,
rsize and wsize of 1048576. I do have different values for here for
NFS servers that are administratively outside of this cluster, where
it is 65536, but in those cases we're not setting that option on the
mount command-line and I am not experiencing the hang I report here
to those servers.
-A
--
Alan Post | Xen VPS hosting for the technically adept
PO Box 61688 | Sunnyvale, CA 94088-1681 | https://prgmr.com/
email: [email protected]
On Wed, May 29, 2019 at 06:41:46PM -0600, Alan Post wrote:
> On Fri, May 24, 2019 at 11:31:55AM -0600, Alan Post wrote:
> > On Tue, May 21, 2019 at 03:46:03PM +0000, Trond Myklebust wrote:
> > > Have you tried upgrading to 4.19.44? There is a fix that went in not
> > > too long ago that deals with a request leak that can cause stack traces
> > > like the above that wait forever.
> > >
> >
> > Following up on this. I have set aside a rack of machines and put
> > Linux 4.19.44 on them. They ran jobs overnight and will do the
> > same over the long weekend (Memorial day in the US). Given the
> > error rate (both over time and over submitted jobs) we see across
> > the cluster this well be enough time to draw a conclusion as to
> > whether 4.19.44 exhibits this hang.
> >
>
> In the six days I've run Linux 4.19.44 on a single rack, I've seen
> no occurrences of this hang. Given the incident rate for this
> issue across the cluster over the same period of time, I would have
> expected to see one on two incidents on the rack running 4.19.44.
>
> This is promising--I'm going to deploy 4.19.44 to another rack
> by the end of the day Friday May 31st and hope for more of the
> same.
>
[snip]
>
> I'll report back no later than next week.
>
As far as I'm concerned the problem I've reported here is resolved.
I have seen no evidence of this issue on any Linux 4.19.44 kernel
on either the rack I originally set aside or on the second rack
the same kernel was deployed to.
In addition, we began rolling out the upstream Linux 4.19.37 I
mentioned. The total incident rate across the cluster has trended
down in near lockstep with that deployment, and none of those systems
have shown any evidence of this hang either.
It even happened in a tremendously satisfying way: late last
week we went through a multi-day period of zero occurrences
of this issue anywhere in the cluster, including on kernel
versions where it should have been happening. That news was *too*
good--everything I understand about the issue suggested it should
have been been occurring less frequently but still ocurring.
Therefor, expecting a regression to the mean, I calculated what
our incident rate should be given our balance of kernel versions,
socialized those numbers around here, and waited for the sampling
period to close. (we have significant day-over-day load variance
and by comparison little week-over-week load variance.)
Monday when I revisited the problem not only had the incident rate
regressed to the rate I expected, it had done so 'perfectly.'
The actual incident count matched my 'best guess' inside the
range of possible values for the entire sampling period, including
the anomalous no-incident period.
We've got operations work yet to put this issue behind us, but
as best as I can tell the work that remains is a 'simple matter
of effort.'
Thank you Trond. If I can be of any help, please reach out.
-A
--
Alan Post | Xen VPS hosting for the technically adept
PO Box 61688 | Sunnyvale, CA 94088-1681 | https://prgmr.com/
email: [email protected]