On May 20th I reported "User process NFS write hang followed
by automount hang requiring reboot" to this list. There I
had a process that would hang on NFS write, followed by sync
hanging, eventually leading to my need to reboot the host.
On June 4th, after upgrading to Linux 4.19.44, I reported
the issue resolved. Since that time, as I've deployed out
Linux 4.19.44, the issue has come back--sort of.
I have begun once again getting sync hangs following a
hung NFS write. The hung write has a different stack trace
than any I previously reported:
[<0>] wait_on_commit+0x60/0x90 [nfs]
[<0>] __nfs_commit_inode+0x146/0x1a0 [nfs]
[<0>] nfs_file_fsync+0xa7/0x1d0 [nfs]
[<0>] filp_close+0x25/0x70
[<0>] put_files_struct+0x66/0xb0
[<0>] do_exit+0x2af/0xbb0
[<0>] do_group_exit+0x35/0xa0
[<0>] __x64_sys_exit_group+0xf/0x10
[<0>] do_syscall_64+0x45/0x100
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0>] 0xffffffffffffffff
And there is attendant kworker thread:
[<0>] wait_on_commit+0x60/0x90 [nfs]
[<0>] __nfs_commit_inode+0x146/0x1a0 [nfs]
[<0>] nfs_write_inode+0x5c/0x90 [nfs]
[<0>] nfs4_write_inode+0xd/0x30 [nfsv4]
[<0>] __writeback_single_inode+0x27a/0x320
[<0>] writeback_sb_inodes+0x19a/0x460
[<0>] wb_writeback+0x102/0x2f0
[<0>] wb_workfn+0xa3/0x400
[<0>] process_one_work+0x1e3/0x3d0
[<0>] worker_thread+0x28/0x3c0
[<0>] kthread+0x10e/0x130
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff
Oddly enough, I can clear the problem without rebooting the host.
I arrange to block all traffic between the NFS server and NFS
client using iptables, of sufficient time for any open TCP
connections to timeout. After which the connection apparently
reestablishes and unblocks the hung process.
I can't explain what's keeping the connection alive but apparently
stalled--requiring my manual intervention. Do any of you have
ideas or speculation? I'm happy to poke around in a packet capture
if the information provided isn't sufficient.
-A
--
Alan Post | Xen VPS hosting for the technically adept
PO Box 61688 | Sunnyvale, CA 94088-1681 | https://prgmr.com/
email: [email protected]
Hi Alan,
I think that your transport or NFS server is dropping the response to an
RPC. The NFS client will not retransmit on an established connection.
What server are you using? Any middle boxes on the network that could be
transparently dropping transmissions (less likely, but I have seen them)?
Ben
On 17 Jun 2019, at 20:06, Alan Post wrote:
> On May 20th I reported "User process NFS write hang followed
> by automount hang requiring reboot" to this list. There I
> had a process that would hang on NFS write, followed by sync
> hanging, eventually leading to my need to reboot the host.
>
> On June 4th, after upgrading to Linux 4.19.44, I reported
> the issue resolved. Since that time, as I've deployed out
> Linux 4.19.44, the issue has come back--sort of.
>
> I have begun once again getting sync hangs following a
> hung NFS write. The hung write has a different stack trace
> than any I previously reported:
>
> [<0>] wait_on_commit+0x60/0x90 [nfs]
> [<0>] __nfs_commit_inode+0x146/0x1a0 [nfs]
> [<0>] nfs_file_fsync+0xa7/0x1d0 [nfs]
> [<0>] filp_close+0x25/0x70
> [<0>] put_files_struct+0x66/0xb0
> [<0>] do_exit+0x2af/0xbb0
> [<0>] do_group_exit+0x35/0xa0
> [<0>] __x64_sys_exit_group+0xf/0x10
> [<0>] do_syscall_64+0x45/0x100
> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [<0>] 0xffffffffffffffff
>
> And there is attendant kworker thread:
>
> [<0>] wait_on_commit+0x60/0x90 [nfs]
> [<0>] __nfs_commit_inode+0x146/0x1a0 [nfs]
> [<0>] nfs_write_inode+0x5c/0x90 [nfs]
> [<0>] nfs4_write_inode+0xd/0x30 [nfsv4]
> [<0>] __writeback_single_inode+0x27a/0x320
> [<0>] writeback_sb_inodes+0x19a/0x460
> [<0>] wb_writeback+0x102/0x2f0
> [<0>] wb_workfn+0xa3/0x400
> [<0>] process_one_work+0x1e3/0x3d0
> [<0>] worker_thread+0x28/0x3c0
> [<0>] kthread+0x10e/0x130
> [<0>] ret_from_fork+0x35/0x40
> [<0>] 0xffffffffffffffff
>
> Oddly enough, I can clear the problem without rebooting the host.
> I arrange to block all traffic between the NFS server and NFS
> client using iptables, of sufficient time for any open TCP
> connections to timeout. After which the connection apparently
> reestablishes and unblocks the hung process.
>
> I can't explain what's keeping the connection alive but apparently
> stalled--requiring my manual intervention. Do any of you have
> ideas or speculation? I'm happy to poke around in a packet capture
> if the information provided isn't sufficient.
>
> -A
> --
> Alan Post | Xen VPS hosting for the technically adept
> PO Box 61688 | Sunnyvale, CA 94088-1681 | https://prgmr.com/
> email: [email protected]
On Tue, Jun 18, 2019 at 11:29:16AM -0400, Benjamin Coddington wrote:
> I think that your transport or NFS server is dropping the response to an
> RPC. The NFS client will not retransmit on an established connection.
>
> What server are you using? Any middle boxes on the network that could be
> transparently dropping transmissions (less likely, but I have seen them)?
>
I've found 8 separate NFS client hangs of the sort I reported here,
and in all cases the same NFS server was involved: an Ubuntu Trusty
system running 4.4.0. I've been upgrading all of these NFS servers,
haven't done this one yet--the complicity of NFS hangs I've been
seeing have slowed me down.
Of the 8 NFS clients with a hang to this server, about half are in
the same computer room where packets only transit rack switches, with
the other half also going through a computer room router.
I see positive dropped and overrun packet counts on the NFS server
interface, along with a similar magnitude of pause counts on the
switch port for the NFS server. Given the occurences of this issue
only this rack switch and a redundant pair of top-of-rack switches in
the rack with the NFS server are in-common between all 8 NFS clients
with write hangs.
-A
--
Alan Post | Xen VPS hosting for the technically adept
PO Box 61688 | Sunnyvale, CA 94088-1681 | https://prgmr.com/
email: [email protected]
On 18 Jun 2019, at 20:07, Alan Post wrote:
> On Tue, Jun 18, 2019 at 11:29:16AM -0400, Benjamin Coddington wrote:
>> I think that your transport or NFS server is dropping the response to an
>> RPC. The NFS client will not retransmit on an established connection.
>>
>> What server are you using? Any middle boxes on the network that could be
>> transparently dropping transmissions (less likely, but I have seen them)?
>>
>
> I've found 8 separate NFS client hangs of the sort I reported here,
> and in all cases the same NFS server was involved: an Ubuntu Trusty
> system running 4.4.0. I've been upgrading all of these NFS servers,
> haven't done this one yet--the complicity of NFS hangs I've been
> seeing have slowed me down.
>
> Of the 8 NFS clients with a hang to this server, about half are in
> the same computer room where packets only transit rack switches, with
> the other half also going through a computer room router.
>
> I see positive dropped and overrun packet counts on the NFS server
> interface, along with a similar magnitude of pause counts on the
> switch port for the NFS server. Given the occurences of this issue
> only this rack switch and a redundant pair of top-of-rack switches in
> the rack with the NFS server are in-common between all 8 NFS clients
> with write hangs.
TCP drops or overruns should not be a problem since the TCP layer will
retransmit packets that are not acked. The issue would be if the NFS
server is perhaps silently dropping a response to an IO RPC. Or, an
intelligent middle-box that keeps its own stateful transparent TCP handling
between client and server existed (you clearly don't have that here).
So I recall some knfsd issues dropping replies in that era of kernel
versions when the GSS sequencing grew out of a window. Are you using a
sec=krb5* on these mounts, or is it all sec=sys? Perhaps that's the problem
you are seeing. Again, just some guessing.
Verifying this is the problem could be done by setting up some rolling
network captures.. but sometimes it can be hard to not have the capture
fill up with continuing traffic from other processes.
Ben
On Wed, Jun 19, 2019 at 08:38:02AM -0400, Benjamin Coddington wrote:
> TCP drops or overruns should not be a problem since the TCP layer will
> retransmit packets that are not acked. The issue would be if the NFS
> server is perhaps silently dropping a response to an IO RPC. Or, an
> intelligent middle-box that keeps its own stateful transparent TCP handling
> between client and server existed (you clearly don't have that here).
>
My conclusion as well. As part of debugging a complicity
of reliability issues with the cluster, we've found that
some workloads are more likely to lead to NFS client hang.
We've migrated the exports used by those workloads to dedicated
NFS servers, one of which is the server under discussion here.
> So I recall some knfsd issues dropping replies in that era of kernel
> versions when the GSS sequencing grew out of a window. Are you using a
> sec=krb5* on these mounts, or is it all sec=sys? Perhaps that's the problem
> you are seeing. Again, just some guessing.
>
We're using sec=sys for the NFS clients that hung on
wait_on_commit, but have in the past used Kerberos. I'm still
chasing down at least intermittent, lingering issue where an
open(2) will return EIO, while on the the wire those procedures
are returning NFS4ERR_EXPIRED. What appears to happening,
though I'm not certain yet, is that a RENEW CID is or tries to
be done with Kerberos when it was not previously, which succeeds,
but only in this degraded manner.
I cannot then rule out something of the sort you're describing.
Thank you for bringing it to my attention.
> Verifying this is the problem could be done by setting up some rolling
> network captures.. but sometimes it can be hard to not have the capture
> fill up with continuing traffic from other processes.
>
I did go ahead and set up a rolling capture between this NFS
server and one rack of clients--I hope I can catch the event as
it happens. Time will tell.
Regards,
-A
--
Alan Post | Xen VPS hosting for the technically adept
PO Box 61688 | Sunnyvale, CA 94088-1681 | https://prgmr.com/
email: [email protected]
On Fri, Jun 21, 2019 at 02:47:23PM -0600, Alan Post wrote:
> > Verifying this is the problem could be done by setting up some rolling
> > network captures.. but sometimes it can be hard to not have the capture
> > fill up with continuing traffic from other processes.
> >
>
> I did go ahead and set up a rolling capture between this NFS
> server and one rack of clients--I hope I can catch the event as
> it happens. Time will tell.
>
I've run this rolling capture and did catch four candidate events.
I haven't confirmed any of them are real--I don't really know
what it is I'm looking for, so I've been approaching the problem
by incrementally/recursively throwing stuff out and manually
working through what's left.
As far as I understand it, for a particular xid, there should be a
call and a reply. The approach I took then was to pull out these
fields from my capture and ignore RPC calls where both are present
in my capture. It seems this is simplistic, as the number of RPC
calls I have without an attendant reply isn't lining up with my
incident window.
In one example, I have a series of READ calls which cease
generating RPC reply messages as the offset for the file continues
to increases. After a couple/few dozen messages, the RPC replies
continue as they were. Is there a normal or routine explanation
for this?
RFC 5531 and the NetworkTracing page on wiki.linux-nfs.org have
been quite helpful bringing me up to speed. If any of you have
advice or guidance or can clarify my understanding of how the
call/reply RPC mechanism works I appreciate it.
-A
--
Alan Post | Xen VPS hosting for the technically adept
PO Box 61688 | Sunnyvale, CA 94088-1681 | https://prgmr.com/
email: [email protected]
On 28 Jun 2019, at 14:33, Alan Post wrote:
> On Fri, Jun 21, 2019 at 02:47:23PM -0600, Alan Post wrote:
>>> Verifying this is the problem could be done by setting up some rolling
>>> network captures.. but sometimes it can be hard to not have the capture
>>> fill up with continuing traffic from other processes.
>>>
>>
>> I did go ahead and set up a rolling capture between this NFS
>> server and one rack of clients--I hope I can catch the event as
>> it happens. Time will tell.
>>
>
> I've run this rolling capture and did catch four candidate events.
> I haven't confirmed any of them are real--I don't really know
> what it is I'm looking for, so I've been approaching the problem
> by incrementally/recursively throwing stuff out and manually
> working through what's left.
>
> As far as I understand it, for a particular xid, there should be a
> call and a reply. The approach I took then was to pull out these
> fields from my capture and ignore RPC calls where both are present
> in my capture. It seems this is simplistic, as the number of RPC
> calls I have without an attendant reply isn't lining up with my
> incident window.
Does your capture report dropped packets? If so, maybe you need to increase
the capture buffer.
There are the sunrpc:xprt_transmit and sunrpc:xprt_complete_rqst tracepoints
as well that should show the xids.
> In one example, I have a series of READ calls which cease
> generating RPC reply messages as the offset for the file continues
> to increases. After a couple/few dozen messages, the RPC replies
> continue as they were. Is there a normal or routine explanation
> for this?
>
> RFC 5531 and the NetworkTracing page on wiki.linux-nfs.org have
> been quite helpful bringing me up to speed. If any of you have
> advice or guidance or can clarify my understanding of how the
> call/reply RPC mechanism works I appreciate it.
Seems like you understand it. Do you have specific questions?
Ben
On Tue, Jul 02, 2019 at 05:55:10AM -0400, Benjamin Coddington wrote:
> > As far as I understand it, for a particular xid, there should be a
> > call and a reply. The approach I took then was to pull out these
> > fields from my capture and ignore RPC calls where both are present
> > in my capture. It seems this is simplistic, as the number of RPC
> > calls I have without an attendant reply isn't lining up with my
> > incident window.
>
> Does your capture report dropped packets? If so, maybe you need to increase
> the capture buffer.
>
I'm not certain, but I do have a capture on both the NFS server and
the NFS client--comparing them would show me if I was under most
circumstances. Good catch.
> > In one example, I have a series of READ calls which cease
> > generating RPC reply messages as the offset for the file continues
> > to increases. After a couple/few dozen messages, the RPC replies
> > continue as they were. Is there a normal or routine explanation
> > for this?
> >
> > RFC 5531 and the NetworkTracing page on wiki.linux-nfs.org have
> > been quite helpful bringing me up to speed. If any of you have
> > advice or guidance or can clarify my understanding of how the
> > call/reply RPC mechanism works I appreciate it.
>
> Seems like you understand it. Do you have specific questions?
>
Is it true that for each RPC call there is an RPC reply with the
same xid? Is it a-priori an error if an otherwise correct RPC
call is not eventually paired with an RPC reply?
Thank you,
-A
--
Alan Post | Xen VPS hosting for the technically adept
PO Box 61688 | Sunnyvale, CA 94088-1681 | https://prgmr.com/
email: [email protected]
On 7/3/2019 5:32 PM, Alan Post wrote:
> On Tue, Jul 02, 2019 at 05:55:10AM -0400, Benjamin Coddington wrote:
>>> As far as I understand it, for a particular xid, there should be a
>>> call and a reply. The approach I took then was to pull out these
>>> fields from my capture and ignore RPC calls where both are present
>>> in my capture. It seems this is simplistic, as the number of RPC
>>> calls I have without an attendant reply isn't lining up with my
>>> incident window.
>>
>> Does your capture report dropped packets? If so, maybe you need to increase
>> the capture buffer.
>>
>
> I'm not certain, but I do have a capture on both the NFS server and
> the NFS client--comparing them would show me if I was under most
> circumstances. Good catch.
>
>>> In one example, I have a series of READ calls which cease
>>> generating RPC reply messages as the offset for the file continues
>>> to increases. After a couple/few dozen messages, the RPC replies
>>> continue as they were. Is there a normal or routine explanation
>>> for this?
>>>
>>> RFC 5531 and the NetworkTracing page on wiki.linux-nfs.org have
>>> been quite helpful bringing me up to speed. If any of you have
>>> advice or guidance or can clarify my understanding of how the
>>> call/reply RPC mechanism works I appreciate it.
>>
>> Seems like you understand it. Do you have specific questions?
>>
>
> Is it true that for each RPC call there is an RPC reply with the
> same xid? Is it a-priori an error if an otherwise correct RPC
> call is not eventually paired with an RPC reply?
Absolutely yes. Not replying would be like a local procedure never
returning.
But remember XIDs are not globally unique. They are only unique within
some limited span of time for the connection they were issued on. This
is typically only a problem on very high IOPS workloads, or over long
spans of time.
Tom.