I have been experiencing nfs file access hangs with multiple release
versions of the 5.13.x linux kernel. In each case, all file transfers
freeze for 5-10 seconds and then resume. This seems worse when reading
through many files sequentially.
My server:
- Archlinux w/ a distribution provided kernel package
- filesystems exported with "rw,sync,no_subtree_check,insecure" options
Client:
- Archlinux w/ latest distribution provided kernel (5.13.9-arch1-1 at writing)
- nfs mounted via /net autofs with "soft,nodev,nosuid" options
(ver=4.2 is indicated in mount)
I have tried the 5.13.x kernel several times since the first arch
release (most recently with 5.13.9-arch1-1), all with similar results.
Each time, I am forced to downgrade the linux package to a 5.12.x
kernel (5.12.15-arch1 as of writing) to clear up the transfer issues
and stabilize performance. No other changes are made between tests. I
have confirmed the freezing behavior using both ext4 and btrfs
filesystems exported from this server.
At this point I would appreciate some guidance in what to provide in
order to diagnose and resolve this issue. I don't have a lot of kernel
debugging experience, so instruction would be helpful.
- mike
> On Aug 8, 2021, at 6:37 PM, Mike Javorski <[email protected]> wrote:
>
> I have been experiencing nfs file access hangs with multiple release
> versions of the 5.13.x linux kernel. In each case, all file transfers
> freeze for 5-10 seconds and then resume. This seems worse when reading
> through many files sequentially.
>
> My server:
> - Archlinux w/ a distribution provided kernel package
> - filesystems exported with "rw,sync,no_subtree_check,insecure" options
>
> Client:
> - Archlinux w/ latest distribution provided kernel (5.13.9-arch1-1 at writing)
> - nfs mounted via /net autofs with "soft,nodev,nosuid" options
> (ver=4.2 is indicated in mount)
>
> I have tried the 5.13.x kernel several times since the first arch
> release (most recently with 5.13.9-arch1-1), all with similar results.
> Each time, I am forced to downgrade the linux package to a 5.12.x
> kernel (5.12.15-arch1 as of writing) to clear up the transfer issues
> and stabilize performance. No other changes are made between tests. I
> have confirmed the freezing behavior using both ext4 and btrfs
> filesystems exported from this server.
>
> At this point I would appreciate some guidance in what to provide in
> order to diagnose and resolve this issue. I don't have a lot of kernel
> debugging experience, so instruction would be helpful.
Hi Mike-
Thanks for the report.
Since you are using a distribution kernel and don't have much
kernel debugging experience, we typically ask you to work with
your distributor first. linux-nfs@ is a developer's mailing
list, we're not really prepared to provide user support.
When you report the problem to Arch, you might want to have
a description of your workload (especially if you can
reproduce the problem often with a particular application
or user activity), and maybe be prepared to capture a network
trace of the failure using tcpdump.
--
Chuck Lever
I appreciate that Chuck. It was not my intention to seek basic "user"
support, but as I don't know the intricacies of how nfs works in the
kernel, I don't know where to get the right information to share to
help with the diagnostic effort. I have read through the
troubleshooting recommendations on the linux-nfs wiki, but the issue
is beyond what is indicated there, and my setup has been running
without any NFS issues for years at this point, and I can demonstrate
the error merely by changing the booted kernel, so I feel confident
it's not a simple configuration error.
I have raised an issue on the ArchLinux bug tracker as well
(https://bugs.archlinux.org/task/71775). Given that the ArchLinux
version of the kernel has minimal delta from mainline, (I've checked
and no fs/rpc/nfs files are changed in 7 file delta vs mainline), it's
been my previous experience that they will refer me to the kernel
subsystem mailing list, but I will wait to hear back from them, and
see what they say.
- mike
- mike
On Sun, Aug 8, 2021 at 3:47 PM Chuck Lever III <[email protected]> wrote:
>
>
>
> > On Aug 8, 2021, at 6:37 PM, Mike Javorski <[email protected]> wrote:
> >
> > I have been experiencing nfs file access hangs with multiple release
> > versions of the 5.13.x linux kernel. In each case, all file transfers
> > freeze for 5-10 seconds and then resume. This seems worse when reading
> > through many files sequentially.
> >
> > My server:
> > - Archlinux w/ a distribution provided kernel package
> > - filesystems exported with "rw,sync,no_subtree_check,insecure" options
> >
> > Client:
> > - Archlinux w/ latest distribution provided kernel (5.13.9-arch1-1 at writing)
> > - nfs mounted via /net autofs with "soft,nodev,nosuid" options
> > (ver=4.2 is indicated in mount)
> >
> > I have tried the 5.13.x kernel several times since the first arch
> > release (most recently with 5.13.9-arch1-1), all with similar results.
> > Each time, I am forced to downgrade the linux package to a 5.12.x
> > kernel (5.12.15-arch1 as of writing) to clear up the transfer issues
> > and stabilize performance. No other changes are made between tests. I
> > have confirmed the freezing behavior using both ext4 and btrfs
> > filesystems exported from this server.
> >
> > At this point I would appreciate some guidance in what to provide in
> > order to diagnose and resolve this issue. I don't have a lot of kernel
> > debugging experience, so instruction would be helpful.
>
> Hi Mike-
>
> Thanks for the report.
>
> Since you are using a distribution kernel and don't have much
> kernel debugging experience, we typically ask you to work with
> your distributor first. linux-nfs@ is a developer's mailing
> list, we're not really prepared to provide user support.
>
> When you report the problem to Arch, you might want to have
> a description of your workload (especially if you can
> reproduce the problem often with a particular application
> or user activity), and maybe be prepared to capture a network
> trace of the failure using tcpdump.
>
>
> --
> Chuck Lever
>
>
>
On Mon, 09 Aug 2021, Mike Javorski wrote:
> I have been experiencing nfs file access hangs with multiple release
> versions of the 5.13.x linux kernel. In each case, all file transfers
> freeze for 5-10 seconds and then resume. This seems worse when reading
> through many files sequentially.
A particularly useful debugging tool for NFS freezes is to run
rpcdebug -m rpc -c all
while the system appears frozen. As you only have a 5-10 second window
this might be tricky.
Setting or clearing debug flags in the rpc module (whether they are
already set or not) has a side effect if listing all RPC "tasks" which a
waiting for a reply. Seeing that task list can often be useful.
The task list appears in "dmesg" output. If there are not tasks
waiting, nothing will be written which might lead you to think it didn't
work.
As Chuck hinted, tcpdump is invaluable for this sort of problem.
tcpdump -s 0 -w /tmp/somefile.pcap port 2049
will capture NFS traffic. If this can start before a hang, and finish
after, it may contain useful information. Doing that in a way that
doesn't create an enormous file might be a challenge. It would help if
you found a way trigger the problem. Take note of the circumstances
when it seems to happen the most. If you can only produce a large file,
we can probably still work with it.
tshark -r /tmp/somefile.pcap
will report the capture one line per packet. You can look for the
appropriate timestamp, note the frame numbers, and use "editcap"
to extract a suitable range of packets.
NeilBrown
>
> My server:
> - Archlinux w/ a distribution provided kernel package
> - filesystems exported with "rw,sync,no_subtree_check,insecure" options
>
> Client:
> - Archlinux w/ latest distribution provided kernel (5.13.9-arch1-1 at writing)
> - nfs mounted via /net autofs with "soft,nodev,nosuid" options
> (ver=4.2 is indicated in mount)
>
> I have tried the 5.13.x kernel several times since the first arch
> release (most recently with 5.13.9-arch1-1), all with similar results.
> Each time, I am forced to downgrade the linux package to a 5.12.x
> kernel (5.12.15-arch1 as of writing) to clear up the transfer issues
> and stabilize performance. No other changes are made between tests. I
> have confirmed the freezing behavior using both ext4 and btrfs
> filesystems exported from this server.
>
> At this point I would appreciate some guidance in what to provide in
> order to diagnose and resolve this issue. I don't have a lot of kernel
> debugging experience, so instruction would be helpful.
>
> - mike
>
>
Appreciate the insights Neil. I will make an attempt to
trigger/capture the fault if I can.
- mike
On Sun, Aug 8, 2021 at 5:01 PM NeilBrown <[email protected]> wrote:
>
> On Mon, 09 Aug 2021, Mike Javorski wrote:
> > I have been experiencing nfs file access hangs with multiple release
> > versions of the 5.13.x linux kernel. In each case, all file transfers
> > freeze for 5-10 seconds and then resume. This seems worse when reading
> > through many files sequentially.
>
> A particularly useful debugging tool for NFS freezes is to run
>
> rpcdebug -m rpc -c all
>
> while the system appears frozen. As you only have a 5-10 second window
> this might be tricky.
> Setting or clearing debug flags in the rpc module (whether they are
> already set or not) has a side effect if listing all RPC "tasks" which a
> waiting for a reply. Seeing that task list can often be useful.
>
> The task list appears in "dmesg" output. If there are not tasks
> waiting, nothing will be written which might lead you to think it didn't
> work.
>
> As Chuck hinted, tcpdump is invaluable for this sort of problem.
> tcpdump -s 0 -w /tmp/somefile.pcap port 2049
>
> will capture NFS traffic. If this can start before a hang, and finish
> after, it may contain useful information. Doing that in a way that
> doesn't create an enormous file might be a challenge. It would help if
> you found a way trigger the problem. Take note of the circumstances
> when it seems to happen the most. If you can only produce a large file,
> we can probably still work with it.
> tshark -r /tmp/somefile.pcap
> will report the capture one line per packet. You can look for the
> appropriate timestamp, note the frame numbers, and use "editcap"
> to extract a suitable range of packets.
>
> NeilBrown
>
>
> >
> > My server:
> > - Archlinux w/ a distribution provided kernel package
> > - filesystems exported with "rw,sync,no_subtree_check,insecure" options
> >
> > Client:
> > - Archlinux w/ latest distribution provided kernel (5.13.9-arch1-1 at writing)
> > - nfs mounted via /net autofs with "soft,nodev,nosuid" options
> > (ver=4.2 is indicated in mount)
> >
> > I have tried the 5.13.x kernel several times since the first arch
> > release (most recently with 5.13.9-arch1-1), all with similar results.
> > Each time, I am forced to downgrade the linux package to a 5.12.x
> > kernel (5.12.15-arch1 as of writing) to clear up the transfer issues
> > and stabilize performance. No other changes are made between tests. I
> > have confirmed the freezing behavior using both ext4 and btrfs
> > filesystems exported from this server.
> >
> > At this point I would appreciate some guidance in what to provide in
> > order to diagnose and resolve this issue. I don't have a lot of kernel
> > debugging experience, so instruction would be helpful.
> >
> > - mike
> >
> >
Managed to get the rpcdebug info for some freezes today. I had a loop
running to call the rpcbind command that Neil provided and then sleep
for 2 seconds. Each triggering event was the opening of a new video
file (I was shuffling through files with mpv). I haven't managed to
get a tcp capture yet trying to get one that isn't too massive, but
the timing is tricky, and this doesn't happen every time.
-----------------------------
[31229.965368] -pid- flgs status -client- --rqstp- -timeout ---ops--
[31229.965370] 44224 c801 0 8c478d70 2473aa3b 15395 8531859
nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[31229.965411] 44225 c801 0 8c478d70 a7d3b9b8 15395 8531859
nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[31231.966982] -pid- flgs status -client- --rqstp- -timeout ---ops--
[31231.966986] 44224 c801 0 8c478d70 2473aa3b 14795 8531859
nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[31231.967027] 44225 c801 0 8c478d70 a7d3b9b8 14795 8531859
nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[31233.968741] -pid- flgs status -client- --rqstp- -timeout ---ops--
[31233.968745] 44224 c801 0 8c478d70 2473aa3b 14194 8531859
nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[31233.968787] 44225 c801 0 8c478d70 a7d3b9b8 14194 8531859
nfsv4 READ a:call_status [sunrpc] q:xprt_pending
-----------------------------
[31292.534529] -pid- flgs status -client- --rqstp- -timeout ---ops--
[31292.534530] 45744 c801 0 8c478d70 2473aa3b 15340 8531859
nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[31292.534554] 45745 c801 0 8c478d70 50cbef7b 15342 8531859
nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[31294.536203] -pid- flgs status -client- --rqstp- -timeout ---ops--
[31294.536206] 45744 c801 0 8c478d70 2473aa3b 14740 8531859
nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[31294.536232] 45745 c801 0 8c478d70 50cbef7b 14742 8531859
nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[31296.537826] -pid- flgs status -client- --rqstp- -timeout ---ops--
[31296.537830] 45744 c801 0 8c478d70 2473aa3b 14139 8531859
nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[31296.537871] 45745 c801 0 8c478d70 50cbef7b 14141 8531859
nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[31298.539474] -pid- flgs status -client- --rqstp- -timeout ---ops--
[31298.539478] 45744 c801 0 8c478d70 2473aa3b 13539 8531859
nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[31298.539522] 45745 c801 0 8c478d70 50cbef7b 13541 8531859
nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[31300.541040] -pid- flgs status -client- --rqstp- -timeout ---ops--
[31300.541044] 46048 c801 0 8c478d70 50cbef7b 17965 8531859
nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[31300.541108] 46049 c801 0 8c478d70 2473aa3b 17965 8531859
nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[31302.542632] -pid- flgs status -client- --rqstp- -timeout ---ops--
[31302.542635] 46922 c801 0 8c478d70 50cbef7b 18000 8531859
nfsv4 READ a:call_status [sunrpc] q:xprt_pending
---------------------------
[31375.046020] -pid- flgs status -client- --rqstp- -timeout ---ops--
[31375.046022] 47336 c801 0 8c478d70 cebaa9a6 15964 8531859
nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[31375.046048] 47337 c801 0 8c478d70 f7145d4d 15964 8531859
nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[31377.047361] -pid- flgs status -client- --rqstp- -timeout ---ops--
[31377.047365] 47336 c801 0 8c478d70 cebaa9a6 15364 8531859
nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[31377.047406] 47337 c801 0 8c478d70 f7145d4d 15364 8531859
nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[31379.048801] -pid- flgs status -client- --rqstp- -timeout ---ops--
[31379.048805] 47336 c801 0 8c478d70 cebaa9a6 14764 8531859
nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[31379.048845] 47337 c801 0 8c478d70 f7145d4d 14764 8531859
nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[31381.050376] -pid- flgs status -client- --rqstp- -timeout ---ops--
[31381.050380] 47336 c801 0 8c478d70 cebaa9a6 14163 8531859
nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[31381.050421] 47337 c801 0 8c478d70 f7145d4d 14163 8531859
nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[31383.051746] -pid- flgs status -client- --rqstp- -timeout ---ops--
[31383.051749] 47336 c801 0 8c478d70 cebaa9a6 13563 8531859
nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[31383.051775] 47337 c801 0 8c478d70 f7145d4d 13563 8531859
nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[31385.053182] -pid- flgs status -client- --rqstp- -timeout ---ops--
[31385.053185] 47342 c801 0 8c478d70 f7145d4d 17980 8531859
nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[31387.054422] -pid- flgs status -client- --rqstp- -timeout ---ops--
[31387.054426] 47818 c801 0 8c478d70 f7145d4d 17761 8531859
nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[31387.054467] 47819 c801 0 8c478d70 cebaa9a6 17762 8531859
nfsv4 READ a:call_status [sunrpc] q:xprt_pending
- mike
On Sun, Aug 8, 2021 at 5:28 PM Mike Javorski <[email protected]> wrote:
>
> Appreciate the insights Neil. I will make an attempt to
> trigger/capture the fault if I can.
>
> - mike
>
> On Sun, Aug 8, 2021 at 5:01 PM NeilBrown <[email protected]> wrote:
> >
> > On Mon, 09 Aug 2021, Mike Javorski wrote:
> > > I have been experiencing nfs file access hangs with multiple release
> > > versions of the 5.13.x linux kernel. In each case, all file transfers
> > > freeze for 5-10 seconds and then resume. This seems worse when reading
> > > through many files sequentially.
> >
> > A particularly useful debugging tool for NFS freezes is to run
> >
> > rpcdebug -m rpc -c all
> >
> > while the system appears frozen. As you only have a 5-10 second window
> > this might be tricky.
> > Setting or clearing debug flags in the rpc module (whether they are
> > already set or not) has a side effect if listing all RPC "tasks" which a
> > waiting for a reply. Seeing that task list can often be useful.
> >
> > The task list appears in "dmesg" output. If there are not tasks
> > waiting, nothing will be written which might lead you to think it didn't
> > work.
> >
> > As Chuck hinted, tcpdump is invaluable for this sort of problem.
> > tcpdump -s 0 -w /tmp/somefile.pcap port 2049
> >
> > will capture NFS traffic. If this can start before a hang, and finish
> > after, it may contain useful information. Doing that in a way that
> > doesn't create an enormous file might be a challenge. It would help if
> > you found a way trigger the problem. Take note of the circumstances
> > when it seems to happen the most. If you can only produce a large file,
> > we can probably still work with it.
> > tshark -r /tmp/somefile.pcap
> > will report the capture one line per packet. You can look for the
> > appropriate timestamp, note the frame numbers, and use "editcap"
> > to extract a suitable range of packets.
> >
> > NeilBrown
> >
> >
> > >
> > > My server:
> > > - Archlinux w/ a distribution provided kernel package
> > > - filesystems exported with "rw,sync,no_subtree_check,insecure" options
> > >
> > > Client:
> > > - Archlinux w/ latest distribution provided kernel (5.13.9-arch1-1 at writing)
> > > - nfs mounted via /net autofs with "soft,nodev,nosuid" options
> > > (ver=4.2 is indicated in mount)
> > >
> > > I have tried the 5.13.x kernel several times since the first arch
> > > release (most recently with 5.13.9-arch1-1), all with similar results.
> > > Each time, I am forced to downgrade the linux package to a 5.12.x
> > > kernel (5.12.15-arch1 as of writing) to clear up the transfer issues
> > > and stabilize performance. No other changes are made between tests. I
> > > have confirmed the freezing behavior using both ext4 and btrfs
> > > filesystems exported from this server.
> > >
> > > At this point I would appreciate some guidance in what to provide in
> > > order to diagnose and resolve this issue. I don't have a lot of kernel
> > > debugging experience, so instruction would be helpful.
> > >
> > > - mike
> > >
> > >
On Tue, 10 Aug 2021, Mike Javorski wrote:
> Managed to get the rpcdebug info for some freezes today. I had a loop
> running to call the rpcbind command that Neil provided and then sleep
> for 2 seconds. Each triggering event was the opening of a new video
> file (I was shuffling through files with mpv). I haven't managed to
> get a tcp capture yet trying to get one that isn't too massive, but
> the timing is tricky, and this doesn't happen every time.
>
Thanks for the logs.
They show consistent "a:call_status" and "q:xprt_pending". They also
show a timeout in the range 10000-20000. This number is in 'jiffies',
which is normally millisecs, so 10 to 20 seconds.
That means the RPC layer (thinks it has) sent a request (made a call)
and is waiting for a reply to arrive over the network transport (i.e.
the TCP connection). However that would typically involve a much longer
timeout. For NFSv3 and 4.0, 60 seconds is the norm. For NFSv4.1 I
*think* there is no timeout as NFS never resends requests unless the TCP
connection is broken. So I don't know what to make of that.
Given the fact that it recovers in 5-10 seconds, it seems very likely
that it actually has sent the request (presumably the timeout doesn't
expire). Mostly likely the server is being delayed in replying. A
tcpdump trace would confirm this.
I don't think you've said anything about what NFS server you are using.
It is Linux, or something else? If Linux, what kernel do you run there?
One thing that might cause a delay when accessing a file is if some
other client has a 'delegation' and takes a while to return it.
e.g.
client2 creates a file for writing
server gives it a 'write delegation'
client2 queues up lots of write request
client1 asks to read the file.
server says to client2 "please return the delegation"
client2 starts flushing its data, but either takes too long or
has some other hiccup. Eventually it is returned, or the server
revokes it
server tells clients one it can read now.
This is leading to me asking: do you have other NFS clients that might
be accessing the same file, or might have accessed it recently?
NeilBrown
> -----------------------------
>
> [31229.965368] -pid- flgs status -client- --rqstp- -timeout ---ops--
> [31229.965370] 44224 c801 0 8c478d70 2473aa3b 15395 8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31229.965411] 44225 c801 0 8c478d70 a7d3b9b8 15395 8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31231.966982] -pid- flgs status -client- --rqstp- -timeout ---ops--
> [31231.966986] 44224 c801 0 8c478d70 2473aa3b 14795 8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31231.967027] 44225 c801 0 8c478d70 a7d3b9b8 14795 8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31233.968741] -pid- flgs status -client- --rqstp- -timeout ---ops--
> [31233.968745] 44224 c801 0 8c478d70 2473aa3b 14194 8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31233.968787] 44225 c801 0 8c478d70 a7d3b9b8 14194 8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
>
> -----------------------------
>
> [31292.534529] -pid- flgs status -client- --rqstp- -timeout ---ops--
> [31292.534530] 45744 c801 0 8c478d70 2473aa3b 15340 8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31292.534554] 45745 c801 0 8c478d70 50cbef7b 15342 8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31294.536203] -pid- flgs status -client- --rqstp- -timeout ---ops--
> [31294.536206] 45744 c801 0 8c478d70 2473aa3b 14740 8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31294.536232] 45745 c801 0 8c478d70 50cbef7b 14742 8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31296.537826] -pid- flgs status -client- --rqstp- -timeout ---ops--
> [31296.537830] 45744 c801 0 8c478d70 2473aa3b 14139 8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31296.537871] 45745 c801 0 8c478d70 50cbef7b 14141 8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31298.539474] -pid- flgs status -client- --rqstp- -timeout ---ops--
> [31298.539478] 45744 c801 0 8c478d70 2473aa3b 13539 8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31298.539522] 45745 c801 0 8c478d70 50cbef7b 13541 8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31300.541040] -pid- flgs status -client- --rqstp- -timeout ---ops--
> [31300.541044] 46048 c801 0 8c478d70 50cbef7b 17965 8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31300.541108] 46049 c801 0 8c478d70 2473aa3b 17965 8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31302.542632] -pid- flgs status -client- --rqstp- -timeout ---ops--
> [31302.542635] 46922 c801 0 8c478d70 50cbef7b 18000 8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
>
> ---------------------------
>
> [31375.046020] -pid- flgs status -client- --rqstp- -timeout ---ops--
> [31375.046022] 47336 c801 0 8c478d70 cebaa9a6 15964 8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31375.046048] 47337 c801 0 8c478d70 f7145d4d 15964 8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31377.047361] -pid- flgs status -client- --rqstp- -timeout ---ops--
> [31377.047365] 47336 c801 0 8c478d70 cebaa9a6 15364 8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31377.047406] 47337 c801 0 8c478d70 f7145d4d 15364 8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31379.048801] -pid- flgs status -client- --rqstp- -timeout ---ops--
> [31379.048805] 47336 c801 0 8c478d70 cebaa9a6 14764 8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31379.048845] 47337 c801 0 8c478d70 f7145d4d 14764 8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31381.050376] -pid- flgs status -client- --rqstp- -timeout ---ops--
> [31381.050380] 47336 c801 0 8c478d70 cebaa9a6 14163 8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31381.050421] 47337 c801 0 8c478d70 f7145d4d 14163 8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31383.051746] -pid- flgs status -client- --rqstp- -timeout ---ops--
> [31383.051749] 47336 c801 0 8c478d70 cebaa9a6 13563 8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31383.051775] 47337 c801 0 8c478d70 f7145d4d 13563 8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31385.053182] -pid- flgs status -client- --rqstp- -timeout ---ops--
> [31385.053185] 47342 c801 0 8c478d70 f7145d4d 17980 8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31387.054422] -pid- flgs status -client- --rqstp- -timeout ---ops--
> [31387.054426] 47818 c801 0 8c478d70 f7145d4d 17761 8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> [31387.054467] 47819 c801 0 8c478d70 cebaa9a6 17762 8531859
> nfsv4 READ a:call_status [sunrpc] q:xprt_pending
>
>
> - mike
>
>
> On Sun, Aug 8, 2021 at 5:28 PM Mike Javorski <[email protected]> wrote:
> >
> > Appreciate the insights Neil. I will make an attempt to
> > trigger/capture the fault if I can.
> >
> > - mike
> >
> > On Sun, Aug 8, 2021 at 5:01 PM NeilBrown <[email protected]> wrote:
> > >
> > > On Mon, 09 Aug 2021, Mike Javorski wrote:
> > > > I have been experiencing nfs file access hangs with multiple release
> > > > versions of the 5.13.x linux kernel. In each case, all file transfers
> > > > freeze for 5-10 seconds and then resume. This seems worse when reading
> > > > through many files sequentially.
> > >
> > > A particularly useful debugging tool for NFS freezes is to run
> > >
> > > rpcdebug -m rpc -c all
> > >
> > > while the system appears frozen. As you only have a 5-10 second window
> > > this might be tricky.
> > > Setting or clearing debug flags in the rpc module (whether they are
> > > already set or not) has a side effect if listing all RPC "tasks" which a
> > > waiting for a reply. Seeing that task list can often be useful.
> > >
> > > The task list appears in "dmesg" output. If there are not tasks
> > > waiting, nothing will be written which might lead you to think it didn't
> > > work.
> > >
> > > As Chuck hinted, tcpdump is invaluable for this sort of problem.
> > > tcpdump -s 0 -w /tmp/somefile.pcap port 2049
> > >
> > > will capture NFS traffic. If this can start before a hang, and finish
> > > after, it may contain useful information. Doing that in a way that
> > > doesn't create an enormous file might be a challenge. It would help if
> > > you found a way trigger the problem. Take note of the circumstances
> > > when it seems to happen the most. If you can only produce a large file,
> > > we can probably still work with it.
> > > tshark -r /tmp/somefile.pcap
> > > will report the capture one line per packet. You can look for the
> > > appropriate timestamp, note the frame numbers, and use "editcap"
> > > to extract a suitable range of packets.
> > >
> > > NeilBrown
> > >
> > >
> > > >
> > > > My server:
> > > > - Archlinux w/ a distribution provided kernel package
> > > > - filesystems exported with "rw,sync,no_subtree_check,insecure" options
> > > >
> > > > Client:
> > > > - Archlinux w/ latest distribution provided kernel (5.13.9-arch1-1 at writing)
> > > > - nfs mounted via /net autofs with "soft,nodev,nosuid" options
> > > > (ver=4.2 is indicated in mount)
> > > >
> > > > I have tried the 5.13.x kernel several times since the first arch
> > > > release (most recently with 5.13.9-arch1-1), all with similar results.
> > > > Each time, I am forced to downgrade the linux package to a 5.12.x
> > > > kernel (5.12.15-arch1 as of writing) to clear up the transfer issues
> > > > and stabilize performance. No other changes are made between tests. I
> > > > have confirmed the freezing behavior using both ext4 and btrfs
> > > > filesystems exported from this server.
> > > >
> > > > At this point I would appreciate some guidance in what to provide in
> > > > order to diagnose and resolve this issue. I don't have a lot of kernel
> > > > debugging experience, so instruction would be helpful.
> > > >
> > > > - mike
> > > >
> > > >
>
>
This sounds a bit like my bugzilla ticket:
https://bugzilla.kernel.org/show_bug.cgi?id=213887
But in my case the client rarely recovers from the timeout so the hang
becomes permanent. My setup is a bit more complicated though (NFS
re-exporting).
Daire
On Tue, 10 Aug 2021 at 02:29, NeilBrown <[email protected]> wrote:
>
> On Tue, 10 Aug 2021, Mike Javorski wrote:
> > Managed to get the rpcdebug info for some freezes today. I had a loop
> > running to call the rpcbind command that Neil provided and then sleep
> > for 2 seconds. Each triggering event was the opening of a new video
> > file (I was shuffling through files with mpv). I haven't managed to
> > get a tcp capture yet trying to get one that isn't too massive, but
> > the timing is tricky, and this doesn't happen every time.
> >
>
> Thanks for the logs.
>
> They show consistent "a:call_status" and "q:xprt_pending". They also
> show a timeout in the range 10000-20000. This number is in 'jiffies',
> which is normally millisecs, so 10 to 20 seconds.
>
> That means the RPC layer (thinks it has) sent a request (made a call)
> and is waiting for a reply to arrive over the network transport (i.e.
> the TCP connection). However that would typically involve a much longer
> timeout. For NFSv3 and 4.0, 60 seconds is the norm. For NFSv4.1 I
> *think* there is no timeout as NFS never resends requests unless the TCP
> connection is broken. So I don't know what to make of that.
>
> Given the fact that it recovers in 5-10 seconds, it seems very likely
> that it actually has sent the request (presumably the timeout doesn't
> expire). Mostly likely the server is being delayed in replying. A
> tcpdump trace would confirm this.
>
> I don't think you've said anything about what NFS server you are using.
> It is Linux, or something else? If Linux, what kernel do you run there?
>
> One thing that might cause a delay when accessing a file is if some
> other client has a 'delegation' and takes a while to return it.
> e.g.
> client2 creates a file for writing
> server gives it a 'write delegation'
> client2 queues up lots of write request
> client1 asks to read the file.
> server says to client2 "please return the delegation"
> client2 starts flushing its data, but either takes too long or
> has some other hiccup. Eventually it is returned, or the server
> revokes it
> server tells clients one it can read now.
>
> This is leading to me asking: do you have other NFS clients that might
> be accessing the same file, or might have accessed it recently?
>
> NeilBrown
>
>
> > -----------------------------
> >
> > [31229.965368] -pid- flgs status -client- --rqstp- -timeout ---ops--
> > [31229.965370] 44224 c801 0 8c478d70 2473aa3b 15395 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31229.965411] 44225 c801 0 8c478d70 a7d3b9b8 15395 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31231.966982] -pid- flgs status -client- --rqstp- -timeout ---ops--
> > [31231.966986] 44224 c801 0 8c478d70 2473aa3b 14795 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31231.967027] 44225 c801 0 8c478d70 a7d3b9b8 14795 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31233.968741] -pid- flgs status -client- --rqstp- -timeout ---ops--
> > [31233.968745] 44224 c801 0 8c478d70 2473aa3b 14194 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31233.968787] 44225 c801 0 8c478d70 a7d3b9b8 14194 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> >
> > -----------------------------
> >
> > [31292.534529] -pid- flgs status -client- --rqstp- -timeout ---ops--
> > [31292.534530] 45744 c801 0 8c478d70 2473aa3b 15340 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31292.534554] 45745 c801 0 8c478d70 50cbef7b 15342 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31294.536203] -pid- flgs status -client- --rqstp- -timeout ---ops--
> > [31294.536206] 45744 c801 0 8c478d70 2473aa3b 14740 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31294.536232] 45745 c801 0 8c478d70 50cbef7b 14742 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31296.537826] -pid- flgs status -client- --rqstp- -timeout ---ops--
> > [31296.537830] 45744 c801 0 8c478d70 2473aa3b 14139 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31296.537871] 45745 c801 0 8c478d70 50cbef7b 14141 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31298.539474] -pid- flgs status -client- --rqstp- -timeout ---ops--
> > [31298.539478] 45744 c801 0 8c478d70 2473aa3b 13539 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31298.539522] 45745 c801 0 8c478d70 50cbef7b 13541 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31300.541040] -pid- flgs status -client- --rqstp- -timeout ---ops--
> > [31300.541044] 46048 c801 0 8c478d70 50cbef7b 17965 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31300.541108] 46049 c801 0 8c478d70 2473aa3b 17965 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31302.542632] -pid- flgs status -client- --rqstp- -timeout ---ops--
> > [31302.542635] 46922 c801 0 8c478d70 50cbef7b 18000 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> >
> > ---------------------------
> >
> > [31375.046020] -pid- flgs status -client- --rqstp- -timeout ---ops--
> > [31375.046022] 47336 c801 0 8c478d70 cebaa9a6 15964 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31375.046048] 47337 c801 0 8c478d70 f7145d4d 15964 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31377.047361] -pid- flgs status -client- --rqstp- -timeout ---ops--
> > [31377.047365] 47336 c801 0 8c478d70 cebaa9a6 15364 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31377.047406] 47337 c801 0 8c478d70 f7145d4d 15364 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31379.048801] -pid- flgs status -client- --rqstp- -timeout ---ops--
> > [31379.048805] 47336 c801 0 8c478d70 cebaa9a6 14764 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31379.048845] 47337 c801 0 8c478d70 f7145d4d 14764 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31381.050376] -pid- flgs status -client- --rqstp- -timeout ---ops--
> > [31381.050380] 47336 c801 0 8c478d70 cebaa9a6 14163 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31381.050421] 47337 c801 0 8c478d70 f7145d4d 14163 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31383.051746] -pid- flgs status -client- --rqstp- -timeout ---ops--
> > [31383.051749] 47336 c801 0 8c478d70 cebaa9a6 13563 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31383.051775] 47337 c801 0 8c478d70 f7145d4d 13563 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31385.053182] -pid- flgs status -client- --rqstp- -timeout ---ops--
> > [31385.053185] 47342 c801 0 8c478d70 f7145d4d 17980 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31387.054422] -pid- flgs status -client- --rqstp- -timeout ---ops--
> > [31387.054426] 47818 c801 0 8c478d70 f7145d4d 17761 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31387.054467] 47819 c801 0 8c478d70 cebaa9a6 17762 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> >
> >
> > - mike
> >
> >
> > On Sun, Aug 8, 2021 at 5:28 PM Mike Javorski <[email protected]> wrote:
> > >
> > > Appreciate the insights Neil. I will make an attempt to
> > > trigger/capture the fault if I can.
> > >
> > > - mike
> > >
> > > On Sun, Aug 8, 2021 at 5:01 PM NeilBrown <[email protected]> wrote:
> > > >
> > > > On Mon, 09 Aug 2021, Mike Javorski wrote:
> > > > > I have been experiencing nfs file access hangs with multiple release
> > > > > versions of the 5.13.x linux kernel. In each case, all file transfers
> > > > > freeze for 5-10 seconds and then resume. This seems worse when reading
> > > > > through many files sequentially.
> > > >
> > > > A particularly useful debugging tool for NFS freezes is to run
> > > >
> > > > rpcdebug -m rpc -c all
> > > >
> > > > while the system appears frozen. As you only have a 5-10 second window
> > > > this might be tricky.
> > > > Setting or clearing debug flags in the rpc module (whether they are
> > > > already set or not) has a side effect if listing all RPC "tasks" which a
> > > > waiting for a reply. Seeing that task list can often be useful.
> > > >
> > > > The task list appears in "dmesg" output. If there are not tasks
> > > > waiting, nothing will be written which might lead you to think it didn't
> > > > work.
> > > >
> > > > As Chuck hinted, tcpdump is invaluable for this sort of problem.
> > > > tcpdump -s 0 -w /tmp/somefile.pcap port 2049
> > > >
> > > > will capture NFS traffic. If this can start before a hang, and finish
> > > > after, it may contain useful information. Doing that in a way that
> > > > doesn't create an enormous file might be a challenge. It would help if
> > > > you found a way trigger the problem. Take note of the circumstances
> > > > when it seems to happen the most. If you can only produce a large file,
> > > > we can probably still work with it.
> > > > tshark -r /tmp/somefile.pcap
> > > > will report the capture one line per packet. You can look for the
> > > > appropriate timestamp, note the frame numbers, and use "editcap"
> > > > to extract a suitable range of packets.
> > > >
> > > > NeilBrown
> > > >
> > > >
> > > > >
> > > > > My server:
> > > > > - Archlinux w/ a distribution provided kernel package
> > > > > - filesystems exported with "rw,sync,no_subtree_check,insecure" options
> > > > >
> > > > > Client:
> > > > > - Archlinux w/ latest distribution provided kernel (5.13.9-arch1-1 at writing)
> > > > > - nfs mounted via /net autofs with "soft,nodev,nosuid" options
> > > > > (ver=4.2 is indicated in mount)
> > > > >
> > > > > I have tried the 5.13.x kernel several times since the first arch
> > > > > release (most recently with 5.13.9-arch1-1), all with similar results.
> > > > > Each time, I am forced to downgrade the linux package to a 5.12.x
> > > > > kernel (5.12.15-arch1 as of writing) to clear up the transfer issues
> > > > > and stabilize performance. No other changes are made between tests. I
> > > > > have confirmed the freezing behavior using both ext4 and btrfs
> > > > > filesystems exported from this server.
> > > > >
> > > > > At this point I would appreciate some guidance in what to provide in
> > > > > order to diagnose and resolve this issue. I don't have a lot of kernel
> > > > > debugging experience, so instruction would be helpful.
> > > > >
> > > > > - mike
> > > > >
> > > > >
> >
> >
Neil:
Apologies for the delay, your message didn't get properly flagged in my email.
To answer your questions, both client (my Desktop PC) and server (my
NAS) are running ArchLinux; client w/ current kernel (5.13.9), server
w/ current or alternate testing kernels (see below).
The server is running the kernel nfs server
The client is using the built-in nfs client, and the "mount" output
indicates it is using nfs with "ver=4.2"
Both machines are plugged into the same network switch.
At time of testing, there is only one client active.
The files in this case are static media files (some years old), and
there are no co-occurring writes to the files being read (or even to
the same directory)
It was suggested on my archlinux bug report that I try a kernel just
before the 5.13 NFS changes landed, and right after to see if they
trigger the behavior, as well as trying a vanilla 5.14 rc kernel. I
still need to test more, but I was able to get the freezes with
5.14-rc4, but so far inconclusive for the earlier 5.13 kernels. I
intend to spend some time this weekend attempting to get the tcpdump.
My initial attempts wound up with 400+Mb files which would be
difficult to ship and use for diagnostics.
- mike
On Mon, Aug 9, 2021 at 6:28 PM NeilBrown <[email protected]> wrote:
>
> On Tue, 10 Aug 2021, Mike Javorski wrote:
> > Managed to get the rpcdebug info for some freezes today. I had a loop
> > running to call the rpcbind command that Neil provided and then sleep
> > for 2 seconds. Each triggering event was the opening of a new video
> > file (I was shuffling through files with mpv). I haven't managed to
> > get a tcp capture yet trying to get one that isn't too massive, but
> > the timing is tricky, and this doesn't happen every time.
> >
>
> Thanks for the logs.
>
> They show consistent "a:call_status" and "q:xprt_pending". They also
> show a timeout in the range 10000-20000. This number is in 'jiffies',
> which is normally millisecs, so 10 to 20 seconds.
>
> That means the RPC layer (thinks it has) sent a request (made a call)
> and is waiting for a reply to arrive over the network transport (i.e.
> the TCP connection). However that would typically involve a much longer
> timeout. For NFSv3 and 4.0, 60 seconds is the norm. For NFSv4.1 I
> *think* there is no timeout as NFS never resends requests unless the TCP
> connection is broken. So I don't know what to make of that.
>
> Given the fact that it recovers in 5-10 seconds, it seems very likely
> that it actually has sent the request (presumably the timeout doesn't
> expire). Mostly likely the server is being delayed in replying. A
> tcpdump trace would confirm this.
>
> I don't think you've said anything about what NFS server you are using.
> It is Linux, or something else? If Linux, what kernel do you run there?
>
> One thing that might cause a delay when accessing a file is if some
> other client has a 'delegation' and takes a while to return it.
> e.g.
> client2 creates a file for writing
> server gives it a 'write delegation'
> client2 queues up lots of write request
> client1 asks to read the file.
> server says to client2 "please return the delegation"
> client2 starts flushing its data, but either takes too long or
> has some other hiccup. Eventually it is returned, or the server
> revokes it
> server tells clients one it can read now.
>
> This is leading to me asking: do you have other NFS clients that might
> be accessing the same file, or might have accessed it recently?
>
> NeilBrown
>
>
> > -----------------------------
> >
> > [31229.965368] -pid- flgs status -client- --rqstp- -timeout ---ops--
> > [31229.965370] 44224 c801 0 8c478d70 2473aa3b 15395 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31229.965411] 44225 c801 0 8c478d70 a7d3b9b8 15395 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31231.966982] -pid- flgs status -client- --rqstp- -timeout ---ops--
> > [31231.966986] 44224 c801 0 8c478d70 2473aa3b 14795 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31231.967027] 44225 c801 0 8c478d70 a7d3b9b8 14795 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31233.968741] -pid- flgs status -client- --rqstp- -timeout ---ops--
> > [31233.968745] 44224 c801 0 8c478d70 2473aa3b 14194 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31233.968787] 44225 c801 0 8c478d70 a7d3b9b8 14194 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> >
> > -----------------------------
> >
> > [31292.534529] -pid- flgs status -client- --rqstp- -timeout ---ops--
> > [31292.534530] 45744 c801 0 8c478d70 2473aa3b 15340 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31292.534554] 45745 c801 0 8c478d70 50cbef7b 15342 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31294.536203] -pid- flgs status -client- --rqstp- -timeout ---ops--
> > [31294.536206] 45744 c801 0 8c478d70 2473aa3b 14740 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31294.536232] 45745 c801 0 8c478d70 50cbef7b 14742 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31296.537826] -pid- flgs status -client- --rqstp- -timeout ---ops--
> > [31296.537830] 45744 c801 0 8c478d70 2473aa3b 14139 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31296.537871] 45745 c801 0 8c478d70 50cbef7b 14141 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31298.539474] -pid- flgs status -client- --rqstp- -timeout ---ops--
> > [31298.539478] 45744 c801 0 8c478d70 2473aa3b 13539 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31298.539522] 45745 c801 0 8c478d70 50cbef7b 13541 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31300.541040] -pid- flgs status -client- --rqstp- -timeout ---ops--
> > [31300.541044] 46048 c801 0 8c478d70 50cbef7b 17965 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31300.541108] 46049 c801 0 8c478d70 2473aa3b 17965 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31302.542632] -pid- flgs status -client- --rqstp- -timeout ---ops--
> > [31302.542635] 46922 c801 0 8c478d70 50cbef7b 18000 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> >
> > ---------------------------
> >
> > [31375.046020] -pid- flgs status -client- --rqstp- -timeout ---ops--
> > [31375.046022] 47336 c801 0 8c478d70 cebaa9a6 15964 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31375.046048] 47337 c801 0 8c478d70 f7145d4d 15964 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31377.047361] -pid- flgs status -client- --rqstp- -timeout ---ops--
> > [31377.047365] 47336 c801 0 8c478d70 cebaa9a6 15364 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31377.047406] 47337 c801 0 8c478d70 f7145d4d 15364 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31379.048801] -pid- flgs status -client- --rqstp- -timeout ---ops--
> > [31379.048805] 47336 c801 0 8c478d70 cebaa9a6 14764 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31379.048845] 47337 c801 0 8c478d70 f7145d4d 14764 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31381.050376] -pid- flgs status -client- --rqstp- -timeout ---ops--
> > [31381.050380] 47336 c801 0 8c478d70 cebaa9a6 14163 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31381.050421] 47337 c801 0 8c478d70 f7145d4d 14163 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31383.051746] -pid- flgs status -client- --rqstp- -timeout ---ops--
> > [31383.051749] 47336 c801 0 8c478d70 cebaa9a6 13563 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31383.051775] 47337 c801 0 8c478d70 f7145d4d 13563 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31385.053182] -pid- flgs status -client- --rqstp- -timeout ---ops--
> > [31385.053185] 47342 c801 0 8c478d70 f7145d4d 17980 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31387.054422] -pid- flgs status -client- --rqstp- -timeout ---ops--
> > [31387.054426] 47818 c801 0 8c478d70 f7145d4d 17761 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> > [31387.054467] 47819 c801 0 8c478d70 cebaa9a6 17762 8531859
> > nfsv4 READ a:call_status [sunrpc] q:xprt_pending
> >
> >
> > - mike
> >
> >
> > On Sun, Aug 8, 2021 at 5:28 PM Mike Javorski <[email protected]> wrote:
> > >
> > > Appreciate the insights Neil. I will make an attempt to
> > > trigger/capture the fault if I can.
> > >
> > > - mike
> > >
> > > On Sun, Aug 8, 2021 at 5:01 PM NeilBrown <[email protected]> wrote:
> > > >
> > > > On Mon, 09 Aug 2021, Mike Javorski wrote:
> > > > > I have been experiencing nfs file access hangs with multiple release
> > > > > versions of the 5.13.x linux kernel. In each case, all file transfers
> > > > > freeze for 5-10 seconds and then resume. This seems worse when reading
> > > > > through many files sequentially.
> > > >
> > > > A particularly useful debugging tool for NFS freezes is to run
> > > >
> > > > rpcdebug -m rpc -c all
> > > >
> > > > while the system appears frozen. As you only have a 5-10 second window
> > > > this might be tricky.
> > > > Setting or clearing debug flags in the rpc module (whether they are
> > > > already set or not) has a side effect if listing all RPC "tasks" which a
> > > > waiting for a reply. Seeing that task list can often be useful.
> > > >
> > > > The task list appears in "dmesg" output. If there are not tasks
> > > > waiting, nothing will be written which might lead you to think it didn't
> > > > work.
> > > >
> > > > As Chuck hinted, tcpdump is invaluable for this sort of problem.
> > > > tcpdump -s 0 -w /tmp/somefile.pcap port 2049
> > > >
> > > > will capture NFS traffic. If this can start before a hang, and finish
> > > > after, it may contain useful information. Doing that in a way that
> > > > doesn't create an enormous file might be a challenge. It would help if
> > > > you found a way trigger the problem. Take note of the circumstances
> > > > when it seems to happen the most. If you can only produce a large file,
> > > > we can probably still work with it.
> > > > tshark -r /tmp/somefile.pcap
> > > > will report the capture one line per packet. You can look for the
> > > > appropriate timestamp, note the frame numbers, and use "editcap"
> > > > to extract a suitable range of packets.
> > > >
> > > > NeilBrown
> > > >
> > > >
> > > > >
> > > > > My server:
> > > > > - Archlinux w/ a distribution provided kernel package
> > > > > - filesystems exported with "rw,sync,no_subtree_check,insecure" options
> > > > >
> > > > > Client:
> > > > > - Archlinux w/ latest distribution provided kernel (5.13.9-arch1-1 at writing)
> > > > > - nfs mounted via /net autofs with "soft,nodev,nosuid" options
> > > > > (ver=4.2 is indicated in mount)
> > > > >
> > > > > I have tried the 5.13.x kernel several times since the first arch
> > > > > release (most recently with 5.13.9-arch1-1), all with similar results.
> > > > > Each time, I am forced to downgrade the linux package to a 5.12.x
> > > > > kernel (5.12.15-arch1 as of writing) to clear up the transfer issues
> > > > > and stabilize performance. No other changes are made between tests. I
> > > > > have confirmed the freezing behavior using both ext4 and btrfs
> > > > > filesystems exported from this server.
> > > > >
> > > > > At this point I would appreciate some guidance in what to provide in
> > > > > order to diagnose and resolve this issue. I don't have a lot of kernel
> > > > > debugging experience, so instruction would be helpful.
> > > > >
> > > > > - mike
> > > > >
> > > > >
> >
> >
The "semi-known-good" has been the client. I tried updating the server
multiple times to a 5.13 kernel and each time had to downgrade to the
last 5.12 kernel that ArchLinux released (5.12.15) to stabilize
performance. At each attempt, the client was running the same 5.13
kernel that was being deployed to the server. I never downgraded the
client.
Thank you for the scripts and all the details, I will test things out
this weekend when I can dedicate time to it.
- mike
On Thu, Aug 12, 2021 at 7:39 PM NeilBrown <[email protected]> wrote:
>
> On Fri, 13 Aug 2021, Mike Javorski wrote:
> > Neil:
> >
> > Apologies for the delay, your message didn't get properly flagged in my email.
>
> :-)
>
> >
> > To answer your questions, both client (my Desktop PC) and server (my
> > NAS) are running ArchLinux; client w/ current kernel (5.13.9), server
> > w/ current or alternate testing kernels (see below).
>
> So the bug could be in the server or the client. I assume you are
> careful to test a client against a know-good server, or a server against
> a known-good client.
>
> > I
> > intend to spend some time this weekend attempting to get the tcpdump.
> > My initial attempts wound up with 400+Mb files which would be
> > difficult to ship and use for diagnostics.
>
> Rather than you sending me the dump, I'll send you the code.
>
> Run
> tshark -r filename -d tcp.port==2049,rpc -Y 'tcp.port==2049 && rpc.time > 1'
>
> This will ensure the NFS traffic is actually decoded as NFS and then
> report only NFS(rpc) replies that arrive more than 1 second after the
> request.
> You can add
>
> -T fields -e frame.number -e rpc.time
>
> to find out what the actual delay was.
>
> If it reports any, that will be interesting. Try with a larger time if
> necessary to get a modest number of hits. Using editcap and the given
> frame number you can select out 1000 packets either side of the problem
> and that should compress to be small enough to transport.
>
> However it might not find anything. If the reply never arrives, you'll
> never get a reply with a long timeout. So we need to check that
> everything got a reply...
>
> tshark -r filename -t tcp.port==2049,rpc \
> -Y 'tcp.port==2049 && rpc.msg == 0' -T fields \
> -e rpc.xid -e frame.number | sort > /tmp/requests
>
> tshark -r filename -t tcp.port==2049,rpc \
> -Y 'tcp.port==2049 && rpc.msg == 1' -T fields \
> -e rpc.xid -e frame.number | sort > /tmp/replies
>
> join -a1 /tmp/requests /tmp/replies | awk 'NF==2'
>
> This should list the xid and frame number of all requests that didn't
> get a reply. Again, editcap can extract a range of frames into a file of
> manageable size.
>
> Another possibility is that requests are getting replies, but the reply
> says "NFS4ERR_DELAY"
>
> tshark -r filename -t tcp.port==2049,rpc -Y nfs.nfsstat4==10008
>
> should report any reply with that error code.
>
> Hopefully something there will be interesting.
>
> NeilBrown
On Fri, 13 Aug 2021, Mike Javorski wrote:
> Neil:
>
> Apologies for the delay, your message didn't get properly flagged in my email.
:-)
>
> To answer your questions, both client (my Desktop PC) and server (my
> NAS) are running ArchLinux; client w/ current kernel (5.13.9), server
> w/ current or alternate testing kernels (see below).
So the bug could be in the server or the client. I assume you are
careful to test a client against a know-good server, or a server against
a known-good client.
> I
> intend to spend some time this weekend attempting to get the tcpdump.
> My initial attempts wound up with 400+Mb files which would be
> difficult to ship and use for diagnostics.
Rather than you sending me the dump, I'll send you the code.
Run
tshark -r filename -d tcp.port==2049,rpc -Y 'tcp.port==2049 && rpc.time > 1'
This will ensure the NFS traffic is actually decoded as NFS and then
report only NFS(rpc) replies that arrive more than 1 second after the
request.
You can add
-T fields -e frame.number -e rpc.time
to find out what the actual delay was.
If it reports any, that will be interesting. Try with a larger time if
necessary to get a modest number of hits. Using editcap and the given
frame number you can select out 1000 packets either side of the problem
and that should compress to be small enough to transport.
However it might not find anything. If the reply never arrives, you'll
never get a reply with a long timeout. So we need to check that
everything got a reply...
tshark -r filename -t tcp.port==2049,rpc \
-Y 'tcp.port==2049 && rpc.msg == 0' -T fields \
-e rpc.xid -e frame.number | sort > /tmp/requests
tshark -r filename -t tcp.port==2049,rpc \
-Y 'tcp.port==2049 && rpc.msg == 1' -T fields \
-e rpc.xid -e frame.number | sort > /tmp/replies
join -a1 /tmp/requests /tmp/replies | awk 'NF==2'
This should list the xid and frame number of all requests that didn't
get a reply. Again, editcap can extract a range of frames into a file of
manageable size.
Another possibility is that requests are getting replies, but the reply
says "NFS4ERR_DELAY"
tshark -r filename -t tcp.port==2049,rpc -Y nfs.nfsstat4==10008
should report any reply with that error code.
Hopefully something there will be interesting.
NeilBrown
I managed to get a cap with several discreet freezes in it, and I
included a chunk with 5 of them in the span of ~3000 frames. I added
packet comments at each frame that the tshark command reported as > 1
sec RPC wait. Just search for "Freeze" in (wire|t)shark in packet
details. This is with kernel 5.13.10 provided by Arch (See
https://github.com/archlinux/linux/compare/a37da2be8e6c85...v5.13.10-arch1
for diff vs mainline, nothing NFS/RPC related I can identify).
I tried unsuccessfully to get any failures with the 5.12.15 kernel.
https://drive.google.com/file/d/1T42iX9xCdF9Oe4f7JXsnWqD8oJPrpMqV/view?usp=sharing
File should be downloadable anonymously.
- mike
On Thu, Aug 12, 2021 at 7:53 PM Mike Javorski <[email protected]> wrote:
>
> The "semi-known-good" has been the client. I tried updating the server
> multiple times to a 5.13 kernel and each time had to downgrade to the
> last 5.12 kernel that ArchLinux released (5.12.15) to stabilize
> performance. At each attempt, the client was running the same 5.13
> kernel that was being deployed to the server. I never downgraded the
> client.
>
> Thank you for the scripts and all the details, I will test things out
> this weekend when I can dedicate time to it.
>
> - mike
>
> On Thu, Aug 12, 2021 at 7:39 PM NeilBrown <[email protected]> wrote:
> >
> > On Fri, 13 Aug 2021, Mike Javorski wrote:
> > > Neil:
> > >
> > > Apologies for the delay, your message didn't get properly flagged in my email.
> >
> > :-)
> >
> > >
> > > To answer your questions, both client (my Desktop PC) and server (my
> > > NAS) are running ArchLinux; client w/ current kernel (5.13.9), server
> > > w/ current or alternate testing kernels (see below).
> >
> > So the bug could be in the server or the client. I assume you are
> > careful to test a client against a know-good server, or a server against
> > a known-good client.
> >
> > > I
> > > intend to spend some time this weekend attempting to get the tcpdump.
> > > My initial attempts wound up with 400+Mb files which would be
> > > difficult to ship and use for diagnostics.
> >
> > Rather than you sending me the dump, I'll send you the code.
> >
> > Run
> > tshark -r filename -d tcp.port==2049,rpc -Y 'tcp.port==2049 && rpc.time > 1'
> >
> > This will ensure the NFS traffic is actually decoded as NFS and then
> > report only NFS(rpc) replies that arrive more than 1 second after the
> > request.
> > You can add
> >
> > -T fields -e frame.number -e rpc.time
> >
> > to find out what the actual delay was.
> >
> > If it reports any, that will be interesting. Try with a larger time if
> > necessary to get a modest number of hits. Using editcap and the given
> > frame number you can select out 1000 packets either side of the problem
> > and that should compress to be small enough to transport.
> >
> > However it might not find anything. If the reply never arrives, you'll
> > never get a reply with a long timeout. So we need to check that
> > everything got a reply...
> >
> > tshark -r filename -t tcp.port==2049,rpc \
> > -Y 'tcp.port==2049 && rpc.msg == 0' -T fields \
> > -e rpc.xid -e frame.number | sort > /tmp/requests
> >
> > tshark -r filename -t tcp.port==2049,rpc \
> > -Y 'tcp.port==2049 && rpc.msg == 1' -T fields \
> > -e rpc.xid -e frame.number | sort > /tmp/replies
> >
> > join -a1 /tmp/requests /tmp/replies | awk 'NF==2'
> >
> > This should list the xid and frame number of all requests that didn't
> > get a reply. Again, editcap can extract a range of frames into a file of
> > manageable size.
> >
> > Another possibility is that requests are getting replies, but the reply
> > says "NFS4ERR_DELAY"
> >
> > tshark -r filename -t tcp.port==2049,rpc -Y nfs.nfsstat4==10008
> >
> > should report any reply with that error code.
> >
> > Hopefully something there will be interesting.
> >
> > NeilBrown
On Sun, 15 Aug 2021, Mike Javorski wrote:
> I managed to get a cap with several discreet freezes in it, and I
> included a chunk with 5 of them in the span of ~3000 frames. I added
> packet comments at each frame that the tshark command reported as > 1
> sec RPC wait. Just search for "Freeze" in (wire|t)shark in packet
> details. This is with kernel 5.13.10 provided by Arch (See
> https://github.com/archlinux/linux/compare/a37da2be8e6c85...v5.13.10-arch1
> for diff vs mainline, nothing NFS/RPC related I can identify).
>
> I tried unsuccessfully to get any failures with the 5.12.15 kernel.
>
> https://drive.google.com/file/d/1T42iX9xCdF9Oe4f7JXsnWqD8oJPrpMqV/view?usp=sharing
>
> File should be downloadable anonymously.
Got it, thanks.
There are 3 RPC replies that are more than 1 second after the request.
The replies are in frames 983, 1005, 1777 These roughly correspond to
where you added the "Freeze" annotation (I didn't know you could do that!).
983:
This is the (Start of the) reply to a READ Request in frame 980.
It is a 128K read. The whole reply has arrived by frame 1004, 2ms
later.
The request (Frame 980) is followed 13ms later by a TCP retransmit
of the request and the (20usec later) a TCP ACK from the server.
The fact that the client needed to retransmit seems a little odd
but as it is the only retransmit in the whole capture, I don't think
we can read much into it.
1005:
This is the reply to a 128K READ request in frame 793 - earlier than
previous one.
So there were two READ requests, then a 2 second delay then both
replies in reverse order.
1777:
This is a similar READ reply - to 1761.
There were READs in 1760, 1761, and 1775
1760 is replied to almost immediately
1761 gets a reply in 4 seconds (1777)
1775 never gets a reply (in the available packet capture).
Looking at other delays ... most READs get a reply in under a millisec.
There are about 20 where the reply is more than 1ms - the longest delay
not already mentioned is 90ms with reply 1857.
The pattern here is
READ req (1)
GETATTR req
GETATTR reply
READ req (2)
READ reply (1)
pause
READ reply (2)
I suspect this is the same problem occurring, but it isn't so
noticeable.
My first thought was that the reply might be getting stuck in the TCP
transmit queue on the server, but checking the TSval in the TCP
timestamp option shows that - for frame 983 which shows a 2second delay
- the TSval is also 2seconds later than the previous packet. So the
delay happens before the TCP-level decision to create the packet.
So I cannot see any real evidence to suggest a TCP-level problem.
The time of 2 or 4 seconds - and maybe even 90ms - seem unlikely to be
caused by an NFSd problem.
So my guess is that the delay comes from the filesystem. Maybe.
What filesystem are you exporting?
How can we check this? Probably by turning on nfsd tracing.
There are a bunch of tracepoints that related to reading:
trace_nfsd_read_start
trace_nfsd_read_done
trace_nfsd_read_io_done
trace_nfsd_read_err
trace_nfsd_read_splice
trace_nfsd_read_vector
trace_nfsd_read_start
trace_nfsd_read_done
Maybe enabling them might be useful as you should be able to see if the
delay was within one read request, or between two read requests.
But I don't have much (if any) experience in enabling trace points. I
really should try that some day. Maybe you can find guidance on using
these tracepoint somewhere ... or maybe you already know how :-)
NeilBrown
> On Aug 15, 2021, at 9:20 PM, NeilBrown <[email protected]> wrote:
>
> On Sun, 15 Aug 2021, Mike Javorski wrote:
>> I managed to get a cap with several discreet freezes in it, and I
>> included a chunk with 5 of them in the span of ~3000 frames. I added
>> packet comments at each frame that the tshark command reported as > 1
>> sec RPC wait. Just search for "Freeze" in (wire|t)shark in packet
>> details. This is with kernel 5.13.10 provided by Arch (See
>> https://github.com/archlinux/linux/compare/a37da2be8e6c85...v5.13.10-arch1
>> for diff vs mainline, nothing NFS/RPC related I can identify).
>>
>> I tried unsuccessfully to get any failures with the 5.12.15 kernel.
>>
>> https://drive.google.com/file/d/1T42iX9xCdF9Oe4f7JXsnWqD8oJPrpMqV/view?usp=sharing
>>
>> File should be downloadable anonymously.
>
> Got it, thanks.
>
> There are 3 RPC replies that are more than 1 second after the request.
> The replies are in frames 983, 1005, 1777 These roughly correspond to
> where you added the "Freeze" annotation (I didn't know you could do that!).
>
> 983:
> This is the (Start of the) reply to a READ Request in frame 980.
> It is a 128K read. The whole reply has arrived by frame 1004, 2ms
> later.
> The request (Frame 980) is followed 13ms later by a TCP retransmit
> of the request and the (20usec later) a TCP ACK from the server.
>
> The fact that the client needed to retransmit seems a little odd
> but as it is the only retransmit in the whole capture, I don't think
> we can read much into it.
>
> 1005:
> This is the reply to a 128K READ request in frame 793 - earlier than
> previous one.
> So there were two READ requests, then a 2 second delay then both
> replies in reverse order.
>
> 1777:
> This is a similar READ reply - to 1761.
> There were READs in 1760, 1761, and 1775
> 1760 is replied to almost immediately
> 1761 gets a reply in 4 seconds (1777)
> 1775 never gets a reply (in the available packet capture).
>
> Looking at other delays ... most READs get a reply in under a millisec.
> There are about 20 where the reply is more than 1ms - the longest delay
> not already mentioned is 90ms with reply 1857.
> The pattern here is
> READ req (1)
> GETATTR req
> GETATTR reply
> READ req (2)
> READ reply (1)
> pause
> READ reply (2)
>
> I suspect this is the same problem occurring, but it isn't so
> noticeable.
>
> My first thought was that the reply might be getting stuck in the TCP
> transmit queue on the server, but checking the TSval in the TCP
> timestamp option shows that - for frame 983 which shows a 2second delay
> - the TSval is also 2seconds later than the previous packet. So the
> delay happens before the TCP-level decision to create the packet.
>
> So I cannot see any real evidence to suggest a TCP-level problem.
> The time of 2 or 4 seconds - and maybe even 90ms - seem unlikely to be
> caused by an NFSd problem.
>
> So my guess is that the delay comes from the filesystem. Maybe.
> What filesystem are you exporting?
>
> How can we check this? Probably by turning on nfsd tracing.
> There are a bunch of tracepoints that related to reading:
>
> trace_nfsd_read_start
> trace_nfsd_read_done
> trace_nfsd_read_io_done
> trace_nfsd_read_err
> trace_nfsd_read_splice
> trace_nfsd_read_vector
> trace_nfsd_read_start
> trace_nfsd_read_done
>
> Maybe enabling them might be useful as you should be able to see if the
> delay was within one read request, or between two read requests.
> But I don't have much (if any) experience in enabling trace points. I
> really should try that some day. Maybe you can find guidance on using
> these tracepoint somewhere ... or maybe you already know how :-)
Try something like:
# trace-cmd record -e nfsd:nfsd_read_\*
When the test is done, ^C the trace-cmd program. The trace records
are stored in a file called trace.dat. You can view them with:
# trace-cmd report | less
The trace.dat file is portable. It carries the format specifiers
for all records events, so the trace records can be viewed on
other systems.
--
Chuck Lever
Neil:
A possible insight, these captures were all made on the server, so the
rpc reply wouldn't be delayed (at least if the server is the one doing
the replying). I did the captures there because I wanted to get as
close to possible issues as I could. If it makes sense, I can try to
capture from the client side, let me know.
Filesystem being exported is BTRFS, a 6 disk array. I jumped to nfs
being the trouble because that is my visibility to the issue, but it's
certainly possible for it to be the underlying filesystem. Maybe the
nfsd traces will give insight to this.
- mike
On Sun, Aug 15, 2021 at 6:20 PM NeilBrown <[email protected]> wrote:
>
> On Sun, 15 Aug 2021, Mike Javorski wrote:
> > I managed to get a cap with several discreet freezes in it, and I
> > included a chunk with 5 of them in the span of ~3000 frames. I added
> > packet comments at each frame that the tshark command reported as > 1
> > sec RPC wait. Just search for "Freeze" in (wire|t)shark in packet
> > details. This is with kernel 5.13.10 provided by Arch (See
> > https://github.com/archlinux/linux/compare/a37da2be8e6c85...v5.13.10-arch1
> > for diff vs mainline, nothing NFS/RPC related I can identify).
> >
> > I tried unsuccessfully to get any failures with the 5.12.15 kernel.
> >
> > https://drive.google.com/file/d/1T42iX9xCdF9Oe4f7JXsnWqD8oJPrpMqV/view?usp=sharing
> >
> > File should be downloadable anonymously.
>
> Got it, thanks.
>
> There are 3 RPC replies that are more than 1 second after the request.
> The replies are in frames 983, 1005, 1777 These roughly correspond to
> where you added the "Freeze" annotation (I didn't know you could do that!).
>
> 983:
> This is the (Start of the) reply to a READ Request in frame 980.
> It is a 128K read. The whole reply has arrived by frame 1004, 2ms
> later.
> The request (Frame 980) is followed 13ms later by a TCP retransmit
> of the request and the (20usec later) a TCP ACK from the server.
>
> The fact that the client needed to retransmit seems a little odd
> but as it is the only retransmit in the whole capture, I don't think
> we can read much into it.
>
> 1005:
> This is the reply to a 128K READ request in frame 793 - earlier than
> previous one.
> So there were two READ requests, then a 2 second delay then both
> replies in reverse order.
>
> 1777:
> This is a similar READ reply - to 1761.
> There were READs in 1760, 1761, and 1775
> 1760 is replied to almost immediately
> 1761 gets a reply in 4 seconds (1777)
> 1775 never gets a reply (in the available packet capture).
>
> Looking at other delays ... most READs get a reply in under a millisec.
> There are about 20 where the reply is more than 1ms - the longest delay
> not already mentioned is 90ms with reply 1857.
> The pattern here is
> READ req (1)
> GETATTR req
> GETATTR reply
> READ req (2)
> READ reply (1)
> pause
> READ reply (2)
>
> I suspect this is the same problem occurring, but it isn't so
> noticeable.
>
> My first thought was that the reply might be getting stuck in the TCP
> transmit queue on the server, but checking the TSval in the TCP
> timestamp option shows that - for frame 983 which shows a 2second delay
> - the TSval is also 2seconds later than the previous packet. So the
> delay happens before the TCP-level decision to create the packet.
>
> So I cannot see any real evidence to suggest a TCP-level problem.
> The time of 2 or 4 seconds - and maybe even 90ms - seem unlikely to be
> caused by an NFSd problem.
>
> So my guess is that the delay comes from the filesystem. Maybe.
> What filesystem are you exporting?
>
> How can we check this? Probably by turning on nfsd tracing.
> There are a bunch of tracepoints that related to reading:
>
> trace_nfsd_read_start
> trace_nfsd_read_done
> trace_nfsd_read_io_done
> trace_nfsd_read_err
> trace_nfsd_read_splice
> trace_nfsd_read_vector
> trace_nfsd_read_start
> trace_nfsd_read_done
>
> Maybe enabling them might be useful as you should be able to see if the
> delay was within one read request, or between two read requests.
> But I don't have much (if any) experience in enabling trace points. I
> really should try that some day. Maybe you can find guidance on using
> these tracepoint somewhere ... or maybe you already know how :-)
>
> NeilBrown
Thanks for the pointer Chuck
I ran the trace capture and was able to trigger two freezes in
relatively short order. Here is that trace file:
https://drive.google.com/file/d/1qk_VIMkj8aTeQIg5O0W3AvWyeDSWL3vW/view?usp=sharing
- mike
On Mon, Aug 16, 2021 at 6:21 AM Chuck Lever III <[email protected]> wrote:
>
>
>
> > On Aug 15, 2021, at 9:20 PM, NeilBrown <[email protected]> wrote:
> >
> > On Sun, 15 Aug 2021, Mike Javorski wrote:
> >> I managed to get a cap with several discreet freezes in it, and I
> >> included a chunk with 5 of them in the span of ~3000 frames. I added
> >> packet comments at each frame that the tshark command reported as > 1
> >> sec RPC wait. Just search for "Freeze" in (wire|t)shark in packet
> >> details. This is with kernel 5.13.10 provided by Arch (See
> >> https://github.com/archlinux/linux/compare/a37da2be8e6c85...v5.13.10-arch1
> >> for diff vs mainline, nothing NFS/RPC related I can identify).
> >>
> >> I tried unsuccessfully to get any failures with the 5.12.15 kernel.
> >>
> >> https://drive.google.com/file/d/1T42iX9xCdF9Oe4f7JXsnWqD8oJPrpMqV/view?usp=sharing
> >>
> >> File should be downloadable anonymously.
> >
> > Got it, thanks.
> >
> > There are 3 RPC replies that are more than 1 second after the request.
> > The replies are in frames 983, 1005, 1777 These roughly correspond to
> > where you added the "Freeze" annotation (I didn't know you could do that!).
> >
> > 983:
> > This is the (Start of the) reply to a READ Request in frame 980.
> > It is a 128K read. The whole reply has arrived by frame 1004, 2ms
> > later.
> > The request (Frame 980) is followed 13ms later by a TCP retransmit
> > of the request and the (20usec later) a TCP ACK from the server.
> >
> > The fact that the client needed to retransmit seems a little odd
> > but as it is the only retransmit in the whole capture, I don't think
> > we can read much into it.
> >
> > 1005:
> > This is the reply to a 128K READ request in frame 793 - earlier than
> > previous one.
> > So there were two READ requests, then a 2 second delay then both
> > replies in reverse order.
> >
> > 1777:
> > This is a similar READ reply - to 1761.
> > There were READs in 1760, 1761, and 1775
> > 1760 is replied to almost immediately
> > 1761 gets a reply in 4 seconds (1777)
> > 1775 never gets a reply (in the available packet capture).
> >
> > Looking at other delays ... most READs get a reply in under a millisec.
> > There are about 20 where the reply is more than 1ms - the longest delay
> > not already mentioned is 90ms with reply 1857.
> > The pattern here is
> > READ req (1)
> > GETATTR req
> > GETATTR reply
> > READ req (2)
> > READ reply (1)
> > pause
> > READ reply (2)
> >
> > I suspect this is the same problem occurring, but it isn't so
> > noticeable.
> >
> > My first thought was that the reply might be getting stuck in the TCP
> > transmit queue on the server, but checking the TSval in the TCP
> > timestamp option shows that - for frame 983 which shows a 2second delay
> > - the TSval is also 2seconds later than the previous packet. So the
> > delay happens before the TCP-level decision to create the packet.
> >
> > So I cannot see any real evidence to suggest a TCP-level problem.
> > The time of 2 or 4 seconds - and maybe even 90ms - seem unlikely to be
> > caused by an NFSd problem.
> >
> > So my guess is that the delay comes from the filesystem. Maybe.
> > What filesystem are you exporting?
> >
> > How can we check this? Probably by turning on nfsd tracing.
> > There are a bunch of tracepoints that related to reading:
> >
> > trace_nfsd_read_start
> > trace_nfsd_read_done
> > trace_nfsd_read_io_done
> > trace_nfsd_read_err
> > trace_nfsd_read_splice
> > trace_nfsd_read_vector
> > trace_nfsd_read_start
> > trace_nfsd_read_done
> >
> > Maybe enabling them might be useful as you should be able to see if the
> > delay was within one read request, or between two read requests.
> > But I don't have much (if any) experience in enabling trace points. I
> > really should try that some day. Maybe you can find guidance on using
> > these tracepoint somewhere ... or maybe you already know how :-)
>
> Try something like:
>
> # trace-cmd record -e nfsd:nfsd_read_\*
>
> When the test is done, ^C the trace-cmd program. The trace records
> are stored in a file called trace.dat. You can view them with:
>
> # trace-cmd report | less
>
> The trace.dat file is portable. It carries the format specifiers
> for all records events, so the trace records can be viewed on
> other systems.
>
>
> --
> Chuck Lever
>
>
>
On Tue, 17 Aug 2021, Mike Javorski wrote:
> Neil:
>
> A possible insight, these captures were all made on the server, so the
> rpc reply wouldn't be delayed (at least if the server is the one doing
> the replying). I did the captures there because I wanted to get as
> close to possible issues as I could. If it makes sense, I can try to
> capture from the client side, let me know.
Sometimes it does make sense to compare both ends of the piece of wire.
However in this case the data collected at the server point the delays
on the server so it is unlikely that there are delays or losses between
server and client.
Thanks,
NeilBrown
>
> Filesystem being exported is BTRFS, a 6 disk array. I jumped to nfs
> being the trouble because that is my visibility to the issue, but it's
> certainly possible for it to be the underlying filesystem. Maybe the
> nfsd traces will give insight to this.
>
> - mike
>
> On Sun, Aug 15, 2021 at 6:20 PM NeilBrown <[email protected]> wrote:
> >
> > On Sun, 15 Aug 2021, Mike Javorski wrote:
> > > I managed to get a cap with several discreet freezes in it, and I
> > > included a chunk with 5 of them in the span of ~3000 frames. I added
> > > packet comments at each frame that the tshark command reported as > 1
> > > sec RPC wait. Just search for "Freeze" in (wire|t)shark in packet
> > > details. This is with kernel 5.13.10 provided by Arch (See
> > > https://github.com/archlinux/linux/compare/a37da2be8e6c85...v5.13.10-arch1
> > > for diff vs mainline, nothing NFS/RPC related I can identify).
> > >
> > > I tried unsuccessfully to get any failures with the 5.12.15 kernel.
> > >
> > > https://drive.google.com/file/d/1T42iX9xCdF9Oe4f7JXsnWqD8oJPrpMqV/view?usp=sharing
> > >
> > > File should be downloadable anonymously.
> >
> > Got it, thanks.
> >
> > There are 3 RPC replies that are more than 1 second after the request.
> > The replies are in frames 983, 1005, 1777 These roughly correspond to
> > where you added the "Freeze" annotation (I didn't know you could do that!).
> >
> > 983:
> > This is the (Start of the) reply to a READ Request in frame 980.
> > It is a 128K read. The whole reply has arrived by frame 1004, 2ms
> > later.
> > The request (Frame 980) is followed 13ms later by a TCP retransmit
> > of the request and the (20usec later) a TCP ACK from the server.
> >
> > The fact that the client needed to retransmit seems a little odd
> > but as it is the only retransmit in the whole capture, I don't think
> > we can read much into it.
> >
> > 1005:
> > This is the reply to a 128K READ request in frame 793 - earlier than
> > previous one.
> > So there were two READ requests, then a 2 second delay then both
> > replies in reverse order.
> >
> > 1777:
> > This is a similar READ reply - to 1761.
> > There were READs in 1760, 1761, and 1775
> > 1760 is replied to almost immediately
> > 1761 gets a reply in 4 seconds (1777)
> > 1775 never gets a reply (in the available packet capture).
> >
> > Looking at other delays ... most READs get a reply in under a millisec.
> > There are about 20 where the reply is more than 1ms - the longest delay
> > not already mentioned is 90ms with reply 1857.
> > The pattern here is
> > READ req (1)
> > GETATTR req
> > GETATTR reply
> > READ req (2)
> > READ reply (1)
> > pause
> > READ reply (2)
> >
> > I suspect this is the same problem occurring, but it isn't so
> > noticeable.
> >
> > My first thought was that the reply might be getting stuck in the TCP
> > transmit queue on the server, but checking the TSval in the TCP
> > timestamp option shows that - for frame 983 which shows a 2second delay
> > - the TSval is also 2seconds later than the previous packet. So the
> > delay happens before the TCP-level decision to create the packet.
> >
> > So I cannot see any real evidence to suggest a TCP-level problem.
> > The time of 2 or 4 seconds - and maybe even 90ms - seem unlikely to be
> > caused by an NFSd problem.
> >
> > So my guess is that the delay comes from the filesystem. Maybe.
> > What filesystem are you exporting?
> >
> > How can we check this? Probably by turning on nfsd tracing.
> > There are a bunch of tracepoints that related to reading:
> >
> > trace_nfsd_read_start
> > trace_nfsd_read_done
> > trace_nfsd_read_io_done
> > trace_nfsd_read_err
> > trace_nfsd_read_splice
> > trace_nfsd_read_vector
> > trace_nfsd_read_start
> > trace_nfsd_read_done
> >
> > Maybe enabling them might be useful as you should be able to see if the
> > delay was within one read request, or between two read requests.
> > But I don't have much (if any) experience in enabling trace points. I
> > really should try that some day. Maybe you can find guidance on using
> > these tracepoint somewhere ... or maybe you already know how :-)
> >
> > NeilBrown
>
>
On Tue, 17 Aug 2021, Mike Javorski wrote:
> Thanks for the pointer Chuck
>
> I ran the trace capture and was able to trigger two freezes in
> relatively short order. Here is that trace file:
> https://drive.google.com/file/d/1qk_VIMkj8aTeQIg5O0W3AvWyeDSWL3vW/view?usp=sharing
>
There are gaps of 5.3sec, 4sec, 1sec and 1sec between adjacent trace
points.
The longest gap between 'start' and 'done' for any given xid is 354msec.
So it doesn't look like the filesystem read causing the problem.
The long gaps between adjacent records are:
0.354581 xid=0x4c6ac2b6
0.418340 xid=0x6a71c2b6
1.013260 xid=0x6f71c2b6
1.020742 xid=0x0f71c2b6
4.064527 xid=0x6171c2b6
5.396559 xid=0xd66dc2b6
The fact 1, 1, and 4 are so close to a precise number of seconds seems
unlikely to be a co-incidence. It isn't clear what it might mean
though.
I don't have any immediae ideas where to look next. I'll let you know
if I come up with something.
NeilBrown
On Tue, 17 Aug 2021, NeilBrown wrote:
> On Tue, 17 Aug 2021, Mike Javorski wrote:
> > Thanks for the pointer Chuck
> >
> > I ran the trace capture and was able to trigger two freezes in
> > relatively short order. Here is that trace file:
> > https://drive.google.com/file/d/1qk_VIMkj8aTeQIg5O0W3AvWyeDSWL3vW/view?usp=sharing
> >
>
> There are gaps of 5.3sec, 4sec, 1sec and 1sec between adjacent trace
> points.
> The longest gap between 'start' and 'done' for any given xid is 354msec.
> So it doesn't look like the filesystem read causing the problem.
>
> The long gaps between adjacent records are:
> 0.354581 xid=0x4c6ac2b6
> 0.418340 xid=0x6a71c2b6
> 1.013260 xid=0x6f71c2b6
> 1.020742 xid=0x0f71c2b6
> 4.064527 xid=0x6171c2b6
> 5.396559 xid=0xd66dc2b6
>
> The fact 1, 1, and 4 are so close to a precise number of seconds seems
> unlikely to be a co-incidence. It isn't clear what it might mean
> though.
>
> I don't have any immediae ideas where to look next. I'll let you know
> if I come up with something.
I had separate bug report
https://bugzilla.suse.com/show_bug.cgi?id=1189508
with broadly similar symptoms which strongly points to network-layer
problem. So I'm digging back into that tcpdump.
The TCP ACK for READ requests usually goes out in 100 or 200
microseconds. Sometimes longer - by a factor of about 100.
% tshark -r /tmp/javmorin-nfsfreeze-5.13.10-cap1\ .test.pcap -d tcp.port==2049,rpc -Y 'tcp.port==2049 && rpc.msgtyp==0 && nfs.opcode==25' -T fields -e tcp.seq -e frame.time_epoch -e frame.number | sed 's/$/ READ/' > /tmp/read-times
% tshark -r /tmp/javmorin-nfsfreeze-5.13.10-cap1\ .test.pcap -d tcp.port==2049,rpc -Y 'tcp.srcport==2049' -T fields -e tcp.ack -e frame.time_epoch -e frame.number | sed 's/$/ ACK/' > /tmp/ack-times
% sort -n /tmp/read-times /tmp/ack-times | awk '$4 == "ACK" && readtime {printf "%f %d\n", ($2 - readtime), $3; readtime=0} $4 == "READ" {readtime=$2}' | sort -n | tail -4
0.000360 441
0.012777 982
0.013318 1086
0.042356 1776
Three times (out of 245) the TCP ACK was 12 milliseconds or longer. The
first and last were times when the reply was delayed by a couple of
seconds at least. The other (1086) was the reply to 1085 - no obvious
delay to the READ reply.
These unusual delays (short though they are) suggest something odd in
the network layer - particularly as they are sometimes followed by a
much larger delay in a READ reply.
It might be as simple as a memory allocation delay. It might be
something deep in the networking layer.
If you still have (or could generate) some larger tcp dumps, you could
try that code and see if 5.12 shows any ACK delays, and if 5.13 shows
more delays than turn into READ delays. IF 5.13 easily shows a few ACK
delays thats 5.12 doesn't, then they might make a git-bisect more
reliable. Having thousands of READs in the trace rather than just 245
should produce more reliable data.
As I suggested in the bug report I linked above, I am suspicious of TCP
offload when I see symptoms like this. You can use "ethtool" to turn
off that various offload features. Doing that and trying to reproduce
without offload might also provide useful information.
NeilBrown
Neil:
I did still have the original/source cap for that 5.13 file. I ran the
command you gave and got the 10 last entries:
0.013086 31352
0.013318 28399
0.013339 3521
0.013783 32423
0.014519 32481
0.014773 6145
0.015063 19034
0.015829 25892
0.042356 29089
0.042581 5779
I did have another test cap which was from an early 5.13 dev compile
prior to the nfs patches (what the Archlinux devs asked me to try). In
that file (which is 1.8G vs the 400M of the above file), the 10 last
entries are:
0.042038 575
0.042190 87093
0.042313 30183
0.042377 34941
0.042521 53593
0.042801 113067
0.042930 1174
0.043285 87326
0.043851 61066
0.107649 115114
As that seems even worse than the above, I think I need to get a clean
5.12 capture to see if it actually was "better" beforehand.
Unfortunately I am stuck on a work issue tonight, but I will test it
either tomorrow or at the latest, this weekend.
Thanks for following up on this.
- mike
On Thu, Aug 19, 2021 at 5:31 PM NeilBrown <[email protected]> wrote:
>
> On Tue, 17 Aug 2021, NeilBrown wrote:
> > On Tue, 17 Aug 2021, Mike Javorski wrote:
> > > Thanks for the pointer Chuck
> > >
> > > I ran the trace capture and was able to trigger two freezes in
> > > relatively short order. Here is that trace file:
> > > https://drive.google.com/file/d/1qk_VIMkj8aTeQIg5O0W3AvWyeDSWL3vW/view?usp=sharing
> > >
> >
> > There are gaps of 5.3sec, 4sec, 1sec and 1sec between adjacent trace
> > points.
> > The longest gap between 'start' and 'done' for any given xid is 354msec.
> > So it doesn't look like the filesystem read causing the problem.
> >
> > The long gaps between adjacent records are:
> > 0.354581 xid=0x4c6ac2b6
> > 0.418340 xid=0x6a71c2b6
> > 1.013260 xid=0x6f71c2b6
> > 1.020742 xid=0x0f71c2b6
> > 4.064527 xid=0x6171c2b6
> > 5.396559 xid=0xd66dc2b6
> >
> > The fact 1, 1, and 4 are so close to a precise number of seconds seems
> > unlikely to be a co-incidence. It isn't clear what it might mean
> > though.
> >
> > I don't have any immediae ideas where to look next. I'll let you know
> > if I come up with something.
>
> I had separate bug report
> https://bugzilla.suse.com/show_bug.cgi?id=1189508
> with broadly similar symptoms which strongly points to network-layer
> problem. So I'm digging back into that tcpdump.
>
> The TCP ACK for READ requests usually goes out in 100 or 200
> microseconds. Sometimes longer - by a factor of about 100.
>
> % tshark -r /tmp/javmorin-nfsfreeze-5.13.10-cap1\ .test.pcap -d tcp.port==2049,rpc -Y 'tcp.port==2049 && rpc.msgtyp==0 && nfs.opcode==25' -T fields -e tcp.seq -e frame.time_epoch -e frame.number | sed 's/$/ READ/' > /tmp/read-times
> % tshark -r /tmp/javmorin-nfsfreeze-5.13.10-cap1\ .test.pcap -d tcp.port==2049,rpc -Y 'tcp.srcport==2049' -T fields -e tcp.ack -e frame.time_epoch -e frame.number | sed 's/$/ ACK/' > /tmp/ack-times
> % sort -n /tmp/read-times /tmp/ack-times | awk '$4 == "ACK" && readtime {printf "%f %d\n", ($2 - readtime), $3; readtime=0} $4 == "READ" {readtime=$2}' | sort -n | tail -4
> 0.000360 441
> 0.012777 982
> 0.013318 1086
> 0.042356 1776
>
> Three times (out of 245) the TCP ACK was 12 milliseconds or longer. The
> first and last were times when the reply was delayed by a couple of
> seconds at least. The other (1086) was the reply to 1085 - no obvious
> delay to the READ reply.
>
> These unusual delays (short though they are) suggest something odd in
> the network layer - particularly as they are sometimes followed by a
> much larger delay in a READ reply.
>
> It might be as simple as a memory allocation delay. It might be
> something deep in the networking layer.
>
> If you still have (or could generate) some larger tcp dumps, you could
> try that code and see if 5.12 shows any ACK delays, and if 5.13 shows
> more delays than turn into READ delays. IF 5.13 easily shows a few ACK
> delays thats 5.12 doesn't, then they might make a git-bisect more
> reliable. Having thousands of READs in the trace rather than just 245
> should produce more reliable data.
>
>
> As I suggested in the bug report I linked above, I am suspicious of TCP
> offload when I see symptoms like this. You can use "ethtool" to turn
> off that various offload features. Doing that and trying to reproduce
> without offload might also provide useful information.
>
> NeilBrown
OK, so new/fresh captures, reading the same set of files via NFS in
roughly the same timing/sequence (client unchanged between runs)
5.12.15-arch1:
===============
0.042320 124082
0.042594 45837
0.043176 19598
0.044092 63667
0.044613 28192
0.045045 131268
0.045982 116572
0.058507 162444
0.369620 153520
0.825167 164682
5.13.12-arch1: (no freezes)
===============
0.040766 12679
0.041565 64532
0.041799 55440
0.042091 159640
0.042105 75075
0.042134 177776
0.042706 40
0.043334 35322
0.045480 183618
0.204246 83997
Since I didn't get any freezes, I waited a bit, tried again and got a
capture with several freezes...
5.13.12-arch1: (with freezes)
===============
0.042143 55425
0.042252 64787
0.042411 57362
0.042441 34461
0.042503 67041
0.042553 64812
0.042592 55179
0.042715 67002
0.042835 66977
0.043308 64849
Not sure what to make of this, but to my (admittedly untrainted) eye,
the two 5.13.12 sets are very similar to each other as well as to the
5.12.15 sample, I am not sure if this is giving any indication to what
is causing the freezes.
- mike
On Thu, Aug 19, 2021 at 5:52 PM Mike Javorski <[email protected]> wrote:
>
> Neil:
>
> I did still have the original/source cap for that 5.13 file. I ran the
> command you gave and got the 10 last entries:
> 0.013086 31352
> 0.013318 28399
> 0.013339 3521
> 0.013783 32423
> 0.014519 32481
> 0.014773 6145
> 0.015063 19034
> 0.015829 25892
> 0.042356 29089
> 0.042581 5779
>
> I did have another test cap which was from an early 5.13 dev compile
> prior to the nfs patches (what the Archlinux devs asked me to try). In
> that file (which is 1.8G vs the 400M of the above file), the 10 last
> entries are:
> 0.042038 575
> 0.042190 87093
> 0.042313 30183
> 0.042377 34941
> 0.042521 53593
> 0.042801 113067
> 0.042930 1174
> 0.043285 87326
> 0.043851 61066
> 0.107649 115114
>
> As that seems even worse than the above, I think I need to get a clean
> 5.12 capture to see if it actually was "better" beforehand.
> Unfortunately I am stuck on a work issue tonight, but I will test it
> either tomorrow or at the latest, this weekend.
>
> Thanks for following up on this.
>
> - mike
>
> On Thu, Aug 19, 2021 at 5:31 PM NeilBrown <[email protected]> wrote:
> >
> > On Tue, 17 Aug 2021, NeilBrown wrote:
> > > On Tue, 17 Aug 2021, Mike Javorski wrote:
> > > > Thanks for the pointer Chuck
> > > >
> > > > I ran the trace capture and was able to trigger two freezes in
> > > > relatively short order. Here is that trace file:
> > > > https://drive.google.com/file/d/1qk_VIMkj8aTeQIg5O0W3AvWyeDSWL3vW/view?usp=sharing
> > > >
> > >
> > > There are gaps of 5.3sec, 4sec, 1sec and 1sec between adjacent trace
> > > points.
> > > The longest gap between 'start' and 'done' for any given xid is 354msec.
> > > So it doesn't look like the filesystem read causing the problem.
> > >
> > > The long gaps between adjacent records are:
> > > 0.354581 xid=0x4c6ac2b6
> > > 0.418340 xid=0x6a71c2b6
> > > 1.013260 xid=0x6f71c2b6
> > > 1.020742 xid=0x0f71c2b6
> > > 4.064527 xid=0x6171c2b6
> > > 5.396559 xid=0xd66dc2b6
> > >
> > > The fact 1, 1, and 4 are so close to a precise number of seconds seems
> > > unlikely to be a co-incidence. It isn't clear what it might mean
> > > though.
> > >
> > > I don't have any immediae ideas where to look next. I'll let you know
> > > if I come up with something.
> >
> > I had separate bug report
> > https://bugzilla.suse.com/show_bug.cgi?id=1189508
> > with broadly similar symptoms which strongly points to network-layer
> > problem. So I'm digging back into that tcpdump.
> >
> > The TCP ACK for READ requests usually goes out in 100 or 200
> > microseconds. Sometimes longer - by a factor of about 100.
> >
> > % tshark -r /tmp/javmorin-nfsfreeze-5.13.10-cap1\ .test.pcap -d tcp.port==2049,rpc -Y 'tcp.port==2049 && rpc.msgtyp==0 && nfs.opcode==25' -T fields -e tcp.seq -e frame.time_epoch -e frame.number | sed 's/$/ READ/' > /tmp/read-times
> > % tshark -r /tmp/javmorin-nfsfreeze-5.13.10-cap1\ .test.pcap -d tcp.port==2049,rpc -Y 'tcp.srcport==2049' -T fields -e tcp.ack -e frame.time_epoch -e frame.number | sed 's/$/ ACK/' > /tmp/ack-times
> > % sort -n /tmp/read-times /tmp/ack-times | awk '$4 == "ACK" && readtime {printf "%f %d\n", ($2 - readtime), $3; readtime=0} $4 == "READ" {readtime=$2}' | sort -n | tail -4
> > 0.000360 441
> > 0.012777 982
> > 0.013318 1086
> > 0.042356 1776
> >
> > Three times (out of 245) the TCP ACK was 12 milliseconds or longer. The
> > first and last were times when the reply was delayed by a couple of
> > seconds at least. The other (1086) was the reply to 1085 - no obvious
> > delay to the READ reply.
> >
> > These unusual delays (short though they are) suggest something odd in
> > the network layer - particularly as they are sometimes followed by a
> > much larger delay in a READ reply.
> >
> > It might be as simple as a memory allocation delay. It might be
> > something deep in the networking layer.
> >
> > If you still have (or could generate) some larger tcp dumps, you could
> > try that code and see if 5.12 shows any ACK delays, and if 5.13 shows
> > more delays than turn into READ delays. IF 5.13 easily shows a few ACK
> > delays thats 5.12 doesn't, then they might make a git-bisect more
> > reliable. Having thousands of READs in the trace rather than just 245
> > should produce more reliable data.
> >
> >
> > As I suggested in the bug report I linked above, I am suspicious of TCP
> > offload when I see symptoms like this. You can use "ethtool" to turn
> > off that various offload features. Doing that and trying to reproduce
> > without offload might also provide useful information.
> >
> > NeilBrown
On Sun, 22 Aug 2021, Mike Javorski wrote:
> OK, so new/fresh captures, reading the same set of files via NFS in
> roughly the same timing/sequence (client unchanged between runs)
>
> 5.12.15-arch1:
> ===============
> 0.042320 124082
> 0.042594 45837
> 0.043176 19598
> 0.044092 63667
> 0.044613 28192
> 0.045045 131268
> 0.045982 116572
> 0.058507 162444
> 0.369620 153520
> 0.825167 164682
>
> 5.13.12-arch1: (no freezes)
> ===============
> 0.040766 12679
> 0.041565 64532
> 0.041799 55440
> 0.042091 159640
> 0.042105 75075
> 0.042134 177776
> 0.042706 40
> 0.043334 35322
> 0.045480 183618
> 0.204246 83997
>
> Since I didn't get any freezes, I waited a bit, tried again and got a
> capture with several freezes...
>
> 5.13.12-arch1: (with freezes)
> ===============
> 0.042143 55425
> 0.042252 64787
> 0.042411 57362
> 0.042441 34461
> 0.042503 67041
> 0.042553 64812
> 0.042592 55179
> 0.042715 67002
> 0.042835 66977
> 0.043308 64849
>
> Not sure what to make of this, but to my (admittedly untrainted) eye,
> the two 5.13.12 sets are very similar to each other as well as to the
> 5.12.15 sample, I am not sure if this is giving any indication to what
> is causing the freezes.
In the trace that I have, most times (242 of 245) were 0.000360 or less.
Only 3 were greater.
As your traces are much bigger you naturally have more that are great -
all of the last 10 and probably more.
I has hoping that 5.12 wouldn't show any large delays, but clearly it
does. However it is still possible that there are substantially fewer.
Rather than using tail, please pipe the list of times into
awk '$1 < 0.001 { fast +=1 } $1 >= 0.001 {slow += 1} END { print fast, slow, slow / (fast + slow) }'
and report the results. If the final number (the fraction) is reliably
significantly smaller for 5.12 than for 5.13 (whether it freezes or
not), then this metric may still be useful.
By the way, disabling the various offload options didn't appear to make
a different for the other person who has reported this problem.
Thanks,
NeilBrown
5.12.15-arch1:
===============
14723 137 0.00921938
5.13.12-arch1: (no freezes)
===============
15333 206 0.013257
5.13.12-arch1: (with freezes)
===============
9230 299 0.0313779
So a decent bump w/ 5.13, and even more with the freezes.
This machine is an older Opteron server w/ nforce networking. Not sure
how much offloading it actually does.
- mike
On Sat, Aug 21, 2021 at 8:42 PM NeilBrown <[email protected]> wrote:
>
> On Sun, 22 Aug 2021, Mike Javorski wrote:
> > OK, so new/fresh captures, reading the same set of files via NFS in
> > roughly the same timing/sequence (client unchanged between runs)
> >
> > 5.12.15-arch1:
> > ===============
> > 0.042320 124082
> > 0.042594 45837
> > 0.043176 19598
> > 0.044092 63667
> > 0.044613 28192
> > 0.045045 131268
> > 0.045982 116572
> > 0.058507 162444
> > 0.369620 153520
> > 0.825167 164682
> >
> > 5.13.12-arch1: (no freezes)
> > ===============
> > 0.040766 12679
> > 0.041565 64532
> > 0.041799 55440
> > 0.042091 159640
> > 0.042105 75075
> > 0.042134 177776
> > 0.042706 40
> > 0.043334 35322
> > 0.045480 183618
> > 0.204246 83997
> >
> > Since I didn't get any freezes, I waited a bit, tried again and got a
> > capture with several freezes...
> >
> > 5.13.12-arch1: (with freezes)
> > ===============
> > 0.042143 55425
> > 0.042252 64787
> > 0.042411 57362
> > 0.042441 34461
> > 0.042503 67041
> > 0.042553 64812
> > 0.042592 55179
> > 0.042715 67002
> > 0.042835 66977
> > 0.043308 64849
> >
> > Not sure what to make of this, but to my (admittedly untrainted) eye,
> > the two 5.13.12 sets are very similar to each other as well as to the
> > 5.12.15 sample, I am not sure if this is giving any indication to what
> > is causing the freezes.
>
> In the trace that I have, most times (242 of 245) were 0.000360 or less.
> Only 3 were greater.
> As your traces are much bigger you naturally have more that are great -
> all of the last 10 and probably more.
>
> I has hoping that 5.12 wouldn't show any large delays, but clearly it
> does. However it is still possible that there are substantially fewer.
>
> Rather than using tail, please pipe the list of times into
>
> awk '$1 < 0.001 { fast +=1 } $1 >= 0.001 {slow += 1} END { print fast, slow, slow / (fast + slow) }'
>
> and report the results. If the final number (the fraction) is reliably
> significantly smaller for 5.12 than for 5.13 (whether it freezes or
> not), then this metric may still be useful.
>
> By the way, disabling the various offload options didn't appear to make
> a different for the other person who has reported this problem.
>
> Thanks,
> NeilBrown
On Sun, 22 Aug 2021, Mike Javorski wrote:
> 5.12.15-arch1:
> ===============
> 14723 137 0.00921938
>
> 5.13.12-arch1: (no freezes)
> ===============
> 15333 206 0.013257
>
> 5.13.12-arch1: (with freezes)
> ===============
> 9230 299 0.0313779
>
> So a decent bump w/ 5.13, and even more with the freezes.
Thanks. 9->13 isn't as much as I was hoping for, but it might still be
significant.
> This machine is an older Opteron server w/ nforce networking. Not sure
> how much offloading it actually does.
The tcpdump showed the server sending very large packets, which means
that it is at least doing TCP segmentation offload and checksum offload.
Those are very common these days.
Thanks,
NeilBrown
I finally had the chance to test with the offloading disabled on the
server. Unfortunately, the issues remain, may even be more frequent,
but I can't rule out confirmation bias on that one.
Not sure where to go from here.
- mike
On Sun, Aug 22, 2021 at 3:00 PM NeilBrown <[email protected]> wrote:
>
> On Sun, 22 Aug 2021, Mike Javorski wrote:
> > 5.12.15-arch1:
> > ===============
> > 14723 137 0.00921938
> >
> > 5.13.12-arch1: (no freezes)
> > ===============
> > 15333 206 0.013257
> >
> > 5.13.12-arch1: (with freezes)
> > ===============
> > 9230 299 0.0313779
> >
> > So a decent bump w/ 5.13, and even more with the freezes.
>
> Thanks. 9->13 isn't as much as I was hoping for, but it might still be
> significant.
>
> > This machine is an older Opteron server w/ nforce networking. Not sure
> > how much offloading it actually does.
>
> The tcpdump showed the server sending very large packets, which means
> that it is at least doing TCP segmentation offload and checksum offload.
> Those are very common these days.
>
> Thanks,
> NeilBrown
On Fri, 27 Aug 2021, Mike Javorski wrote:
> I finally had the chance to test with the offloading disabled on the
> server. Unfortunately, the issues remain, may even be more frequent,
> but I can't rule out confirmation bias on that one.
>
> Not sure where to go from here.
Have you tried NFSv3? Or 4.0?
I am (or was) leaning away from thinking it was an nfsd problem, but we
cannot rule anything out until we know that is happening.
I plan to have another dig through the code changes today.
NeilBrown
I just tried the same mount with 4 different nfsvers values: 3, 4.0, 4.1 and 4.2
At first I thought it might be "working" because I only got freezes
with 4.2 at first, but I went back and re-tested (to be sure) and got
freezes with all 4 versions. So the nfsvers setting doesn't seem to
have an impact. I did verify at each pass that the 'nfsvers=' value
was present and correct in the mount output.
FYI: another user posted on the archlinux reddit with a similar issue,
I suggested they try with a 5.12 kernel and that "solved" the issue
for them as well.
- mike
On Thu, Aug 26, 2021 at 2:44 PM NeilBrown <[email protected]> wrote:
>
> On Fri, 27 Aug 2021, Mike Javorski wrote:
> > I finally had the chance to test with the offloading disabled on the
> > server. Unfortunately, the issues remain, may even be more frequent,
> > but I can't rule out confirmation bias on that one.
> >
> > Not sure where to go from here.
>
> Have you tried NFSv3? Or 4.0?
> I am (or was) leaning away from thinking it was an nfsd problem, but we
> cannot rule anything out until we know that is happening.
>
> I plan to have another dig through the code changes today.
>
> NeilBrown
[[Mel: if you read through to the end you'll see why I cc:ed you on this]]
On Fri, 27 Aug 2021, Mike Javorski wrote:
> I just tried the same mount with 4 different nfsvers values: 3, 4.0, 4.1 and 4.2
>
> At first I thought it might be "working" because I only got freezes
> with 4.2 at first, but I went back and re-tested (to be sure) and got
> freezes with all 4 versions. So the nfsvers setting doesn't seem to
> have an impact. I did verify at each pass that the 'nfsvers=' value
> was present and correct in the mount output.
>
> FYI: another user posted on the archlinux reddit with a similar issue,
> I suggested they try with a 5.12 kernel and that "solved" the issue
> for them as well.
well... I have good news and I have bad news.
First the good.
I reviewed all the symptoms again, and browsed the commits between
working and not-working, and the only pattern that made any sense was
that there was some issue with memory allocation. The pauses - I
reasoned - were most likely pauses while allocating memory.
So instead of testing in a VM with 2G of RAM, I tried 512MB, and
suddenly the problem was trivial to reproduce. Specifically I created a
(sparse) 1GB file on the test VM, exported it over NFS, and ran "md5sum"
on the file from an NFS client. With 5.12 this reliably takes about 90 seconds
(as it does with 2G RAM). On 5.13 and 512MB RAM, it usually takes a lot
longer. 5, 6, 7, 8 minutes (and assorted seconds).
The most questionable nfsd/ memory related patch in 5.13 is
Commit f6e70aab9dfe ("SUNRPC: refresh rq_pages using a bulk page allocator")
I reverted that and now the problem is no longer there. Gone. 90seconds
every time.
Now the bad news: I don't know why. That patch should be a good patch,
with a small performance improvement, particularly at very high loads.
(maybe even a big improvement at very high loads).
The problem must be in alloc_pages_bulk_array(), which is a new
interface, so not possible to bisect.
So I might have a look at the code next week, but I've cc:ed Mel Gorman
in case he comes up with some ideas sooner.
For now, you can just revert that patch.
Thanks for all the testing you did!! It certainly helped.
NeilBrown
Neil:
I am actually compiling a 5.13.13 kernel with the patch that Chuck
suggested earlier right now. I am doing the full compile matching the
distro compile as I don't have a targeted kernel config ready to go
(it's been years), and I want to test like for like anyway. It should
be ready to install in the AM, my time, so I will test with that first
tomorrow and see if it resolves the issue, if not, I will report back
and then try your revert suggestion. On the issue of memory though, my
server has 16GB of memory (and free currently shows ~1GB unused, and
~11GB in buffers/caches), so this really shouldn't be an available
memory issue, but I guess we'll find out.
Thanks for the info.
- mike
On Thu, Aug 26, 2021 at 10:27 PM NeilBrown <[email protected]> wrote:
>
>
> [[Mel: if you read through to the end you'll see why I cc:ed you on this]]
>
> On Fri, 27 Aug 2021, Mike Javorski wrote:
> > I just tried the same mount with 4 different nfsvers values: 3, 4.0, 4.1 and 4.2
> >
> > At first I thought it might be "working" because I only got freezes
> > with 4.2 at first, but I went back and re-tested (to be sure) and got
> > freezes with all 4 versions. So the nfsvers setting doesn't seem to
> > have an impact. I did verify at each pass that the 'nfsvers=' value
> > was present and correct in the mount output.
> >
> > FYI: another user posted on the archlinux reddit with a similar issue,
> > I suggested they try with a 5.12 kernel and that "solved" the issue
> > for them as well.
>
> well... I have good news and I have bad news.
>
> First the good.
> I reviewed all the symptoms again, and browsed the commits between
> working and not-working, and the only pattern that made any sense was
> that there was some issue with memory allocation. The pauses - I
> reasoned - were most likely pauses while allocating memory.
>
> So instead of testing in a VM with 2G of RAM, I tried 512MB, and
> suddenly the problem was trivial to reproduce. Specifically I created a
> (sparse) 1GB file on the test VM, exported it over NFS, and ran "md5sum"
> on the file from an NFS client. With 5.12 this reliably takes about 90 seconds
> (as it does with 2G RAM). On 5.13 and 512MB RAM, it usually takes a lot
> longer. 5, 6, 7, 8 minutes (and assorted seconds).
>
> The most questionable nfsd/ memory related patch in 5.13 is
>
> Commit f6e70aab9dfe ("SUNRPC: refresh rq_pages using a bulk page allocator")
>
> I reverted that and now the problem is no longer there. Gone. 90seconds
> every time.
>
> Now the bad news: I don't know why. That patch should be a good patch,
> with a small performance improvement, particularly at very high loads.
> (maybe even a big improvement at very high loads).
> The problem must be in alloc_pages_bulk_array(), which is a new
> interface, so not possible to bisect.
>
> So I might have a look at the code next week, but I've cc:ed Mel Gorman
> in case he comes up with some ideas sooner.
>
> For now, you can just revert that patch.
>
> Thanks for all the testing you did!! It certainly helped.
>
> NeilBrown
>
On Fri, 27 Aug 2021, Mike Javorski wrote:
> Neil:
>
> I am actually compiling a 5.13.13 kernel with the patch that Chuck
> suggested earlier right now. I am doing the full compile matching the
> distro compile as I don't have a targeted kernel config ready to go
> (it's been years), and I want to test like for like anyway. It should
> be ready to install in the AM, my time, so I will test with that first
> tomorrow and see if it resolves the issue, if not, I will report back
> and then try your revert suggestion. On the issue of memory though, my
> server has 16GB of memory (and free currently shows ~1GB unused, and
> ~11GB in buffers/caches), so this really shouldn't be an available
> memory issue, but I guess we'll find out.
>
> Thanks for the info.
Take your time.
Just FYI, the fix Chuck identified doesn't match your symptoms.
That bug can only occur if
/sys/module/sunrpc/parameters/svc_rpc_per_connection_limit
is non-zero. When it does occur, the TCP connection completely
freezes - no further traffic. IT won't even close.
I took a break and got some fresh air and now I understand the problem.
Please try the patch below, not the revert I suggested.
The pause can, I think, be caused by fragmented memory - not just low
memory. If only 1/16 of your memory is free, it could easily be
fragmented.
Thanks,
NeilBrown
Subject: [PATCH] SUNRPC: don't pause on incomplete allocation
alloc_pages_bulk_array() attempts to allocate at least one page based on
the provided pages, and then opportunistically allocates more if that
can be done without dropping the spinlock.
So if it returns fewer than requested, that could just mean that it
needed to drop the lock. In that case, try again immediately.
Only pause for a time if no progress could be made.
Signed-off-by: NeilBrown <[email protected]>
---
net/sunrpc/svc_xprt.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index d66a8e44a1ae..99268dd95519 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -662,7 +662,7 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
{
struct svc_serv *serv = rqstp->rq_server;
struct xdr_buf *arg = &rqstp->rq_arg;
- unsigned long pages, filled;
+ unsigned long pages, filled, prev;
pages = (serv->sv_max_mesg + 2 * PAGE_SIZE) >> PAGE_SHIFT;
if (pages > RPCSVC_MAXPAGES) {
@@ -672,11 +672,14 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
pages = RPCSVC_MAXPAGES;
}
- for (;;) {
+ for (prev = 0;; prev = filled) {
filled = alloc_pages_bulk_array(GFP_KERNEL, pages,
rqstp->rq_pages);
if (filled == pages)
break;
+ if (filled > prev)
+ /* Made progress, don't sleep yet */
+ continue;
set_current_state(TASK_INTERRUPTIBLE);
if (signalled() || kthread_should_stop()) {
> On Aug 27, 2021, at 3:14 AM, NeilBrown <[email protected]> wrote:
>
> Subject: [PATCH] SUNRPC: don't pause on incomplete allocation
>
> alloc_pages_bulk_array() attempts to allocate at least one page based on
> the provided pages, and then opportunistically allocates more if that
> can be done without dropping the spinlock.
>
> So if it returns fewer than requested, that could just mean that it
> needed to drop the lock. In that case, try again immediately.
>
> Only pause for a time if no progress could be made.
The case I was worried about was "no pages available on the
pcplist", in which case, alloc_pages_bulk_array() resorts
to calling __alloc_pages() and returns only one new page.
"No progess" would mean even __alloc_pages() failed.
So this patch would behave essentially like the
pre-alloc_pages_bulk_array() code: call alloc_page() for
each empty struct_page in the array without pausing. That
seems correct to me.
I would add
Fixes: f6e70aab9dfe ("SUNRPC: refresh rq_pages using a bulk page allocator")
> Signed-off-by: NeilBrown <[email protected]>
> ---
> net/sunrpc/svc_xprt.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index d66a8e44a1ae..99268dd95519 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -662,7 +662,7 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
> {
> struct svc_serv *serv = rqstp->rq_server;
> struct xdr_buf *arg = &rqstp->rq_arg;
> - unsigned long pages, filled;
> + unsigned long pages, filled, prev;
>
> pages = (serv->sv_max_mesg + 2 * PAGE_SIZE) >> PAGE_SHIFT;
> if (pages > RPCSVC_MAXPAGES) {
> @@ -672,11 +672,14 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
> pages = RPCSVC_MAXPAGES;
> }
>
> - for (;;) {
> + for (prev = 0;; prev = filled) {
> filled = alloc_pages_bulk_array(GFP_KERNEL, pages,
> rqstp->rq_pages);
> if (filled == pages)
> break;
> + if (filled > prev)
> + /* Made progress, don't sleep yet */
> + continue;
>
> set_current_state(TASK_INTERRUPTIBLE);
> if (signalled() || kthread_should_stop()) {
--
Chuck Lever
Chuck:
I just booted a 5.13.13 kernel with your suggested patch. No freezes
on the first test, but that sometimes happens so I will let the server
settle some and try it again later in the day (which also would align
with Neil's comment on memory fragmentation being a contributor).
Neil:
I have started a compile with the above kernel + your patch to test
next unless you or Chuck determine that it isn't needed, or that I
should test both patches discreetly. As the above is already merged to
5.14 it seemed logical to just add your patch on top.
I will also try to set up a vm to test your md5sum scenario with the
various kernels since it's a much faster thing to test.
- mike
On Fri, Aug 27, 2021 at 7:13 AM Chuck Lever III <[email protected]> wrote:
>
>
> > On Aug 27, 2021, at 3:14 AM, NeilBrown <[email protected]> wrote:
> >
> > Subject: [PATCH] SUNRPC: don't pause on incomplete allocation
> >
> > alloc_pages_bulk_array() attempts to allocate at least one page based on
> > the provided pages, and then opportunistically allocates more if that
> > can be done without dropping the spinlock.
> >
> > So if it returns fewer than requested, that could just mean that it
> > needed to drop the lock. In that case, try again immediately.
> >
> > Only pause for a time if no progress could be made.
>
> The case I was worried about was "no pages available on the
> pcplist", in which case, alloc_pages_bulk_array() resorts
> to calling __alloc_pages() and returns only one new page.
>
> "No progess" would mean even __alloc_pages() failed.
>
> So this patch would behave essentially like the
> pre-alloc_pages_bulk_array() code: call alloc_page() for
> each empty struct_page in the array without pausing. That
> seems correct to me.
>
>
> I would add
>
> Fixes: f6e70aab9dfe ("SUNRPC: refresh rq_pages using a bulk page allocator")
>
>
> > Signed-off-by: NeilBrown <[email protected]>
> > ---
> > net/sunrpc/svc_xprt.c | 7 +++++--
> > 1 file changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> > index d66a8e44a1ae..99268dd95519 100644
> > --- a/net/sunrpc/svc_xprt.c
> > +++ b/net/sunrpc/svc_xprt.c
> > @@ -662,7 +662,7 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
> > {
> > struct svc_serv *serv = rqstp->rq_server;
> > struct xdr_buf *arg = &rqstp->rq_arg;
> > - unsigned long pages, filled;
> > + unsigned long pages, filled, prev;
> >
> > pages = (serv->sv_max_mesg + 2 * PAGE_SIZE) >> PAGE_SHIFT;
> > if (pages > RPCSVC_MAXPAGES) {
> > @@ -672,11 +672,14 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
> > pages = RPCSVC_MAXPAGES;
> > }
> >
> > - for (;;) {
> > + for (prev = 0;; prev = filled) {
> > filled = alloc_pages_bulk_array(GFP_KERNEL, pages,
> > rqstp->rq_pages);
> > if (filled == pages)
> > break;
> > + if (filled > prev)
> > + /* Made progress, don't sleep yet */
> > + continue;
> >
> > set_current_state(TASK_INTERRUPTIBLE);
> > if (signalled() || kthread_should_stop()) {
>
> --
> Chuck Lever
>
>
>
OK, an update. Several hours of spaced out testing sessions and the
first patch seems to have resolved the issue. There may be a very tiny
bit of lag that still occurs when opening/processing new files, but so
far on this kernel I have not had any multi-second freezes. I am still
waiting on the kernel with Neil's patch to compile (compiling on this
underpowered server so it's taking several hours), but I think the
testing there will just be to see if I can show it works still, and
then to try and test in a memory constrained VM. To see if I can
recreate Neil's experiment. Likely will have to do this over the
weekend given the kernel compile delay + fiddling with a VM.
Chuck: I don't mean to overstep bounds, but is it possible to get that
patch pulled into 5.13 stable? That may help things for several people
while 5.14 goes through it's shakedown in archlinux prior to release.
- mike
On Fri, Aug 27, 2021 at 10:07 AM Mike Javorski <[email protected]> wrote:
>
> Chuck:
> I just booted a 5.13.13 kernel with your suggested patch. No freezes
> on the first test, but that sometimes happens so I will let the server
> settle some and try it again later in the day (which also would align
> with Neil's comment on memory fragmentation being a contributor).
>
> Neil:
> I have started a compile with the above kernel + your patch to test
> next unless you or Chuck determine that it isn't needed, or that I
> should test both patches discreetly. As the above is already merged to
> 5.14 it seemed logical to just add your patch on top.
>
> I will also try to set up a vm to test your md5sum scenario with the
> various kernels since it's a much faster thing to test.
>
> - mike
>
> On Fri, Aug 27, 2021 at 7:13 AM Chuck Lever III <[email protected]> wrote:
> >
> >
> > > On Aug 27, 2021, at 3:14 AM, NeilBrown <[email protected]> wrote:
> > >
> > > Subject: [PATCH] SUNRPC: don't pause on incomplete allocation
> > >
> > > alloc_pages_bulk_array() attempts to allocate at least one page based on
> > > the provided pages, and then opportunistically allocates more if that
> > > can be done without dropping the spinlock.
> > >
> > > So if it returns fewer than requested, that could just mean that it
> > > needed to drop the lock. In that case, try again immediately.
> > >
> > > Only pause for a time if no progress could be made.
> >
> > The case I was worried about was "no pages available on the
> > pcplist", in which case, alloc_pages_bulk_array() resorts
> > to calling __alloc_pages() and returns only one new page.
> >
> > "No progess" would mean even __alloc_pages() failed.
> >
> > So this patch would behave essentially like the
> > pre-alloc_pages_bulk_array() code: call alloc_page() for
> > each empty struct_page in the array without pausing. That
> > seems correct to me.
> >
> >
> > I would add
> >
> > Fixes: f6e70aab9dfe ("SUNRPC: refresh rq_pages using a bulk page allocator")
> >
> >
> > > Signed-off-by: NeilBrown <[email protected]>
> > > ---
> > > net/sunrpc/svc_xprt.c | 7 +++++--
> > > 1 file changed, 5 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> > > index d66a8e44a1ae..99268dd95519 100644
> > > --- a/net/sunrpc/svc_xprt.c
> > > +++ b/net/sunrpc/svc_xprt.c
> > > @@ -662,7 +662,7 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
> > > {
> > > struct svc_serv *serv = rqstp->rq_server;
> > > struct xdr_buf *arg = &rqstp->rq_arg;
> > > - unsigned long pages, filled;
> > > + unsigned long pages, filled, prev;
> > >
> > > pages = (serv->sv_max_mesg + 2 * PAGE_SIZE) >> PAGE_SHIFT;
> > > if (pages > RPCSVC_MAXPAGES) {
> > > @@ -672,11 +672,14 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
> > > pages = RPCSVC_MAXPAGES;
> > > }
> > >
> > > - for (;;) {
> > > + for (prev = 0;; prev = filled) {
> > > filled = alloc_pages_bulk_array(GFP_KERNEL, pages,
> > > rqstp->rq_pages);
> > > if (filled == pages)
> > > break;
> > > + if (filled > prev)
> > > + /* Made progress, don't sleep yet */
> > > + continue;
> > >
> > > set_current_state(TASK_INTERRUPTIBLE);
> > > if (signalled() || kthread_should_stop()) {
> >
> > --
> > Chuck Lever
> >
> >
> >
> On Aug 27, 2021, at 6:00 PM, Mike Javorski <[email protected]> wrote:
>
> OK, an update. Several hours of spaced out testing sessions and the
> first patch seems to have resolved the issue. There may be a very tiny
> bit of lag that still occurs when opening/processing new files, but so
> far on this kernel I have not had any multi-second freezes. I am still
> waiting on the kernel with Neil's patch to compile (compiling on this
> underpowered server so it's taking several hours), but I think the
> testing there will just be to see if I can show it works still, and
> then to try and test in a memory constrained VM. To see if I can
> recreate Neil's experiment. Likely will have to do this over the
> weekend given the kernel compile delay + fiddling with a VM.
Thanks for your testing!
> Chuck: I don't mean to overstep bounds, but is it possible to get that
> patch pulled into 5.13 stable? That may help things for several people
> while 5.14 goes through it's shakedown in archlinux prior to release.
The patch had a Fixes: tag, so it should get automatically backported
to every kernel that has the broken commit. If you don't see it in
a subsequent 5.13 stable kernel, you are free to ask the stable
maintainers to consider it.
> - mike
>
> On Fri, Aug 27, 2021 at 10:07 AM Mike Javorski <[email protected]> wrote:
>>
>> Chuck:
>> I just booted a 5.13.13 kernel with your suggested patch. No freezes
>> on the first test, but that sometimes happens so I will let the server
>> settle some and try it again later in the day (which also would align
>> with Neil's comment on memory fragmentation being a contributor).
>>
>> Neil:
>> I have started a compile with the above kernel + your patch to test
>> next unless you or Chuck determine that it isn't needed, or that I
>> should test both patches discreetly. As the above is already merged to
>> 5.14 it seemed logical to just add your patch on top.
>>
>> I will also try to set up a vm to test your md5sum scenario with the
>> various kernels since it's a much faster thing to test.
>>
>> - mike
>>
>> On Fri, Aug 27, 2021 at 7:13 AM Chuck Lever III <[email protected]> wrote:
>>>
>>>
>>>> On Aug 27, 2021, at 3:14 AM, NeilBrown <[email protected]> wrote:
>>>>
>>>> Subject: [PATCH] SUNRPC: don't pause on incomplete allocation
>>>>
>>>> alloc_pages_bulk_array() attempts to allocate at least one page based on
>>>> the provided pages, and then opportunistically allocates more if that
>>>> can be done without dropping the spinlock.
>>>>
>>>> So if it returns fewer than requested, that could just mean that it
>>>> needed to drop the lock. In that case, try again immediately.
>>>>
>>>> Only pause for a time if no progress could be made.
>>>
>>> The case I was worried about was "no pages available on the
>>> pcplist", in which case, alloc_pages_bulk_array() resorts
>>> to calling __alloc_pages() and returns only one new page.
>>>
>>> "No progess" would mean even __alloc_pages() failed.
>>>
>>> So this patch would behave essentially like the
>>> pre-alloc_pages_bulk_array() code: call alloc_page() for
>>> each empty struct_page in the array without pausing. That
>>> seems correct to me.
>>>
>>>
>>> I would add
>>>
>>> Fixes: f6e70aab9dfe ("SUNRPC: refresh rq_pages using a bulk page allocator")
>>>
>>>
>>>> Signed-off-by: NeilBrown <[email protected]>
>>>> ---
>>>> net/sunrpc/svc_xprt.c | 7 +++++--
>>>> 1 file changed, 5 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
>>>> index d66a8e44a1ae..99268dd95519 100644
>>>> --- a/net/sunrpc/svc_xprt.c
>>>> +++ b/net/sunrpc/svc_xprt.c
>>>> @@ -662,7 +662,7 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
>>>> {
>>>> struct svc_serv *serv = rqstp->rq_server;
>>>> struct xdr_buf *arg = &rqstp->rq_arg;
>>>> - unsigned long pages, filled;
>>>> + unsigned long pages, filled, prev;
>>>>
>>>> pages = (serv->sv_max_mesg + 2 * PAGE_SIZE) >> PAGE_SHIFT;
>>>> if (pages > RPCSVC_MAXPAGES) {
>>>> @@ -672,11 +672,14 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
>>>> pages = RPCSVC_MAXPAGES;
>>>> }
>>>>
>>>> - for (;;) {
>>>> + for (prev = 0;; prev = filled) {
>>>> filled = alloc_pages_bulk_array(GFP_KERNEL, pages,
>>>> rqstp->rq_pages);
>>>> if (filled == pages)
>>>> break;
>>>> + if (filled > prev)
>>>> + /* Made progress, don't sleep yet */
>>>> + continue;
>>>>
>>>> set_current_state(TASK_INTERRUPTIBLE);
>>>> if (signalled() || kthread_should_stop()) {
>>>
>>> --
>>> Chuck Lever
>>>
>>>
>>>
--
Chuck Lever
I had some time this evening (and the kernel finally compiled), and
wanted to get this tested.
The TL;DR: Both patches are needed
Below are the test results from my replication of Neil's test. It is
readily apparent that both the 5.13.13 kernel AND the 5.13.13 kernel
with the 82011c80b3ec fix exhibit the randomness in read times that
were observed. The 5.13.13 kernel with both the 82011c80b3ec and
f6e70aab9dfe fixes brings the performance back in line with the
5.12.15 kernel which I tested as a baseline.
Please forgive the inconsistency in sample counts. This was running as
a while loop, and I just let it go long enough that the behavior was
consistent. Only change to the VM between tests was the different
kernel + a reboot. The testing PC had a consistent workload during the
entire set of tests.
Test 0: 5.13.10 (base kernel in VM image, just for kicks)
==================================================
Samples 30
Min 6.839
Max 19.998
Median 9.638
75-P 10.898
95-P 12.939
99-P 18.005
Test 1: 5.12.15 (known good)
==================================================
Samples 152
Min 1.997
Max 2.333
Median 2.171
75-P 2.230
95-P 2.286
99-P 2.312
Test 2: 5.13.13 (known bad)
==================================================
Samples 42
Min 3.587
Max 15.803
Median 6.039
75-P 6.452
95-P 10.293
99-P 15.540
Test 3: 5.13.13 + 82011c80b3ec fix
==================================================
Samples 44
Min 4.309
Max 37.040
Median 6.615
75-P 10.224
95-P 19.516
99-P 36.650
Test 4: 5.13.13 + 82011c80b3ec fix + f6e70aab9dfe fix
==================================================
Samples 131
Min 2.013
Max 2.397
Median 2.169
75-P 2.211
95-P 2.283
99-P 2.348
I am going to run the kernel w/ both fixes over the weekend, but
things look good at this point.
- mike
On Fri, Aug 27, 2021 at 4:49 PM Chuck Lever III <[email protected]> wrote:
>
>
> > On Aug 27, 2021, at 6:00 PM, Mike Javorski <[email protected]> wrote:
> >
> > OK, an update. Several hours of spaced out testing sessions and the
> > first patch seems to have resolved the issue. There may be a very tiny
> > bit of lag that still occurs when opening/processing new files, but so
> > far on this kernel I have not had any multi-second freezes. I am still
> > waiting on the kernel with Neil's patch to compile (compiling on this
> > underpowered server so it's taking several hours), but I think the
> > testing there will just be to see if I can show it works still, and
> > then to try and test in a memory constrained VM. To see if I can
> > recreate Neil's experiment. Likely will have to do this over the
> > weekend given the kernel compile delay + fiddling with a VM.
>
> Thanks for your testing!
>
>
> > Chuck: I don't mean to overstep bounds, but is it possible to get that
> > patch pulled into 5.13 stable? That may help things for several people
> > while 5.14 goes through it's shakedown in archlinux prior to release.
>
> The patch had a Fixes: tag, so it should get automatically backported
> to every kernel that has the broken commit. If you don't see it in
> a subsequent 5.13 stable kernel, you are free to ask the stable
> maintainers to consider it.
>
>
> > - mike
> >
> > On Fri, Aug 27, 2021 at 10:07 AM Mike Javorski <[email protected]> wrote:
> >>
> >> Chuck:
> >> I just booted a 5.13.13 kernel with your suggested patch. No freezes
> >> on the first test, but that sometimes happens so I will let the server
> >> settle some and try it again later in the day (which also would align
> >> with Neil's comment on memory fragmentation being a contributor).
> >>
> >> Neil:
> >> I have started a compile with the above kernel + your patch to test
> >> next unless you or Chuck determine that it isn't needed, or that I
> >> should test both patches discreetly. As the above is already merged to
> >> 5.14 it seemed logical to just add your patch on top.
> >>
> >> I will also try to set up a vm to test your md5sum scenario with the
> >> various kernels since it's a much faster thing to test.
> >>
> >> - mike
> >>
> >> On Fri, Aug 27, 2021 at 7:13 AM Chuck Lever III <[email protected]> wrote:
> >>>
> >>>
> >>>> On Aug 27, 2021, at 3:14 AM, NeilBrown <[email protected]> wrote:
> >>>>
> >>>> Subject: [PATCH] SUNRPC: don't pause on incomplete allocation
> >>>>
> >>>> alloc_pages_bulk_array() attempts to allocate at least one page based on
> >>>> the provided pages, and then opportunistically allocates more if that
> >>>> can be done without dropping the spinlock.
> >>>>
> >>>> So if it returns fewer than requested, that could just mean that it
> >>>> needed to drop the lock. In that case, try again immediately.
> >>>>
> >>>> Only pause for a time if no progress could be made.
> >>>
> >>> The case I was worried about was "no pages available on the
> >>> pcplist", in which case, alloc_pages_bulk_array() resorts
> >>> to calling __alloc_pages() and returns only one new page.
> >>>
> >>> "No progess" would mean even __alloc_pages() failed.
> >>>
> >>> So this patch would behave essentially like the
> >>> pre-alloc_pages_bulk_array() code: call alloc_page() for
> >>> each empty struct_page in the array without pausing. That
> >>> seems correct to me.
> >>>
> >>>
> >>> I would add
> >>>
> >>> Fixes: f6e70aab9dfe ("SUNRPC: refresh rq_pages using a bulk page allocator")
> >>>
> >>>
> >>>> Signed-off-by: NeilBrown <[email protected]>
> >>>> ---
> >>>> net/sunrpc/svc_xprt.c | 7 +++++--
> >>>> 1 file changed, 5 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> >>>> index d66a8e44a1ae..99268dd95519 100644
> >>>> --- a/net/sunrpc/svc_xprt.c
> >>>> +++ b/net/sunrpc/svc_xprt.c
> >>>> @@ -662,7 +662,7 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
> >>>> {
> >>>> struct svc_serv *serv = rqstp->rq_server;
> >>>> struct xdr_buf *arg = &rqstp->rq_arg;
> >>>> - unsigned long pages, filled;
> >>>> + unsigned long pages, filled, prev;
> >>>>
> >>>> pages = (serv->sv_max_mesg + 2 * PAGE_SIZE) >> PAGE_SHIFT;
> >>>> if (pages > RPCSVC_MAXPAGES) {
> >>>> @@ -672,11 +672,14 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
> >>>> pages = RPCSVC_MAXPAGES;
> >>>> }
> >>>>
> >>>> - for (;;) {
> >>>> + for (prev = 0;; prev = filled) {
> >>>> filled = alloc_pages_bulk_array(GFP_KERNEL, pages,
> >>>> rqstp->rq_pages);
> >>>> if (filled == pages)
> >>>> break;
> >>>> + if (filled > prev)
> >>>> + /* Made progress, don't sleep yet */
> >>>> + continue;
> >>>>
> >>>> set_current_state(TASK_INTERRUPTIBLE);
> >>>> if (signalled() || kthread_should_stop()) {
> >>>
> >>> --
> >>> Chuck Lever
> >>>
> >>>
> >>>
>
> --
> Chuck Lever
>
>
>
> On Aug 27, 2021, at 11:22 PM, Mike Javorski <[email protected]> wrote:
>
> I had some time this evening (and the kernel finally compiled), and
> wanted to get this tested.
>
> The TL;DR: Both patches are needed
>
> Below are the test results from my replication of Neil's test. It is
> readily apparent that both the 5.13.13 kernel AND the 5.13.13 kernel
> with the 82011c80b3ec fix exhibit the randomness in read times that
> were observed. The 5.13.13 kernel with both the 82011c80b3ec and
> f6e70aab9dfe fixes brings the performance back in line with the
> 5.12.15 kernel which I tested as a baseline.
>
> Please forgive the inconsistency in sample counts. This was running as
> a while loop, and I just let it go long enough that the behavior was
> consistent. Only change to the VM between tests was the different
> kernel + a reboot. The testing PC had a consistent workload during the
> entire set of tests.
>
> Test 0: 5.13.10 (base kernel in VM image, just for kicks)
> ==================================================
> Samples 30
> Min 6.839
> Max 19.998
> Median 9.638
> 75-P 10.898
> 95-P 12.939
> 99-P 18.005
>
> Test 1: 5.12.15 (known good)
> ==================================================
> Samples 152
> Min 1.997
> Max 2.333
> Median 2.171
> 75-P 2.230
> 95-P 2.286
> 99-P 2.312
>
> Test 2: 5.13.13 (known bad)
> ==================================================
> Samples 42
> Min 3.587
> Max 15.803
> Median 6.039
> 75-P 6.452
> 95-P 10.293
> 99-P 15.540
>
> Test 3: 5.13.13 + 82011c80b3ec fix
> ==================================================
> Samples 44
> Min 4.309
> Max 37.040
> Median 6.615
> 75-P 10.224
> 95-P 19.516
> 99-P 36.650
>
> Test 4: 5.13.13 + 82011c80b3ec fix + f6e70aab9dfe fix
> ==================================================
> Samples 131
> Min 2.013
> Max 2.397
> Median 2.169
> 75-P 2.211
> 95-P 2.283
> 99-P 2.348
>
> I am going to run the kernel w/ both fixes over the weekend, but
> things look good at this point.
>
> - mike
I've targeted Neil's fix for the first 5.15-rc NFSD pull request.
I'd like to have Mel's Reviewed-by or Acked-by, though.
I will add a Fixes: tag if Neil doesn't repost (no reason to at
this point) so the fix should get backported automatically to
recent stable kernels.
> On Fri, Aug 27, 2021 at 4:49 PM Chuck Lever III <[email protected]> wrote:
>>
>>
>>> On Aug 27, 2021, at 6:00 PM, Mike Javorski <[email protected]> wrote:
>>>
>>> OK, an update. Several hours of spaced out testing sessions and the
>>> first patch seems to have resolved the issue. There may be a very tiny
>>> bit of lag that still occurs when opening/processing new files, but so
>>> far on this kernel I have not had any multi-second freezes. I am still
>>> waiting on the kernel with Neil's patch to compile (compiling on this
>>> underpowered server so it's taking several hours), but I think the
>>> testing there will just be to see if I can show it works still, and
>>> then to try and test in a memory constrained VM. To see if I can
>>> recreate Neil's experiment. Likely will have to do this over the
>>> weekend given the kernel compile delay + fiddling with a VM.
>>
>> Thanks for your testing!
>>
>>
>>> Chuck: I don't mean to overstep bounds, but is it possible to get that
>>> patch pulled into 5.13 stable? That may help things for several people
>>> while 5.14 goes through it's shakedown in archlinux prior to release.
>>
>> The patch had a Fixes: tag, so it should get automatically backported
>> to every kernel that has the broken commit. If you don't see it in
>> a subsequent 5.13 stable kernel, you are free to ask the stable
>> maintainers to consider it.
>>
>>
>>> - mike
>>>
>>> On Fri, Aug 27, 2021 at 10:07 AM Mike Javorski <[email protected]> wrote:
>>>>
>>>> Chuck:
>>>> I just booted a 5.13.13 kernel with your suggested patch. No freezes
>>>> on the first test, but that sometimes happens so I will let the server
>>>> settle some and try it again later in the day (which also would align
>>>> with Neil's comment on memory fragmentation being a contributor).
>>>>
>>>> Neil:
>>>> I have started a compile with the above kernel + your patch to test
>>>> next unless you or Chuck determine that it isn't needed, or that I
>>>> should test both patches discreetly. As the above is already merged to
>>>> 5.14 it seemed logical to just add your patch on top.
>>>>
>>>> I will also try to set up a vm to test your md5sum scenario with the
>>>> various kernels since it's a much faster thing to test.
>>>>
>>>> - mike
>>>>
>>>> On Fri, Aug 27, 2021 at 7:13 AM Chuck Lever III <[email protected]> wrote:
>>>>>
>>>>>
>>>>>> On Aug 27, 2021, at 3:14 AM, NeilBrown <[email protected]> wrote:
>>>>>>
>>>>>> Subject: [PATCH] SUNRPC: don't pause on incomplete allocation
>>>>>>
>>>>>> alloc_pages_bulk_array() attempts to allocate at least one page based on
>>>>>> the provided pages, and then opportunistically allocates more if that
>>>>>> can be done without dropping the spinlock.
>>>>>>
>>>>>> So if it returns fewer than requested, that could just mean that it
>>>>>> needed to drop the lock. In that case, try again immediately.
>>>>>>
>>>>>> Only pause for a time if no progress could be made.
>>>>>
>>>>> The case I was worried about was "no pages available on the
>>>>> pcplist", in which case, alloc_pages_bulk_array() resorts
>>>>> to calling __alloc_pages() and returns only one new page.
>>>>>
>>>>> "No progess" would mean even __alloc_pages() failed.
>>>>>
>>>>> So this patch would behave essentially like the
>>>>> pre-alloc_pages_bulk_array() code: call alloc_page() for
>>>>> each empty struct_page in the array without pausing. That
>>>>> seems correct to me.
>>>>>
>>>>>
>>>>> I would add
>>>>>
>>>>> Fixes: f6e70aab9dfe ("SUNRPC: refresh rq_pages using a bulk page allocator")
>>>>>
>>>>>
>>>>>> Signed-off-by: NeilBrown <[email protected]>
>>>>>> ---
>>>>>> net/sunrpc/svc_xprt.c | 7 +++++--
>>>>>> 1 file changed, 5 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
>>>>>> index d66a8e44a1ae..99268dd95519 100644
>>>>>> --- a/net/sunrpc/svc_xprt.c
>>>>>> +++ b/net/sunrpc/svc_xprt.c
>>>>>> @@ -662,7 +662,7 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
>>>>>> {
>>>>>> struct svc_serv *serv = rqstp->rq_server;
>>>>>> struct xdr_buf *arg = &rqstp->rq_arg;
>>>>>> - unsigned long pages, filled;
>>>>>> + unsigned long pages, filled, prev;
>>>>>>
>>>>>> pages = (serv->sv_max_mesg + 2 * PAGE_SIZE) >> PAGE_SHIFT;
>>>>>> if (pages > RPCSVC_MAXPAGES) {
>>>>>> @@ -672,11 +672,14 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
>>>>>> pages = RPCSVC_MAXPAGES;
>>>>>> }
>>>>>>
>>>>>> - for (;;) {
>>>>>> + for (prev = 0;; prev = filled) {
>>>>>> filled = alloc_pages_bulk_array(GFP_KERNEL, pages,
>>>>>> rqstp->rq_pages);
>>>>>> if (filled == pages)
>>>>>> break;
>>>>>> + if (filled > prev)
>>>>>> + /* Made progress, don't sleep yet */
>>>>>> + continue;
>>>>>>
>>>>>> set_current_state(TASK_INTERRUPTIBLE);
>>>>>> if (signalled() || kthread_should_stop()) {
>>>>>
>>>>> --
>>>>> Chuck Lever
>>>>>
>>>>>
>>>>>
>>
>> --
>> Chuck Lever
>>
>>
>>
--
Chuck Lever
alloc_pages_bulk_array() attempts to allocate at least one page based on
the provided pages, and then opportunistically allocates more if that
can be done without dropping the spinlock.
So if it returns fewer than requested, that could just mean that it
needed to drop the lock. In that case, try again immediately.
Only pause for a time if no progress could be made.
Reported-and-tested-by: Mike Javorski <[email protected]>
Reported-and-tested-by: Lothar Paltins <[email protected]>
Fixes: f6e70aab9dfe ("SUNRPC: refresh rq_pages using a bulk page allocator")
Signed-off-by: NeilBrown <[email protected]>
---
I decided I would resend, as I thought the for() loops could be clearer.
This patch should perform exactly the same as the previous one.
net/sunrpc/svc_xprt.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index d66a8e44a1ae..e74d5cf3cbb4 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -662,7 +662,7 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
{
struct svc_serv *serv = rqstp->rq_server;
struct xdr_buf *arg = &rqstp->rq_arg;
- unsigned long pages, filled;
+ unsigned long pages, filled, ret;
pages = (serv->sv_max_mesg + 2 * PAGE_SIZE) >> PAGE_SHIFT;
if (pages > RPCSVC_MAXPAGES) {
@@ -672,11 +672,12 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
pages = RPCSVC_MAXPAGES;
}
- for (;;) {
- filled = alloc_pages_bulk_array(GFP_KERNEL, pages,
- rqstp->rq_pages);
- if (filled == pages)
- break;
+ for (filled = 0; filled < pages; filled = ret) {
+ ret = alloc_pages_bulk_array(GFP_KERNEL, pages,
+ rqstp->rq_pages);
+ if (ret > filled)
+ /* Made progress, don't sleep yet */
+ continue;
set_current_state(TASK_INTERRUPTIBLE);
if (signalled() || kthread_should_stop()) {
--
2.32.0
On Mon, Aug 30, 2021 at 08:36:34AM +1000, NeilBrown wrote:
>
> alloc_pages_bulk_array() attempts to allocate at least one page based on
> the provided pages, and then opportunistically allocates more if that
> can be done without dropping the spinlock.
>
> So if it returns fewer than requested, that could just mean that it
> needed to drop the lock. In that case, try again immediately.
>
> Only pause for a time if no progress could be made.
>
> Reported-and-tested-by: Mike Javorski <[email protected]>
> Reported-and-tested-by: Lothar Paltins <[email protected]>
> Fixes: f6e70aab9dfe ("SUNRPC: refresh rq_pages using a bulk page allocator")
> Signed-off-by: NeilBrown <[email protected]>
Acked-by: Mel Gorman <[email protected]>
--
Mel Gorman
SUSE Labs
For what it's worth, another positive report on this patch:
https://bugzilla.kernel.org/show_bug.cgi?id=213869#c10
--b.
On Mon, Aug 30, 2021 at 08:36:34AM +1000, NeilBrown wrote:
>
> alloc_pages_bulk_array() attempts to allocate at least one page based on
> the provided pages, and then opportunistically allocates more if that
> can be done without dropping the spinlock.
>
> So if it returns fewer than requested, that could just mean that it
> needed to drop the lock. In that case, try again immediately.
>
> Only pause for a time if no progress could be made.
>
> Reported-and-tested-by: Mike Javorski <[email protected]>
> Reported-and-tested-by: Lothar Paltins <[email protected]>
> Fixes: f6e70aab9dfe ("SUNRPC: refresh rq_pages using a bulk page allocator")
> Signed-off-by: NeilBrown <[email protected]>
> ---
>
> I decided I would resend, as I thought the for() loops could be clearer.
> This patch should perform exactly the same as the previous one.
>
>
> net/sunrpc/svc_xprt.c | 13 +++++++------
> 1 file changed, 7 insertions(+), 6 deletions(-)
>
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index d66a8e44a1ae..e74d5cf3cbb4 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -662,7 +662,7 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
> {
> struct svc_serv *serv = rqstp->rq_server;
> struct xdr_buf *arg = &rqstp->rq_arg;
> - unsigned long pages, filled;
> + unsigned long pages, filled, ret;
>
> pages = (serv->sv_max_mesg + 2 * PAGE_SIZE) >> PAGE_SHIFT;
> if (pages > RPCSVC_MAXPAGES) {
> @@ -672,11 +672,12 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
> pages = RPCSVC_MAXPAGES;
> }
>
> - for (;;) {
> - filled = alloc_pages_bulk_array(GFP_KERNEL, pages,
> - rqstp->rq_pages);
> - if (filled == pages)
> - break;
> + for (filled = 0; filled < pages; filled = ret) {
> + ret = alloc_pages_bulk_array(GFP_KERNEL, pages,
> + rqstp->rq_pages);
> + if (ret > filled)
> + /* Made progress, don't sleep yet */
> + continue;
>
> set_current_state(TASK_INTERRUPTIBLE);
> if (signalled() || kthread_should_stop()) {
> --
> 2.32.0
Hi Chuck.
I noticed that you sent in the 5.15 pull request but Neil's fix
(e38b3f20059426a0adbde014ff71071739ab5226 in your tree) missed the
pull and thus the fix isn't going to be backported to 5.14 in the near
term. Is there another 5.15 pull planned in the not too distant future
so this will get flagged for back-porting, or do I need to reach out
to someone to expressly pull it into 5.14? If the latter, can you
point me in the right direction of who to ask (I assume it's someone
other than Greg KH)?
Thanks
- mike
On Sat, Aug 28, 2021 at 11:23 AM Chuck Lever III <[email protected]> wrote:
>
>
>
> > On Aug 27, 2021, at 11:22 PM, Mike Javorski <[email protected]> wrote:
> >
> > I had some time this evening (and the kernel finally compiled), and
> > wanted to get this tested.
> >
> > The TL;DR: Both patches are needed
> >
> > Below are the test results from my replication of Neil's test. It is
> > readily apparent that both the 5.13.13 kernel AND the 5.13.13 kernel
> > with the 82011c80b3ec fix exhibit the randomness in read times that
> > were observed. The 5.13.13 kernel with both the 82011c80b3ec and
> > f6e70aab9dfe fixes brings the performance back in line with the
> > 5.12.15 kernel which I tested as a baseline.
> >
> > Please forgive the inconsistency in sample counts. This was running as
> > a while loop, and I just let it go long enough that the behavior was
> > consistent. Only change to the VM between tests was the different
> > kernel + a reboot. The testing PC had a consistent workload during the
> > entire set of tests.
> >
> > Test 0: 5.13.10 (base kernel in VM image, just for kicks)
> > ==================================================
> > Samples 30
> > Min 6.839
> > Max 19.998
> > Median 9.638
> > 75-P 10.898
> > 95-P 12.939
> > 99-P 18.005
> >
> > Test 1: 5.12.15 (known good)
> > ==================================================
> > Samples 152
> > Min 1.997
> > Max 2.333
> > Median 2.171
> > 75-P 2.230
> > 95-P 2.286
> > 99-P 2.312
> >
> > Test 2: 5.13.13 (known bad)
> > ==================================================
> > Samples 42
> > Min 3.587
> > Max 15.803
> > Median 6.039
> > 75-P 6.452
> > 95-P 10.293
> > 99-P 15.540
> >
> > Test 3: 5.13.13 + 82011c80b3ec fix
> > ==================================================
> > Samples 44
> > Min 4.309
> > Max 37.040
> > Median 6.615
> > 75-P 10.224
> > 95-P 19.516
> > 99-P 36.650
> >
> > Test 4: 5.13.13 + 82011c80b3ec fix + f6e70aab9dfe fix
> > ==================================================
> > Samples 131
> > Min 2.013
> > Max 2.397
> > Median 2.169
> > 75-P 2.211
> > 95-P 2.283
> > 99-P 2.348
> >
> > I am going to run the kernel w/ both fixes over the weekend, but
> > things look good at this point.
> >
> > - mike
>
> I've targeted Neil's fix for the first 5.15-rc NFSD pull request.
> I'd like to have Mel's Reviewed-by or Acked-by, though.
>
> I will add a Fixes: tag if Neil doesn't repost (no reason to at
> this point) so the fix should get backported automatically to
> recent stable kernels.
>
>
> > On Fri, Aug 27, 2021 at 4:49 PM Chuck Lever III <[email protected]> wrote:
> >>
> >>
> >>> On Aug 27, 2021, at 6:00 PM, Mike Javorski <[email protected]> wrote:
> >>>
> >>> OK, an update. Several hours of spaced out testing sessions and the
> >>> first patch seems to have resolved the issue. There may be a very tiny
> >>> bit of lag that still occurs when opening/processing new files, but so
> >>> far on this kernel I have not had any multi-second freezes. I am still
> >>> waiting on the kernel with Neil's patch to compile (compiling on this
> >>> underpowered server so it's taking several hours), but I think the
> >>> testing there will just be to see if I can show it works still, and
> >>> then to try and test in a memory constrained VM. To see if I can
> >>> recreate Neil's experiment. Likely will have to do this over the
> >>> weekend given the kernel compile delay + fiddling with a VM.
> >>
> >> Thanks for your testing!
> >>
> >>
> >>> Chuck: I don't mean to overstep bounds, but is it possible to get that
> >>> patch pulled into 5.13 stable? That may help things for several people
> >>> while 5.14 goes through it's shakedown in archlinux prior to release.
> >>
> >> The patch had a Fixes: tag, so it should get automatically backported
> >> to every kernel that has the broken commit. If you don't see it in
> >> a subsequent 5.13 stable kernel, you are free to ask the stable
> >> maintainers to consider it.
> >>
> >>
> >>> - mike
> >>>
> >>> On Fri, Aug 27, 2021 at 10:07 AM Mike Javorski <[email protected]> wrote:
> >>>>
> >>>> Chuck:
> >>>> I just booted a 5.13.13 kernel with your suggested patch. No freezes
> >>>> on the first test, but that sometimes happens so I will let the server
> >>>> settle some and try it again later in the day (which also would align
> >>>> with Neil's comment on memory fragmentation being a contributor).
> >>>>
> >>>> Neil:
> >>>> I have started a compile with the above kernel + your patch to test
> >>>> next unless you or Chuck determine that it isn't needed, or that I
> >>>> should test both patches discreetly. As the above is already merged to
> >>>> 5.14 it seemed logical to just add your patch on top.
> >>>>
> >>>> I will also try to set up a vm to test your md5sum scenario with the
> >>>> various kernels since it's a much faster thing to test.
> >>>>
> >>>> - mike
> >>>>
> >>>> On Fri, Aug 27, 2021 at 7:13 AM Chuck Lever III <[email protected]> wrote:
> >>>>>
> >>>>>
> >>>>>> On Aug 27, 2021, at 3:14 AM, NeilBrown <[email protected]> wrote:
> >>>>>>
> >>>>>> Subject: [PATCH] SUNRPC: don't pause on incomplete allocation
> >>>>>>
> >>>>>> alloc_pages_bulk_array() attempts to allocate at least one page based on
> >>>>>> the provided pages, and then opportunistically allocates more if that
> >>>>>> can be done without dropping the spinlock.
> >>>>>>
> >>>>>> So if it returns fewer than requested, that could just mean that it
> >>>>>> needed to drop the lock. In that case, try again immediately.
> >>>>>>
> >>>>>> Only pause for a time if no progress could be made.
> >>>>>
> >>>>> The case I was worried about was "no pages available on the
> >>>>> pcplist", in which case, alloc_pages_bulk_array() resorts
> >>>>> to calling __alloc_pages() and returns only one new page.
> >>>>>
> >>>>> "No progess" would mean even __alloc_pages() failed.
> >>>>>
> >>>>> So this patch would behave essentially like the
> >>>>> pre-alloc_pages_bulk_array() code: call alloc_page() for
> >>>>> each empty struct_page in the array without pausing. That
> >>>>> seems correct to me.
> >>>>>
> >>>>>
> >>>>> I would add
> >>>>>
> >>>>> Fixes: f6e70aab9dfe ("SUNRPC: refresh rq_pages using a bulk page allocator")
> >>>>>
> >>>>>
> >>>>>> Signed-off-by: NeilBrown <[email protected]>
> >>>>>> ---
> >>>>>> net/sunrpc/svc_xprt.c | 7 +++++--
> >>>>>> 1 file changed, 5 insertions(+), 2 deletions(-)
> >>>>>>
> >>>>>> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> >>>>>> index d66a8e44a1ae..99268dd95519 100644
> >>>>>> --- a/net/sunrpc/svc_xprt.c
> >>>>>> +++ b/net/sunrpc/svc_xprt.c
> >>>>>> @@ -662,7 +662,7 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
> >>>>>> {
> >>>>>> struct svc_serv *serv = rqstp->rq_server;
> >>>>>> struct xdr_buf *arg = &rqstp->rq_arg;
> >>>>>> - unsigned long pages, filled;
> >>>>>> + unsigned long pages, filled, prev;
> >>>>>>
> >>>>>> pages = (serv->sv_max_mesg + 2 * PAGE_SIZE) >> PAGE_SHIFT;
> >>>>>> if (pages > RPCSVC_MAXPAGES) {
> >>>>>> @@ -672,11 +672,14 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
> >>>>>> pages = RPCSVC_MAXPAGES;
> >>>>>> }
> >>>>>>
> >>>>>> - for (;;) {
> >>>>>> + for (prev = 0;; prev = filled) {
> >>>>>> filled = alloc_pages_bulk_array(GFP_KERNEL, pages,
> >>>>>> rqstp->rq_pages);
> >>>>>> if (filled == pages)
> >>>>>> break;
> >>>>>> + if (filled > prev)
> >>>>>> + /* Made progress, don't sleep yet */
> >>>>>> + continue;
> >>>>>>
> >>>>>> set_current_state(TASK_INTERRUPTIBLE);
> >>>>>> if (signalled() || kthread_should_stop()) {
> >>>>>
> >>>>> --
> >>>>> Chuck Lever
> >>>>>
> >>>>>
> >>>>>
> >>
> >> --
> >> Chuck Lever
> >>
> >>
> >>
>
> --
> Chuck Lever
>
>
>
> On Sep 4, 2021, at 1:41 PM, Mike Javorski <[email protected]> wrote:
>
> Hi Chuck.
>
> I noticed that you sent in the 5.15 pull request but Neil's fix
> (e38b3f20059426a0adbde014ff71071739ab5226 in your tree) missed the
> pull and thus the fix isn't going to be backported to 5.14 in the near
> term. Is there another 5.15 pull planned in the not too distant future
> so this will get flagged for back-porting,
Yes. The final version of Neil’s patch was just a little late for the initial v5.15 NFSD pull request (IMO) so it’s queued for the next PR, probably this week.
> or do I need to reach out to someone to expressly pull it into 5.14? If the latter, can you
> point me in the right direction of who to ask (I assume it's someone
> other than Greg KH)?
>
> Thanks
>
> - mike
>
>
>> On Sat, Aug 28, 2021 at 11:23 AM Chuck Lever III <[email protected]> wrote:
>>
>>
>>
>>>> On Aug 27, 2021, at 11:22 PM, Mike Javorski <[email protected]> wrote:
>>>
>>> I had some time this evening (and the kernel finally compiled), and
>>> wanted to get this tested.
>>>
>>> The TL;DR: Both patches are needed
>>>
>>> Below are the test results from my replication of Neil's test. It is
>>> readily apparent that both the 5.13.13 kernel AND the 5.13.13 kernel
>>> with the 82011c80b3ec fix exhibit the randomness in read times that
>>> were observed. The 5.13.13 kernel with both the 82011c80b3ec and
>>> f6e70aab9dfe fixes brings the performance back in line with the
>>> 5.12.15 kernel which I tested as a baseline.
>>>
>>> Please forgive the inconsistency in sample counts. This was running as
>>> a while loop, and I just let it go long enough that the behavior was
>>> consistent. Only change to the VM between tests was the different
>>> kernel + a reboot. The testing PC had a consistent workload during the
>>> entire set of tests.
>>>
>>> Test 0: 5.13.10 (base kernel in VM image, just for kicks)
>>> ==================================================
>>> Samples 30
>>> Min 6.839
>>> Max 19.998
>>> Median 9.638
>>> 75-P 10.898
>>> 95-P 12.939
>>> 99-P 18.005
>>>
>>> Test 1: 5.12.15 (known good)
>>> ==================================================
>>> Samples 152
>>> Min 1.997
>>> Max 2.333
>>> Median 2.171
>>> 75-P 2.230
>>> 95-P 2.286
>>> 99-P 2.312
>>>
>>> Test 2: 5.13.13 (known bad)
>>> ==================================================
>>> Samples 42
>>> Min 3.587
>>> Max 15.803
>>> Median 6.039
>>> 75-P 6.452
>>> 95-P 10.293
>>> 99-P 15.540
>>>
>>> Test 3: 5.13.13 + 82011c80b3ec fix
>>> ==================================================
>>> Samples 44
>>> Min 4.309
>>> Max 37.040
>>> Median 6.615
>>> 75-P 10.224
>>> 95-P 19.516
>>> 99-P 36.650
>>>
>>> Test 4: 5.13.13 + 82011c80b3ec fix + f6e70aab9dfe fix
>>> ==================================================
>>> Samples 131
>>> Min 2.013
>>> Max 2.397
>>> Median 2.169
>>> 75-P 2.211
>>> 95-P 2.283
>>> 99-P 2.348
>>>
>>> I am going to run the kernel w/ both fixes over the weekend, but
>>> things look good at this point.
>>>
>>> - mike
>>
>> I've targeted Neil's fix for the first 5.15-rc NFSD pull request.
>> I'd like to have Mel's Reviewed-by or Acked-by, though.
>>
>> I will add a Fixes: tag if Neil doesn't repost (no reason to at
>> this point) so the fix should get backported automatically to
>> recent stable kernels.
>>
>>
>>> On Fri, Aug 27, 2021 at 4:49 PM Chuck Lever III <[email protected]> wrote:
>>>>
>>>>
>>>>> On Aug 27, 2021, at 6:00 PM, Mike Javorski <[email protected]> wrote:
>>>>>
>>>>> OK, an update. Several hours of spaced out testing sessions and the
>>>>> first patch seems to have resolved the issue. There may be a very tiny
>>>>> bit of lag that still occurs when opening/processing new files, but so
>>>>> far on this kernel I have not had any multi-second freezes. I am still
>>>>> waiting on the kernel with Neil's patch to compile (compiling on this
>>>>> underpowered server so it's taking several hours), but I think the
>>>>> testing there will just be to see if I can show it works still, and
>>>>> then to try and test in a memory constrained VM. To see if I can
>>>>> recreate Neil's experiment. Likely will have to do this over the
>>>>> weekend given the kernel compile delay + fiddling with a VM.
>>>>
>>>> Thanks for your testing!
>>>>
>>>>
>>>>> Chuck: I don't mean to overstep bounds, but is it possible to get that
>>>>> patch pulled into 5.13 stable? That may help things for several people
>>>>> while 5.14 goes through it's shakedown in archlinux prior to release.
>>>>
>>>> The patch had a Fixes: tag, so it should get automatically backported
>>>> to every kernel that has the broken commit. If you don't see it in
>>>> a subsequent 5.13 stable kernel, you are free to ask the stable
>>>> maintainers to consider it.
>>>>
>>>>
>>>>> - mike
>>>>>
>>>>> On Fri, Aug 27, 2021 at 10:07 AM Mike Javorski <[email protected]> wrote:
>>>>>>
>>>>>> Chuck:
>>>>>> I just booted a 5.13.13 kernel with your suggested patch. No freezes
>>>>>> on the first test, but that sometimes happens so I will let the server
>>>>>> settle some and try it again later in the day (which also would align
>>>>>> with Neil's comment on memory fragmentation being a contributor).
>>>>>>
>>>>>> Neil:
>>>>>> I have started a compile with the above kernel + your patch to test
>>>>>> next unless you or Chuck determine that it isn't needed, or that I
>>>>>> should test both patches discreetly. As the above is already merged to
>>>>>> 5.14 it seemed logical to just add your patch on top.
>>>>>>
>>>>>> I will also try to set up a vm to test your md5sum scenario with the
>>>>>> various kernels since it's a much faster thing to test.
>>>>>>
>>>>>> - mike
>>>>>>
>>>>>> On Fri, Aug 27, 2021 at 7:13 AM Chuck Lever III <[email protected]> wrote:
>>>>>>>
>>>>>>>
>>>>>>>> On Aug 27, 2021, at 3:14 AM, NeilBrown <[email protected]> wrote:
>>>>>>>>
>>>>>>>> Subject: [PATCH] SUNRPC: don't pause on incomplete allocation
>>>>>>>>
>>>>>>>> alloc_pages_bulk_array() attempts to allocate at least one page based on
>>>>>>>> the provided pages, and then opportunistically allocates more if that
>>>>>>>> can be done without dropping the spinlock.
>>>>>>>>
>>>>>>>> So if it returns fewer than requested, that could just mean that it
>>>>>>>> needed to drop the lock. In that case, try again immediately.
>>>>>>>>
>>>>>>>> Only pause for a time if no progress could be made.
>>>>>>>
>>>>>>> The case I was worried about was "no pages available on the
>>>>>>> pcplist", in which case, alloc_pages_bulk_array() resorts
>>>>>>> to calling __alloc_pages() and returns only one new page.
>>>>>>>
>>>>>>> "No progess" would mean even __alloc_pages() failed.
>>>>>>>
>>>>>>> So this patch would behave essentially like the
>>>>>>> pre-alloc_pages_bulk_array() code: call alloc_page() for
>>>>>>> each empty struct_page in the array without pausing. That
>>>>>>> seems correct to me.
>>>>>>>
>>>>>>>
>>>>>>> I would add
>>>>>>>
>>>>>>> Fixes: f6e70aab9dfe ("SUNRPC: refresh rq_pages using a bulk page allocator")
>>>>>>>
>>>>>>>
>>>>>>>> Signed-off-by: NeilBrown <[email protected]>
>>>>>>>> ---
>>>>>>>> net/sunrpc/svc_xprt.c | 7 +++++--
>>>>>>>> 1 file changed, 5 insertions(+), 2 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
>>>>>>>> index d66a8e44a1ae..99268dd95519 100644
>>>>>>>> --- a/net/sunrpc/svc_xprt.c
>>>>>>>> +++ b/net/sunrpc/svc_xprt.c
>>>>>>>> @@ -662,7 +662,7 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
>>>>>>>> {
>>>>>>>> struct svc_serv *serv = rqstp->rq_server;
>>>>>>>> struct xdr_buf *arg = &rqstp->rq_arg;
>>>>>>>> - unsigned long pages, filled;
>>>>>>>> + unsigned long pages, filled, prev;
>>>>>>>>
>>>>>>>> pages = (serv->sv_max_mesg + 2 * PAGE_SIZE) >> PAGE_SHIFT;
>>>>>>>> if (pages > RPCSVC_MAXPAGES) {
>>>>>>>> @@ -672,11 +672,14 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
>>>>>>>> pages = RPCSVC_MAXPAGES;
>>>>>>>> }
>>>>>>>>
>>>>>>>> - for (;;) {
>>>>>>>> + for (prev = 0;; prev = filled) {
>>>>>>>> filled = alloc_pages_bulk_array(GFP_KERNEL, pages,
>>>>>>>> rqstp->rq_pages);
>>>>>>>> if (filled == pages)
>>>>>>>> break;
>>>>>>>> + if (filled > prev)
>>>>>>>> + /* Made progress, don't sleep yet */
>>>>>>>> + continue;
>>>>>>>>
>>>>>>>> set_current_state(TASK_INTERRUPTIBLE);
>>>>>>>> if (signalled() || kthread_should_stop()) {
>>>>>>>
>>>>>>> --
>>>>>>> Chuck Lever
>>>>>>>
>>>>>>>
>>>>>>>
>>>>
>>>> --
>>>> Chuck Lever
>>>>
>>>>
>>>>
>>
>> --
>> Chuck Lever
>>
>>
>>
Chuck:
I see that the patch was merged to Linus' branch, but there have been
2 stable patch releases since and the patch hasn't been pulled in. You
mentioned I should reach out to the stable maintainers in this
instance, is the [email protected] list the appropriate place to
make such a request?
Thanks.
- mike
On Sat, Sep 4, 2021 at 7:02 PM Chuck Lever III <[email protected]> wrote:
>
>
> > On Sep 4, 2021, at 1:41 PM, Mike Javorski <[email protected]> wrote:
> >
> > Hi Chuck.
> >
> > I noticed that you sent in the 5.15 pull request but Neil's fix
> > (e38b3f20059426a0adbde014ff71071739ab5226 in your tree) missed the
> > pull and thus the fix isn't going to be backported to 5.14 in the near
> > term. Is there another 5.15 pull planned in the not too distant future
> > so this will get flagged for back-porting,
>
> Yes. The final version of Neil’s patch was just a little late for the initial v5.15 NFSD pull request (IMO) so it’s queued for the next PR, probably this week.
>
>
> > or do I need to reach out to someone to expressly pull it into 5.14? If the latter, can you
> > point me in the right direction of who to ask (I assume it's someone
> > other than Greg KH)?
> >
> > Thanks
> >
> > - mike
> >
> >
> >> On Sat, Aug 28, 2021 at 11:23 AM Chuck Lever III <[email protected]> wrote:
> >>
> >>
> >>
> >>>> On Aug 27, 2021, at 11:22 PM, Mike Javorski <[email protected]> wrote:
> >>>
> >>> I had some time this evening (and the kernel finally compiled), and
> >>> wanted to get this tested.
> >>>
> >>> The TL;DR: Both patches are needed
> >>>
> >>> Below are the test results from my replication of Neil's test. It is
> >>> readily apparent that both the 5.13.13 kernel AND the 5.13.13 kernel
> >>> with the 82011c80b3ec fix exhibit the randomness in read times that
> >>> were observed. The 5.13.13 kernel with both the 82011c80b3ec and
> >>> f6e70aab9dfe fixes brings the performance back in line with the
> >>> 5.12.15 kernel which I tested as a baseline.
> >>>
> >>> Please forgive the inconsistency in sample counts. This was running as
> >>> a while loop, and I just let it go long enough that the behavior was
> >>> consistent. Only change to the VM between tests was the different
> >>> kernel + a reboot. The testing PC had a consistent workload during the
> >>> entire set of tests.
> >>>
> >>> Test 0: 5.13.10 (base kernel in VM image, just for kicks)
> >>> ==================================================
> >>> Samples 30
> >>> Min 6.839
> >>> Max 19.998
> >>> Median 9.638
> >>> 75-P 10.898
> >>> 95-P 12.939
> >>> 99-P 18.005
> >>>
> >>> Test 1: 5.12.15 (known good)
> >>> ==================================================
> >>> Samples 152
> >>> Min 1.997
> >>> Max 2.333
> >>> Median 2.171
> >>> 75-P 2.230
> >>> 95-P 2.286
> >>> 99-P 2.312
> >>>
> >>> Test 2: 5.13.13 (known bad)
> >>> ==================================================
> >>> Samples 42
> >>> Min 3.587
> >>> Max 15.803
> >>> Median 6.039
> >>> 75-P 6.452
> >>> 95-P 10.293
> >>> 99-P 15.540
> >>>
> >>> Test 3: 5.13.13 + 82011c80b3ec fix
> >>> ==================================================
> >>> Samples 44
> >>> Min 4.309
> >>> Max 37.040
> >>> Median 6.615
> >>> 75-P 10.224
> >>> 95-P 19.516
> >>> 99-P 36.650
> >>>
> >>> Test 4: 5.13.13 + 82011c80b3ec fix + f6e70aab9dfe fix
> >>> ==================================================
> >>> Samples 131
> >>> Min 2.013
> >>> Max 2.397
> >>> Median 2.169
> >>> 75-P 2.211
> >>> 95-P 2.283
> >>> 99-P 2.348
> >>>
> >>> I am going to run the kernel w/ both fixes over the weekend, but
> >>> things look good at this point.
> >>>
> >>> - mike
> >>
> >> I've targeted Neil's fix for the first 5.15-rc NFSD pull request.
> >> I'd like to have Mel's Reviewed-by or Acked-by, though.
> >>
> >> I will add a Fixes: tag if Neil doesn't repost (no reason to at
> >> this point) so the fix should get backported automatically to
> >> recent stable kernels.
> >>
> >>
> >>> On Fri, Aug 27, 2021 at 4:49 PM Chuck Lever III <[email protected]> wrote:
> >>>>
> >>>>
> >>>>> On Aug 27, 2021, at 6:00 PM, Mike Javorski <[email protected]> wrote:
> >>>>>
> >>>>> OK, an update. Several hours of spaced out testing sessions and the
> >>>>> first patch seems to have resolved the issue. There may be a very tiny
> >>>>> bit of lag that still occurs when opening/processing new files, but so
> >>>>> far on this kernel I have not had any multi-second freezes. I am still
> >>>>> waiting on the kernel with Neil's patch to compile (compiling on this
> >>>>> underpowered server so it's taking several hours), but I think the
> >>>>> testing there will just be to see if I can show it works still, and
> >>>>> then to try and test in a memory constrained VM. To see if I can
> >>>>> recreate Neil's experiment. Likely will have to do this over the
> >>>>> weekend given the kernel compile delay + fiddling with a VM.
> >>>>
> >>>> Thanks for your testing!
> >>>>
> >>>>
> >>>>> Chuck: I don't mean to overstep bounds, but is it possible to get that
> >>>>> patch pulled into 5.13 stable? That may help things for several people
> >>>>> while 5.14 goes through it's shakedown in archlinux prior to release.
> >>>>
> >>>> The patch had a Fixes: tag, so it should get automatically backported
> >>>> to every kernel that has the broken commit. If you don't see it in
> >>>> a subsequent 5.13 stable kernel, you are free to ask the stable
> >>>> maintainers to consider it.
> >>>>
> >>>>
> >>>>> - mike
> >>>>>
> >>>>> On Fri, Aug 27, 2021 at 10:07 AM Mike Javorski <[email protected]> wrote:
> >>>>>>
> >>>>>> Chuck:
> >>>>>> I just booted a 5.13.13 kernel with your suggested patch. No freezes
> >>>>>> on the first test, but that sometimes happens so I will let the server
> >>>>>> settle some and try it again later in the day (which also would align
> >>>>>> with Neil's comment on memory fragmentation being a contributor).
> >>>>>>
> >>>>>> Neil:
> >>>>>> I have started a compile with the above kernel + your patch to test
> >>>>>> next unless you or Chuck determine that it isn't needed, or that I
> >>>>>> should test both patches discreetly. As the above is already merged to
> >>>>>> 5.14 it seemed logical to just add your patch on top.
> >>>>>>
> >>>>>> I will also try to set up a vm to test your md5sum scenario with the
> >>>>>> various kernels since it's a much faster thing to test.
> >>>>>>
> >>>>>> - mike
> >>>>>>
> >>>>>> On Fri, Aug 27, 2021 at 7:13 AM Chuck Lever III <[email protected]> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>> On Aug 27, 2021, at 3:14 AM, NeilBrown <[email protected]> wrote:
> >>>>>>>>
> >>>>>>>> Subject: [PATCH] SUNRPC: don't pause on incomplete allocation
> >>>>>>>>
> >>>>>>>> alloc_pages_bulk_array() attempts to allocate at least one page based on
> >>>>>>>> the provided pages, and then opportunistically allocates more if that
> >>>>>>>> can be done without dropping the spinlock.
> >>>>>>>>
> >>>>>>>> So if it returns fewer than requested, that could just mean that it
> >>>>>>>> needed to drop the lock. In that case, try again immediately.
> >>>>>>>>
> >>>>>>>> Only pause for a time if no progress could be made.
> >>>>>>>
> >>>>>>> The case I was worried about was "no pages available on the
> >>>>>>> pcplist", in which case, alloc_pages_bulk_array() resorts
> >>>>>>> to calling __alloc_pages() and returns only one new page.
> >>>>>>>
> >>>>>>> "No progess" would mean even __alloc_pages() failed.
> >>>>>>>
> >>>>>>> So this patch would behave essentially like the
> >>>>>>> pre-alloc_pages_bulk_array() code: call alloc_page() for
> >>>>>>> each empty struct_page in the array without pausing. That
> >>>>>>> seems correct to me.
> >>>>>>>
> >>>>>>>
> >>>>>>> I would add
> >>>>>>>
> >>>>>>> Fixes: f6e70aab9dfe ("SUNRPC: refresh rq_pages using a bulk page allocator")
> >>>>>>>
> >>>>>>>
> >>>>>>>> Signed-off-by: NeilBrown <[email protected]>
> >>>>>>>> ---
> >>>>>>>> net/sunrpc/svc_xprt.c | 7 +++++--
> >>>>>>>> 1 file changed, 5 insertions(+), 2 deletions(-)
> >>>>>>>>
> >>>>>>>> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> >>>>>>>> index d66a8e44a1ae..99268dd95519 100644
> >>>>>>>> --- a/net/sunrpc/svc_xprt.c
> >>>>>>>> +++ b/net/sunrpc/svc_xprt.c
> >>>>>>>> @@ -662,7 +662,7 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
> >>>>>>>> {
> >>>>>>>> struct svc_serv *serv = rqstp->rq_server;
> >>>>>>>> struct xdr_buf *arg = &rqstp->rq_arg;
> >>>>>>>> - unsigned long pages, filled;
> >>>>>>>> + unsigned long pages, filled, prev;
> >>>>>>>>
> >>>>>>>> pages = (serv->sv_max_mesg + 2 * PAGE_SIZE) >> PAGE_SHIFT;
> >>>>>>>> if (pages > RPCSVC_MAXPAGES) {
> >>>>>>>> @@ -672,11 +672,14 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
> >>>>>>>> pages = RPCSVC_MAXPAGES;
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> - for (;;) {
> >>>>>>>> + for (prev = 0;; prev = filled) {
> >>>>>>>> filled = alloc_pages_bulk_array(GFP_KERNEL, pages,
> >>>>>>>> rqstp->rq_pages);
> >>>>>>>> if (filled == pages)
> >>>>>>>> break;
> >>>>>>>> + if (filled > prev)
> >>>>>>>> + /* Made progress, don't sleep yet */
> >>>>>>>> + continue;
> >>>>>>>>
> >>>>>>>> set_current_state(TASK_INTERRUPTIBLE);
> >>>>>>>> if (signalled() || kthread_should_stop()) {
> >>>>>>>
> >>>>>>> --
> >>>>>>> Chuck Lever
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>
> >>>> --
> >>>> Chuck Lever
> >>>>
> >>>>
> >>>>
> >>
> >> --
> >> Chuck Lever
> >>
> >>
> >>
> On Sep 15, 2021, at 10:45 PM, Mike Javorski <[email protected]> wrote:
>
> Chuck:
>
> I see that the patch was merged to Linus' branch, but there have been
> 2 stable patch releases since and the patch hasn't been pulled in.
The v5.15 merge window just closed on Monday. I'm sure there is a
long queue of fixes to be pulled into stable. I'd give it more time.
> You
> mentioned I should reach out to the stable maintainers in this
> instance, is the [email protected] list the appropriate place to
> make such a request?
That could be correct, but the automated process has worked for me
long enough that I'm no longer sure how to make such a request.
--
Chuck Lever
Thanks for the insight Chuck. I will give it a bit more time as you
suggest; I will ask if the arch devs can pull in the fix to their
kernel in the meantime.
I didn't see anyone making "backport fixes" requests in the stable
list archive recently, so it doesn't seem to be the norm, but I don't
see any other methods.
- mike
On Thu, Sep 16, 2021 at 11:58 AM Chuck Lever III <[email protected]> wrote:
>
>
>
> > On Sep 15, 2021, at 10:45 PM, Mike Javorski <[email protected]> wrote:
> >
> > Chuck:
> >
> > I see that the patch was merged to Linus' branch, but there have been
> > 2 stable patch releases since and the patch hasn't been pulled in.
>
> The v5.15 merge window just closed on Monday. I'm sure there is a
> long queue of fixes to be pulled into stable. I'd give it more time.
>
>
> > You
> > mentioned I should reach out to the stable maintainers in this
> > instance, is the [email protected] list the appropriate place to
> > make such a request?
>
> That could be correct, but the automated process has worked for me
> long enough that I'm no longer sure how to make such a request.
>
> --
> Chuck Lever
>
>
>
On Thu, Sep 16, 2021 at 12:21:28PM -0700, Mike Javorski wrote:
> Thanks for the insight Chuck. I will give it a bit more time as you
> suggest; I will ask if the arch devs can pull in the fix to their
> kernel in the meantime.
> I didn't see anyone making "backport fixes" requests in the stable
> list archive recently, so it doesn't seem to be the norm, but I don't
> see any other methods.
Yes, I think you just send them mail.
But like Chuck says they're probably just working through a bit of a
backlog.
--b.
>
> - mike
>
> On Thu, Sep 16, 2021 at 11:58 AM Chuck Lever III <[email protected]> wrote:
> >
> >
> >
> > > On Sep 15, 2021, at 10:45 PM, Mike Javorski <[email protected]> wrote:
> > >
> > > Chuck:
> > >
> > > I see that the patch was merged to Linus' branch, but there have been
> > > 2 stable patch releases since and the patch hasn't been pulled in.
> >
> > The v5.15 merge window just closed on Monday. I'm sure there is a
> > long queue of fixes to be pulled into stable. I'd give it more time.
> >
> >
> > > You
> > > mentioned I should reach out to the stable maintainers in this
> > > instance, is the [email protected] list the appropriate place to
> > > make such a request?
> >
> > That could be correct, but the automated process has worked for me
> > long enough that I'm no longer sure how to make such a request.
> >
> > --
> > Chuck Lever
> >
> >
> >