Hello,
I'm currently analyzing an nfs performance problem and hope to be in the right place to ask about it here, kindly redirect me otherwise.
I initially thought it to be a distribution specific server issue but discovered it to be a nfs client issue according to my current observations.
I created a test directory on the nfs server that as follows:
mkdir -p /nfs/server && mkdir /nfs/server/testdir{0..99} && touch /nfs/server/testdir{0..99}/testfile{0..499} && touch /nfs/server/testdir{0..99}/{500..999}
This should create 100 directories with 1000 files each, for a grand total of 100.000 files.
The nfs server exports that as follows:
/nfs nfs.client.local(rw,sync,no_subtree_check,no_root_squash)
On the nfs server, "time du -shxc /nfs/server" takes around 0.4 seconds and says the total data is around 3.6M.
Now, the following is a shortened excerpt from an nfs client, "..." represent "uninteresting data points similar to last and next".
Note that i reprocuded this issue to different degrees on different hardware and with different servers/clients.
This example had Debian 11 Kernel 5.10.0-19-amd64 for server and Ubuntu 22.04 Kernel 5.15.0-56-generic.
(I tested with other Debian,Ubuntu,Fedora,ArchLinux Clients additionally)
----------------------------------------------------------------
nfs.client.local$ sudo umount /mnt; { sudo mount -t nfs -o 'nfsvers=4.2,rw,sync,acregmin=30,acdirmin=60,acregmax=120,acdirmax=120' nfs.server.local:/nfs/server /mnt ; { for i in {1..256}; do { echo "==== $i >> " ; /usr/bin/time -f "real %e seconds" -- du -shxc /mnt 2>&1; } | sed ':a;N;$!ba;s/\n/ | /g'; done } }; sudo umount /mnt;
==== 1 >> | 3,6M /mnt | 3,6M insgesamt | real 1.53 seconds
==== 2 >> | 3,6M /mnt | 3,6M insgesamt | real 0.35 seconds
==== 3 >> | 3,6M /mnt | 3,6M insgesamt | real 0.27 seconds
...
==== 105 >> | 3,6M /mnt | 3,6M insgesamt | real 0.27 seconds
==== 106 >> | 3,6M /mnt | 3,6M insgesamt | real 0.27 seconds
==== 107 >> | 3,6M /mnt | 3,6M insgesamt | real 24.96 seconds
==== 108 >> | 3,6M /mnt | 3,6M insgesamt | real 0.25 seconds
==== 109 >> | 3,6M /mnt | 3,6M insgesamt | real 0.25 seconds
...
==== 238 >> | 3,6M /mnt | 3,6M insgesamt | real 0.26 seconds
==== 239 >> | 3,6M /mnt | 3,6M insgesamt | real 0.26 seconds
==== 240 >> | 3,6M /mnt | 3,6M insgesamt | real 25.78 seconds
==== 241 >> | 3,6M /mnt | 3,6M insgesamt | real 0.25 seconds
==== 242 >> | 3,6M /mnt | 3,6M insgesamt | real 0.25 seconds
...
----------------------------------------------------------------
What I observer now is that:
1) The very first operation after mounting ([*1]!) takes around 1.53 seconds, which is expected to be slower since there is network involved and no cacheing.
2) The second and operation takes around 0.3 seconds, which is faster than the first because now there is local cache.
3) The 107th (and 240th) operation are each slower, these occur when the cache has become older than the acregmin=30, understandably.
4) However the 107th (and 240th) operation are much much mich slower than the initial first operation took to fill it's cache.
[*1] The fast cache-filling behaviour occures only after a fresh mount (which is why my test tries to ensure a fresh mount) OR for files newly created on the server (e.g. a new directory parallel to the one that was requested previously). I can extend on that if requested, but I am unsure if relevant to the core issue at hand.
I have debugged this issue with tcpdump and noticed the first operation(s) use READDIR while the the later do not (and use single GETADDR instead).
I have toyed around with serveral mount options plenty, but have not found a way to get around this.
However much i might increase actimeo (or it's resulting descendants acregmin,acdirmin,acregmax,acdirmax), the first cache filling seems to be fast but after that its slow when the cache times out.
This in turn makes it hard to choose reasonable values for actimeo (or decendants) because the production system in question uses nfs to serve home directories.
One of the real directories has for example about 4GB in volume and also around 100.000 files.
Others are much larger in volume (400GB) but atleast in similar ranges of file count (500.000).
I initially observed these very-slow-supposed-to-be-cache-filling operations to take longer than the cache timeout, resulting in the cache never really getting valid at all, resulting in very poor performance.
That could be artificially be reproduced for observation in the above example when mounting with actimeo=3, for example:
----------------------------------------------------------------
nfs.client.local$ sudo umount /mnt; { sudo mount -t nfs -o 'nfsvers=4.2,rw,sync,actimeo=3' nfs.server.local:/nfs/server /mnt ; { for i in {1..256}; do { echo "==== $i >> " ; /usr/bin/time -f "real %e seconds" -- du -shxc /mnt 2>&1; } | sed ':a;N;$!ba;s/\n/ | /g'; done } }; sudo umount /mnt;
umount: /mnt: not mounted.
==== 1 >> | 3.6M /mnt | 3.6M total | real 1.34 seconds
==== 2 >> | 3.6M /mnt | 3.6M total | real 0.67 seconds
==== 3 >> | 3.6M /mnt | 3.6M total | real 0.61 seconds
==== 4 >> | 3.6M /mnt | 3.6M total | real 0.54 seconds
==== 5 >> | 3.6M /mnt | 3.6M total | real 61.25 seconds
==== 6 >> | 3.6M /mnt | 3.6M total | real 62.69 seconds
==== 7 >> | 3.6M /mnt | 3.6M total | real 59.70 seconds
...
----------------------------------------------------------------
I _think_ that i have sometimes (randomly?) observed it successfully filling the cache in that or similar scenarios, but was not able to reproduce that in any way on purpose.
Might have been on the production system where a concurrent request managed to fill the cache.
From what i gathered around the internet and understood, there seem to be heuristics involved when the client decides what operations to transmit to the server.
Also, the timed-out cache seems to be creating what some called a "getattr storm", which i understand in theory.
But why does the first request manage to be smarter about it, since it gathers the same information about the exact same files?
I would be happy if i could maintain the initial-non-cached time (in the examples above 1.5 seconds) but none of "noac","lookupcache=none","actimeo=0" would let me achieve that seemingly.
Is there a way to improve that situation, and if so, how?
best regards
Theodor Mittermair
On 5 Dec 2022, at 21:18, Theodor Mittermair wrote:
> Hello,
Hi Theodor,
.. snip ..
> From what i gathered around the internet and understood, there seem to be
> heuristics involved when the client decides what operations to transmit to
> the server. Also, the timed-out cache seems to be creating what some
> called a "getattr storm", which i understand in theory.
When `du` gathers information, it does so by switching between two syscalls:
getdents() and stat() (or some equivalents). The getdents() syscall causes
the NFS client to perform either READDIR or READDIRPLUS - the choice of
which is governed by a heuristic. The heuristic can only intelligently
determine which readdir operation to use based on whether the program is
performing this pattern of getdents(), stat(), stat(), stat(), getdents(),
stat(), stat(), stat(). The way it can tell is by checking if each inode's
attributes have been cached, so the cache timeouts end up coming into play.
> But why does the first request manage to be smarter about it, since it
> gathers the same information about the exact same files?
It's not smarter, it just optimistically uses READDIRPLUS on the very first
call of getdents() for a directory, but can only do so if the directory's
dentries have not yet been cached. If they /are/ cached, but each dentry's
individual attributes have timed out, then the client must send an
individual GETATTR for each entry.
What is happening for you is that your attribute caches for each inode are
timing out, but the overall directory's dentry list is not changing.
There's no need to send /any/ readdir operations - so the heuristic doesn't
send READDIRPLUS and you end up with one full pile of getdents() results of
individual GETATTRs for every entry.
If your server is returning a large dtpref (the preferred data transfer size
for readdir), and there's some latency for round-trip operations, you'll see
this stack up quickly in exactly the results you've presented.
There's a patch that may go into v6.2 to help this:
https://lore.kernel.org/linux-nfs/[email protected]/
.. if you have the ability to test it in your setup, I'd be interested in
the results.
This heuristic's behavior is becoming harder to change, because over time we
have a lot of setups depending on certain performance characteristics and
changes in this area create unexpected performance regressions.
> I would be happy if i could maintain the initial-non-cached time (in the
> examples above 1.5 seconds) but none of
> "noac","lookupcache=none","actimeo=0" would let me achieve that seemingly.
>
> Is there a way to improve that situation, and if so, how?
Hopefully, the above patch will help. We've all had wild ideas: maybe we
should also only do uncached readdir if lookupcache=none? Its a bit
surprising that you'd opt to forego all caching just to optmize this `du`
case. I don't think that's what you want, as it will negatively impact
other workloads.
I also think that if you were to dump all the directories' page caches
in between your calls to `du` you'd get consistent performance as in your
first pass.. something with POSIX_FADV_DONTNEED to fadvise(), but I'd be
leery of depending on this behavior, since its only a hint.
I also wonder if glibc() might be willing to check a hint (like an
environment variable?) about how big a buffer to send to getdents(), since I
suspect it might also be nice for some fuse filesystems.
Ben
> On Dec 6, 2022, at 8:21 AM, Benjamin Coddington <[email protected]> wrote:
>
> On 5 Dec 2022, at 21:18, Theodor Mittermair wrote:
>
>> Hello,
>
> Hi Theodor,
>
> .. snip ..
>
>> From what i gathered around the internet and understood, there seem to be
>> heuristics involved when the client decides what operations to transmit to
>> the server. Also, the timed-out cache seems to be creating what some
>> called a "getattr storm", which i understand in theory.
>
> When `du` gathers information, it does so by switching between two syscalls:
> getdents() and stat() (or some equivalents). The getdents() syscall causes
> the NFS client to perform either READDIR or READDIRPLUS - the choice of
> which is governed by a heuristic. The heuristic can only intelligently
> determine which readdir operation to use based on whether the program is
> performing this pattern of getdents(), stat(), stat(), stat(), getdents(),
> stat(), stat(), stat(). The way it can tell is by checking if each inode's
> attributes have been cached, so the cache timeouts end up coming into play.
>
>> But why does the first request manage to be smarter about it, since it
>> gathers the same information about the exact same files?
>
> It's not smarter, it just optimistically uses READDIRPLUS on the very first
> call of getdents() for a directory, but can only do so if the directory's
> dentries have not yet been cached. If they /are/ cached, but each dentry's
> individual attributes have timed out, then the client must send an
> individual GETATTR for each entry.
>
> What is happening for you is that your attribute caches for each inode are
> timing out, but the overall directory's dentry list is not changing.
> There's no need to send /any/ readdir operations - so the heuristic doesn't
> send READDIRPLUS and you end up with one full pile of getdents() results of
> individual GETATTRs for every entry.
Are those GETATTRs emitted one at a time sequentially, or concurrently?
--
Chuck Lever
On 6 Dec 2022, at 9:32, Chuck Lever III wrote:
>> On Dec 6, 2022, at 8:21 AM, Benjamin Coddington <[email protected]> wrote:
>>
>> On 5 Dec 2022, at 21:18, Theodor Mittermair wrote:
>>
>>> Hello,
>>
>> Hi Theodor,
>>
>> .. snip ..
>>
>>> From what i gathered around the internet and understood, there seem to be
>>> heuristics involved when the client decides what operations to transmit to
>>> the server. Also, the timed-out cache seems to be creating what some
>>> called a "getattr storm", which i understand in theory.
>>
>> When `du` gathers information, it does so by switching between two syscalls:
>> getdents() and stat() (or some equivalents). The getdents() syscall causes
>> the NFS client to perform either READDIR or READDIRPLUS - the choice of
>> which is governed by a heuristic. The heuristic can only intelligently
>> determine which readdir operation to use based on whether the program is
>> performing this pattern of getdents(), stat(), stat(), stat(), getdents(),
>> stat(), stat(), stat(). The way it can tell is by checking if each inode's
>> attributes have been cached, so the cache timeouts end up coming into play.
>>
>>> But why does the first request manage to be smarter about it, since it
>>> gathers the same information about the exact same files?
>>
>> It's not smarter, it just optimistically uses READDIRPLUS on the very first
>> call of getdents() for a directory, but can only do so if the directory's
>> dentries have not yet been cached. If they /are/ cached, but each dentry's
>> individual attributes have timed out, then the client must send an
>> individual GETATTR for each entry.
>>
>> What is happening for you is that your attribute caches for each inode are
>> timing out, but the overall directory's dentry list is not changing.
>> There's no need to send /any/ readdir operations - so the heuristic doesn't
>> send READDIRPLUS and you end up with one full pile of getdents() results of
>> individual GETATTRs for every entry.
>
> Are those GETATTRs emitted one at a time sequentially, or concurrently?
Sequentially, as `du` loops for each getdents() entry calling stat().
Ben