2019-09-20 18:41:19

by Alkis Georgopoulos

[permalink] [raw]
Subject: Re: rsize,wsize=1M causes severe lags in 10/100 Mbps

On 9/20/19 1:16 AM, Daniel Forrest wrote:
>>> What may be happening here is something I have noticed with glibc.
>>>
>>> - statfs reports the rsize/wsize as the block size of the filesystem.
>>>
>>> - glibc uses the block size as the default buffer size for
>>> fread/fwrite.
>>>
>>> If an application is using fread/fwrite on an NFS mounted file with
>>> an rsize/wsize of 1M it will try to fill a 1MB buffer.
>>>
>>> I have often changed mounts to use rsize/wsize=64K to alleviate this.
>>
>> That sounds like an abuse of the filesystem block size. There is
>> nothing in the POSIX definition of either fread() or fwrite() that
>> requires glibc to do this:
>> https://pubs.opengroup.org/onlinepubs/9699919799/functions/fread.html
>> https://pubs.opengroup.org/onlinepubs/9699919799/functions/fwrite.html
>>
>
> It looks like this was fixed in glibc 2.25:
>
> https://sourceware.org/bugzilla/show_bug.cgi?id=4099


This is likely not the exact issue I'm experiencing, as I'm testing e.g.
with glibc 2.27-3ubuntu1 on Ubuntu 18.04 and kernel 5.0.

New benchmark, measuring the boot time of a netbooted client,
from right after the kernel is loaded to the display manager screen:

1) On 10 Mbps:
a) tcp,timeo=600,rsize=32K: 304 secs
b) tcp,timeo=600,rsize=1M: 618 secs

2) On 100 Mbps:
a) tcp,timeo=600,rsize=32K: 40 secs
b) tcp,timeo=600,rsize=1M: 84 secs

3) On 1000 Mbps:
a) tcp,timeo=600,rsize=32K: 20 secs
b) tcp,timeo=600,rsize=1M: 24 secs

32K is always faster, even on full gigabit.
Disk access on gigabit was *significantly* faster to result in 4 seconds
lower boot time. In the 10/100 cases, rsize=1M is pretty much unusable.
There are no writes involved, they go in a local tmpfs/overlayfs.
Would it make sense for me to measure the *boot bandwidth* in each case,
to see if more things (readahead) are downloaded with rsize=1M?

I can do whatever benchmarks and test whatever parameters you tell me
to, but I do not know the NFS/kernel internals to be able to explain why
this happens.

The reason I investigated this is because I developed the new version of
ltsp.org (GPL netbooting software), where we switched from
squashfs-over-NBD to squashfs-over-NFS, and netbooting was extremely
slow until I lowered rsize to 32K, so I thought I'd share my findings in
case it makes a better default for everyone (or reveals a problem
elsewhere).
With rsize=32K, squashfs-over-NFS is as speedy as squashfs-over-NBD, but
a lot more stable.

Of course the same rsize findings apply for NFS /home too (without
nfsmount), or for just transferring large or small files, not just for /.

Btw,
https://www.kernel.org/doc/Documentation/filesystems/nfs/nfsroot.txt
says the kernel nfsroot defaults are timeo=7,rsize=4096,wsize=4096. This
is about the internal kernel netbooting support, not using klibc
nfsmount; but I haven't tested it as it would involve compiling a kernel
with my NIC driver.

Thank you,
Alkis Georgopoulos
LTSP developer


2019-09-20 19:08:49

by Alkis Georgopoulos

[permalink] [raw]
Subject: Re: rsize,wsize=1M causes severe lags in 10/100 Mbps

On 9/20/19 12:25 PM, Alkis Georgopoulos wrote:
> This is likely not the exact issue I'm experiencing, as I'm testing e.g.
> with glibc 2.27-3ubuntu1 on Ubuntu 18.04 and kernel 5.0.
>
> New benchmark, measuring the boot time of a netbooted client,
> from right after the kernel is loaded to the display manager screen:
>
> 1) On 10 Mbps:
> a) tcp,timeo=600,rsize=32K: 304 secs
> b) tcp,timeo=600,rsize=1M: 618 secs
>
> 2) On 100 Mbps:
> a) tcp,timeo=600,rsize=32K: 40 secs
> b) tcp,timeo=600,rsize=1M: 84 secs
>
> 3) On 1000 Mbps:
> a) tcp,timeo=600,rsize=32K: 20 secs
> b) tcp,timeo=600,rsize=1M: 24 secs
>
> 32K is always faster, even on full gigabit.
> Disk access on gigabit was *significantly* faster to result in 4 seconds
> lower boot time. In the 10/100 cases, rsize=1M is pretty much unusable.
> There are no writes involved, they go in a local tmpfs/overlayfs.
> Would it make sense for me to measure the *boot bandwidth* in each case,
> to see if more things (readahead) are downloaded with rsize=1M?


I did test the boot bandwidth.
On ext4-over-NFS, with tmpfs-and-overlayfs to make root writable:

2) On 100 Mbps:
a) tcp,timeo=600,rsize=32K: 471MB
b) tcp,timeo=600,rsize=1M: 1250MB

So it is indeed slower because it's transferring more things that the
client doesn't need.
Maybe it is a different or a new aspect of the readahead issue that you
guys mentioned above.
Is it possible that NFS is always sending 1MB chunks even when the
actual data inside them is lower?

If you want me to test more things, I can;
if you consider it a problem with glibc etc that shouldn't involve this
mailing list, I can try to report it there...

Thank you,
Alkis Georgopoulos

2019-09-20 19:09:21

by Alkis Georgopoulos

[permalink] [raw]
Subject: Re: rsize,wsize=1M causes severe lags in 10/100 Mbps

On 9/20/19 12:48 PM, Alkis Georgopoulos wrote:
> I did test the boot bandwidth (I mean how many MB were transferred).
> On ext4-over-NFS, with tmpfs-and-overlayfs to make root writable:


I also tested with the kernel netbooting default of rsize=4K to compare.
All on 100 Mbps, tcp,timeo=600:

| rsize | MB to boot | sec to boot |
|-------|------------|-------------|
| 1M | 1250 | 84 |
| 32K | 471 | 40 |
| 4K | 320 | 31 |
| 2K | 355 | 34 |

It appears matching rsize=cluster size=4K gives the best results.

Thank you,
Alkis Georgopoulos

2019-09-22 19:15:19

by Alkis Georgopoulos

[permalink] [raw]
Subject: Re: rsize,wsize=1M causes severe lags in 10/100 Mbps

I think it's caused by the kernel readahead, not glibc readahead.
TL;DR: This solves the problem:
echo 4 > /sys/devices/virtual/bdi/0:58/read_ahead_kb

Question: how to configure NFS/kernel to automatically set that?

Long version:
Doing step (4) below results in tremendous speedup:

1) mount -t nfs -o tcp,timeo=600,rsize=1048576,wsize=1048576
10.161.254.11:/srv/ltsp /mnt

2) cat /proc/fs/nfsfs/volumes
We see the DEV number from there, e.g. 0:58

3) cat /sys/devices/virtual/bdi/0:58/read_ahead_kb
15360
I assume that this means the kernel will try to read ahead up to 15 MB
for each accessed file. *THIS IS THE PROBLEM*. For non-NFS devices, this
value is 128 (KB).

4) echo 4 > /sys/devices/virtual/bdi/0:58/read_ahead_kb

5) Test. Traffic now should be a *lot* less, and speed a *lot* more.
E.g. my NFS booting tests:
- read_ahead_kb=15360 (the default) => 1160 MB traffic to boot
- read_ahead_kb=128 => 324MB traffic
- read_ahead_kb=4 => 223MB traffic

So the question that remains, is how to properly configure either NFS or
the kernel, to use small readahead values for NFS.

I'm currently doing it with this workaround:
for f in $(awk '/^v[0-9]/ { print $4 }' < /proc/fs/nfsfs/volumes); do
echo 4 > /sys/devices/virtual/bdi/$f/read_ahead_kb; done

Thanks,
Alkis