Hello,
I often take snapshots in order to move kvm VMs from one nfs share to
another while they're running or to take backups. Sometimes I have very
large VMs (1.1 TB) which take a very long time (40 minutes - 2 hours) to
backup or move. They also write between 20 - 60 GB of data while being
backed up or moved. Once the backup or move is done the dirty snapshot
data needs to be merged to the parent disk. While doing this I often
experience I/O stalls within the VMs in the range of 1 - 20 seconds.
Sometimes worse. But I have some very latency sensitive VMs which crash
or misbehave after 15 seconds I/O stalls. So I would like to know if there
is some tuening I can do to make these I/O stalls shorter.
- I already tried to set vm.dirty_expire_centisecs=100 which appears to
make it better, but not under 15 seconds. Perfect would be I/O stalls
no more than 1 second.
This is how you can reproduce the issue:
- NFS Server:
mkdir /ssd
apt install -y nfs-kernel-server
echo '/nfs 0.0.0.0/0.0.0.0(rw,no_root_squash,no_subtree_check,sync)' > /etc/exports
exports -ra
- NFS Client / KVM Host:
mount server:/ssd /mnt
# Put a VM on /mnt and start it.
# Create a snapshot:
virsh snapshot-create-as --domain testy guest-state1 --diskspec vda,file=/mnt/overlay.qcow2 --disk-only --atomic --no-metadata -no-metadata
- In the VM:
# Write some data (in my case 6 GB of data are writen in 60 seconds due
# to the nfs client being connected with a 1 Gbit/s link)
fio --ioengine=libaio --filesize=32G --ramp_time=2s --runtime=1m --numjobs=1 --direct=1 --verify=0 --randrepeat=0 --group_reporting --directory=/mnt --name=write --blocksize=1m --iodepth=1 --readwrite=write --unlink=1
# Do some synchronous I/O
while true; do date | tee -a date.log; sync; sleep 1; done
- On the NFS Client / KVM host:
# Merge the snapshot into the parentdisk
time virsh blockcommit testy vda --active --pivot --delete
Successfully pivoted
real 1m4.666s
user 0m0.017s
sys 0m0.007s
I exported the nfs share with sync on purpose because I often use drbd
in sync mode (protocol c) to replicate the data on the nfs server to a
site which is 200 km away using a 10 Gbit/s link.
The result is:
(testy) [~] while true; do date | tee -a date.log; sync; sleep 1; done
Sun May 5 12:53:36 CEST 2024
Sun May 5 12:53:37 CEST 2024
Sun May 5 12:53:38 CEST 2024
Sun May 5 12:53:39 CEST 2024
Sun May 5 12:53:40 CEST 2024
Sun May 5 12:53:41 CEST 2024 < here I started virsh blockcommit
Sun May 5 12:53:45 CEST 2024
Sun May 5 12:53:50 CEST 2024
Sun May 5 12:53:59 CEST 2024
Sun May 5 12:54:04 CEST 2024
Sun May 5 12:54:22 CEST 2024
Sun May 5 12:54:23 CEST 2024
Sun May 5 12:54:27 CEST 2024
Sun May 5 12:54:32 CEST 2024
Sun May 5 12:54:40 CEST 2024
Sun May 5 12:54:42 CEST 2024
Sun May 5 12:54:45 CEST 2024
Sun May 5 12:54:46 CEST 2024
Sun May 5 12:54:47 CEST 2024
Sun May 5 12:54:48 CEST 2024
Sun May 5 12:54:49 CEST 2024
This is with 'vm.dirty_expire_centisecs=100' with the default values
'vm.dirty_expire_centisecs=3000' it is worse.
I/O stalls:
- 4 seconds
- 9 seconds
- 5 seconds
- 18 seconds
- 4 seconds
- 5 seconds
- 8 seconds
- 2 seconds
- 3 seconds
With the default vm.dirty_expire_centisecs=3000 I get something like that:
(testy) [~] while true; do date | tee -a date.log; sync; sleep 1; done
Sun May 5 11:51:33 CEST 2024
Sun May 5 11:51:34 CEST 2024
Sun May 5 11:51:35 CEST 2024
Sun May 5 11:51:37 CEST 2024
Sun May 5 11:51:38 CEST 2024
Sun May 5 11:51:39 CEST 2024
Sun May 5 11:51:40 CEST 2024 << virsh blockcommit
Sun May 5 11:51:49 CEST 2024
Sun May 5 11:52:07 CEST 2024
Sun May 5 11:52:08 CEST 2024
Sun May 5 11:52:27 CEST 2024
Sun May 5 11:52:45 CEST 2024
Sun May 5 11:52:47 CEST 2024
Sun May 5 11:52:48 CEST 2024
Sun May 5 11:52:49 CEST 2024
I/O stalls:
- 9 seconds
- 18 seconds
- 19 seconds
- 18 seconds
- 1 seconds
I'm open to any suggestions which improve the situation. I often have 10
Gbit/s network and a lot of dirty buffer cache, but at the same time I
often replicate synchronously to a second site 200 kms apart which only
gives me around 100 MB/s write performance.
With vm.dirty_expire_centisecs=10 even worse:
(testy) [~] while true; do date | tee -a date.log; sync; sleep 1; done
Sun May 5 13:25:31 CEST 2024
Sun May 5 13:25:32 CEST 2024
Sun May 5 13:25:33 CEST 2024
Sun May 5 13:25:34 CEST 2024
Sun May 5 13:25:35 CEST 2024
Sun May 5 13:25:36 CEST 2024
Sun May 5 13:25:37 CEST 2024 < virsh blockcommit
Sun May 5 13:26:00 CEST 2024
Sun May 5 13:26:01 CEST 2024
Sun May 5 13:26:06 CEST 2024
Sun May 5 13:26:11 CEST 2024
Sun May 5 13:26:40 CEST 2024
Sun May 5 13:26:42 CEST 2024
Sun May 5 13:26:43 CEST 2024
Sun May 5 13:26:44 CEST 2024
I/O stalls:
- 23 seconds
- 5 seconds
- 5 seconds
- 29 seconds
- 1 second
Cheers,
Thomas
On 5 May 2024, at 7:29, Thomas Glanzmann wrote:
> Hello,
> I often take snapshots in order to move kvm VMs from one nfs share to
> another while they're running or to take backups. Sometimes I have very
> large VMs (1.1 TB) which take a very long time (40 minutes - 2 hours) to
> backup or move. They also write between 20 - 60 GB of data while being
> backed up or moved. Once the backup or move is done the dirty snapshot
> data needs to be merged to the parent disk. While doing this I often
> experience I/O stalls within the VMs in the range of 1 - 20 seconds.
> Sometimes worse. But I have some very latency sensitive VMs which crash
> or misbehave after 15 seconds I/O stalls. So I would like to know if there
> is some tuening I can do to make these I/O stalls shorter.
>
> - I already tried to set vm.dirty_expire_centisecs=100 which appears to
> make it better, but not under 15 seconds. Perfect would be I/O stalls
> no more than 1 second.
>
> This is how you can reproduce the issue:
>
> - NFS Server:
> mkdir /ssd
> apt install -y nfs-kernel-server
> echo '/nfs 0.0.0.0/0.0.0.0(rw,no_root_squash,no_subtree_check,sync)' > /etc/exports
> exports -ra
>
> - NFS Client / KVM Host:
> mount server:/ssd /mnt
> # Put a VM on /mnt and start it.
> # Create a snapshot:
> virsh snapshot-create-as --domain testy guest-state1 --diskspec vda,file=/mnt/overlay.qcow2 --disk-only --atomic --no-metadata -no-metadata
What NFS version ends up getting mounted here? You might eliminate some
head-of-line blocking issues with the "nconnect=16" mount option to open
additional TCP connections.
My view of what could be happening is that the IO from your guest's process
is congesting with the IO from your 'virsh blockcommit' process, and we
don't currently have a great way to classify and queue IO from various
sources in various ways.
Ben
On Sun, 2024-05-05 at 13:29 +0200, Thomas Glanzmann wrote:
> Hello,
> I often take snapshots in order to move kvm VMs from one nfs share to
> another while they're running or to take backups. Sometimes I have
> very
> large VMs (1.1 TB) which take a very long time (40 minutes - 2 hours)
> to
> backup or move. They also write between 20 - 60 GB of data while
> being
> backed up or moved. Once the backup or move is done the dirty
> snapshot
> data needs to be merged to the parent disk. While doing this I often
> experience I/O stalls within the VMs in the range of 1 - 20 seconds.
> Sometimes worse. But I have some very latency sensitive VMs which
> crash
> or misbehave after 15 seconds I/O stalls. So I would like to know if
> there
> is some tuening I can do to make these I/O stalls shorter.
>
> - I already tried to set vm.dirty_expire_centisecs=100 which appears
> to
> make it better, but not under 15 seconds. Perfect would be I/O
> stalls
> no more than 1 second.
>
> This is how you can reproduce the issue:
>
> - NFS Server:
> mkdir /ssd
> apt install -y nfs-kernel-server
> echo '/nfs 0.0.0.0/0.0.0.0(rw,no_root_squash,no_subtree_check,sync)'
> > /etc/exports
> exports -ra
>
> - NFS Client / KVM Host:
> mount server:/ssd /mnt
> # Put a VM on /mnt and start it.
> # Create a snapshot:
> virsh snapshot-create-as --domain testy guest-state1 --diskspec
> vda,file=/mnt/overlay.qcow2 --disk-only --atomic --no-metadata -no-
> metadata
>
> - In the VM:
>
> # Write some data (in my case 6 GB of data are writen in 60 seconds
> due
> # to the nfs client being connected with a 1 Gbit/s link)
> fio --ioengine=libaio --filesize=32G --ramp_time=2s --runtime=1m --
> numjobs=1 --direct=1 --verify=0 --randrepeat=0 --group_reporting --
> directory=/mnt --name=write --blocksize=1m --iodepth=1 --
> readwrite=write --unlink=1
> # Do some synchronous I/O
> while true; do date | tee -a date.log; sync; sleep 1; done
>
> - On the NFS Client / KVM host:
> # Merge the snapshot into the parentdisk
> time virsh blockcommit testy vda --active --pivot --delete
>
> Successfully pivoted
>
> real 1m4.666s
> user 0m0.017s
> sys 0m0.007s
>
> I exported the nfs share with sync on purpose because I often use
> drbd
> in sync mode (protocol c) to replicate the data on the nfs server to
> a
> site which is 200 km away using a 10 Gbit/s link.
>
> The result is:
> (testy) [~] while true; do date | tee -a date.log; sync; sleep 1;
> done
> Sun May 5 12:53:36 CEST 2024
> Sun May 5 12:53:37 CEST 2024
> Sun May 5 12:53:38 CEST 2024
> Sun May 5 12:53:39 CEST 2024
> Sun May 5 12:53:40 CEST 2024
> Sun May 5 12:53:41 CEST 2024 < here I started virsh blockcommit
> Sun May 5 12:53:45 CEST 2024
> Sun May 5 12:53:50 CEST 2024
> Sun May 5 12:53:59 CEST 2024
> Sun May 5 12:54:04 CEST 2024
> Sun May 5 12:54:22 CEST 2024
> Sun May 5 12:54:23 CEST 2024
> Sun May 5 12:54:27 CEST 2024
> Sun May 5 12:54:32 CEST 2024
> Sun May 5 12:54:40 CEST 2024
> Sun May 5 12:54:42 CEST 2024
> Sun May 5 12:54:45 CEST 2024
> Sun May 5 12:54:46 CEST 2024
> Sun May 5 12:54:47 CEST 2024
> Sun May 5 12:54:48 CEST 2024
> Sun May 5 12:54:49 CEST 2024
>
> This is with 'vm.dirty_expire_centisecs=100' with the default values
> 'vm.dirty_expire_centisecs=3000' it is worse.
>
> I/O stalls:
> - 4 seconds
> - 9 seconds
> - 5 seconds
> - 18 seconds
> - 4 seconds
> - 5 seconds
> - 8 seconds
> - 2 seconds
> - 3 seconds
>
> With the default vm.dirty_expire_centisecs=3000 I get something like
> that:
>
> (testy) [~] while true; do date | tee -a date.log; sync; sleep 1;
> done
> Sun May 5 11:51:33 CEST 2024
> Sun May 5 11:51:34 CEST 2024
> Sun May 5 11:51:35 CEST 2024
> Sun May 5 11:51:37 CEST 2024
> Sun May 5 11:51:38 CEST 2024
> Sun May 5 11:51:39 CEST 2024
> Sun May 5 11:51:40 CEST 2024 << virsh blockcommit
> Sun May 5 11:51:49 CEST 2024
> Sun May 5 11:52:07 CEST 2024
> Sun May 5 11:52:08 CEST 2024
> Sun May 5 11:52:27 CEST 2024
> Sun May 5 11:52:45 CEST 2024
> Sun May 5 11:52:47 CEST 2024
> Sun May 5 11:52:48 CEST 2024
> Sun May 5 11:52:49 CEST 2024
>
> I/O stalls:
>
> - 9 seconds
> - 18 seconds
> - 19 seconds
> - 18 seconds
> - 1 seconds
>
> I'm open to any suggestions which improve the situation. I often have
> 10
> Gbit/s network and a lot of dirty buffer cache, but at the same time
> I
> often replicate synchronously to a second site 200 kms apart which
> only
> gives me around 100 MB/s write performance.
>
> With vm.dirty_expire_centisecs=10 even worse:
>
> (testy) [~] while true; do date | tee -a date.log; sync; sleep 1;
> done
> Sun May 5 13:25:31 CEST 2024
> Sun May 5 13:25:32 CEST 2024
> Sun May 5 13:25:33 CEST 2024
> Sun May 5 13:25:34 CEST 2024
> Sun May 5 13:25:35 CEST 2024
> Sun May 5 13:25:36 CEST 2024
> Sun May 5 13:25:37 CEST 2024 < virsh blockcommit
> Sun May 5 13:26:00 CEST 2024
> Sun May 5 13:26:01 CEST 2024
> Sun May 5 13:26:06 CEST 2024
> Sun May 5 13:26:11 CEST 2024
> Sun May 5 13:26:40 CEST 2024
> Sun May 5 13:26:42 CEST 2024
> Sun May 5 13:26:43 CEST 2024
> Sun May 5 13:26:44 CEST 2024
>
> I/O stalls:
>
> - 23 seconds
> - 5 seconds
> - 5 seconds
> - 29 seconds
> - 1 second
>
> Cheers,
> Thomas
>
Two suggestions:
1. Try mounting the NFS partition on which these VMs reside with the
"write=eager" mount option. That ensures that the kernel kicks
off the write of the block immediately once QEMU has scheduled it
for writeback. Note, however that the kernel does not wait for
that write to complete (i.e. these writes are all asynchronous).
2. Alternatively, try playing with the 'vm.dirty_ratio' or
'vm.dirty_bytes' values in order to trigger writeback at an
earlier time. With the default value of vm.dirty_ratio=20, you
can end up caching up to 20% of your total memory's worth of
dirty data before the VM triggers writeback over that 1Gbit link.
--
Trond Myklebust Linux NFS client maintainer, Hammerspace
[email protected]
Hello Ben and Trond,
> On 5 May 2024, at 7:29, Thomas Glanzmann wrote paraphrased:
> When commiting 20 - 60 GB snapshots on kvm VMs which are stored on NFS I get 20
> seconds+ I/O stalls.
> When doing backups and migrations with kvm on NFS I get I/O stalls in
> the guest. How to avoid that?
* Benjamin Coddington <[email protected]> [2024-05-06 13:25]:
> What NFS version ends up getting mounted here?
NFS 4.2: (below output has already your's and Tronds options added)
172.31.0.1:/nfs on /mnt type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,nconnect=16,timeo=600,retrans=2,sec=sys,clientaddr=172.31.0.6,local_lock=none,write=eager,addr=172.31.0.1)
> You might eliminate some head-of-line blocking issues with the
> "nconnect=16" mount option to open additional TCP connections.
> My view of what could be happening is that the IO from your guest's process
> is congesting with the IO from your 'virsh blockcommit' process, and we
> don't currently have a great way to classify and queue IO from various
> sources in various ways.
thank you for reminding me of nconnect. I evaluated it with VMware ESX and saw
no benefit when benchmarking it with a single VM and dismissed it. But of
course it makes sense when having more than one concurrent I/O stream.
* Trond Myklebust <[email protected]> [2024-05-06 15:47]:
> Two suggestions:
> 1. Try mounting the NFS partition on which these VMs reside with the
> "write=eager" mount option. That ensures that the kernel kicks
> off the write of the block immediately once QEMU has scheduled it
> for writeback. Note, however that the kernel does not wait for
> that write to complete (i.e. these writes are all asynchronous).
> 2. Alternatively, try playing with the 'vm.dirty_ratio' or
> 'vm.dirty_bytes' values in order to trigger writeback at an
> earlier time. With the default value of vm.dirty_ratio=20, you
> can end up caching up to 20% of your total memory's worth of
> dirty data before the VM triggers writeback over that 1Gbit link.
Thank you for the option write=eager. I was not aware of that but I
often run into problems where a 10 Gbit/s network pipe fills up my
buffer cache and than tries to destage GB 128 GB * 0.2 - 25.6 GB to the
disk which can't keep in my case and resulting in long I/O stalls. Usually my
disks can take between 100 (synchronous replicated drbd link 200km) - 500 MB/s
(SATA SSDs). I tried to tell kernel to destage faster by
(vm.dirty_expire_centisecs=100) which improved some workloads but not all.
So, I think I found a solution to my problem by doing the following:
- Increase NFSD threads to 128:
cat > /etc/nfs.conf.d/storage.conf <<'EOF'
[nfsd]
threads = 128
[mountd]
threads = 8
EOF
echo 128 > /proc/fs/nfsd/threads
- Mount the nfs volume with -o nconnect=16,write=eager
- Use iothreads and cache=none.
<iothreads>2</iothreads>
<driver name='qemu' type='qcow2' cache='none' discard='unmap' iothread='1'/>
By doing the above I no longer see any I/O stalls longer than one second (in my
date loop 2 seconds time difference).
Thank you two again for helping me out with this.
Cheers,
Thomas
PS: Cache=writethrough and without I/O threads the I/O stalls for the time blockcommit executes.