LinuxLists.cc - Qemu KVM thread spins at 100% CPU usage on scsi hot-unplug (kernel 6.6.8 guest)

2023-12-28 02:10:30

Subject: Qemu KVM thread spins at 100% CPU usage on scsi hot-unplug (kernel 6.6.8 guest)

Hi all,

I'm trying to summarise what I'm seeing - please feel free to contact me directly for any further information that I may
have missed. I'm also not subscribed to either kernel.org mailing list, so please CC me in any replies.

History:
At some point in kernel 6.6.x, SCSI hotplug in qemu VMs broke. This was mostly fixed in the following commit to release
6.6.8:
commit 5cc8d88a1b94b900fd74abda744c29ff5845430b
Author: Bjorn Helgaas <[email protected]>
Date: Thu Dec 14 09:08:56 2023 -0600
Revert "PCI: acpiphp: Reassign resources on bridge if necessary"

After this commit, the SCSI block device is hotplugged correctly, and a device node as /dev/sdX appears within the qemu VM.

New problem:

When the same SCSI block device is hot-unplugged, the QEMU KVM process will spin at 100% CPU usage. The guest shows no
CPU being used via top, but the host will continue to spin in the KVM thread until the VM is rebooted.

Further information:

Guest: Fedora 39 with kernel 6.6.8 packages from:
https://koji.fedoraproject.org/koji/buildinfo?buildID=2336239

Host: Proxmox 8.1.3 with kernel 6.5.11-7-pve

Messages when a drive is hot-plugged to the guest via:
# qm set 104 -scsi1 /dev/sde

Dec 21 19:44:02 kernel: pci 0000:09:02.0: [1af4:1004] type 00 class 0x010000
Dec 21 19:44:02 kernel: pci 0000:09:02.0: reg 0x10: [io 0x0000-0x003f]
Dec 21 19:44:02 kernel: pci 0000:09:02.0: reg 0x14: [mem 0x00000000-0x00000fff]
Dec 21 19:44:02 kernel: pci 0000:09:02.0: reg 0x20: [mem 0x00000000-0x00003fff 64bit pref]
Dec 21 19:44:02 kernel: pci 0000:09:02.0: BAR 4: assigned [mem 0xc080004000-0xc080007fff 64bit pref]
Dec 21 19:44:02 kernel: pci 0000:09:02.0: BAR 1: assigned [mem 0xc1801000-0xc1801fff]
Dec 21 19:44:02 kernel: pci 0000:09:02.0: BAR 0: assigned [io 0x6040-0x607f]
Dec 21 19:44:02 kernel: virtio-pci 0000:09:02.0: enabling device (0000 -> 0003)
Dec 21 19:44:02 kernel: scsi host7: Virtio SCSI HBA
Dec 21 19:44:02 kernel: scsi 7:0:0:1: Direct-Access QEMU QEMU HARDDISK 2.5+ PQ: 0 ANSI: 5
Dec 21 19:44:02 kernel: sd 7:0:0:1: Power-on or device reset occurred
Dec 21 19:44:02 kernel: sd 7:0:0:1: Attached scsi generic sg1 type 0
Dec 21 19:44:02 kernel: sd 7:0:0:1: LUN assignments on this target have changed. The Linux SCSI layer does not
automatically remap LUN assignments.
Dec 21 19:44:02 kernel: sd 7:0:0:1: [sdb] 3906994318 512-byte logical blocks: (2.00 TB/1.82 TiB)
Dec 21 19:44:02 kernel: sd 7:0:0:1: [sdb] Write Protect is off
Dec 21 19:44:02 kernel: sd 7:0:0:1: [sdb] Mode Sense: 63 00 00 08
Dec 21 19:44:02 kernel: sd 7:0:0:1: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Dec 21 19:44:02 kernel: sd 7:0:0:1: [sdb] Attached SCSI disk

Device node is then available as /dev/sdb as expected.

Hot-unplugging the device in proxmox is done via:
# /usr/sbin/qm set 104 --delete scsi1

where 104 is the VM ID within the proxmox host. I have been trying to trawl through the perl code for the `qm` util to
see how that translates to a qemu command, but haven't nailed anything down yet. The code for the qm util is here:
https://git.proxmox.com/?p=qemu-server.git;a=tree;h=refs/heads/master;hb=refs/heads/master

After the qm command is executed the device node disappears correctly from the running VM, and the VM seems to operate
as normal. The spinning withing the KVM thread seems to only affect the host.

--
Steven Haigh

???? [email protected]
???? https://crc.id.au

2023-12-28 13:18:24

by Lukas Wunner

[permalink] [raw]

Subject: Re: Qemu KVM thread spins at 100% CPU usage on scsi hot-unplug (kernel 6.6.8 guest)

On Thu, Dec 28, 2023 at 01:03:10PM +1100, Steven Haigh wrote:
> At some point in kernel 6.6.x, SCSI hotplug in qemu VMs broke. This was
> mostly fixed in the following commit to release 6.6.8:
> commit 5cc8d88a1b94b900fd74abda744c29ff5845430b
> Author: Bjorn Helgaas <[email protected]>
> Date: Thu Dec 14 09:08:56 2023 -0600
> Revert "PCI: acpiphp: Reassign resources on bridge if necessary"
>
> After this commit, the SCSI block device is hotplugged correctly, and a device node as /dev/sdX appears within the qemu VM.
>
> New problem:
>
> When the same SCSI block device is hot-unplugged, the QEMU KVM process will
> spin at 100% CPU usage. The guest shows no CPU being used via top, but the
> host will continue to spin in the KVM thread until the VM is rebooted.

Find out the PID of the qemu process on the host, then cat /proc/$PID/stack
to see where the CPU time is spent.

2023-12-29 05:47:19

by Steven Haigh

[permalink] [raw]

Subject: Re: Qemu KVM thread spins at 100% CPU usage on scsi hot-unplug (kernel 6.6.8 guest)

On 29/12/23 00:18, Lukas Wunner wrote:
> On Thu, Dec 28, 2023 at 01:03:10PM +1100, Steven Haigh wrote:
>> At some point in kernel 6.6.x, SCSI hotplug in qemu VMs broke. This was
>> mostly fixed in the following commit to release 6.6.8:
>> commit 5cc8d88a1b94b900fd74abda744c29ff5845430b
>> Author: Bjorn Helgaas <[email protected]>
>> Date: Thu Dec 14 09:08:56 2023 -0600
>> Revert "PCI: acpiphp: Reassign resources on bridge if necessary"
>>
>> After this commit, the SCSI block device is hotplugged correctly, and a device node as /dev/sdX appears within the qemu VM.
>>
>> New problem:
>>
>> When the same SCSI block device is hot-unplugged, the QEMU KVM process will
>> spin at 100% CPU usage. The guest shows no CPU being used via top, but the
>> host will continue to spin in the KVM thread until the VM is rebooted.
>
> Find out the PID of the qemu process on the host, then cat /proc/$PID/stack
> to see where the CPU time is spent.

Thanks for the tip - I'll certainly do that.

Annoyingly, since I posted this report originally, then adding in a new report to the kernel.org lists in this, I have
been unable to reproduce this problem. I have successfully done ~22 scsi hotplug / remove cycles and none resulted in
reproducing the issue.

Kernel versions are still the same on both proxmox host and the Fedora guest - however I see an update on the host of
the qemu-kvm packages in Proxmox. The proxmox host hasn't even been rebooted in this time.

I wonder if the initial revert included in 6.6.8 fixed the main problem, and the later update to qemu-kvm packages on
the proxmox host followed by the last reboot of the VM with the new KVM package sorted the second issue.

Seeing as I can no longer reproduce this reliably - whereas it was 100% reproducible prior, maybe I'm now chasing ghosts.

I'll still continue to monitor - as I normally do this SCSI hotplug ~3 times per week doing backups to different
external HDDs - so if I do observe it again, I'll grab the stack and reply to this thread again with what I can find.

Until then, I don't want to waste other peoples time also chasing ghosts :)

--
Steven Haigh

???? [email protected]
???? https://crc.id.au

2024-01-03 09:53:10

by Fiona Ebner

[permalink] [raw]

Subject: Re: Qemu KVM thread spins at 100% CPU usage on scsi hot-unplug (kernel 6.6.8 guest)

Hi,

Am 29.12.23 um 06:46 schrieb Steven Haigh:
> On 29/12/23 00:18, Lukas Wunner wrote:
>> On Thu, Dec 28, 2023 at 01:03:10PM +1100, Steven Haigh wrote:
>>> At some point in kernel 6.6.x, SCSI hotplug in qemu VMs broke. This was
>>> mostly fixed in the following commit to release 6.6.8:
>>>     commit 5cc8d88a1b94b900fd74abda744c29ff5845430b
>>>     Author: Bjorn Helgaas <[email protected]>
>>>     Date:   Thu Dec 14 09:08:56 2023 -0600
>>>     Revert "PCI: acpiphp: Reassign resources on bridge if necessary"
>>>
>>> After this commit, the SCSI block device is hotplugged correctly, and
>>> a device node as /dev/sdX appears within the qemu VM.
>>>
>>> New problem:
>>>
>>> When the same SCSI block device is hot-unplugged, the QEMU KVM
>>> process will
>>> spin at 100% CPU usage. The guest shows no CPU being used via top,
>>> but the
>>> host will continue to spin in the KVM thread until the VM is rebooted.
>>
>> Find out the PID of the qemu process on the host, then cat
>> /proc/$PID/stack
>> to see where the CPU time is spent.
>
> Thanks for the tip - I'll certainly do that.
>
> Annoyingly, since I posted this report originally, then adding in a new
> report to the kernel.org lists in this, I have been unable to reproduce
> this problem. I have successfully done ~22 scsi hotplug / remove cycles
> and none resulted in reproducing the issue.
>
> Kernel versions are still the same on both proxmox host and the Fedora
> guest - however I see an update on the host of the qemu-kvm packages in
> Proxmox. The proxmox host hasn't even been rebooted in this time.
>
> I wonder if the initial revert included in 6.6.8 fixed the main problem,
> and the later update to qemu-kvm packages on the proxmox host followed
> by the last reboot of the VM with the new KVM package sorted the second
> issue.
>
> Seeing as I can no longer reproduce this reliably - whereas it was 100%
> reproducible prior, maybe I'm now chasing ghosts.
>

That sounds likely. Version pve-qemu-kvm=8.1.2-5 had a regression where
an IO thread in QEMU could start spinning after a drain (which happens
during hotplug on the QEMU side). It was introduced by an attempted fix
for a much rarer problem [0] and was reverted in pve-qemu-kvm=8.1.2-6
[1]. A proper fix is still being worked on [2].

[0]:
https://git.proxmox.com/?p=pve-qemu.git;a=commit;h=6b7c1815e1c89cb66ff48fbba6da69fe6d254630
[1]:
https://git.proxmox.com/?p=pve-qemu.git;a=commit;h=2a49e667bae33f2a5c6ba6b59a0cd26387f73a27
[2]: https://lists.nongnu.org/archive/html/qemu-devel/2023-12/msg01900.html

Best Regards,
Fiona

> I'll still continue to monitor - as I normally do this SCSI hotplug ~3
> times per week doing backups to different external HDDs - so if I do
> observe it again, I'll grab the stack and reply to this thread again
> with what I can find.
>
> Until then, I don't want to waste other peoples time also chasing ghosts :)
>