2022-08-25 23:58:34

by Laba, SlawomirX

[permalink] [raw]
Subject: BUG: Virtual machine fails to start on 6.0-rc2

We were testing the changes for our VF devices and noticed an issue when starting VMs with passthrough VFs. We then moved back to mainline kernel and reproduced the issue on 6.0-rc2

We noticed that the startup of the KVM hangs.

Steps to reproduce:
Create a VF from the PF interface.
Configure VM XML with the VF PCI.
Start the KVM.

To isolate the issue we moved back to kernel 5.19 and it was working fine.

Working tag v5.19
Tested failing commit 4c612826bec1

[root@localhost sl]# uname -r
6.0.0-rc2-00159-g4c612826bec1
[root@localhost sl]# echo 1 > /sys/class/net/ens785f3/device/sriov_numvfs
[root@localhost sl]# virsh start rhel_9_0_first
^C
[root@localhost sl]# virsh list --all
Id Name State
-------------------------------
1 rhel_9_0_first paused

Dmesg:
[ +0.042400] iavf: Intel(R) Ethernet Adaptive Virtual Function Network Driver
[ +0.000004] Copyright (c) 2013 - 2018 Intel Corporation.
[ +0.000309] iavf 0000:18:19.0: enabling device (0000 -> 0002)
[ +0.073471] iavf 0000:18:19.0: Invalid MAC address 00:00:00:00:00:00, using random
[ +0.000674] iavf 0000:18:19.0: Multiqueue Enabled: Queue pair count = 16
[ +0.000466] iavf 0000:18:19.0: MAC address: 5a:0c:b5:f7:4f:0b
[ +0.000003] iavf 0000:18:19.0: GRO is enabled
[ +0.005941] iavf 0000:18:19.0 ens785f3v0: renamed from eth0
[ +0.179174] IPv6: ADDRCONF(NETDEV_CHANGE): ens785f3v0: link becomes ready
[ +0.000040] iavf 0000:18:19.0 ens785f3v0: NIC Link is Up Speed is 25 Gbps Full Duplex
[ +26.408503] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
[ +0.399621] VFIO - User Level meta-driver version: 0.3
[ +0.151579] iavf 0000:18:19.0: Remove device
[ +0.292158] ice 0000:18:00.3 ens785f3: Setting MAC 52:54:00:9f:ea:de on VF 0. VF driver will be reinitialized
[ +0.083676] ice 0000:18:00.3: Clearing port VLAN on VF 0
[ +0.155905] tun: Universal TUN/TAP device driver, 1.6
[ +0.000976] virbr0: port 1(vnet0) entered blocking state
[ +0.000017] virbr0: port 1(vnet0) entered disabled state
[ +0.000052] device vnet0 entered promiscuous mode
[ +0.000244] virbr0: port 1(vnet0) entered blocking state
[ +0.000003] virbr0: port 1(vnet0) entered listening state
[ +2.019924] virbr0: port 1(vnet0) entered learning state
[ +2.047997] virbr0: port 1(vnet0) entered forwarding state
[ +0.000018] virbr0: topology change detected, propagating
[Aug25 19:12] INFO: task khugepaged:507 blocked for more than 122 seconds.
[ +0.000016] Tainted: G W I 6.0.0-rc2-00159-g4c612826bec1 #1
[ +0.000010] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ +0.000010] task:khugepaged state:D stack: 0 pid: 507 ppid: 2 flags:0x00004000
[ +0.000004] Call Trace:
[ +0.000001] <TASK>
[ +0.000003] __schedule+0x1bc/0x550
[ +0.000006] ? osq_unlock+0xf/0x90
[ +0.000005] schedule+0x5d/0xd0
[ +0.000002] rwsem_down_write_slowpath+0x2c9/0x5e0
[ +0.000004] ? find_vma+0x64/0x70
[ +0.000004] collapse_huge_page+0x1f8/0x8a0
[ +0.000004] ? _raw_spin_unlock+0x14/0x30
[ +0.000002] ? preempt_count_add+0x70/0xa0
[ +0.000005] ? _raw_spin_lock_irqsave+0x21/0x30
[ +0.000001] ? lock_timer_base+0x61/0x80
[ +0.000005] khugepaged_scan_pmd+0x33d/0x7b0
[ +0.000003] khugepaged_scan_mm_slot+0x155/0x440
[ +0.000003] khugepaged+0x189/0x3e0
[ +0.000002] ? preempt_count_add+0x70/0xa0
[ +0.000002] ? _raw_spin_unlock_irqrestore+0x1e/0x40
[ +0.000002] ? khugepaged_scan_mm_slot+0x440/0x440
[ +0.000001] kthread+0xf0/0x120
[ +0.000003] ? kthread_complete_and_exit+0x20/0x20
[ +0.000003] ret_from_fork+0x1f/0x30
[ +0.000006] </TASK>

On the working version there is a line that enables the interface for the VF which is missing on non-working one:
[ +0.911730] vfio-pci 0000:18:19.0: enabling device (0000 -> 0002)


2022-08-26 00:15:26

by Alex Williamson

[permalink] [raw]
Subject: Re: BUG: Virtual machine fails to start on 6.0-rc2

On Thu, 25 Aug 2022 23:42:47 +0000
"Laba, SlawomirX" <[email protected]> wrote:

> We were testing the changes for our VF devices and noticed an issue
> when starting VMs with passthrough VFs. We then moved back to
> mainline kernel and reproduced the issue on 6.0-rc2
>
> We noticed that the startup of the KVM hangs.
>
> Steps to reproduce:
> Create a VF from the PF interface.
> Configure VM XML with the VF PCI.
> Start the KVM.
>
> To isolate the issue we moved back to kernel 5.19 and it was working
> fine.
>
> Working tag v5.19
> Tested failing commit 4c612826bec1

Does this resolve the issue?

https://lore.kernel.org/all/166015037385.760108.16881097713975517242.stgit@omen/

This is currently in Andrew's mm-hotfixes-unstable branch and we're
waiting for it to be merged to mainline. Thanks,

Alex

> [root@localhost sl]# uname -r
> 6.0.0-rc2-00159-g4c612826bec1
> [root@localhost sl]# echo 1 >
> /sys/class/net/ens785f3/device/sriov_numvfs [root@localhost sl]#
> virsh start rhel_9_0_first ^C
> [root@localhost sl]# virsh list --all
> Id Name State
> -------------------------------
> 1 rhel_9_0_first paused
>
> Dmesg:
> [ +0.042400] iavf: Intel(R) Ethernet Adaptive Virtual Function
> Network Driver [ +0.000004] Copyright (c) 2013 - 2018 Intel
> Corporation. [ +0.000309] iavf 0000:18:19.0: enabling device (0000
> -> 0002) [ +0.073471] iavf 0000:18:19.0: Invalid MAC address
> 00:00:00:00:00:00, using random [ +0.000674] iavf 0000:18:19.0:
> Multiqueue Enabled: Queue pair count = 16 [ +0.000466] iavf
> 0000:18:19.0: MAC address: 5a:0c:b5:f7:4f:0b [ +0.000003] iavf
> 0000:18:19.0: GRO is enabled [ +0.005941] iavf 0000:18:19.0
> ens785f3v0: renamed from eth0 [ +0.179174] IPv6:
> ADDRCONF(NETDEV_CHANGE): ens785f3v0: link becomes ready [ +0.000040]
> iavf 0000:18:19.0 ens785f3v0: NIC Link is Up Speed is 25 Gbps Full
> Duplex [ +26.408503] bridge: filtering via arp/ip/ip6tables is no
> longer available by default. Update your scripts to load br_netfilter
> if you need this. [ +0.399621] VFIO - User Level meta-driver
> version: 0.3 [ +0.151579] iavf 0000:18:19.0: Remove device [
> +0.292158] ice 0000:18:00.3 ens785f3: Setting MAC 52:54:00:9f:ea:de
> on VF 0. VF driver will be reinitialized [ +0.083676] ice
> 0000:18:00.3: Clearing port VLAN on VF 0 [ +0.155905] tun: Universal
> TUN/TAP device driver, 1.6 [ +0.000976] virbr0: port 1(vnet0)
> entered blocking state [ +0.000017] virbr0: port 1(vnet0) entered
> disabled state [ +0.000052] device vnet0 entered promiscuous mode [
> +0.000244] virbr0: port 1(vnet0) entered blocking state [ +0.000003]
> virbr0: port 1(vnet0) entered listening state [ +2.019924] virbr0:
> port 1(vnet0) entered learning state [ +2.047997] virbr0: port
> 1(vnet0) entered forwarding state [ +0.000018] virbr0: topology
> change detected, propagating [Aug25 19:12] INFO: task khugepaged:507
> blocked for more than 122 seconds. [ +0.000016] Tainted: G
> W I 6.0.0-rc2-00159-g4c612826bec1 #1 [ +0.000010] "echo 0
> > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [
> > +0.000010] task:khugepaged state:D stack: 0 pid: 507 ppid:
> > 2 flags:0x00004000 [ +0.000004] Call Trace:
> [ +0.000001] <TASK>
> [ +0.000003] __schedule+0x1bc/0x550
> [ +0.000006] ? osq_unlock+0xf/0x90
> [ +0.000005] schedule+0x5d/0xd0
> [ +0.000002] rwsem_down_write_slowpath+0x2c9/0x5e0
> [ +0.000004] ? find_vma+0x64/0x70
> [ +0.000004] collapse_huge_page+0x1f8/0x8a0
> [ +0.000004] ? _raw_spin_unlock+0x14/0x30
> [ +0.000002] ? preempt_count_add+0x70/0xa0
> [ +0.000005] ? _raw_spin_lock_irqsave+0x21/0x30
> [ +0.000001] ? lock_timer_base+0x61/0x80
> [ +0.000005] khugepaged_scan_pmd+0x33d/0x7b0
> [ +0.000003] khugepaged_scan_mm_slot+0x155/0x440
> [ +0.000003] khugepaged+0x189/0x3e0
> [ +0.000002] ? preempt_count_add+0x70/0xa0
> [ +0.000002] ? _raw_spin_unlock_irqrestore+0x1e/0x40
> [ +0.000002] ? khugepaged_scan_mm_slot+0x440/0x440
> [ +0.000001] kthread+0xf0/0x120
> [ +0.000003] ? kthread_complete_and_exit+0x20/0x20
> [ +0.000003] ret_from_fork+0x1f/0x30
> [ +0.000006] </TASK>
>
> On the working version there is a line that enables the interface for
> the VF which is missing on non-working one: [ +0.911730] vfio-pci
> 0000:18:19.0: enabling device (0000 -> 0002)
>

2022-08-26 18:50:43

by Laba, SlawomirX

[permalink] [raw]
Subject: RE: BUG: Virtual machine fails to start on 6.0-rc2

> -----Original Message-----
> From: Alex Williamson <[email protected]>
> Sent: Friday, August 26, 2022 2:08 AM
> To: Laba, SlawomirX <[email protected]>
> Cc: [email protected]; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; Brandeburg, Jesse
> <[email protected]>; Nguyen, Anthony L
> <[email protected]>; Andrew Morton <akpm@linux-
> foundation.org>
> Subject: Re: BUG: Virtual machine fails to start on 6.0-rc2
>
> On Thu, 25 Aug 2022 23:42:47 +0000
> "Laba, SlawomirX" <[email protected]> wrote:
>
> > We were testing the changes for our VF devices and noticed an issue
> > when starting VMs with passthrough VFs. We then moved back to mainline
> > kernel and reproduced the issue on 6.0-rc2
> >
> > We noticed that the startup of the KVM hangs.
> >
> > Steps to reproduce:
> > Create a VF from the PF interface.
> > Configure VM XML with the VF PCI.
> > Start the KVM.
> >
> > To isolate the issue we moved back to kernel 5.19 and it was working
> > fine.
> >
> > Working tag v5.19
> > Tested failing commit 4c612826bec1
>
> Does this resolve the issue?
>
> https://lore.kernel.org/all/166015037385.760108.16881097713975517242.stgit
> @omen/
>
> This is currently in Andrew's mm-hotfixes-unstable branch and we're waiting
> for it to be merged to mainline. Thanks,
>
> Alex

With the patch you linked, the issue is gone.

Thanks,
Slawek