2017-03-23 15:09:18

by Thorsten Leemhuis

[permalink] [raw]
Subject: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

Hi Christoph! Hi Michael!

(Mail roughly based on text from
https://bugzilla.kernel.org/show_bug.cgi?id=194911 )

I'm seeing random crashes during boot every few boot attempts when
running Linux 4.11-rc/mainline in a Fedora 26 guest under a CentOS7 host
(CPU: Intel(R) Pentium(R) CPU G3220) using KVM. Sometimes when the guest
actually booted the network did not work. To get some impressions of the
crashes I got see this gallery:
https://plus.google.com/+ThorstenLeemhuis/posts/FjyyGjNtrrG

Richard W.M. Jones and Adam Williamson see the same problems. See above
bug for details. It seems they ran into the problem in the past few
days, so I assume it's still present in mainline (I'm travelling
currently and haven't had time for proper tests since last last Friday
(pre-rc3); but I thought it's time to get the problem to the lists).

Long story short: Richard and I did bisections and we both found that
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=07ec51480b5e
("virtio_pci: use shared interrupts for virtqueues") is the first bad
commit. Any idea what might be wrong? Do you need more details from us
to fix this?

Ciao, Thorsten

P.S.: Sorry, I should have written this mail a few days ago after filing
above bug report, but I didn't get around to it :-/


2017-03-23 14:56:26

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

Does the patch from Jason in the

"[REGRESSION] 07ec51480b5e ("virtio_pci: use shared interrupts for virtqueues") causes crashes in guest"

thread fix the issue for you?

2017-03-23 14:59:11

by Richard W.M. Jones

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Thu, Mar 23, 2017 at 03:51:25PM +0100, Thorsten Leemhuis wrote:
> Hi Christoph! Hi Michael!
>
> (Mail roughly based on text from
> https://bugzilla.kernel.org/show_bug.cgi?id=194911 )
>
> I'm seeing random crashes during boot every few boot attempts when
> running Linux 4.11-rc/mainline in a Fedora 26 guest under a CentOS7 host
> (CPU: Intel(R) Pentium(R) CPU G3220) using KVM. Sometimes when the guest
> actually booted the network did not work. To get some impressions of the
> crashes I got see this gallery:
> https://plus.google.com/+ThorstenLeemhuis/posts/FjyyGjNtrrG
>
> Richard W.M. Jones and Adam Williamson see the same problems. See above
> bug for details. It seems they ran into the problem in the past few
> days, so I assume it's still present in mainline (I'm travelling
> currently and haven't had time for proper tests since last last Friday
> (pre-rc3); but I thought it's time to get the problem to the lists).
>
> Long story short: Richard and I did bisections and we both found that
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=07ec51480b5e
> ("virtio_pci: use shared interrupts for virtqueues") is the first bad
> commit. Any idea what might be wrong? Do you need more details from us
> to fix this?

Laura Abbott posted a kernel RPM which works for me. She has had to
revert quite a number of commits, which are detailed in this comment:

https://bugzilla.redhat.com/show_bug.cgi?id=1430297#c7

Her reverting patch is also attached.

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html


Attachments:
(No filename) (1.71 kB)
virtio_revert.patch (52.77 kB)
Download all attachments

2017-03-23 15:00:02

by Richard W.M. Jones

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Thu, Mar 23, 2017 at 03:56:22PM +0100, Christoph Hellwig wrote:
> Does the patch from Jason in the
>
> "[REGRESSION] 07ec51480b5e ("virtio_pci: use shared interrupts for virtqueues") causes crashes in guest"
>
> thread fix the issue for you?

I didn't see this thread before. I'll check that out for you now.

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines. Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v

2017-03-23 15:02:00

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On 23.03.2017 15:56, Christoph Hellwig wrote:
> Does the patch from Jason in the
> "[REGRESSION] 07ec51480b5e ("virtio_pci: use shared interrupts for virtqueues") causes crashes in guest"
> thread fix the issue for you?

Ha, sorry, I'm travelling and wasn't aware that Laura earlier today did
what I should have done a few days ago: bring the issue to the proper
mailing lists.

I'll give the patch a try. Thx for pointing it out and sorry for the
noise. Ciao, Thorsten

2017-03-23 15:20:02

by Richard W.M. Jones

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Thu, Mar 23, 2017 at 03:56:22PM +0100, Christoph Hellwig wrote:
> Does the patch from Jason in the
>
> "[REGRESSION] 07ec51480b5e ("virtio_pci: use shared interrupts for virtqueues") causes crashes in guest"
>
> thread fix the issue for you?

In brief, yes it does. I followed up on that thread.

Thanks,

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

2017-03-27 09:07:47

by Mike Galbraith

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Thu, 2017-03-23 at 15:56 +0100, Christoph Hellwig wrote:
> Does the patch from Jason in the
>
> "[REGRESSION] 07ec51480b5e ("virtio_pci: use shared interrupts for virtqueues") causes crashes in guest"
>
> thread fix the issue for you?

That seems to eliminate explosions, but not the below.

07ec51480b5e causes me some kworker grief in -rt too (100% CPU), but
that's as yet not been stared at (too darn [busy/lazy], pick one;).

virgin 4.11-rc4+referenced patch, config=enterprise-ish.
...
[ 158.400210] PM: Hibernation mode set to 'shutdown'
[ 158.607439] PM: Syncing filesystems ...
[ 158.986595] PM: done.
[ 158.986771] Freezing user space processes ... (elapsed 0.001 seconds) done.
[ 158.988758] PM: Marking nosave pages: [mem 0x00000000-0x00000fff]
[ 158.989156] PM: Marking nosave pages: [mem 0x0009f000-0x000fffff]
[ 158.989550] PM: Marking nosave pages: [mem 0xbffde000-0xffffffff]
[ 158.990200] PM: Basic memory bitmaps created
[ 158.990468] PM: Preallocating image memory... done (allocated 395798 pages)
[ 159.114650] PM: Allocated 1583192 kbytes in 0.12 seconds (13193.26 MB/s)
[ 159.115203] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
[ 159.119378] ------------[ cut here ]------------
[ 159.122606] WARNING: CPU: 3 PID: 509 at drivers/pci/msi.c:1251 pci_irq_vector+0xcf/0xe0
[ 159.123194] Modules linked in: fuse(E) ebtable_filter(E) ebtables(E) rpcsec_gss_krb5(E) nfsv4(E) dns_resolver(E) nfs(E) fscache(E) nf_log_ipv6(E) xt_pkttype(E) nf_log_ipv4(E) nf_log_common(E) xt_LOG(E) xt_limit(E) af_packet(E) iscsi_ibft(E) iscsi_boot_sysfs(E) ip6t_REJECT(E) xt_tcpudp(E) nf_conntrack_ipv6(E) nf_defrag_ipv6(E) ip6table_raw(E) ipt_REJECT(E) iptable_raw(E) xt_CT(E) iptable_filter(E) ip6table_mangle(E) nf_conntrack_netbios_ns(E) nf_conntrack_broadcast(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) ip_tables(E) xt_conntrack(E) nf_conntrack(E) libcrc32c(E) ip6table_filter(E) ip6_tables(E) x_tables(E) snd_hda_codec_generic(E) snd_hda_intel(E) snd_hda_codec(E) snd_hda_core(E) snd_hwdep(E) joydev(E) snd_pcm(E) snd_timer(E) snd(E) crct10dif_pclmul(E) soundcore(E) crc32_pclmul(E) 8139too(E) ghash_clmulni_intel(E)
[ 159.128123] pcbc(E) aesni_intel(E) ppdev(E) i2c_piix4(E) aes_x86_64(E) virtio_balloon(E) crypto_simd(E) parport_pc(E) glue_helper(E) parport(E) button(E) pcspkr(E) cryptd(E) serio_raw(E) acpi_cpufreq(E) nfsd(E) dm_mod(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) ext4(E) crc16(E) jbd2(E) mbcache(E) hid_generic(E) usbhid(E) sr_mod(E) cdrom(E) ata_generic(E) virtio_rng(E) virtio_blk(E) virtio_console(E) ata_piix(E) floppy(E) ehci_pci(E) qxl(E) drm_kms_helper(E) syscopyarea(E) uhci_hcd(E) ahci(E) ehci_hcd(E) sysfillrect(E) crc32c_intel(E) sysimgblt(E) libahci(E) fb_sys_fops(E) ttm(E) virtio_pci(E) virtio_ring(E) usbcore(E) virtio(E) 8139cp(E) drm(E) libata(E) mii(E) sg(E) scsi_mod(E) autofs4(E)
[ 159.132177] CPU: 3 PID: 509 Comm: kworker/u16:6 Tainted: G E 4.11.0-default #28
[ 159.132677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20161202_174313-build11a 04/01/2014
[ 159.133428] Workqueue: events_unbound async_run_entry_fn
[ 159.133768] Call Trace:
[ 159.133933] dump_stack+0x63/0x90
[ 159.134161] __warn+0xd1/0xf0
[ 159.134360] ? pci_pm_poweroff+0x100/0x100
[ 159.134627] warn_slowpath_null+0x1d/0x20
[ 159.134889] pci_irq_vector+0xcf/0xe0
[ 159.135134] vp_synchronize_vectors+0x46/0x60 [virtio_pci]
[ 159.135486] vp_reset+0x37/0x40 [virtio_pci]
[ 159.135780] virtcons_freeze+0x23/0xa0 [virtio_console]
[ 159.136116] virtio_device_freeze+0x6b/0x80 [virtio]
[ 159.136431] virtio_pci_freeze+0x1d/0x40 [virtio_pci]
[ 159.136756] pci_pm_freeze+0x5f/0xe0
[ 159.136999] dpm_run_callback+0x59/0x180
[ 159.137252] __device_suspend+0x127/0x3c0
[ 159.137513] ? pm_dev_dbg+0x80/0x80
[ 159.137740] async_suspend+0x1f/0xa0
[ 159.137973] async_run_entry_fn+0x39/0x170
[ 159.138250] process_one_work+0x16c/0x450
[ 159.138514] worker_thread+0x137/0x4e0
[ 159.138761] kthread+0x10c/0x140
[ 159.138970] ? rescuer_thread+0x3c0/0x3c0
[ 159.139235] ? kthread_park+0x90/0x90
[ 159.139476] ret_from_fork+0x2c/0x40
[ 159.139721] ---[ end trace d66daafbe82e66e7 ]---
[ 159.728658] PM: freeze of devices complete after 611.743 msecs
[ 159.729321] PM: late freeze of devices complete after 0.243 msecs
[ 159.730921] PM: noirq freeze of devices complete after 1.145 msecs
[ 159.731507] Disabling non-boot CPUs ...
[ 159.732004] Unregister pv shared memory for cpu 1
[ 159.739017] smpboot: CPU 1 is now offline
[ 159.765702] Unregister pv shared memory for cpu 2
[ 159.770757] smpboot: CPU 2 is now offline
[ 159.797684] Unregister pv shared memory for cpu 3
[ 159.799545] smpboot: CPU 3 is now offline
[ 159.821934] Unregister pv shared memory for cpu 4
[ 159.823759] smpboot: CPU 4 is now offline
[ 159.848588] Unregister pv shared memory for cpu 5
[ 159.850375] smpboot: CPU 5 is now offline
[ 159.872907] Unregister pv shared memory for cpu 6
[ 159.874598] smpboot: CPU 6 is now offline
[ 159.896898] Unregister pv shared memory for cpu 7
[ 159.898517] smpboot: CPU 7 is now offline
[ 159.916620] PM: Creating hibernation image:
[ 159.998033] PM: Need to copy 394338 pages
[ 159.998370] PM: Normal pages needed: 394338 + 1024, available pages: 1702516
[1266874520.353525] kvm-clock: cpu 0, msr 2:3ff54001, primary cpu clock, resume
[1266874520.354849] Suspended for 67.457 seconds
[1266874520.354877] Enabling non-boot CPUs ...
[1266874520.366683] x86: Booting SMP configuration:
[1266874520.367028] smpboot: Booting Node 0 Processor 1 APIC 0x1
[1266874520.367514] kvm-clock: cpu 1, msr 2:3ff54041, secondary cpu clock
[1266874520.370107] KVM setup async PF for cpu 1
[1266874520.370425] kvm-stealtime: cpu 1, msr 23fc4d900
[1266874520.370858] cache: parent cpu1 should not be sleeping
[1266874520.371408] CPU1 is up
[1266874520.386657] smpboot: Booting Node 0 Processor 2 APIC 0x2
[1266874520.387189] kvm-clock: cpu 2, msr 2:3ff54081, secondary cpu clock
[1266874520.389787] KVM setup async PF for cpu 2
[1266874520.390142] kvm-stealtime: cpu 2, msr 23fc8d900
[1266874520.390604] cache: parent cpu2 should not be sleeping
[1266874520.391178] CPU2 is up
[1266874520.410687] smpboot: Booting Node 0 Processor 3 APIC 0x3
[1266874520.411288] kvm-clock: cpu 3, msr 2:3ff540c1, secondary cpu clock
[1266874520.413851] KVM setup async PF for cpu 3
[1266874520.414162] kvm-stealtime: cpu 3, msr 23fccd900
[1266874520.414566] cache: parent cpu3 should not be sleeping
[1266874520.415071] CPU3 is up
[1266874520.430924] smpboot: Booting Node 0 Processor 4 APIC 0x4
[1266874520.431459] kvm-clock: cpu 4, msr 2:3ff54101, secondary cpu clock
[1266874520.434048] KVM setup async PF for cpu 4
[1266874520.434381] kvm-stealtime: cpu 4, msr 23fd0d900
[1266874520.434822] cache: parent cpu4 should not be sleeping
[1266874520.435456] CPU4 is up
[1266874520.454957] smpboot: Booting Node 0 Processor 5 APIC 0x5
[1266874520.455822] kvm-clock: cpu 5, msr 2:3ff54141, secondary cpu clock
[1266874520.458507] KVM setup async PF for cpu 5
[1266874520.458834] kvm-stealtime: cpu 5, msr 23fd4d900
[1266874520.459439] cache: parent cpu5 should not be sleeping
[1266874520.460320] CPU5 is up
[1266874520.478907] smpboot: Booting Node 0 Processor 6 APIC 0x6
[1266874520.479586] kvm-clock: cpu 6, msr 2:3ff54181, secondary cpu clock
[1266874520.482176] KVM setup async PF for cpu 6
[1266874520.482510] kvm-stealtime: cpu 6, msr 23fd8d900
[1266874520.482936] cache: parent cpu6 should not be sleeping
[1266874520.483466] CPU6 is up
[1266874520.502988] smpboot: Booting Node 0 Processor 7 APIC 0x7
[1266874520.503775] kvm-clock: cpu 7, msr 2:3ff541c1, secondary cpu clock
[1266874520.506519] KVM setup async PF for cpu 7
[1266874520.506884] kvm-stealtime: cpu 7, msr 23fdcd900
[1266874520.507438] cache: parent cpu7 should not be sleeping
[1266874520.508141] CPU7 is up
[1266874520.517708] PM: noirq restore of devices complete after 3.294 msecs
[1266874520.518547] PM: early restore of devices complete after 0.129 msecs
[1266874520.553012] rtc_cmos 00:00: System wakeup disabled by ACPI
[1266874520.562260] usb usb2: root hub lost power or was reset
[1266874520.562269] usb usb1: root hub lost power or was reset
[1266874520.563296] usb usb4: root hub lost power or was reset
[1266874520.564993] ------------[ cut here ]------------
[1266874520.564998] WARNING: CPU: 0 PID: 507 at drivers/pci/msi.c:1261 pci_irq_vector+0xb5/0xe0
[1266874520.564998] Modules linked in: fuse(E) ebtable_filter(E) ebtables(E) rpcsec_gss_krb5(E) nfsv4(E) dns_resolver(E) nfs(E) fscache(E) nf_log_ipv6(E) xt_pkttype(E) nf_log_ipv4(E) nf_log_common(E) xt_LOG(E) xt_limit(E) af_packet(E) iscsi_ibft(E) iscsi_boot_sysfs(E) ip6t_REJECT(E) xt_tcpudp(E) nf_conntrack_ipv6(E) nf_defrag_ipv6(E) ip6table_raw(E) ipt_REJECT(E) iptable_raw(E) xt_CT(E) iptable_filter(E) ip6table_mangle(E) nf_conntrack_netbios_ns(E) nf_conntrack_broadcast(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) ip_tables(E) xt_conntrack(E) nf_conntrack(E) libcrc32c(E) ip6table_filter(E) ip6_tables(E) x_tables(E) snd_hda_codec_generic(E) snd_hda_intel(E) snd_hda_codec(E) snd_hda_core(E) snd_hwdep(E) joydev(E) snd_pcm(E) snd_timer(E) snd(E) crct10dif_pclmul(E) soundcore(E) crc32_pclmul(E) 8139too(E) ghash_clmulni_intel(E)
[1266874520.565022] pcbc(E) aesni_intel(E) ppdev(E) i2c_piix4(E) aes_x86_64(E) virtio_balloon(E) crypto_simd(E) parport_pc(E) glue_helper(E) parport(E) button(E) pcspkr(E) cryptd(E) serio_raw(E) acpi_cpufreq(E) nfsd(E) dm_mod(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) ext4(E) crc16(E) jbd2(E) mbcache(E) hid_generic(E) usbhid(E) sr_mod(E) cdrom(E) ata_generic(E) virtio_rng(E) virtio_blk(E) virtio_console(E) ata_piix(E) floppy(E) ehci_pci(E) qxl(E) drm_kms_helper(E) syscopyarea(E) uhci_hcd(E) ahci(E) ehci_hcd(E) sysfillrect(E) crc32c_intel(E) sysimgblt(E) libahci(E) fb_sys_fops(E) ttm(E) virtio_pci(E) virtio_ring(E) usbcore(E) virtio(E) 8139cp(E) drm(E) libata(E) mii(E) sg(E) scsi_mod(E) autofs4(E)
[1266874520.565048] CPU: 0 PID: 507 Comm: kworker/u16:4 Tainted: G W E 4.11.0-default #28
[1266874520.565049] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20161202_174313-build11a 04/01/2014
[1266874520.565052] Workqueue: events_unbound async_run_entry_fn
[1266874520.565052] Call Trace:
[1266874520.565058] dump_stack+0x63/0x90
[1266874520.565060] __warn+0xd1/0xf0
[1266874520.565064] ? pci_pm_suspend_noirq+0x190/0x190
[1266874520.565065] warn_slowpath_null+0x1d/0x20
[1266874520.565067] pci_irq_vector+0xb5/0xe0
[1266874520.565069] vp_synchronize_vectors+0x46/0x60 [virtio_pci]
[1266874520.565071] vp_reset+0x37/0x40 [virtio_pci]
[1266874520.565073] virtio_device_restore+0x22/0x110 [virtio]
[1266874520.565074] virtio_pci_restore+0x36/0x40 [virtio_pci]
[1266874520.565076] pci_pm_restore+0x79/0xb0
[1266874520.565078] dpm_run_callback+0x59/0x180
[1266874520.565079] device_resume+0xe7/0x210
[1266874520.565080] ? pm_dev_dbg+0x80/0x80
[1266874520.565082] async_resume+0x1d/0x50
[1266874520.565083] async_run_entry_fn+0x39/0x170
[1266874520.565084] process_one_work+0x16c/0x450
[1266874520.565085] worker_thread+0x137/0x4e0
[1266874520.565088] kthread+0x10c/0x140
[1266874520.565089] ? rescuer_thread+0x3c0/0x3c0
[1266874520.565090] ? kthread_park+0x90/0x90
[1266874520.565092] ret_from_fork+0x2c/0x40
[1266874520.565094] ---[ end trace d66daafbe82e66e8 ]---
[1266874520.600698] usb usb3: root hub lost power or was reset
[1266874520.882457] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[1266874520.883385] ata5: SATA link down (SStatus 0 SControl 300)
[1266874520.884202] ata4: SATA link down (SStatus 0 SControl 300)
[1266874520.884923] ata6: SATA link down (SStatus 0 SControl 300)
[1266874520.885592] ata3: SATA link down (SStatus 0 SControl 300)
[1266874520.886347] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[1266874520.887101] ata1.00: configured for UDMA/100
[1266874520.887835] ata2.00: configured for UDMA/100
[1266874520.946407] usb 3-1: reset high-speed USB device number 2 using ehci-pci
[1266874521.096211] PM: restore of devices complete after 543.539 msecs
[1266874521.097984] PM: Image restored successfully.
[1266874521.099356] PM: Basic memory bitmaps freed
[1266874521.100669] Restarting tasks ... done.

2017-03-27 17:06:09

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

Hi Mike,

does the patch below fix that issue for you?

diff --git a/drivers/virtio/virtio_pci_common.c b/drivers/virtio/virtio_pci_common.c
index df548a6fb844..fd1b06368b1f 100644
--- a/drivers/virtio/virtio_pci_common.c
+++ b/drivers/virtio/virtio_pci_common.c
@@ -176,7 +176,7 @@ static int vp_find_vqs_msix(struct virtio_device *vdev, unsigned nvqs,
if (err < 0)
return err;

- vp_dev->msix_vectors = nvectors;
+ vp_dev->msix_vectors = err; /* number of vectors allocated */
vp_dev->msix_names = kmalloc_array(nvectors,
sizeof(*vp_dev->msix_names), GFP_KERNEL);
if (!vp_dev->msix_names)

2017-03-27 18:16:33

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Mon, Mar 27, 2017 at 07:05:40PM +0200, Christoph Hellwig wrote:
> Hi Mike,
>
> does the patch below fix that issue for you?
>
> diff --git a/drivers/virtio/virtio_pci_common.c b/drivers/virtio/virtio_pci_common.c
> index df548a6fb844..fd1b06368b1f 100644
> --- a/drivers/virtio/virtio_pci_common.c
> +++ b/drivers/virtio/virtio_pci_common.c
> @@ -176,7 +176,7 @@ static int vp_find_vqs_msix(struct virtio_device *vdev, unsigned nvqs,
> if (err < 0)
> return err;
>
> - vp_dev->msix_vectors = nvectors;
> + vp_dev->msix_vectors = err; /* number of vectors allocated */
> vp_dev->msix_names = kmalloc_array(nvectors,
> sizeof(*vp_dev->msix_names), GFP_KERNEL);
> if (!vp_dev->msix_names)

Can this sometimes allocate less vectors than min number requested then?
I didn't realize.

In that case we probably should change if (err < 0)
to if (err != nvectors) and similarly for when we try
to get 2 vectors.


Mike, could you pls send lspci -vv that shows up after
boot?

--
MST

2017-03-27 18:27:07

by Mike Galbraith

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Mon, 2017-03-27 at 19:05 +0200, Christoph Hellwig wrote:
> Hi Mike,
>
> does the patch below fix that issue for you?

Thanks, I'll give it a go in the A.M.

BTW, WRT RT woes with $subject, I tried booting a generic kernel with
threadirqs, and bingo, same deal, just a bit more painful than for RT,
where there's no watchdog moaning accompanying the (preemptible) spin.

[ 28.346311] NMI watchdog: BUG: soft lockup - CPU#7 stuck for 22s! [kworker/7:1:108]
[ 28.347536] Modules linked in: virtio_rng(E) virtio_blk(E) virtio_console(E) ata_piix(E) qxl(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) ttm(E) ahci(E) libahci(E) drm(E) ehci_pci(E) uhci_hcd(E) ehci_hcd(E) usbcore(E) libata(E) virtio_pci(E) virtio_ring(E) virtio(E) 8139cp(E) floppy(E) mii(E) sg(E) scsi_mod(E) autofs4(E)
[ 28.351160] CPU: 7 PID: 108 Comm: kworker/7:1 Tainted: G E 4.11.0-default #30
[ 28.352085] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20161202_174313-build11a 04/01/2014
[ 28.353547] Workqueue: events control_work_handler [virtio_console]
[ 28.354450] task: ffff8802370d4440 task.stack: ffffc900010d8000
[ 28.355281] RIP: 0010:__send_control_msg+0xbd/0xd0 [virtio_console]
[ 28.356005] RSP: 0018:ffffc900010dbd20 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[ 28.356987] RAX: 0000000000000000 RBX: ffff880231c31ec8 RCX: ffff880231cb1000
[ 28.357866] RDX: 0000000000000001 RSI: ffffc900010dbd2c RDI: ffff880234f87400
[ 28.358738] RBP: ffffc900010dbd78 R08: 0000000001080020 R09: ffffc900010dbd30
[ 28.359718] R10: ffff88023fdddc00 R11: ffffffffffffffc8 R12: ffff880234f87400
[ 28.360653] R13: ffff880231c31ea8 R14: 0000000000000001 R15: 0000000000000003
[ 28.361510] FS: 0000000000000000(0000) GS:ffff88023fdc0000(0000) knlGS:0000000000000000
[ 28.362433] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 28.363177] CR2: 00007f4da0f40000 CR3: 0000000001c09000 CR4: 00000000001406e0
[ 28.363994] Call Trace:
[ 28.364420] add_port+0x23f/0x3d0 [virtio_console]
[ 28.365094] ? _raw_spin_unlock_irqrestore+0x24/0x40
[ 28.365765] handle_control_message.constprop.32+0x2c2/0x2e0 [virtio_console]
[ 28.366622] control_work_handler+0x52/0xb7 [virtio_console]
[ 28.367291] process_one_work+0x15c/0x440
[ 28.367869] worker_thread+0x137/0x4b0
[ 28.368426] kthread+0x10c/0x140
[ 28.368921] ? process_one_work+0x440/0x440
[ 28.369477] ? kthread_create_on_node+0x40/0x40
[ 28.370067] ret_from_fork+0x2c/0x40
[ 28.370611] Code: 57 e1 48 83 c4 30 31 c0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 4c 89 e7 e8 03 93 f7 ff eb 0e 4c 89 e7 e8 89 84 f7 ff 84 c0 75 d1 f3 90 <48> 8d 75 b4 4c 89 e7 e8 57 91 f7 ff 48 85 c0 74 e1 eb bc 0f 1f

2017-03-28 01:01:39

by Mike Galbraith

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Mon, 2017-03-27 at 19:05 +0200, Christoph Hellwig wrote:
> Hi Mike,
>
> does the patch below fix that issue for you?

Nope, warnings are alive and well.

> diff --git a/drivers/virtio/virtio_pci_common.c
> b/drivers/virtio/virtio_pci_common.c
> index df548a6fb844..fd1b06368b1f 100644
> --- a/drivers/virtio/virtio_pci_common.c
> +++ b/drivers/virtio/virtio_pci_common.c
> @@ -176,7 +176,7 @@ static int vp_find_vqs_msix(struct virtio_device
> *vdev, unsigned nvqs,
> if (err < 0)
> return err;
>
> - vp_dev->msix_vectors = nvectors;
> + vp_dev->msix_vectors = err; /* number of vectors allocated
> */
> vp_dev->msix_names = kmalloc_array(nvectors,
> sizeof(*vp_dev->msix_names), GFP_KERNEL);
> if (!vp_dev->msix_names)

2017-03-28 01:15:56

by Mike Galbraith

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Mon, 2017-03-27 at 21:16 +0300, Michael S. Tsirkin wrote:

> Mike, could you pls send lspci -vv that shows up after
> boot?

Presuming you mean the virtual box..

00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
Subsystem: Red Hat, Inc Qemu virtual machine
Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-

00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
Subsystem: Red Hat, Inc Qemu virtual machine
Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-

00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II] (prog-if 80 [Master])
Subsystem: Red Hat, Inc Qemu virtual machine
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Region 0: [virtual] Memory at 000001f0 (32-bit, non-prefetchable) [size=8]
Region 1: [virtual] Memory at 000003f0 (type 3, non-prefetchable)
Region 2: [virtual] Memory at 00000170 (32-bit, non-prefetchable) [size=8]
Region 3: [virtual] Memory at 00000370 (type 3, non-prefetchable)
Region 4: I/O ports at c240 [size=16]
Kernel driver in use: ata_piix
Kernel modules: ata_piix, pata_acpi, ata_generic

00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
Subsystem: Red Hat, Inc Qemu virtual machine
Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 9
Kernel driver in use: piix4_smbus
Kernel modules: i2c_piix4

00:02.0 VGA compatible controller: Red Hat, Inc. QXL paravirtual graphic card (rev 04) (prog-if 00 [VGA controller])
Subsystem: Red Hat, Inc QEMU Virtual Machine
Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 10
Region 0: Memory at f4000000 (32-bit, non-prefetchable) [size=64M]
Region 1: Memory at f8000000 (32-bit, non-prefetchable) [size=64M]
Region 2: Memory at fc054000 (32-bit, non-prefetchable) [size=8K]
Region 3: I/O ports at c140 [size=32]
Expansion ROM at 000c0000 [disabled] [size=128K]
Kernel driver in use: qxl
Kernel modules: qxl

00:03.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8100/8101L/8139 PCI Fast Ethernet Adapter (rev 20)
Subsystem: Red Hat, Inc QEMU Virtual Machine
Physical Slot: 3
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 11
Region 0: I/O ports at c000 [size=256]
Region 1: Memory at fc056000 (32-bit, non-prefetchable) [size=256]
Expansion ROM at fc000000 [disabled] [size=256K]
Kernel driver in use: 8139cp
Kernel modules: 8139cp, 8139too

00:04.0 Audio device: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) High Definition Audio Controller (rev 01)
Subsystem: Red Hat, Inc QEMU Virtual Machine
Physical Slot: 4
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 29
Region 0: Memory at fc050000 (32-bit, non-prefetchable) [size=16K]
Capabilities: [60] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee00000 Data: 4091
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel

00:05.0 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] (rev 02) (prog-if 01 [AHCI 1.0])
Subsystem: Red Hat, Inc QEMU Virtual Machine
Physical Slot: 5
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 24
Region 4: I/O ports at c160 [size=32]
Region 5: Memory at fc057000 (32-bit, non-prefetchable) [size=4K]
Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee00000 Data: 4041
Capabilities: [a8] SATA HBA v1.0 BAR4 Offset=00000004
Kernel driver in use: ahci
Kernel modules: ahci

00:06.0 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 (rev 03) (prog-if 00 [UHCI])
Subsystem: Red Hat, Inc QEMU Virtual Machine
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 10
Region 4: I/O ports at c180 [size=32]
Kernel driver in use: uhci_hcd
Kernel modules: uhci_hcd

00:06.1 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 (rev 03) (prog-if 00 [UHCI])
Subsystem: Red Hat, Inc QEMU Virtual Machine
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin B routed to IRQ 11
Region 4: I/O ports at c1a0 [size=32]
Kernel driver in use: uhci_hcd
Kernel modules: uhci_hcd

00:06.2 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 (rev 03) (prog-if 00 [UHCI])
Subsystem: Red Hat, Inc QEMU Virtual Machine
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin C routed to IRQ 11
Region 4: I/O ports at c1c0 [size=32]
Kernel driver in use: uhci_hcd
Kernel modules: uhci_hcd

00:06.7 USB controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 (rev 03) (prog-if 20 [EHCI])
Subsystem: Red Hat, Inc QEMU Virtual Machine
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin D routed to IRQ 10
Region 0: Memory at fc058000 (32-bit, non-prefetchable) [size=4K]
Kernel driver in use: ehci-pci
Kernel modules: ehci_pci

00:07.0 Communication controller: Red Hat, Inc Virtio console
Subsystem: Red Hat, Inc Device 0003
Physical Slot: 7
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 11
Region 0: I/O ports at c1e0 [size=32]
Region 1: Memory at fc059000 (32-bit, non-prefetchable) [size=4K]
Capabilities: [40] MSI-X: Enable+ Count=2 Masked-
Vector table: BAR=1 offset=00000000
PBA: BAR=1 offset=00000800
Kernel driver in use: virtio-pci
Kernel modules: virtio_pci

00:08.0 Unclassified device [00ff]: Red Hat, Inc Virtio memory balloon
Subsystem: Red Hat, Inc Device 0005
Physical Slot: 8
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 11
Region 0: I/O ports at c200 [size=32]
Kernel driver in use: virtio-pci
Kernel modules: virtio_pci

00:09.0 Unclassified device [00ff]: Red Hat, Inc Virtio RNG
Subsystem: Red Hat, Inc Device 0004
Physical Slot: 9
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 10
Region 0: I/O ports at c220 [size=32]
Kernel driver in use: virtio-pci
Kernel modules: virtio_pci

00:0a.0 SCSI storage controller: Red Hat, Inc Virtio block device
Subsystem: Red Hat, Inc Device 0002
Physical Slot: 10
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 10
Region 0: I/O ports at c100 [size=64]
Region 1: Memory at fc05a000 (32-bit, non-prefetchable) [size=4K]
Capabilities: [40] MSI-X: Enable+ Count=2 Masked-
Vector table: BAR=1 offset=00000000
PBA: BAR=1 offset=00000800
Kernel driver in use: virtio-pci
Kernel modules: virtio_pci

2017-03-28 02:36:04

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Tue, Mar 28, 2017 at 03:08:20AM +0200, Mike Galbraith wrote:
> On Mon, 2017-03-27 at 21:16 +0300, Michael S. Tsirkin wrote:
>
> > Mike, could you pls send lspci -vv that shows up after
> > boot?
>
> Presuming you mean the virtual box..

Yes. Hmm nothing strange here. Can you pls post your QEMU
command line so I can try reproducing?

2017-03-28 03:16:26

by Mike Galbraith

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Tue, 2017-03-28 at 05:35 +0300, Michael S. Tsirkin wrote:
> On Tue, Mar 28, 2017 at 03:08:20AM +0200, Mike Galbraith wrote:
> > On Mon, 2017-03-27 at 21:16 +0300, Michael S. Tsirkin wrote:
> >
> > > Mike, could you pls send lspci -vv that shows up after
> > > boot?
> >
> > Presuming you mean the virtual box..
>
> Yes. Hmm nothing strange here. Can you pls post your QEMU
> command line so I can try reproducing?

I don't start from the command line, I poke buttons in gui tool for
virt-weenies, below is ps result (hope your monitor is 37 feet wide).

/usr/bin/qemu-system-x86_64 -name opensuse42.1 -S -machine pc-i440fx-2.3,accel=kvm,usb=off,vmport=off -cpu Haswell-noTSX,+abm,+pdpe1gb,+rdrand,+f16c,+osxsave,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme -m 8192 -realtime mlock=off -smp 8,sockets=1,cores=1,threads=8 -uuid afff4e95-262d-41ca-9189-f40c87c9375b -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/opensuse42.1.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x6.0x7 -device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x6 -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x6.0x1 -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x6.0x2 -device ahci,id=sata0,bus=pci.0,addr=0x5 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x7 -drive file=/abuild/lib/libvirt/images/opensuse42.1.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=off -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0xa,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive file=/dev/sr0,if=none,media=cdrom,id=drive-sata0-0-0,readonly=on,format=raw -device ide-cd,bus=sata0.0,drive=drive-sata0-0-0,id=sata0-0-0 -netdev tap,fd=22,id=hostnet0 -device rtl8139,netdev=hostnet0,id=net0,mac=52:54:00:be:db:82,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev spicevmc,id=charchannel0,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.spice.0 -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channel/target/opensuse42.1.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -device usb-tablet,id=input0 -spice port=5900,addr=127.0.0.1,disable-ticketing,seamless-migration=on -k de -device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vgamem_mb=16,bus=pci.0,addr=0x2 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -chardev spicevmc,id=charredir0,name=usbredir -device usb-redir,chardev=charredir0,id=redir0 -chardev spicevmc,id=charredir1,name=usbredir -device usb-redir,chardev=charredir1,id=redir1 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -object rng-random,id=objrng0,filename=/dev/random -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x9 -msg timestamp=on

If you want the xml instead, holler.

-Mike

2017-03-28 15:38:52

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Tue, Mar 28, 2017 at 05:16:13AM +0200, Mike Galbraith wrote:
> On Tue, 2017-03-28 at 05:35 +0300, Michael S. Tsirkin wrote:
> > On Tue, Mar 28, 2017 at 03:08:20AM +0200, Mike Galbraith wrote:
> > > On Mon, 2017-03-27 at 21:16 +0300, Michael S. Tsirkin wrote:
> > >
> > > > Mike, could you pls send lspci -vv that shows up after
> > > > boot?
> > >
> > > Presuming you mean the virtual box..
> >
> > Yes. Hmm nothing strange here. Can you pls post your QEMU
> > command line so I can try reproducing?
>
> I don't start from the command line, I poke buttons in gui tool for
> virt-weenies, below is ps result (hope your monitor is 37 feet wide).
>
> /usr/bin/qemu-system-x86_64 -name opensuse42.1 -S -machine pc-i440fx-2.3,accel=kvm,usb=off,vmport=off -cpu Haswell-noTSX,+abm,+pdpe1gb,+rdrand,+f16c,+osxsave,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme -m 8192 -realtime mlock=off -smp 8,sockets=1,cores=1,threads=8 -uuid afff4e95-262d-41ca-9189-f40c87c9375b -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/opensuse42.1.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x6.0x7 -device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x6 -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x6.0x1 -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x6.0x2 -device ahci,id=sata0,bus=pci.0,addr=0x5 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x7 -drive file=/abuild/lib/libvirt/images/opensuse42.1.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=off -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0xa,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive file=/dev/sr0,if=none,media=cdrom,id=drive-sata0-0-0,readonly=on,format=raw -device ide-cd,bus=sata0.0,drive=drive-sata0-0-0,id=sata0-0-0 -netdev tap,fd=22,id=hostnet0 -device rtl8139,netdev=hostnet0,id=net0,mac=52:54:00:be:db:82,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev spicevmc,id=charchannel0,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.spice.0 -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channel/target/opensuse42.1.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -device usb-tablet,id=input0 -spice port=5900,addr=127.0.0.1,disable-ticketing,seamless-migration=on -k de -device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vgamem_mb=16,bus=pci.0,addr=0x2 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -chardev spicevmc,id=charredir0,name=usbredir -device usb-redir,chardev=charredir0,id=redir0 -chardev spicevmc,id=charredir1,name=usbredir -device usb-redir,chardev=charredir1,id=redir1 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -object rng-random,id=objrng0,filename=/dev/random -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x9 -msg timestamp=on
>
> If you want the xml instead, holler.
>
> -Mike

No, that's fine, thanks. Anything specific that you do to trigger this?

--
MST

2017-03-28 16:34:26

by Mike Galbraith

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Tue, 2017-03-28 at 18:37 +0300, Michael S. Tsirkin wrote:

> Anything specific that you do to trigger this?

Nope, all I have to do is to poke kde Power/Session Hibernate button.

Not that it should matter, but the vm is a full clone of my 42.1 box,
including git server/repos etc, so has all whistles/bells/lard.

-Mike

2017-03-28 17:27:11

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Tue, Mar 28, 2017 at 06:33:53PM +0200, Mike Galbraith wrote:
> On Tue, 2017-03-28 at 18:37 +0300, Michael S. Tsirkin wrote:
>
> > Anything specific that you do to trigger this?
>
> Nope, all I have to do is to poke kde Power/Session Hibernate button.

Oh so you actually start hypernate? Is this what you mean when
you say "poke"?

> Not that it should matter, but the vm is a full clone of my 42.1 box,
> including git server/repos etc, so has all whistles/bells/lard.
>
> -Mike

2017-03-28 17:48:08

by Mike Galbraith

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Tue, 2017-03-28 at 20:27 +0300, Michael S. Tsirkin wrote:
> On Tue, Mar 28, 2017 at 06:33:53PM +0200, Mike Galbraith wrote:
> > On Tue, 2017-03-28 at 18:37 +0300, Michael S. Tsirkin wrote:
> >
> > > Anything specific that you do to trigger this?
> >
> > Nope, all I have to do is to poke kde Power/Session Hibernate
> > button.
>
> Oh so you actually start hypernate? Is this what you mean when
> you say "poke"?

s/hyper/hiber, but yes, and button poking == mouse clicking.

-Mike

2017-03-29 06:23:38

by Mike Galbraith

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Mon, 2017-03-27 at 20:18 +0200, Mike Galbraith wrote:

> BTW, WRT RT woes with $subject, I tried booting a generic kernel with
> threadirqs, and bingo, same deal, just a bit more painful than for RT,
> where there's no watchdog moaning accompanying the (preemptible) spin.

BTW++: the last hunk of this bandaid may be a bug fix. With only the
first two, box tried to use uninitialized stuff on hibernate, went
boom. Looks like that may be possible without help from me.

--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -2058,7 +2058,7 @@ static int virtcons_probe(struct virtio_
portdev->max_nr_ports = 1;

/* Don't test MULTIPORT at all if we're rproc: not a valid feature! */
- if (!is_rproc_serial(vdev) &&
+ if (!is_rproc_serial(vdev) && !IS_ENABLED(CONFIG_IRQ_FORCED_THREADING) &&
virtio_cread_feature(vdev, VIRTIO_CONSOLE_F_MULTIPORT,
struct virtio_console_config, max_nr_ports,
&portdev->max_nr_ports) == 0) {
@@ -2179,7 +2179,9 @@ static struct virtio_device_id id_table[

static unsigned int features[] = {
VIRTIO_CONSOLE_F_SIZE,
+#ifndef CONFIG_IRQ_FORCED_THREADING
VIRTIO_CONSOLE_F_MULTIPORT,
+#endif
};

static struct virtio_device_id rproc_serial_id_table[] = {
@@ -2202,14 +2204,16 @@ static int virtcons_freeze(struct virtio

vdev->config->reset(vdev);

- virtqueue_disable_cb(portdev->c_ivq);
+ if (use_multiport(portdev))
+ virtqueue_disable_cb(portdev->c_ivq);
cancel_work_sync(&portdev->control_work);
cancel_work_sync(&portdev->config_work);
/*
* Once more: if control_work_handler() was running, it would
* enable the cb as the last step.
*/
- virtqueue_disable_cb(portdev->c_ivq);
+ if (use_multiport(portdev))
+ virtqueue_disable_cb(portdev->c_ivq);
remove_controlq_data(portdev);

list_for_each_entry(port, &portdev->ports, list) {

2017-03-29 20:11:42

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Wed, Mar 29, 2017 at 08:23:22AM +0200, Mike Galbraith wrote:
> On Mon, 2017-03-27 at 20:18 +0200, Mike Galbraith wrote:
>
> > BTW, WRT RT woes with $subject, I tried booting a generic kernel with
> > threadirqs, and bingo, same deal, just a bit more painful than for RT,
> > where there's no watchdog moaning accompanying the (preemptible) spin.
>
> BTW++: the last hunk of this bandaid may be a bug fix. With only the
> first two, box tried to use uninitialized stuff on hibernate, went
> boom. Looks like that may be possible without help from me.
>
> --- a/drivers/char/virtio_console.c
> +++ b/drivers/char/virtio_console.c
> @@ -2058,7 +2058,7 @@ static int virtcons_probe(struct virtio_
> portdev->max_nr_ports = 1;
>
> /* Don't test MULTIPORT at all if we're rproc: not a valid feature! */
> - if (!is_rproc_serial(vdev) &&
> + if (!is_rproc_serial(vdev) && !IS_ENABLED(CONFIG_IRQ_FORCED_THREADING) &&
> virtio_cread_feature(vdev, VIRTIO_CONSOLE_F_MULTIPORT,
> struct virtio_console_config, max_nr_ports,
> &portdev->max_nr_ports) == 0) {
> @@ -2179,7 +2179,9 @@ static struct virtio_device_id id_table[
>
> static unsigned int features[] = {
> VIRTIO_CONSOLE_F_SIZE,
> +#ifndef CONFIG_IRQ_FORCED_THREADING
> VIRTIO_CONSOLE_F_MULTIPORT,
> +#endif
> };
>
> static struct virtio_device_id rproc_serial_id_table[] = {
> @@ -2202,14 +2204,16 @@ static int virtcons_freeze(struct virtio
>
> vdev->config->reset(vdev);
>
> - virtqueue_disable_cb(portdev->c_ivq);
> + if (use_multiport(portdev))
> + virtqueue_disable_cb(portdev->c_ivq);
> cancel_work_sync(&portdev->control_work);
> cancel_work_sync(&portdev->config_work);
> /*
> * Once more: if control_work_handler() was running, it would
> * enable the cb as the last step.
> */
> - virtqueue_disable_cb(portdev->c_ivq);
> + if (use_multiport(portdev))
> + virtqueue_disable_cb(portdev->c_ivq);
> remove_controlq_data(portdev);
>
> list_for_each_entry(port, &portdev->ports, list) {


Poking at this some more, I was able to reproduce at
least some warnings. I still do not see a spin
but is there a chance this helps your case too?

commit 85039ca3162295759cf986aa753778043a90012c
Author: Michael S. Tsirkin <[email protected]>
Date: Wed Mar 29 23:02:28 2017 +0300

virtio_pci: fix msix vector tracking on cleanup

virtio pci tracks allocated vectors in a variable: msix_vectors. This
isn't reset on del_vqs, as a result if reset is called after vqs are
deleted we try to synchronize non-existing irqs producing a (probably
harmless) warning.

Fixes: 07ec51480b5e ("virtio_pci: use shared interrupts for virtqueues")
Signed-off-by: Michael S. Tsirkin <[email protected]>

diff --git a/drivers/virtio/virtio_pci_common.c b/drivers/virtio/virtio_pci_common.c
index baae423..a70bed6 100644
--- a/drivers/virtio/virtio_pci_common.c
+++ b/drivers/virtio/virtio_pci_common.c
@@ -151,6 +151,7 @@ void vp_del_vqs(struct virtio_device *vdev)
}

free_irq(pci_irq_vector(vp_dev->pci_dev, 0), vp_dev);
+ vp_dev->msix_vectors = 0;
pci_free_irq_vectors(vp_dev->pci_dev);
}

@@ -294,6 +295,7 @@ static int vp_find_vqs_msix(struct virtio_device *vdev, unsigned nvqs,
out_free_msix_names:
kfree(vp_dev->msix_names);
out_free_irq_vectors:
+ vp_dev->msix_vectors = 0;
pci_free_irq_vectors(vp_dev->pci_dev);
return err;
}

2017-03-29 20:19:35

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Wed, Mar 29, 2017 at 08:23:22AM +0200, Mike Galbraith wrote:
> On Mon, 2017-03-27 at 20:18 +0200, Mike Galbraith wrote:
>
> > BTW, WRT RT woes with $subject, I tried booting a generic kernel with
> > threadirqs, and bingo, same deal, just a bit more painful than for RT,
> > where there's no watchdog moaning accompanying the (preemptible) spin.
>
> BTW++: the last hunk of this bandaid may be a bug fix. With only the
> first two, box tried to use uninitialized stuff on hibernate, went
> boom. Looks like that may be possible without help from me.
>
> --- a/drivers/char/virtio_console.c
> +++ b/drivers/char/virtio_console.c
> @@ -2058,7 +2058,7 @@ static int virtcons_probe(struct virtio_
> portdev->max_nr_ports = 1;
>
> /* Don't test MULTIPORT at all if we're rproc: not a valid feature! */
> - if (!is_rproc_serial(vdev) &&
> + if (!is_rproc_serial(vdev) && !IS_ENABLED(CONFIG_IRQ_FORCED_THREADING) &&
> virtio_cread_feature(vdev, VIRTIO_CONSOLE_F_MULTIPORT,
> struct virtio_console_config, max_nr_ports,
> &portdev->max_nr_ports) == 0) {
> @@ -2179,7 +2179,9 @@ static struct virtio_device_id id_table[
>
> static unsigned int features[] = {
> VIRTIO_CONSOLE_F_SIZE,
> +#ifndef CONFIG_IRQ_FORCED_THREADING
> VIRTIO_CONSOLE_F_MULTIPORT,
> +#endif
> };

These look kind of questionable.
Is this part needed?

> static struct virtio_device_id rproc_serial_id_table[] = {
> @@ -2202,14 +2204,16 @@ static int virtcons_freeze(struct virtio
>
> vdev->config->reset(vdev);
>
> - virtqueue_disable_cb(portdev->c_ivq);
> + if (use_multiport(portdev))
> + virtqueue_disable_cb(portdev->c_ivq);
> cancel_work_sync(&portdev->control_work);
> cancel_work_sync(&portdev->config_work);
> /*
> * Once more: if control_work_handler() was running, it would
> * enable the cb as the last step.
> */
> - virtqueue_disable_cb(portdev->c_ivq);
> + if (use_multiport(portdev))
> + virtqueue_disable_cb(portdev->c_ivq);
> remove_controlq_data(portdev);
>
> list_for_each_entry(port, &portdev->ports, list) {

This looks real. No idea why would interrupt sharing
trigger anything like this but go figure.
Can you pls submit this separately with
a signature?

--
MST

2017-03-30 03:10:29

by Mike Galbraith

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Wed, 2017-03-29 at 23:10 +0300, Michael S. Tsirkin wrote:

> Poking at this some more, I was able to reproduce at
> least some warnings. I still do not see a spin
> but is there a chance this helps your case too?

Well, it's down to one warning, clean on the way back up.

WRT spin, you should need do nothing more than boot with threadirqs,
that's 100% repeatable here in absolutely virgin source. Attaching
(obese enterprise-ish) config.

[ 174.147626] ------------[ cut here ]------------
[ 174.147640] WARNING: CPU: 7 PID: 339 at drivers/pci/msi.c:1251 pci_irq_vector+0xcb/0xe0
[ 174.147640] Modules linked in: dm_mod(E) fuse(E) ebtable_filter(E) ebtables(E) nf_log_ipv6(E) rpcsec_gss_krb5(E) xt_pkttype(E) nfsv4(E) nf_log_ipv4(E) nf_log_common(E) dns_resolver(E) xt_LOG(E) xt_limit(E) nfs(E) fscache(E) af_packet(E) iscsi_ibft(E) iscsi_boot_sysfs(E) ip6t_REJECT(E) xt_tcpudp(E) nf_conntrack_ipv6(E) nf_defrag_ipv6(E) ip6table_raw(E) ipt_REJECT(E) iptable_raw(E) xt_CT(E) iptable_filter(E) ip6table_mangle(E) nf_conntrack_netbios_ns(E) nf_conntrack_broadcast(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) ip_tables(E) xt_conntrack(E) nf_conntrack(E) libcrc32c(E) ip6table_filter(E) ip6_tables(E) x_tables(E) snd_hda_codec_generic(E) snd_hda_intel(E) snd_hda_codec(E) joydev(E) snd_hda_core(E) snd_hwdep(E) snd_pcm(E) snd_timer(E) crct10dif_pclmul(E) snd(E) crc32_pclmul(E) ghash_clmulni_intel(E)
[ 174.147664] pcbc(E) soundcore(E) 8139too(E) aesni_intel(E) i2c_piix4(E) ppdev(E) aes_x86_64(E) virtio_balloon(E) crypto_simd(E) parport_pc(E) glue_helper(E) serio_raw(E) pcspkr(E) parport(E) cryptd(E) button(E) acpi_cpufreq(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) ext4(E) crc16(E) jbd2(E) mbcache(E) hid_generic(E) usbhid(E) sr_mod(E) cdrom(E) ata_generic(E) ata_piix(E) virtio_console(E) virtio_blk(E) virtio_rng(E) qxl(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) ehci_pci(E) sysimgblt(E) ahci(E) fb_sys_fops(E) libahci(E) ttm(E) uhci_hcd(E) ehci_hcd(E) virtio_pci(E) virtio_ring(E) drm(E) crc32c_intel(E) 8139cp(E) libata(E) usbcore(E) mii(E) virtio(E) floppy(E) sg(E) scsi_mod(E) autofs4(E)
[ 174.147702] CPU: 7 PID: 339 Comm: kworker/u16:3 Tainted: G E 4.11.0-default #2
[ 174.147702] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20161202_174313-build11a 04/01/2014
[ 174.147707] Workqueue: events_unbound async_run_entry_fn
[ 174.147708] Call Trace:
[ 174.147713] ? dump_stack+0x5c/0x85
[ 174.147718] ? __warn+0xc4/0xe0
[ 174.147721] ? pci_pm_poweroff+0xf0/0xf0
[ 174.147722] ? pci_irq_vector+0xcb/0xe0
[ 174.147725] ? vp_synchronize_vectors+0x3e/0x50 [virtio_pci]
[ 174.147727] ? virtcons_freeze+0x1f/0xa0 [virtio_console]
[ 174.147729] ? virtio_pci_freeze+0x19/0x40 [virtio_pci]
[ 174.147730] ? pci_pm_freeze+0x59/0xe0
[ 174.147737] ? dpm_run_callback+0x4d/0x170
[ 174.147738] ? __device_suspend+0x11f/0x3b0
[ 174.147739] ? pm_dev_dbg+0x70/0x70
[ 174.147739] ? async_suspend+0x1a/0x90
[ 174.147740] ? async_run_entry_fn+0x34/0x160
[ 174.147742] ? process_one_work+0x164/0x430
[ 174.147743] ? worker_thread+0x135/0x4d0
[ 174.147744] ? kthread+0xff/0x140
[ 174.147745] ? rescuer_thread+0x3c0/0x3c0
[ 174.147746] ? kthread_park+0x80/0x80
[ 174.147753] ? ret_from_fork+0x26/0x40
[ 174.147754] ---[ end trace 02cd3f1b527dc954 ]---


Attachments:
config.xz (36.91 kB)

2017-03-30 03:54:18

by Mike Galbraith

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Wed, 2017-03-29 at 23:19 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > > > &portdev->max_nr_ports) == 0) {
> > @@ -2179,7 +2179,9 @@ static struct virtio_device_id id_table[
> >
> > static unsigned int features[] = {
> > > > > > VIRTIO_CONSOLE_F_SIZE,
> > +#ifndef CONFIG_IRQ_FORCED_THREADING
> > > > > > VIRTIO_CONSOLE_F_MULTIPORT,
> > +#endif
> > };
>
> These look kind of questionable.
> Is this part needed?

I would have sworn it was, but double checking, nope, it's not.

Hm, so I could make a prettier bandaid with a runtime check.. but it'd
remain a bandaid, so I'll go do some beans 'n' biscuits work instead.

-Mike

2017-03-30 07:20:49

by Mike Galbraith

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Thu, 2017-03-30 at 05:10 +0200, Mike Galbraith wrote:

> WRT spin, you should need do nothing more than boot with threadirqs,
> that's 100% repeatable here in absolutely virgin source.

No idea why virtqueue_get_buf() in __send_control_msg() fails forever
with threadirqs, but marking that vq as being busted (it clearly is)
results in one gripe, and a vbox that seemingly cares not one whit that
something went missing. CONFIG_DEBUG_SHIRQ OTOH notices, mutters
something that sounds like "idiot" when I hibernate the thing ;-)

diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index e9b7e0b3cabe..831406dae1cb 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -567,6 +567,7 @@ static ssize_t __send_control_msg(struct ports_device *portdev, u32 port_id,
struct scatterlist sg[1];
struct virtqueue *vq;
unsigned int len;
+ unsigned long deadline = jiffies+1;

if (!use_multiport(portdev))
return 0;
@@ -583,9 +584,13 @@ static ssize_t __send_control_msg(struct ports_device *portdev, u32 port_id,

if (virtqueue_add_outbuf(vq, sg, 1, &portdev->cpkt, GFP_ATOMIC) == 0) {
virtqueue_kick(vq);
- while (!virtqueue_get_buf(vq, &len)
- && !virtqueue_is_broken(vq))
+ while (!virtqueue_get_buf(vq, &len) && !virtqueue_is_broken(vq)) {
cpu_relax();
+ if (time_after(jiffies, deadline)) {
+ trace_printk("Aw crap, I'm stuck.. breaking device\n");
+ virtio_break_device(portdev->vdev);
+ }
+ }
}

spin_unlock(&portdev->c_ovq_lock);

2017-03-31 03:22:36

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Fri, Mar 31, 2017 at 04:23:35AM +0300, Michael S. Tsirkin wrote:
> On Thu, Mar 30, 2017 at 09:20:35AM +0200, Mike Galbraith wrote:
> > On Thu, 2017-03-30 at 05:10 +0200, Mike Galbraith wrote:
> >
> > > WRT spin, you should need do nothing more than boot with threadirqs,
> > > that's 100% repeatable here in absolutely virgin source.
> >
> > No idea why virtqueue_get_buf() in __send_control_msg() fails forever
> > with threadirqs, but marking that vq as being busted (it clearly is)
> > results in one gripe, and a vbox that seemingly cares not one whit that
> > something went missing. CONFIG_DEBUG_SHIRQ OTOH notices, mutters
> > something that sounds like "idiot" when I hibernate the thing ;-)
> >
> > diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
> > index e9b7e0b3cabe..831406dae1cb 100644
> > --- a/drivers/char/virtio_console.c
> > +++ b/drivers/char/virtio_console.c
> > @@ -567,6 +567,7 @@ static ssize_t __send_control_msg(struct ports_device *portdev, u32 port_id,
> > struct scatterlist sg[1];
> > struct virtqueue *vq;
> > unsigned int len;
> > + unsigned long deadline = jiffies+1;
> >
> > if (!use_multiport(portdev))
> > return 0;
> > @@ -583,9 +584,13 @@ static ssize_t __send_control_msg(struct ports_device *portdev, u32 port_id,
> >
> > if (virtqueue_add_outbuf(vq, sg, 1, &portdev->cpkt, GFP_ATOMIC) == 0) {
> > virtqueue_kick(vq);
> > - while (!virtqueue_get_buf(vq, &len)
> > - && !virtqueue_is_broken(vq))
> > + while (!virtqueue_get_buf(vq, &len) && !virtqueue_is_broken(vq)) {
> > cpu_relax();
> > + if (time_after(jiffies, deadline)) {
> > + trace_printk("Aw crap, I'm stuck.. breaking device\n");
> > + virtio_break_device(portdev->vdev);
> > + }
> > + }
> > }
> >
> > spin_unlock(&portdev->c_ovq_lock);
>
>
> OK so with your help I was able to reproduce. Surprisingly easy:
>
> 1. add threadirqs
> 2. add to qemu -device virtio-serial-pci -no-shutdown
> 3. within guest, do echo disk > /sys/power/state
>
> This produces a warning. Looking deeper into it, I find:
> the device has 64 vqs. This line
>
> err = request_irq(pci_irq_vector(vp_dev->pci_dev, msix_vec),
> vring_interrupt, IRQF_SHARED,
> vp_dev->msix_names[j], vqs[i]);
>
> fails after assigning interrupts to 33 vqs.
> Is there a limit to how many threaded irqs can share a line?

In fact it fails on the 33'rd one, and I see this:

/*
* Unlikely to have 32 resp 64 irqs sharing one line,
* but who knows.
*/
if (thread_mask == ~0UL) {
printk(KERN_ERR "%s +%d\n", __FILE__, __LINE__);
ret = -EBUSY;
goto out_mask;
}


I'm not sure why does it fail after 32 on 64 bit, but as
virtio devices aren't limited to 32 vqs it looks like we
should go back to requesting the irq only once for all vqs.

Christoph, should I just revert for now, or do you
want to look into a smaller patch for this?

Another question is looking into intx support - that
should work but it seems to be broken at the moment.


>
> If so we need to rethink the whole approach.
>
> Still looking into it.
>
> Christoph, any idea?
>
>
> --
> MST

2017-03-31 08:21:05

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Fri, Mar 31, 2017 at 06:22:31AM +0300, Michael S. Tsirkin wrote:
> I'm not sure why does it fail after 32 on 64 bit, but as
> virtio devices aren't limited to 32 vqs it looks like we
> should go back to requesting the irq only once for all vqs.

Meh.

>
> Christoph, should I just revert for now, or do you
> want to look into a smaller patch for this?

I think we'll need to do a different patch than just a simple revert,
mostly because so much infrastructure depends on the patch.

I'll take a look over the weekend.

> Another question is looking into intx support - that
> should work but it seems to be broken at the moment.

Does it? I'm pretty sure I tested it back when I came up with the
series by artifically disabling MSI-X in the kernel. I can try this
again, though.

2017-03-31 16:47:09

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Fri, Mar 31, 2017 at 10:20:49AM +0200, Christoph Hellwig wrote:
> On Fri, Mar 31, 2017 at 06:22:31AM +0300, Michael S. Tsirkin wrote:
> > I'm not sure why does it fail after 32 on 64 bit, but as
> > virtio devices aren't limited to 32 vqs it looks like we
> > should go back to requesting the irq only once for all vqs.
>
> Meh.
>
> >
> > Christoph, should I just revert for now, or do you
> > want to look into a smaller patch for this?
>
> I think we'll need to do a different patch than just a simple revert,
> mostly because so much infrastructure depends on the patch.
>
> I'll take a look over the weekend.
>
> > Another question is looking into intx support - that
> > should work but it seems to be broken at the moment.
>
> Does it? I'm pretty sure I tested it back when I came up with the
> series by artifically disabling MSI-X in the kernel. I can try this
> again, though.

I'm not 100% sure - what I see is that we do not handle failure to
request irqs correctly, we seem to fall back on intx but
the following freeze then blows up trying to free non-existing
vectors.

Does not seem to trigger with just msix off so maybe that is
simply failure to recover from an error correctly.


--
MST

2017-04-03 14:18:26

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

Mike,

can you try the patch below?

---
>From fe41a30b54878cc631623b7511267125e0da4b15 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <[email protected]>
Date: Mon, 3 Apr 2017 14:51:35 +0200
Subject: virtio_pci: don't use shared irq for virtqueues

Reimplement the shared irq feature manually, as we might have a larger
number of virtqueues than the core shared interrupt code can handle
in threaded interrupt mode.

Signed-off-by: Christoph Hellwig <[email protected]>
---
drivers/virtio/virtio_pci_common.c | 142 +++++++++++++++++++++----------------
drivers/virtio/virtio_pci_common.h | 1 +
2 files changed, 83 insertions(+), 60 deletions(-)

diff --git a/drivers/virtio/virtio_pci_common.c b/drivers/virtio/virtio_pci_common.c
index 590534910dc6..6dd719543410 100644
--- a/drivers/virtio/virtio_pci_common.c
+++ b/drivers/virtio/virtio_pci_common.c
@@ -137,6 +137,9 @@ void vp_del_vqs(struct virtio_device *vdev)
kfree(vp_dev->msix_vector_map);
}

+ /* free the shared virtuqueue irq if we don't use per-vq irqs */
+ if (vp_dev->shared_vq_vec)
+ free_irq(pci_irq_vector(vp_dev->pci_dev, 1), vp_dev);
free_irq(pci_irq_vector(vp_dev->pci_dev, 0), vp_dev);
pci_free_irq_vectors(vp_dev->pci_dev);
}
@@ -147,10 +150,10 @@ static int vp_find_vqs_msix(struct virtio_device *vdev, unsigned nvqs,
{
struct virtio_pci_device *vp_dev = to_vp_device(vdev);
const char *name = dev_name(&vp_dev->vdev.dev);
- int i, j, err = -ENOMEM, allocated_vectors, nvectors;
+ struct pci_dev *pdev = vp_dev->pci_dev;
+ int i, err = -ENOMEM, nvectors;
unsigned flags = PCI_IRQ_MSIX;
- bool shared = false;
- u16 msix_vec;
+ u16 msix_vec = 0;

if (desc) {
flags |= PCI_IRQ_AFFINITY;
@@ -162,19 +165,18 @@ static int vp_find_vqs_msix(struct virtio_device *vdev, unsigned nvqs,
if (callbacks[i])
nvectors++;

- /* Try one vector per queue first. */
- err = pci_alloc_irq_vectors_affinity(vp_dev->pci_dev, nvectors,
- nvectors, flags, desc);
+ /* Try one vector for config and one per queue first. */
+ err = pci_alloc_irq_vectors_affinity(pdev, nvectors, nvectors, flags,
+ desc);
if (err < 0) {
/* Fallback to one vector for config, one shared for queues. */
- shared = true;
- err = pci_alloc_irq_vectors(vp_dev->pci_dev, 2, 2,
+ nvectors = 2;
+ vp_dev->shared_vq_vec = true;
+ err = pci_alloc_irq_vectors(pdev, nvectors, nvectors,
PCI_IRQ_MSIX);
if (err < 0)
return err;
}
- if (err < 0)
- return err;

vp_dev->msix_vectors = nvectors;
vp_dev->msix_names = kmalloc_array(nvectors,
@@ -194,79 +196,99 @@ static int vp_find_vqs_msix(struct virtio_device *vdev, unsigned nvqs,
}

/* Set the vector used for configuration */
- snprintf(vp_dev->msix_names[0], sizeof(*vp_dev->msix_names),
+ snprintf(vp_dev->msix_names[msix_vec], sizeof(*vp_dev->msix_names),
"%s-config", name);
- err = request_irq(pci_irq_vector(vp_dev->pci_dev, 0), vp_config_changed,
- 0, vp_dev->msix_names[0], vp_dev);
+ err = request_irq(pci_irq_vector(pdev, msix_vec), vp_config_changed, 0,
+ vp_dev->msix_names[msix_vec], vp_dev);
if (err)
goto out_free_msix_affinity_masks;

/* Verify we had enough resources to assign the vector */
- if (vp_dev->config_vector(vp_dev, 0) == VIRTIO_MSI_NO_VECTOR) {
+ if (vp_dev->config_vector(vp_dev, msix_vec) == VIRTIO_MSI_NO_VECTOR) {
err = -EBUSY;
goto out_free_config_irq;
}

- vp_dev->msix_vector_map = kmalloc_array(nvqs,
- sizeof(*vp_dev->msix_vector_map), GFP_KERNEL);
- if (!vp_dev->msix_vector_map)
- goto out_disable_config_irq;
-
- allocated_vectors = j = 1; /* vector 0 is the config interrupt */
- for (i = 0; i < nvqs; ++i) {
- if (!names[i]) {
- vqs[i] = NULL;
- continue;
- }
-
- if (callbacks[i])
- msix_vec = allocated_vectors;
- else
- msix_vec = VIRTIO_MSI_NO_VECTOR;
-
- vqs[i] = vp_dev->setup_vq(vp_dev, i, callbacks[i], names[i],
- msix_vec);
- if (IS_ERR(vqs[i])) {
- err = PTR_ERR(vqs[i]);
- goto out_remove_vqs;
+ msix_vec++;
+
+ /*
+ * Use a different vector for each queue if they are available,
+ * else share the same vector for all VQs.
+ */
+ if (vp_dev->shared_vq_vec) {
+ snprintf(vp_dev->msix_names[msix_vec],
+ sizeof(vp_dev->msix_names[msix_vec]),
+ "%s-virtqueues", name);
+ err = request_irq(pci_irq_vector(pdev, msix_vec),
+ vp_vring_interrupt, 0,
+ vp_dev->msix_names[msix_vec], vp_dev);
+ if (err)
+ goto out_disable_config_irq;
+
+ for (i = 0; i < nvqs; ++i) {
+ if (!names[i]) {
+ vqs[i] = NULL;
+ continue;
+ }
+
+ vqs[i] = vp_dev->setup_vq(vp_dev, i, callbacks[i],
+ names[i], callbacks[i] ?
+ msix_vec : VIRTIO_MSI_NO_VECTOR);
+ if (IS_ERR(vqs[i])) {
+ err = PTR_ERR(vqs[i]);
+ goto out_remove_vqs;
+ }
}
+ } else {
+ vp_dev->msix_vector_map = kmalloc_array(nvqs,
+ sizeof(*vp_dev->msix_vector_map), GFP_KERNEL);
+ if (!vp_dev->msix_vector_map)
+ goto out_disable_config_irq;

- if (msix_vec == VIRTIO_MSI_NO_VECTOR) {
- vp_dev->msix_vector_map[i] = VIRTIO_MSI_NO_VECTOR;
- continue;
- }
+ for (i = 0; i < nvqs; ++i) {
+ if (!names[i]) {
+ vqs[i] = NULL;
+ continue;
+ }

- snprintf(vp_dev->msix_names[j],
- sizeof(*vp_dev->msix_names), "%s-%s",
- dev_name(&vp_dev->vdev.dev), names[i]);
- err = request_irq(pci_irq_vector(vp_dev->pci_dev, msix_vec),
- vring_interrupt, IRQF_SHARED,
- vp_dev->msix_names[j], vqs[i]);
- if (err) {
/* don't free this irq on error */
vp_dev->msix_vector_map[i] = VIRTIO_MSI_NO_VECTOR;
- goto out_remove_vqs;
+
+ vqs[i] = vp_dev->setup_vq(vp_dev, i, callbacks[i],
+ names[i], callbacks[i] ?
+ msix_vec : VIRTIO_MSI_NO_VECTOR);
+ if (IS_ERR(vqs[i])) {
+ err = PTR_ERR(vqs[i]);
+ goto out_remove_vqs;
+ }
+
+ if (!callbacks[i])
+ continue;
+
+ snprintf(vp_dev->msix_names[msix_vec],
+ sizeof(*vp_dev->msix_names), "%s-%s",
+ dev_name(&vp_dev->vdev.dev), names[i]);
+ err = request_irq(pci_irq_vector(pdev, msix_vec),
+ vring_interrupt, IRQF_SHARED,
+ vp_dev->msix_names[msix_vec],
+ vqs[i]);
+ if (err)
+ goto out_remove_vqs;
+ vp_dev->msix_vector_map[i] = msix_vec++;
}
- vp_dev->msix_vector_map[i] = msix_vec;
- j++;
-
- /*
- * Use a different vector for each queue if they are available,
- * else share the same vector for all VQs.
- */
- if (!shared)
- allocated_vectors++;
}

return 0;

out_remove_vqs:
vp_remove_vqs(vdev);
+ if (vp_dev->shared_vq_vec)
+ free_irq(pci_irq_vector(pdev, 1), vp_dev);
kfree(vp_dev->msix_vector_map);
out_disable_config_irq:
vp_dev->config_vector(vp_dev, VIRTIO_MSI_NO_VECTOR);
out_free_config_irq:
- free_irq(pci_irq_vector(vp_dev->pci_dev, 0), vp_dev);
+ free_irq(pci_irq_vector(pdev, 0), vp_dev);
out_free_msix_affinity_masks:
for (i = 0; i < nvectors; i++) {
if (vp_dev->msix_affinity_masks[i])
@@ -276,7 +298,7 @@ static int vp_find_vqs_msix(struct virtio_device *vdev, unsigned nvqs,
out_free_msix_names:
kfree(vp_dev->msix_names);
out_free_irq_vectors:
- pci_free_irq_vectors(vp_dev->pci_dev);
+ pci_free_irq_vectors(pdev);
return err;
}

@@ -346,7 +368,7 @@ int vp_set_vq_affinity(struct virtqueue *vq, int cpu)
if (!vq->callback)
return -EINVAL;

- if (vp_dev->pci_dev->msix_enabled) {
+ if (vp_dev->msix_vector_map) {
int vec = vp_dev->msix_vector_map[vq->index];
struct cpumask *mask = vp_dev->msix_affinity_masks[vec];
unsigned int irq = pci_irq_vector(vp_dev->pci_dev, vec);
diff --git a/drivers/virtio/virtio_pci_common.h b/drivers/virtio/virtio_pci_common.h
index ac8c9d788964..d6d7fb99e47f 100644
--- a/drivers/virtio/virtio_pci_common.h
+++ b/drivers/virtio/virtio_pci_common.h
@@ -72,6 +72,7 @@ struct virtio_pci_device {
int msix_vectors;
/* Map of per-VQ MSI-X vectors, may be NULL */
unsigned *msix_vector_map;
+ bool shared_vq_vec;

struct virtqueue *(*setup_vq)(struct virtio_pci_device *vp_dev,
unsigned idx,
--
2.11.0

2017-04-03 15:49:15

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Mon, Apr 03, 2017 at 04:18:23PM +0200, Christoph Hellwig wrote:
> Mike,
>
> can you try the patch below?

It's really easy to test on qemu so I will - just add a dummy
virtio-serial-pci device with -device virtio-serial-pci and
add threadirqs to kernel command line.

However it doesn't look like this will fix the error recovery
for when request irq fails - it will just make the error less likely.

So we still need to look into that - failure should recover
and use the intx path, ATM it causes hybernation to hang.

> ---
> >From fe41a30b54878cc631623b7511267125e0da4b15 Mon Sep 17 00:00:00 2001
> From: Christoph Hellwig <[email protected]>
> Date: Mon, 3 Apr 2017 14:51:35 +0200
> Subject: virtio_pci: don't use shared irq for virtqueues
>
> Reimplement the shared irq feature manually, as we might have a larger
> number of virtqueues than the core shared interrupt code can handle
> in threaded interrupt mode.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> ---
> drivers/virtio/virtio_pci_common.c | 142 +++++++++++++++++++++----------------
> drivers/virtio/virtio_pci_common.h | 1 +
> 2 files changed, 83 insertions(+), 60 deletions(-)
>
> diff --git a/drivers/virtio/virtio_pci_common.c b/drivers/virtio/virtio_pci_common.c
> index 590534910dc6..6dd719543410 100644
> --- a/drivers/virtio/virtio_pci_common.c
> +++ b/drivers/virtio/virtio_pci_common.c
> @@ -137,6 +137,9 @@ void vp_del_vqs(struct virtio_device *vdev)
> kfree(vp_dev->msix_vector_map);
> }
>
> + /* free the shared virtuqueue irq if we don't use per-vq irqs */
> + if (vp_dev->shared_vq_vec)
> + free_irq(pci_irq_vector(vp_dev->pci_dev, 1), vp_dev);
> free_irq(pci_irq_vector(vp_dev->pci_dev, 0), vp_dev);
> pci_free_irq_vectors(vp_dev->pci_dev);
> }
> @@ -147,10 +150,10 @@ static int vp_find_vqs_msix(struct virtio_device *vdev, unsigned nvqs,
> {
> struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> const char *name = dev_name(&vp_dev->vdev.dev);
> - int i, j, err = -ENOMEM, allocated_vectors, nvectors;
> + struct pci_dev *pdev = vp_dev->pci_dev;
> + int i, err = -ENOMEM, nvectors;
> unsigned flags = PCI_IRQ_MSIX;
> - bool shared = false;
> - u16 msix_vec;
> + u16 msix_vec = 0;
>
> if (desc) {
> flags |= PCI_IRQ_AFFINITY;
> @@ -162,19 +165,18 @@ static int vp_find_vqs_msix(struct virtio_device *vdev, unsigned nvqs,
> if (callbacks[i])
> nvectors++;
>
> - /* Try one vector per queue first. */
> - err = pci_alloc_irq_vectors_affinity(vp_dev->pci_dev, nvectors,
> - nvectors, flags, desc);
> + /* Try one vector for config and one per queue first. */
> + err = pci_alloc_irq_vectors_affinity(pdev, nvectors, nvectors, flags,
> + desc);
> if (err < 0) {
> /* Fallback to one vector for config, one shared for queues. */
> - shared = true;
> - err = pci_alloc_irq_vectors(vp_dev->pci_dev, 2, 2,
> + nvectors = 2;
> + vp_dev->shared_vq_vec = true;
> + err = pci_alloc_irq_vectors(pdev, nvectors, nvectors,
> PCI_IRQ_MSIX);
> if (err < 0)
> return err;
> }
> - if (err < 0)
> - return err;
>
> vp_dev->msix_vectors = nvectors;
> vp_dev->msix_names = kmalloc_array(nvectors,
> @@ -194,79 +196,99 @@ static int vp_find_vqs_msix(struct virtio_device *vdev, unsigned nvqs,
> }
>
> /* Set the vector used for configuration */
> - snprintf(vp_dev->msix_names[0], sizeof(*vp_dev->msix_names),
> + snprintf(vp_dev->msix_names[msix_vec], sizeof(*vp_dev->msix_names),
> "%s-config", name);
> - err = request_irq(pci_irq_vector(vp_dev->pci_dev, 0), vp_config_changed,
> - 0, vp_dev->msix_names[0], vp_dev);
> + err = request_irq(pci_irq_vector(pdev, msix_vec), vp_config_changed, 0,
> + vp_dev->msix_names[msix_vec], vp_dev);
> if (err)
> goto out_free_msix_affinity_masks;
>
> /* Verify we had enough resources to assign the vector */
> - if (vp_dev->config_vector(vp_dev, 0) == VIRTIO_MSI_NO_VECTOR) {
> + if (vp_dev->config_vector(vp_dev, msix_vec) == VIRTIO_MSI_NO_VECTOR) {
> err = -EBUSY;
> goto out_free_config_irq;
> }
>
> - vp_dev->msix_vector_map = kmalloc_array(nvqs,
> - sizeof(*vp_dev->msix_vector_map), GFP_KERNEL);
> - if (!vp_dev->msix_vector_map)
> - goto out_disable_config_irq;
> -
> - allocated_vectors = j = 1; /* vector 0 is the config interrupt */
> - for (i = 0; i < nvqs; ++i) {
> - if (!names[i]) {
> - vqs[i] = NULL;
> - continue;
> - }
> -
> - if (callbacks[i])
> - msix_vec = allocated_vectors;
> - else
> - msix_vec = VIRTIO_MSI_NO_VECTOR;
> -
> - vqs[i] = vp_dev->setup_vq(vp_dev, i, callbacks[i], names[i],
> - msix_vec);
> - if (IS_ERR(vqs[i])) {
> - err = PTR_ERR(vqs[i]);
> - goto out_remove_vqs;
> + msix_vec++;
> +
> + /*
> + * Use a different vector for each queue if they are available,
> + * else share the same vector for all VQs.
> + */
> + if (vp_dev->shared_vq_vec) {
> + snprintf(vp_dev->msix_names[msix_vec],
> + sizeof(vp_dev->msix_names[msix_vec]),
> + "%s-virtqueues", name);
> + err = request_irq(pci_irq_vector(pdev, msix_vec),
> + vp_vring_interrupt, 0,
> + vp_dev->msix_names[msix_vec], vp_dev);
> + if (err)
> + goto out_disable_config_irq;
> +
> + for (i = 0; i < nvqs; ++i) {
> + if (!names[i]) {
> + vqs[i] = NULL;
> + continue;
> + }
> +
> + vqs[i] = vp_dev->setup_vq(vp_dev, i, callbacks[i],
> + names[i], callbacks[i] ?
> + msix_vec : VIRTIO_MSI_NO_VECTOR);
> + if (IS_ERR(vqs[i])) {
> + err = PTR_ERR(vqs[i]);
> + goto out_remove_vqs;
> + }
> }
> + } else {
> + vp_dev->msix_vector_map = kmalloc_array(nvqs,
> + sizeof(*vp_dev->msix_vector_map), GFP_KERNEL);
> + if (!vp_dev->msix_vector_map)
> + goto out_disable_config_irq;
>
> - if (msix_vec == VIRTIO_MSI_NO_VECTOR) {
> - vp_dev->msix_vector_map[i] = VIRTIO_MSI_NO_VECTOR;
> - continue;
> - }
> + for (i = 0; i < nvqs; ++i) {
> + if (!names[i]) {
> + vqs[i] = NULL;
> + continue;
> + }
>
> - snprintf(vp_dev->msix_names[j],
> - sizeof(*vp_dev->msix_names), "%s-%s",
> - dev_name(&vp_dev->vdev.dev), names[i]);
> - err = request_irq(pci_irq_vector(vp_dev->pci_dev, msix_vec),
> - vring_interrupt, IRQF_SHARED,
> - vp_dev->msix_names[j], vqs[i]);
> - if (err) {
> /* don't free this irq on error */
> vp_dev->msix_vector_map[i] = VIRTIO_MSI_NO_VECTOR;
> - goto out_remove_vqs;
> +
> + vqs[i] = vp_dev->setup_vq(vp_dev, i, callbacks[i],
> + names[i], callbacks[i] ?
> + msix_vec : VIRTIO_MSI_NO_VECTOR);
> + if (IS_ERR(vqs[i])) {
> + err = PTR_ERR(vqs[i]);
> + goto out_remove_vqs;
> + }
> +
> + if (!callbacks[i])
> + continue;
> +
> + snprintf(vp_dev->msix_names[msix_vec],
> + sizeof(*vp_dev->msix_names), "%s-%s",
> + dev_name(&vp_dev->vdev.dev), names[i]);
> + err = request_irq(pci_irq_vector(pdev, msix_vec),
> + vring_interrupt, IRQF_SHARED,
> + vp_dev->msix_names[msix_vec],
> + vqs[i]);
> + if (err)
> + goto out_remove_vqs;
> + vp_dev->msix_vector_map[i] = msix_vec++;
> }
> - vp_dev->msix_vector_map[i] = msix_vec;
> - j++;
> -
> - /*
> - * Use a different vector for each queue if they are available,
> - * else share the same vector for all VQs.
> - */
> - if (!shared)
> - allocated_vectors++;
> }
>
> return 0;
>
> out_remove_vqs:
> vp_remove_vqs(vdev);
> + if (vp_dev->shared_vq_vec)
> + free_irq(pci_irq_vector(pdev, 1), vp_dev);
> kfree(vp_dev->msix_vector_map);
> out_disable_config_irq:
> vp_dev->config_vector(vp_dev, VIRTIO_MSI_NO_VECTOR);
> out_free_config_irq:
> - free_irq(pci_irq_vector(vp_dev->pci_dev, 0), vp_dev);
> + free_irq(pci_irq_vector(pdev, 0), vp_dev);
> out_free_msix_affinity_masks:
> for (i = 0; i < nvectors; i++) {
> if (vp_dev->msix_affinity_masks[i])
> @@ -276,7 +298,7 @@ static int vp_find_vqs_msix(struct virtio_device *vdev, unsigned nvqs,
> out_free_msix_names:
> kfree(vp_dev->msix_names);
> out_free_irq_vectors:
> - pci_free_irq_vectors(vp_dev->pci_dev);
> + pci_free_irq_vectors(pdev);
> return err;
> }
>
> @@ -346,7 +368,7 @@ int vp_set_vq_affinity(struct virtqueue *vq, int cpu)
> if (!vq->callback)
> return -EINVAL;
>
> - if (vp_dev->pci_dev->msix_enabled) {
> + if (vp_dev->msix_vector_map) {
> int vec = vp_dev->msix_vector_map[vq->index];
> struct cpumask *mask = vp_dev->msix_affinity_masks[vec];
> unsigned int irq = pci_irq_vector(vp_dev->pci_dev, vec);
> diff --git a/drivers/virtio/virtio_pci_common.h b/drivers/virtio/virtio_pci_common.h
> index ac8c9d788964..d6d7fb99e47f 100644
> --- a/drivers/virtio/virtio_pci_common.h
> +++ b/drivers/virtio/virtio_pci_common.h
> @@ -72,6 +72,7 @@ struct virtio_pci_device {
> int msix_vectors;
> /* Map of per-VQ MSI-X vectors, may be NULL */
> unsigned *msix_vector_map;
> + bool shared_vq_vec;
>
> struct virtqueue *(*setup_vq)(struct virtio_pci_device *vp_dev,
> unsigned idx,
> --
> 2.11.0

2017-04-03 16:14:42

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Mon, Apr 03, 2017 at 04:18:23PM +0200, Christoph Hellwig wrote:
> Mike,
>
> can you try the patch below?
>
> ---
> >From fe41a30b54878cc631623b7511267125e0da4b15 Mon Sep 17 00:00:00 2001
> From: Christoph Hellwig <[email protected]>
> Date: Mon, 3 Apr 2017 14:51:35 +0200
> Subject: virtio_pci: don't use shared irq for virtqueues
>
> Reimplement the shared irq feature manually, as we might have a larger
> number of virtqueues than the core shared interrupt code can handle
> in threaded interrupt mode.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> ---
> drivers/virtio/virtio_pci_common.c | 142 +++++++++++++++++++++----------------
> drivers/virtio/virtio_pci_common.h | 1 +
> 2 files changed, 83 insertions(+), 60 deletions(-)

Well the original patch this is trying to fix is
07ec51480b5eb1233f8c1b0f5d7a7c8d1247c507 which dropped just 40 lines
with documentation. It did this by re-using error handling to switch
from per-vq to non-per-vq mode. Now this has separate flows for errors
and per-vq non-per-vq switch and (I think, as a result) is adding 140
lines which doesn't make me very happy.

> diff --git a/drivers/virtio/virtio_pci_common.c b/drivers/virtio/virtio_pci_common.c
> index 590534910dc6..6dd719543410 100644
> --- a/drivers/virtio/virtio_pci_common.c
> +++ b/drivers/virtio/virtio_pci_common.c
> @@ -137,6 +137,9 @@ void vp_del_vqs(struct virtio_device *vdev)
> kfree(vp_dev->msix_vector_map);
> }
>
> + /* free the shared virtuqueue irq if we don't use per-vq irqs */

typo

> + if (vp_dev->shared_vq_vec)
> + free_irq(pci_irq_vector(vp_dev->pci_dev, 1), vp_dev);

So we used to have enums for 1 and 0. I think it was cleaner.


> free_irq(pci_irq_vector(vp_dev->pci_dev, 0), vp_dev);
> pci_free_irq_vectors(vp_dev->pci_dev);
> }
> @@ -147,10 +150,10 @@ static int vp_find_vqs_msix(struct virtio_device *vdev, unsigned nvqs,
> {
> struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> const char *name = dev_name(&vp_dev->vdev.dev);
> - int i, j, err = -ENOMEM, allocated_vectors, nvectors;
> + struct pci_dev *pdev = vp_dev->pci_dev;
> + int i, err = -ENOMEM, nvectors;
> unsigned flags = PCI_IRQ_MSIX;
> - bool shared = false;
> - u16 msix_vec;
> + u16 msix_vec = 0;
>
> if (desc) {
> flags |= PCI_IRQ_AFFINITY;
> @@ -162,19 +165,18 @@ static int vp_find_vqs_msix(struct virtio_device *vdev, unsigned nvqs,
> if (callbacks[i])
> nvectors++;
>
> - /* Try one vector per queue first. */
> - err = pci_alloc_irq_vectors_affinity(vp_dev->pci_dev, nvectors,
> - nvectors, flags, desc);
> + /* Try one vector for config and one per queue first. */
> + err = pci_alloc_irq_vectors_affinity(pdev, nvectors, nvectors, flags,
> + desc);
> if (err < 0) {
> /* Fallback to one vector for config, one shared for queues. */
> - shared = true;
> - err = pci_alloc_irq_vectors(vp_dev->pci_dev, 2, 2,
> + nvectors = 2;
> + vp_dev->shared_vq_vec = true;
> + err = pci_alloc_irq_vectors(pdev, nvectors, nvectors,
> PCI_IRQ_MSIX);
> if (err < 0)
> return err;
> }
> - if (err < 0)
> - return err;
>
> vp_dev->msix_vectors = nvectors;
> vp_dev->msix_names = kmalloc_array(nvectors,
> @@ -194,79 +196,99 @@ static int vp_find_vqs_msix(struct virtio_device *vdev, unsigned nvqs,
> }
>
> /* Set the vector used for configuration */
> - snprintf(vp_dev->msix_names[0], sizeof(*vp_dev->msix_names),
> + snprintf(vp_dev->msix_names[msix_vec], sizeof(*vp_dev->msix_names),
> "%s-config", name);
> - err = request_irq(pci_irq_vector(vp_dev->pci_dev, 0), vp_config_changed,
> - 0, vp_dev->msix_names[0], vp_dev);
> + err = request_irq(pci_irq_vector(pdev, msix_vec), vp_config_changed, 0,
> + vp_dev->msix_names[msix_vec], vp_dev);
> if (err)
> goto out_free_msix_affinity_masks;
>
> /* Verify we had enough resources to assign the vector */
> - if (vp_dev->config_vector(vp_dev, 0) == VIRTIO_MSI_NO_VECTOR) {
> + if (vp_dev->config_vector(vp_dev, msix_vec) == VIRTIO_MSI_NO_VECTOR) {
> err = -EBUSY;
> goto out_free_config_irq;
> }
>
> - vp_dev->msix_vector_map = kmalloc_array(nvqs,
> - sizeof(*vp_dev->msix_vector_map), GFP_KERNEL);
> - if (!vp_dev->msix_vector_map)
> - goto out_disable_config_irq;
> -
> - allocated_vectors = j = 1; /* vector 0 is the config interrupt */
> - for (i = 0; i < nvqs; ++i) {
> - if (!names[i]) {
> - vqs[i] = NULL;
> - continue;
> - }
> -
> - if (callbacks[i])
> - msix_vec = allocated_vectors;
> - else
> - msix_vec = VIRTIO_MSI_NO_VECTOR;
> -
> - vqs[i] = vp_dev->setup_vq(vp_dev, i, callbacks[i], names[i],
> - msix_vec);
> - if (IS_ERR(vqs[i])) {
> - err = PTR_ERR(vqs[i]);
> - goto out_remove_vqs;
> + msix_vec++;
> +
> + /*
> + * Use a different vector for each queue if they are available,
> + * else share the same vector for all VQs.
> + */
> + if (vp_dev->shared_vq_vec) {
> + snprintf(vp_dev->msix_names[msix_vec],
> + sizeof(vp_dev->msix_names[msix_vec]),
> + "%s-virtqueues", name);
> + err = request_irq(pci_irq_vector(pdev, msix_vec),
> + vp_vring_interrupt, 0,
> + vp_dev->msix_names[msix_vec], vp_dev);
> + if (err)
> + goto out_disable_config_irq;
> +
> + for (i = 0; i < nvqs; ++i) {
> + if (!names[i]) {
> + vqs[i] = NULL;
> + continue;
> + }
> +
> + vqs[i] = vp_dev->setup_vq(vp_dev, i, callbacks[i],
> + names[i], callbacks[i] ?
> + msix_vec : VIRTIO_MSI_NO_VECTOR);
> + if (IS_ERR(vqs[i])) {
> + err = PTR_ERR(vqs[i]);
> + goto out_remove_vqs;
> + }
> }
> + } else {
> + vp_dev->msix_vector_map = kmalloc_array(nvqs,
> + sizeof(*vp_dev->msix_vector_map), GFP_KERNEL);
> + if (!vp_dev->msix_vector_map)
> + goto out_disable_config_irq;
>
> - if (msix_vec == VIRTIO_MSI_NO_VECTOR) {
> - vp_dev->msix_vector_map[i] = VIRTIO_MSI_NO_VECTOR;
> - continue;
> - }
> + for (i = 0; i < nvqs; ++i) {
> + if (!names[i]) {
> + vqs[i] = NULL;
> + continue;
> + }
>
> - snprintf(vp_dev->msix_names[j],
> - sizeof(*vp_dev->msix_names), "%s-%s",
> - dev_name(&vp_dev->vdev.dev), names[i]);
> - err = request_irq(pci_irq_vector(vp_dev->pci_dev, msix_vec),
> - vring_interrupt, IRQF_SHARED,
> - vp_dev->msix_names[j], vqs[i]);
> - if (err) {
> /* don't free this irq on error */
> vp_dev->msix_vector_map[i] = VIRTIO_MSI_NO_VECTOR;
> - goto out_remove_vqs;
> +
> + vqs[i] = vp_dev->setup_vq(vp_dev, i, callbacks[i],
> + names[i], callbacks[i] ?
> + msix_vec : VIRTIO_MSI_NO_VECTOR);
> + if (IS_ERR(vqs[i])) {
> + err = PTR_ERR(vqs[i]);
> + goto out_remove_vqs;
> + }
> +
> + if (!callbacks[i])
> + continue;
> +
> + snprintf(vp_dev->msix_names[msix_vec],
> + sizeof(*vp_dev->msix_names), "%s-%s",
> + dev_name(&vp_dev->vdev.dev), names[i]);
> + err = request_irq(pci_irq_vector(pdev, msix_vec),
> + vring_interrupt, IRQF_SHARED,
> + vp_dev->msix_names[msix_vec],
> + vqs[i]);
> + if (err)
> + goto out_remove_vqs;
> + vp_dev->msix_vector_map[i] = msix_vec++;
> }
> - vp_dev->msix_vector_map[i] = msix_vec;
> - j++;
> -
> - /*
> - * Use a different vector for each queue if they are available,
> - * else share the same vector for all VQs.
> - */
> - if (!shared)
> - allocated_vectors++;
> }
>
> return 0;
>
> out_remove_vqs:
> vp_remove_vqs(vdev);
> + if (vp_dev->shared_vq_vec)
> + free_irq(pci_irq_vector(pdev, 1), vp_dev);
> kfree(vp_dev->msix_vector_map);
> out_disable_config_irq:
> vp_dev->config_vector(vp_dev, VIRTIO_MSI_NO_VECTOR);
> out_free_config_irq:
> - free_irq(pci_irq_vector(vp_dev->pci_dev, 0), vp_dev);
> + free_irq(pci_irq_vector(pdev, 0), vp_dev);
> out_free_msix_affinity_masks:
> for (i = 0; i < nvectors; i++) {
> if (vp_dev->msix_affinity_masks[i])
> @@ -276,7 +298,7 @@ static int vp_find_vqs_msix(struct virtio_device *vdev, unsigned nvqs,
> out_free_msix_names:
> kfree(vp_dev->msix_names);
> out_free_irq_vectors:
> - pci_free_irq_vectors(vp_dev->pci_dev);
> + pci_free_irq_vectors(pdev);
> return err;
> }
>
> @@ -346,7 +368,7 @@ int vp_set_vq_affinity(struct virtqueue *vq, int cpu)
> if (!vq->callback)
> return -EINVAL;
>
> - if (vp_dev->pci_dev->msix_enabled) {
> + if (vp_dev->msix_vector_map) {
> int vec = vp_dev->msix_vector_map[vq->index];
> struct cpumask *mask = vp_dev->msix_affinity_masks[vec];
> unsigned int irq = pci_irq_vector(vp_dev->pci_dev, vec);
> diff --git a/drivers/virtio/virtio_pci_common.h b/drivers/virtio/virtio_pci_common.h
> index ac8c9d788964..d6d7fb99e47f 100644
> --- a/drivers/virtio/virtio_pci_common.h
> +++ b/drivers/virtio/virtio_pci_common.h
> @@ -72,6 +72,7 @@ struct virtio_pci_device {
> int msix_vectors;
> /* Map of per-VQ MSI-X vectors, may be NULL */
> unsigned *msix_vector_map;
> + bool shared_vq_vec;


Pls add documentation. In fact I'd prefer a counter of vectors
used than a separate mode.
>
> struct virtqueue *(*setup_vq)(struct virtio_pci_device *vp_dev,
> unsigned idx,
> --
> 2.11.0

2017-04-03 17:56:44

by Mike Galbraith

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Mon, 2017-04-03 at 16:18 +0200, Christoph Hellwig wrote:
> Mike,
>
> can you try the patch below?

No more spinning kworker woes, but I still have a warning on hibernate,
threadirqs invariant. I'm also seeing intermittent post hibernate hang
funnies in virgin source +- this patch, and without threadirqs.

[ 110.223953] WARNING: CPU: 5 PID: 452 at drivers/pci/msi.c:1261 pci_irq_vector+0xb1/0xe0

-Mike

2017-04-03 18:11:24

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Mon, Apr 03, 2017 at 07:56:32PM +0200, Mike Galbraith wrote:
> On Mon, 2017-04-03 at 16:18 +0200, Christoph Hellwig wrote:
> > Mike,
> >
> > can you try the patch below?
>
> No more spinning kworker woes, but I still have a warning on hibernate,
> threadirqs invariant. I'm also seeing intermittent post hibernate hang
> funnies in virgin source +- this patch, and without threadirqs.
>
> [ 110.223953] WARNING: CPU: 5 PID: 452 at drivers/pci/msi.c:1261 pci_irq_vector+0xb1/0xe0
>
> -Mike

I just sent a patch fixing that.
However I think we want to print a message when MSI fails to work so we
know guest is falling back on legacy interrupts.

--
MST

2017-04-04 04:03:04

by Mike Galbraith

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Mon, 2017-04-03 at 21:11 +0300, Michael S. Tsirkin wrote:
> On Mon, Apr 03, 2017 at 07:56:32PM +0200, Mike Galbraith wrote:
> > On Mon, 2017-04-03 at 16:18 +0200, Christoph Hellwig wrote:
> > > Mike,
> > >
> > > can you try the patch below?
> >
> > No more spinning kworker woes, but I still have a warning on hibernate,
> > threadirqs invariant. I'm also seeing intermittent post hibernate hang
> > funnies in virgin source +- this patch, and without threadirqs.
> >
> > [ 110.223953] WARNING: CPU: 5 PID: 452 at drivers/pci/msi.c:1261 pci_irq_vector+0xb1/0xe0
> >
> > > > -Mike
>
> I just sent a patch fixing that.
> However I think we want to print a message when MSI fails to work so we
> know guest is falling back on legacy interrupts.

The warning persists.

[ 137.656423] WARNING: CPU: 1 PID: 535 at drivers/pci/msi.c:1261 pci_irq_vector+0xb1/0xe0

WRT the post hibernate hang business, that is apparently not part of
the 4.11 woes (at least not solely), as 4.10.8 did not survive a 10
hibernate cycle loop. RT is better at reproducing trouble (shrug, it
frequently is), but it matters not whether I'm running 4.10, master or
master-rt, they will all hang.

WRT gripe, I wedged virtio_pci-fix-msix-vector-tracking-on-cleanup in
on top, but it wasn't impressed.

-Mike

2017-04-04 13:38:11

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Tue, Apr 04, 2017 at 06:02:52AM +0200, Mike Galbraith wrote:
> On Mon, 2017-04-03 at 21:11 +0300, Michael S. Tsirkin wrote:
> > On Mon, Apr 03, 2017 at 07:56:32PM +0200, Mike Galbraith wrote:
> > > On Mon, 2017-04-03 at 16:18 +0200, Christoph Hellwig wrote:
> > > > Mike,
> > > >
> > > > can you try the patch below?
> > >
> > > No more spinning kworker woes, but I still have a warning on hibernate,
> > > threadirqs invariant. I'm also seeing intermittent post hibernate hang
> > > funnies in virgin source +- this patch, and without threadirqs.
> > >
> > > [ 110.223953] WARNING: CPU: 5 PID: 452 at drivers/pci/msi.c:1261 pci_irq_vector+0xb1/0xe0
> > >
> > > > > -Mike
> >
> > I just sent a patch fixing that.
> > However I think we want to print a message when MSI fails to work so we
> > know guest is falling back on legacy interrupts.
>
> The warning persists.
>
> [ 137.656423] WARNING: CPU: 1 PID: 535 at drivers/pci/msi.c:1261 pci_irq_vector+0xb1/0xe0

Can you post the rest of the backtrace? Is it still in the console?

> WRT the post hibernate hang business, that is apparently not part of
> the 4.11 woes (at least not solely), as 4.10.8 did not survive a 10
> hibernate cycle loop. RT is better at reproducing trouble (shrug, it
> frequently is), but it matters not whether I'm running 4.10, master or
> master-rt, they will all hang.
>
> WRT gripe, I wedged virtio_pci-fix-msix-vector-tracking-on-cleanup in
> on top, but it wasn't impressed.
>
> -Mike

2017-04-04 14:18:38

by Mike Galbraith

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Tue, 2017-04-04 at 16:38 +0300, Michael S. Tsirkin wrote:
> On Tue, Apr 04, 2017 at 06:02:52AM +0200, Mike Galbraith wrote:
> > On Mon, 2017-04-03 at 21:11 +0300, Michael S. Tsirkin wrote:
> > > On Mon, Apr 03, 2017 at 07:56:32PM +0200, Mike Galbraith wrote:
> > > > On Mon, 2017-04-03 at 16:18 +0200, Christoph Hellwig wrote:
> > > > > Mike,
> > > > >
> > > > > can you try the patch below?
> > > >
> > > > No more spinning kworker woes, but I still have a warning on
> > > > hibernate,
> > > > threadirqs invariant. I'm also seeing intermittent post
> > > > hibernate hang
> > > > funnies in virgin source +- this patch, and without threadirqs.
> > > >
> > > > [ 110.223953] WARNING: CPU: 5 PID: 452 at
> > > > drivers/pci/msi.c:1261 pci_irq_vector+0xb1/0xe0
> > > >
> > > > > > -Mike
> > >
> > > I just sent a patch fixing that.
> > > However I think we want to print a message when MSI fails to work
> > > so we
> > > know guest is falling back on legacy interrupts.
> >
> > The warning persists.
> >
> > [ 137.656423] WARNING: CPU: 1 PID: 535 at drivers/pci/msi.c:1261
> > pci_irq_vector+0xb1/0xe0
>
> Can you post the rest of the backtrace? Is it still in the console?

This is from a dump of post hibernate loop dying vbox I captured and
squirreled away, so pid is different. I'm not absolutely certain that
I didn't have my local patch set re-applied when I did this, so I'll
rebuild in the a.m.. My stuff is unrelated, so this should be fine.

[ 328.475988] ------------[ cut here ]------------
[ 328.476002] WARNING: CPU: 6 PID: 313 at drivers/pci/msi.c:1261 pci_irq_vector+0xb1/0xe0
[ 328.476003] Modules linked in: fuse(E) ebtable_filter(E) ebtables(E) nf_log_ipv6(E) xt_pkttype(E) nf_log_ipv4(E) nf_log_common(E) xt_LOG(E) xt_limit(E) rpcsec_gss_krb5(E) nfsv4(E) dns_resolver(E) nfs(E) fscache(E) af_packet(E) iscsi_ibft(E) iscsi_boot_sysfs(E) ip6t_REJECT(E) xt_tcpudp(E) nf_conntrack_ipv6(E) nf_defrag_ipv6(E) ip6table_raw(E) ipt_REJECT(E) iptable_raw(E) xt_CT(E) iptable_filter(E) ip6table_mangle(E) nf_conntrack_netbios_ns(E) nf_conntrack_broadcast(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) ip_tables(E) xt_conntrack(E) nf_conntrack(E) libcrc32c(E) ip6table_filter(E) ip6_tables(E) x_tables(E) snd_hda_codec_generic(E) snd_hda_intel(E) snd_hda_codec(E) snd_hda_core(E) joydev(E) snd_hwdep(E) snd_pcm(E) snd_timer(E) snd(E) 8139too(E) soundcore(E) i2c_piix4(E) virtio_balloon(E) crct10dif_pclmul(E)
[ 328.476019] crc32_pclmul(E) ppdev(E) ghash_clmulni_intel(E) parport_pc(E) acpi_cpufreq(E) pcbc(E) button(E) parport(E) aesni_intel(E) aes_x86_64(E) serio_raw(E) pcspkr(E) crypto_simd(E) glue_helper(E) cryptd(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) dm_mod(E) grace(E) sunrpc(E) ext4(E) crc16(E) jbd2(E) mbcache(E) hid_generic(E) usbhid(E) sr_mod(E) cdrom(E) ata_generic(E) virtio_blk(E) virtio_rng(E) virtio_console(E) ata_piix(E) qxl(E) drm_kms_helper(E) syscopyarea(E) uhci_hcd(E) ehci_pci(E) sysfillrect(E) sysimgblt(E) ahci(E) fb_sys_fops(E) ehci_hcd(E) libahci(E) crc32c_intel(E) ttm(E) virtio_pci(E) virtio_ring(E) 8139cp(E) virtio(E) usbcore(E) floppy(E) mii(E) drm(E) libata(E) sg(E) scsi_mod(E) autofs4(E)
[ 328.476037] CPU: 6 PID: 313 Comm: kworker/u16:2 Tainted: G E 4.11.0-default #20
[ 328.476038] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20161202_174313-build11a 04/01/2014
[ 328.476041] Workqueue: events_unbound async_run_entry_fn
[ 328.476042] Call Trace:
[ 328.476056] ? dump_stack+0x5c/0x85
[ 328.476058] ? __warn+0xc4/0xe0
[ 328.476060] ? pci_pm_poweroff+0xf0/0xf0
[ 328.476062] ? pci_irq_vector+0xb1/0xe0
[ 328.476064] ? vp_del_vqs+0xcb/0x120 [virtio_pci]
[ 328.476066] ? remove_common+0x60/0x80 [virtio_rng]
[ 328.476067] ? virtrng_freeze+0xa/0x10 [virtio_rng]
[ 328.476068] ? virtio_pci_freeze+0x19/0x40 [virtio_pci]
[ 328.476069] ? pci_pm_freeze+0x59/0xe0
[ 328.476070] ? dpm_run_callback+0x4d/0x170
[ 328.476071] ? __device_suspend+0x11f/0x3b0
[ 328.476072] ? pm_dev_dbg+0x70/0x70
[ 328.476072] ? async_suspend+0x1a/0x90
[ 328.476082] ? async_run_entry_fn+0x34/0x160
[ 328.476083] ? process_one_work+0x164/0x430
[ 328.476084] ? worker_thread+0x135/0x4d0
[ 328.476085] ? kthread+0xff/0x140
[ 328.476086] ? rescuer_thread+0x3c0/0x3c0
[ 328.476087] ? kthread_park+0x80/0x80
[ 328.476088] ? do_group_exit+0x39/0xa0
[ 328.476090] ? ret_from_fork+0x26/0x40
[ 328.476091] ---[ end trace a045c2118936902f ]---

2017-04-04 14:24:43

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Tue, Apr 04, 2017 at 04:18:02PM +0200, Mike Galbraith wrote:
> On Tue, 2017-04-04 at 16:38 +0300, Michael S. Tsirkin wrote:
> > On Tue, Apr 04, 2017 at 06:02:52AM +0200, Mike Galbraith wrote:
> > > On Mon, 2017-04-03 at 21:11 +0300, Michael S. Tsirkin wrote:
> > > > On Mon, Apr 03, 2017 at 07:56:32PM +0200, Mike Galbraith wrote:
> > > > > On Mon, 2017-04-03 at 16:18 +0200, Christoph Hellwig wrote:
> > > > > > Mike,
> > > > > >
> > > > > > can you try the patch below?
> > > > >
> > > > > No more spinning kworker woes, but I still have a warning on
> > > > > hibernate,
> > > > > threadirqs invariant. I'm also seeing intermittent post
> > > > > hibernate hang
> > > > > funnies in virgin source +- this patch, and without threadirqs.
> > > > >
> > > > > [ 110.223953] WARNING: CPU: 5 PID: 452 at
> > > > > drivers/pci/msi.c:1261 pci_irq_vector+0xb1/0xe0
> > > > >
> > > > > > > -Mike
> > > >
> > > > I just sent a patch fixing that.
> > > > However I think we want to print a message when MSI fails to work
> > > > so we
> > > > know guest is falling back on legacy interrupts.
> > >
> > > The warning persists.
> > >
> > > [ 137.656423] WARNING: CPU: 1 PID: 535 at drivers/pci/msi.c:1261
> > > pci_irq_vector+0xb1/0xe0
> >
> > Can you post the rest of the backtrace? Is it still in the console?
>
> This is from a dump of post hibernate loop dying vbox I captured and
> squirreled away, so pid is different. I'm not absolutely certain that
> I didn't have my local patch set re-applied when I did this, so I'll
> rebuild in the a.m.. My stuff is unrelated, so this should be fine.
>
> [ 328.475988] ------------[ cut here ]------------
> [ 328.476002] WARNING: CPU: 6 PID: 313 at drivers/pci/msi.c:1261 pci_irq_vector+0xb1/0xe0
> [ 328.476003] Modules linked in: fuse(E) ebtable_filter(E) ebtables(E) nf_log_ipv6(E) xt_pkttype(E) nf_log_ipv4(E) nf_log_common(E) xt_LOG(E) xt_limit(E) rpcsec_gss_krb5(E) nfsv4(E) dns_resolver(E) nfs(E) fscache(E) af_packet(E) iscsi_ibft(E) iscsi_boot_sysfs(E) ip6t_REJECT(E) xt_tcpudp(E) nf_conntrack_ipv6(E) nf_defrag_ipv6(E) ip6table_raw(E) ipt_REJECT(E) iptable_raw(E) xt_CT(E) iptable_filter(E) ip6table_mangle(E) nf_conntrack_netbios_ns(E) nf_conntrack_broadcast(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) ip_tables(E) xt_conntrack(E) nf_conntrack(E) libcrc32c(E) ip6table_filter(E) ip6_tables(E) x_tables(E) snd_hda_codec_generic(E) snd_hda_intel(E) snd_hda_codec(E) snd_hda_core(E) joydev(E) snd_hwdep(E) snd_pcm(E) snd_timer(E) snd(E) 8139too(E) soundcore(E) i2c_piix4(E) virtio_balloon(E) crct10dif_pclmul(E)
> [ 328.476019] crc32_pclmul(E) ppdev(E) ghash_clmulni_intel(E) parport_pc(E) acpi_cpufreq(E) pcbc(E) button(E) parport(E) aesni_intel(E) aes_x86_64(E) serio_raw(E) pcspkr(E) crypto_simd(E) glue_helper(E) cryptd(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) dm_mod(E) grace(E) sunrpc(E) ext4(E) crc16(E) jbd2(E) mbcache(E) hid_generic(E) usbhid(E) sr_mod(E) cdrom(E) ata_generic(E) virtio_blk(E) virtio_rng(E) virtio_console(E) ata_piix(E) qxl(E) drm_kms_helper(E) syscopyarea(E) uhci_hcd(E) ehci_pci(E) sysfillrect(E) sysimgblt(E) ahci(E) fb_sys_fops(E) ehci_hcd(E) libahci(E) crc32c_intel(E) ttm(E) virtio_pci(E) virtio_ring(E) 8139cp(E) virtio(E) usbcore(E) floppy(E) mii(E) drm(E) libata(E) sg(E) scsi_mod(E) autofs4(E)
> [ 328.476037] CPU: 6 PID: 313 Comm: kworker/u16:2 Tainted: G E 4.11.0-default #20
> [ 328.476038] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20161202_174313-build11a 04/01/2014
> [ 328.476041] Workqueue: events_unbound async_run_entry_fn
> [ 328.476042] Call Trace:
> [ 328.476056] ? dump_stack+0x5c/0x85
> [ 328.476058] ? __warn+0xc4/0xe0
> [ 328.476060] ? pci_pm_poweroff+0xf0/0xf0
> [ 328.476062] ? pci_irq_vector+0xb1/0xe0
> [ 328.476064] ? vp_del_vqs+0xcb/0x120 [virtio_pci]
> [ 328.476066] ? remove_common+0x60/0x80 [virtio_rng]
> [ 328.476067] ? virtrng_freeze+0xa/0x10 [virtio_rng]
> [ 328.476068] ? virtio_pci_freeze+0x19/0x40 [virtio_pci]
> [ 328.476069] ? pci_pm_freeze+0x59/0xe0
> [ 328.476070] ? dpm_run_callback+0x4d/0x170
> [ 328.476071] ? __device_suspend+0x11f/0x3b0
> [ 328.476072] ? pm_dev_dbg+0x70/0x70
> [ 328.476072] ? async_suspend+0x1a/0x90
> [ 328.476082] ? async_run_entry_fn+0x34/0x160
> [ 328.476083] ? process_one_work+0x164/0x430
> [ 328.476084] ? worker_thread+0x135/0x4d0
> [ 328.476085] ? kthread+0xff/0x140
> [ 328.476086] ? rescuer_thread+0x3c0/0x3c0
> [ 328.476087] ? kthread_park+0x80/0x80
> [ 328.476088] ? do_group_exit+0x39/0xa0
> [ 328.476090] ? ret_from_fork+0x26/0x40
> [ 328.476091] ---[ end trace a045c2118936902f ]---

Interesting, it's rng this time. I'll try that.

--
MST

2017-04-04 15:30:26

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Tue, Apr 04, 2017 at 04:18:02PM +0200, Mike Galbraith wrote:
> On Tue, 2017-04-04 at 16:38 +0300, Michael S. Tsirkin wrote:
> > On Tue, Apr 04, 2017 at 06:02:52AM +0200, Mike Galbraith wrote:
> > > On Mon, 2017-04-03 at 21:11 +0300, Michael S. Tsirkin wrote:
> > > > On Mon, Apr 03, 2017 at 07:56:32PM +0200, Mike Galbraith wrote:
> > > > > On Mon, 2017-04-03 at 16:18 +0200, Christoph Hellwig wrote:
> > > > > > Mike,
> > > > > >
> > > > > > can you try the patch below?
> > > > >
> > > > > No more spinning kworker woes, but I still have a warning on
> > > > > hibernate,
> > > > > threadirqs invariant. I'm also seeing intermittent post
> > > > > hibernate hang
> > > > > funnies in virgin source +- this patch, and without threadirqs.
> > > > >
> > > > > [ 110.223953] WARNING: CPU: 5 PID: 452 at
> > > > > drivers/pci/msi.c:1261 pci_irq_vector+0xb1/0xe0
> > > > >
> > > > > > > -Mike
> > > >
> > > > I just sent a patch fixing that.
> > > > However I think we want to print a message when MSI fails to work
> > > > so we
> > > > know guest is falling back on legacy interrupts.
> > >
> > > The warning persists.
> > >
> > > [ 137.656423] WARNING: CPU: 1 PID: 535 at drivers/pci/msi.c:1261
> > > pci_irq_vector+0xb1/0xe0
> >
> > Can you post the rest of the backtrace? Is it still in the console?
>
> This is from a dump of post hibernate loop dying vbox I captured and
> squirreled away, so pid is different. I'm not absolutely certain that
> I didn't have my local patch set re-applied when I did this, so I'll
> rebuild in the a.m.. My stuff is unrelated, so this should be fine.
>
> [ 328.475988] ------------[ cut here ]------------
> [ 328.476002] WARNING: CPU: 6 PID: 313 at drivers/pci/msi.c:1261 pci_irq_vector+0xb1/0xe0
> [ 328.476003] Modules linked in: fuse(E) ebtable_filter(E) ebtables(E) nf_log_ipv6(E) xt_pkttype(E) nf_log_ipv4(E) nf_log_common(E) xt_LOG(E) xt_limit(E) rpcsec_gss_krb5(E) nfsv4(E) dns_resolver(E) nfs(E) fscache(E) af_packet(E) iscsi_ibft(E) iscsi_boot_sysfs(E) ip6t_REJECT(E) xt_tcpudp(E) nf_conntrack_ipv6(E) nf_defrag_ipv6(E) ip6table_raw(E) ipt_REJECT(E) iptable_raw(E) xt_CT(E) iptable_filter(E) ip6table_mangle(E) nf_conntrack_netbios_ns(E) nf_conntrack_broadcast(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) ip_tables(E) xt_conntrack(E) nf_conntrack(E) libcrc32c(E) ip6table_filter(E) ip6_tables(E) x_tables(E) snd_hda_codec_generic(E) snd_hda_intel(E) snd_hda_codec(E) snd_hda_core(E) joydev(E) snd_hwdep(E) snd_pcm(E) snd_timer(E) snd(E) 8139too(E) soundcore(E) i2c_piix4(E) virtio_balloon(E) crct10dif_pclmul(E)
> [ 328.476019] crc32_pclmul(E) ppdev(E) ghash_clmulni_intel(E) parport_pc(E) acpi_cpufreq(E) pcbc(E) button(E) parport(E) aesni_intel(E) aes_x86_64(E) serio_raw(E) pcspkr(E) crypto_simd(E) glue_helper(E) cryptd(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) dm_mod(E) grace(E) sunrpc(E) ext4(E) crc16(E) jbd2(E) mbcache(E) hid_generic(E) usbhid(E) sr_mod(E) cdrom(E) ata_generic(E) virtio_blk(E) virtio_rng(E) virtio_console(E) ata_piix(E) qxl(E) drm_kms_helper(E) syscopyarea(E) uhci_hcd(E) ehci_pci(E) sysfillrect(E) sysimgblt(E) ahci(E) fb_sys_fops(E) ehci_hcd(E) libahci(E) crc32c_intel(E) ttm(E) virtio_pci(E) virtio_ring(E) 8139cp(E) virtio(E) usbcore(E) floppy(E) mii(E) drm(E) libata(E) sg(E) scsi_mod(E) autofs4(E)
> [ 328.476037] CPU: 6 PID: 313 Comm: kworker/u16:2 Tainted: G E 4.11.0-default #20
> [ 328.476038] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20161202_174313-build11a 04/01/2014
> [ 328.476041] Workqueue: events_unbound async_run_entry_fn
> [ 328.476042] Call Trace:
> [ 328.476056] ? dump_stack+0x5c/0x85
> [ 328.476058] ? __warn+0xc4/0xe0
> [ 328.476060] ? pci_pm_poweroff+0xf0/0xf0
> [ 328.476062] ? pci_irq_vector+0xb1/0xe0
> [ 328.476064] ? vp_del_vqs+0xcb/0x120 [virtio_pci]
> [ 328.476066] ? remove_common+0x60/0x80 [virtio_rng]
> [ 328.476067] ? virtrng_freeze+0xa/0x10 [virtio_rng]
> [ 328.476068] ? virtio_pci_freeze+0x19/0x40 [virtio_pci]
> [ 328.476069] ? pci_pm_freeze+0x59/0xe0
> [ 328.476070] ? dpm_run_callback+0x4d/0x170
> [ 328.476071] ? __device_suspend+0x11f/0x3b0
> [ 328.476072] ? pm_dev_dbg+0x70/0x70
> [ 328.476072] ? async_suspend+0x1a/0x90
> [ 328.476082] ? async_run_entry_fn+0x34/0x160
> [ 328.476083] ? process_one_work+0x164/0x430
> [ 328.476084] ? worker_thread+0x135/0x4d0
> [ 328.476085] ? kthread+0xff/0x140
> [ 328.476086] ? rescuer_thread+0x3c0/0x3c0
> [ 328.476087] ? kthread_park+0x80/0x80
> [ 328.476088] ? do_group_exit+0x39/0xa0
> [ 328.476090] ? ret_from_fork+0x26/0x40
> [ 328.476091] ---[ end trace a045c2118936902f ]---


I couldn't reproduce it - let's make sure we are using the
same tree. Could you pls try

git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git linux-next

It's currently at cc79d42a7d7e57ff64f406a1fd3740afebac0b44
--
MST

2017-04-04 17:40:59

by Mike Galbraith

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Tue, 2017-04-04 at 18:30 +0300, Michael S. Tsirkin wrote:

> I couldn't reproduce it - let's make sure we are using the
> same tree. Could you pls try
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git linux-next
>
> It's currently at cc79d42a7d7e57ff64f406a1fd3740afebac0b44

Things that make ya go hmm...

[ 87.940161] ------------[ cut here ]------------
[ 87.940180] WARNING: CPU: 0 PID: 97 at drivers/pci/msi.c:1251 pci_irq_vector+0xcb/0xe0
[ 87.940181] Modules linked in: dm_mod(E) fuse(E) ebtable_filter(E) ebtables(E) rpcsec_gss_krb5(E) nfsv4(E) dns_resolver(E) nfs(E) fscache(E) nf_log_ipv6(E) xt_pkttype(E) nf_log_ipv4(E) nf_log_common(E) xt_LOG(E) xt_limit(E) af_packet(E) iscsi_ibft(E) iscsi_boot_sysfs(E) ip6t_REJECT(E) xt_tcpudp(E) nf_conntrack_ipv6(E) nf_defrag_ipv6(E) ip6table_raw(E) ipt_REJECT(E) iptable_raw(E) xt_CT(E) iptable_filter(E) ip6table_mangle(E) nf_conntrack_netbios_ns(E) nf_conntrack_broadcast(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) ip_tables(E) xt_conntrack(E) nf_conntrack(E) libcrc32c(E) ip6table_filter(E) ip6_tables(E) x_tables(E) snd_hda_codec_generic(E) snd_hda_intel(E) snd_hda_codec(E) joydev(E) snd_hda_core(E) snd_hwdep(E) snd_pcm(E) snd_timer(E) snd(E) 8139too(E) ppdev(E) soundcore(E) parport_pc(E) i2c_piix4(E)
[ 87.940206] parport(E) virtio_balloon(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) ghash_clmulni_intel(E) serio_raw(E) acpi_cpufreq(E) pcbc(E) button(E) aesni_intel(E) pcspkr(E) aes_x86_64(E) crypto_simd(E) glue_helper(E) cryptd(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) ext4(E) crc16(E) jbd2(E) mbcache(E) hid_generic(E) usbhid(E) ata_generic(E) ata_piix(E) sr_mod(E) cdrom(E) virtio_blk(E) virtio_rng(E) virtio_console(E) qxl(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) ehci_pci(E) ttm(E) uhci_hcd(E) ehci_hcd(E) floppy(E) ahci(E) libahci(E) virtio_pci(E) drm(E) virtio_ring(E) virtio(E) usbcore(E) libata(E) 8139cp(E) mii(E) sg(E) scsi_mod(E) autofs4(E)
[ 87.940233] CPU: 0 PID: 97 Comm: kworker/u16:1 Tainted: G E 4.11.0-default #1
[ 87.940234] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20161202_174313-build11a 04/01/2014
[ 87.940240] Workqueue: events_unbound async_run_entry_fn
[ 87.940241] Call Trace:
[ 87.940246] ? dump_stack+0x5c/0x85
[ 87.940255] ? __warn+0xc4/0xe0
[ 87.940258] ? pci_pm_poweroff+0xf0/0xf0
[ 87.940269] ? pci_irq_vector+0xcb/0xe0
[ 87.940272] ? vp_synchronize_vectors+0x3e/0x50 [virtio_pci]
[ 87.940275] ? virtcons_freeze+0x1a/0xd0 [virtio_console]
[ 87.940276] ? virtio_pci_freeze+0x19/0x40 [virtio_pci]
[ 87.940277] ? pci_pm_freeze+0x59/0xe0
[ 87.940281] ? dpm_run_callback+0x4d/0x170
[ 87.940283] ? __device_suspend+0x11f/0x3b0
[ 87.940283] ? pm_dev_dbg+0x70/0x70
[ 87.940284] ? async_suspend+0x1a/0x90
[ 87.940286] ? async_run_entry_fn+0x34/0x160
[ 87.940287] ? process_one_work+0x164/0x430
[ 87.940288] ? worker_thread+0x135/0x4d0
[ 87.940290] ? kthread+0xff/0x140
[ 87.940291] ? rescuer_thread+0x3c0/0x3c0
[ 87.940292] ? kthread_park+0x80/0x80
[ 87.940293] ? kthread_park+0x80/0x80
[ 87.940299] ? ret_from_fork+0x26/0x40
[ 87.940300] ---[ end trace 5d65fe0efc4b61d7 ]---

2017-04-04 17:54:50

by Mike Galbraith

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Tue, 2017-04-04 at 19:40 +0200, Mike Galbraith wrote:
> On Tue, 2017-04-04 at 18:30 +0300, Michael S. Tsirkin wrote:
>
> > I couldn't reproduce it - let's make sure we are using the
> > same tree. Could you pls try
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git linux
> > -next
> >
> > It's currently at cc79d42a7d7e57ff64f406a1fd3740afebac0b44
>
> Things that make ya go hmm...

Making double sure we're on the same page...

git@homer:..git/vhost> git branch
* linux-next
master
git@homer:..git/vhost> git describe
warning: tag 'for_linus' is really 'tags_for_linus' here
for_linus-220128-gcc79d42a7d7e
git@homer:..git/vhost> git status
On branch linux-next
Your branch is up-to-date with 'origin/linux-next'.
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git checkout -- <file>..." to discard changes in working directory)

modified: Makefile
modified: scripts/setlocalversion

no changes added to commit (use "git add" and/or "git commit -a")
git@homer:..git/vhost>

Modifications are me whacking '+' sign and -rc5.. I don't do those.

2017-04-04 18:00:37

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Tue, Apr 04, 2017 at 07:54:36PM +0200, Mike Galbraith wrote:
> On Tue, 2017-04-04 at 19:40 +0200, Mike Galbraith wrote:
> > On Tue, 2017-04-04 at 18:30 +0300, Michael S. Tsirkin wrote:
> >
> > > I couldn't reproduce it - let's make sure we are using the
> > > same tree. Could you pls try
> > >
> > > git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git linux
> > > -next
> > >
> > > It's currently at cc79d42a7d7e57ff64f406a1fd3740afebac0b44
> >
> > Things that make ya go hmm...
>
> Making double sure we're on the same page...
>
> git@homer:..git/vhost> git branch
> * linux-next
> master
> git@homer:..git/vhost> git describe
> warning: tag 'for_linus' is really 'tags_for_linus' here
> for_linus-220128-gcc79d42a7d7e
> git@homer:..git/vhost> git status
> On branch linux-next
> Your branch is up-to-date with 'origin/linux-next'.
> Changes not staged for commit:
> (use "git add <file>..." to update what will be committed)
> (use "git checkout -- <file>..." to discard changes in working directory)
>
> modified: Makefile
> modified: scripts/setlocalversion
>
> no changes added to commit (use "git add" and/or "git commit -a")
> git@homer:..git/vhost>
>
> Modifications are me whacking '+' sign and -rc5.. I don't do those.

And just making double sure, the 1st version that has the issue
is 5c34d002dcc7, isn't it? I'm asking because subject says so
but then goes on to list subject from another commit.
This one is:
virtio_pci: remove struct virtio_pci_vq_info

--
MST

2017-04-04 18:39:12

by Mike Galbraith

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Tue, 2017-04-04 at 21:00 +0300, Michael S. Tsirkin wrote:

> And just making double sure, the 1st version that has the issue
> is 5c34d002dcc7, isn't it? I'm asking because subject says so
> but then goes on to list subject from another commit.
> This one is:
> > virtio_pci: remove struct virtio_pci_vq_info

When the hibernation related warnings started I don't know, I wasn't
targeting that, those fell out of subsequent testing. I started out
hunting console breakage point w. threaded irqs, which is 5c34d002dcc7.

-Mike

2017-04-04 19:03:17

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Tue, Apr 04, 2017 at 07:54:36PM +0200, Mike Galbraith wrote:
> On Tue, 2017-04-04 at 19:40 +0200, Mike Galbraith wrote:
> > On Tue, 2017-04-04 at 18:30 +0300, Michael S. Tsirkin wrote:
> >
> > > I couldn't reproduce it - let's make sure we are using the
> > > same tree. Could you pls try
> > >
> > > git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git linux
> > > -next
> > >
> > > It's currently at cc79d42a7d7e57ff64f406a1fd3740afebac0b44
> >
> > Things that make ya go hmm...
>
> Making double sure we're on the same page...
>
> git@homer:..git/vhost> git branch
> * linux-next
> master
> git@homer:..git/vhost> git describe
> warning: tag 'for_linus' is really 'tags_for_linus' here
> for_linus-220128-gcc79d42a7d7e
> git@homer:..git/vhost> git status
> On branch linux-next
> Your branch is up-to-date with 'origin/linux-next'.
> Changes not staged for commit:
> (use "git add <file>..." to update what will be committed)
> (use "git checkout -- <file>..." to discard changes in working directory)
>
> modified: Makefile
> modified: scripts/setlocalversion
>
> no changes added to commit (use "git add" and/or "git commit -a")
> git@homer:..git/vhost>
>
> Modifications are me whacking '+' sign and -rc5.. I don't do those.


since I couldn't reproduce, I decided it's worth trying to see
what happens if we revert back to before 5c34d002dcc7.


Could you please test a tag "test" in my tree above?
It should point at 6d88af1bf359417eb821370294ba489bdf7f5ab8


That has reverts for code refactorings since 5c34d002dcc7
inclusive. If this finally works, maybe you could
go back and see which of the reverts helps?

The idea is that this only has refactorings nicely isolated,
if all else fails we can even do the reverts without losing
functionality.

--
MST

2017-04-04 21:31:49

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Tue, Apr 04, 2017 at 08:38:35PM +0200, Mike Galbraith wrote:
> On Tue, 2017-04-04 at 21:00 +0300, Michael S. Tsirkin wrote:
>
> > And just making double sure, the 1st version that has the issue
> > is 5c34d002dcc7, isn't it? I'm asking because subject says so
> > but then goes on to list subject from another commit.
> > This one is:
> > > virtio_pci: remove struct virtio_pci_vq_info
>
> When the hibernation related warnings started I don't know, I wasn't
> targeting that, those fell out of subsequent testing.
> I started out
> hunting console breakage point w. threaded irqs, which is 5c34d002dcc7.

OK but 5c34d002dcc7 isn't "virtio_pci: use shared
interrupts for virtqueues".

>
> -Mike

I'm confused at this point. I would appreciate the summary of
which versions were tested and what did you see. Testing
a revert might also help.

Thanks a lot for your testing!

--
MST

2017-04-05 02:55:10

by Mike Galbraith

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Wed, 2017-04-05 at 00:31 +0300, Michael S. Tsirkin wrote:
> On Tue, Apr 04, 2017 at 08:38:35PM +0200, Mike Galbraith wrote:
> > On Tue, 2017-04-04 at 21:00 +0300, Michael S. Tsirkin wrote:
> >
> > > And just making double sure, the 1st version that has the issue
> > > is 5c34d002dcc7, isn't it? I'm asking because subject says so
> > > but then goes on to list subject from another commit.
> > > This one is:
> > > > virtio_pci: remove struct virtio_pci_vq_info
> >
> > When the hibernation related warnings started I don't know, I
> > wasn't
> > targeting that, those fell out of subsequent testing.
> > I started out
> > hunting console breakage point w. threaded irqs, which is
> > 5c34d002dcc7.
>
> OK but 5c34d002dcc7 isn't "virtio_pci: use shared
> interrupts for virtqueues".

Heh, wrong sha.. $subject does however correctly identify in quotes the
origin of the threaded irq woes.

> I'm confused at this point. I would appreciate the summary of
> which versions were tested and what did you see. Testing
> a revert might also help.

I already tested full revert. I went looking for what busted kvm for
RT kernels, extracted the virtio series and quilt bisected that to use
shared interrupts. I was going to just use my little turn off
multiport hacklet to put spinning kworker on the back burner until the
dust settled, but noticed that there was more going on, and none of it
is RT specific (thus freeing up a back burner).

>From there, it's all test what you/Christoph post, as you post it, in
virgin source.

-Mike

2017-04-05 03:09:20

by Mike Galbraith

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Tue, 2017-04-04 at 22:03 +0300, Michael S. Tsirkin wrote:

> since I couldn't reproduce, I decided it's worth trying to see
> what happens if we revert back to before 5c34d002dcc7.
>
>
> Could you please test a tag "test" in my tree above?
> It should point at 6d88af1bf359417eb821370294ba489bdf7f5ab8

Nogo.

git@homer:..git/vhost> git remote update
Fetching origin
git@homer:..git/vhost> git show 6d88af1bf359417eb821370294ba489bdf7f5ab8
fatal: bad object 6d88af1bf359417eb821370294ba489bdf7f5ab8

2017-04-05 03:13:39

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Wed, Apr 05, 2017 at 05:09:09AM +0200, Mike Galbraith wrote:
> On Tue, 2017-04-04 at 22:03 +0300, Michael S. Tsirkin wrote:
>
> > since I couldn't reproduce, I decided it's worth trying to see
> > what happens if we revert back to before 5c34d002dcc7.
> >
> >
> > Could you please test a tag "test" in my tree above?
> > It should point at 6d88af1bf359417eb821370294ba489bdf7f5ab8
>
> Nogo.
>
> git@homer:..git/vhost> git remote update
> Fetching origin
> git@homer:..git/vhost> git show 6d88af1bf359417eb821370294ba489bdf7f5ab8
> fatal: bad object 6d88af1bf359417eb821370294ba489bdf7f5ab8

Maybe because it's a tag not a head. Pls try
git fetch git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git refs/tags/test

--
MST

2017-04-05 03:24:40

by Mike Galbraith

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Wed, 2017-04-05 at 06:13 +0300, Michael S. Tsirkin wrote:
> On Wed, Apr 05, 2017 at 05:09:09AM +0200, Mike Galbraith wrote:
> > On Tue, 2017-04-04 at 22:03 +0300, Michael S. Tsirkin wrote:
> >
> > > since I couldn't reproduce, I decided it's worth trying to see
> > > what happens if we revert back to before 5c34d002dcc7.
> > >
> > >
> > > Could you please test a tag "test" in my tree above?
> > > It should point at 6d88af1bf359417eb821370294ba489bdf7f5ab8
> >
> > Nogo.
> >
> > git@homer:..git/vhost> git remote update
> > Fetching origin
> > git@homer:..git/vhost> git show
> > 6d88af1bf359417eb821370294ba489bdf7f5ab8
> > fatal: bad object 6d88af1bf359417eb821370294ba489bdf7f5ab8
>
> Maybe because it's a tag not a head. Pls try
> git fetch git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git
> refs/tags/test

That worked. Checked out/building.

2017-04-05 03:40:18

by Mike Galbraith

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Wed, 2017-04-05 at 05:24 +0200, Mike Galbraith wrote:
> On Wed, 2017-04-05 at 06:13 +0300, Michael S. Tsirkin wrote:
> > On Wed, Apr 05, 2017 at 05:09:09AM +0200, Mike Galbraith wrote:
> > > On Tue, 2017-04-04 at 22:03 +0300, Michael S. Tsirkin wrote:
> > >
> > > > since I couldn't reproduce, I decided it's worth trying to see
> > > > what happens if we revert back to before 5c34d002dcc7.
> > > >
> > > >
> > > > Could you please test a tag "test" in my tree above?
> > > > It should point at 6d88af1bf359417eb821370294ba489bdf7f5ab8
> > >
> > > Nogo.
> > >
> > > git@homer:..git/vhost> git remote update
> > > Fetching origin
> > > git@homer:..git/vhost> git show
> > > 6d88af1bf359417eb821370294ba489bdf7f5ab8
> > > fatal: bad object 6d88af1bf359417eb821370294ba489bdf7f5ab8
> >
> > Maybe because it's a tag not a head. Pls try
> > git fetch
> > git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git
> > refs/tags/test
>
> That worked. Checked out/building.

vbox hibernated gripe free, w/wo threadirqs.

-Mike

2017-04-05 03:51:50

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Wed, Apr 05, 2017 at 05:40:06AM +0200, Mike Galbraith wrote:
> On Wed, 2017-04-05 at 05:24 +0200, Mike Galbraith wrote:
> > On Wed, 2017-04-05 at 06:13 +0300, Michael S. Tsirkin wrote:
> > > On Wed, Apr 05, 2017 at 05:09:09AM +0200, Mike Galbraith wrote:
> > > > On Tue, 2017-04-04 at 22:03 +0300, Michael S. Tsirkin wrote:
> > > >
> > > > > since I couldn't reproduce, I decided it's worth trying to see
> > > > > what happens if we revert back to before 5c34d002dcc7.
> > > > >
> > > > >
> > > > > Could you please test a tag "test" in my tree above?
> > > > > It should point at 6d88af1bf359417eb821370294ba489bdf7f5ab8
> > > >
> > > > Nogo.
> > > >
> > > > git@homer:..git/vhost> git remote update
> > > > Fetching origin
> > > > git@homer:..git/vhost> git show
> > > > 6d88af1bf359417eb821370294ba489bdf7f5ab8
> > > > fatal: bad object 6d88af1bf359417eb821370294ba489bdf7f5ab8
> > >
> > > Maybe because it's a tag not a head. Pls try
> > > git fetch
> > > git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git
> > > refs/tags/test
> >
> > That worked. Checked out/building.
>
> vbox hibernated gripe free, w/wo threadirqs.
>
> -Mike

Any issues at all left with this tree?
In particular any regressions?

--
MST

2017-04-05 03:52:08

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Wed, Apr 05, 2017 at 05:24:30AM +0200, Mike Galbraith wrote:
> On Wed, 2017-04-05 at 06:13 +0300, Michael S. Tsirkin wrote:
> > On Wed, Apr 05, 2017 at 05:09:09AM +0200, Mike Galbraith wrote:
> > > On Tue, 2017-04-04 at 22:03 +0300, Michael S. Tsirkin wrote:
> > >
> > > > since I couldn't reproduce, I decided it's worth trying to see
> > > > what happens if we revert back to before 5c34d002dcc7.
> > > >
> > > >
> > > > Could you please test a tag "test" in my tree above?
> > > > It should point at 6d88af1bf359417eb821370294ba489bdf7f5ab8
> > >
> > > Nogo.
> > >
> > > git@homer:..git/vhost> git remote update
> > > Fetching origin
> > > git@homer:..git/vhost> git show
> > > 6d88af1bf359417eb821370294ba489bdf7f5ab8
> > > fatal: bad object 6d88af1bf359417eb821370294ba489bdf7f5ab8
> >
> > Maybe because it's a tag not a head. Pls try
> > git fetch git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git
> > refs/tags/test
>
> That worked. Checked out/building.

Thanks a lot for the testing.

--
MST

2017-04-05 04:25:01

by Mike Galbraith

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Wed, 2017-04-05 at 06:51 +0300, Michael S. Tsirkin wrote:

> Any issues at all left with this tree?
> In particular any regressions?

Nothing blatantly obvious in a testdrive that lasted a couple minutes.
I'd have to beat on it a bit to look for things beyond the reported,
but can't afford to do that right now.

-Mike

2017-04-05 06:29:36

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Wed, Apr 05, 2017 at 06:24:50AM +0200, Mike Galbraith wrote:
> On Wed, 2017-04-05 at 06:51 +0300, Michael S. Tsirkin wrote:
>
> > Any issues at all left with this tree?
> > In particular any regressions?
>
> Nothing blatantly obvious in a testdrive that lasted a couple minutes.
> I'd have to beat on it a bit to look for things beyond the reported,
> but can't afford to do that right now.

Can you check where the issues appear? I'd like to do a pure revert
of the shared interrupts, but that three has a lot more in it..

2017-04-05 06:36:14

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Mon, Apr 03, 2017 at 07:14:22PM +0300, Michael S. Tsirkin wrote:
> On Mon, Apr 03, 2017 at 04:18:23PM +0200, Christoph Hellwig wrote:
> > Mike,
> >
> > can you try the patch below?
> >
> > ---
> > >From fe41a30b54878cc631623b7511267125e0da4b15 Mon Sep 17 00:00:00 2001
> > From: Christoph Hellwig <[email protected]>
> > Date: Mon, 3 Apr 2017 14:51:35 +0200
> > Subject: virtio_pci: don't use shared irq for virtqueues
> >
> > Reimplement the shared irq feature manually, as we might have a larger
> > number of virtqueues than the core shared interrupt code can handle
> > in threaded interrupt mode.
> >
> > Signed-off-by: Christoph Hellwig <[email protected]>
> > ---
> > drivers/virtio/virtio_pci_common.c | 142 +++++++++++++++++++++----------------
> > drivers/virtio/virtio_pci_common.h | 1 +
> > 2 files changed, 83 insertions(+), 60 deletions(-)
>
> Well the original patch this is trying to fix is
> 07ec51480b5eb1233f8c1b0f5d7a7c8d1247c507 which dropped just 40 lines
> with documentation. It did this by re-using error handling to switch
> from per-vq to non-per-vq mode. Now this has separate flows for errors
> and per-vq non-per-vq switch and (I think, as a result) is adding 140
> lines which doesn't make me very happy.

The above adds 23 lines. We could entangle both loops again, but I'm
not sure it's going to buy us much.

2017-04-05 06:51:59

by Mike Galbraith

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Wed, 2017-04-05 at 08:29 +0200, Christoph Hellwig wrote:

> Can you check where the issues appear? I'd like to do a pure revert
> of the shared interrupts, but that three has a lot more in it..

Not immediately, one of my several pots is emitting black smoke.

-Mike

2017-04-05 21:38:29

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Wed, Apr 05, 2017 at 08:29:34AM +0200, Christoph Hellwig wrote:
> On Wed, Apr 05, 2017 at 06:24:50AM +0200, Mike Galbraith wrote:
> > On Wed, 2017-04-05 at 06:51 +0300, Michael S. Tsirkin wrote:
> >
> > > Any issues at all left with this tree?
> > > In particular any regressions?
> >
> > Nothing blatantly obvious in a testdrive that lasted a couple minutes.
> > I'd have to beat on it a bit to look for things beyond the reported,
> > but can't afford to do that right now.
>
> Can you check where the issues appear? I'd like to do a pure revert
> of the shared interrupts, but that three has a lot more in it..

What I did is a revert the refactorings while keeping the affinity API -
we can safely postpone them until the next release without loss of
functionality. But that's on top of my testing tree so it has unrelated
stuff as well. I'm rather confident they aren't fixing the issues but
I'll prepare a bugfix-only tree now for testing.

--
MST

2017-04-07 06:03:37

by Mike Galbraith

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Thu, 2017-04-06 at 00:38 +0300, Michael S. Tsirkin wrote:

> What I did is a revert the refactorings while keeping the affinity API -
> we can safely postpone them until the next release without loss of
> functionality. But that's on top of my testing tree so it has unrelated
> stuff as well. I'm rather confident they aren't fixing the issues but
> I'll prepare a bugfix-only tree now for testing.

Test tag works fine here w/wo threadirqs, RT works as well.

-Mike

2017-04-07 06:24:13

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Fri, Apr 07, 2017 at 08:03:19AM +0200, Mike Galbraith wrote:
> On Thu, 2017-04-06 at 00:38 +0300, Michael S. Tsirkin wrote:
>
> > What I did is a revert the refactorings while keeping the affinity API -
> > we can safely postpone them until the next release without loss of
> > functionality. But that's on top of my testing tree so it has unrelated
> > stuff as well. I'm rather confident they aren't fixing the issues but
> > I'll prepare a bugfix-only tree now for testing.
>
> Test tag works fine here w/wo threadirqs, RT works as well.
>
> -Mike

Thanks a lot.
OK I pushed out two new tags
test1 with just the cleanup reverts
test2 with a bugfix in this area


I would very much appreciate your testing report on both -
should be ok but better make sure.
Unfortunately it's past 2am here so I don't have the time to
test - and I'm at a conference so not a lot of time during
the day either.

Christoph, I still think your cleanups were a good idea,
but we need get this release into a stable shape ASAP.
Let's try again for the next release, OK?

--
MST

2017-04-07 06:44:41

by Mike Galbraith

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Fri, 2017-04-07 at 09:24 +0300, Michael S. Tsirkin wrote:
> On Fri, Apr 07, 2017 at 08:03:19AM +0200, Mike Galbraith wrote:

> > Test tag works fine here w/wo threadirqs, RT works as well.
> >
> > -Mike
>
> Thanks a lot.
> OK I pushed out two new tags
> test1 with just the cleanup reverts
> test2 with a bugfix in this area
>
>
> I would very much appreciate your testing report on both -
> should be ok but better make sure.

Ok, once it percolates out I'll do that.

-Mike

2017-04-07 07:06:08

by Mike Galbraith

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Fri, 2017-04-07 at 08:44 +0200, Mike Galbraith wrote:
> On Fri, 2017-04-07 at 09:24 +0300, Michael S. Tsirkin wrote:
> > On Fri, Apr 07, 2017 at 08:03:19AM +0200, Mike Galbraith wrote:
>
> > > Test tag works fine here w/wo threadirqs, RT works as well.
> > >
> > > -Mike
> >
> > Thanks a lot.
> > OK I pushed out two new tags
> > test1 with just the cleanup reverts
> > test2 with a bugfix in this area
> >
> >
> > I would very much appreciate your testing report on both -
> > should be ok but better make sure.
>
> Ok, once it percolates out I'll do that.

for_linus-10-g960bef2a6172 contains a -ENOBUILD merge conflict.

-Mike

2017-04-07 07:22:27

by Mike Galbraith

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Fri, 2017-04-07 at 09:05 +0200, Mike Galbraith wrote:
> On Fri, 2017-04-07 at 08:44 +0200, Mike Galbraith wrote:
> > On Fri, 2017-04-07 at 09:24 +0300, Michael S. Tsirkin wrote:
> > > On Fri, Apr 07, 2017 at 08:03:19AM +0200, Mike Galbraith wrote:
> >
> > > > Test tag works fine here w/wo threadirqs, RT works as well.
> > > >
> > > > -Mike
> > >
> > > Thanks a lot.
> > > OK I pushed out two new tags
> > > test1 with just the cleanup reverts
> > > test2 with a bugfix in this area
> > >
> > >
> > > I would very much appreciate your testing report on both -
> > > should be ok but better make sure.
> >
> > Ok, once it percolates out I'll do that.
>
> for_linus-10-g960bef2a6172 contains a -ENOBUILD merge conflict.

But test2 works fine w/wo threadirqs.

2017-04-07 07:23:28

by Mike Galbraith

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Fri, 2017-04-07 at 09:22 +0200, Mike Galbraith wrote:
> On Fri, 2017-04-07 at 09:05 +0200, Mike Galbraith wrote:
> > On Fri, 2017-04-07 at 08:44 +0200, Mike Galbraith wrote:
> > > On Fri, 2017-04-07 at 09:24 +0300, Michael S. Tsirkin wrote:
> > > > On Fri, Apr 07, 2017 at 08:03:19AM +0200, Mike Galbraith wrote:
> > >
> > > > > Test tag works fine here w/wo threadirqs, RT works as well.
> > > > >
> > > > > -Mike
> > > >
> > > > Thanks a lot.
> > > > OK I pushed out two new tags
> > > > test1 with just the cleanup reverts
> > > > test2 with a bugfix in this area
> > > >
> > > >
> > > > I would very much appreciate your testing report on both -
> > > > should be ok but better make sure.
> > >
> > > Ok, once it percolates out I'll do that.
> >
> > for_linus-10-g960bef2a6172 contains a -ENOBUILD merge conflict.
>
> But test2 works fine w/wo threadirqs.

(CONFIG_DEBUG_SHIRQ=y as well btw)

2017-04-07 13:20:36

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Fri, Apr 07, 2017 at 09:22:02AM +0200, Mike Galbraith wrote:
> On Fri, 2017-04-07 at 09:05 +0200, Mike Galbraith wrote:
> > On Fri, 2017-04-07 at 08:44 +0200, Mike Galbraith wrote:
> > > On Fri, 2017-04-07 at 09:24 +0300, Michael S. Tsirkin wrote:
> > > > On Fri, Apr 07, 2017 at 08:03:19AM +0200, Mike Galbraith wrote:
> > >
> > > > > Test tag works fine here w/wo threadirqs, RT works as well.
> > > > >
> > > > > -Mike
> > > >
> > > > Thanks a lot.
> > > > OK I pushed out two new tags
> > > > test1 with just the cleanup reverts
> > > > test2 with a bugfix in this area
> > > >
> > > >
> > > > I would very much appreciate your testing report on both -
> > > > should be ok but better make sure.
> > >
> > > Ok, once it percolates out I'll do that.
> >
> > for_linus-10-g960bef2a6172 contains a -ENOBUILD merge conflict.
>
> But test2 works fine w/wo threadirqs.

Oops. This is what one gets by pushing at 2am. I fixed that one up
(still didn't even build as I'm in the middle of a conference).
Also it's actually the reverse test2 is just the revert test1 has
one more bugfix.

So I'm inclined to push test2 out to linux-next for now, and will
add test1 later if it fares well.

Mike, your testing is very much appreciated!

--
MST

2017-04-07 13:35:29

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Fri, Apr 07, 2017 at 04:20:12PM +0300, Michael S. Tsirkin wrote:
> On Fri, Apr 07, 2017 at 09:22:02AM +0200, Mike Galbraith wrote:
> > On Fri, 2017-04-07 at 09:05 +0200, Mike Galbraith wrote:
> > > On Fri, 2017-04-07 at 08:44 +0200, Mike Galbraith wrote:
> > > > On Fri, 2017-04-07 at 09:24 +0300, Michael S. Tsirkin wrote:
> > > > > On Fri, Apr 07, 2017 at 08:03:19AM +0200, Mike Galbraith wrote:
> > > >
> > > > > > Test tag works fine here w/wo threadirqs, RT works as well.
> > > > > >
> > > > > > -Mike
> > > > >
> > > > > Thanks a lot.
> > > > > OK I pushed out two new tags
> > > > > test1 with just the cleanup reverts
> > > > > test2 with a bugfix in this area
> > > > >
> > > > >
> > > > > I would very much appreciate your testing report on both -
> > > > > should be ok but better make sure.
> > > >
> > > > Ok, once it percolates out I'll do that.
> > >
> > > for_linus-10-g960bef2a6172 contains a -ENOBUILD merge conflict.
> >
> > But test2 works fine w/wo threadirqs.
>
> Oops. This is what one gets by pushing at 2am. I fixed that one up
> (still didn't even build as I'm in the middle of a conference).
> Also it's actually the reverse test2 is just the revert test1 has
> one more bugfix.
>
> So I'm inclined to push test2 out to linux-next for now, and will
> add test1 later if it fares well.
>
> Mike, your testing is very much appreciated!

Oh wait, I still put the ctx feature patches in there :(
Pls ignore, I'll update when I've fixed it up. Sorry about the noise.

> --
> MST

2017-04-07 14:30:14

by Mike Galbraith

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Fri, 2017-04-07 at 16:35 +0300, Michael S. Tsirkin wrote:

> Oh wait, I still put the ctx feature patches in there :(
> Pls ignore, I'll update when I've fixed it up. Sorry about the noise.

Both worked fine w/wo threadirqs.

-Mike

2017-04-07 18:57:02

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Fri, Apr 07, 2017 at 04:29:53PM +0200, Mike Galbraith wrote:
> On Fri, 2017-04-07 at 16:35 +0300, Michael S. Tsirkin wrote:
>
> > Oh wait, I still put the ctx feature patches in there :(
> > Pls ignore, I'll update when I've fixed it up. Sorry about the noise.
>
> Both worked fine w/wo threadirqs.
>
> -Mike

OK. test3 and test4 are now pushed: test3 should fix your hang,
test4 is trying to fix a crash reported independently.

Will push to linux-next once I hear from you.

--
MST

2017-04-07 21:36:09

by kernel test robot

[permalink] [raw]
Subject: [Random guest crashes since 5c34d002dcc7 ("virtio_pci] 3313bedd74: WARNING:at_include/linux/pci.h:#vp_del_vqs

FYI, we noticed the following commit:

commit: 3313bedd740af10575cc0e22742ee89166e1ded6 ("Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")")
url: https://github.com/0day-ci/linux/commits/Christoph-Hellwig/virtio_pci-don-t-use-shared-irq-for-virtqueues/20170404-140836


in testcase: trinity
with following parameters:

runtime: 300s

test-description: Trinity is a linux system call fuzz tester.
test-url: http://codemonkey.org.uk/projects/trinity/


on test machine: qemu-system-i386 -enable-kvm -smp 2 -m 320M

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):


+--------------------------------------------+------------+------------+
| | 89970a04d7 | 3313bedd74 |
+--------------------------------------------+------------+------------+
| boot_successes | 8 | 0 |
| boot_failures | 0 | 19 |
| WARNING:at_include/linux/pci.h:#vp_del_vqs | 0 | 19 |
+--------------------------------------------+------------+------------+



[ 57.630582] WARNING: CPU: 0 PID: 1 at include/linux/pci.h:1365 vp_del_vqs+0xf6/0x100
[ 57.634248] CPU: 0 PID: 1 Comm: swapper Not tainted 4.11.0-rc4-00065-g3313bed #1
[ 57.637078] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-20161025_171302-gandalf 04/01/2014
[ 57.641013] Call Trace:
[ 57.642234] dump_stack+0x16/0x19
[ 57.643521] __warn+0xd1/0xf0
[ 57.644630] ? vp_del_vqs+0xf6/0x100
[ 57.646488] warn_slowpath_null+0x2a/0x30
[ 57.648219] vp_del_vqs+0xf6/0x100
[ 57.649606] virtblk_remove+0x6c/0xb0
[ 57.651041] virtio_dev_remove+0x39/0xa0
[ 57.652559] driver_probe_device+0xbe/0x4a0
[ 57.654271] ? klist_next+0x1b/0xf0
[ 57.655740] __driver_attach+0xd9/0x100
[ 57.659805] ? _raw_spin_unlock+0x22/0x30
[ 57.661576] ? klist_next+0x73/0xf0
[ 57.663027] ? bus_uevent_store+0x40/0x40
[ 57.664536] ? driver_probe_device+0x4a0/0x4a0
[ 57.666558] bus_for_each_dev+0x4f/0x80
[ 57.668242] driver_attach+0x1e/0x20
[ 57.669664] ? driver_probe_device+0x4a0/0x4a0
[ 57.691602] bus_add_driver+0x1df/0x280
[ 57.693154] driver_register+0x5d/0xf0
[ 57.694684] ? mm_init+0x151/0x151
[ 57.696301] register_virtio_driver+0x1b/0x30
[ 57.698171] init+0x4e/0x78
[ 57.699416] do_one_initcall+0x79/0x123
[ 57.700773] ? trace_hardirqs_on+0xb/0x10
[ 57.702466] ? kernel_init_freeable+0x17f/0x217
[ 57.704207] kernel_init_freeable+0x19f/0x217
[ 57.706067] ? rest_init+0x120/0x120
[ 57.707460] kernel_init+0x10/0x100
[ 57.708816] ? schedule_tail+0x11/0x50
[ 57.710603] ? rest_init+0x120/0x120
[ 57.712249] ret_from_fork+0x21/0x30
[ 57.713684] ---[ end trace e7da8fd7329ca23e ]---


To reproduce:

git clone https://github.com/01org/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email



Thanks,
Kernel Test Robot


Attachments:
(No filename) (2.97 kB)
config-4.11.0-rc4-00065-g3313bed (114.68 kB)
job-script (3.55 kB)
dmesg.xz (26.95 kB)
Download all attachments

2017-04-08 05:01:53

by Mike Galbraith

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Fri, 2017-04-07 at 21:56 +0300, Michael S. Tsirkin wrote:

> OK. test3 and test4 are now pushed: test3 should fix your hang,
> test4 is trying to fix a crash reported independently.

test3 does not fix the post hibernate hang business that I can easily
reproduce, those are NFS, and at least as old as 4.4. Host/guest,
dunno, put 4.4 on both, guest hangs intermittently.

[<ffffffffa039a550>] __rpc_wait_for_completion_task+0x30/0x30 [sunrpc]
[<ffffffffa039a56e>] rpc_wait_bit_killable+0x1e/0xb0 [sunrpc]
[<ffffffffa039a550>] __rpc_wait_for_completion_task+0x30/0x30 [sunrpc]
[<ffffffff810c6450>] autoremove_wake_function+0x50/0x50
[<ffffffffa038f670>] call_decode+0x850/0x850 [sunrpc]
[<ffffffffa038f670>] call_decode+0x850/0x850 [sunrpc]
[<ffffffffa039b17e>] __rpc_execute+0x14e/0x440 [sunrpc]
[<ffffffff810f7d35>] ktime_get+0x35/0xa0
[<ffffffffa0390900>] rpc_run_task+0x120/0x170 [sunrpc]
[<ffffffffa064a1c6>] nfs4_call_sync_sequence+0x56/0x80 [nfsv4]
[<ffffffffa064acd0>] _nfs4_proc_getattr+0xb0/0xc0 [nfsv4]
[<ffffffff8123f8b2>] path_lookupat+0xd2/0x100
[<ffffffffa0655d9c>] nfs4_proc_getattr+0x5c/0xe0 [nfsv4]
[<ffffffffa0612120>] __nfs_revalidate_inode+0xa0/0x300 [nfs]
[<ffffffffa0612485>] nfs_getattr+0x95/0x250 [nfs]
[<ffffffff812365eb>] vfs_statx+0x7b/0xc0
[<ffffffff81236a60>] SYSC_newstat+0x20/0x40
[<ffffffff81687177>] entry_SYSCALL_64_fastpath+0x1a/0xa9
[<ffffffffffffffff>] 0xffffffffffffffff

I noted no _other_ misbehavior in either kernel, w/wo threadirqs.

-Mike

2017-04-10 21:23:06

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Sat, Apr 08, 2017 at 07:01:34AM +0200, Mike Galbraith wrote:
> On Fri, 2017-04-07 at 21:56 +0300, Michael S. Tsirkin wrote:
>
> > OK. test3 and test4 are now pushed: test3 should fix your hang,
> > test4 is trying to fix a crash reported independently.
>
> test3 does not fix the post hibernate hang business that I can easily
> reproduce, those are NFS, and at least as old as 4.4. Host/guest,
> dunno, put 4.4 on both, guest hangs intermittently.

OK so IIUC you agree it's a good idea to send test4 to Linus, right?
Hybernation's still broken but that's not a regression.

> [<ffffffffa039a550>] __rpc_wait_for_completion_task+0x30/0x30 [sunrpc]
> [<ffffffffa039a56e>] rpc_wait_bit_killable+0x1e/0xb0 [sunrpc]
> [<ffffffffa039a550>] __rpc_wait_for_completion_task+0x30/0x30 [sunrpc]
> [<ffffffff810c6450>] autoremove_wake_function+0x50/0x50
> [<ffffffffa038f670>] call_decode+0x850/0x850 [sunrpc]
> [<ffffffffa038f670>] call_decode+0x850/0x850 [sunrpc]
> [<ffffffffa039b17e>] __rpc_execute+0x14e/0x440 [sunrpc]
> [<ffffffff810f7d35>] ktime_get+0x35/0xa0
> [<ffffffffa0390900>] rpc_run_task+0x120/0x170 [sunrpc]
> [<ffffffffa064a1c6>] nfs4_call_sync_sequence+0x56/0x80 [nfsv4]
> [<ffffffffa064acd0>] _nfs4_proc_getattr+0xb0/0xc0 [nfsv4]
> [<ffffffff8123f8b2>] path_lookupat+0xd2/0x100
> [<ffffffffa0655d9c>] nfs4_proc_getattr+0x5c/0xe0 [nfsv4]
> [<ffffffffa0612120>] __nfs_revalidate_inode+0xa0/0x300 [nfs]
> [<ffffffffa0612485>] nfs_getattr+0x95/0x250 [nfs]
> [<ffffffff812365eb>] vfs_statx+0x7b/0xc0
> [<ffffffff81236a60>] SYSC_newstat+0x20/0x40
> [<ffffffff81687177>] entry_SYSCALL_64_fastpath+0x1a/0xa9
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> I noted no _other_ misbehavior in either kernel, w/wo threadirqs.
>
> -Mike

Interesting. I would guess virtio net does not complete some
packets. So you were unable to find an old guest where this
works fine?

--
MST

2017-04-11 04:19:55

by Mike Galbraith

[permalink] [raw]
Subject: Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

On Tue, 2017-04-11 at 00:23 +0300, Michael S. Tsirkin wrote:
> On Sat, Apr 08, 2017 at 07:01:34AM +0200, Mike Galbraith wrote:
> > On Fri, 2017-04-07 at 21:56 +0300, Michael S. Tsirkin wrote:
> >
> > > OK. test3 and test4 are now pushed: test3 should fix your hang,
> > > test4 is trying to fix a crash reported independently.
> >
> > test3 does not fix the post hibernate hang business that I can easily
> > reproduce, those are NFS, and at least as old as 4.4. Host/guest,
> > dunno, put 4.4 on both, guest hangs intermittently.
>
> OK so IIUC you agree it's a good idea to send test4 to Linus, right?

Well, my box agrees that that is a viable option.

> Hybernation's still broken but that's not a regression.

Yup.

> > [] __rpc_wait_for_completion_task+0x30/0x30 [sunrpc]
> > [] rpc_wait_bit_killable+0x1e/0xb0 [sunrpc]
> > [] __rpc_wait_for_completion_task+0x30/0x30 [sunrpc]
> > [] autoremove_wake_function+0x50/0x50
> > [] call_decode+0x850/0x850 [sunrpc]
> > [] call_decode+0x850/0x850 [sunrpc]
> > [] __rpc_execute+0x14e/0x440 [sunrpc]
> > [] ktime_get+0x35/0xa0
> > [] rpc_run_task+0x120/0x170 [sunrpc]
> > [] nfs4_call_sync_sequence+0x56/0x80 [nfsv4]
> > [] _nfs4_proc_getattr+0xb0/0xc0 [nfsv4]
> > [] path_lookupat+0xd2/0x100
> > [] nfs4_proc_getattr+0x5c/0xe0 [nfsv4]
> > [] __nfs_revalidate_inode+0xa0/0x300 [nfs]
> > [] nfs_getattr+0x95/0x250 [nfs]
> > [] vfs_statx+0x7b/0xc0
> > [] SYSC_newstat+0x20/0x40
> > [] entry_SYSCALL_64_fastpath+0x1a/0xa9
> > [] 0xffffffffffffffff
> >
> > I noted no _other_ misbehavior in either kernel, w/wo threadirqs.
> >
> > > > -Mike
>
> Interesting. I would guess virtio net does not complete some
> packets. So you were unable to find an old guest where this
> works fine?

I just tried my opensuse 13.2 clone. It works markedly less fine,
turns into a brick either on the way down or back up in short order.

-Mike