2024-05-03 18:12:21

by Joseph Salisbury

[permalink] [raw]
Subject: [REGRESSION][v6.8-rc1] virtio-pci: Introduce admin virtqueue

Hi Feng,

During testing, a kernel bug was identified with the suspend/resume
functionality on instances running in a public cloud [0].  This bug is a
regression introduced in v6.8-rc1.  After a kernel bisect, the following
commit was identified as the cause of the regression:

       fd27ef6b44be  ("virtio-pci: Introduce admin virtqueue")

I was hoping to get your feedback, since you are the patch author. Do
you think gathering any additional data will help diagnose this issue? 
This commit is depended upon by other virtio commits, so a revert test
is not really straight forward without reverting all the dependencies.  
Any ideas you have would be greatly appreciated.


Thanks,

Joe

http://pad.lv/2063315



2024-05-08 03:34:23

by Jason Wang

[permalink] [raw]
Subject: Re: [REGRESSION][v6.8-rc1] virtio-pci: Introduce admin virtqueue

On Sat, May 4, 2024 at 2:10 AM Joseph Salisbury
<[email protected]> wrote:
>
> Hi Feng,
>
> During testing, a kernel bug was identified with the suspend/resume
> functionality on instances running in a public cloud [0]. This bug is a
> regression introduced in v6.8-rc1. After a kernel bisect, the following
> commit was identified as the cause of the regression:
>
> fd27ef6b44be ("virtio-pci: Introduce admin virtqueue")

Have a quick glance at the patch it seems it should not damage the
freeze/restore as it should behave as in the past.

But I found something interesting:

1) assumes 1 admin vq which is not what spec said
2) special function for admin virtqueue during freeze/restore, but it
doesn't do anything special than del_vq()
3) lack real users but I guess e.g the destroy_avq() needs to be
synchronized with the one that is using admin virtqueue

>
> I was hoping to get your feedback, since you are the patch author. Do
> you think gathering any additional data will help diagnose this issue?

Yes, please show us

1) the kernel log here.
2) the features that the device has like
/sys/bus/virtio/devices/virtio0/features

> This commit is depended upon by other virtio commits, so a revert test
> is not really straight forward without reverting all the dependencies.
> Any ideas you have would be greatly appreciated.

Thanks

>
>
> Thanks,
>
> Joe
>
> http://pad.lv/2063315
>


2024-05-08 11:18:58

by Catherine Redfield

[permalink] [raw]
Subject: Re: [REGRESSION][v6.8-rc1] virtio-pci: Introduce admin virtqueue

On a VM with the GCP kernel (where we first identified the problem), I see:

1. The full kernel log from `journalctl --system > kernlog` attached. The
specific suspend section is here:

May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
systemd[1]: Reached target sleep.target - Sleep.
May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
systemd[1]: Starting systemd-suspend.service - System Suspend...
May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
systemd-sleep[1413]: Performing sleep operation 'suspend'...
May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal kernel:
PM: suspend entry (deep)
May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal kernel:
Filesystems sync: 0.008 seconds
May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal kernel:
Freezing user space processes
May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal kernel:
Freezing user space processes completed (elapsed 0.001 seconds)
May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal kernel:
OOM killer disabled.
May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal kernel:
Freezing remaining freezable tasks
May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal kernel:
Freezing remaining freezable tasks completed (elapsed 0.000 seconds)
May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal kernel:
printk: Suspending console(s) (use no_console_suspend to debug)
May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal kernel:
port 00:03:0.0: PM: dpm_run_callback(): pm_runtime_force_suspend+0x0/0x130
returns -16
May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal kernel:
port 00:03:0.0: PM: failed to suspend: error -16
May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal kernel:
sd 0:0:1:0: [sda] Synchronizing SCSI cache
May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal kernel:
PM: Some devices failed to suspend, or early wake event detected
May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal kernel:
OOM killer enabled.
May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal kernel:
Restarting tasks ... done.
May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal kernel:
random: crng reseeded on system resumption
May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal kernel:
PM: suspend exit
May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal kernel:
PM: suspend entry (s2idle)
-- Boot 61828bc938b44fc68a8aeedc16a23a9d --
May 08 11:09:03 localhost kernel: Linux version 6.8.0-1007-gcp
(buildd@lcy02-amd64-079) (x86_64-linux-gnu-gcc-13 (Ubuntu 13.2.0-23ubuntu4)
13.2.0, GNU ld (GNU Binutils for Ubuntu) 2.42) #7-Ubuntu SMP Sat Apr 20
00:58:31 UTC 2024 (Ubuntu 6.8.0-1007.7-gcp 6.8.1)
May 08 11:09:03 localhost kernel: Command line:
BOOT_IMAGE=/vmlinuz-6.8.0-1007-gcp
root=PARTUUID=7a949935-6bf2-4cae-b404-803c95163572 ro console=ttyS0,115200
panic=-1

2. The features the devices has:

catred@kernel-test-202405080702:~$ cat
/sys/bus/virtio/devices/virtio0/features
0110000000000000000000000000010000000000000000000000000000000000
catred@kernel-test-202405080702:~$ cat
/sys/bus/virtio/devices/virtio1/features
1110010110011001110000100000010000000000000000000000000000000000
catred@kernel-test-202405080702:~$ cat
/sys/bus/virtio/devices/virtio2/features
1110000000000000000000000000000000000000000000000000000000000000
catred@kernel-test-202405080702:~$ cat
/sys/bus/virtio/devices/virtio3/features
0000000000000000000000000000000000000000000000000000000000000000

Catherine

On Tue, May 7, 2024 at 11:34 PM Jason Wang <[email protected]> wrote:

> On Sat, May 4, 2024 at 2:10 AM Joseph Salisbury
> <[email protected]> wrote:
> >
> > Hi Feng,
> >
> > During testing, a kernel bug was identified with the suspend/resume
> > functionality on instances running in a public cloud [0]. This bug is a
> > regression introduced in v6.8-rc1. After a kernel bisect, the following
> > commit was identified as the cause of the regression:
> >
> > fd27ef6b44be ("virtio-pci: Introduce admin virtqueue")
>
> Have a quick glance at the patch it seems it should not damage the
> freeze/restore as it should behave as in the past.
>
> But I found something interesting:
>
> 1) assumes 1 admin vq which is not what spec said
> 2) special function for admin virtqueue during freeze/restore, but it
> doesn't do anything special than del_vq()
> 3) lack real users but I guess e.g the destroy_avq() needs to be
> synchronized with the one that is using admin virtqueue
>
> >
> > I was hoping to get your feedback, since you are the patch author. Do
> > you think gathering any additional data will help diagnose this issue?
>
> Yes, please show us
>
> 1) the kernel log here.
> 2) the features that the device has like
> /sys/bus/virtio/devices/virtio0/features
>
> > This commit is depended upon by other virtio commits, so a revert test
> > is not really straight forward without reverting all the dependencies.
> > Any ideas you have would be greatly appreciated.
>
> Thanks
>
> >
> >
> > Thanks,
> >
> > Joe
> >
> > http://pad.lv/2063315
> >
>
>


Attachments:
kernlog20240508_0714 (532.37 kB)

2024-05-16 10:01:10

by Jason Wang

[permalink] [raw]
Subject: Re: [REGRESSION][v6.8-rc1] virtio-pci: Introduce admin virtqueue

On Thu, May 16, 2024 at 5:46 PM Catherine Redfield
<[email protected]> wrote:
>
> Feng,
>
> Thank you for providing your debugging steps; I used them on a gce image locally and was not able to replicate the issue. I also attempted to replicate in qemu/virsh using qemu-guest-agent to enable the S3 suspend state, also without success (that is S3 suspend state worked without any problems). I have brought this back to the cloud for further debugging of their config and guest agent to try and determine what the issue is.
>
> Thank you very much for all your help on this issue and time looking into it!
> Catherine

Does this fix the issue? I guess the reason is that GCE is using legacy virtio.

https://lore.kernel.org/kvm/CACGkMEth_9Baewekq862YgZwuozwG96Z3G6oYqHzyCj2JPUZ3g@mail.gmail.com/T/

Thanks

>
> On Thu, May 9, 2024 at 5:03 AM Feng Liu <[email protected]> wrote:
>>
>>
>> On 2024-05-08 a.m.7:18, Catherine Redfield wrote:
>> > *External email: Use caution opening links or attachments*
>> >
>> >
>> > On a VM with the GCP kernel (where we first identified the problem), I see:
>> >
>> > 1. The full kernel log from `journalctl --system > kernlog` attached.
>> > The specific suspend section is here:
>> >
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > systemd[1]: Reached target sleep.target - Sleep.
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > systemd[1]: Starting systemd-suspend.service - System Suspend...
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > systemd-sleep[1413]: Performing sleep operation 'suspend'...
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: PM: suspend entry (deep)
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: Filesystems sync: 0.008 seconds
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: Freezing user space processes
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: Freezing user space processes completed (elapsed 0.001 seconds)
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: OOM killer disabled.
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: Freezing remaining freezable tasks
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: Freezing remaining freezable tasks completed (elapsed 0.000 seconds)
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: printk: Suspending console(s) (use no_console_suspend to debug)
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: port 00:03:0.0: PM: dpm_run_callback():
>> > pm_runtime_force_suspend+0x0/0x130 returns -16
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: port 00:03:0.0: PM: failed to suspend: error -16
>>
>> Thanks Joesph and Catherine's help.
>>
>> Hi,
>>
>> I have alreay synced up with Cananical guys offline about this issue.
>>
>> I can run "suspend/resume" sucessfully on my local server and VM.
>> And "PM: failed to suspend: error -16" looks like not cause by my
>> previous virtio patch ( fd27ef6b44be ("virtio-pci: Introduce admin
>> virtqueue")) which only modified "virtio_device_freeze" about "suspend"
>> action.
>>
>> So I have provide the my steps and debug patch to Joesph and Catherine.
>> I will also sync up the information here, as follow:
>>
>> I have read the qemu code and find a way to trigger "suspend/resume" on
>> my setup, and add some debug message in the latest kerenel
>>
>> My setps are:
>> 1. QEMU cmdline add following
>> ....
>> -global PIIX4_PM.disable_s3=0 \
>> -global PIIX4_PM.disable_s4=1 \
>> ....
>> -netdev type=tap,ifname=tap0,id=hostnet0,script=no,downscript=no \
>> -device
>> virtio-net-pci,netdev=hostnet0,id=net0,mac=$SSH_MAC,bus=pci.0,addr=0x3 \
>> ......
>>
>> 2. In the VM, run "systemctl suspend" to PM suspend the VM into memory
>> 3. In qemu hmp shell, run "system_wakeup" to resume the VM again
>>
>> My VM configuration:
>> NIC: 1 virtio nic emulated by QEMU
>> OS: Ubuntu 22.04.4 LTS
>> kernel: latest kernel, 6.9-rc7: ee5b455b0ada (kernel2/net-next-virito,
>> kernel2/master, master) Merge tag 'slab-for-6.9-rc7-fixes' of
>> git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab)
>>
>>
>> I add some debug message on the latest kernel, and do above steps to
>> trigger "suspen/resume". Everything of VM is OK, VM could suspend/resume
>> successfully.
>> Follwing is the kernel log:
>> ----------------------------------------------------------------------------
>> ........
>> May 6 15:59:52 feliu-vm kernel: [ 43.446737] PM: suspend entry (deep)
>> May 6 16:00:04 feliu-vm kernel: [ 43.467640] Filesystems sync: 0.020
>> seconds
>> May 6 16:00:04 feliu-vm kernel: [ 43.467923] Freezing user space
>> processes
>> May 6 16:00:04 feliu-vm kernel: [ 43.470294] Freezing user space
>> processes completed (elapsed 0.002 seconds)
>> May 6 16:00:04 feliu-vm kernel: [ 43.470299] OOM killer disabled.
>> May 6 16:00:04 feliu-vm kernel: [ 43.470301] Freezing remaining
>> freezable tasks
>> May 6 16:00:04 feliu-vm kernel: [ 43.471482] Freezing remaining
>> freezable tasks completed (elapsed 0.001 seconds)
>> May 6 16:00:04 feliu-vm kernel: [ 43.471495] printk: Suspending
>> console(s) (use no_console_suspend to debug)
>> May 6 16:00:04 feliu-vm kernel: [ 43.474034] virtio_net virtio0:
>> godeng virtio device freeze
>> May 6 16:00:04 feliu-vm kernel: [ 43.475714] virtio_net virtio0 ens3:
>> godfeng virtnet_freeze done
>> May 6 16:00:04 feliu-vm kernel: [ 43.475717] virtio_net virtio0:
>> godfeng VIRTIO_F_ADMIN_VQ not enabled
>> May 6 16:00:04 feliu-vm kernel: [ 43.475719] virtio_net virtio0:
>> godeng virtio device freeze done
>> ........
>> May 6 16:00:04 feliu-vm kernel: [ 43.535382] smpboot: CPU 1 is now
>> offline
>> May 6 16:00:04 feliu-vm kernel: [ 43.537283] IRQ fixup: irq 1 move in
>> progress, old vector 32
>> May 6 16:00:04 feliu-vm kernel: [ 43.538504] smpboot: CPU 2 is now
>> offline
>> May 6 16:00:04 feliu-vm kernel: [ 43.541392] smpboot: CPU 3 is now
>> offline
>>
>> ......
>>
>> May 6 16:00:04 feliu-vm kernel: [ 54.973285] smpboot: Booting Node 0
>> Processor 15 APIC 0xf
>> May 6 16:00:04 feliu-vm kernel: [ 54.975190] CPU15 is up
>> May 6 16:00:04 feliu-vm kernel: [ 54.976011] ACPI: PM: Waking up from
>> system sleep state S3
>> May 6 16:00:04 feliu-vm kernel: [ 54.986071] virtio_net virtio0:
>> godeng virtio device restore
>> May 6 16:00:04 feliu-vm kernel: [ 54.987563] virtio_net virtio0 ens3:
>> godfeng virtnet_restore done
>> May 6 16:00:04 feliu-vm kernel: [ 54.987635] virtio_net virtio0:
>> godfeng: virtio device restore done
>> ......
>> May 6 16:00:04 feliu-vm kernel: [ 55.307221] ata8: SATA link down
>> (SStatus 0 SControl 300)
>> May 6 16:00:04 feliu-vm kernel: [ 55.442048] OOM killer enabled.
>> May 6 16:00:04 feliu-vm kernel: [ 55.442051] Restarting tasks ... done.
>> May 6 16:00:04 feliu-vm kernel: [ 55.443576] random: crng reseeded on
>> system resumption
>> May 6 16:00:04 feliu-vm kernel: [ 55.443582] PM: suspend exit
>>
>> ----------------------------------------------------------------------------
>>
>> Attachment is the full kernel log. I think maybe it is some configration
>> error.
>>
>>
>> Thanks
>> Feng
>>
>>
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: sd 0:0:1:0: [sda] Synchronizing SCSI cache
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: PM: Some devices failed to suspend, or early wake event detected
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: OOM killer enabled.
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: Restarting tasks ... done.
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: random: crng reseeded on system resumption
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: PM: suspend exit
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: PM: suspend entry (s2idle)
>> > -- Boot 61828bc938b44fc68a8aeedc16a23a9d --
>> > May 08 11:09:03 localhost kernel: Linux version 6.8.0-1007-gcp
>> > (buildd@lcy02-amd64-079) (x86_64-linux-gnu-gcc-13 (Ubuntu
>> > 13.2.0-23ubuntu4) 13.2.0, GNU ld (GNU Binutils for Ubuntu) 2.42)
>> > #7-Ubuntu SMP Sat Apr 20 00:58:31 UTC 2024 (Ubuntu 6.8.0-1007.7-gcp 6.8.1)
>> > May 08 11:09:03 localhost kernel: Command line:
>> > BOOT_IMAGE=/vmlinuz-6.8.0-1007-gcp
>> > root=PARTUUID=7a949935-6bf2-4cae-b404-803c95163572 ro
>> > console=ttyS0,115200 panic=-1
>> >
>> > 2. The features the devices has:
>> >
>> > catred@kernel-test-202405080702:~$ cat
>> > /sys/bus/virtio/devices/virtio0/features
>> > 0110000000000000000000000000010000000000000000000000000000000000
>> > catred@kernel-test-202405080702:~$ cat
>> > /sys/bus/virtio/devices/virtio1/features
>> > 1110010110011001110000100000010000000000000000000000000000000000
>> > catred@kernel-test-202405080702:~$ cat
>> > /sys/bus/virtio/devices/virtio2/features
>> > 1110000000000000000000000000000000000000000000000000000000000000
>> > catred@kernel-test-202405080702:~$ cat
>> > /sys/bus/virtio/devices/virtio3/features
>> > 0000000000000000000000000000000000000000000000000000000000000000
>> >
>> > Catherine
>> >
>> > On Tue, May 7, 2024 at 11:34 PM Jason Wang <[email protected]
>> > <mailto:[email protected]>> wrote:
>> >
>> > On Sat, May 4, 2024 at 2:10 AM Joseph Salisbury
>> > <[email protected]
>> > <mailto:[email protected]>> wrote:
>> > >
>> > > Hi Feng,
>> > >
>> > > During testing, a kernel bug was identified with the suspend/resume
>> > > functionality on instances running in a public cloud [0]. This
>> > bug is a
>> > > regression introduced in v6.8-rc1. After a kernel bisect, the
>> > following
>> > > commit was identified as the cause of the regression:
>> > >
>> > > fd27ef6b44be ("virtio-pci: Introduce admin virtqueue")
>> >
>> > Have a quick glance at the patch it seems it should not damage the
>> > freeze/restore as it should behave as in the past.
>> >
>> > But I found something interesting:
>> >
>> > 1) assumes 1 admin vq which is not what spec said
>> > 2) special function for admin virtqueue during freeze/restore, but it
>> > doesn't do anything special than del_vq()
>> > 3) lack real users but I guess e.g the destroy_avq() needs to be
>> > synchronized with the one that is using admin virtqueue
>> >
>> > >
>> > > I was hoping to get your feedback, since you are the patch author. Do
>> > > you think gathering any additional data will help diagnose this
>> > issue?
>> >
>> > Yes, please show us
>> >
>> > 1) the kernel log here.
>> > 2) the features that the device has like
>> > /sys/bus/virtio/devices/virtio0/features
>> >
>> > > This commit is depended upon by other virtio commits, so a revert
>> > test
>> > > is not really straight forward without reverting all the
>> > dependencies.
>> > > Any ideas you have would be greatly appreciated.
>> >
>> > Thanks
>> >
>> > >
>> > >
>> > > Thanks,
>> > >
>> > > Joe
>> > >
>> > > http://pad.lv/2063315 <http://pad.lv/2063315>
>> > >
>> >