LinuxLists.cc - Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM

2022-01-17 17:03:27

by Regzbot (on behalf of Thorsten Leemhuis)

[permalink] [raw]

Subject: Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM

[TLDR: I'm adding this regression to regzbot, the Linux kernel
regression tracking bot; most text you find below is compiled from a few
templates paragraphs some of you might have seen already.]

Hi, this is your Linux kernel regression tracker speaking.

On 17.01.22 03:12, James D. Turner wrote:
>
> With newer kernels, starting with the v5.14 series, when using a MS
> Windows 10 guest VM with PCI passthrough of an AMD Radeon Pro WX 3200
> discrete GPU, the passed-through GPU will not run above 501 MHz, even
> when it is under 100% load and well below the temperature limit. As a
> result, GPU-intensive software (such as video games) runs unusably
> slowly in the VM.

Thanks for the report. Greg already asked for a bisection, which would
help a lot here.

To be sure this issue doesn't fall through the cracks unnoticed, I'm
adding it to regzbot, my Linux kernel regression tracking bot:

#regzbot ^introduced v5.13..v5.14-rc1
#regzbot ignore-activity

Reminder: when fixing the issue, please add a 'Link:' tag with the URL
to the report (the parent of this mail) using the kernel.org redirector,
as explained in 'Documentation/process/submitting-patches.rst'. Regzbot
then will automatically mark the regression as resolved once the fix
lands in the appropriate tree. For more details about regzbot see footer.

Sending this to everyone that got the initial report, to make all aware
of the tracking. I also hope that messages like this motivate people to
directly get at least the regression mailing list and ideally even
regzbot involved when dealing with regressions, as messages like this
wouldn't be needed then.

Don't worry, I'll send further messages wrt to this regression just to
the lists (with a tag in the subject so people can filter them away), as
long as they are intended just for regzbot. With a bit of luck no such
messages will be needed anyway.

Ciao, Thorsten (wearing his 'Linux kernel regression tracker' hat)

P.S.: As a Linux kernel regression tracker I'm getting a lot of reports
on my table. I can only look briefly into most of them. Unfortunately
therefore I sometimes will get things wrong or miss something important.
I hope that's not the case here; if you think it is, don't hesitate to
tell me about it in a public reply, that's in everyone's interest.

BTW, I have no personal interest in this issue, which is tracked using
regzbot, my Linux kernel regression tracking bot
(https://linux-regtracking.leemhuis.info/regzbot/). I'm only posting
this mail to get things rolling again and hence don't need to be CC on
all further activities wrt to this regression.

> In contrast, with older kernels, the passed-through GPU runs at up to
> 1295 MHz (the correct hardware limit), so GPU-intensive software runs at
> a reasonable speed in the VM.
>
> I've confirmed that the issue exists with the following kernel versions:
>
> - v5.16
> - v5.14
> - v5.14-rc1
>
> The issue does not exist with the following kernels:
>
> - v5.13
> - various packaged (non-vanilla) 5.10.* Arch Linux `linux-lts` kernels
>
> So, the issue was introduced between v5.13 and v5.14-rc1. I'm willing to
> bisect the commit history to narrow it down further, if that would be
> helpful.
>
> The configuration details and test results are provided below. In
> summary, for the kernels with this issue, the GPU core stays at a
> constant 0.8 V, the GPU core clock ranges from 214 MHz to 501 MHz, and
> the GPU memory stays at a constant 625 MHz, in the VM. For the correctly
> working kernels, the GPU core ranges from 0.85 V to 1.0 V, the GPU core
> clock ranges from 214 MHz to 1295 MHz, and the GPU memory stays at 1500
> MHz, in the VM.
>
> Please let me know if additional information would be helpful.
>
> Regards,
> James Turner
>
> # Configuration Details
>
> Hardware:
>
> - Dell Precision 7540 laptop
> - CPU: Intel Core i7-9750H (x86-64)
> - Discrete GPU: AMD Radeon Pro WX 3200
> - The internal display is connected to the integrated GPU, and external
> displays are connected to the discrete GPU.
>
> Software:
>
> - KVM host: Arch Linux
> - self-built vanilla kernel (built using Arch Linux `PKGBUILD`
> modified to use vanilla kernel sources from git.kernel.org)
> - libvirt 1:7.10.0-2
> - qemu 6.2.0-2
>
> - KVM guest: Windows 10
> - GPU driver: Radeon Pro Software Version 21.Q3 (Note that I also
> experienced this issue with the 20.Q4 driver, using packaged
> (non-vanilla) Arch Linux kernels on the host, before updating to the
> 21.Q3 driver.)
>
> Kernel config:
>
> - For v5.13, v5.14-rc1, and v5.14, I used
> https://github.com/archlinux/svntogit-packages/blob/89c24952adbfa645d9e1a6f12c572929f7e4e3c7/trunk/config
> (The build script ran `make olddefconfig` on that config file.)
>
> - For v5.16, I used
> https://github.com/archlinux/svntogit-packages/blob/94f84e1ad8a530e54aa34cadbaa76e8dcc439d10/trunk/config
> (The build script ran `make olddefconfig` on that config file.)
>
> I set up the VM with PCI passthrough according to the instructions at
> https://wiki.archlinux.org/title/PCI_passthrough_via_OVMF
>
> I'm passing through the following PCI devices to the VM, as listed by
> `lspci -D -nn`:
>
> 0000:01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Lexa XT [Radeon PRO WX 3200] [1002:6981]
> 0000:01:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X] [1002:aae0]
>
> The host kernel command line includes the following relevant options:
>
> intel_iommu=on vfio-pci.ids=1002:6981,1002:aae0
>
> to enable IOMMU and bind the `vfio-pci` driver to the PCI devices.
>
> My `/etc/mkinitcpio.conf` includes the following line:
>
> MODULES=(vfio_pci vfio vfio_iommu_type1 vfio_virqfd i915 amdgpu)
>
> to load `vfio-pci` before the graphics drivers. (Note that removing
> `i915 amdgpu` has no effect on this issue.)
>
> I'm using libvirt to manage the VM. The relevant portions of the XML
> file are:
>
> <hostdev mode="subsystem" type="pci" managed="yes">
> <source>
> <address domain="0x0000" bus="0x01" slot="0x00" function="0x0"/>
> </source>
> <address type="pci" domain="0x0000" bus="0x06" slot="0x00" function="0x0"/>
> </hostdev>
> <hostdev mode="subsystem" type="pci" managed="yes">
> <source>
> <address domain="0x0000" bus="0x01" slot="0x00" function="0x1"/>
> </source>
> <address type="pci" domain="0x0000" bus="0x07" slot="0x00" function="0x0"/>
> </hostdev>
>
> # Test Results
>
> For testing, I used the following procedure:
>
> 1. Boot the host machine and log in.
>
> 2. Run the following commands to gather information. For all the tests,
> the output was identical.
>
> - `cat /proc/sys/kernel/tainted` printed:
>
> 0
>
> - `hostnamectl | grep "Operating System"` printed:
>
> Operating System: Arch Linux
>
> - `lspci -nnk -d 1002:6981` printed
>
> 01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Lexa XT [Radeon PRO WX 3200] [1002:6981]
> Subsystem: Dell Device [1028:0926]
> Kernel driver in use: vfio-pci
> Kernel modules: amdgpu
>
> - `lspci -nnk -d 1002:aae0` printed
>
> 01:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X] [1002:aae0]
> Subsystem: Dell Device [1028:0926]
> Kernel driver in use: vfio-pci
> Kernel modules: snd_hda_intel
>
> - `sudo dmesg | grep -i vfio` printed the kernel command line and the
> following messages:
>
> VFIO - User Level meta-driver version: 0.3
> vfio-pci 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
> vfio_pci: add [1002:6981[ffffffff:ffffffff]] class 0x000000/00000000
> vfio_pci: add [1002:aae0[ffffffff:ffffffff]] class 0x000000/00000000
> vfio-pci 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
>
> 3. Start the Windows VM using libvirt and log in. Record sensor
> information.
>
> 4. Run a graphically-intensive video game to put the GPU under load.
> Record sensor information.
>
> 5. Stop the game. Record sensor information.
>
> 6. Shut down the VM. Save the output of `sudo dmesg`.
>
> I compared the `sudo dmesg` output for v5.13 and v5.14-rc1 and didn't
> see any relevant differences.
>
> Note that the issue occurs only within the guest VM. When I'm not using
> a VM (after removing `vfio-pci.ids=1002:6981,1002:aae0` from the kernel
> command line so that the PCI devices are bound to their normal `amdgpu`
> and `snd_hda_intel` drivers instead of the `vfio-pci` driver), the GPU
> operates correctly on the host.
>
> ## Linux v5.16 (issue present)
>
> $ cat /proc/version
> Linux version 5.16.0-1 (linux@archlinux) (gcc (GCC) 11.1.0, GNU ld (GNU Binutils) 2.36.1) #1 SMP PREEMPT Sun, 16 Jan 2022 01:51:08 +0000
>
> Before running the game:
>
> - GPU core: 214.0 MHz, 0.800 V, 0.0% load, 53.0 degC
> - GPU memory: 625.0 MHz
>
> While running the game:
>
> - GPU core: 501.0 MHz, 0.800 V, 100.0% load, 54.0 degC
> - GPU memory: 625.0 MHz
>
> After stopping the game:
>
> - GPU core: 214.0 MHz, 0.800 V, 0.0% load, 51.0 degC
> - GPU memory: 625.0 MHz
>
> ## Linux v5.14 (issue present)
>
> $ cat /proc/version
> Linux version 5.14.0-1 (linux@archlinux) (gcc (GCC) 11.1.0, GNU ld (GNU Binutils) 2.36.1) #1 SMP PREEMPT Sun, 16 Jan 2022 03:19:35 +0000
>
> Before running the game:
>
> - GPU core: 214.0 MHz, 0.800 V, 0.0% load, 50.0 degC
> - GPU memory: 625.0 MHz
>
> While running the game:
>
> - GPU core: 501.0 MHz, 0.800 V, 100.0% load, 54.0 degC
> - GPU memory: 625.0 MHz
>
> After stopping the game:
>
> - GPU core: 214.0 MHz, 0.800 V, 0.0% load, 49.0 degC
> - GPU memory: 625.0 MHz
>
> ## Linux v5.14-rc1 (issue present)
>
> $ cat /proc/version
> Linux version 5.14.0-rc1-1 (linux@archlinux) (gcc (GCC) 11.1.0, GNU ld (GNU Binutils) 2.36.1) #1 SMP PREEMPT Sun, 16 Jan 2022 18:31:35 +0000
>
> Before running the game:
>
> - GPU core: 214.0 MHz, 0.800 V, 0.0% load, 50.0 degC
> - GPU memory: 625.0 MHz
>
> While running the game:
>
> - GPU core: 501.0 MHz, 0.800 V, 100.0% load, 54.0 degC
> - GPU memory: 625.0 MHz
>
> After stopping the game:
>
> - GPU core: 214.0 MHz, 0.800 V, 0.0% load, 49.0 degC
> - GPU memory: 625.0 MHz
>
> ## Linux v5.13 (works correctly, issue not present)
>
> $ cat /proc/version
> Linux version 5.13.0-1 (linux@archlinux) (gcc (GCC) 11.1.0, GNU ld (GNU Binutils) 2.36.1) #1 SMP PREEMPT Sun, 16 Jan 2022 02:39:18 +0000
>
> Before running the game:
>
> - GPU core: 214.0 MHz, 0.850 V, 0.0% load, 55.0 degC
> - GPU memory: 1500.0 MHz
>
> While running the game:
>
> - GPU core: 1295.0 MHz, 1.000 V, 100.0% load, 67.0 degC
> - GPU memory: 1500.0 MHz
>
> After stopping the game:
>
> - GPU core: 214.0 MHz, 0.850 V, 0.0% load, 52.0 degC
> - GPU memory: 1500.0 MHz
>
>
---
Additional information about regzbot:

If you want to know more about regzbot, check out its web-interface, the
getting start guide, and/or the references documentation:

https://linux-regtracking.leemhuis.info/regzbot/
https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md
https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md

The last two documents will explain how you can interact with regzbot
yourself if your want to.

Hint for reporters: when reporting a regression it's in your interest to
tell #regzbot about it in the report, as that will ensure the regression
gets on the radar of regzbot and the regression tracker. That's in your
interest, as they will make sure the report won't fall through the
cracks unnoticed.

Hint for developers: you normally don't need to care about regzbot once
it's involved. Fix the issue as you normally would, just remember to
include a 'Link:' tag to the report in the commit message, as explained
in Documentation/process/submitting-patches.rst
That aspect was recently was made more explicit in commit 1f57bd42b77c:
https://git.kernel.org/linus/1f57bd42b77c

2022-01-19 15:52:14

by James Turner

[permalink] [raw]

Subject: Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM

I finished about half of the bisection process today. The log so far is
below. I'll follow up again once I've narrowed it down to a single
commit.

git bisect start
# bad: [e73f0f0ee7541171d89f2e2491130c7771ba58d3] Linux 5.14-rc1
git bisect bad e73f0f0ee7541171d89f2e2491130c7771ba58d3
# good: [62fb9874f5da54fdb243003b386128037319b219] Linux 5.13
git bisect good 62fb9874f5da54fdb243003b386128037319b219
# bad: [e058a84bfddc42ba356a2316f2cf1141974625c9] Merge tag 'drm-next-2021-07-01' of git://anongit.freedesktop.org/drm/drm
git bisect bad e058a84bfddc42ba356a2316f2cf1141974625c9
# good: [a6eaf3850cb171c328a8b0db6d3c79286a1eba9d] Merge tag 'sched-urgent-2021-06-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good a6eaf3850cb171c328a8b0db6d3c79286a1eba9d
# good: [007b312c6f294770de01fbc0643610145012d244] Merge tag 'mac80211-next-for-net-next-2021-06-25' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
git bisect good 007b312c6f294770de01fbc0643610145012d244
# bad: [18703923a66aecf6f7ded0e16d22eb412ddae72f] drm/amdgpu: Fix incorrect register offsets for Sienna Cichlid
git bisect bad 18703923a66aecf6f7ded0e16d22eb412ddae72f
# good: [c99c4d0ca57c978dcc2a2f41ab8449684ea154cc] Merge tag 'amd-drm-next-5.14-2021-05-19' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
git bisect good c99c4d0ca57c978dcc2a2f41ab8449684ea154cc
# good: [43ed3c6c786d996a264fcde68dbb36df6f03b965] Merge tag 'drm-misc-next-2021-06-01' of git://anongit.freedesktop.org/drm/drm-misc into drm-next
git bisect good 43ed3c6c786d996a264fcde68dbb36df6f03b965
# bad: [050cd3d616d96c3a04f4877842a391c0a4fdcc7a] drm/amd/display: Add support for SURFACE_PIXEL_FORMAT_GRPH_ABGR16161616.
git bisect bad 050cd3d616d96c3a04f4877842a391c0a4fdcc7a

James

2022-01-22 00:29:54

by James Turner

[permalink] [raw]

Subject: Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM

Hi all,

I finished the bisection (log below). The issue was introduced in
f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)").

Would any additional information be helpful?

git bisect start
# bad: [e73f0f0ee7541171d89f2e2491130c7771ba58d3] Linux 5.14-rc1
git bisect bad e73f0f0ee7541171d89f2e2491130c7771ba58d3
# good: [62fb9874f5da54fdb243003b386128037319b219] Linux 5.13
git bisect good 62fb9874f5da54fdb243003b386128037319b219
# bad: [e058a84bfddc42ba356a2316f2cf1141974625c9] Merge tag 'drm-next-2021-07-01' of git://anongit.freedesktop.org/drm/drm
git bisect bad e058a84bfddc42ba356a2316f2cf1141974625c9
# good: [a6eaf3850cb171c328a8b0db6d3c79286a1eba9d] Merge tag 'sched-urgent-2021-06-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good a6eaf3850cb171c328a8b0db6d3c79286a1eba9d
# good: [007b312c6f294770de01fbc0643610145012d244] Merge tag 'mac80211-next-for-net-next-2021-06-25' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
git bisect good 007b312c6f294770de01fbc0643610145012d244
# bad: [18703923a66aecf6f7ded0e16d22eb412ddae72f] drm/amdgpu: Fix incorrect register offsets for Sienna Cichlid
git bisect bad 18703923a66aecf6f7ded0e16d22eb412ddae72f
# good: [c99c4d0ca57c978dcc2a2f41ab8449684ea154cc] Merge tag 'amd-drm-next-5.14-2021-05-19' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
git bisect good c99c4d0ca57c978dcc2a2f41ab8449684ea154cc
# good: [43ed3c6c786d996a264fcde68dbb36df6f03b965] Merge tag 'drm-misc-next-2021-06-01' of git://anongit.freedesktop.org/drm/drm-misc into drm-next
git bisect good 43ed3c6c786d996a264fcde68dbb36df6f03b965
# bad: [050cd3d616d96c3a04f4877842a391c0a4fdcc7a] drm/amd/display: Add support for SURFACE_PIXEL_FORMAT_GRPH_ABGR16161616.
git bisect bad 050cd3d616d96c3a04f4877842a391c0a4fdcc7a
# good: [f43ae2d1806c2b8a0934cb4acddd3cf3750d10f8] drm/amdgpu: Fix inconsistent indenting
git bisect good f43ae2d1806c2b8a0934cb4acddd3cf3750d10f8
# good: [6566cae7aef30da8833f1fa0eb854baf33b96676] drm/amd/display: fix odm scaling
git bisect good 6566cae7aef30da8833f1fa0eb854baf33b96676
# good: [5ac1dd89df549648b67f4d5e3a01b2d653914c55] drm/amd/display/dc/dce/dmub_outbox: Convert over to kernel-doc
git bisect good 5ac1dd89df549648b67f4d5e3a01b2d653914c55
# good: [a76eb7d30f700e5bdecc72d88d2226d137b11f74] drm/amd/display/dc/dce110/dce110_hw_sequencer: Include header containing our prototypes
git bisect good a76eb7d30f700e5bdecc72d88d2226d137b11f74
# good: [dd1d82c04e111b5a864638ede8965db2fe6d8653] drm/amdgpu/swsmu/aldebaran: fix check in is_dpm_running
git bisect good dd1d82c04e111b5a864638ede8965db2fe6d8653
# bad: [f9b7f3703ff97768a8dfabd42bdb107681f1da22] drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)
git bisect bad f9b7f3703ff97768a8dfabd42bdb107681f1da22
# good: [f1688bd69ec4b07eda1657ff953daebce7cfabf6] drm/amd/amdgpu:save psp ring wptr to avoid attack
git bisect good f1688bd69ec4b07eda1657ff953daebce7cfabf6
# first bad commit: [f9b7f3703ff97768a8dfabd42bdb107681f1da22] drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)

James

2022-01-22 00:34:23

by Regzbot (on behalf of Thorsten Leemhuis)

[permalink] [raw]

Subject: Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM

Hi, this is your Linux kernel regression tracker speaking.

On 21.01.22 03:13, James Turner wrote:
>
> I finished the bisection (log below). The issue was introduced in
> f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)").

FWIW, that was:

> drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)
> They are global ACPI methods, so maybe the structures
> global in the driver. This simplified a number of things
> in the handling of these methods.
>
> v2: reset the handle if verify interface fails (Lijo)
> v3: fix compilation when ACPI is not defined.
>
> Reviewed-by: Lijo Lazar <[email protected]>
> Signed-off-by: Alex Deucher <[email protected]>

In that case we need to get those two and the maintainers for the driver
involved by addressing them with this mail. And to make it easy for them
here is a link and a quote from the original report:

https://lore.kernel.org/all/[email protected]/

```
> Hi,
>
> With newer kernels, starting with the v5.14 series, when using a MS
> Windows 10 guest VM with PCI passthrough of an AMD Radeon Pro WX 3200
> discrete GPU, the passed-through GPU will not run above 501 MHz, even
> when it is under 100% load and well below the temperature limit. As a
> result, GPU-intensive software (such as video games) runs unusably
> slowly in the VM.
>
> In contrast, with older kernels, the passed-through GPU runs at up to
> 1295 MHz (the correct hardware limit), so GPU-intensive software runs at
> a reasonable speed in the VM.
>
> I've confirmed that the issue exists with the following kernel versions:
>
> - v5.16
> - v5.14
> - v5.14-rc1
>
> The issue does not exist with the following kernels:
>
> - v5.13
> - various packaged (non-vanilla) 5.10.* Arch Linux `linux-lts` kernels
>
> So, the issue was introduced between v5.13 and v5.14-rc1. I'm willing to
> bisect the commit history to narrow it down further, if that would be
> helpful.
>
> The configuration details and test results are provided below. In
> summary, for the kernels with this issue, the GPU core stays at a
> constant 0.8 V, the GPU core clock ranges from 214 MHz to 501 MHz, and
> the GPU memory stays at a constant 625 MHz, in the VM. For the correctly
> working kernels, the GPU core ranges from 0.85 V to 1.0 V, the GPU core
> clock ranges from 214 MHz to 1295 MHz, and the GPU memory stays at 1500
> MHz, in the VM.
>
> Please let me know if additional information would be helpful.
>
> Regards,
> James Turner
>
> # Configuration Details
>
> Hardware:
>
> - Dell Precision 7540 laptop
> - CPU: Intel Core i7-9750H (x86-64)
> - Discrete GPU: AMD Radeon Pro WX 3200
> - The internal display is connected to the integrated GPU, and external
> displays are connected to the discrete GPU.
>
> Software:
>
> - KVM host: Arch Linux
> - self-built vanilla kernel (built using Arch Linux `PKGBUILD`
> modified to use vanilla kernel sources from git.kernel.org)
> - libvirt 1:7.10.0-2
> - qemu 6.2.0-2
>
> - KVM guest: Windows 10
> - GPU driver: Radeon Pro Software Version 21.Q3 (Note that I also
> experienced this issue with the 20.Q4 driver, using packaged
> (non-vanilla) Arch Linux kernels on the host, before updating to the
> 21.Q3 driver.)
>
> Kernel config:
>
> - For v5.13, v5.14-rc1, and v5.14, I used
> https://github.com/archlinux/svntogit-packages/blob/89c24952adbfa645d9e1a6f12c572929f7e4e3c7/trunk/config
> (The build script ran `make olddefconfig` on that config file.)
>
> - For v5.16, I used
> https://github.com/archlinux/svntogit-packages/blob/94f84e1ad8a530e54aa34cadbaa76e8dcc439d10/trunk/config
> (The build script ran `make olddefconfig` on that config file.)
>
> I set up the VM with PCI passthrough according to the instructions at
> https://wiki.archlinux.org/title/PCI_passthrough_via_OVMF
>
> I'm passing through the following PCI devices to the VM, as listed by
> `lspci -D -nn`:
>
> 0000:01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Lexa XT [Radeon PRO WX 3200] [1002:6981]
> 0000:01:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X] [1002:aae0]
>
> The host kernel command line includes the following relevant options:
>
> intel_iommu=on vfio-pci.ids=1002:6981,1002:aae0
>
> to enable IOMMU and bind the `vfio-pci` driver to the PCI devices.
>
> My `/etc/mkinitcpio.conf` includes the following line:
>
> MODULES=(vfio_pci vfio vfio_iommu_type1 vfio_virqfd i915 amdgpu)
>
> to load `vfio-pci` before the graphics drivers. (Note that removing
> `i915 amdgpu` has no effect on this issue.)
>
> I'm using libvirt to manage the VM. The relevant portions of the XML
> file are:
>
> <hostdev mode="subsystem" type="pci" managed="yes">
> <source>
> <address domain="0x0000" bus="0x01" slot="0x00" function="0x0"/>
> </source>
> <address type="pci" domain="0x0000" bus="0x06" slot="0x00" function="0x0"/>
> </hostdev>
> <hostdev mode="subsystem" type="pci" managed="yes">
> <source>
> <address domain="0x0000" bus="0x01" slot="0x00" function="0x1"/>
> </source>
> <address type="pci" domain="0x0000" bus="0x07" slot="0x00" function="0x0"/>
> </hostdev>
>
> # Test Results
>
> For testing, I used the following procedure:
>
> 1. Boot the host machine and log in.
>
> 2. Run the following commands to gather information. For all the tests,
> the output was identical.
>
> - `cat /proc/sys/kernel/tainted` printed:
>
> 0
>
> - `hostnamectl | grep "Operating System"` printed:
>
> Operating System: Arch Linux
>
> - `lspci -nnk -d 1002:6981` printed
>
> 01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Lexa XT [Radeon PRO WX 3200] [1002:6981]
> Subsystem: Dell Device [1028:0926]
> Kernel driver in use: vfio-pci
> Kernel modules: amdgpu
>
> - `lspci -nnk -d 1002:aae0` printed
>
> 01:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X] [1002:aae0]
> Subsystem: Dell Device [1028:0926]
> Kernel driver in use: vfio-pci
> Kernel modules: snd_hda_intel
>
> - `sudo dmesg | grep -i vfio` printed the kernel command line and the
> following messages:
>
> VFIO - User Level meta-driver version: 0.3
> vfio-pci 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
> vfio_pci: add [1002:6981[ffffffff:ffffffff]] class 0x000000/00000000
> vfio_pci: add [1002:aae0[ffffffff:ffffffff]] class 0x000000/00000000
> vfio-pci 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
>
> 3. Start the Windows VM using libvirt and log in. Record sensor
> information.
>
> 4. Run a graphically-intensive video game to put the GPU under load.
> Record sensor information.
>
> 5. Stop the game. Record sensor information.
>
> 6. Shut down the VM. Save the output of `sudo dmesg`.
>
> I compared the `sudo dmesg` output for v5.13 and v5.14-rc1 and didn't
> see any relevant differences.
>
> Note that the issue occurs only within the guest VM. When I'm not using
> a VM (after removing `vfio-pci.ids=1002:6981,1002:aae0` from the kernel
> command line so that the PCI devices are bound to their normal `amdgpu`
> and `snd_hda_intel` drivers instead of the `vfio-pci` driver), the GPU
> operates correctly on the host.
>
> ## Linux v5.16 (issue present)
>
> $ cat /proc/version
> Linux version 5.16.0-1 (linux@archlinux) (gcc (GCC) 11.1.0, GNU ld (GNU Binutils) 2.36.1) #1 SMP PREEMPT Sun, 16 Jan 2022 01:51:08 +0000
>
> Before running the game:
>
> - GPU core: 214.0 MHz, 0.800 V, 0.0% load, 53.0 degC
> - GPU memory: 625.0 MHz
>
> While running the game:
>
> - GPU core: 501.0 MHz, 0.800 V, 100.0% load, 54.0 degC
> - GPU memory: 625.0 MHz
>
> After stopping the game:
>
> - GPU core: 214.0 MHz, 0.800 V, 0.0% load, 51.0 degC
> - GPU memory: 625.0 MHz
>
> ## Linux v5.14 (issue present)
>
> $ cat /proc/version
> Linux version 5.14.0-1 (linux@archlinux) (gcc (GCC) 11.1.0, GNU ld (GNU Binutils) 2.36.1) #1 SMP PREEMPT Sun, 16 Jan 2022 03:19:35 +0000
>
> Before running the game:
>
> - GPU core: 214.0 MHz, 0.800 V, 0.0% load, 50.0 degC
> - GPU memory: 625.0 MHz
>
> While running the game:
>
> - GPU core: 501.0 MHz, 0.800 V, 100.0% load, 54.0 degC
> - GPU memory: 625.0 MHz
>
> After stopping the game:
>
> - GPU core: 214.0 MHz, 0.800 V, 0.0% load, 49.0 degC
> - GPU memory: 625.0 MHz
>
> ## Linux v5.14-rc1 (issue present)
>
> $ cat /proc/version
> Linux version 5.14.0-rc1-1 (linux@archlinux) (gcc (GCC) 11.1.0, GNU ld (GNU Binutils) 2.36.1) #1 SMP PREEMPT Sun, 16 Jan 2022 18:31:35 +0000
>
> Before running the game:
>
> - GPU core: 214.0 MHz, 0.800 V, 0.0% load, 50.0 degC
> - GPU memory: 625.0 MHz
>
> While running the game:
>
> - GPU core: 501.0 MHz, 0.800 V, 100.0% load, 54.0 degC
> - GPU memory: 625.0 MHz
>
> After stopping the game:
>
> - GPU core: 214.0 MHz, 0.800 V, 0.0% load, 49.0 degC
> - GPU memory: 625.0 MHz
>
> ## Linux v5.13 (works correctly, issue not present)
>
> $ cat /proc/version
> Linux version 5.13.0-1 (linux@archlinux) (gcc (GCC) 11.1.0, GNU ld (GNU Binutils) 2.36.1) #1 SMP PREEMPT Sun, 16 Jan 2022 02:39:18 +0000
>
> Before running the game:
>
> - GPU core: 214.0 MHz, 0.850 V, 0.0% load, 55.0 degC
> - GPU memory: 1500.0 MHz
>
> While running the game:
>
> - GPU core: 1295.0 MHz, 1.000 V, 100.0% load, 67.0 degC
> - GPU memory: 1500.0 MHz
>
> After stopping the game:
>
> - GPU core: 214.0 MHz, 0.850 V, 0.0% load, 52.0 degC
> - GPU memory: 1500.0 MHz

```

Ciao, Thorsten (wearing his 'Linux kernel regression tracker' hat)

P.S.: As a Linux kernel regression tracker I'm getting a lot of reports
on my table. I can only look briefly into most of them. Unfortunately
therefore I sometimes will get things wrong or miss something important.
I hope that's not the case here; if you think it is, don't hesitate to
tell me about it in a public reply, that's in everyone's interest.

BTW, I have no personal interest in this issue, which is tracked using
regzbot, my Linux kernel regression tracking bot
(https://linux-regtracking.leemhuis.info/regzbot/). I'm only posting
this mail to get things rolling again and hence don't need to be CC on
all further activities wrt to this regression.

#regzbot introduced f9b7f3703ff9
#regzbot title drm: amdgpu: Too-low frequency limit for AMD GPU
PCI-passed-through to Windows VM

> Would any additional information be helpful?
>
> git bisect start
> # bad: [e73f0f0ee7541171d89f2e2491130c7771ba58d3] Linux 5.14-rc1
> git bisect bad e73f0f0ee7541171d89f2e2491130c7771ba58d3
> # good: [62fb9874f5da54fdb243003b386128037319b219] Linux 5.13
> git bisect good 62fb9874f5da54fdb243003b386128037319b219
> # bad: [e058a84bfddc42ba356a2316f2cf1141974625c9] Merge tag 'drm-next-2021-07-01' of git://anongit.freedesktop.org/drm/drm
> git bisect bad e058a84bfddc42ba356a2316f2cf1141974625c9
> # good: [a6eaf3850cb171c328a8b0db6d3c79286a1eba9d] Merge tag 'sched-urgent-2021-06-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> git bisect good a6eaf3850cb171c328a8b0db6d3c79286a1eba9d
> # good: [007b312c6f294770de01fbc0643610145012d244] Merge tag 'mac80211-next-for-net-next-2021-06-25' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
> git bisect good 007b312c6f294770de01fbc0643610145012d244
> # bad: [18703923a66aecf6f7ded0e16d22eb412ddae72f] drm/amdgpu: Fix incorrect register offsets for Sienna Cichlid
> git bisect bad 18703923a66aecf6f7ded0e16d22eb412ddae72f
> # good: [c99c4d0ca57c978dcc2a2f41ab8449684ea154cc] Merge tag 'amd-drm-next-5.14-2021-05-19' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
> git bisect good c99c4d0ca57c978dcc2a2f41ab8449684ea154cc
> # good: [43ed3c6c786d996a264fcde68dbb36df6f03b965] Merge tag 'drm-misc-next-2021-06-01' of git://anongit.freedesktop.org/drm/drm-misc into drm-next
> git bisect good 43ed3c6c786d996a264fcde68dbb36df6f03b965
> # bad: [050cd3d616d96c3a04f4877842a391c0a4fdcc7a] drm/amd/display: Add support for SURFACE_PIXEL_FORMAT_GRPH_ABGR16161616.
> git bisect bad 050cd3d616d96c3a04f4877842a391c0a4fdcc7a
> # good: [f43ae2d1806c2b8a0934cb4acddd3cf3750d10f8] drm/amdgpu: Fix inconsistent indenting
> git bisect good f43ae2d1806c2b8a0934cb4acddd3cf3750d10f8
> # good: [6566cae7aef30da8833f1fa0eb854baf33b96676] drm/amd/display: fix odm scaling
> git bisect good 6566cae7aef30da8833f1fa0eb854baf33b96676
> # good: [5ac1dd89df549648b67f4d5e3a01b2d653914c55] drm/amd/display/dc/dce/dmub_outbox: Convert over to kernel-doc
> git bisect good 5ac1dd89df549648b67f4d5e3a01b2d653914c55
> # good: [a76eb7d30f700e5bdecc72d88d2226d137b11f74] drm/amd/display/dc/dce110/dce110_hw_sequencer: Include header containing our prototypes
> git bisect good a76eb7d30f700e5bdecc72d88d2226d137b11f74
> # good: [dd1d82c04e111b5a864638ede8965db2fe6d8653] drm/amdgpu/swsmu/aldebaran: fix check in is_dpm_running
> git bisect good dd1d82c04e111b5a864638ede8965db2fe6d8653
> # bad: [f9b7f3703ff97768a8dfabd42bdb107681f1da22] drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)
> git bisect bad f9b7f3703ff97768a8dfabd42bdb107681f1da22
> # good: [f1688bd69ec4b07eda1657ff953daebce7cfabf6] drm/amd/amdgpu:save psp ring wptr to avoid attack
> git bisect good f1688bd69ec4b07eda1657ff953daebce7cfabf6
> # first bad commit: [f9b7f3703ff97768a8dfabd42bdb107681f1da22] drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)
>
> James
>

2022-01-22 01:58:39

by Alex Deucher

[permalink] [raw]

Subject: Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM

On Fri, Jan 21, 2022 at 3:35 AM Thorsten Leemhuis
<[email protected]> wrote:
>
> Hi, this is your Linux kernel regression tracker speaking.
>
> On 21.01.22 03:13, James Turner wrote:
> >
> > I finished the bisection (log below). The issue was introduced in
> > f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)").
>
> FWIW, that was:
>
> > drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)
> > They are global ACPI methods, so maybe the structures
> > global in the driver. This simplified a number of things
> > in the handling of these methods.
> >
> > v2: reset the handle if verify interface fails (Lijo)
> > v3: fix compilation when ACPI is not defined.
> >
> > Reviewed-by: Lijo Lazar <[email protected]>
> > Signed-off-by: Alex Deucher <[email protected]>
>
> In that case we need to get those two and the maintainers for the driver
> involved by addressing them with this mail. And to make it easy for them
> here is a link and a quote from the original report:
>
> https://lore.kernel.org/all/[email protected]/

Are you ever loading the amdgpu driver in your tests? If not, I don't
see how this patch would affect anything as the driver code would
never have executed. It would appear not based on your example.

Alex

>
> ```
> > Hi,
> >
> > With newer kernels, starting with the v5.14 series, when using a MS
> > Windows 10 guest VM with PCI passthrough of an AMD Radeon Pro WX 3200
> > discrete GPU, the passed-through GPU will not run above 501 MHz, even
> > when it is under 100% load and well below the temperature limit. As a
> > result, GPU-intensive software (such as video games) runs unusably
> > slowly in the VM.
> >
> > In contrast, with older kernels, the passed-through GPU runs at up to
> > 1295 MHz (the correct hardware limit), so GPU-intensive software runs at
> > a reasonable speed in the VM.
> >
> > I've confirmed that the issue exists with the following kernel versions:
> >
> > - v5.16
> > - v5.14
> > - v5.14-rc1
> >
> > The issue does not exist with the following kernels:
> >
> > - v5.13
> > - various packaged (non-vanilla) 5.10.* Arch Linux `linux-lts` kernels
> >
> > So, the issue was introduced between v5.13 and v5.14-rc1. I'm willing to
> > bisect the commit history to narrow it down further, if that would be
> > helpful.
> >
> > The configuration details and test results are provided below. In
> > summary, for the kernels with this issue, the GPU core stays at a
> > constant 0.8 V, the GPU core clock ranges from 214 MHz to 501 MHz, and
> > the GPU memory stays at a constant 625 MHz, in the VM. For the correctly
> > working kernels, the GPU core ranges from 0.85 V to 1.0 V, the GPU core
> > clock ranges from 214 MHz to 1295 MHz, and the GPU memory stays at 1500
> > MHz, in the VM.
> >
> > Please let me know if additional information would be helpful.
> >
> > Regards,
> > James Turner
> >
> > # Configuration Details
> >
> > Hardware:
> >
> > - Dell Precision 7540 laptop
> > - CPU: Intel Core i7-9750H (x86-64)
> > - Discrete GPU: AMD Radeon Pro WX 3200
> > - The internal display is connected to the integrated GPU, and external
> > displays are connected to the discrete GPU.
> >
> > Software:
> >
> > - KVM host: Arch Linux
> > - self-built vanilla kernel (built using Arch Linux `PKGBUILD`
> > modified to use vanilla kernel sources from git.kernel.org)
> > - libvirt 1:7.10.0-2
> > - qemu 6.2.0-2
> >
> > - KVM guest: Windows 10
> > - GPU driver: Radeon Pro Software Version 21.Q3 (Note that I also
> > experienced this issue with the 20.Q4 driver, using packaged
> > (non-vanilla) Arch Linux kernels on the host, before updating to the
> > 21.Q3 driver.)
> >
> > Kernel config:
> >
> > - For v5.13, v5.14-rc1, and v5.14, I used
> > https://github.com/archlinux/svntogit-packages/blob/89c24952adbfa645d9e1a6f12c572929f7e4e3c7/trunk/config
> > (The build script ran `make olddefconfig` on that config file.)
> >
> > - For v5.16, I used
> > https://github.com/archlinux/svntogit-packages/blob/94f84e1ad8a530e54aa34cadbaa76e8dcc439d10/trunk/config
> > (The build script ran `make olddefconfig` on that config file.)
> >
> > I set up the VM with PCI passthrough according to the instructions at
> > https://wiki.archlinux.org/title/PCI_passthrough_via_OVMF
> >
> > I'm passing through the following PCI devices to the VM, as listed by
> > `lspci -D -nn`:
> >
> > 0000:01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Lexa XT [Radeon PRO WX 3200] [1002:6981]
> > 0000:01:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X] [1002:aae0]
> >
> > The host kernel command line includes the following relevant options:
> >
> > intel_iommu=on vfio-pci.ids=1002:6981,1002:aae0
> >
> > to enable IOMMU and bind the `vfio-pci` driver to the PCI devices.
> >
> > My `/etc/mkinitcpio.conf` includes the following line:
> >
> > MODULES=(vfio_pci vfio vfio_iommu_type1 vfio_virqfd i915 amdgpu)
> >
> > to load `vfio-pci` before the graphics drivers. (Note that removing
> > `i915 amdgpu` has no effect on this issue.)
> >
> > I'm using libvirt to manage the VM. The relevant portions of the XML
> > file are:
> >
> > <hostdev mode="subsystem" type="pci" managed="yes">
> > <source>
> > <address domain="0x0000" bus="0x01" slot="0x00" function="0x0"/>
> > </source>
> > <address type="pci" domain="0x0000" bus="0x06" slot="0x00" function="0x0"/>
> > </hostdev>
> > <hostdev mode="subsystem" type="pci" managed="yes">
> > <source>
> > <address domain="0x0000" bus="0x01" slot="0x00" function="0x1"/>
> > </source>
> > <address type="pci" domain="0x0000" bus="0x07" slot="0x00" function="0x0"/>
> > </hostdev>
> >
> > # Test Results
> >
> > For testing, I used the following procedure:
> >
> > 1. Boot the host machine and log in.
> >
> > 2. Run the following commands to gather information. For all the tests,
> > the output was identical.
> >
> > - `cat /proc/sys/kernel/tainted` printed:
> >
> > 0
> >
> > - `hostnamectl | grep "Operating System"` printed:
> >
> > Operating System: Arch Linux
> >
> > - `lspci -nnk -d 1002:6981` printed
> >
> > 01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Lexa XT [Radeon PRO WX 3200] [1002:6981]
> > Subsystem: Dell Device [1028:0926]
> > Kernel driver in use: vfio-pci
> > Kernel modules: amdgpu
> >
> > - `lspci -nnk -d 1002:aae0` printed
> >
> > 01:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X] [1002:aae0]
> > Subsystem: Dell Device [1028:0926]
> > Kernel driver in use: vfio-pci
> > Kernel modules: snd_hda_intel
> >
> > - `sudo dmesg | grep -i vfio` printed the kernel command line and the
> > following messages:
> >
> > VFIO - User Level meta-driver version: 0.3
> > vfio-pci 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
> > vfio_pci: add [1002:6981[ffffffff:ffffffff]] class 0x000000/00000000
> > vfio_pci: add [1002:aae0[ffffffff:ffffffff]] class 0x000000/00000000
> > vfio-pci 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
> >
> > 3. Start the Windows VM using libvirt and log in. Record sensor
> > information.
> >
> > 4. Run a graphically-intensive video game to put the GPU under load.
> > Record sensor information.
> >
> > 5. Stop the game. Record sensor information.
> >
> > 6. Shut down the VM. Save the output of `sudo dmesg`.
> >
> > I compared the `sudo dmesg` output for v5.13 and v5.14-rc1 and didn't
> > see any relevant differences.
> >
> > Note that the issue occurs only within the guest VM. When I'm not using
> > a VM (after removing `vfio-pci.ids=1002:6981,1002:aae0` from the kernel
> > command line so that the PCI devices are bound to their normal `amdgpu`
> > and `snd_hda_intel` drivers instead of the `vfio-pci` driver), the GPU
> > operates correctly on the host.
> >
> > ## Linux v5.16 (issue present)
> >
> > $ cat /proc/version
> > Linux version 5.16.0-1 (linux@archlinux) (gcc (GCC) 11.1.0, GNU ld (GNU Binutils) 2.36.1) #1 SMP PREEMPT Sun, 16 Jan 2022 01:51:08 +0000
> >
> > Before running the game:
> >
> > - GPU core: 214.0 MHz, 0.800 V, 0.0% load, 53.0 degC
> > - GPU memory: 625.0 MHz
> >
> > While running the game:
> >
> > - GPU core: 501.0 MHz, 0.800 V, 100.0% load, 54.0 degC
> > - GPU memory: 625.0 MHz
> >
> > After stopping the game:
> >
> > - GPU core: 214.0 MHz, 0.800 V, 0.0% load, 51.0 degC
> > - GPU memory: 625.0 MHz
> >
> > ## Linux v5.14 (issue present)
> >
> > $ cat /proc/version
> > Linux version 5.14.0-1 (linux@archlinux) (gcc (GCC) 11.1.0, GNU ld (GNU Binutils) 2.36.1) #1 SMP PREEMPT Sun, 16 Jan 2022 03:19:35 +0000
> >
> > Before running the game:
> >
> > - GPU core: 214.0 MHz, 0.800 V, 0.0% load, 50.0 degC
> > - GPU memory: 625.0 MHz
> >
> > While running the game:
> >
> > - GPU core: 501.0 MHz, 0.800 V, 100.0% load, 54.0 degC
> > - GPU memory: 625.0 MHz
> >
> > After stopping the game:
> >
> > - GPU core: 214.0 MHz, 0.800 V, 0.0% load, 49.0 degC
> > - GPU memory: 625.0 MHz
> >
> > ## Linux v5.14-rc1 (issue present)
> >
> > $ cat /proc/version
> > Linux version 5.14.0-rc1-1 (linux@archlinux) (gcc (GCC) 11.1.0, GNU ld (GNU Binutils) 2.36.1) #1 SMP PREEMPT Sun, 16 Jan 2022 18:31:35 +0000
> >
> > Before running the game:
> >
> > - GPU core: 214.0 MHz, 0.800 V, 0.0% load, 50.0 degC
> > - GPU memory: 625.0 MHz
> >
> > While running the game:
> >
> > - GPU core: 501.0 MHz, 0.800 V, 100.0% load, 54.0 degC
> > - GPU memory: 625.0 MHz
> >
> > After stopping the game:
> >
> > - GPU core: 214.0 MHz, 0.800 V, 0.0% load, 49.0 degC
> > - GPU memory: 625.0 MHz
> >
> > ## Linux v5.13 (works correctly, issue not present)
> >
> > $ cat /proc/version
> > Linux version 5.13.0-1 (linux@archlinux) (gcc (GCC) 11.1.0, GNU ld (GNU Binutils) 2.36.1) #1 SMP PREEMPT Sun, 16 Jan 2022 02:39:18 +0000
> >
> > Before running the game:
> >
> > - GPU core: 214.0 MHz, 0.850 V, 0.0% load, 55.0 degC
> > - GPU memory: 1500.0 MHz
> >
> > While running the game:
> >
> > - GPU core: 1295.0 MHz, 1.000 V, 100.0% load, 67.0 degC
> > - GPU memory: 1500.0 MHz
> >
> > After stopping the game:
> >
> > - GPU core: 214.0 MHz, 0.850 V, 0.0% load, 52.0 degC
> > - GPU memory: 1500.0 MHz
>
> ```
>
> Ciao, Thorsten (wearing his 'Linux kernel regression tracker' hat)
>
> P.S.: As a Linux kernel regression tracker I'm getting a lot of reports
> on my table. I can only look briefly into most of them. Unfortunately
> therefore I sometimes will get things wrong or miss something important.
> I hope that's not the case here; if you think it is, don't hesitate to
> tell me about it in a public reply, that's in everyone's interest.
>
> BTW, I have no personal interest in this issue, which is tracked using
> regzbot, my Linux kernel regression tracking bot
> (https://linux-regtracking.leemhuis.info/regzbot/). I'm only posting
> this mail to get things rolling again and hence don't need to be CC on
> all further activities wrt to this regression.
>
> #regzbot introduced f9b7f3703ff9
> #regzbot title drm: amdgpu: Too-low frequency limit for AMD GPU
> PCI-passed-through to Windows VM
>
>
> > Would any additional information be helpful?
> >
> > git bisect start
> > # bad: [e73f0f0ee7541171d89f2e2491130c7771ba58d3] Linux 5.14-rc1
> > git bisect bad e73f0f0ee7541171d89f2e2491130c7771ba58d3
> > # good: [62fb9874f5da54fdb243003b386128037319b219] Linux 5.13
> > git bisect good 62fb9874f5da54fdb243003b386128037319b219
> > # bad: [e058a84bfddc42ba356a2316f2cf1141974625c9] Merge tag 'drm-next-2021-07-01' of git://anongit.freedesktop.org/drm/drm
> > git bisect bad e058a84bfddc42ba356a2316f2cf1141974625c9
> > # good: [a6eaf3850cb171c328a8b0db6d3c79286a1eba9d] Merge tag 'sched-urgent-2021-06-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> > git bisect good a6eaf3850cb171c328a8b0db6d3c79286a1eba9d
> > # good: [007b312c6f294770de01fbc0643610145012d244] Merge tag 'mac80211-next-for-net-next-2021-06-25' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
> > git bisect good 007b312c6f294770de01fbc0643610145012d244
> > # bad: [18703923a66aecf6f7ded0e16d22eb412ddae72f] drm/amdgpu: Fix incorrect register offsets for Sienna Cichlid
> > git bisect bad 18703923a66aecf6f7ded0e16d22eb412ddae72f
> > # good: [c99c4d0ca57c978dcc2a2f41ab8449684ea154cc] Merge tag 'amd-drm-next-5.14-2021-05-19' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
> > git bisect good c99c4d0ca57c978dcc2a2f41ab8449684ea154cc
> > # good: [43ed3c6c786d996a264fcde68dbb36df6f03b965] Merge tag 'drm-misc-next-2021-06-01' of git://anongit.freedesktop.org/drm/drm-misc into drm-next
> > git bisect good 43ed3c6c786d996a264fcde68dbb36df6f03b965
> > # bad: [050cd3d616d96c3a04f4877842a391c0a4fdcc7a] drm/amd/display: Add support for SURFACE_PIXEL_FORMAT_GRPH_ABGR16161616.
> > git bisect bad 050cd3d616d96c3a04f4877842a391c0a4fdcc7a
> > # good: [f43ae2d1806c2b8a0934cb4acddd3cf3750d10f8] drm/amdgpu: Fix inconsistent indenting
> > git bisect good f43ae2d1806c2b8a0934cb4acddd3cf3750d10f8
> > # good: [6566cae7aef30da8833f1fa0eb854baf33b96676] drm/amd/display: fix odm scaling
> > git bisect good 6566cae7aef30da8833f1fa0eb854baf33b96676
> > # good: [5ac1dd89df549648b67f4d5e3a01b2d653914c55] drm/amd/display/dc/dce/dmub_outbox: Convert over to kernel-doc
> > git bisect good 5ac1dd89df549648b67f4d5e3a01b2d653914c55
> > # good: [a76eb7d30f700e5bdecc72d88d2226d137b11f74] drm/amd/display/dc/dce110/dce110_hw_sequencer: Include header containing our prototypes
> > git bisect good a76eb7d30f700e5bdecc72d88d2226d137b11f74
> > # good: [dd1d82c04e111b5a864638ede8965db2fe6d8653] drm/amdgpu/swsmu/aldebaran: fix check in is_dpm_running
> > git bisect good dd1d82c04e111b5a864638ede8965db2fe6d8653
> > # bad: [f9b7f3703ff97768a8dfabd42bdb107681f1da22] drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)
> > git bisect bad f9b7f3703ff97768a8dfabd42bdb107681f1da22
> > # good: [f1688bd69ec4b07eda1657ff953daebce7cfabf6] drm/amd/amdgpu:save psp ring wptr to avoid attack
> > git bisect good f1688bd69ec4b07eda1657ff953daebce7cfabf6
> > # first bad commit: [f9b7f3703ff97768a8dfabd42bdb107681f1da22] drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)
> >
> > James
> >

2022-01-23 00:13:37

by James Turner

[permalink] [raw]

Subject: Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM

> Are you ever loading the amdgpu driver in your tests?

Yes, although I'm binding the `vfio-pci` driver to the AMD GPU's PCI
devices via the kernel command line. (See my initial email.) My
understanding is that `vfio-pci` is supposed to keep other drivers, such
as `amdgpu`, from interacting with the GPU, although that's clearly not
what's happening.

I've been testing with `amdgpu` included in the `MODULES` list in
`/etc/mkinitcpio.conf` (which Arch Linux uses to generate the
initramfs). However, I ran some more tests today (results below), this
time without `i915` or `amdgpu` in the `MODULES` list. The `amdgpu`
kernel module still gets loaded. (I think udev loads it automatically?)

Your comment gave me the idea to blacklist the `amdgpu` kernel module.
That does serve as a workaround on my machine – it fixes the behavior
for f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)")
and for the current Arch Linux prebuilt kernel (5.16.2-arch1-1). That's
an acceptable workaround for my machine only because the separate GPU
used by the host is an Intel integrated GPU. That workaround wouldn't
work well for someone with two AMD GPUs.

# New test results

The following tests are set up the same way as in my initial email,
with the following exceptions:

- I've updated libvirt to 1:8.0.0-1.

- I've removed `i915` and `amdgpu` from the `MODULES` list in
`/etc/mkinitcpio.conf`.

For all three of these tests, `lspci` said the following:

% lspci -nnk -d 1002:6981
01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Lexa XT [Radeon PRO WX 3200] [1002:6981]
Subsystem: Dell Device [1028:0926]
Kernel driver in use: vfio-pci
Kernel modules: amdgpu

% lspci -nnk -d 1002:aae0
01:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X] [1002:aae0]
Subsystem: Dell Device [1028:0926]
Kernel driver in use: vfio-pci
Kernel modules: snd_hda_intel

## Version f1688bd69ec4 ("drm/amd/amdgpu:save psp ring wptr to avoid attack")

This is the commit immediately preceding the one which introduced the issue.

% sudo dmesg | grep -i amdgpu
[ 15.840160] [drm] amdgpu kernel modesetting enabled.
[ 15.840884] amdgpu: CRAT table not found
[ 15.840885] amdgpu: Virtual CRAT table created for CPU
[ 15.840893] amdgpu: Topology: Add CPU node

% lsmod | grep amdgpu
amdgpu 7450624 0
gpu_sched 49152 1 amdgpu
drm_ttm_helper 16384 1 amdgpu
ttm 77824 2 amdgpu,drm_ttm_helper
i2c_algo_bit 16384 2 amdgpu,i915
drm_kms_helper 303104 2 amdgpu,i915
drm 581632 11 gpu_sched,drm_kms_helper,amdgpu,drm_ttm_helper,i915,ttm

The passed-through GPU worked properly in the VM.

## Version f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)")

This is the commit which introduced the issue.

% sudo dmesg | grep -i amdgpu
[ 15.319023] [drm] amdgpu kernel modesetting enabled.
[ 15.329468] amdgpu: CRAT table not found
[ 15.329470] amdgpu: Virtual CRAT table created for CPU
[ 15.329482] amdgpu: Topology: Add CPU node

% lsmod | grep amdgpu
amdgpu 7450624 0
gpu_sched 49152 1 amdgpu
drm_ttm_helper 16384 1 amdgpu
ttm 77824 2 amdgpu,drm_ttm_helper
i2c_algo_bit 16384 2 amdgpu,i915
drm_kms_helper 303104 2 amdgpu,i915
drm 581632 11 gpu_sched,drm_kms_helper,amdgpu,drm_ttm_helper,i915,ttm

The passed-through GPU did not run above 501 MHz in the VM.

## Blacklisted `amdgpu`, version f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)")

For this test, I added `module_blacklist=amdgpu` to kernel command line
to blacklist the `amdgpu` module.

% sudo dmesg | grep -i amdgpu
[ 14.591576] Module amdgpu is blacklisted

% lsmod | grep amdgpu

The passed-through GPU worked properly in the VM.

James

2022-01-23 13:54:40

[permalink] [raw]

Subject: RE: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM

[AMD Official Use Only]

Hi James,

Could you provide the pp_dpm_* values in sysfs with and without the patch? Also, could you try forcing PCIE to gen3 (through pp_dpm_pcie) if it's not in gen3 when the issue happens?

For details on pp_dpm_*, please check https://dri.freedesktop.org/docs/drm/gpu/amdgpu.html

Thanks,
Lijo

-----Original Message-----
From: James Turner <[email protected]>
Sent: Saturday, January 22, 2022 6:21 AM
To: Alex Deucher <[email protected]>
Cc: Thorsten Leemhuis <[email protected]>; Deucher, Alexander <[email protected]>; Lazar, Lijo <[email protected]>; [email protected]; [email protected]; Greg KH <[email protected]>; Pan, Xinhui <[email protected]>; LKML <[email protected]>; [email protected]; Alex Williamson <[email protected]>; Koenig, Christian <[email protected]>
Subject: Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM

> Are you ever loading the amdgpu driver in your tests?

Yes, although I'm binding the `vfio-pci` driver to the AMD GPU's PCI devices via the kernel command line. (See my initial email.) My understanding is that `vfio-pci` is supposed to keep other drivers, such as `amdgpu`, from interacting with the GPU, although that's clearly not what's happening.

I've been testing with `amdgpu` included in the `MODULES` list in `/etc/mkinitcpio.conf` (which Arch Linux uses to generate the initramfs). However, I ran some more tests today (results below), this time without `i915` or `amdgpu` in the `MODULES` list. The `amdgpu` kernel module still gets loaded. (I think udev loads it automatically?)

Your comment gave me the idea to blacklist the `amdgpu` kernel module.
That does serve as a workaround on my machine – it fixes the behavior for f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)") and for the current Arch Linux prebuilt kernel (5.16.2-arch1-1). That's an acceptable workaround for my machine only because the separate GPU used by the host is an Intel integrated GPU. That workaround wouldn't work well for someone with two AMD GPUs.

# New test results

The following tests are set up the same way as in my initial email, with the following exceptions:

- I've updated libvirt to 1:8.0.0-1.

- I've removed `i915` and `amdgpu` from the `MODULES` list in
`/etc/mkinitcpio.conf`.

For all three of these tests, `lspci` said the following:

% lspci -nnk -d 1002:6981
01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Lexa XT [Radeon PRO WX 3200] [1002:6981]
Subsystem: Dell Device [1028:0926]
Kernel driver in use: vfio-pci
Kernel modules: amdgpu

% lspci -nnk -d 1002:aae0
01:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X] [1002:aae0]
Subsystem: Dell Device [1028:0926]
Kernel driver in use: vfio-pci
Kernel modules: snd_hda_intel

## Version f1688bd69ec4 ("drm/amd/amdgpu:save psp ring wptr to avoid attack")

This is the commit immediately preceding the one which introduced the issue.

% sudo dmesg | grep -i amdgpu
[ 15.840160] [drm] amdgpu kernel modesetting enabled.
[ 15.840884] amdgpu: CRAT table not found
[ 15.840885] amdgpu: Virtual CRAT table created for CPU
[ 15.840893] amdgpu: Topology: Add CPU node

% lsmod | grep amdgpu
amdgpu 7450624 0
gpu_sched 49152 1 amdgpu
drm_ttm_helper 16384 1 amdgpu
ttm 77824 2 amdgpu,drm_ttm_helper
i2c_algo_bit 16384 2 amdgpu,i915
drm_kms_helper 303104 2 amdgpu,i915
drm 581632 11 gpu_sched,drm_kms_helper,amdgpu,drm_ttm_helper,i915,ttm

The passed-through GPU worked properly in the VM.

## Version f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)")

This is the commit which introduced the issue.

% sudo dmesg | grep -i amdgpu
[ 15.319023] [drm] amdgpu kernel modesetting enabled.
[ 15.329468] amdgpu: CRAT table not found
[ 15.329470] amdgpu: Virtual CRAT table created for CPU
[ 15.329482] amdgpu: Topology: Add CPU node

% lsmod | grep amdgpu
amdgpu 7450624 0
gpu_sched 49152 1 amdgpu
drm_ttm_helper 16384 1 amdgpu
ttm 77824 2 amdgpu,drm_ttm_helper
i2c_algo_bit 16384 2 amdgpu,i915
drm_kms_helper 303104 2 amdgpu,i915
drm 581632 11 gpu_sched,drm_kms_helper,amdgpu,drm_ttm_helper,i915,ttm

The passed-through GPU did not run above 501 MHz in the VM.

## Blacklisted `amdgpu`, version f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)")

For this test, I added `module_blacklist=amdgpu` to kernel command line to blacklist the `amdgpu` module.

% sudo dmesg | grep -i amdgpu
[ 14.591576] Module amdgpu is blacklisted

% lsmod | grep amdgpu

The passed-through GPU worked properly in the VM.

James

2022-01-23 15:39:39

by James Turner

[permalink] [raw]

Subject: Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM

Hi Lijo,

> Could you provide the pp_dpm_* values in sysfs with and without the
> patch? Also, could you try forcing PCIE to gen3 (through pp_dpm_pcie)
> if it's not in gen3 when the issue happens?

AFAICT, I can't access those values while the AMD GPU PCI devices are
bound to `vfio-pci`. However, I can at least access the link speed and
width elsewhere in sysfs. So, I gathered what information I could for
two different cases:

- With the PCI devices bound to `vfio-pci`. With this configuration, I
can start the VM, but the `pp_dpm_*` values are not available since
the devices are bound to `vfio-pci` instead of `amdgpu`.

- Without the PCI devices bound to `vfio-pci` (i.e. after removing the
`vfio-pci.ids=...` kernel command line argument). With this
configuration, I can access the `pp_dpm_*` values, since the PCI
devices are bound to `amdgpu`. However, I cannot use the VM. If I try
to start the VM, the display (both the external monitors attached to
the AMD GPU and the built-in laptop display attached to the Intel
iGPU) completely freezes.

The output shown below was identical for both the good commit:
f1688bd69ec4 ("drm/amd/amdgpu:save psp ring wptr to avoid attack")
and the commit which introduced the issue:
f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)")

Note that the PCI link speed increased to 8.0 GT/s when the GPU was
under heavy load for both versions, but the clock speeds of the GPU were
different under load. (For the good commit, it was 1295 MHz; for the bad
commit, it was 501 MHz.)

# With the PCI devices bound to `vfio-pci`

## Before starting the VM

% ls /sys/module/amdgpu/drivers/pci:amdgpu
module bind new_id remove_id uevent unbind

% find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \;
/sys/bus/pci/devices/0000:01:00.0/current_link_width
8
/sys/bus/pci/devices/0000:01:00.0/current_link_speed
8.0 GT/s PCIe

## While running the VM, before placing the AMD GPU under heavy load

% find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \;
/sys/bus/pci/devices/0000:01:00.0/current_link_width
8
/sys/bus/pci/devices/0000:01:00.0/current_link_speed
2.5 GT/s PCIe

## While running the VM, with the AMD GPU under heavy load

% find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \;
/sys/bus/pci/devices/0000:01:00.0/current_link_width
8
/sys/bus/pci/devices/0000:01:00.0/current_link_speed
8.0 GT/s PCIe

## While running the VM, after stopping the heavy load on the AMD GPU

% find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \;
/sys/bus/pci/devices/0000:01:00.0/current_link_width
8
/sys/bus/pci/devices/0000:01:00.0/current_link_speed
2.5 GT/s PCIe

## After stopping the VM

% find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \;
/sys/bus/pci/devices/0000:01:00.0/current_link_width
8
/sys/bus/pci/devices/0000:01:00.0/current_link_speed
2.5 GT/s PCIe

# Without the PCI devices bound to `vfio-pci`

% ls /sys/module/amdgpu/drivers/pci:amdgpu
0000:01:00.0 module bind new_id remove_id uevent unbind

% for f in /sys/module/amdgpu/drivers/pci:amdgpu/*/pp_dpm_*; do echo "$f"; cat "$f"; echo; done
/sys/module/amdgpu/drivers/pci:amdgpu/0000:01:00.0/pp_dpm_mclk
0: 300Mhz
1: 625Mhz
2: 1500Mhz *

/sys/module/amdgpu/drivers/pci:amdgpu/0000:01:00.0/pp_dpm_pcie
0: 2.5GT/s, x8
1: 8.0GT/s, x16 *

/sys/module/amdgpu/drivers/pci:amdgpu/0000:01:00.0/pp_dpm_sclk
0: 214Mhz
1: 501Mhz
2: 850Mhz
3: 1034Mhz
4: 1144Mhz
5: 1228Mhz
6: 1275Mhz
7: 1295Mhz *

% find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \;
/sys/bus/pci/devices/0000:01:00.0/current_link_width
8
/sys/bus/pci/devices/0000:01:00.0/current_link_speed
8.0 GT/s PCIe

James

2022-01-24 19:22:27

[permalink] [raw]

Subject: RE: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM

[Public]

Not able to relate to how it affects gfx/mem DPM alone. Unless Alex has other ideas, would you be able to enable drm debug messages and share the log?

Enabling verbose debug messages is done through the drm.debug parameter, each category being enabled by a bit:

drm.debug=0x1 will enable CORE messages
drm.debug=0x2 will enable DRIVER messages
drm.debug=0x3 will enable CORE and DRIVER messages
...
drm.debug=0x1ff will enable all messages
An interesting feature is that it's possible to enable verbose logging at run-time by echoing the debug value in its sysfs node:

# echo 0xf > /sys/module/drm/parameters/debug

Thanks,
Lijo

-----Original Message-----
From: James Turner <[email protected]>
Sent: Sunday, January 23, 2022 2:41 AM
To: Lazar, Lijo <[email protected]>
Cc: Alex Deucher <[email protected]>; Thorsten Leemhuis <[email protected]>; Deucher, Alexander <[email protected]>; [email protected]; [email protected]; Greg KH <[email protected]>; Pan, Xinhui <[email protected]>; LKML <[email protected]>; [email protected]; Alex Williamson <[email protected]>; Koenig, Christian <[email protected]>
Subject: Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM

Hi Lijo,

> Could you provide the pp_dpm_* values in sysfs with and without the
> patch? Also, could you try forcing PCIE to gen3 (through pp_dpm_pcie)
> if it's not in gen3 when the issue happens?

AFAICT, I can't access those values while the AMD GPU PCI devices are bound to `vfio-pci`. However, I can at least access the link speed and width elsewhere in sysfs. So, I gathered what information I could for two different cases:

- With the PCI devices bound to `vfio-pci`. With this configuration, I
can start the VM, but the `pp_dpm_*` values are not available since
the devices are bound to `vfio-pci` instead of `amdgpu`.

- Without the PCI devices bound to `vfio-pci` (i.e. after removing the
`vfio-pci.ids=...` kernel command line argument). With this
configuration, I can access the `pp_dpm_*` values, since the PCI
devices are bound to `amdgpu`. However, I cannot use the VM. If I try
to start the VM, the display (both the external monitors attached to
the AMD GPU and the built-in laptop display attached to the Intel
iGPU) completely freezes.

The output shown below was identical for both the good commit:
f1688bd69ec4 ("drm/amd/amdgpu:save psp ring wptr to avoid attack") and the commit which introduced the issue:
f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)")

Note that the PCI link speed increased to 8.0 GT/s when the GPU was under heavy load for both versions, but the clock speeds of the GPU were different under load. (For the good commit, it was 1295 MHz; for the bad commit, it was 501 MHz.)

# With the PCI devices bound to `vfio-pci`

## Before starting the VM

% ls /sys/module/amdgpu/drivers/pci:amdgpu
module bind new_id remove_id uevent unbind

% find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \; /sys/bus/pci/devices/0000:01:00.0/current_link_width
8
/sys/bus/pci/devices/0000:01:00.0/current_link_speed
8.0 GT/s PCIe

## While running the VM, before placing the AMD GPU under heavy load

% find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \; /sys/bus/pci/devices/0000:01:00.0/current_link_width
8
/sys/bus/pci/devices/0000:01:00.0/current_link_speed
2.5 GT/s PCIe

## While running the VM, with the AMD GPU under heavy load

% find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \; /sys/bus/pci/devices/0000:01:00.0/current_link_width
8
/sys/bus/pci/devices/0000:01:00.0/current_link_speed
8.0 GT/s PCIe

## While running the VM, after stopping the heavy load on the AMD GPU

% find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \; /sys/bus/pci/devices/0000:01:00.0/current_link_width
8
/sys/bus/pci/devices/0000:01:00.0/current_link_speed
2.5 GT/s PCIe

## After stopping the VM

% find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \; /sys/bus/pci/devices/0000:01:00.0/current_link_width
8
/sys/bus/pci/devices/0000:01:00.0/current_link_speed
2.5 GT/s PCIe

# Without the PCI devices bound to `vfio-pci`

% ls /sys/module/amdgpu/drivers/pci:amdgpu
0000:01:00.0 module bind new_id remove_id uevent unbind

% for f in /sys/module/amdgpu/drivers/pci:amdgpu/*/pp_dpm_*; do echo "$f"; cat "$f"; echo; done /sys/module/amdgpu/drivers/pci:amdgpu/0000:01:00.0/pp_dpm_mclk
0: 300Mhz
1: 625Mhz
2: 1500Mhz *

/sys/module/amdgpu/drivers/pci:amdgpu/0000:01:00.0/pp_dpm_pcie
0: 2.5GT/s, x8
1: 8.0GT/s, x16 *

/sys/module/amdgpu/drivers/pci:amdgpu/0000:01:00.0/pp_dpm_sclk
0: 214Mhz
1: 501Mhz
2: 850Mhz
3: 1034Mhz
4: 1144Mhz
5: 1228Mhz
6: 1275Mhz
7: 1295Mhz *

% find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \; /sys/bus/pci/devices/0000:01:00.0/current_link_width
8
/sys/bus/pci/devices/0000:01:00.0/current_link_speed
8.0 GT/s PCIe

James

2022-01-24 19:38:50

by Alex Deucher

[permalink] [raw]

Subject: Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM

On Sat, Jan 22, 2022 at 4:38 PM James Turner
<[email protected]> wrote:
>
> Hi Lijo,
>
> > Could you provide the pp_dpm_* values in sysfs with and without the
> > patch? Also, could you try forcing PCIE to gen3 (through pp_dpm_pcie)
> > if it's not in gen3 when the issue happens?
>
> AFAICT, I can't access those values while the AMD GPU PCI devices are
> bound to `vfio-pci`. However, I can at least access the link speed and
> width elsewhere in sysfs. So, I gathered what information I could for
> two different cases:
>
> - With the PCI devices bound to `vfio-pci`. With this configuration, I
> can start the VM, but the `pp_dpm_*` values are not available since
> the devices are bound to `vfio-pci` instead of `amdgpu`.
>
> - Without the PCI devices bound to `vfio-pci` (i.e. after removing the
> `vfio-pci.ids=...` kernel command line argument). With this
> configuration, I can access the `pp_dpm_*` values, since the PCI
> devices are bound to `amdgpu`. However, I cannot use the VM. If I try
> to start the VM, the display (both the external monitors attached to
> the AMD GPU and the built-in laptop display attached to the Intel
> iGPU) completely freezes.
>
> The output shown below was identical for both the good commit:
> f1688bd69ec4 ("drm/amd/amdgpu:save psp ring wptr to avoid attack")
> and the commit which introduced the issue:
> f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)")
>
> Note that the PCI link speed increased to 8.0 GT/s when the GPU was
> under heavy load for both versions, but the clock speeds of the GPU were
> different under load. (For the good commit, it was 1295 MHz; for the bad
> commit, it was 501 MHz.)
>

Are the ATIF and ATCS ACPI methods available in the guest VM? They
are required for this platform to work correctly from a power
standpoint. One thing that f9b7f3703ff9 did was to get those ACPI
methods executed on certain platforms where they had not been
previously due to a bug in the original implementation. If the
windows driver doesn't interact with them, it could cause performance
issues. It may have worked by accident before because the ACPI
interfaces may not have been called, leading the windows driver to
believe this was a standalone dGPU rather than one integrated into a
power/thermal limited platform.

Alex

>
> # With the PCI devices bound to `vfio-pci`
>
> ## Before starting the VM
>
> % ls /sys/module/amdgpu/drivers/pci:amdgpu
> module bind new_id remove_id uevent unbind
>
> % find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \;
> /sys/bus/pci/devices/0000:01:00.0/current_link_width
> 8
> /sys/bus/pci/devices/0000:01:00.0/current_link_speed
> 8.0 GT/s PCIe
>
> ## While running the VM, before placing the AMD GPU under heavy load
>
> % find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \;
> /sys/bus/pci/devices/0000:01:00.0/current_link_width
> 8
> /sys/bus/pci/devices/0000:01:00.0/current_link_speed
> 2.5 GT/s PCIe
>
> ## While running the VM, with the AMD GPU under heavy load
>
> % find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \;
> /sys/bus/pci/devices/0000:01:00.0/current_link_width
> 8
> /sys/bus/pci/devices/0000:01:00.0/current_link_speed
> 8.0 GT/s PCIe
>
> ## While running the VM, after stopping the heavy load on the AMD GPU
>
> % find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \;
> /sys/bus/pci/devices/0000:01:00.0/current_link_width
> 8
> /sys/bus/pci/devices/0000:01:00.0/current_link_speed
> 2.5 GT/s PCIe
>
> ## After stopping the VM
>
> % find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \;
> /sys/bus/pci/devices/0000:01:00.0/current_link_width
> 8
> /sys/bus/pci/devices/0000:01:00.0/current_link_speed
> 2.5 GT/s PCIe
>
>
> # Without the PCI devices bound to `vfio-pci`
>
> % ls /sys/module/amdgpu/drivers/pci:amdgpu
> 0000:01:00.0 module bind new_id remove_id uevent unbind
>
> % for f in /sys/module/amdgpu/drivers/pci:amdgpu/*/pp_dpm_*; do echo "$f"; cat "$f"; echo; done
> /sys/module/amdgpu/drivers/pci:amdgpu/0000:01:00.0/pp_dpm_mclk
> 0: 300Mhz
> 1: 625Mhz
> 2: 1500Mhz *
>
> /sys/module/amdgpu/drivers/pci:amdgpu/0000:01:00.0/pp_dpm_pcie
> 0: 2.5GT/s, x8
> 1: 8.0GT/s, x16 *
>
> /sys/module/amdgpu/drivers/pci:amdgpu/0000:01:00.0/pp_dpm_sclk
> 0: 214Mhz
> 1: 501Mhz
> 2: 850Mhz
> 3: 1034Mhz
> 4: 1144Mhz
> 5: 1228Mhz
> 6: 1275Mhz
> 7: 1295Mhz *
>
> % find /sys/bus/pci/devices/0000:01:00.0/ -type f -name 'current_link*' -print -exec cat {} \;
> /sys/bus/pci/devices/0000:01:00.0/current_link_width
> 8
> /sys/bus/pci/devices/0000:01:00.0/current_link_speed
> 8.0 GT/s PCIe
>
>
> James

2022-01-24 19:42:53

by Alex Williamson

[permalink] [raw]

Subject: Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM

On Mon, 24 Jan 2022 12:04:18 -0500
Alex Deucher <[email protected]> wrote:

> On Sat, Jan 22, 2022 at 4:38 PM James Turner
> <[email protected]> wrote:
> >
> > Hi Lijo,
> >
> > > Could you provide the pp_dpm_* values in sysfs with and without the
> > > patch? Also, could you try forcing PCIE to gen3 (through pp_dpm_pcie)
> > > if it's not in gen3 when the issue happens?
> >
> > AFAICT, I can't access those values while the AMD GPU PCI devices are
> > bound to `vfio-pci`. However, I can at least access the link speed and
> > width elsewhere in sysfs. So, I gathered what information I could for
> > two different cases:
> >
> > - With the PCI devices bound to `vfio-pci`. With this configuration, I
> > can start the VM, but the `pp_dpm_*` values are not available since
> > the devices are bound to `vfio-pci` instead of `amdgpu`.
> >
> > - Without the PCI devices bound to `vfio-pci` (i.e. after removing the
> > `vfio-pci.ids=...` kernel command line argument). With this
> > configuration, I can access the `pp_dpm_*` values, since the PCI
> > devices are bound to `amdgpu`. However, I cannot use the VM. If I try
> > to start the VM, the display (both the external monitors attached to
> > the AMD GPU and the built-in laptop display attached to the Intel
> > iGPU) completely freezes.
> >
> > The output shown below was identical for both the good commit:
> > f1688bd69ec4 ("drm/amd/amdgpu:save psp ring wptr to avoid attack")
> > and the commit which introduced the issue:
> > f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)")
> >
> > Note that the PCI link speed increased to 8.0 GT/s when the GPU was
> > under heavy load for both versions, but the clock speeds of the GPU were
> > different under load. (For the good commit, it was 1295 MHz; for the bad
> > commit, it was 501 MHz.)
> >
>
> Are the ATIF and ATCS ACPI methods available in the guest VM? They
> are required for this platform to work correctly from a power
> standpoint. One thing that f9b7f3703ff9 did was to get those ACPI
> methods executed on certain platforms where they had not been
> previously due to a bug in the original implementation. If the
> windows driver doesn't interact with them, it could cause performance
> issues. It may have worked by accident before because the ACPI
> interfaces may not have been called, leading the windows driver to
> believe this was a standalone dGPU rather than one integrated into a
> power/thermal limited platform.

None of the host ACPI interfaces are available to or accessible by the
guest when assigning a PCI device. Likewise the guest does not have
access to the parent downstream ports of the PCIe link. Thanks,

Alex

2022-01-25 08:38:57

by James Turner

[permalink] [raw]

Subject: Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM

Hi Lijo,

> Not able to relate to how it affects gfx/mem DPM alone. Unless Alex
> has other ideas, would you be able to enable drm debug messages and
> share the log?

Sure, I'm happy to provide drm debug messages. Enabling everything
(0x1ff) generates *a lot* of log messages, though. Is there a smaller
subset that would be useful? Fwiw, I don't see much in the full drm logs
about the AMD GPU anyway; it's mostly about the Intel GPU.

All the messages in the system log containing "01:00" or "1002:6981" are
identical between the two versions.

I've posted below the only places in the logs which contain "amd". The
commit with the issue (f9b7f3703ff9) has a few drm log messages from
amdgpu which are not present in the logs for f1688bd69ec4.

# f1688bd69ec4 ("drm/amd/amdgpu:save psp ring wptr to avoid attack")

[drm] amdgpu kernel modesetting enabled.
vga_switcheroo: detected switching method \_SB_.PCI0.GFX0.ATPX handle
ATPX version 1, functions 0x00000033
amdgpu: CRAT table not found
amdgpu: Virtual CRAT table created for CPU
amdgpu: Topology: Add CPU node

# f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)")

[drm] amdgpu kernel modesetting enabled.
vga_switcheroo: detected switching method \_SB_.PCI0.GFX0.ATPX handle
ATPX version 1, functions 0x00000033
[drm:amdgpu_atif_pci_probe_handle.isra.0 [amdgpu]] Found ATIF handle \_SB_.PCI0.GFX0.ATIF
[drm:amdgpu_atif_pci_probe_handle.isra.0 [amdgpu]] ATIF version 1
[drm:amdgpu_acpi_detect [amdgpu]] SYSTEM_PARAMS: mask = 0x6, flags = 0x7
[drm:amdgpu_acpi_detect [amdgpu]] Notification enabled, command code = 0xd9
amdgpu: CRAT table not found
amdgpu: Virtual CRAT table created for CPU
amdgpu: Topology: Add CPU node

Other things I'm willing to try if they'd be useful:

- I could update to the 21.Q4 Radeon Pro driver in the Windows VM. (The
21.Q3 driver is currently installed.)

- I could set up a Linux guest VM with PCI passthrough to compare to the
Windows VM and obtain more debugging information.

- I could build a kernel with a patch applied, e.g. to disable some of
the changes in f9b7f3703ff9.

James

2022-01-25 19:43:31

[permalink] [raw]

Subject: Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM

On 1/25/2022 5:28 AM, James Turner wrote:
> Hi Lijo,
>
>> Not able to relate to how it affects gfx/mem DPM alone. Unless Alex
>> has other ideas, would you be able to enable drm debug messages and
>> share the log?
>
> Sure, I'm happy to provide drm debug messages. Enabling everything
> (0x1ff) generates *a lot* of log messages, though. Is there a smaller
> subset that would be useful? Fwiw, I don't see much in the full drm logs
> about the AMD GPU anyway; it's mostly about the Intel GPU.
>
> All the messages in the system log containing "01:00" or "1002:6981" are
> identical between the two versions.
>
> I've posted below the only places in the logs which contain "amd". The
> commit with the issue (f9b7f3703ff9) has a few drm log messages from
> amdgpu which are not present in the logs for f1688bd69ec4.
>
>
> # f1688bd69ec4 ("drm/amd/amdgpu:save psp ring wptr to avoid attack")
>
> [drm] amdgpu kernel modesetting enabled.
> vga_switcheroo: detected switching method \_SB_.PCI0.GFX0.ATPX handle
> ATPX version 1, functions 0x00000033
> amdgpu: CRAT table not found
> amdgpu: Virtual CRAT table created for CPU
> amdgpu: Topology: Add CPU node
>
>
> # f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)")
>
> [drm] amdgpu kernel modesetting enabled.
> vga_switcheroo: detected switching method \_SB_.PCI0.GFX0.ATPX handle
> ATPX version 1, functions 0x00000033
> [drm:amdgpu_atif_pci_probe_handle.isra.0 [amdgpu]] Found ATIF handle \_SB_.PCI0.GFX0.ATIF
> [drm:amdgpu_atif_pci_probe_handle.isra.0 [amdgpu]] ATIF version 1
> [drm:amdgpu_acpi_detect [amdgpu]] SYSTEM_PARAMS: mask = 0x6, flags = 0x7
> [drm:amdgpu_acpi_detect [amdgpu]] Notification enabled, command code = 0xd9
> amdgpu: CRAT table not found
> amdgpu: Virtual CRAT table created for CPU
> amdgpu: Topology: Add CPU node
>
>

Hi James,

Specifically, I was looking for any events happening at these two places
because of the patch-

https://elixir.bootlin.com/linux/v5.16/source/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c#L411

https://elixir.bootlin.com/linux/v5.16/source/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c#L653

The patch specifically affects these two. On/before starting VM, if
there are invocations of these two functions on your system as a result
of the patch, we could navigate from there and check what is the side
effect.

Thanks,
Lijo

> Other things I'm willing to try if they'd be useful:
>
> - I could update to the 21.Q4 Radeon Pro driver in the Windows VM. (The
> 21.Q3 driver is currently installed.)
>
> - I could set up a Linux guest VM with PCI passthrough to compare to the
> Windows VM and obtain more debugging information.
>
> - I could build a kernel with a patch applied, e.g. to disable some of
> the changes in f9b7f3703ff9.
>
> James
>

2022-02-01 15:17:15

by James Turner

[permalink] [raw]

Subject: Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM

Hi Lijo,

> Specifically, I was looking for any events happening at these two
> places because of the patch-
>
> https://elixir.bootlin.com/linux/v5.16/source/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c#L411
>
> https://elixir.bootlin.com/linux/v5.16/source/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c#L653

I searched the logs generated with all drm debug messages enabled
(drm.debug=0x1ff) for "device_class", "ATCS", "atcs", "ATIF", and
"atif", for both f1688bd69ec4 and f9b7f3703ff9. Other than the few lines
mentioning ATIF from my previous email, there weren't any matches.

Since "device_class" didn't appear in the logs, we know that
`amdgpu_atif_handler` was not called for either version.

I also patched f9b7f3703ff9 to add the line

DRM_DEBUG_DRIVER("Entered amdgpu_acpi_pcie_performance_request");

at the top (below the variable declarations) of
`amdgpu_acpi_pcie_performance_request`, and then tested again with all
drm debug messages enabled (0x1ff). That debug message didn't show up.

So, `amdgpu_acpi_pcie_performance_request` was not called either, at
least with f9b7f3703ff9. (I didn't try adding this patch to
f1688bd69ec4.)

Would anything else be helpful?

James

2022-02-15 20:01:11

by Regzbot (on behalf of Thorsten Leemhuis)

[permalink] [raw]

Subject: Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM

Top-posting for once, to make this easy accessible to everyone.

Nothing happened here for two weeks now afaics. Was the discussion moved
elsewhere or did it fall through the cracks?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I'm getting a lot of
reports on my table. I can only look briefly into most of them and lack
knowledge about most of the areas they concern. I thus unfortunately
will sometimes get things wrong or miss something important. I hope
that's not the case here; if you think it is, don't hesitate to tell me
in a public reply, it's in everyone's interest to set the public record
straight.

On 30.01.22 01:25, Jim Turner wrote:
> Hi Lijo,
>
>> Specifically, I was looking for any events happening at these two
>> places because of the patch-
>>
>> https://elixir.bootlin.com/linux/v5.16/source/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c#L411
>>
>> https://elixir.bootlin.com/linux/v5.16/source/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c#L653
>
> I searched the logs generated with all drm debug messages enabled
> (drm.debug=0x1ff) for "device_class", "ATCS", "atcs", "ATIF", and
> "atif", for both f1688bd69ec4 and f9b7f3703ff9. Other than the few lines
> mentioning ATIF from my previous email, there weren't any matches.
>
> Since "device_class" didn't appear in the logs, we know that
> `amdgpu_atif_handler` was not called for either version.
>
> I also patched f9b7f3703ff9 to add the line
>
> DRM_DEBUG_DRIVER("Entered amdgpu_acpi_pcie_performance_request");
>
> at the top (below the variable declarations) of
> `amdgpu_acpi_pcie_performance_request`, and then tested again with all
> drm debug messages enabled (0x1ff). That debug message didn't show up.
>
> So, `amdgpu_acpi_pcie_performance_request` was not called either, at
> least with f9b7f3703ff9. (I didn't try adding this patch to
> f1688bd69ec4.)
>
> Would anything else be helpful?