2022-02-16 18:12:56

by Alex Deucher

[permalink] [raw]
Subject: Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM

On Tue, Feb 15, 2022 at 9:35 PM James D. Turner
<[email protected]> wrote:
>
> Hi Alex,
>
> > I guess just querying the ATIF method does something that negatively
> > influences the windows driver in the guest. Perhaps the platform
> > thinks the driver has been loaded since the method has been called so
> > it enables certain behaviors that require ATIF interaction that never
> > happen because the ACPI methods are not available in the guest.
>
> Do you mean the `amdgpu_atif_pci_probe_handle` function? If it would be
> helpful, I could try disabling that function and testing again.

Correct.

>
> > I don't really have a good workaround other than blacklisting the
> > driver since on bare metal the driver needs to use this interface for
> > platform interactions.
>
> I'm not familiar with ATIF, but should `amdgpu_atif_pci_probe_handle`
> really be called for PCI devices which are bound to vfio-pci? I'd expect
> amdgpu to ignore such devices.
>
> As I understand it, starting with
> f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)"),
> the `amdgpu_acpi_detect` function loops over all PCI devices in the
> `PCI_CLASS_DISPLAY_VGA` and `PCI_CLASS_DISPLAY_OTHER` classes to find
> the ATIF and ATCS handles. Maybe skipping over any PCI devices bound to
> vfio-pci would fix the issue? On a related note, shouldn't it also skip
> over any PCI devices with non-AMD vendor IDs?

The ACPI methods are global. There's only one instance of each per
system and they are relevant to add GPUs on the platform. That's why
they are a global resource in the driver. They can be hung off of the
dGPU or APU ACPI namespace, depending on the platform which is why we
check all of the display devices. Skipping them would prevent them
from being available if you later bound the amdgpu driver to the GPU
device(s) I think.

Alex

>
> Regards,
> James


2022-03-07 06:07:37

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM

Hi, this is your Linux kernel regression tracker again. Top-posting once
more, to make this easily accessible to everyone.

What's the status of this? It looks stuck, or did the discussion
continue somewhere else? James, it sounded like you wanted to test
something, did you give it a try? Or is there some reason why I should
stop tracking this regression?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I'm getting a lot of
reports on my table. I can only look briefly into most of them and lack
knowledge about most of the areas they concern. I thus unfortunately
will sometimes get things wrong or miss something important. I hope
that's not the case here; if you think it is, don't hesitate to tell me
in a public reply, it's in everyone's interest to set the public record
straight.

#regzbot poke

On 16.02.22 17:37, Alex Deucher wrote:
> On Tue, Feb 15, 2022 at 9:35 PM James D. Turner
> <[email protected]> wrote:
>>
>> Hi Alex,
>>
>>> I guess just querying the ATIF method does something that negatively
>>> influences the windows driver in the guest. Perhaps the platform
>>> thinks the driver has been loaded since the method has been called so
>>> it enables certain behaviors that require ATIF interaction that never
>>> happen because the ACPI methods are not available in the guest.
>>
>> Do you mean the `amdgpu_atif_pci_probe_handle` function? If it would be
>> helpful, I could try disabling that function and testing again.
>
> Correct.
>
>>
>>> I don't really have a good workaround other than blacklisting the
>>> driver since on bare metal the driver needs to use this interface for
>>> platform interactions.
>>
>> I'm not familiar with ATIF, but should `amdgpu_atif_pci_probe_handle`
>> really be called for PCI devices which are bound to vfio-pci? I'd expect
>> amdgpu to ignore such devices.
>>
>> As I understand it, starting with
>> f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)"),
>> the `amdgpu_acpi_detect` function loops over all PCI devices in the
>> `PCI_CLASS_DISPLAY_VGA` and `PCI_CLASS_DISPLAY_OTHER` classes to find
>> the ATIF and ATCS handles. Maybe skipping over any PCI devices bound to
>> vfio-pci would fix the issue? On a related note, shouldn't it also skip
>> over any PCI devices with non-AMD vendor IDs?
>
> The ACPI methods are global. There's only one instance of each per
> system and they are relevant to add GPUs on the platform. That's why
> they are a global resource in the driver. They can be hung off of the
> dGPU or APU ACPI namespace, depending on the platform which is why we
> check all of the display devices. Skipping them would prevent them
> from being available if you later bound the amdgpu driver to the GPU
> device(s) I think.
>
> Alex
>
>>
>> Regards,
>> James
>
>

2022-03-07 07:21:41

by James Turner

[permalink] [raw]
Subject: Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM

Hi Thorsten,

My understanding at this point is that the root problem is probably not
in the Linux kernel but rather something else (e.g. the machine firmware
or AMD Windows driver) and that the change in
f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)")
simply exposed the underlying problem.

This week, I'll double-check that this is the case by disabling the
`amdgpu_atif_pci_probe_handle` function and testing again. I'll post the
results here.

James

2022-03-14 06:48:19

by James Turner

[permalink] [raw]
Subject: Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM

Hi all,

I've confirmed that changing the `amdgpu_atif_pci_probe_handle` function
to do nothing does make the GPU work properly in the VM. I started with
f9b7f3703ff9 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)")
and changed the function implementation to:

static bool amdgpu_atif_pci_probe_handle(struct pci_dev *pdev)
{
DRM_DEBUG_DRIVER("Entered amdgpu_atif_pci_probe_handle");
return false;
}

With that change, the GPU works properly in the VM.

I'm not sure where to go from here. This issue isn't much of a concern
for me anymore, since blacklisting `amdgpu` works for my machine. At
this point, my understanding is that the root problem needs to be fixed
in AMD's Windows GPU driver or Dell's firmware, not the Linux kernel. If
any of the AMD developers on this thread would like to forward it to the
AMD Windows driver team, I'd be happy to work with AMD to fix the issue
properly.

I've added a mention of this issue and workaround to the [Arch Wiki][1]
to make it more discoverable. If anyone has a better place to document
this, please let me know.

Thank you all for your help on this.

[1]: https://wiki.archlinux.org/title/PCI_passthrough_via_OVMF#Too-low_frequency_limit_for_AMD_GPU_passed-through_to_virtual_machine

James

2022-03-17 18:48:28

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM

On 13.03.22 19:33, James Turner wrote:
>
>> My understanding at this point is that the root problem is probably
>> not in the Linux kernel but rather something else (e.g. the machine
>> firmware or AMD Windows driver) and that the change in f9b7f3703ff9
>> ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)") simply
>> exposed the underlying problem.

FWIW: that in the end is irrelevant when it comes to the Linux kernel's
'no regressions' rule. For details see:

https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/Documentation/admin-guide/reporting-regressions.rst
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/Documentation/process/handling-regressions.rst

That being said: sometimes for the greater good it's better to not
insist on that. And I guess that might be the case here.

> I'm not sure where to go from here. This issue isn't much of a concern> for me anymore, since blacklisting `amdgpu` works for my machine. At>
this point, my understanding is that the root problem needs to be fixed>
in AMD's Windows GPU driver or Dell's firmware, not the Linux kernel.
If> any of the AMD developers on this thread would like to forward it to
the> AMD Windows driver team, I'd be happy to work with AMD to fix the
issue> properly.
In that case I'll drop it from the list of regressions, unless what I
wrote above makes you change your mind.

#regzbot invalid: firmware issue exposed by kernel change, user seems to
be happy with a workaround

Thx everyone who participated in handling this.

Ciao, Thorsten

2022-03-18 13:53:01

by Paul Menzel

[permalink] [raw]
Subject: Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM

Dear Thorsten, dear James,


Am 17.03.22 um 13:54 schrieb Thorsten Leemhuis:
> On 13.03.22 19:33, James Turner wrote:
>>
>>> My understanding at this point is that the root problem is probably
>>> not in the Linux kernel but rather something else (e.g. the machine
>>> firmware or AMD Windows driver) and that the change in f9b7f3703ff9
>>> ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)") simply
>>> exposed the underlying problem.
>
> FWIW: that in the end is irrelevant when it comes to the Linux kernel's
> 'no regressions' rule. For details see:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/Documentation/admin-guide/reporting-regressions.rst
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/Documentation/process/handling-regressions.rst
>
> That being said: sometimes for the greater good it's better to not
> insist on that. And I guess that might be the case here.

But who decides that? Running stuff in a virtual machine is not that
uncommon.

Should the commit be reverted, and re-added with a more elaborate commit
message documenting the downsides?

Could the user be notified somehow? Can PCI passthrough and a loaded
amdgpu driver be detected, so Linux warns about this?

Also, should this be documented in the code?

>> I'm not sure where to go from here. This issue isn't much of a concern
>> for me anymore, since blacklisting `amdgpu` works for my machine. At
>> this point, my understanding is that the root problem needs to be fixed
>> in AMD's Windows GPU driver or Dell's firmware, not the Linux kernel. If
>> any of the AMD developers on this thread would like to forward it to the
>> AMD Windows driver team, I'd be happy to work with AMD to fix the issue
>> properly.

(Thorsten, your mailer mangled the quote somehow – I reformatted it –,
which is too bad, as this message is shown when clicking on the link
*marked invalid* in the regzbot Web page [1]. (The link is a very nice
feature.)

> In that case I'll drop it from the list of regressions, unless what I
> wrote above makes you change your mind.
>
> #regzbot invalid: firmware issue exposed by kernel change, user seems to
> be happy with a workaround
>
> Thx everyone who participated in handling this.

Should the regression issue be re-opened until the questions above are
answered, and a more user friendly solution is found?


Kind regards,

Paul


[1]: https://linux-regtracking.leemhuis.info/regzbot/resolved/