2022-08-22 14:27:28

by Takashi Iwai

[permalink] [raw]
Subject: [REGRESSION 5.19.x] AMD HD-audio devices missing on 5.19

Hi,

we've received regression reports about the missing HD-audio devices
on AMD platforms, and this turned out to be caused by the commit
512881eacfa72c2136b27b9934b7b27504a9efc2
bus: platform,amba,fsl-mc,PCI: Add device DMA ownership management

The details are found in openSUSE bugzilla:
https://bugzilla.suse.com/show_bug.cgi?id=1202492

The problem seems to be that HD-audio (both onboard analog and HDMI)
PCI devices are assigned to the same IOMMU group as AMD graphics PCI
device, and once after the AMDGPU is initialized beforehand, those
audio devices can't be probed since iommu_device_use_default_domain()
returns -EBUSY.

I'm not sure whether it's specific to PCI bus due to the assignment
logic of those IOMMU groups, or it's about the handling of the active
domain assignment. In anyway, disabling IOMMU works around the
problem, and passing driver_managed_dma flag to the HD-audio driver
was also confirmed to work around it, too.

The problem persists with 6.0-rc1 kernel.

If you have / can give any fix patch or debug patch, let me know; I'll
build test kernels and ask the reporters.


thanks,

Takashi


2022-08-23 01:20:57

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [REGRESSION 5.19.x] AMD HD-audio devices missing on 5.19

On Mon, Aug 22, 2022 at 04:12:59PM +0200, Takashi Iwai wrote:
> Hi,
>
> we've received regression reports about the missing HD-audio devices
> on AMD platforms, and this turned out to be caused by the commit
> 512881eacfa72c2136b27b9934b7b27504a9efc2
> bus: platform,amba,fsl-mc,PCI: Add device DMA ownership management
>
> The details are found in openSUSE bugzilla:
> https://bugzilla.suse.com/show_bug.cgi?id=1202492
>
> The problem seems to be that HD-audio (both onboard analog and HDMI)
> PCI devices are assigned to the same IOMMU group as AMD graphics PCI
> device, and once after the AMDGPU is initialized beforehand, those
> audio devices can't be probed since iommu_device_use_default_domain()
> returns -EBUSY.

Can you describe exactly what drivers are involved in this? If it is
the above commit then several devices are sharing an iommu group and
one of them (well, the only one already attached, I suppose) has made
the group unsharable.

With grep I don't see an obvious place where the AMDGPU driver would
mess with the iommu configuration, so I have no guess.

It would be good to have some debugging to confirm if it is
group->owner (should be impossible, suggests memory corruption if it
is) or group->domain != group->default_domain.

Most likely it is the later, but I can't see how that could happen on
a system like this.. There is no obvious manipulation in AMDGPU, for
instance.

So debugging to find the backtrace for exactly when
group->domain != group->default_domain
Occurs for the troubled group would be necessary.

If you know the group name it would be easy enough to cook a patch to
throw a warn on when group->domain changes

> domain assignment. In anyway, disabling IOMMU works around the
> problem, and passing driver_managed_dma flag to the HD-audio driver
> was also confirmed to work around it, too.

Disabling iommu removes the groups entirely, this disables the check.

driver_managed_dma disables the check entirely - which raises the
question how the driver is even able to work..

If the domain is not the default_domain it is very surprising that DMA
can work at all. Since it does, something really odd has happened.

Jason

2022-08-23 06:14:28

by Takashi Iwai

[permalink] [raw]
Subject: Re: [REGRESSION 5.19.x] AMD HD-audio devices missing on 5.19

On Tue, 23 Aug 2022 03:00:21 +0200,
Jason Gunthorpe wrote:
>
> On Mon, Aug 22, 2022 at 04:12:59PM +0200, Takashi Iwai wrote:
> > Hi,
> >
> > we've received regression reports about the missing HD-audio devices
> > on AMD platforms, and this turned out to be caused by the commit
> > 512881eacfa72c2136b27b9934b7b27504a9efc2
> > bus: platform,amba,fsl-mc,PCI: Add device DMA ownership management
> >
> > The details are found in openSUSE bugzilla:
> > https://bugzilla.suse.com/show_bug.cgi?id=1202492
> >
> > The problem seems to be that HD-audio (both onboard analog and HDMI)
> > PCI devices are assigned to the same IOMMU group as AMD graphics PCI
> > device, and once after the AMDGPU is initialized beforehand, those
> > audio devices can't be probed since iommu_device_use_default_domain()
> > returns -EBUSY.
>
> Can you describe exactly what drivers are involved in this? If it is
> the above commit then several devices are sharing an iommu group and
> one of them (well, the only one already attached, I suppose) has made
> the group unsharable.
>
> With grep I don't see an obvious place where the AMDGPU driver would
> mess with the iommu configuration, so I have no guess.

I have also no concrete clue, either :)
At least, drivers/gpu/drm/amd/amdkfd/kfd_iommu.c calls
amd_iommu_init_device(), and this invokes iommu_attach_group(), which
may change group->domain. But it was just my wild guess, and it might
be others, indeed.

> It would be good to have some debugging to confirm if it is
> group->owner (should be impossible, suggests memory corruption if it
> is) or group->domain != group->default_domain.
>
> Most likely it is the later, but I can't see how that could happen on
> a system like this.. There is no obvious manipulation in AMDGPU, for
> instance.
>
> So debugging to find the backtrace for exactly when
> group->domain != group->default_domain
> Occurs for the troubled group would be necessary.

OK, will try to build a test kernel with some debug prints and ask the
reporters. It may take some time.

> If you know the group name it would be easy enough to cook a patch to
> throw a warn on when group->domain changes
>
> > domain assignment. In anyway, disabling IOMMU works around the
> > problem, and passing driver_managed_dma flag to the HD-audio driver
> > was also confirmed to work around it, too.
>
> Disabling iommu removes the groups entirely, this disables the check.
>
> driver_managed_dma disables the check entirely - which raises the
> question how the driver is even able to work..
>
> If the domain is not the default_domain it is very surprising that DMA
> can work at all. Since it does, something really odd has happened.

Yeah it's something odd ;)

I'm not sure whether the people tested HD-audio capability at all, but
otherwise they might have noticed.


Takashi

2022-08-23 16:06:11

by Takashi Iwai

[permalink] [raw]
Subject: Re: [REGRESSION 5.19.x] AMD HD-audio devices missing on 5.19

On Tue, 23 Aug 2022 08:06:05 +0200,
Takashi Iwai wrote:
>
> On Tue, 23 Aug 2022 03:00:21 +0200,
> Jason Gunthorpe wrote:
> >
> > On Mon, Aug 22, 2022 at 04:12:59PM +0200, Takashi Iwai wrote:
> > > Hi,
> > >
> > > we've received regression reports about the missing HD-audio devices
> > > on AMD platforms, and this turned out to be caused by the commit
> > > 512881eacfa72c2136b27b9934b7b27504a9efc2
> > > bus: platform,amba,fsl-mc,PCI: Add device DMA ownership management
> > >
> > > The details are found in openSUSE bugzilla:
> > > https://bugzilla.suse.com/show_bug.cgi?id=1202492
> > >
> > > The problem seems to be that HD-audio (both onboard analog and HDMI)
> > > PCI devices are assigned to the same IOMMU group as AMD graphics PCI
> > > device, and once after the AMDGPU is initialized beforehand, those
> > > audio devices can't be probed since iommu_device_use_default_domain()
> > > returns -EBUSY.
> >
> > Can you describe exactly what drivers are involved in this? If it is
> > the above commit then several devices are sharing an iommu group and
> > one of them (well, the only one already attached, I suppose) has made
> > the group unsharable.
> >
> > With grep I don't see an obvious place where the AMDGPU driver would
> > mess with the iommu configuration, so I have no guess.
>
> I have also no concrete clue, either :)
> At least, drivers/gpu/drm/amd/amdkfd/kfd_iommu.c calls
> amd_iommu_init_device(), and this invokes iommu_attach_group(), which
> may change group->domain. But it was just my wild guess, and it might
> be others, indeed.
>
> > It would be good to have some debugging to confirm if it is
> > group->owner (should be impossible, suggests memory corruption if it
> > is) or group->domain != group->default_domain.
> >
> > Most likely it is the later, but I can't see how that could happen on
> > a system like this.. There is no obvious manipulation in AMDGPU, for
> > instance.
> >
> > So debugging to find the backtrace for exactly when
> > group->domain != group->default_domain
> > Occurs for the troubled group would be necessary.
>
> OK, will try to build a test kernel with some debug prints and ask the
> reporters. It may take some time.

It was tested now and confirmed that the call path is via AMDGPU, as
expected:
amdgpu_pci_probe ->
amdgpu_driver_load_kms ->
amdgpu_device_init ->
amdgpu_amdkfd_device_init ->
kgd2kfd_device_init ->
kgd2kfd_resume_iommu ->
kfd_iommu_resume ->
amd_iommu_init_device ->
iommu_attach_group ->
__iommu_attach_group

At first AMDGPU driver is probed, and the iommu_attach_group() call
above changes the assigned group->domain. Afterwards, when HD-audio
devices are probed, it fails because:
- Both HD-audio PCI devices belong to the very same IOMMU group as the
AMD graphics PCI device
- PCI core calls iommu_device_use_default_domain() and the check
fails there because group->domain != group->default_domain

Takashi

2022-08-23 20:50:45

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [REGRESSION 5.19.x] AMD HD-audio devices missing on 5.19

On Tue, Aug 23, 2022 at 01:46:36PM +0200, Takashi Iwai wrote:
> It was tested now and confirmed that the call path is via AMDGPU, as
> expected:
> amdgpu_pci_probe ->
> amdgpu_driver_load_kms ->
> amdgpu_device_init ->
> amdgpu_amdkfd_device_init ->
> kgd2kfd_device_init ->
> kgd2kfd_resume_iommu ->
> kfd_iommu_resume ->
> amd_iommu_init_device ->
> iommu_attach_group ->
> __iommu_attach_group

Oh, when you said sound intel I thought this was an Intel CPU..

Yes, there is this hacky private path from the amdgpu to
the amd iommu driver that makes a mess of it here. We discussed it in
this thread:

https://lore.kernel.org/linux-iommu/YgtuJQhY8SNlv9%[email protected]/

But nobody put it together that it would be a problem with this.

Something like this, perhaps, but I didn't check if overriding the
type would cause other problems.

diff --git a/drivers/iommu/amd/iommu_v2.c b/drivers/iommu/amd/iommu_v2.c
index 696d5555be5794..6a1f02c62dffcc 100644
--- a/drivers/iommu/amd/iommu_v2.c
+++ b/drivers/iommu/amd/iommu_v2.c
@@ -777,6 +777,8 @@ int amd_iommu_init_device(struct pci_dev *pdev, int pasids)
if (dev_state->domain == NULL)
goto out_free_states;

+ /* See iommu_is_default_domain() */
+ dev_state->domain->type = IOMMU_DOMAIN_IDENTITY;
amd_iommu_domain_direct_map(dev_state->domain);

ret = amd_iommu_domain_enable_v2(dev_state->domain, pasids);
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 780fb70715770d..fe8bd17f52314b 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -3076,6 +3076,24 @@ static ssize_t iommu_group_store_type(struct iommu_group *group,
return ret;
}

+static bool iommu_is_default_domain(struct iommu_group *group)
+{
+ if (group->domain == group->default_domain)
+ return true;
+
+ /*
+ * If the default domain was set to identity and it is still an identity
+ * domain then we consider this a pass. This happens because of
+ * amd_iommu_init_device() replacing the default idenity domain with an
+ * identity domain that has a different configuration for AMDGPU.
+ */
+ if (group->default_domain &&
+ group->default_domain->type == IOMMU_DOMAIN_IDENTITY &&
+ group->domain && group->domain->type == IOMMU_DOMAIN_IDENTITY)
+ return true;
+ return false;
+}
+
/**
* iommu_device_use_default_domain() - Device driver wants to handle device
* DMA through the kernel DMA API.
@@ -3094,8 +3112,7 @@ int iommu_device_use_default_domain(struct device *dev)

mutex_lock(&group->mutex);
if (group->owner_cnt) {
- if (group->domain != group->default_domain ||
- group->owner) {
+ if (group->owner || iommu_is_default_domain(group)) {
ret = -EBUSY;
goto unlock_out;
}

2022-08-23 21:04:50

by Robin Murphy

[permalink] [raw]
Subject: Re: [REGRESSION 5.19.x] AMD HD-audio devices missing on 5.19

On 2022-08-23 21:28, Jason Gunthorpe wrote:
> On Tue, Aug 23, 2022 at 01:46:36PM +0200, Takashi Iwai wrote:
>> It was tested now and confirmed that the call path is via AMDGPU, as
>> expected:
>> amdgpu_pci_probe ->
>> amdgpu_driver_load_kms ->
>> amdgpu_device_init ->
>> amdgpu_amdkfd_device_init ->
>> kgd2kfd_device_init ->
>> kgd2kfd_resume_iommu ->
>> kfd_iommu_resume ->
>> amd_iommu_init_device ->
>> iommu_attach_group ->
>> __iommu_attach_group
>
> Oh, when you said sound intel I thought this was an Intel CPU..
>
> Yes, there is this hacky private path from the amdgpu to
> the amd iommu driver that makes a mess of it here. We discussed it in
> this thread:
>
> https://lore.kernel.org/linux-iommu/YgtuJQhY8SNlv9%[email protected]/
>
> But nobody put it together that it would be a problem with this.
>
> Something like this, perhaps, but I didn't check if overriding the
> type would cause other problems.
>
> diff --git a/drivers/iommu/amd/iommu_v2.c b/drivers/iommu/amd/iommu_v2.c
> index 696d5555be5794..6a1f02c62dffcc 100644
> --- a/drivers/iommu/amd/iommu_v2.c
> +++ b/drivers/iommu/amd/iommu_v2.c
> @@ -777,6 +777,8 @@ int amd_iommu_init_device(struct pci_dev *pdev, int pasids)
> if (dev_state->domain == NULL)
> goto out_free_states;
>
> + /* See iommu_is_default_domain() */
> + dev_state->domain->type = IOMMU_DOMAIN_IDENTITY;
> amd_iommu_domain_direct_map(dev_state->domain);

Same question as 6 months ago, apparently: allocating an unmanaged
domain with a pagetable then sucking out the pagetable is silly enough,
but if we're going to then also call it a proper identity domain, we
should really just allocate an identity domain directly; but then why
not just enable_v2 on the identity domain that we know is already there
courtesy of def_domain_type?

Robin.


Intentional whitespace-damaged patch based on wrong kernel version to
stress how much it is for illustration purposes only:
----->8-----
diff --git a/drivers/iommu/amd/iommu_v2.c b/drivers/iommu/amd/iommu_v2.c
index fb61bdca4c2c..2925103cd71a 100644
--- a/drivers/iommu/amd/iommu_v2.c
+++ b/drivers/iommu/amd/iommu_v2.c
@@ -135,9 +135,6 @@ static void free_device_state(struct device_state
*dev_state)

iommu_group_put(group);

- /* Everything is down now, free the IOMMUv2 domain */
- iommu_domain_free(dev_state->domain);
-
/* Finally get rid of the device-state */
kfree(dev_state);
}
@@ -730,7 +727,6 @@ EXPORT_SYMBOL(amd_iommu_unbind_pasid);
int amd_iommu_init_device(struct pci_dev *pdev, int pasids)
{
struct device_state *dev_state;
- struct iommu_group *group;
unsigned long flags;
int ret, tmp;
u16 devid;
@@ -773,34 +769,20 @@ int amd_iommu_init_device(struct pci_dev *pdev,
int pasids)
if (dev_state->states == NULL)
goto out_free_dev_state;

- dev_state->domain = iommu_domain_alloc(&pci_bus_type);
- if (dev_state->domain == NULL)
+ dev_state->domain = iommu_get_domain_for_dev(&pdev->dev);
+ if (dev_state->domain->type != IOMMU_DOMAIN_IDENTITY)
goto out_free_states;

- amd_iommu_domain_direct_map(dev_state->domain);
-
ret = amd_iommu_domain_enable_v2(dev_state->domain, pasids);
if (ret)
- goto out_free_domain;
-
- group = iommu_group_get(&pdev->dev);
- if (!group) {
- ret = -EINVAL;
- goto out_free_domain;
- }
-
- ret = iommu_attach_group(dev_state->domain, group);
- if (ret != 0)
- goto out_drop_group;
-
- iommu_group_put(group);
+ goto out_free_states;

spin_lock_irqsave(&state_lock, flags);

if (__get_device_state(devid) != NULL) {
spin_unlock_irqrestore(&state_lock, flags);
ret = -EBUSY;
- goto out_free_domain;
+ goto out_free_states;
}

list_add_tail(&dev_state->list, &state_list);
@@ -809,12 +791,6 @@ int amd_iommu_init_device(struct pci_dev *pdev, int
pasids)

return 0;

-out_drop_group:
- iommu_group_put(group);
-
-out_free_domain:
- iommu_domain_free(dev_state->domain);
-
out_free_states:
free_page((unsigned long)dev_state->states);

2022-08-24 00:26:57

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [REGRESSION 5.19.x] AMD HD-audio devices missing on 5.19

On Tue, Aug 23, 2022 at 10:01:57PM +0100, Robin Murphy wrote:

> > diff --git a/drivers/iommu/amd/iommu_v2.c b/drivers/iommu/amd/iommu_v2.c
> > index 696d5555be5794..6a1f02c62dffcc 100644
> > --- a/drivers/iommu/amd/iommu_v2.c
> > +++ b/drivers/iommu/amd/iommu_v2.c
> > @@ -777,6 +777,8 @@ int amd_iommu_init_device(struct pci_dev *pdev, int pasids)
> > if (dev_state->domain == NULL)
> > goto out_free_states;
> > + /* See iommu_is_default_domain() */
> > + dev_state->domain->type = IOMMU_DOMAIN_IDENTITY;
> > amd_iommu_domain_direct_map(dev_state->domain);
>
> Same question as 6 months ago, apparently: allocating an unmanaged domain
> with a pagetable then sucking out the pagetable is silly enough, but if
> we're going to then also call it a proper identity domain, we should really
> just allocate an identity domain directly; but then why not just enable_v2
> on the identity domain that we know is already there courtesy of
> def_domain_type?

Yeah, nobody who knows this code answered that question either..

Looking at it a bit, I think this comment will start to be a problem:

/*
* Save us all sanity checks whether devices already in the
* domain support IOMMUv2. Just force that the domain has no
* devices attached when it is switched into IOMMUv2 mode.
*/
ret = -EBUSY;
if (domain->dev_cnt > 0 || domain->flags & PD_IOMMUV2_MASK)
goto out;

Beacuse we should have dev_cnt != 0 on the existing identity domain at
this point - worse if the probe order is backwards the sound driver
may even already be running when we reach this.

Plus the challenge of undoing it when the PASID user goes away.

Overall I can see how it is easier and more logical to transition
between two domains. We already have good infrastructure for doing
that.

From a core perspective I don't have a real problem with iommu drivers
using multiple iommu_domains to manage their internal operations, eg
for different operating modes. But you are right that it should be
cleaner and directly allocate the special domains it needs. This would
be much more self-descriptive if it called a function 'allocate v2
identity domain', for instance.

I think it would also make sense for the core to provide some API to
change the default domain (ie dma API domain) of a group, and that
would be a more logical, and self explanatory, API for iommu drivers
to use than attach/detach. ie:

iommu_change_default_domain(group, amd_identity_domain_v2):
iommu_change_default_domain(group, amd_identity_domain_v1):

At least for this effort I wanted something simple enough to backport
that maybe doesn't need to be an expert in the amd iommu to write..
[I checked some more and the hack to change the type looks like it is
OK on the free path, so maybe this even works]

My general hope is that we can convince AMD to work on this once the
generic PASID & PRI series lands, as this entire private path to the
GPU driver and non-standard PASID handling all needs to be aligned
with the upcoming core code. When doing that work it would make sense
to tidy and modernize this better. I added a bunch of AMD people to
this thread to that end. It sure would be good if AMD participated in
that series since they are going to have to use it too.

https://lore.kernel.org/linux-iommu/[email protected]/

Regards,
Jason

2022-08-24 07:29:06

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: [REGRESSION 5.19.x] AMD HD-audio devices missing on 5.19

[TLDR: I'm adding this regression report to the list of tracked
regressions; all text from me you find below is based on a few templates
paragraphs you might have encountered already already in similar form.]

TWIMC: this mail is primarily send for documentation purposes and for
regzbot, my Linux kernel regression tracking bot. These mails usually
contain '#forregzbot' in the subject, to make them easy to spot and filter.

Hi, this is your Linux kernel regression tracker.

On 22.08.22 16:12, Takashi Iwai wrote:
>
> we've received regression reports about the missing HD-audio devices
> on AMD platforms, and this turned out to be caused by the commit
> 512881eacfa72c2136b27b9934b7b27504a9efc2
> bus: platform,amba,fsl-mc,PCI: Add device DMA ownership management
>
> The details are found in openSUSE bugzilla:
> https://bugzilla.suse.com/show_bug.cgi?id=1202492
>
> The problem seems to be that HD-audio (both onboard analog and HDMI)
> PCI devices are assigned to the same IOMMU group as AMD graphics PCI
> device, and once after the AMDGPU is initialized beforehand, those
> audio devices can't be probed since iommu_device_use_default_domain()
> returns -EBUSY.
>
> I'm not sure whether it's specific to PCI bus due to the assignment
> logic of those IOMMU groups, or it's about the handling of the active
> domain assignment. In anyway, disabling IOMMU works around the
> problem, and passing driver_managed_dma flag to the HD-audio driver
> was also confirmed to work around it, too.
>
> The problem persists with 6.0-rc1 kernel.
>
> If you have / can give any fix patch or debug patch, let me know; I'll
> build test kernels and ask the reporters.

Thanks for the report. To be sure below issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, my Linux kernel regression
tracking bot:

#regzbot introduced 512881eacfa72c2136b27b99 ^
https://bugzilla.suse.com/show_bug.cgi?id=1202492
#regzbot title AMD HD-audio devices missing on 5.19
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply -- ideally with also
telling regzbot about it, as explained here:
https://linux-regtracking.leemhuis.info/tracked-regression/

Reminder for developers: When fixing the issue, add 'Link:' tags
pointing to the report (the mail this one replies to), as explained for
in the Linux kernel's documentation; above webpage explains why this is
important for tracked regressions.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.

2022-09-06 16:40:52

by Takashi Iwai

[permalink] [raw]
Subject: Re: [REGRESSION 5.19.x] AMD HD-audio devices missing on 5.19

On Tue, 06 Sep 2022 17:53:44 +0200,
Jason Gunthorpe wrote:
>
> On Tue, Sep 06, 2022 at 05:52:34PM +0200, Takashi Iwai wrote:
> > On Tue, 06 Sep 2022 17:41:51 +0200,
> > Jason Gunthorpe wrote:
> > >
> > > On Tue, Aug 23, 2022 at 05:28:24PM -0300, Jason Gunthorpe wrote:
> > > > On Tue, Aug 23, 2022 at 01:46:36PM +0200, Takashi Iwai wrote:
> > > > > It was tested now and confirmed that the call path is via AMDGPU, as
> > > > > expected:
> > > > > amdgpu_pci_probe ->
> > > > > amdgpu_driver_load_kms ->
> > > > > amdgpu_device_init ->
> > > > > amdgpu_amdkfd_device_init ->
> > > > > kgd2kfd_device_init ->
> > > > > kgd2kfd_resume_iommu ->
> > > > > kfd_iommu_resume ->
> > > > > amd_iommu_init_device ->
> > > > > iommu_attach_group ->
> > > > > __iommu_attach_group
> > > >
> > > > Oh, when you said sound intel I thought this was an Intel CPU..
> > > >
> > > > Yes, there is this hacky private path from the amdgpu to
> > > > the amd iommu driver that makes a mess of it here. We discussed it in
> > > > this thread:
> > > >
> > > > https://lore.kernel.org/linux-iommu/YgtuJQhY8SNlv9%[email protected]/
> > > >
> > > > But nobody put it together that it would be a problem with this.
> > > >
> > > > Something like this, perhaps, but I didn't check if overriding the
> > > > type would cause other problems.
> > >
> > > Takashi, do we want to do this patch?
> >
> > I really have no much preference regarding the fix for this
> > regression from my side. If you can work on it, it'd be greatly
> > appreciated.
>
> If you say this patch works I will formally propose it, but I have no
> ability to test on this special AMD HW.

Erm, it's not clear which patch you're referring to. The mentioned
URL points to a patch series.

If you can send (or point) a patch for test, I can set up a test
kernel and ask reporters for testing it.


thanks,

Takashi

2022-09-06 17:01:29

by Takashi Iwai

[permalink] [raw]
Subject: Re: [REGRESSION 5.19.x] AMD HD-audio devices missing on 5.19

On Tue, 06 Sep 2022 17:41:51 +0200,
Jason Gunthorpe wrote:
>
> On Tue, Aug 23, 2022 at 05:28:24PM -0300, Jason Gunthorpe wrote:
> > On Tue, Aug 23, 2022 at 01:46:36PM +0200, Takashi Iwai wrote:
> > > It was tested now and confirmed that the call path is via AMDGPU, as
> > > expected:
> > > amdgpu_pci_probe ->
> > > amdgpu_driver_load_kms ->
> > > amdgpu_device_init ->
> > > amdgpu_amdkfd_device_init ->
> > > kgd2kfd_device_init ->
> > > kgd2kfd_resume_iommu ->
> > > kfd_iommu_resume ->
> > > amd_iommu_init_device ->
> > > iommu_attach_group ->
> > > __iommu_attach_group
> >
> > Oh, when you said sound intel I thought this was an Intel CPU..
> >
> > Yes, there is this hacky private path from the amdgpu to
> > the amd iommu driver that makes a mess of it here. We discussed it in
> > this thread:
> >
> > https://lore.kernel.org/linux-iommu/YgtuJQhY8SNlv9%[email protected]/
> >
> > But nobody put it together that it would be a problem with this.
> >
> > Something like this, perhaps, but I didn't check if overriding the
> > type would cause other problems.
>
> Takashi, do we want to do this patch?

I really have no much preference regarding the fix for this
regression from my side. If you can work on it, it'd be greatly
appreciated.


thanks,

Takashi

2022-09-06 17:03:06

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [REGRESSION 5.19.x] AMD HD-audio devices missing on 5.19

On Tue, Sep 06, 2022 at 06:04:27PM +0200, Takashi Iwai wrote:

> If you can send (or point) a patch for test, I can set up a test
> kernel and ask reporters for testing it.

https://lore.kernel.org/all/[email protected]/

Jason

2022-09-06 17:03:33

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [REGRESSION 5.19.x] AMD HD-audio devices missing on 5.19

On Tue, Sep 06, 2022 at 05:52:34PM +0200, Takashi Iwai wrote:
> On Tue, 06 Sep 2022 17:41:51 +0200,
> Jason Gunthorpe wrote:
> >
> > On Tue, Aug 23, 2022 at 05:28:24PM -0300, Jason Gunthorpe wrote:
> > > On Tue, Aug 23, 2022 at 01:46:36PM +0200, Takashi Iwai wrote:
> > > > It was tested now and confirmed that the call path is via AMDGPU, as
> > > > expected:
> > > > amdgpu_pci_probe ->
> > > > amdgpu_driver_load_kms ->
> > > > amdgpu_device_init ->
> > > > amdgpu_amdkfd_device_init ->
> > > > kgd2kfd_device_init ->
> > > > kgd2kfd_resume_iommu ->
> > > > kfd_iommu_resume ->
> > > > amd_iommu_init_device ->
> > > > iommu_attach_group ->
> > > > __iommu_attach_group
> > >
> > > Oh, when you said sound intel I thought this was an Intel CPU..
> > >
> > > Yes, there is this hacky private path from the amdgpu to
> > > the amd iommu driver that makes a mess of it here. We discussed it in
> > > this thread:
> > >
> > > https://lore.kernel.org/linux-iommu/YgtuJQhY8SNlv9%[email protected]/
> > >
> > > But nobody put it together that it would be a problem with this.
> > >
> > > Something like this, perhaps, but I didn't check if overriding the
> > > type would cause other problems.
> >
> > Takashi, do we want to do this patch?
>
> I really have no much preference regarding the fix for this
> regression from my side. If you can work on it, it'd be greatly
> appreciated.

If you say this patch works I will formally propose it, but I have no
ability to test on this special AMD HW.

Thanks,
Jason

2022-09-06 17:04:24

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [REGRESSION 5.19.x] AMD HD-audio devices missing on 5.19

On Tue, Aug 23, 2022 at 05:28:24PM -0300, Jason Gunthorpe wrote:
> On Tue, Aug 23, 2022 at 01:46:36PM +0200, Takashi Iwai wrote:
> > It was tested now and confirmed that the call path is via AMDGPU, as
> > expected:
> > amdgpu_pci_probe ->
> > amdgpu_driver_load_kms ->
> > amdgpu_device_init ->
> > amdgpu_amdkfd_device_init ->
> > kgd2kfd_device_init ->
> > kgd2kfd_resume_iommu ->
> > kfd_iommu_resume ->
> > amd_iommu_init_device ->
> > iommu_attach_group ->
> > __iommu_attach_group
>
> Oh, when you said sound intel I thought this was an Intel CPU..
>
> Yes, there is this hacky private path from the amdgpu to
> the amd iommu driver that makes a mess of it here. We discussed it in
> this thread:
>
> https://lore.kernel.org/linux-iommu/YgtuJQhY8SNlv9%[email protected]/
>
> But nobody put it together that it would be a problem with this.
>
> Something like this, perhaps, but I didn't check if overriding the
> type would cause other problems.

Takashi, do we want to do this patch?

Thanks,
Jason

2022-09-06 17:14:00

by Takashi Iwai

[permalink] [raw]
Subject: Re: [REGRESSION 5.19.x] AMD HD-audio devices missing on 5.19

On Tue, 06 Sep 2022 18:08:50 +0200,
Jason Gunthorpe wrote:
>
> On Tue, Sep 06, 2022 at 06:04:27PM +0200, Takashi Iwai wrote:
>
> > If you can send (or point) a patch for test, I can set up a test
> > kernel and ask reporters for testing it.
>
> https://lore.kernel.org/all/[email protected]/

Ah, that one!

OK, will ask the reporters with a test kernel with that patch.


thanks,

Takashi

2022-09-07 10:18:50

by Takashi Iwai

[permalink] [raw]
Subject: Re: [REGRESSION 5.19.x] AMD HD-audio devices missing on 5.19

On Tue, 06 Sep 2022 18:16:50 +0200,
Takashi Iwai wrote:
>
> On Tue, 06 Sep 2022 18:08:50 +0200,
> Jason Gunthorpe wrote:
> >
> > On Tue, Sep 06, 2022 at 06:04:27PM +0200, Takashi Iwai wrote:
> >
> > > If you can send (or point) a patch for test, I can set up a test
> > > kernel and ask reporters for testing it.
> >
> > https://lore.kernel.org/all/[email protected]/
>
> Ah, that one!
>
> OK, will ask the reporters with a test kernel with that patch.

According to both reporters, unfortunately it caused a hang at boot
somewhere. Only the reset button helped.


Takashi

2022-09-07 13:39:07

by Takashi Iwai

[permalink] [raw]
Subject: Re: [REGRESSION 5.19.x] AMD HD-audio devices missing on 5.19

On Tue, 23 Aug 2022 22:28:24 +0200,
Jason Gunthorpe wrote:
>
> On Tue, Aug 23, 2022 at 01:46:36PM +0200, Takashi Iwai wrote:
> > It was tested now and confirmed that the call path is via AMDGPU, as
> > expected:
> > amdgpu_pci_probe ->
> > amdgpu_driver_load_kms ->
> > amdgpu_device_init ->
> > amdgpu_amdkfd_device_init ->
> > kgd2kfd_device_init ->
> > kgd2kfd_resume_iommu ->
> > kfd_iommu_resume ->
> > amd_iommu_init_device ->
> > iommu_attach_group ->
> > __iommu_attach_group
>
> Oh, when you said sound intel I thought this was an Intel CPU..
>
> Yes, there is this hacky private path from the amdgpu to
> the amd iommu driver that makes a mess of it here. We discussed it in
> this thread:
>
> https://lore.kernel.org/linux-iommu/YgtuJQhY8SNlv9%[email protected]/
>
> But nobody put it together that it would be a problem with this.
>
> Something like this, perhaps, but I didn't check if overriding the
> type would cause other problems.
>
> diff --git a/drivers/iommu/amd/iommu_v2.c b/drivers/iommu/amd/iommu_v2.c
> index 696d5555be5794..6a1f02c62dffcc 100644
> --- a/drivers/iommu/amd/iommu_v2.c
> +++ b/drivers/iommu/amd/iommu_v2.c
> @@ -777,6 +777,8 @@ int amd_iommu_init_device(struct pci_dev *pdev, int pasids)
> if (dev_state->domain == NULL)
> goto out_free_states;
>
> + /* See iommu_is_default_domain() */
> + dev_state->domain->type = IOMMU_DOMAIN_IDENTITY;
> amd_iommu_domain_direct_map(dev_state->domain);
>
> ret = amd_iommu_domain_enable_v2(dev_state->domain, pasids);
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 780fb70715770d..fe8bd17f52314b 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -3076,6 +3076,24 @@ static ssize_t iommu_group_store_type(struct iommu_group *group,
> return ret;
> }
>
> +static bool iommu_is_default_domain(struct iommu_group *group)
> +{
> + if (group->domain == group->default_domain)
> + return true;
> +
> + /*
> + * If the default domain was set to identity and it is still an identity
> + * domain then we consider this a pass. This happens because of
> + * amd_iommu_init_device() replacing the default idenity domain with an
> + * identity domain that has a different configuration for AMDGPU.
> + */
> + if (group->default_domain &&
> + group->default_domain->type == IOMMU_DOMAIN_IDENTITY &&
> + group->domain && group->domain->type == IOMMU_DOMAIN_IDENTITY)
> + return true;
> + return false;
> +}
> +
> /**
> * iommu_device_use_default_domain() - Device driver wants to handle device
> * DMA through the kernel DMA API.
> @@ -3094,8 +3112,7 @@ int iommu_device_use_default_domain(struct device *dev)
>
> mutex_lock(&group->mutex);
> if (group->owner_cnt) {
> - if (group->domain != group->default_domain ||
> - group->owner) {
> + if (group->owner || iommu_is_default_domain(group)) {

Isn't this rather
if (group->owner || !iommu_is_default_domain(group)) {
?

I'll rebuild the kernel with this change and ask reporters again.


Takashi

2022-09-07 14:03:13

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [REGRESSION 5.19.x] AMD HD-audio devices missing on 5.19

On Wed, Sep 07, 2022 at 03:28:31PM +0200, Takashi Iwai wrote:
> > /**
> > * iommu_device_use_default_domain() - Device driver wants to handle device
> > * DMA through the kernel DMA API.
> > @@ -3094,8 +3112,7 @@ int iommu_device_use_default_domain(struct device *dev)
> >
> > mutex_lock(&group->mutex);
> > if (group->owner_cnt) {
> > - if (group->domain != group->default_domain ||
> > - group->owner) {
> > + if (group->owner || iommu_is_default_domain(group)) {
>
> Isn't this rather
> if (group->owner || !iommu_is_default_domain(group)) {
> ?
>
> I'll rebuild the kernel with this change and ask reporters again.

Oh yes, good eyes, that probably crashes on boot too

Jason

2022-09-08 07:09:10

by Takashi Iwai

[permalink] [raw]
Subject: Re: [REGRESSION 5.19.x] AMD HD-audio devices missing on 5.19

On Wed, 07 Sep 2022 15:48:33 +0200,
Jason Gunthorpe wrote:
>
> On Wed, Sep 07, 2022 at 03:28:31PM +0200, Takashi Iwai wrote:
> > > /**
> > > * iommu_device_use_default_domain() - Device driver wants to handle device
> > > * DMA through the kernel DMA API.
> > > @@ -3094,8 +3112,7 @@ int iommu_device_use_default_domain(struct device *dev)
> > >
> > > mutex_lock(&group->mutex);
> > > if (group->owner_cnt) {
> > > - if (group->domain != group->default_domain ||
> > > - group->owner) {
> > > + if (group->owner || iommu_is_default_domain(group)) {
> >
> > Isn't this rather
> > if (group->owner || !iommu_is_default_domain(group)) {
> > ?
> >
> > I'll rebuild the kernel with this change and ask reporters again.
>
> Oh yes, good eyes, that probably crashes on boot too

Indeed the corrected patch seems working; it boots and the HD-audio
device appears again.

Can we go forward with this?


thanks,

Takashi

2022-09-19 16:01:04

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: [REGRESSION 5.19.x] AMD HD-audio devices missing on 5.19 #forregzbot

TWIMC: this mail is primarily send for documentation purposes and for
regzbot, my Linux kernel regression tracking bot. These mails usually
contain '#forregzbot' in the subject, to make them easy to spot and filter.

On 24.08.22 09:13, Thorsten Leemhuis wrote:

> On 22.08.22 16:12, Takashi Iwai wrote:
>>
>> we've received regression reports about the missing HD-audio devices
>> on AMD platforms, and this turned out to be caused by the commit
>> 512881eacfa72c2136b27b9934b7b27504a9efc2
>> bus: platform,amba,fsl-mc,PCI: Add device DMA ownership management
>>
>> The details are found in openSUSE bugzilla:
>> https://bugzilla.suse.com/show_bug.cgi?id=1202492
>>
>> The problem seems to be that HD-audio (both onboard analog and HDMI)
>> PCI devices are assigned to the same IOMMU group as AMD graphics PCI
>> device, and once after the AMDGPU is initialized beforehand, those
>> audio devices can't be probed since iommu_device_use_default_domain()
>> returns -EBUSY.
>>
>> I'm not sure whether it's specific to PCI bus due to the assignment
>> logic of those IOMMU groups, or it's about the handling of the active
>> domain assignment. In anyway, disabling IOMMU works around the
>> problem, and passing driver_managed_dma flag to the HD-audio driver
>> was also confirmed to work around it, too.
>>
>> The problem persists with 6.0-rc1 kernel.
>>
>> If you have / can give any fix patch or debug patch, let me know; I'll
>> build test kernels and ask the reporters.
>
> Thanks for the report. To be sure below issue doesn't fall through the
> cracks unnoticed, I'm adding it to regzbot, my Linux kernel regression
> tracking bot:
>
> #regzbot introduced 512881eacfa72c2136b27b99 ^
> https://bugzilla.suse.com/show_bug.cgi?id=1202492
> #regzbot title AMD HD-audio devices missing on 5.19
> #regzbot ignore-activity

#regzbot fixed-by: 2380f1e8195ef612deea