Hi,
I have compiled Linux v5.5-rc1 and thought all was good until I
hot-removed a Gigabyte Aorus eGPU from Thunderbolt. The driver for the
GPU was not loaded (blacklisted) so the crash is nothing to do with the
GPU driver.
We had:
- kernel NULL pointer dereference
- refcount_t: underflow; use-after-free.
Attaching dmesg for now; will bisect and come back with results.
I am not sure why only this eGPU causes the issue - all my other devices
do not trigger the bug.
I will shortly re-post my 4-patch series with some trivial alterations for v5.5.
Kind regards,
Nicholas
On Mon, Dec 09, 2019 at 12:34:04PM +0000, Nicholas Johnson wrote:
> Hi,
>
> I have compiled Linux v5.5-rc1 and thought all was good until I
> hot-removed a Gigabyte Aorus eGPU from Thunderbolt. The driver for the
> GPU was not loaded (blacklisted) so the crash is nothing to do with the
> GPU driver.
>
> We had:
> - kernel NULL pointer dereference
> - refcount_t: underflow; use-after-free.
>
> Attaching dmesg for now; will bisect and come back with results.
Looks like something related to iommu. Does it work if you disable it?
(intel_iommu=off in the command line).
On Mon, Dec 09, 2019 at 03:12:39PM +0200, [email protected] wrote:
> On Mon, Dec 09, 2019 at 12:34:04PM +0000, Nicholas Johnson wrote:
> > Hi,
> >
> > I have compiled Linux v5.5-rc1 and thought all was good until I
> > hot-removed a Gigabyte Aorus eGPU from Thunderbolt. The driver for the
> > GPU was not loaded (blacklisted) so the crash is nothing to do with the
> > GPU driver.
> >
> > We had:
> > - kernel NULL pointer dereference
> > - refcount_t: underflow; use-after-free.
> >
> > Attaching dmesg for now; will bisect and come back with results.
>
> Looks like something related to iommu. Does it work if you disable it?
> (intel_iommu=off in the command line).
I thought it could be that, too.
The attachment "dmesg-4" from the original email is with iommu parameters.
The attachment "dmesg-5" from the original email is with no iommu parameters.
Attaching here "dmesg-6" with the iommu explicitly set off like you said.
No difference, still broken. Although, with iommu off, there are less stack traces.
Could it be sysfs-related?
Kind regards,
Nicholas
On Mon, Dec 09, 2019 at 03:12:39PM +0200, [email protected] wrote:
> On Mon, Dec 09, 2019 at 12:34:04PM +0000, Nicholas Johnson wrote:
> > Hi,
> >
> > I have compiled Linux v5.5-rc1 and thought all was good until I
> > hot-removed a Gigabyte Aorus eGPU from Thunderbolt. The driver for the
> > GPU was not loaded (blacklisted) so the crash is nothing to do with the
> > GPU driver.
> >
> > We had:
> > - kernel NULL pointer dereference
> > - refcount_t: underflow; use-after-free.
> >
> > Attaching dmesg for now; will bisect and come back with results.
>
> Looks like something related to iommu. Does it work if you disable it?
> (intel_iommu=off in the command line).
On Mon, Dec 09, 2019 at 03:12:39PM +0200, [email protected] wrote:
> On Mon, Dec 09, 2019 at 12:34:04PM +0000, Nicholas Johnson wrote:
> > Hi,
> >
> > I have compiled Linux v5.5-rc1 and thought all was good until I
> > hot-removed a Gigabyte Aorus eGPU from Thunderbolt. The driver for the
> > GPU was not loaded (blacklisted) so the crash is nothing to do with the
> > GPU driver.
> >
> > We had:
> > - kernel NULL pointer dereference
> > - refcount_t: underflow; use-after-free.
> >
> > Attaching dmesg for now; will bisect and come back with results.
>
> Looks like something related to iommu. Does it work if you disable it?
> (intel_iommu=off in the command line).
I thought it could be that, too.
The attachment "dmesg-4" from the original email is with iommu parameters.
The attachment "dmesg-5" from the original email is with no iommu parameters.
Attaching here "dmesg-6" with the iommu explicitly set off like you said.
No difference, still broken. Although, with iommu off, there are less stack traces.
Could it be sysfs-related?
Kind regards,
Nicholas
On Mon, Dec 09, 2019 at 01:33:49PM +0000, Nicholas Johnson wrote:
> On Mon, Dec 09, 2019 at 03:12:39PM +0200, [email protected] wrote:
> > On Mon, Dec 09, 2019 at 12:34:04PM +0000, Nicholas Johnson wrote:
> > > Hi,
> > >
> > > I have compiled Linux v5.5-rc1 and thought all was good until I
> > > hot-removed a Gigabyte Aorus eGPU from Thunderbolt. The driver for the
> > > GPU was not loaded (blacklisted) so the crash is nothing to do with the
> > > GPU driver.
> > >
> > > We had:
> > > - kernel NULL pointer dereference
> > > - refcount_t: underflow; use-after-free.
> > >
> > > Attaching dmesg for now; will bisect and come back with results.
> >
> > Looks like something related to iommu. Does it work if you disable it?
> > (intel_iommu=off in the command line).
> On Mon, Dec 09, 2019 at 03:12:39PM +0200, [email protected] wrote:
> > On Mon, Dec 09, 2019 at 12:34:04PM +0000, Nicholas Johnson wrote:
> > > Hi,
> > >
> > > I have compiled Linux v5.5-rc1 and thought all was good until I
> > > hot-removed a Gigabyte Aorus eGPU from Thunderbolt. The driver for the
> > > GPU was not loaded (blacklisted) so the crash is nothing to do with the
> > > GPU driver.
> > >
> > > We had:
> > > - kernel NULL pointer dereference
> > > - refcount_t: underflow; use-after-free.
> > >
> > > Attaching dmesg for now; will bisect and come back with results.
> >
> > Looks like something related to iommu. Does it work if you disable it?
> > (intel_iommu=off in the command line).
> I thought it could be that, too.
>
> The attachment "dmesg-4" from the original email is with iommu parameters.
> The attachment "dmesg-5" from the original email is with no iommu parameters.
> Attaching here "dmesg-6" with the iommu explicitly set off like you said.
>
> No difference, still broken. Although, with iommu off, there are less stack traces.
>
> Could it be sysfs-related?
Bisect would probably be the best option to find the culprit commit.
There are couple of commits done for pciehp so reverting them one by one
may help as well:
87d0f2a5536f PCI: pciehp: Prevent deadlock on disconnect
75fcc0ce72e5 PCI: pciehp: Do not disable interrupt twice on suspend
b94ec12dfaee PCI: pciehp: Refactor infinite loop in pcie_poll_cmd()
157c1062fcd8 PCI: pciehp: Avoid returning prematurely from sysfs requests
On Tue, Dec 10, 2019 at 09:28:00AM +0200, [email protected] wrote:
> On Mon, Dec 09, 2019 at 01:33:49PM +0000, Nicholas Johnson wrote:
> > On Mon, Dec 09, 2019 at 03:12:39PM +0200, [email protected] wrote:
> > > On Mon, Dec 09, 2019 at 12:34:04PM +0000, Nicholas Johnson wrote:
> > > > Hi,
> > > >
> > > > I have compiled Linux v5.5-rc1 and thought all was good until I
> > > > hot-removed a Gigabyte Aorus eGPU from Thunderbolt. The driver for the
> > > > GPU was not loaded (blacklisted) so the crash is nothing to do with the
> > > > GPU driver.
> > > >
> > > > We had:
> > > > - kernel NULL pointer dereference
> > > > - refcount_t: underflow; use-after-free.
> > > >
> > > > Attaching dmesg for now; will bisect and come back with results.
> > >
> > > Looks like something related to iommu. Does it work if you disable it?
> > > (intel_iommu=off in the command line).
> > On Mon, Dec 09, 2019 at 03:12:39PM +0200, [email protected] wrote:
> > > On Mon, Dec 09, 2019 at 12:34:04PM +0000, Nicholas Johnson wrote:
> > > > Hi,
> > > >
> > > > I have compiled Linux v5.5-rc1 and thought all was good until I
> > > > hot-removed a Gigabyte Aorus eGPU from Thunderbolt. The driver for the
> > > > GPU was not loaded (blacklisted) so the crash is nothing to do with the
> > > > GPU driver.
> > > >
> > > > We had:
> > > > - kernel NULL pointer dereference
> > > > - refcount_t: underflow; use-after-free.
> > > >
> > > > Attaching dmesg for now; will bisect and come back with results.
> > >
> > > Looks like something related to iommu. Does it work if you disable it?
> > > (intel_iommu=off in the command line).
> > I thought it could be that, too.
> >
> > The attachment "dmesg-4" from the original email is with iommu parameters.
> > The attachment "dmesg-5" from the original email is with no iommu parameters.
> > Attaching here "dmesg-6" with the iommu explicitly set off like you said.
> >
> > No difference, still broken. Although, with iommu off, there are less stack traces.
> >
> > Could it be sysfs-related?
>
> Bisect would probably be the best option to find the culprit commit.
> There are couple of commits done for pciehp so reverting them one by one
> may help as well:
>
> 87d0f2a5536f PCI: pciehp: Prevent deadlock on disconnect
> 75fcc0ce72e5 PCI: pciehp: Do not disable interrupt twice on suspend
> b94ec12dfaee PCI: pciehp: Refactor infinite loop in pcie_poll_cmd()
> 157c1062fcd8 PCI: pciehp: Avoid returning prematurely from sysfs requests
You are not going to believe this. The offending commit is in the SOUND
subsystem. I thought I had messed up the bisect when only sound commits
were showing near the end.
And yes, I double checked.
Reverted, compiled, tested that it started working.
Reapplied, compiled, tested that it stopped working.
Twice.
The following is the culprit responsible for the issues:
commit 586bc4aab878efcf672536f0cdec3d04b6990c94
Author: Alex Deucher <[email protected]>
Date: Fri Nov 22 16:43:50 2019 -0500
ALSA: hda/hdmi - fix vgaswitcheroo detection for AMD
It is playing with PCI devices. Clearly they did not consider
hot-removal. I am guessing it is seeing the audio PCI func of the AMD
card in that Thunderbolt eGPU enclosure.
I will collect information, make a bugzilla report and contact the AMD
team. If anybody wants to be cc'd in then let me know. Sorry for
bothering you and Bjorn with something which actually has nothing
directly to do with the PCI subsystem or Thunderbolt.
I strongly hope that the upcoming Intel Xe GPU driver allows for
surprise-removal in Linux without any crashing of kernel or userspace.
The amdgpu and nouveau drivers do not take to surprise removal kindly,
even without the above sound bug applying to AMD.
Kind regards,
Nicholas
[cc += Alex, Takashi]
On Tue, Dec 10, 2019 at 12:00:23PM +0000, Nicholas Johnson wrote:
> On Tue, Dec 10, 2019 at 09:28:00AM +0200, [email protected] wrote:
> > On Mon, Dec 09, 2019 at 01:33:49PM +0000, Nicholas Johnson wrote:
> > > On Mon, Dec 09, 2019 at 03:12:39PM +0200, [email protected] wrote:
> > > > On Mon, Dec 09, 2019 at 12:34:04PM +0000, Nicholas Johnson wrote:
> > > > > I have compiled Linux v5.5-rc1 and thought all was good until I
> > > > > hot-removed a Gigabyte Aorus eGPU from Thunderbolt. The driver for the
> > > > > GPU was not loaded (blacklisted) so the crash is nothing to do with the
> > > > > GPU driver.
> > > > >
> > > > > We had:
> > > > > - kernel NULL pointer dereference
> > > > > - refcount_t: underflow; use-after-free.
>
> The following is the culprit responsible for the issues:
>
> commit 586bc4aab878efcf672536f0cdec3d04b6990c94
> Author: Alex Deucher <[email protected]>
> Date: Fri Nov 22 16:43:50 2019 -0500
>
> ALSA: hda/hdmi - fix vgaswitcheroo detection for AMD
Does the below fix the issue?
-- >8 --
diff --git a/sound/pci/hda/hda_intel.c b/sound/pci/hda/hda_intel.c
index 35b4526f0d28..b856b89378ac 100644
--- a/sound/pci/hda/hda_intel.c
+++ b/sound/pci/hda/hda_intel.c
@@ -1419,7 +1419,6 @@ static bool atpx_present(void)
return true;
}
}
- pci_dev_put(pdev);
}
return false;
}
On Tue, Dec 10, 2019 at 12:00:23PM +0000, Nicholas Johnson wrote:
> The following is the culprit responsible for the issues:
>
> commit 586bc4aab878efcf672536f0cdec3d04b6990c94
> Author: Alex Deucher <[email protected]>
> Date: Fri Nov 22 16:43:50 2019 -0500
>
> ALSA: hda/hdmi - fix vgaswitcheroo detection for AMD
>
> It is playing with PCI devices. Clearly they did not consider
> hot-removal. I am guessing it is seeing the audio PCI func of the AMD
> card in that Thunderbolt eGPU enclosure.
>
> I will collect information, make a bugzilla report and contact the AMD
> team. If anybody wants to be cc'd in then let me know. Sorry for
> bothering you and Bjorn with something which actually has nothing
> directly to do with the PCI subsystem or Thunderbolt.
No problem. Thanks for taking time to bisect the issue.
On Tue, 10 Dec 2019 13:29:41 +0100,
Lukas Wunner wrote:
>
> [cc += Alex, Takashi]
>
> On Tue, Dec 10, 2019 at 12:00:23PM +0000, Nicholas Johnson wrote:
> > On Tue, Dec 10, 2019 at 09:28:00AM +0200, [email protected] wrote:
> > > On Mon, Dec 09, 2019 at 01:33:49PM +0000, Nicholas Johnson wrote:
> > > > On Mon, Dec 09, 2019 at 03:12:39PM +0200, [email protected] wrote:
> > > > > On Mon, Dec 09, 2019 at 12:34:04PM +0000, Nicholas Johnson wrote:
> > > > > > I have compiled Linux v5.5-rc1 and thought all was good until I
> > > > > > hot-removed a Gigabyte Aorus eGPU from Thunderbolt. The driver for the
> > > > > > GPU was not loaded (blacklisted) so the crash is nothing to do with the
> > > > > > GPU driver.
> > > > > >
> > > > > > We had:
> > > > > > - kernel NULL pointer dereference
> > > > > > - refcount_t: underflow; use-after-free.
> >
> > The following is the culprit responsible for the issues:
> >
> > commit 586bc4aab878efcf672536f0cdec3d04b6990c94
> > Author: Alex Deucher <[email protected]>
> > Date: Fri Nov 22 16:43:50 2019 -0500
> >
> > ALSA: hda/hdmi - fix vgaswitcheroo detection for AMD
>
> Does the below fix the issue?
>
> -- >8 --
> diff --git a/sound/pci/hda/hda_intel.c b/sound/pci/hda/hda_intel.c
> index 35b4526f0d28..b856b89378ac 100644
> --- a/sound/pci/hda/hda_intel.c
> +++ b/sound/pci/hda/hda_intel.c
> @@ -1419,7 +1419,6 @@ static bool atpx_present(void)
> return true;
> }
> }
> - pci_dev_put(pdev);
> }
> return false;
> }
Oh this looks really like a bug, even if this isn't the root cause.
Care to submit a proper patch?
thanks,
Takashi
On Tue, Dec 10, 2019 at 01:29:41PM +0100, Lukas Wunner wrote:
> [cc += Alex, Takashi]
>
> On Tue, Dec 10, 2019 at 12:00:23PM +0000, Nicholas Johnson wrote:
> > On Tue, Dec 10, 2019 at 09:28:00AM +0200, [email protected] wrote:
> > > On Mon, Dec 09, 2019 at 01:33:49PM +0000, Nicholas Johnson wrote:
> > > > On Mon, Dec 09, 2019 at 03:12:39PM +0200, [email protected] wrote:
> > > > > On Mon, Dec 09, 2019 at 12:34:04PM +0000, Nicholas Johnson wrote:
> > > > > > I have compiled Linux v5.5-rc1 and thought all was good until I
> > > > > > hot-removed a Gigabyte Aorus eGPU from Thunderbolt. The driver for the
> > > > > > GPU was not loaded (blacklisted) so the crash is nothing to do with the
> > > > > > GPU driver.
> > > > > >
> > > > > > We had:
> > > > > > - kernel NULL pointer dereference
> > > > > > - refcount_t: underflow; use-after-free.
> >
> > The following is the culprit responsible for the issues:
> >
> > commit 586bc4aab878efcf672536f0cdec3d04b6990c94
> > Author: Alex Deucher <[email protected]>
> > Date: Fri Nov 22 16:43:50 2019 -0500
> >
> > ALSA: hda/hdmi - fix vgaswitcheroo detection for AMD
>
> Does the below fix the issue?
>
> -- >8 --
> diff --git a/sound/pci/hda/hda_intel.c b/sound/pci/hda/hda_intel.c
> index 35b4526f0d28..b856b89378ac 100644
> --- a/sound/pci/hda/hda_intel.c
> +++ b/sound/pci/hda/hda_intel.c
> @@ -1419,7 +1419,6 @@ static bool atpx_present(void)
> return true;
> }
> }
> - pci_dev_put(pdev);
> }
> return false;
> }
Yes, removing the pci_dev_put() as above solves the issue.
Thanks.
Kind regards,
Nicholas.
Nicholas Johnson reports a null pointer deref as well as a refcount
underflow upon hot-removal of a Thunderbolt-attached AMD eGPU.
He's bisected the issue down to commit 586bc4aab878 ("ALSA: hda/hdmi -
fix vgaswitcheroo detection for AMD").
The commit iterates over PCI devices using pci_get_class() and
unreferences each device found, even though pci_get_class()
subsequently unreferences the device as well. Fix it.
Fixes: 586bc4aab878 ("ALSA: hda/hdmi - fix vgaswitcheroo detection for AMD")
Link: https://lore.kernel.org/r/PSXP216MB0438BFEAA0617283A834E11580580@PSXP216MB0438.KORP216.PROD.OUTLOOK.COM/
Reported-and-tested-by: Nicholas Johnson <[email protected]>
Signed-off-by: Lukas Wunner <[email protected]>
Cc: Mika Westerberg <[email protected]>
Cc: Alexander Deucher <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
---
sound/pci/hda/hda_intel.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/sound/pci/hda/hda_intel.c b/sound/pci/hda/hda_intel.c
index 35b4526f0d28..b856b89378ac 100644
--- a/sound/pci/hda/hda_intel.c
+++ b/sound/pci/hda/hda_intel.c
@@ -1419,7 +1419,6 @@ static bool atpx_present(void)
return true;
}
}
- pci_dev_put(pdev);
}
return false;
}
--
2.24.0
On Tue, 10 Dec 2019 14:39:50 +0100,
Lukas Wunner wrote:
>
> Nicholas Johnson reports a null pointer deref as well as a refcount
> underflow upon hot-removal of a Thunderbolt-attached AMD eGPU.
> He's bisected the issue down to commit 586bc4aab878 ("ALSA: hda/hdmi -
> fix vgaswitcheroo detection for AMD").
>
> The commit iterates over PCI devices using pci_get_class() and
> unreferences each device found, even though pci_get_class()
> subsequently unreferences the device as well. Fix it.
>
> Fixes: 586bc4aab878 ("ALSA: hda/hdmi - fix vgaswitcheroo detection for AMD")
> Link: https://lore.kernel.org/r/PSXP216MB0438BFEAA0617283A834E11580580@PSXP216MB0438.KORP216.PROD.OUTLOOK.COM/
> Reported-and-tested-by: Nicholas Johnson <[email protected]>
> Signed-off-by: Lukas Wunner <[email protected]>
> Cc: Mika Westerberg <[email protected]>
> Cc: Alexander Deucher <[email protected]>
> Cc: Bjorn Helgaas <[email protected]>
Applied now. Thanks.
Takashi
> ---
> sound/pci/hda/hda_intel.c | 1 -
> 1 file changed, 1 deletion(-)
>
> diff --git a/sound/pci/hda/hda_intel.c b/sound/pci/hda/hda_intel.c
> index 35b4526f0d28..b856b89378ac 100644
> --- a/sound/pci/hda/hda_intel.c
> +++ b/sound/pci/hda/hda_intel.c
> @@ -1419,7 +1419,6 @@ static bool atpx_present(void)
> return true;
> }
> }
> - pci_dev_put(pdev);
> }
> return false;
> }
> --
> 2.24.0
>
On Tue, Dec 10, 2019 at 02:39:50PM +0100, Lukas Wunner wrote:
> Nicholas Johnson reports a null pointer deref as well as a refcount
> underflow upon hot-removal of a Thunderbolt-attached AMD eGPU.
> He's bisected the issue down to commit 586bc4aab878 ("ALSA: hda/hdmi -
> fix vgaswitcheroo detection for AMD").
>
> The commit iterates over PCI devices using pci_get_class() and
> unreferences each device found, even though pci_get_class()
> subsequently unreferences the device as well. Fix it.
>
> Fixes: 586bc4aab878 ("ALSA: hda/hdmi - fix vgaswitcheroo detection for AMD")
> Link: https://lore.kernel.org/r/PSXP216MB0438BFEAA0617283A834E11580580@PSXP216MB0438.KORP216.PROD.OUTLOOK.COM/
> Reported-and-tested-by: Nicholas Johnson <[email protected]>
> Signed-off-by: Lukas Wunner <[email protected]>
> Cc: Mika Westerberg <[email protected]>
> Cc: Alexander Deucher <[email protected]>
> Cc: Bjorn Helgaas <[email protected]>
> ---
> sound/pci/hda/hda_intel.c | 1 -
> 1 file changed, 1 deletion(-)
>
> diff --git a/sound/pci/hda/hda_intel.c b/sound/pci/hda/hda_intel.c
> index 35b4526f0d28..b856b89378ac 100644
> --- a/sound/pci/hda/hda_intel.c
> +++ b/sound/pci/hda/hda_intel.c
> @@ -1419,7 +1419,6 @@ static bool atpx_present(void)
> return true;
> }
> }
> - pci_dev_put(pdev);
> }
> return false;
> }
> --
> 2.24.0
>
Extremely fast turnaround from bisect to patch.
If you want the bugzilla.kernel.org report then I can do that ASAP, but
I feel that it is redundant now.
Thank you for handling my bug report.
Regards,
Nicholas
On Tue, 10 Dec 2019 14:47:28 +0100,
Nicholas Johnson wrote:
>
> On Tue, Dec 10, 2019 at 02:39:50PM +0100, Lukas Wunner wrote:
> > Nicholas Johnson reports a null pointer deref as well as a refcount
> > underflow upon hot-removal of a Thunderbolt-attached AMD eGPU.
> > He's bisected the issue down to commit 586bc4aab878 ("ALSA: hda/hdmi -
> > fix vgaswitcheroo detection for AMD").
> >
> > The commit iterates over PCI devices using pci_get_class() and
> > unreferences each device found, even though pci_get_class()
> > subsequently unreferences the device as well. Fix it.
> >
> > Fixes: 586bc4aab878 ("ALSA: hda/hdmi - fix vgaswitcheroo detection for AMD")
> > Link: https://lore.kernel.org/r/PSXP216MB0438BFEAA0617283A834E11580580@PSXP216MB0438.KORP216.PROD.OUTLOOK.COM/
> > Reported-and-tested-by: Nicholas Johnson <[email protected]>
> > Signed-off-by: Lukas Wunner <[email protected]>
> > Cc: Mika Westerberg <[email protected]>
> > Cc: Alexander Deucher <[email protected]>
> > Cc: Bjorn Helgaas <[email protected]>
> > ---
> > sound/pci/hda/hda_intel.c | 1 -
> > 1 file changed, 1 deletion(-)
> >
> > diff --git a/sound/pci/hda/hda_intel.c b/sound/pci/hda/hda_intel.c
> > index 35b4526f0d28..b856b89378ac 100644
> > --- a/sound/pci/hda/hda_intel.c
> > +++ b/sound/pci/hda/hda_intel.c
> > @@ -1419,7 +1419,6 @@ static bool atpx_present(void)
> > return true;
> > }
> > }
> > - pci_dev_put(pdev);
> > }
> > return false;
> > }
> > --
> > 2.24.0
> >
> Extremely fast turnaround from bisect to patch.
>
> If you want the bugzilla.kernel.org report then I can do that ASAP, but
> I feel that it is redundant now.
We have now Link tag to track the ML archive, so it should suffice, I
suppose.
> Thank you for handling my bug report.
And thank you for your quick testing.
Takashi
> -----Original Message-----
> From: Lukas Wunner <[email protected]>
> Sent: Tuesday, December 10, 2019 8:40 AM
> To: Takashi Iwai <[email protected]>
> Cc: Jaroslav Kysela <[email protected]>; Deucher, Alexander
> <[email protected]>; Mika Westerberg
> <[email protected]>; Bjorn Helgaas <[email protected]>;
> Nicholas Johnson <[email protected]>; alsa-
> [email protected]; [email protected]; linux-
> [email protected]
> Subject: [PATCH] ALSA: hda/hdmi - Fix duplicate unref of pci_dev
>
> Nicholas Johnson reports a null pointer deref as well as a refcount underflow
> upon hot-removal of a Thunderbolt-attached AMD eGPU.
> He's bisected the issue down to commit 586bc4aab878 ("ALSA: hda/hdmi - fix
> vgaswitcheroo detection for AMD").
>
> The commit iterates over PCI devices using pci_get_class() and unreferences
> each device found, even though pci_get_class() subsequently unreferences
> the device as well. Fix it.
The pci_dev_put() a few lines above should probably be dropped as well.
Alex
>
> Fixes: 586bc4aab878 ("ALSA: hda/hdmi - fix vgaswitcheroo detection for
> AMD")
> Link:
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.
> kernel.org%2Fr%2FPSXP216MB0438BFEAA0617283A834E11580580%40PSXP2
> 16MB0438.KORP216.PROD.OUTLOOK.COM%2F&data=02%7C01%7Calex
> ander.deucher%40amd.com%7C254649b79a6240f3a8aa08d77d76715f%7C3d
> d8961fe4884e608e11a82d994e183d%7C0%7C0%7C637115819998031934&
> ;sdata=qI%2B%2FJv3Z6pvwN7gQFSnZiP31YUAvd%2BKjo0ZqMERFbYU%3D&a
> mp;reserved=0
> Reported-and-tested-by: Nicholas Johnson <nicholas.johnson-
> [email protected]>
> Signed-off-by: Lukas Wunner <[email protected]>
> Cc: Mika Westerberg <[email protected]>
> Cc: Alexander Deucher <[email protected]>
> Cc: Bjorn Helgaas <[email protected]>
> ---
> sound/pci/hda/hda_intel.c | 1 -
> 1 file changed, 1 deletion(-)
>
> diff --git a/sound/pci/hda/hda_intel.c b/sound/pci/hda/hda_intel.c index
> 35b4526f0d28..b856b89378ac 100644
> --- a/sound/pci/hda/hda_intel.c
> +++ b/sound/pci/hda/hda_intel.c
> @@ -1419,7 +1419,6 @@ static bool atpx_present(void)
> return true;
> }
> }
> - pci_dev_put(pdev);
> }
> return false;
> }
> --
> 2.24.0
On Tue, Dec 10, 2019 at 03:34:27PM +0000, Deucher, Alexander wrote:
> > Nicholas Johnson reports a null pointer deref as well as a refcount underflow
> > upon hot-removal of a Thunderbolt-attached AMD eGPU.
> > He's bisected the issue down to commit 586bc4aab878 ("ALSA: hda/hdmi - fix
> > vgaswitcheroo detection for AMD").
> >
> > The commit iterates over PCI devices using pci_get_class() and unreferences
> > each device found, even though pci_get_class() subsequently unreferences
> > the device as well. Fix it.
>
> The pci_dev_put() a few lines above should probably be dropped as well.
That one looks fine to me. The refcount is already increased in the
caller get_bound_vga() via pci_get_domain_bus_and_slot() and it's
increased again in atpx_present() via pci_get_class(). It needs to
be decremented in atpx_present() to avoid leaking a ref.
Thanks,
Lukas
> > diff --git a/sound/pci/hda/hda_intel.c b/sound/pci/hda/hda_intel.c index
> > 35b4526f0d28..b856b89378ac 100644
> > --- a/sound/pci/hda/hda_intel.c
> > +++ b/sound/pci/hda/hda_intel.c
> > @@ -1419,7 +1419,6 @@ static bool atpx_present(void)
> > return true;
> > }
> > }
> > - pci_dev_put(pdev);
> > }
> > return false;
> > }
> > --
> > 2.24.0
> -----Original Message-----
> From: Lukas Wunner <[email protected]>
> Sent: Tuesday, December 10, 2019 10:47 AM
> To: Deucher, Alexander <[email protected]>
> Cc: Takashi Iwai <[email protected]>; Jaroslav Kysela <[email protected]>; Mika
> Westerberg <[email protected]>; Bjorn Helgaas
> <[email protected]>; Nicholas Johnson <nicholas.johnson-
> [email protected]>; [email protected]; linux-
> [email protected]; [email protected]
> Subject: Re: [PATCH] ALSA: hda/hdmi - Fix duplicate unref of pci_dev
>
> On Tue, Dec 10, 2019 at 03:34:27PM +0000, Deucher, Alexander wrote:
> > > Nicholas Johnson reports a null pointer deref as well as a refcount
> > > underflow upon hot-removal of a Thunderbolt-attached AMD eGPU.
> > > He's bisected the issue down to commit 586bc4aab878 ("ALSA: hda/hdmi
> > > - fix vgaswitcheroo detection for AMD").
> > >
> > > The commit iterates over PCI devices using pci_get_class() and
> > > unreferences each device found, even though pci_get_class()
> > > subsequently unreferences the device as well. Fix it.
> >
> > The pci_dev_put() a few lines above should probably be dropped as well.
>
> That one looks fine to me. The refcount is already increased in the caller
> get_bound_vga() via pci_get_domain_bus_and_slot() and it's increased
> again in atpx_present() via pci_get_class(). It needs to be decremented in
> atpx_present() to avoid leaking a ref.
I'm not following. This is part of the same loop as the one you removed. All we are doing is checking whether the ATPX method exists or not om the platform. The pdev may not be the same one as the one in pci_get_domain_bus_and_slot(). The APTX method in the APU's ACPI namespace, not the dGPUs.
Alex
>
> Thanks,
>
> Lukas
>
> > > diff --git a/sound/pci/hda/hda_intel.c b/sound/pci/hda/hda_intel.c
> > > index 35b4526f0d28..b856b89378ac 100644
> > > --- a/sound/pci/hda/hda_intel.c
> > > +++ b/sound/pci/hda/hda_intel.c
> > > @@ -1419,7 +1419,6 @@ static bool atpx_present(void)
> > > return true;
> > > }
> > > }
> > > - pci_dev_put(pdev);
> > > }
> > > return false;
> > > }
> > > --
> > > 2.24.0
On Tue, 10 Dec 2019 16:53:20 +0100,
Deucher, Alexander wrote:
>
> > -----Original Message-----
> > From: Lukas Wunner <[email protected]>
> > Sent: Tuesday, December 10, 2019 10:47 AM
> > To: Deucher, Alexander <[email protected]>
> > Cc: Takashi Iwai <[email protected]>; Jaroslav Kysela <[email protected]>; Mika
> > Westerberg <[email protected]>; Bjorn Helgaas
> > <[email protected]>; Nicholas Johnson <nicholas.johnson-
> > [email protected]>; [email protected]; linux-
> > [email protected]; [email protected]
> > Subject: Re: [PATCH] ALSA: hda/hdmi - Fix duplicate unref of pci_dev
> >
> > On Tue, Dec 10, 2019 at 03:34:27PM +0000, Deucher, Alexander wrote:
> > > > Nicholas Johnson reports a null pointer deref as well as a refcount
> > > > underflow upon hot-removal of a Thunderbolt-attached AMD eGPU.
> > > > He's bisected the issue down to commit 586bc4aab878 ("ALSA: hda/hdmi
> > > > - fix vgaswitcheroo detection for AMD").
> > > >
> > > > The commit iterates over PCI devices using pci_get_class() and
> > > > unreferences each device found, even though pci_get_class()
> > > > subsequently unreferences the device as well. Fix it.
> > >
> > > The pci_dev_put() a few lines above should probably be dropped as well.
> >
> > That one looks fine to me. The refcount is already increased in the caller
> > get_bound_vga() via pci_get_domain_bus_and_slot() and it's increased
> > again in atpx_present() via pci_get_class(). It needs to be decremented in
> > atpx_present() to avoid leaking a ref.
>
> I'm not following. This is part of the same loop as the one you removed. All we are doing is checking whether the ATPX method exists or not om the platform. The pdev may not be the same one as the one in pci_get_domain_bus_and_slot(). The APTX method in the APU's ACPI namespace, not the dGPUs.
Well, the tricky part is that pci_get_class() itself does
unrefeference the old object and reference the new object (if found).
At the end of the loop, nothing is referenced, so it's fine.
OTOH, if you go out of the loop in the middle, you're still keeping
the pdev object reference, so you need to manually unreference it.
Takashi
>
> Alex
>
> >
> > Thanks,
> >
> > Lukas
> >
> > > > diff --git a/sound/pci/hda/hda_intel.c b/sound/pci/hda/hda_intel.c
> > > > index 35b4526f0d28..b856b89378ac 100644
> > > > --- a/sound/pci/hda/hda_intel.c
> > > > +++ b/sound/pci/hda/hda_intel.c
> > > > @@ -1419,7 +1419,6 @@ static bool atpx_present(void)
> > > > return true;
> > > > }
> > > > }
> > > > - pci_dev_put(pdev);
> > > > }
> > > > return false;
> > > > }
> > > > --
> > > > 2.24.0
>
On Tue, Dec 10, 2019 at 03:53:20PM +0000, Deucher, Alexander wrote:
> > On Tue, Dec 10, 2019 at 03:34:27PM +0000, Deucher, Alexander wrote:
> > > > Nicholas Johnson reports a null pointer deref as well as a refcount
> > > > underflow upon hot-removal of a Thunderbolt-attached AMD eGPU.
> > > > He's bisected the issue down to commit 586bc4aab878 ("ALSA: hda/hdmi
> > > > - fix vgaswitcheroo detection for AMD").
> > > >
> > > > The commit iterates over PCI devices using pci_get_class() and
> > > > unreferences each device found, even though pci_get_class()
> > > > subsequently unreferences the device as well. Fix it.
> > >
> > > The pci_dev_put() a few lines above should probably be dropped as well.
> >
> > That one looks fine to me. The refcount is already increased in the caller
> > get_bound_vga() via pci_get_domain_bus_and_slot() and it's increased
> > again in atpx_present() via pci_get_class(). It needs to be decremented in
> > atpx_present() to avoid leaking a ref.
>
> I'm not following. This is part of the same loop as the one you removed.
> All we are doing is checking whether the ATPX method exists or not om the
> platform. The pdev may not be the same one as the one in
> pci_get_domain_bus_and_slot(). The APTX method in the APU's ACPI namespace,
> not the dGPUs.
Okay. Still, atpx_present() doesn't pass the found pci_dev back to the
caller, so it would be leaked if the ref isn't returned.
The situation is different for the pci_dev_put() I removed: The ref is
returned by pci_get_class() on the next loop iteration.
Thanks,
Lukas
> > > > diff --git a/sound/pci/hda/hda_intel.c b/sound/pci/hda/hda_intel.c
> > > > index 35b4526f0d28..b856b89378ac 100644
> > > > --- a/sound/pci/hda/hda_intel.c
> > > > +++ b/sound/pci/hda/hda_intel.c
> > > > @@ -1419,7 +1419,6 @@ static bool atpx_present(void)
> > > > return true;
> > > > }
> > > > }
> > > > - pci_dev_put(pdev);
> > > > }
> > > > return false;
> > > > }
> > > > --
> > > > 2.24.0
> -----Original Message-----
> From: Takashi Iwai <[email protected]>
> Sent: Tuesday, December 10, 2019 11:11 AM
> To: Deucher, Alexander <[email protected]>
> Cc: Lukas Wunner <[email protected]>; Jaroslav Kysela <[email protected]>;
> Mika Westerberg <[email protected]>; Bjorn Helgaas
> <[email protected]>; Nicholas Johnson <nicholas.johnson-
> [email protected]>; [email protected]; linux-
> [email protected]; [email protected]
> Subject: Re: [PATCH] ALSA: hda/hdmi - Fix duplicate unref of pci_dev
>
> On Tue, 10 Dec 2019 16:53:20 +0100,
> Deucher, Alexander wrote:
> >
> > > -----Original Message-----
> > > From: Lukas Wunner <[email protected]>
> > > Sent: Tuesday, December 10, 2019 10:47 AM
> > > To: Deucher, Alexander <[email protected]>
> > > Cc: Takashi Iwai <[email protected]>; Jaroslav Kysela <[email protected]>;
> > > Mika Westerberg <[email protected]>; Bjorn Helgaas
> > > <[email protected]>; Nicholas Johnson <nicholas.johnson-
> > > [email protected]>; [email protected]; linux-
> > > [email protected]; [email protected]
> > > Subject: Re: [PATCH] ALSA: hda/hdmi - Fix duplicate unref of pci_dev
> > >
> > > On Tue, Dec 10, 2019 at 03:34:27PM +0000, Deucher, Alexander wrote:
> > > > > Nicholas Johnson reports a null pointer deref as well as a
> > > > > refcount underflow upon hot-removal of a Thunderbolt-attached
> AMD eGPU.
> > > > > He's bisected the issue down to commit 586bc4aab878 ("ALSA:
> > > > > hda/hdmi
> > > > > - fix vgaswitcheroo detection for AMD").
> > > > >
> > > > > The commit iterates over PCI devices using pci_get_class() and
> > > > > unreferences each device found, even though pci_get_class()
> > > > > subsequently unreferences the device as well. Fix it.
> > > >
> > > > The pci_dev_put() a few lines above should probably be dropped as
> well.
> > >
> > > That one looks fine to me. The refcount is already increased in the
> > > caller
> > > get_bound_vga() via pci_get_domain_bus_and_slot() and it's increased
> > > again in atpx_present() via pci_get_class(). It needs to be
> > > decremented in
> > > atpx_present() to avoid leaking a ref.
> >
> > I'm not following. This is part of the same loop as the one you removed. All
> we are doing is checking whether the ATPX method exists or not om the
> platform. The pdev may not be the same one as the one in
> pci_get_domain_bus_and_slot(). The APTX method in the APU's ACPI
> namespace, not the dGPUs.
>
> Well, the tricky part is that pci_get_class() itself does unrefeference the old
> object and reference the new object (if found).
> At the end of the loop, nothing is referenced, so it's fine.
> OTOH, if you go out of the loop in the middle, you're still keeping the pdev
> object reference, so you need to manually unreference it.
>
Ah, I see what you are saying. Thanks. Patch is:
Reviewed-by: Alex Deucher <[email protected]>
>
> Takashi
>
> >
> > Alex
> >
> > >
> > > Thanks,
> > >
> > > Lukas
> > >
> > > > > diff --git a/sound/pci/hda/hda_intel.c
> > > > > b/sound/pci/hda/hda_intel.c index 35b4526f0d28..b856b89378ac
> > > > > 100644
> > > > > --- a/sound/pci/hda/hda_intel.c
> > > > > +++ b/sound/pci/hda/hda_intel.c
> > > > > @@ -1419,7 +1419,6 @@ static bool atpx_present(void)
> > > > > return true;
> > > > > }
> > > > > }
> > > > > - pci_dev_put(pdev);
> > > > > }
> > > > > return false;
> > > > > }
> > > > > --
> > > > > 2.24.0
> >
On 2019/12/10 20:46, Takashi Iwai wrote:
> On Tue, 10 Dec 2019 13:29:41 +0100,
> Lukas Wunner wrote:
>>
>> [cc += Alex, Takashi]
>>
>> On Tue, Dec 10, 2019 at 12:00:23PM +0000, Nicholas Johnson wrote:
>>> On Tue, Dec 10, 2019 at 09:28:00AM +0200, [email protected] wrote:
>>>> On Mon, Dec 09, 2019 at 01:33:49PM +0000, Nicholas Johnson wrote:
>>>>> On Mon, Dec 09, 2019 at 03:12:39PM +0200, [email protected] wrote:
>>>>>> On Mon, Dec 09, 2019 at 12:34:04PM +0000, Nicholas Johnson wrote:
>>>>>>> I have compiled Linux v5.5-rc1 and thought all was good until I
>>>>>>> hot-removed a Gigabyte Aorus eGPU from Thunderbolt. The driver for the
>>>>>>> GPU was not loaded (blacklisted) so the crash is nothing to do with the
>>>>>>> GPU driver.
>>>>>>>
>>>>>>> We had:
>>>>>>> - kernel NULL pointer dereference
>>>>>>> - refcount_t: underflow; use-after-free.
>>>
>>> The following is the culprit responsible for the issues:
>>>
>>> commit 586bc4aab878efcf672536f0cdec3d04b6990c94
>>> Author: Alex Deucher <[email protected]>
>>> Date: Fri Nov 22 16:43:50 2019 -0500
>>>
>>> ALSA: hda/hdmi - fix vgaswitcheroo detection for AMD
>>
>> Does the below fix the issue?
>>
>> -- >8 --
>> diff --git a/sound/pci/hda/hda_intel.c b/sound/pci/hda/hda_intel.c
>> index 35b4526f0d28..b856b89378ac 100644
>> --- a/sound/pci/hda/hda_intel.c
>> +++ b/sound/pci/hda/hda_intel.c
>> @@ -1419,7 +1419,6 @@ static bool atpx_present(void)
>> return true;
>> }
>> }
>> - pci_dev_put(pdev);
>> }
>> return false;
>> }
>
> Oh this looks really like a bug, even if this isn't the root cause.
>
> Care to submit a proper patch?
>
Hi Nicholas
I have disassembled the vmlinux by objdump -Sdl, but is it too big to
send as an attachment. Just take a look at the key information:
The file dmesg-5 show trace:
[ 71.693210] general protection fault: 0000 [#1] SMP PTI
[ 71.693215] CPU: 1 PID: 184 Comm: irq/128-pciehp Not tainted 5.5.0-rc1 #2
[ 71.693217] Hardware name: Dell Inc. XPS 13 9370/09DKKT, BIOS 1.11.1
07/11/2019
[ 71.693223] RIP: 0010:pci_remove_bus_device+0x9a/0x100
[ 71.693250] Call Trace:
[ 71.693256] pci_remove_bus_device+0x31/0x100
[ 71.693259] pci_remove_bus_device+0x31/0x100
[ 71.693261] pci_stop_and_remove_bus_device+0x1b/0x20
.....
[ 71.709515] Call Trace:
[ 71.709527] kobject_put+0xfc/0x1c0
[ 71.709536] __device_link_free_srcu+0x51/0x60
[ 71.709538] srcu_invoke_callbacks+0xd2/0x170
[ 71.709547] process_one_work+0x1ec/0x3a0
[ 71.709550] worker_thread+0x4d/0x400
[ 71.709554] kthread+0x104/0x140
[ 71.709557] ? process_one_work+0x3a0/0x3a0
[ 71.709559] ? kthread_park+0x90/0x90
[ 71.709563] ret_from_fork+0x35/0x40
the disassembling file
ffffffff814eea30 <pci_remove_bus_device>:
pci_remove_bus_device():
/home/higon/linux/drivers/pci/remove.c:86
static void pci_remove_bus_device(struct pci_dev *dev)
{
ffffffff814eea30: e8 bb 2e 51 00 callq ffffffff81a018f0 <__fentry__>
ffffffff814eea35: 41 55 push %r13
ffffffff814eea37: 41 54 push %r12
ffffffff814eea39: 55 push %rbp
ffffffff814eea3a: 53 push %rbx
ffffffff814eea3b: 48 89 fd mov %rdi,%rbp
/home/higon/linux/drivers/pci/remove.c:87
struct pci_bus *bus = dev->subordinate;
ffffffff814eea3e: 4c 8b 6f 18 mov 0x18(%rdi),%r13
/home/higon/linux/drivers/pci/remove.c:90
struct pci_dev *child, *tmp;
........
static inline void list_del(struct list_head *entry)
{
__list_del_entry(entry);
entry->next = LIST_POISON1;
ffffffff814eeac0: 48 b8 00 01 00 00 00 movabs $0xdead000000000100,%rax
ffffffff814eeac7: 00 ad de
ffffffff814eeaca: 48 89 45 00 mov %rax,0x0(%rbp)
/home/higon/linux/./include/linux/list.h:141
entry->prev = LIST_POISON2;
ffffffff814eeace: 48 b8 22 01 00 00 00 movabs $0xdead000000000122,%rax
ffffffff814eead5: 00 ad de
ffffffff814eead8: 48 89 45 08 mov %rax,0x8(%rbp)
pci_destroy_dev():
/home/higon/linux/drivers/pci/remove.c:39
ffffffff814eeadc: e8 2f 23 bf ff callq ffffffff810e0e10 <up_write>
/home/higon/linux/drivers/pci/remove.c:41
In the file source/sound/pci/hda/hda_intel.c, atpx_present can not call
pci_dev_put(pdev) because the reference count for pdev is always
decremented if it is not NULL after call pci_get_class.
Resource of pdev will be released when the reference count for it
decremented to zero. Hotplug service dereferences NULL pointer when
remove the device from the device lists.
static bool atpx_present(void)
{
struct pci_dev *pdev = NULL;
acpi_handle dhandle, atpx_handle;
acpi_status status;
while ((pdev = pci_get_class(PCI_BASE_CLASS_DISPLAY << 16, pdev)) !=
NULL) {
dhandle = ACPI_HANDLE(&pdev->dev);
if (dhandle) {
status = acpi_get_handle(dhandle, "ATPX", &atpx_handle);
if (!ACPI_FAILURE(status)) {
pci_dev_put(pdev);
return true;
}
}
pci_dev_put(pdev);
}
return false;
}
struct pci_dev *pci_get_class(unsigned int class, struct pci_dev *from)
{
struct pci_device_id id = {
.vendor = PCI_ANY_ID,
.device = PCI_ANY_ID,
.subvendor = PCI_ANY_ID,
.subdevice = PCI_ANY_ID,
.class_mask = PCI_ANY_ID,
.class = class,
};
return pci_get_dev_by_id(&id, from);
}
static struct pci_dev *pci_get_dev_by_id(const struct pci_device_id *id,
struct pci_dev *from)
{
struct device *dev;
struct device *dev_start = NULL;
struct pci_dev *pdev = NULL;
WARN_ON(in_interrupt());
if (from)
dev_start = &from->dev;
dev = bus_find_device(&pci_bus_type, dev_start, (void *)id,
match_pci_dev_by_id);
if (dev)
pdev = to_pci_dev(dev);
pci_dev_put(from);
return pdev;
}
Thanks,
Jiasen Lin
>
> thanks,
>
> Takashi
>