2020-04-21 19:10:33

by Alex Xu (Hello71)

[permalink] [raw]
Subject: Unrecoverable AER error when resuming from RAM (hda regression in 5.7-rc2)

With 5.7-rc2, after resuming from suspend to RAM, I get:

[ 55.679382] pcieport 0000:00:03.1: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:00.0
[ 55.679405] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 55.679410] pcieport 0000:00:03.1: AER: device [1022:1453] error status/mask=00100000/04400000
[ 55.679414] pcieport 0000:00:03.1: AER: [20] UnsupReq (First)
[ 55.679417] pcieport 0000:00:03.1: AER: TLP Header: 40000004 0a0000ff fffc0e80 00000000
[ 55.679423] amdgpu 0000:0a:00.0: AER: can't recover (no error_detected callback)
[ 55.679425] snd_hda_intel 0000:0a:00.1: AER: can't recover (no error_detected callback)
[ 55.679455] pcieport 0000:00:03.1: AER: device recovery failed

Then the display freezes and the system basically falls apart (can't
even sudo reboot -f, need to use magic sysrq).

I bisected this to "ALSA: hda: Skip controller resume if not needed".
Setting snd_hda_intel.power_save=0 resolves the issue.

I am using an ASRock B450 Pro4 with Realtek HDA codec:

[ 1.009400] snd_hda_intel 0000:0a:00.1: enabling device (0000 -> 0002)
[ 1.009425] snd_hda_intel 0000:0a:00.1: Force to non-snoop mode
[ 1.009653] snd_hda_intel 0000:0c:00.3: enabling device (0000 -> 0002)
[ 1.021452] snd_hda_codec_generic hdaudioC0D0: ignore pin 0x7, too many assigned pins
[ 1.021461] snd_hda_codec_generic hdaudioC0D0: ignore pin 0x9, too many assigned pins
[ 1.021471] snd_hda_codec_generic hdaudioC0D0: ignore pin 0xb, too many assigned pins
[ 1.021480] snd_hda_codec_generic hdaudioC0D0: ignore pin 0xd, too many assigned pins
[ 1.021482] snd_hda_codec_generic hdaudioC0D0: autoconfig for Generic: line_outs=0 (0x0/0x0/0x0/0x0/0x0) type:line
[ 1.021482] snd_hda_codec_generic hdaudioC0D0: speaker_outs=0 (0x0/0x0/0x0/0x0/0x0)
[ 1.021483] snd_hda_codec_generic hdaudioC0D0: hp_outs=0 (0x0/0x0/0x0/0x0/0x0)
[ 1.021484] snd_hda_codec_generic hdaudioC0D0: mono: mono_out=0x0
[ 1.021484] snd_hda_codec_generic hdaudioC0D0: dig-out=0x3/0x5
[ 1.021485] snd_hda_codec_generic hdaudioC0D0: inputs:
[ 1.046053] snd_hda_codec_realtek hdaudioC1D0: autoconfig for ALC892: line_outs=1 (0x14/0x0/0x0/0x0/0x0) type:line
[ 1.046054] snd_hda_codec_realtek hdaudioC1D0: speaker_outs=0 (0x0/0x0/0x0/0x0/0x0)
[ 1.046055] snd_hda_codec_realtek hdaudioC1D0: hp_outs=1 (0x1b/0x0/0x0/0x0/0x0)
[ 1.046055] snd_hda_codec_realtek hdaudioC1D0: mono: mono_out=0x0
[ 1.046056] snd_hda_codec_realtek hdaudioC1D0: inputs:
[ 1.046057] snd_hda_codec_realtek hdaudioC1D0: Front Mic=0x19
[ 1.046058] snd_hda_codec_realtek hdaudioC1D0: Rear Mic=0x18
[ 1.046058] snd_hda_codec_realtek hdaudioC1D0: Line=0x1a

I also have an ASUS RX 480 graphics card with HDMI audio output.


2020-04-21 19:42:59

by Takashi Iwai

[permalink] [raw]
Subject: Re: Unrecoverable AER error when resuming from RAM (hda regression in 5.7-rc2)

On Tue, 21 Apr 2020 21:08:44 +0200,
Alex Xu (Hello71) wrote:
>
> With 5.7-rc2, after resuming from suspend to RAM, I get:
>
> [ 55.679382] pcieport 0000:00:03.1: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:00.0
> [ 55.679405] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
> [ 55.679410] pcieport 0000:00:03.1: AER: device [1022:1453] error status/mask=00100000/04400000
> [ 55.679414] pcieport 0000:00:03.1: AER: [20] UnsupReq (First)
> [ 55.679417] pcieport 0000:00:03.1: AER: TLP Header: 40000004 0a0000ff fffc0e80 00000000
> [ 55.679423] amdgpu 0000:0a:00.0: AER: can't recover (no error_detected callback)
> [ 55.679425] snd_hda_intel 0000:0a:00.1: AER: can't recover (no error_detected callback)
> [ 55.679455] pcieport 0000:00:03.1: AER: device recovery failed
>
> Then the display freezes and the system basically falls apart (can't
> even sudo reboot -f, need to use magic sysrq).
>
> I bisected this to "ALSA: hda: Skip controller resume if not needed".
> Setting snd_hda_intel.power_save=0 resolves the issue.

Hrm, it means the condition to skip the controller resume doesn't fit
well. Does the patch below help?

But looking at the dmesg output:
> [ 1.021452] snd_hda_codec_generic hdaudioC0D0: ignore pin 0x7, too many assigned pins
> [ 1.021461] snd_hda_codec_generic hdaudioC0D0: ignore pin 0x9, too many assigned pins
> [ 1.021471] snd_hda_codec_generic hdaudioC0D0: ignore pin 0xb, too many assigned pins
> [ 1.021480] snd_hda_codec_generic hdaudioC0D0: ignore pin 0xd, too many assigned pins
> [ 1.021482] snd_hda_codec_generic hdaudioC0D0: autoconfig for Generic: line_outs=0 (0x0/0x0/0x0/0x0/0x0) type:line
> [ 1.021482] snd_hda_codec_generic hdaudioC0D0: speaker_outs=0 (0x0/0x0/0x0/0x0/0x0)
> [ 1.021483] snd_hda_codec_generic hdaudioC0D0: hp_outs=0 (0x0/0x0/0x0/0x0/0x0)
> [ 1.021484] snd_hda_codec_generic hdaudioC0D0: mono: mono_out=0x0
> [ 1.021484] snd_hda_codec_generic hdaudioC0D0: dig-out=0x3/0x5
> [ 1.021485] snd_hda_codec_generic hdaudioC0D0: inputs:

... it looks like snd-hda-codec-generic is used for HDMI/DP codec.
This can't work well. Did you enable CONFIG_SND_HDA_HDMI?

In anyway, please give alsa-info.sh output. Run the script with
--no-upload option and attach the output.


thanks,

Takashi

---
--- a/sound/pci/hda/hda_intel.c
+++ b/sound/pci/hda/hda_intel.c
@@ -1060,7 +1060,7 @@ static int azx_resume(struct device *dev)

/* check for the forced resume */
list_for_each_codec(codec, &chip->bus) {
- if (hda_codec_need_resume(codec)) {
+ if (!codec->relaxed_resume) {
forced_resume = true;
break;
}

2020-04-22 20:53:53

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: Unrecoverable AER error when resuming from RAM (hda regression in 5.7-rc2)

[+cc Rafael, linux-pm]

On Tue, Apr 21, 2020 at 03:08:44PM -0400, Alex Xu (Hello71) wrote:
> With 5.7-rc2, after resuming from suspend to RAM, I get:
>
> [ 55.679382] pcieport 0000:00:03.1: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:00.0
> [ 55.679405] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
> [ 55.679410] pcieport 0000:00:03.1: AER: device [1022:1453] error status/mask=00100000/04400000
> [ 55.679414] pcieport 0000:00:03.1: AER: [20] UnsupReq (First)
> [ 55.679417] pcieport 0000:00:03.1: AER: TLP Header: 40000004 0a0000ff fffc0e80 00000000
> [ 55.679423] amdgpu 0000:0a:00.0: AER: can't recover (no error_detected callback)
> [ 55.679425] snd_hda_intel 0000:0a:00.1: AER: can't recover (no error_detected callback)
> [ 55.679455] pcieport 0000:00:03.1: AER: device recovery failed

I'm not at all confident in my decoding skills, but I *think* the TLP
header decodes to:

Fmt 010b 3 DW header with data (32-bit address)
Type 00000b MWr
Length 0x4 4 DW = 16 bytes
Requester ID 0x0a00 0a:00.0
Byte enables 0xff
Address 0xfffc0e80

which would mean the 0a:00.0 GPU did a 16-byte write to 0xfffc0e80,
and the 00:03.1 Root Port reported that as an Unsupported Request.
I don't know why that would be unless the address is invalid.

Maybe that's supposed to be an MSI address? Maybe a complete dmesg or
/proc/iomem would have a clue?

I feel like this UR issue could be a PCI core issue or maybe some sort
of misuse of PCI power management, but I can't seem to get traction on
it.

> Then the display freezes and the system basically falls apart (can't
> even sudo reboot -f, need to use magic sysrq).
>
> I bisected this to "ALSA: hda: Skip controller resume if not needed".
> Setting snd_hda_intel.power_save=0 resolves the issue.

FWIW, the complete citation is c4c8dd6ef807 ("ALSA: hda: Skip
controller resume if not needed"),
https://git.kernel.org/linus/c4c8dd6ef807, which first appeared in
v5.7-rc2.

> I am using an ASRock B450 Pro4 with Realtek HDA codec:
>
> [ 1.009400] snd_hda_intel 0000:0a:00.1: enabling device (0000 -> 0002)
> [ 1.009425] snd_hda_intel 0000:0a:00.1: Force to non-snoop mode
> [ 1.009653] snd_hda_intel 0000:0c:00.3: enabling device (0000 -> 0002)
> [ 1.021452] snd_hda_codec_generic hdaudioC0D0: ignore pin 0x7, too many assigned pins
> [ 1.021461] snd_hda_codec_generic hdaudioC0D0: ignore pin 0x9, too many assigned pins
> [ 1.021471] snd_hda_codec_generic hdaudioC0D0: ignore pin 0xb, too many assigned pins
> [ 1.021480] snd_hda_codec_generic hdaudioC0D0: ignore pin 0xd, too many assigned pins
> [ 1.021482] snd_hda_codec_generic hdaudioC0D0: autoconfig for Generic: line_outs=0 (0x0/0x0/0x0/0x0/0x0) type:line
> [ 1.021482] snd_hda_codec_generic hdaudioC0D0: speaker_outs=0 (0x0/0x0/0x0/0x0/0x0)
> [ 1.021483] snd_hda_codec_generic hdaudioC0D0: hp_outs=0 (0x0/0x0/0x0/0x0/0x0)
> [ 1.021484] snd_hda_codec_generic hdaudioC0D0: mono: mono_out=0x0
> [ 1.021484] snd_hda_codec_generic hdaudioC0D0: dig-out=0x3/0x5
> [ 1.021485] snd_hda_codec_generic hdaudioC0D0: inputs:
> [ 1.046053] snd_hda_codec_realtek hdaudioC1D0: autoconfig for ALC892: line_outs=1 (0x14/0x0/0x0/0x0/0x0) type:line
> [ 1.046054] snd_hda_codec_realtek hdaudioC1D0: speaker_outs=0 (0x0/0x0/0x0/0x0/0x0)
> [ 1.046055] snd_hda_codec_realtek hdaudioC1D0: hp_outs=1 (0x1b/0x0/0x0/0x0/0x0)
> [ 1.046055] snd_hda_codec_realtek hdaudioC1D0: mono: mono_out=0x0
> [ 1.046056] snd_hda_codec_realtek hdaudioC1D0: inputs:
> [ 1.046057] snd_hda_codec_realtek hdaudioC1D0: Front Mic=0x19
> [ 1.046058] snd_hda_codec_realtek hdaudioC1D0: Rear Mic=0x18
> [ 1.046058] snd_hda_codec_realtek hdaudioC1D0: Line=0x1a
>
> I also have an ASUS RX 480 graphics card with HDMI audio output.

2020-04-22 21:28:22

by Takashi Iwai

[permalink] [raw]
Subject: Re: Unrecoverable AER error when resuming from RAM (hda regression in 5.7-rc2)

On Wed, 22 Apr 2020 22:50:28 +0200,
Bjorn Helgaas wrote:
>
> [+cc Rafael, linux-pm]
>
> On Tue, Apr 21, 2020 at 03:08:44PM -0400, Alex Xu (Hello71) wrote:
> > With 5.7-rc2, after resuming from suspend to RAM, I get:
> >
> > [ 55.679382] pcieport 0000:00:03.1: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:00.0
> > [ 55.679405] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
> > [ 55.679410] pcieport 0000:00:03.1: AER: device [1022:1453] error status/mask=00100000/04400000
> > [ 55.679414] pcieport 0000:00:03.1: AER: [20] UnsupReq (First)
> > [ 55.679417] pcieport 0000:00:03.1: AER: TLP Header: 40000004 0a0000ff fffc0e80 00000000
> > [ 55.679423] amdgpu 0000:0a:00.0: AER: can't recover (no error_detected callback)
> > [ 55.679425] snd_hda_intel 0000:0a:00.1: AER: can't recover (no error_detected callback)
> > [ 55.679455] pcieport 0000:00:03.1: AER: device recovery failed
>
> I'm not at all confident in my decoding skills, but I *think* the TLP
> header decodes to:
>
> Fmt 010b 3 DW header with data (32-bit address)
> Type 00000b MWr
> Length 0x4 4 DW = 16 bytes
> Requester ID 0x0a00 0a:00.0
> Byte enables 0xff
> Address 0xfffc0e80
>
> which would mean the 0a:00.0 GPU did a 16-byte write to 0xfffc0e80,
> and the 00:03.1 Root Port reported that as an Unsupported Request.
> I don't know why that would be unless the address is invalid.
>
> Maybe that's supposed to be an MSI address? Maybe a complete dmesg or
> /proc/iomem would have a clue?
>
> I feel like this UR issue could be a PCI core issue or maybe some sort
> of misuse of PCI power management, but I can't seem to get traction on
> it.
>
> > Then the display freezes and the system basically falls apart (can't
> > even sudo reboot -f, need to use magic sysrq).
> >
> > I bisected this to "ALSA: hda: Skip controller resume if not needed".
> > Setting snd_hda_intel.power_save=0 resolves the issue.
>
> FWIW, the complete citation is c4c8dd6ef807 ("ALSA: hda: Skip
> controller resume if not needed"),
> https://git.kernel.org/linus/c4c8dd6ef807, which first appeared in
> v5.7-rc2.

Yes, and I posted the fix patch right now:
https://lore.kernel.org/r/[email protected]

The possible cause was the tricky resume code that both HD-audio
controller (the parent PCI device) and the codec devices used.

At least the patch above seems working for the reporter's machine.
Now we need a bit more testing before merging, but it looks promising,
so far.


thanks,

Takashi

2020-04-22 23:22:55

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: Unrecoverable AER error when resuming from RAM (hda regression in 5.7-rc2)

On Wed, Apr 22, 2020 at 11:25:04PM +0200, Takashi Iwai wrote:
> On Wed, 22 Apr 2020 22:50:28 +0200,
> Bjorn Helgaas wrote:
> > ...
> > I feel like this UR issue could be a PCI core issue or maybe some sort
> > of misuse of PCI power management, but I can't seem to get traction on
> > it.
> >
> > > Then the display freezes and the system basically falls apart (can't
> > > even sudo reboot -f, need to use magic sysrq).
> > >
> > > I bisected this to "ALSA: hda: Skip controller resume if not needed".
> > > Setting snd_hda_intel.power_save=0 resolves the issue.
> >
> > FWIW, the complete citation is c4c8dd6ef807 ("ALSA: hda: Skip
> > controller resume if not needed"),
> > https://git.kernel.org/linus/c4c8dd6ef807, which first appeared in
> > v5.7-rc2.
>
> Yes, and I posted the fix patch right now:
> https://lore.kernel.org/r/[email protected]
>
> The possible cause was the tricky resume code that both HD-audio
> controller (the parent PCI device) and the codec devices used.
>
> At least the patch above seems working for the reporter's machine.
> Now we need a bit more testing before merging, but it looks promising,
> so far.

Great, I'm glad you figured something out because I sure wasn't
getting anywhere!

Maybe this is a tangent, but I can't figure out what
snd_power_change_state() is doing. It *looks* like it's supposed to
change the PCI power state, but I gave up trying to figure out where
it actually touches the device.

It seems like sound has more magic in power management than other
device types, which makes me wonder if we're not providing the right
interfaces or something.

Bjorn

2020-04-23 07:10:06

by Takashi Iwai

[permalink] [raw]
Subject: Re: Unrecoverable AER error when resuming from RAM (hda regression in 5.7-rc2)

On Thu, 23 Apr 2020 01:21:27 +0200,
Bjorn Helgaas wrote:
>
> On Wed, Apr 22, 2020 at 11:25:04PM +0200, Takashi Iwai wrote:
> > On Wed, 22 Apr 2020 22:50:28 +0200,
> > Bjorn Helgaas wrote:
> > > ...
> > > I feel like this UR issue could be a PCI core issue or maybe some sort
> > > of misuse of PCI power management, but I can't seem to get traction on
> > > it.
> > >
> > > > Then the display freezes and the system basically falls apart (can't
> > > > even sudo reboot -f, need to use magic sysrq).
> > > >
> > > > I bisected this to "ALSA: hda: Skip controller resume if not needed".
> > > > Setting snd_hda_intel.power_save=0 resolves the issue.
> > >
> > > FWIW, the complete citation is c4c8dd6ef807 ("ALSA: hda: Skip
> > > controller resume if not needed"),
> > > https://git.kernel.org/linus/c4c8dd6ef807, which first appeared in
> > > v5.7-rc2.
> >
> > Yes, and I posted the fix patch right now:
> > https://lore.kernel.org/r/[email protected]
> >
> > The possible cause was the tricky resume code that both HD-audio
> > controller (the parent PCI device) and the codec devices used.
> >
> > At least the patch above seems working for the reporter's machine.
> > Now we need a bit more testing before merging, but it looks promising,
> > so far.
>
> Great, I'm glad you figured something out because I sure wasn't
> getting anywhere!
>
> Maybe this is a tangent, but I can't figure out what
> snd_power_change_state() is doing. It *looks* like it's supposed to
> change the PCI power state, but I gave up trying to figure out where
> it actually touches the device.

Not really, it merely updates the internal state field stored in the
sound card object, see in include/sound/core.h:

static inline void snd_power_change_state(struct snd_card *card, unsigned int state)
{
card->power_state = state;
wake_up(&card->power_sleep);
}

The sound API blocks the operation while suspend/resume explicitly
with this card top-level signal.


thanks,

Takashi