2020-04-26 16:04:07

by Nicholas Johnson

[permalink] [raw]
Subject: [PATCH 0/1] Fiji GPU audio register timeout when in BACO state

Hi all,

Since Linux v5.7-rc1 / commit 4fdda2e66de0 ("drm/amdgpu/runpm: enable
runpm on baco capable VI+ asics"), my AMD R9 Nano has been using runpm /
BACO. You can tell visually when it sleeps, because the fan on the
graphics card is switched off to save power. It did not spin down the
fan in v5.6.x.

This is great (I love it), except that when it is sleeping, the PCIe
audio function of the GPU has issues if anything tries to access it. You
get dmesg errors such as these:

snd_hda_intel 0000:08:00.1: spurious response 0x0:0x0, last cmd=0x170500
snd_hda_intel 0000:08:00.1: azx_get_response timeout, switching to polling mode: last cmd=0x001f0500
snd_hda_intel 0000:08:00.1: No response from codec, disabling MSI: last cmd=0x001f0500
snd_hda_intel 0000:08:00.1: No response from codec, resetting bus: last cmd=0x001f0500
snd_hda_codec_hdmi hdaudioC1D0: Unable to sync register 0x2f0d00. -11

The above is with the Fiji XT GPU at 0000:08:00.0 in a Thunderbolt
enclosure (not that Thunderbolt should affect it, but I feel I should
mention it just in case). I dropped a lot of duplicate dmesg lines, as
some of them repeated a lot of times before the driver gave up.

I offer this patch to disable runpm for Fiji while a fix is found, if
you decide that is the best approach. Regardless, I will gladly test any
patches you come up with instead and confirm that the above issue has
been fixed.

I cannot tell if any other GPUs are affected. The only other cards to
which I have access are a couple of AMD R9 280X (Tahiti XT), which use
radeon driver instead of amdgpu driver.

Kind regards,
Nicholas Johnson

Nicholas Johnson (1):
drm/amdgpu/runpm: Disable runpm on Fiji due to audio register timeout

drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 1 +
1 file changed, 1 insertion(+)

--
2.26.2


2020-04-26 16:04:54

by Nicholas Johnson

[permalink] [raw]
Subject: [PATCH 1/1] drm/amdgpu/runpm: Disable runpm on Fiji due to audio register timeout

Since commit 4fdda2e66de0 ("drm/amdgpu/runpm: enable runpm on baco
capable VI+ asics"), runpm has been enabled on AMD Fiji GPUs. This
allows the GPU to enter BACO state, as evidenced by the fan on the
graphics card turning off. When it is in this state, accesses to the
registers of the PCIe audio function on the GPU time out, leading to
dmesg errors such as the following:

snd_hda_intel 0000:08:00.1: spurious response 0x0:0x0, last cmd=0x170500
snd_hda_intel 0000:08:00.1: azx_get_response timeout, switching to polling mode: last cmd=0x001f0500
snd_hda_intel 0000:08:00.1: No response from codec, disabling MSI: last cmd=0x001f0500
snd_hda_intel 0000:08:00.1: No response from codec, resetting bus: last cmd=0x001f0500
snd_hda_codec_hdmi hdaudioC1D0: Unable to sync register 0x2f0d00. -11

Pending a fix for the above problem, disable runpm on Fiji.

Signed-off-by: Nicholas Johnson <[email protected]>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index fd1dc3236..cbb55d2f1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -172,6 +172,7 @@ int amdgpu_driver_load_kms(struct drm_device *dev, unsigned long flags)
else if (amdgpu_device_supports_baco(dev) &&
(amdgpu_runtime_pm != 0) &&
(adev->asic_type >= CHIP_TOPAZ) &&
+ (adev->asic_type != CHIP_FIJI) &&
(adev->asic_type != CHIP_VEGA10) &&
(adev->asic_type != CHIP_VEGA20) &&
(adev->asic_type != CHIP_ARCTURUS)) /* enable runpm on VI+ */
--
2.26.2

2020-04-27 14:24:28

by Deucher, Alexander

[permalink] [raw]
Subject: RE: [PATCH 0/1] Fiji GPU audio register timeout when in BACO state

[AMD Public Use]

> -----Original Message-----
> From: Nicholas Johnson <[email protected]>
> Sent: Sunday, April 26, 2020 12:02 PM
> To: [email protected]
> Cc: Deucher, Alexander <[email protected]>; Koenig, Christian
> <[email protected]>; Zhou, David(ChunMing)
> <[email protected]>; Nicholas Johnson <nicholas.johnson-
> [email protected]>
> Subject: [PATCH 0/1] Fiji GPU audio register timeout when in BACO state
>
> Hi all,
>
> Since Linux v5.7-rc1 / commit 4fdda2e66de0 ("drm/amdgpu/runpm: enable
> runpm on baco capable VI+ asics"), my AMD R9 Nano has been using runpm /
> BACO. You can tell visually when it sleeps, because the fan on the graphics
> card is switched off to save power. It did not spin down the fan in v5.6.x.
>
> This is great (I love it), except that when it is sleeping, the PCIe audio function
> of the GPU has issues if anything tries to access it. You get dmesg errors such
> as these:
>
> snd_hda_intel 0000:08:00.1: spurious response 0x0:0x0, last cmd=0x170500
> snd_hda_intel 0000:08:00.1: azx_get_response timeout, switching to polling
> mode: last cmd=0x001f0500 snd_hda_intel 0000:08:00.1: No response from
> codec, disabling MSI: last cmd=0x001f0500 snd_hda_intel 0000:08:00.1: No
> response from codec, resetting bus: last cmd=0x001f0500
> snd_hda_codec_hdmi hdaudioC1D0: Unable to sync register 0x2f0d00. -11
>
> The above is with the Fiji XT GPU at 0000:08:00.0 in a Thunderbolt enclosure
> (not that Thunderbolt should affect it, but I feel I should mention it just in
> case). I dropped a lot of duplicate dmesg lines, as some of them repeated a
> lot of times before the driver gave up.
>
> I offer this patch to disable runpm for Fiji while a fix is found, if you decide
> that is the best approach. Regardless, I will gladly test any patches you come
> up with instead and confirm that the above issue has been fixed.
>
> I cannot tell if any other GPUs are affected. The only other cards to which I
> have access are a couple of AMD R9 280X (Tahiti XT), which use radeon driver
> instead of amdgpu driver.

Adding a few more people. Do you know what is accessing the audio? The audio should have a dependency on the GPU device. The GPU won't enter runtime pm until the audio has entered runtime pm and vice versa on resume. Please attach a copy of your dmesg output and lspci output.

Alex

2020-04-27 15:20:09

by Takashi Iwai

[permalink] [raw]
Subject: Re: [PATCH 0/1] Fiji GPU audio register timeout when in BACO state

On Mon, 27 Apr 2020 16:22:21 +0200,
Deucher, Alexander wrote:
>
> [AMD Public Use]
>
> > -----Original Message-----
> > From: Nicholas Johnson <[email protected]>
> > Sent: Sunday, April 26, 2020 12:02 PM
> > To: [email protected]
> > Cc: Deucher, Alexander <[email protected]>; Koenig, Christian
> > <[email protected]>; Zhou, David(ChunMing)
> > <[email protected]>; Nicholas Johnson <nicholas.johnson-
> > [email protected]>
> > Subject: [PATCH 0/1] Fiji GPU audio register timeout when in BACO state
> >
> > Hi all,
> >
> > Since Linux v5.7-rc1 / commit 4fdda2e66de0 ("drm/amdgpu/runpm: enable
> > runpm on baco capable VI+ asics"), my AMD R9 Nano has been using runpm /
> > BACO. You can tell visually when it sleeps, because the fan on the graphics
> > card is switched off to save power. It did not spin down the fan in v5.6.x.
> >
> > This is great (I love it), except that when it is sleeping, the PCIe audio function
> > of the GPU has issues if anything tries to access it. You get dmesg errors such
> > as these:
> >
> > snd_hda_intel 0000:08:00.1: spurious response 0x0:0x0, last cmd=0x170500
> > snd_hda_intel 0000:08:00.1: azx_get_response timeout, switching to polling
> > mode: last cmd=0x001f0500 snd_hda_intel 0000:08:00.1: No response from
> > codec, disabling MSI: last cmd=0x001f0500 snd_hda_intel 0000:08:00.1: No
> > response from codec, resetting bus: last cmd=0x001f0500
> > snd_hda_codec_hdmi hdaudioC1D0: Unable to sync register 0x2f0d00. -11
> >
> > The above is with the Fiji XT GPU at 0000:08:00.0 in a Thunderbolt enclosure
> > (not that Thunderbolt should affect it, but I feel I should mention it just in
> > case). I dropped a lot of duplicate dmesg lines, as some of them repeated a
> > lot of times before the driver gave up.
> >
> > I offer this patch to disable runpm for Fiji while a fix is found, if you decide
> > that is the best approach. Regardless, I will gladly test any patches you come
> > up with instead and confirm that the above issue has been fixed.
> >
> > I cannot tell if any other GPUs are affected. The only other cards to which I
> > have access are a couple of AMD R9 280X (Tahiti XT), which use radeon driver
> > instead of amdgpu driver.
>
> Adding a few more people. Do you know what is accessing the audio? The audio should have a dependency on the GPU device. The GPU won't enter runtime pm until the audio has entered runtime pm and vice versa on resume. Please attach a copy of your dmesg output and lspci output.

Also, please retest with the fresh 5.7-rc3. There was a known
regression regarding HD-audio PM in 5.7-rc1/rc2, and it's been fixed
there (commit 8d6762af302d).


thanks,

Takashi

2020-04-27 17:26:24

by Nicholas Johnson

[permalink] [raw]
Subject: Re: [PATCH 0/1] Fiji GPU audio register timeout when in BACO state

On Mon, Apr 27, 2020 at 05:15:55PM +0200, Takashi Iwai wrote:
> On Mon, 27 Apr 2020 16:22:21 +0200,
> Deucher, Alexander wrote:
> >
> > [AMD Public Use]
> >
> > > -----Original Message-----
> > > From: Nicholas Johnson <[email protected]>
> > > Sent: Sunday, April 26, 2020 12:02 PM
> > > To: [email protected]
> > > Cc: Deucher, Alexander <[email protected]>; Koenig, Christian
> > > <[email protected]>; Zhou, David(ChunMing)
> > > <[email protected]>; Nicholas Johnson <nicholas.johnson-
> > > [email protected]>
> > > Subject: [PATCH 0/1] Fiji GPU audio register timeout when in BACO state
> > >
> > > Hi all,
> > >
> > > Since Linux v5.7-rc1 / commit 4fdda2e66de0 ("drm/amdgpu/runpm: enable
> > > runpm on baco capable VI+ asics"), my AMD R9 Nano has been using runpm /
> > > BACO. You can tell visually when it sleeps, because the fan on the graphics
> > > card is switched off to save power. It did not spin down the fan in v5.6.x.
> > >
> > > This is great (I love it), except that when it is sleeping, the PCIe audio function
> > > of the GPU has issues if anything tries to access it. You get dmesg errors such
> > > as these:
> > >
> > > snd_hda_intel 0000:08:00.1: spurious response 0x0:0x0, last cmd=0x170500
> > > snd_hda_intel 0000:08:00.1: azx_get_response timeout, switching to polling
> > > mode: last cmd=0x001f0500 snd_hda_intel 0000:08:00.1: No response from
> > > codec, disabling MSI: last cmd=0x001f0500 snd_hda_intel 0000:08:00.1: No
> > > response from codec, resetting bus: last cmd=0x001f0500
> > > snd_hda_codec_hdmi hdaudioC1D0: Unable to sync register 0x2f0d00. -11
> > >
> > > The above is with the Fiji XT GPU at 0000:08:00.0 in a Thunderbolt enclosure
> > > (not that Thunderbolt should affect it, but I feel I should mention it just in
> > > case). I dropped a lot of duplicate dmesg lines, as some of them repeated a
> > > lot of times before the driver gave up.
> > >
> > > I offer this patch to disable runpm for Fiji while a fix is found, if you decide
> > > that is the best approach. Regardless, I will gladly test any patches you come
> > > up with instead and confirm that the above issue has been fixed.
> > >
> > > I cannot tell if any other GPUs are affected. The only other cards to which I
> > > have access are a couple of AMD R9 280X (Tahiti XT), which use radeon driver
> > > instead of amdgpu driver.
> >
> > Adding a few more people. Do you know what is accessing the audio? The audio should have a dependency on the GPU device. The GPU won't enter runtime pm until the audio has entered runtime pm and vice versa on resume. Please attach a copy of your dmesg output and lspci output.

pci 0000:08:00.1: D0 power state depends on 0000:08:00.0
The above must be the dependency of which you speak from dmesg.

Accessing the audio? I did not have a single method for triggering it.
Sometimes it happened on shutdown. Sometimes when restarting gdm.
Sometimes when playing with audio settings in Cinnamon Desktop. But most
often when changing displays. It might have something to do with the
audio device associated with a monitor being created when the monitor is
found. If an audio device is created, then pulseaudio might touch it.
Sorry, this is a very verbose "not quite sure".

To trigger the bug, this time I did the following:

1. Boot laptop without Fiji and log in

2. Attach Fiji via Thunderbolt (no displays attached to Fiji) and
approve Thunderbolt device

3. Log in again because the session gets killed when GPU is hot-added

4. Wait for Fiji to fall asleep (fan stops)

5. Open "dmesg -w" on laptop display

6. Attach display to DisplayPort on Fiji (it should still stay asleep)

7. Do WindowsKey+P to activate external display. The error appears in
dmesg window that instant.

Could it be a race condition when waking the card up?

I cannot get the graphics card fan to spin down if the Thunderbolt
enclosure is attached at boot time. It only does it if hot-added.

If you think it will help, I can take out the Fiji and put it in a test
rig and try to replicate the issue without Thunderbolt, but it looks
like it will not spin the fan down if Fiji is attached at boot time.

Question, why would the fan not spin down if Fiji is attached at boot
time, and how would one make the said fan turn off? Aside from being
useful for pinning down the audio register issue, I would like to make
sure the power savings are realised whenever the GPU is not being used.

>
> Also, please retest with the fresh 5.7-rc3. There was a known
> regression regarding HD-audio PM in 5.7-rc1/rc2, and it's been fixed
> there (commit 8d6762af302d).
Linux v5.7-rc3 still has the same problem, unfortunately.

The dmesg is attached.

Thanks for your replies. Kind regards,
Nicholas

>
>
> thanks,
>
> Takashi


Attachments:
(No filename) (4.84 kB)
dmesg-2020-04-28 (110.21 kB)
Download all attachments

2020-04-27 18:30:37

by Alex Deucher

[permalink] [raw]
Subject: Re: [PATCH 0/1] Fiji GPU audio register timeout when in BACO state

On Mon, Apr 27, 2020 at 2:07 PM Nicholas Johnson
<[email protected]> wrote:
>
> On Mon, Apr 27, 2020 at 05:15:55PM +0200, Takashi Iwai wrote:
> > On Mon, 27 Apr 2020 16:22:21 +0200,
> > Deucher, Alexander wrote:
> > >
> > > [AMD Public Use]
> > >
> > > > -----Original Message-----
> > > > From: Nicholas Johnson <[email protected]>
> > > > Sent: Sunday, April 26, 2020 12:02 PM
> > > > To: [email protected]
> > > > Cc: Deucher, Alexander <[email protected]>; Koenig, Christian
> > > > <[email protected]>; Zhou, David(ChunMing)
> > > > <[email protected]>; Nicholas Johnson <nicholas.johnson-
> > > > [email protected]>
> > > > Subject: [PATCH 0/1] Fiji GPU audio register timeout when in BACO state
> > > >
> > > > Hi all,
> > > >
> > > > Since Linux v5.7-rc1 / commit 4fdda2e66de0 ("drm/amdgpu/runpm: enable
> > > > runpm on baco capable VI+ asics"), my AMD R9 Nano has been using runpm /
> > > > BACO. You can tell visually when it sleeps, because the fan on the graphics
> > > > card is switched off to save power. It did not spin down the fan in v5.6.x.
> > > >
> > > > This is great (I love it), except that when it is sleeping, the PCIe audio function
> > > > of the GPU has issues if anything tries to access it. You get dmesg errors such
> > > > as these:
> > > >
> > > > snd_hda_intel 0000:08:00.1: spurious response 0x0:0x0, last cmd=0x170500
> > > > snd_hda_intel 0000:08:00.1: azx_get_response timeout, switching to polling
> > > > mode: last cmd=0x001f0500 snd_hda_intel 0000:08:00.1: No response from
> > > > codec, disabling MSI: last cmd=0x001f0500 snd_hda_intel 0000:08:00.1: No
> > > > response from codec, resetting bus: last cmd=0x001f0500
> > > > snd_hda_codec_hdmi hdaudioC1D0: Unable to sync register 0x2f0d00. -11
> > > >
> > > > The above is with the Fiji XT GPU at 0000:08:00.0 in a Thunderbolt enclosure
> > > > (not that Thunderbolt should affect it, but I feel I should mention it just in
> > > > case). I dropped a lot of duplicate dmesg lines, as some of them repeated a
> > > > lot of times before the driver gave up.
> > > >
> > > > I offer this patch to disable runpm for Fiji while a fix is found, if you decide
> > > > that is the best approach. Regardless, I will gladly test any patches you come
> > > > up with instead and confirm that the above issue has been fixed.
> > > >
> > > > I cannot tell if any other GPUs are affected. The only other cards to which I
> > > > have access are a couple of AMD R9 280X (Tahiti XT), which use radeon driver
> > > > instead of amdgpu driver.
> > >
> > > Adding a few more people. Do you know what is accessing the audio? The audio should have a dependency on the GPU device. The GPU won't enter runtime pm until the audio has entered runtime pm and vice versa on resume. Please attach a copy of your dmesg output and lspci output.
>
> pci 0000:08:00.1: D0 power state depends on 0000:08:00.0
> The above must be the dependency of which you speak from dmesg.
>
> Accessing the audio? I did not have a single method for triggering it.
> Sometimes it happened on shutdown. Sometimes when restarting gdm.
> Sometimes when playing with audio settings in Cinnamon Desktop. But most
> often when changing displays. It might have something to do with the
> audio device associated with a monitor being created when the monitor is
> found. If an audio device is created, then pulseaudio might touch it.
> Sorry, this is a very verbose "not quite sure".
>
> To trigger the bug, this time I did the following:
>
> 1. Boot laptop without Fiji and log in
>
> 2. Attach Fiji via Thunderbolt (no displays attached to Fiji) and
> approve Thunderbolt device
>
> 3. Log in again because the session gets killed when GPU is hot-added
>
> 4. Wait for Fiji to fall asleep (fan stops)
>
> 5. Open "dmesg -w" on laptop display
>
> 6. Attach display to DisplayPort on Fiji (it should still stay asleep)
>
> 7. Do WindowsKey+P to activate external display. The error appears in
> dmesg window that instant.
>
> Could it be a race condition when waking the card up?
>
> I cannot get the graphics card fan to spin down if the Thunderbolt
> enclosure is attached at boot time. It only does it if hot-added.
>
> If you think it will help, I can take out the Fiji and put it in a test
> rig and try to replicate the issue without Thunderbolt, but it looks
> like it will not spin the fan down if Fiji is attached at boot time.
>
> Question, why would the fan not spin down if Fiji is attached at boot
> time, and how would one make the said fan turn off? Aside from being
> useful for pinning down the audio register issue, I would like to make
> sure the power savings are realised whenever the GPU is not being used.

Presumably something is using the device. Maybe a framebuffer console
or X? Or maybe the something like tlp has disabled runtime pm on your
device? You can see the current status by reading the files in
/sys/class/drm/cardX/device/power/ . Replace cardX with card0, card1,
etc. depending on which device is the radeon card.

FWIW, I have a fiji board in a desktop system and it worked fine when
this code was enabled.

Alex

>
> >
> > Also, please retest with the fresh 5.7-rc3. There was a known
> > regression regarding HD-audio PM in 5.7-rc1/rc2, and it's been fixed
> > there (commit 8d6762af302d).
> Linux v5.7-rc3 still has the same problem, unfortunately.
>
> The dmesg is attached.
>
> Thanks for your replies. Kind regards,
> Nicholas
>
> >
> >
> > thanks,
> >
> > Takashi
> _______________________________________________
> amd-gfx mailing list
> [email protected]
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

2020-04-27 18:43:58

by Takashi Iwai

[permalink] [raw]
Subject: Re: [PATCH 0/1] Fiji GPU audio register timeout when in BACO state

On Mon, 27 Apr 2020 20:28:12 +0200,
Alex Deucher wrote:
>
> On Mon, Apr 27, 2020 at 2:07 PM Nicholas Johnson
> <[email protected]> wrote:
> >
> > On Mon, Apr 27, 2020 at 05:15:55PM +0200, Takashi Iwai wrote:
> > > On Mon, 27 Apr 2020 16:22:21 +0200,
> > > Deucher, Alexander wrote:
> > > >
> > > > [AMD Public Use]
> > > >
> > > > > -----Original Message-----
> > > > > From: Nicholas Johnson <[email protected]>
> > > > > Sent: Sunday, April 26, 2020 12:02 PM
> > > > > To: [email protected]
> > > > > Cc: Deucher, Alexander <[email protected]>; Koenig, Christian
> > > > > <[email protected]>; Zhou, David(ChunMing)
> > > > > <[email protected]>; Nicholas Johnson <nicholas.johnson-
> > > > > [email protected]>
> > > > > Subject: [PATCH 0/1] Fiji GPU audio register timeout when in BACO state
> > > > >
> > > > > Hi all,
> > > > >
> > > > > Since Linux v5.7-rc1 / commit 4fdda2e66de0 ("drm/amdgpu/runpm: enable
> > > > > runpm on baco capable VI+ asics"), my AMD R9 Nano has been using runpm /
> > > > > BACO. You can tell visually when it sleeps, because the fan on the graphics
> > > > > card is switched off to save power. It did not spin down the fan in v5.6.x.
> > > > >
> > > > > This is great (I love it), except that when it is sleeping, the PCIe audio function
> > > > > of the GPU has issues if anything tries to access it. You get dmesg errors such
> > > > > as these:
> > > > >
> > > > > snd_hda_intel 0000:08:00.1: spurious response 0x0:0x0, last cmd=0x170500
> > > > > snd_hda_intel 0000:08:00.1: azx_get_response timeout, switching to polling
> > > > > mode: last cmd=0x001f0500 snd_hda_intel 0000:08:00.1: No response from
> > > > > codec, disabling MSI: last cmd=0x001f0500 snd_hda_intel 0000:08:00.1: No
> > > > > response from codec, resetting bus: last cmd=0x001f0500
> > > > > snd_hda_codec_hdmi hdaudioC1D0: Unable to sync register 0x2f0d00. -11
> > > > >
> > > > > The above is with the Fiji XT GPU at 0000:08:00.0 in a Thunderbolt enclosure
> > > > > (not that Thunderbolt should affect it, but I feel I should mention it just in
> > > > > case). I dropped a lot of duplicate dmesg lines, as some of them repeated a
> > > > > lot of times before the driver gave up.
> > > > >
> > > > > I offer this patch to disable runpm for Fiji while a fix is found, if you decide
> > > > > that is the best approach. Regardless, I will gladly test any patches you come
> > > > > up with instead and confirm that the above issue has been fixed.
> > > > >
> > > > > I cannot tell if any other GPUs are affected. The only other cards to which I
> > > > > have access are a couple of AMD R9 280X (Tahiti XT), which use radeon driver
> > > > > instead of amdgpu driver.
> > > >
> > > > Adding a few more people. Do you know what is accessing the audio? The audio should have a dependency on the GPU device. The GPU won't enter runtime pm until the audio has entered runtime pm and vice versa on resume. Please attach a copy of your dmesg output and lspci output.
> >
> > pci 0000:08:00.1: D0 power state depends on 0000:08:00.0
> > The above must be the dependency of which you speak from dmesg.
> >
> > Accessing the audio? I did not have a single method for triggering it.
> > Sometimes it happened on shutdown. Sometimes when restarting gdm.
> > Sometimes when playing with audio settings in Cinnamon Desktop. But most
> > often when changing displays. It might have something to do with the
> > audio device associated with a monitor being created when the monitor is
> > found. If an audio device is created, then pulseaudio might touch it.
> > Sorry, this is a very verbose "not quite sure".
> >
> > To trigger the bug, this time I did the following:
> >
> > 1. Boot laptop without Fiji and log in
> >
> > 2. Attach Fiji via Thunderbolt (no displays attached to Fiji) and
> > approve Thunderbolt device
> >
> > 3. Log in again because the session gets killed when GPU is hot-added
> >
> > 4. Wait for Fiji to fall asleep (fan stops)
> >
> > 5. Open "dmesg -w" on laptop display
> >
> > 6. Attach display to DisplayPort on Fiji (it should still stay asleep)
> >
> > 7. Do WindowsKey+P to activate external display. The error appears in
> > dmesg window that instant.
> >
> > Could it be a race condition when waking the card up?
> >
> > I cannot get the graphics card fan to spin down if the Thunderbolt
> > enclosure is attached at boot time. It only does it if hot-added.
> >
> > If you think it will help, I can take out the Fiji and put it in a test
> > rig and try to replicate the issue without Thunderbolt, but it looks
> > like it will not spin the fan down if Fiji is attached at boot time.
> >
> > Question, why would the fan not spin down if Fiji is attached at boot
> > time, and how would one make the said fan turn off? Aside from being
> > useful for pinning down the audio register issue, I would like to make
> > sure the power savings are realised whenever the GPU is not being used.
>
> Presumably something is using the device. Maybe a framebuffer console
> or X? Or maybe the something like tlp has disabled runtime pm on your
> device? You can see the current status by reading the files in
> /sys/class/drm/cardX/device/power/ . Replace cardX with card0, card1,
> etc. depending on which device is the radeon card.
>
> FWIW, I have a fiji board in a desktop system and it worked fine when
> this code was enabled.

Is the new DC code used for Fiji boards? IIRC, the audio component
binding from amdgpu is enabled only for DC, and without the audio
component binding the runtime PM won't be linked up, hence you can't
power up GPU from the audio side access automatically.


Takashi

>
> Alex
>
> >
> > >
> > > Also, please retest with the fresh 5.7-rc3. There was a known
> > > regression regarding HD-audio PM in 5.7-rc1/rc2, and it's been fixed
> > > there (commit 8d6762af302d).
> > Linux v5.7-rc3 still has the same problem, unfortunately.
> >
> > The dmesg is attached.
> >
> > Thanks for your replies. Kind regards,
> > Nicholas
> >
> > >
> > >
> > > thanks,
> > >
> > > Takashi
> > _______________________________________________
> > amd-gfx mailing list
> > [email protected]
> > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>

2020-04-27 18:46:19

by Alex Deucher

[permalink] [raw]
Subject: Re: [PATCH 0/1] Fiji GPU audio register timeout when in BACO state

On Mon, Apr 27, 2020 at 2:39 PM Takashi Iwai <[email protected]> wrote:
>
> On Mon, 27 Apr 2020 20:28:12 +0200,
> Alex Deucher wrote:
> >
> > On Mon, Apr 27, 2020 at 2:07 PM Nicholas Johnson
> > <[email protected]> wrote:
> > >
> > > On Mon, Apr 27, 2020 at 05:15:55PM +0200, Takashi Iwai wrote:
> > > > On Mon, 27 Apr 2020 16:22:21 +0200,
> > > > Deucher, Alexander wrote:
> > > > >
> > > > > [AMD Public Use]
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Nicholas Johnson <[email protected]>
> > > > > > Sent: Sunday, April 26, 2020 12:02 PM
> > > > > > To: [email protected]
> > > > > > Cc: Deucher, Alexander <[email protected]>; Koenig, Christian
> > > > > > <[email protected]>; Zhou, David(ChunMing)
> > > > > > <[email protected]>; Nicholas Johnson <nicholas.johnson-
> > > > > > [email protected]>
> > > > > > Subject: [PATCH 0/1] Fiji GPU audio register timeout when in BACO state
> > > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > Since Linux v5.7-rc1 / commit 4fdda2e66de0 ("drm/amdgpu/runpm: enable
> > > > > > runpm on baco capable VI+ asics"), my AMD R9 Nano has been using runpm /
> > > > > > BACO. You can tell visually when it sleeps, because the fan on the graphics
> > > > > > card is switched off to save power. It did not spin down the fan in v5.6.x.
> > > > > >
> > > > > > This is great (I love it), except that when it is sleeping, the PCIe audio function
> > > > > > of the GPU has issues if anything tries to access it. You get dmesg errors such
> > > > > > as these:
> > > > > >
> > > > > > snd_hda_intel 0000:08:00.1: spurious response 0x0:0x0, last cmd=0x170500
> > > > > > snd_hda_intel 0000:08:00.1: azx_get_response timeout, switching to polling
> > > > > > mode: last cmd=0x001f0500 snd_hda_intel 0000:08:00.1: No response from
> > > > > > codec, disabling MSI: last cmd=0x001f0500 snd_hda_intel 0000:08:00.1: No
> > > > > > response from codec, resetting bus: last cmd=0x001f0500
> > > > > > snd_hda_codec_hdmi hdaudioC1D0: Unable to sync register 0x2f0d00. -11
> > > > > >
> > > > > > The above is with the Fiji XT GPU at 0000:08:00.0 in a Thunderbolt enclosure
> > > > > > (not that Thunderbolt should affect it, but I feel I should mention it just in
> > > > > > case). I dropped a lot of duplicate dmesg lines, as some of them repeated a
> > > > > > lot of times before the driver gave up.
> > > > > >
> > > > > > I offer this patch to disable runpm for Fiji while a fix is found, if you decide
> > > > > > that is the best approach. Regardless, I will gladly test any patches you come
> > > > > > up with instead and confirm that the above issue has been fixed.
> > > > > >
> > > > > > I cannot tell if any other GPUs are affected. The only other cards to which I
> > > > > > have access are a couple of AMD R9 280X (Tahiti XT), which use radeon driver
> > > > > > instead of amdgpu driver.
> > > > >
> > > > > Adding a few more people. Do you know what is accessing the audio? The audio should have a dependency on the GPU device. The GPU won't enter runtime pm until the audio has entered runtime pm and vice versa on resume. Please attach a copy of your dmesg output and lspci output.
> > >
> > > pci 0000:08:00.1: D0 power state depends on 0000:08:00.0
> > > The above must be the dependency of which you speak from dmesg.
> > >
> > > Accessing the audio? I did not have a single method for triggering it.
> > > Sometimes it happened on shutdown. Sometimes when restarting gdm.
> > > Sometimes when playing with audio settings in Cinnamon Desktop. But most
> > > often when changing displays. It might have something to do with the
> > > audio device associated with a monitor being created when the monitor is
> > > found. If an audio device is created, then pulseaudio might touch it.
> > > Sorry, this is a very verbose "not quite sure".
> > >
> > > To trigger the bug, this time I did the following:
> > >
> > > 1. Boot laptop without Fiji and log in
> > >
> > > 2. Attach Fiji via Thunderbolt (no displays attached to Fiji) and
> > > approve Thunderbolt device
> > >
> > > 3. Log in again because the session gets killed when GPU is hot-added
> > >
> > > 4. Wait for Fiji to fall asleep (fan stops)
> > >
> > > 5. Open "dmesg -w" on laptop display
> > >
> > > 6. Attach display to DisplayPort on Fiji (it should still stay asleep)
> > >
> > > 7. Do WindowsKey+P to activate external display. The error appears in
> > > dmesg window that instant.
> > >
> > > Could it be a race condition when waking the card up?
> > >
> > > I cannot get the graphics card fan to spin down if the Thunderbolt
> > > enclosure is attached at boot time. It only does it if hot-added.
> > >
> > > If you think it will help, I can take out the Fiji and put it in a test
> > > rig and try to replicate the issue without Thunderbolt, but it looks
> > > like it will not spin the fan down if Fiji is attached at boot time.
> > >
> > > Question, why would the fan not spin down if Fiji is attached at boot
> > > time, and how would one make the said fan turn off? Aside from being
> > > useful for pinning down the audio register issue, I would like to make
> > > sure the power savings are realised whenever the GPU is not being used.
> >
> > Presumably something is using the device. Maybe a framebuffer console
> > or X? Or maybe the something like tlp has disabled runtime pm on your
> > device? You can see the current status by reading the files in
> > /sys/class/drm/cardX/device/power/ . Replace cardX with card0, card1,
> > etc. depending on which device is the radeon card.
> >
> > FWIW, I have a fiji board in a desktop system and it worked fine when
> > this code was enabled.
>
> Is the new DC code used for Fiji boards? IIRC, the audio component
> binding from amdgpu is enabled only for DC, and without the audio
> component binding the runtime PM won't be linked up, hence you can't
> power up GPU from the audio side access automatically.
>

Yes, DC is enabled by default for all cards with runtime pm enabled.

Alex

>
> Takashi
>
> >
> > Alex
> >
> > >
> > > >
> > > > Also, please retest with the fresh 5.7-rc3. There was a known
> > > > regression regarding HD-audio PM in 5.7-rc1/rc2, and it's been fixed
> > > > there (commit 8d6762af302d).
> > > Linux v5.7-rc3 still has the same problem, unfortunately.
> > >
> > > The dmesg is attached.
> > >
> > > Thanks for your replies. Kind regards,
> > > Nicholas
> > >
> > > >
> > > >
> > > > thanks,
> > > >
> > > > Takashi
> > > _______________________________________________
> > > amd-gfx mailing list
> > > [email protected]
> > > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> >

2020-04-28 07:59:22

by Takashi Iwai

[permalink] [raw]
Subject: Re: [PATCH 0/1] Fiji GPU audio register timeout when in BACO state

On Mon, 27 Apr 2020 20:43:54 +0200,
Alex Deucher wrote:
>
> On Mon, Apr 27, 2020 at 2:39 PM Takashi Iwai <[email protected]> wrote:
> >
> > On Mon, 27 Apr 2020 20:28:12 +0200,
> > Alex Deucher wrote:
> > >
> > > On Mon, Apr 27, 2020 at 2:07 PM Nicholas Johnson
> > > <[email protected]> wrote:
> > > >
> > > > On Mon, Apr 27, 2020 at 05:15:55PM +0200, Takashi Iwai wrote:
> > > > > On Mon, 27 Apr 2020 16:22:21 +0200,
> > > > > Deucher, Alexander wrote:
> > > > > >
> > > > > > [AMD Public Use]
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Nicholas Johnson <[email protected]>
> > > > > > > Sent: Sunday, April 26, 2020 12:02 PM
> > > > > > > To: [email protected]
> > > > > > > Cc: Deucher, Alexander <[email protected]>; Koenig, Christian
> > > > > > > <[email protected]>; Zhou, David(ChunMing)
> > > > > > > <[email protected]>; Nicholas Johnson <nicholas.johnson-
> > > > > > > [email protected]>
> > > > > > > Subject: [PATCH 0/1] Fiji GPU audio register timeout when in BACO state
> > > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > Since Linux v5.7-rc1 / commit 4fdda2e66de0 ("drm/amdgpu/runpm: enable
> > > > > > > runpm on baco capable VI+ asics"), my AMD R9 Nano has been using runpm /
> > > > > > > BACO. You can tell visually when it sleeps, because the fan on the graphics
> > > > > > > card is switched off to save power. It did not spin down the fan in v5.6.x.
> > > > > > >
> > > > > > > This is great (I love it), except that when it is sleeping, the PCIe audio function
> > > > > > > of the GPU has issues if anything tries to access it. You get dmesg errors such
> > > > > > > as these:
> > > > > > >
> > > > > > > snd_hda_intel 0000:08:00.1: spurious response 0x0:0x0, last cmd=0x170500
> > > > > > > snd_hda_intel 0000:08:00.1: azx_get_response timeout, switching to polling
> > > > > > > mode: last cmd=0x001f0500 snd_hda_intel 0000:08:00.1: No response from
> > > > > > > codec, disabling MSI: last cmd=0x001f0500 snd_hda_intel 0000:08:00.1: No
> > > > > > > response from codec, resetting bus: last cmd=0x001f0500
> > > > > > > snd_hda_codec_hdmi hdaudioC1D0: Unable to sync register 0x2f0d00. -11
> > > > > > >
> > > > > > > The above is with the Fiji XT GPU at 0000:08:00.0 in a Thunderbolt enclosure
> > > > > > > (not that Thunderbolt should affect it, but I feel I should mention it just in
> > > > > > > case). I dropped a lot of duplicate dmesg lines, as some of them repeated a
> > > > > > > lot of times before the driver gave up.
> > > > > > >
> > > > > > > I offer this patch to disable runpm for Fiji while a fix is found, if you decide
> > > > > > > that is the best approach. Regardless, I will gladly test any patches you come
> > > > > > > up with instead and confirm that the above issue has been fixed.
> > > > > > >
> > > > > > > I cannot tell if any other GPUs are affected. The only other cards to which I
> > > > > > > have access are a couple of AMD R9 280X (Tahiti XT), which use radeon driver
> > > > > > > instead of amdgpu driver.
> > > > > >
> > > > > > Adding a few more people. Do you know what is accessing the audio? The audio should have a dependency on the GPU device. The GPU won't enter runtime pm until the audio has entered runtime pm and vice versa on resume. Please attach a copy of your dmesg output and lspci output.
> > > >
> > > > pci 0000:08:00.1: D0 power state depends on 0000:08:00.0
> > > > The above must be the dependency of which you speak from dmesg.
> > > >
> > > > Accessing the audio? I did not have a single method for triggering it.
> > > > Sometimes it happened on shutdown. Sometimes when restarting gdm.
> > > > Sometimes when playing with audio settings in Cinnamon Desktop. But most
> > > > often when changing displays. It might have something to do with the
> > > > audio device associated with a monitor being created when the monitor is
> > > > found. If an audio device is created, then pulseaudio might touch it.
> > > > Sorry, this is a very verbose "not quite sure".
> > > >
> > > > To trigger the bug, this time I did the following:
> > > >
> > > > 1. Boot laptop without Fiji and log in
> > > >
> > > > 2. Attach Fiji via Thunderbolt (no displays attached to Fiji) and
> > > > approve Thunderbolt device
> > > >
> > > > 3. Log in again because the session gets killed when GPU is hot-added
> > > >
> > > > 4. Wait for Fiji to fall asleep (fan stops)
> > > >
> > > > 5. Open "dmesg -w" on laptop display
> > > >
> > > > 6. Attach display to DisplayPort on Fiji (it should still stay asleep)
> > > >
> > > > 7. Do WindowsKey+P to activate external display. The error appears in
> > > > dmesg window that instant.
> > > >
> > > > Could it be a race condition when waking the card up?
> > > >
> > > > I cannot get the graphics card fan to spin down if the Thunderbolt
> > > > enclosure is attached at boot time. It only does it if hot-added.
> > > >
> > > > If you think it will help, I can take out the Fiji and put it in a test
> > > > rig and try to replicate the issue without Thunderbolt, but it looks
> > > > like it will not spin the fan down if Fiji is attached at boot time.
> > > >
> > > > Question, why would the fan not spin down if Fiji is attached at boot
> > > > time, and how would one make the said fan turn off? Aside from being
> > > > useful for pinning down the audio register issue, I would like to make
> > > > sure the power savings are realised whenever the GPU is not being used.
> > >
> > > Presumably something is using the device. Maybe a framebuffer console
> > > or X? Or maybe the something like tlp has disabled runtime pm on your
> > > device? You can see the current status by reading the files in
> > > /sys/class/drm/cardX/device/power/ . Replace cardX with card0, card1,
> > > etc. depending on which device is the radeon card.
> > >
> > > FWIW, I have a fiji board in a desktop system and it worked fine when
> > > this code was enabled.
> >
> > Is the new DC code used for Fiji boards? IIRC, the audio component
> > binding from amdgpu is enabled only for DC, and without the audio
> > component binding the runtime PM won't be linked up, hence you can't
> > power up GPU from the audio side access automatically.
> >
>
> Yes, DC is enabled by default for all cards with runtime pm enabled.

OK, thanks, I found that amdgpu got bound via component in the dmesg
output, too:

[ 21.294927] snd_hda_intel 0000:08:00.1: bound 0000:08:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])

This is the place soon after amdgpu driver gets initialized.
Then we see later another initialization phase:

[ 26.904127] rfkill: input handler enabled
[ 37.264152] [drm] PCIE GART of 1024M enabled (table at 0x000000F400000000).

here shows 10 seconds between them. Then, it complained something:


[ 37.363287] [drm] UVD initialized successfully.
[ 37.473340] [drm] VCE initialized successfully.
[ 37.477942] amdgpu 0000:08:00.0: [drm] Cannot find any crtc or sizes

... and go further, and hitting HD-audio error:


[ 38.936624] [drm] fb mappable at 0x4B0696000
[ 38.936626] [drm] vram apper at 0x4B0000000
[ 38.936626] [drm] size 33177600
[ 38.936627] [drm] fb depth is 24
[ 38.936627] [drm] pitch is 15360
[ 38.936673] amdgpu 0000:08:00.0: fb1: amdgpudrmfb frame buffer device
[ 40.092223] snd_hda_intel 0000:08:00.1: azx_get_response timeout, switching to polling mode: last cmd=0x00170500

After this point, HD-audio communication was screwed up.

This lastcmd in the above message is AC_SET_POWER_STATE verb for the
root node to D0, so the very first command to power up the codec.
The rest commands are also about the power up of each node, so the
whole error indicate that the power up at runtime resume failed.

So, this looks to me as if the device gets runtime-resumed at the bad
moment?


thanks,

Takashi

2020-04-28 14:51:18

by Nicholas Johnson

[permalink] [raw]
Subject: Re: [PATCH 0/1] Fiji GPU audio register timeout when in BACO state

On Tue, Apr 28, 2020 at 09:57:24AM +0200, Takashi Iwai wrote:
> On Mon, 27 Apr 2020 20:43:54 +0200,
> Alex Deucher wrote:
> >
> > On Mon, Apr 27, 2020 at 2:39 PM Takashi Iwai <[email protected]> wrote:
> > >
> > > On Mon, 27 Apr 2020 20:28:12 +0200,
> > > Alex Deucher wrote:
> > > >
> > > > On Mon, Apr 27, 2020 at 2:07 PM Nicholas Johnson
> > > > <[email protected]> wrote:
> > > > >
> > > > > On Mon, Apr 27, 2020 at 05:15:55PM +0200, Takashi Iwai wrote:
> > > > > > On Mon, 27 Apr 2020 16:22:21 +0200,
> > > > > > Deucher, Alexander wrote:
> > > > > > >
> > > > > > > [AMD Public Use]
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Nicholas Johnson <[email protected]>
> > > > > > > > Sent: Sunday, April 26, 2020 12:02 PM
> > > > > > > > To: [email protected]
> > > > > > > > Cc: Deucher, Alexander <[email protected]>; Koenig, Christian
> > > > > > > > <[email protected]>; Zhou, David(ChunMing)
> > > > > > > > <[email protected]>; Nicholas Johnson <nicholas.johnson-
> > > > > > > > [email protected]>
> > > > > > > > Subject: [PATCH 0/1] Fiji GPU audio register timeout when in BACO state
> > > > > > > >
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > Since Linux v5.7-rc1 / commit 4fdda2e66de0 ("drm/amdgpu/runpm: enable
> > > > > > > > runpm on baco capable VI+ asics"), my AMD R9 Nano has been using runpm /
> > > > > > > > BACO. You can tell visually when it sleeps, because the fan on the graphics
> > > > > > > > card is switched off to save power. It did not spin down the fan in v5.6.x.
> > > > > > > >
> > > > > > > > This is great (I love it), except that when it is sleeping, the PCIe audio function
> > > > > > > > of the GPU has issues if anything tries to access it. You get dmesg errors such
> > > > > > > > as these:
> > > > > > > >
> > > > > > > > snd_hda_intel 0000:08:00.1: spurious response 0x0:0x0, last cmd=0x170500
> > > > > > > > snd_hda_intel 0000:08:00.1: azx_get_response timeout, switching to polling
> > > > > > > > mode: last cmd=0x001f0500 snd_hda_intel 0000:08:00.1: No response from
> > > > > > > > codec, disabling MSI: last cmd=0x001f0500 snd_hda_intel 0000:08:00.1: No
> > > > > > > > response from codec, resetting bus: last cmd=0x001f0500
> > > > > > > > snd_hda_codec_hdmi hdaudioC1D0: Unable to sync register 0x2f0d00. -11
> > > > > > > >
> > > > > > > > The above is with the Fiji XT GPU at 0000:08:00.0 in a Thunderbolt enclosure
> > > > > > > > (not that Thunderbolt should affect it, but I feel I should mention it just in
> > > > > > > > case). I dropped a lot of duplicate dmesg lines, as some of them repeated a
> > > > > > > > lot of times before the driver gave up.
> > > > > > > >
> > > > > > > > I offer this patch to disable runpm for Fiji while a fix is found, if you decide
> > > > > > > > that is the best approach. Regardless, I will gladly test any patches you come
> > > > > > > > up with instead and confirm that the above issue has been fixed.
> > > > > > > >
> > > > > > > > I cannot tell if any other GPUs are affected. The only other cards to which I
> > > > > > > > have access are a couple of AMD R9 280X (Tahiti XT), which use radeon driver
> > > > > > > > instead of amdgpu driver.
> > > > > > >
> > > > > > > Adding a few more people. Do you know what is accessing the audio? The audio should have a dependency on the GPU device. The GPU won't enter runtime pm until the audio has entered runtime pm and vice versa on resume. Please attach a copy of your dmesg output and lspci output.
> > > > >
> > > > > pci 0000:08:00.1: D0 power state depends on 0000:08:00.0
> > > > > The above must be the dependency of which you speak from dmesg.
> > > > >
> > > > > Accessing the audio? I did not have a single method for triggering it.
> > > > > Sometimes it happened on shutdown. Sometimes when restarting gdm.
> > > > > Sometimes when playing with audio settings in Cinnamon Desktop. But most
> > > > > often when changing displays. It might have something to do with the
> > > > > audio device associated with a monitor being created when the monitor is
> > > > > found. If an audio device is created, then pulseaudio might touch it.
> > > > > Sorry, this is a very verbose "not quite sure".
> > > > >
> > > > > To trigger the bug, this time I did the following:
> > > > >
> > > > > 1. Boot laptop without Fiji and log in
> > > > >
> > > > > 2. Attach Fiji via Thunderbolt (no displays attached to Fiji) and
> > > > > approve Thunderbolt device
> > > > >
> > > > > 3. Log in again because the session gets killed when GPU is hot-added
> > > > >
> > > > > 4. Wait for Fiji to fall asleep (fan stops)
> > > > >
> > > > > 5. Open "dmesg -w" on laptop display
> > > > >
> > > > > 6. Attach display to DisplayPort on Fiji (it should still stay asleep)
> > > > >
> > > > > 7. Do WindowsKey+P to activate external display. The error appears in
> > > > > dmesg window that instant.
> > > > >
> > > > > Could it be a race condition when waking the card up?
> > > > >
> > > > > I cannot get the graphics card fan to spin down if the Thunderbolt
> > > > > enclosure is attached at boot time. It only does it if hot-added.
> > > > >
> > > > > If you think it will help, I can take out the Fiji and put it in a test
> > > > > rig and try to replicate the issue without Thunderbolt, but it looks
> > > > > like it will not spin the fan down if Fiji is attached at boot time.
> > > > >
> > > > > Question, why would the fan not spin down if Fiji is attached at boot
> > > > > time, and how would one make the said fan turn off? Aside from being
> > > > > useful for pinning down the audio register issue, I would like to make
> > > > > sure the power savings are realised whenever the GPU is not being used.
> > > >
> > > > Presumably something is using the device. Maybe a framebuffer console
> > > > or X? Or maybe the something like tlp has disabled runtime pm on your
> > > > device? You can see the current status by reading the files in
> > > > /sys/class/drm/cardX/device/power/ . Replace cardX with card0, card1,
> > > > etc. depending on which device is the radeon card.
I had card1 = Fiji stuck awake and card2 = Fiji asleep (both in separate
Thunderbolt enclosures).

The sysfs values in /sys/class/drm/card{0,1}/device/power/ were the
same.

The powertop utility did not help in "tunables" tab.

I compiled kernel without fbcon and it still did it.

But moving from Arch to Ubuntu changed the behaviour. I am still
investigating.

> > > >
> > > > FWIW, I have a fiji board in a desktop system and it worked fine when
> > > > this code was enabled.
> > >
> > > Is the new DC code used for Fiji boards? IIRC, the audio component
> > > binding from amdgpu is enabled only for DC, and without the audio
> > > component binding the runtime PM won't be linked up, hence you can't
> > > power up GPU from the audio side access automatically.
> > >
> >
> > Yes, DC is enabled by default for all cards with runtime pm enabled.
>
> OK, thanks, I found that amdgpu got bound via component in the dmesg
> output, too:
>
> [ 21.294927] snd_hda_intel 0000:08:00.1: bound 0000:08:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
>
> This is the place soon after amdgpu driver gets initialized.
> Then we see later another initialization phase:
>
> [ 26.904127] rfkill: input handler enabled
> [ 37.264152] [drm] PCIE GART of 1024M enabled (table at 0x000000F400000000).
>
> here shows 10 seconds between them. Then, it complained something:
>
>
> [ 37.363287] [drm] UVD initialized successfully.
> [ 37.473340] [drm] VCE initialized successfully.
> [ 37.477942] amdgpu 0000:08:00.0: [drm] Cannot find any crtc or sizes

The above would be me hitting WindowsKey+P to change screens, but with
no DisplayPort attached to Fiji, hence it unable to find crtc.

>
> ... and go further, and hitting HD-audio error:
>
That would be me having attached the DisplayPort and done WindowsKey+P
again.

>
> [ 38.936624] [drm] fb mappable at 0x4B0696000
> [ 38.936626] [drm] vram apper at 0x4B0000000
> [ 38.936626] [drm] size 33177600
> [ 38.936627] [drm] fb depth is 24
> [ 38.936627] [drm] pitch is 15360
> [ 38.936673] amdgpu 0000:08:00.0: fb1: amdgpudrmfb frame buffer device
> [ 40.092223] snd_hda_intel 0000:08:00.1: azx_get_response timeout, switching to polling mode: last cmd=0x00170500
>
> After this point, HD-audio communication was screwed up.
>
> This lastcmd in the above message is AC_SET_POWER_STATE verb for the
> root node to D0, so the very first command to power up the codec.
> The rest commands are also about the power up of each node, so the
> whole error indicate that the power up at runtime resume failed.
>
> So, this looks to me as if the device gets runtime-resumed at the bad
> moment?
It does. However, this is not going to be easy to pin down.

I moved from Arch to Ubuntu, and it behaves differently. I cannot
trigger the bug in Ubuntu. Plus, it puts the GPUs asleep, even if
attached at boot, unlike Arch. I will continue to try to trigger it. But
even if this is a problem with the Linux distribution, it should not be
able to trigger a kernel mode bug, so we should persist with finding it.

Regards,
Nicholas

>
>
> thanks,
>
> Takashi

2020-04-29 07:41:45

by Takashi Iwai

[permalink] [raw]
Subject: Re: [PATCH 0/1] Fiji GPU audio register timeout when in BACO state

On Tue, 28 Apr 2020 16:48:45 +0200,
Nicholas Johnson wrote:
>
> > > > >
> > > > > FWIW, I have a fiji board in a desktop system and it worked fine when
> > > > > this code was enabled.
> > > >
> > > > Is the new DC code used for Fiji boards? IIRC, the audio component
> > > > binding from amdgpu is enabled only for DC, and without the audio
> > > > component binding the runtime PM won't be linked up, hence you can't
> > > > power up GPU from the audio side access automatically.
> > > >
> > >
> > > Yes, DC is enabled by default for all cards with runtime pm enabled.
> >
> > OK, thanks, I found that amdgpu got bound via component in the dmesg
> > output, too:
> >
> > [ 21.294927] snd_hda_intel 0000:08:00.1: bound 0000:08:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
> >
> > This is the place soon after amdgpu driver gets initialized.
> > Then we see later another initialization phase:
> >
> > [ 26.904127] rfkill: input handler enabled
> > [ 37.264152] [drm] PCIE GART of 1024M enabled (table at 0x000000F400000000).
> >
> > here shows 10 seconds between them. Then, it complained something:
> >
> >
> > [ 37.363287] [drm] UVD initialized successfully.
> > [ 37.473340] [drm] VCE initialized successfully.
> > [ 37.477942] amdgpu 0000:08:00.0: [drm] Cannot find any crtc or sizes
>
> The above would be me hitting WindowsKey+P to change screens, but with
> no DisplayPort attached to Fiji, hence it unable to find crtc.
>
> >
> > ... and go further, and hitting HD-audio error:
> >
> That would be me having attached the DisplayPort and done WindowsKey+P
> again.
>
> > [ 38.936624] [drm] fb mappable at 0x4B0696000
> > [ 38.936626] [drm] vram apper at 0x4B0000000
> > [ 38.936626] [drm] size 33177600
> > [ 38.936627] [drm] fb depth is 24
> > [ 38.936627] [drm] pitch is 15360
> > [ 38.936673] amdgpu 0000:08:00.0: fb1: amdgpudrmfb frame buffer device
> > [ 40.092223] snd_hda_intel 0000:08:00.1: azx_get_response timeout, switching to polling mode: last cmd=0x00170500
> >
> > After this point, HD-audio communication was screwed up.
> >
> > This lastcmd in the above message is AC_SET_POWER_STATE verb for the
> > root node to D0, so the very first command to power up the codec.
> > The rest commands are also about the power up of each node, so the
> > whole error indicate that the power up at runtime resume failed.
> >
> > So, this looks to me as if the device gets runtime-resumed at the bad
> > moment?
> It does. However, this is not going to be easy to pin down.
>
> I moved from Arch to Ubuntu, and it behaves differently. I cannot
> trigger the bug in Ubuntu. Plus, it puts the GPUs asleep, even if
> attached at boot, unlike Arch. I will continue to try to trigger it. But
> even if this is a problem with the Linux distribution, it should not be
> able to trigger a kernel mode bug, so we should persist with finding it.

Sure, that's a bug to be fixed.

This made me thinking what happens if we load the HD-audio driver very
late. Could you try to blacklist snd-hda-intel module, then load it
manually after plugging the DP monitor and activating it?

Also, could you track who called the problematic power-up sequence,
e.g. by adding WARN_ON_ONCE()?

Last but not least, please check /proc/asound/card1/eld#* files (there
are both card0 and card1 or such that contain eld#* files, and one is
for i915 and another for amdgpu) before and after plugging. This
shows whether the audio connection was recognized or not.


thanks,

Takashi

--- a/sound/hda/hdac_controller.c
+++ b/sound/hda/hdac_controller.c
@@ -224,6 +224,7 @@ void snd_hdac_bus_update_rirb(struct hdac_bus *bus)
dev_err_ratelimited(bus->dev,
"spurious response %#x:%#x, last cmd=%#08x\n",
res, res_ex, bus->last_cmd[addr]);
+ WARN_ON_ONCE(1);
}
}
}

2020-04-29 15:32:56

by Nicholas Johnson

[permalink] [raw]
Subject: Re: [PATCH 0/1] Fiji GPU audio register timeout when in BACO state

On Wed, Apr 29, 2020 at 09:37:41AM +0200, Takashi Iwai wrote:
> On Tue, 28 Apr 2020 16:48:45 +0200,
> Nicholas Johnson wrote:
> >
> > > > > >
> > > > > > FWIW, I have a fiji board in a desktop system and it worked fine when
> > > > > > this code was enabled.
> > > > >
> > > > > Is the new DC code used for Fiji boards? IIRC, the audio component
> > > > > binding from amdgpu is enabled only for DC, and without the audio
> > > > > component binding the runtime PM won't be linked up, hence you can't
> > > > > power up GPU from the audio side access automatically.
> > > > >
> > > >
> > > > Yes, DC is enabled by default for all cards with runtime pm enabled.
> > >
> > > OK, thanks, I found that amdgpu got bound via component in the dmesg
> > > output, too:
> > >
> > > [ 21.294927] snd_hda_intel 0000:08:00.1: bound 0000:08:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
> > >
> > > This is the place soon after amdgpu driver gets initialized.
> > > Then we see later another initialization phase:
> > >
> > > [ 26.904127] rfkill: input handler enabled
> > > [ 37.264152] [drm] PCIE GART of 1024M enabled (table at 0x000000F400000000).
> > >
> > > here shows 10 seconds between them. Then, it complained something:
> > >
> > >
> > > [ 37.363287] [drm] UVD initialized successfully.
> > > [ 37.473340] [drm] VCE initialized successfully.
> > > [ 37.477942] amdgpu 0000:08:00.0: [drm] Cannot find any crtc or sizes
> >
> > The above would be me hitting WindowsKey+P to change screens, but with
> > no DisplayPort attached to Fiji, hence it unable to find crtc.
> >
> > >
> > > ... and go further, and hitting HD-audio error:
> > >
> > That would be me having attached the DisplayPort and done WindowsKey+P
> > again.
> >
> > > [ 38.936624] [drm] fb mappable at 0x4B0696000
> > > [ 38.936626] [drm] vram apper at 0x4B0000000
> > > [ 38.936626] [drm] size 33177600
> > > [ 38.936627] [drm] fb depth is 24
> > > [ 38.936627] [drm] pitch is 15360
> > > [ 38.936673] amdgpu 0000:08:00.0: fb1: amdgpudrmfb frame buffer device
> > > [ 40.092223] snd_hda_intel 0000:08:00.1: azx_get_response timeout, switching to polling mode: last cmd=0x00170500
> > >
> > > After this point, HD-audio communication was screwed up.
> > >
> > > This lastcmd in the above message is AC_SET_POWER_STATE verb for the
> > > root node to D0, so the very first command to power up the codec.
> > > The rest commands are also about the power up of each node, so the
> > > whole error indicate that the power up at runtime resume failed.
> > >
> > > So, this looks to me as if the device gets runtime-resumed at the bad
> > > moment?
> > It does. However, this is not going to be easy to pin down.
> >
> > I moved from Arch to Ubuntu, and it behaves differently. I cannot
> > trigger the bug in Ubuntu. Plus, it puts the GPUs asleep, even if
> > attached at boot, unlike Arch. I will continue to try to trigger it. But
> > even if this is a problem with the Linux distribution, it should not be
> > able to trigger a kernel mode bug, so we should persist with finding it.
>
> Sure, that's a bug to be fixed.
>
> This made me thinking what happens if we load the HD-audio driver very
> late. Could you try to blacklist snd-hda-intel module, then load it
> manually after plugging the DP monitor and activating it?
Attached dmesg-blacklist-*

It is interesting. If I enable the monitor with the module unloaded, and
then load the module, I cannot trigger the bug, even if disabling the
monitor, waiting for GPU to sleep, and then waking again.

Even if I wake monitor up, put to sleep again, and then insmod when
sleeping, it does not cause bug when waking again.

Is there anything special about the first time the monitor is used?

>
> Also, could you track who called the problematic power-up sequence,
> e.g. by adding WARN_ON_ONCE()?
Attached dmesg-warning

>
> Last but not least, please check /proc/asound/card1/eld#* files (there
> are both card0 and card1 or such that contain eld#* files, and one is
> for i915 and another for amdgpu) before and after plugging. This
> shows whether the audio connection was recognized or not.
Before plugging: card not yet attached, so the sysfs for that card not
yet created

After plugging (and insmod snd-hda-intel.ko):
codec#0 codec#2 eld#2.0 eld#2.1 eld#2.2 eld#2.3 eld#2.4 eld#2.5 eld#2.6 eld#2.7 eld#2.8 id pcm0c pcm0p pcm10p pcm3p pcm7p pcm8p pcm9p

Regards,
Nicholas

>
>
> thanks,
>
> Takashi
>
> --- a/sound/hda/hdac_controller.c
> +++ b/sound/hda/hdac_controller.c
> @@ -224,6 +224,7 @@ void snd_hdac_bus_update_rirb(struct hdac_bus *bus)
> dev_err_ratelimited(bus->dev,
> "spurious response %#x:%#x, last cmd=%#08x\n",
> res, res_ex, bus->last_cmd[addr]);
> + WARN_ON_ONCE(1);
> }
> }
> }


Attachments:
(No filename) (4.83 kB)
dmesg-warning (110.73 kB)
dmesg-blacklist-0 (94.62 kB)
dmesg-blacklist-1 (127.48 kB)
Download all attachments

2020-04-29 15:47:34

by Takashi Iwai

[permalink] [raw]
Subject: Re: [PATCH 0/1] Fiji GPU audio register timeout when in BACO state

On Wed, 29 Apr 2020 17:27:17 +0200,
Nicholas Johnson wrote:
>
> On Wed, Apr 29, 2020 at 09:37:41AM +0200, Takashi Iwai wrote:
> > On Tue, 28 Apr 2020 16:48:45 +0200,
> > Nicholas Johnson wrote:
> > >
> > > > > > >
> > > > > > > FWIW, I have a fiji board in a desktop system and it worked fine when
> > > > > > > this code was enabled.
> > > > > >
> > > > > > Is the new DC code used for Fiji boards? IIRC, the audio component
> > > > > > binding from amdgpu is enabled only for DC, and without the audio
> > > > > > component binding the runtime PM won't be linked up, hence you can't
> > > > > > power up GPU from the audio side access automatically.
> > > > > >
> > > > >
> > > > > Yes, DC is enabled by default for all cards with runtime pm enabled.
> > > >
> > > > OK, thanks, I found that amdgpu got bound via component in the dmesg
> > > > output, too:
> > > >
> > > > [ 21.294927] snd_hda_intel 0000:08:00.1: bound 0000:08:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
> > > >
> > > > This is the place soon after amdgpu driver gets initialized.
> > > > Then we see later another initialization phase:
> > > >
> > > > [ 26.904127] rfkill: input handler enabled
> > > > [ 37.264152] [drm] PCIE GART of 1024M enabled (table at 0x000000F400000000).
> > > >
> > > > here shows 10 seconds between them. Then, it complained something:
> > > >
> > > >
> > > > [ 37.363287] [drm] UVD initialized successfully.
> > > > [ 37.473340] [drm] VCE initialized successfully.
> > > > [ 37.477942] amdgpu 0000:08:00.0: [drm] Cannot find any crtc or sizes
> > >
> > > The above would be me hitting WindowsKey+P to change screens, but with
> > > no DisplayPort attached to Fiji, hence it unable to find crtc.
> > >
> > > >
> > > > ... and go further, and hitting HD-audio error:
> > > >
> > > That would be me having attached the DisplayPort and done WindowsKey+P
> > > again.
> > >
> > > > [ 38.936624] [drm] fb mappable at 0x4B0696000
> > > > [ 38.936626] [drm] vram apper at 0x4B0000000
> > > > [ 38.936626] [drm] size 33177600
> > > > [ 38.936627] [drm] fb depth is 24
> > > > [ 38.936627] [drm] pitch is 15360
> > > > [ 38.936673] amdgpu 0000:08:00.0: fb1: amdgpudrmfb frame buffer device
> > > > [ 40.092223] snd_hda_intel 0000:08:00.1: azx_get_response timeout, switching to polling mode: last cmd=0x00170500
> > > >
> > > > After this point, HD-audio communication was screwed up.
> > > >
> > > > This lastcmd in the above message is AC_SET_POWER_STATE verb for the
> > > > root node to D0, so the very first command to power up the codec.
> > > > The rest commands are also about the power up of each node, so the
> > > > whole error indicate that the power up at runtime resume failed.
> > > >
> > > > So, this looks to me as if the device gets runtime-resumed at the bad
> > > > moment?
> > > It does. However, this is not going to be easy to pin down.
> > >
> > > I moved from Arch to Ubuntu, and it behaves differently. I cannot
> > > trigger the bug in Ubuntu. Plus, it puts the GPUs asleep, even if
> > > attached at boot, unlike Arch. I will continue to try to trigger it. But
> > > even if this is a problem with the Linux distribution, it should not be
> > > able to trigger a kernel mode bug, so we should persist with finding it.
> >
> > Sure, that's a bug to be fixed.
> >
> > This made me thinking what happens if we load the HD-audio driver very
> > late. Could you try to blacklist snd-hda-intel module, then load it
> > manually after plugging the DP monitor and activating it?
> Attached dmesg-blacklist-*
>
> It is interesting. If I enable the monitor with the module unloaded, and
> then load the module, I cannot trigger the bug, even if disabling the
> monitor, waiting for GPU to sleep, and then waking again.
>
> Even if I wake monitor up, put to sleep again, and then insmod when
> sleeping, it does not cause bug when waking again.

Thanks, that's a good news, at least.

> Is there anything special about the first time the monitor is used?

My wild guess is that the audio controller got powered up too early
before the graphics side became ready. Basically HD-audio PCI
controller for HDMI audio is a shadow component of the graphics chip,
so it can't work before GPU is set up properly.


> > Also, could you track who called the problematic power-up sequence,
> > e.g. by adding WARN_ON_ONCE()?
> Attached dmesg-warning

This showed that it's triggered by the runtime PM resume from opening
the PCM device. That said, a desktop application (most likely
PulseAudio) tried to open the stream because it detected something.

This implies a doubt that PA received a false-positive notification
about the HDMI audio connection, so...

> > Last but not least, please check /proc/asound/card1/eld#* files (there
> > are both card0 and card1 or such that contain eld#* files, and one is
> > for i915 and another for amdgpu) before and after plugging. This
> > shows whether the audio connection was recognized or not.
> Before plugging: card not yet attached, so the sysfs for that card not
> yet created
>
> After plugging (and insmod snd-hda-intel.ko):
> codec#0 codec#2 eld#2.0 eld#2.1 eld#2.2 eld#2.3 eld#2.4 eld#2.5 eld#2.6 eld#2.7 eld#2.8 id pcm0c pcm0p pcm10p pcm3p pcm7p pcm8p pcm9p

... here comes the question again. What's interesting here is the
contents of eld#* proc files. If, at the moment the problem appears,
any of eld#* files shows the state as connected wrongly, it may
confuse the user-space to trigger the opening of PCM stream.

Note that you should have multiple /proc/asound/card[0-9]* directories
and one of them is for i915 and another for amdgpu. The interesting
information is only about the latter.


thanks,

Takashi

2020-04-29 15:52:32

by Alex Deucher

[permalink] [raw]
Subject: Re: [PATCH 0/1] Fiji GPU audio register timeout when in BACO state

On Wed, Apr 29, 2020 at 11:27 AM Nicholas Johnson
<[email protected]> wrote:
>
> On Wed, Apr 29, 2020 at 09:37:41AM +0200, Takashi Iwai wrote:
> > On Tue, 28 Apr 2020 16:48:45 +0200,
> > Nicholas Johnson wrote:
> > >
> > > > > > >
> > > > > > > FWIW, I have a fiji board in a desktop system and it worked fine when
> > > > > > > this code was enabled.
> > > > > >
> > > > > > Is the new DC code used for Fiji boards? IIRC, the audio component
> > > > > > binding from amdgpu is enabled only for DC, and without the audio
> > > > > > component binding the runtime PM won't be linked up, hence you can't
> > > > > > power up GPU from the audio side access automatically.
> > > > > >
> > > > >
> > > > > Yes, DC is enabled by default for all cards with runtime pm enabled.
> > > >
> > > > OK, thanks, I found that amdgpu got bound via component in the dmesg
> > > > output, too:
> > > >
> > > > [ 21.294927] snd_hda_intel 0000:08:00.1: bound 0000:08:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
> > > >
> > > > This is the place soon after amdgpu driver gets initialized.
> > > > Then we see later another initialization phase:
> > > >
> > > > [ 26.904127] rfkill: input handler enabled
> > > > [ 37.264152] [drm] PCIE GART of 1024M enabled (table at 0x000000F400000000).
> > > >
> > > > here shows 10 seconds between them. Then, it complained something:
> > > >
> > > >
> > > > [ 37.363287] [drm] UVD initialized successfully.
> > > > [ 37.473340] [drm] VCE initialized successfully.
> > > > [ 37.477942] amdgpu 0000:08:00.0: [drm] Cannot find any crtc or sizes
> > >
> > > The above would be me hitting WindowsKey+P to change screens, but with
> > > no DisplayPort attached to Fiji, hence it unable to find crtc.
> > >
> > > >
> > > > ... and go further, and hitting HD-audio error:
> > > >
> > > That would be me having attached the DisplayPort and done WindowsKey+P
> > > again.
> > >
> > > > [ 38.936624] [drm] fb mappable at 0x4B0696000
> > > > [ 38.936626] [drm] vram apper at 0x4B0000000
> > > > [ 38.936626] [drm] size 33177600
> > > > [ 38.936627] [drm] fb depth is 24
> > > > [ 38.936627] [drm] pitch is 15360
> > > > [ 38.936673] amdgpu 0000:08:00.0: fb1: amdgpudrmfb frame buffer device
> > > > [ 40.092223] snd_hda_intel 0000:08:00.1: azx_get_response timeout, switching to polling mode: last cmd=0x00170500
> > > >
> > > > After this point, HD-audio communication was screwed up.
> > > >
> > > > This lastcmd in the above message is AC_SET_POWER_STATE verb for the
> > > > root node to D0, so the very first command to power up the codec.
> > > > The rest commands are also about the power up of each node, so the
> > > > whole error indicate that the power up at runtime resume failed.
> > > >
> > > > So, this looks to me as if the device gets runtime-resumed at the bad
> > > > moment?
> > > It does. However, this is not going to be easy to pin down.
> > >
> > > I moved from Arch to Ubuntu, and it behaves differently. I cannot
> > > trigger the bug in Ubuntu. Plus, it puts the GPUs asleep, even if
> > > attached at boot, unlike Arch. I will continue to try to trigger it. But
> > > even if this is a problem with the Linux distribution, it should not be
> > > able to trigger a kernel mode bug, so we should persist with finding it.
> >
> > Sure, that's a bug to be fixed.
> >
> > This made me thinking what happens if we load the HD-audio driver very
> > late. Could you try to blacklist snd-hda-intel module, then load it
> > manually after plugging the DP monitor and activating it?
> Attached dmesg-blacklist-*
>
> It is interesting. If I enable the monitor with the module unloaded, and
> then load the module, I cannot trigger the bug, even if disabling the
> monitor, waiting for GPU to sleep, and then waking again.
>
> Even if I wake monitor up, put to sleep again, and then insmod when
> sleeping, it does not cause bug when waking again.
>
> Is there anything special about the first time the monitor is used?
>

What do you mean by used? Do you mean plugged in to the GPU or used
in the GUI? It might be easier to debug this without a GUI involved.
Can you try this at runlevel 3 or something equivalent for your
distro?

When the GPU is powered up, the driver gets an interrupt when a
display is hotplugged and generates an event and userspace
applications can listen for these events. When the GPU is powered
down, there's no interrupt. I think most GUIs poll GPUs periodically
to handle this case so they can detect a new display even when the GPU
is off. Maybe we are getting some sort of race here. GUI queries GPU
driver, causes GPU to wake up, checks attached displays, GPU driver
resets runtime pm timer. GPU goes back to sleep. The detection
updates the ELD data which causes the HDA driver to wake up. It
assumes the hw is on and tries to query it. In the meantime, the GPU
has already powered everything down again.

Alex

> >
> > Also, could you track who called the problematic power-up sequence,
> > e.g. by adding WARN_ON_ONCE()?
> Attached dmesg-warning
>
> >
> > Last but not least, please check /proc/asound/card1/eld#* files (there
> > are both card0 and card1 or such that contain eld#* files, and one is
> > for i915 and another for amdgpu) before and after plugging. This
> > shows whether the audio connection was recognized or not.
> Before plugging: card not yet attached, so the sysfs for that card not
> yet created
>
> After plugging (and insmod snd-hda-intel.ko):
> codec#0 codec#2 eld#2.0 eld#2.1 eld#2.2 eld#2.3 eld#2.4 eld#2.5 eld#2.6 eld#2.7 eld#2.8 id pcm0c pcm0p pcm10p pcm3p pcm7p pcm8p pcm9p
>
> Regards,
> Nicholas
>
> >
> >
> > thanks,
> >
> > Takashi
> >
> > --- a/sound/hda/hdac_controller.c
> > +++ b/sound/hda/hdac_controller.c
> > @@ -224,6 +224,7 @@ void snd_hdac_bus_update_rirb(struct hdac_bus *bus)
> > dev_err_ratelimited(bus->dev,
> > "spurious response %#x:%#x, last cmd=%#08x\n",
> > res, res_ex, bus->last_cmd[addr]);
> > + WARN_ON_ONCE(1);
> > }
> > }
> > }

2020-04-29 16:08:09

by Takashi Iwai

[permalink] [raw]
Subject: Re: [PATCH 0/1] Fiji GPU audio register timeout when in BACO state

On Wed, 29 Apr 2020 17:47:47 +0200,
Alex Deucher wrote:
>
> On Wed, Apr 29, 2020 at 11:27 AM Nicholas Johnson
> <[email protected]> wrote:
> >
> > On Wed, Apr 29, 2020 at 09:37:41AM +0200, Takashi Iwai wrote:
> > > On Tue, 28 Apr 2020 16:48:45 +0200,
> > > Nicholas Johnson wrote:
> > > >
> > > > > > > >
> > > > > > > > FWIW, I have a fiji board in a desktop system and it worked fine when
> > > > > > > > this code was enabled.
> > > > > > >
> > > > > > > Is the new DC code used for Fiji boards? IIRC, the audio component
> > > > > > > binding from amdgpu is enabled only for DC, and without the audio
> > > > > > > component binding the runtime PM won't be linked up, hence you can't
> > > > > > > power up GPU from the audio side access automatically.
> > > > > > >
> > > > > >
> > > > > > Yes, DC is enabled by default for all cards with runtime pm enabled.
> > > > >
> > > > > OK, thanks, I found that amdgpu got bound via component in the dmesg
> > > > > output, too:
> > > > >
> > > > > [ 21.294927] snd_hda_intel 0000:08:00.1: bound 0000:08:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
> > > > >
> > > > > This is the place soon after amdgpu driver gets initialized.
> > > > > Then we see later another initialization phase:
> > > > >
> > > > > [ 26.904127] rfkill: input handler enabled
> > > > > [ 37.264152] [drm] PCIE GART of 1024M enabled (table at 0x000000F400000000).
> > > > >
> > > > > here shows 10 seconds between them. Then, it complained something:
> > > > >
> > > > >
> > > > > [ 37.363287] [drm] UVD initialized successfully.
> > > > > [ 37.473340] [drm] VCE initialized successfully.
> > > > > [ 37.477942] amdgpu 0000:08:00.0: [drm] Cannot find any crtc or sizes
> > > >
> > > > The above would be me hitting WindowsKey+P to change screens, but with
> > > > no DisplayPort attached to Fiji, hence it unable to find crtc.
> > > >
> > > > >
> > > > > ... and go further, and hitting HD-audio error:
> > > > >
> > > > That would be me having attached the DisplayPort and done WindowsKey+P
> > > > again.
> > > >
> > > > > [ 38.936624] [drm] fb mappable at 0x4B0696000
> > > > > [ 38.936626] [drm] vram apper at 0x4B0000000
> > > > > [ 38.936626] [drm] size 33177600
> > > > > [ 38.936627] [drm] fb depth is 24
> > > > > [ 38.936627] [drm] pitch is 15360
> > > > > [ 38.936673] amdgpu 0000:08:00.0: fb1: amdgpudrmfb frame buffer device
> > > > > [ 40.092223] snd_hda_intel 0000:08:00.1: azx_get_response timeout, switching to polling mode: last cmd=0x00170500
> > > > >
> > > > > After this point, HD-audio communication was screwed up.
> > > > >
> > > > > This lastcmd in the above message is AC_SET_POWER_STATE verb for the
> > > > > root node to D0, so the very first command to power up the codec.
> > > > > The rest commands are also about the power up of each node, so the
> > > > > whole error indicate that the power up at runtime resume failed.
> > > > >
> > > > > So, this looks to me as if the device gets runtime-resumed at the bad
> > > > > moment?
> > > > It does. However, this is not going to be easy to pin down.
> > > >
> > > > I moved from Arch to Ubuntu, and it behaves differently. I cannot
> > > > trigger the bug in Ubuntu. Plus, it puts the GPUs asleep, even if
> > > > attached at boot, unlike Arch. I will continue to try to trigger it. But
> > > > even if this is a problem with the Linux distribution, it should not be
> > > > able to trigger a kernel mode bug, so we should persist with finding it.
> > >
> > > Sure, that's a bug to be fixed.
> > >
> > > This made me thinking what happens if we load the HD-audio driver very
> > > late. Could you try to blacklist snd-hda-intel module, then load it
> > > manually after plugging the DP monitor and activating it?
> > Attached dmesg-blacklist-*
> >
> > It is interesting. If I enable the monitor with the module unloaded, and
> > then load the module, I cannot trigger the bug, even if disabling the
> > monitor, waiting for GPU to sleep, and then waking again.
> >
> > Even if I wake monitor up, put to sleep again, and then insmod when
> > sleeping, it does not cause bug when waking again.
> >
> > Is there anything special about the first time the monitor is used?
> >
>
> What do you mean by used? Do you mean plugged in to the GPU or used
> in the GUI? It might be easier to debug this without a GUI involved.
> Can you try this at runlevel 3 or something equivalent for your
> distro?
>
> When the GPU is powered up, the driver gets an interrupt when a
> display is hotplugged and generates an event and userspace
> applications can listen for these events. When the GPU is powered
> down, there's no interrupt. I think most GUIs poll GPUs periodically
> to handle this case so they can detect a new display even when the GPU
> is off. Maybe we are getting some sort of race here. GUI queries GPU
> driver, causes GPU to wake up, checks attached displays, GPU driver
> resets runtime pm timer. GPU goes back to sleep. The detection
> updates the ELD data which causes the HDA driver to wake up. It
> assumes the hw is on and tries to query it. In the meantime, the GPU
> has already powered everything down again.

Well, but the code path there is the runtime PM resume of the audio
device and it means that GPU must have been runtime-resumed again
beforehand via the device link. So, it should have worked from the
beginning but in reality not -- that is, apparently some inconsistency
is found in the initial attempt of the runtime resume...


Takashi

2020-04-29 16:22:40

by Alex Deucher

[permalink] [raw]
Subject: Re: [PATCH 0/1] Fiji GPU audio register timeout when in BACO state

On Wed, Apr 29, 2020 at 12:05 PM Takashi Iwai <[email protected]> wrote:
>
> On Wed, 29 Apr 2020 17:47:47 +0200,
> Alex Deucher wrote:
> >
> > On Wed, Apr 29, 2020 at 11:27 AM Nicholas Johnson
> > <[email protected]> wrote:
> > >
> > > On Wed, Apr 29, 2020 at 09:37:41AM +0200, Takashi Iwai wrote:
> > > > On Tue, 28 Apr 2020 16:48:45 +0200,
> > > > Nicholas Johnson wrote:
> > > > >
> > > > > > > > >
> > > > > > > > > FWIW, I have a fiji board in a desktop system and it worked fine when
> > > > > > > > > this code was enabled.
> > > > > > > >
> > > > > > > > Is the new DC code used for Fiji boards? IIRC, the audio component
> > > > > > > > binding from amdgpu is enabled only for DC, and without the audio
> > > > > > > > component binding the runtime PM won't be linked up, hence you can't
> > > > > > > > power up GPU from the audio side access automatically.
> > > > > > > >
> > > > > > >
> > > > > > > Yes, DC is enabled by default for all cards with runtime pm enabled.
> > > > > >
> > > > > > OK, thanks, I found that amdgpu got bound via component in the dmesg
> > > > > > output, too:
> > > > > >
> > > > > > [ 21.294927] snd_hda_intel 0000:08:00.1: bound 0000:08:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
> > > > > >
> > > > > > This is the place soon after amdgpu driver gets initialized.
> > > > > > Then we see later another initialization phase:
> > > > > >
> > > > > > [ 26.904127] rfkill: input handler enabled
> > > > > > [ 37.264152] [drm] PCIE GART of 1024M enabled (table at 0x000000F400000000).
> > > > > >
> > > > > > here shows 10 seconds between them. Then, it complained something:
> > > > > >
> > > > > >
> > > > > > [ 37.363287] [drm] UVD initialized successfully.
> > > > > > [ 37.473340] [drm] VCE initialized successfully.
> > > > > > [ 37.477942] amdgpu 0000:08:00.0: [drm] Cannot find any crtc or sizes
> > > > >
> > > > > The above would be me hitting WindowsKey+P to change screens, but with
> > > > > no DisplayPort attached to Fiji, hence it unable to find crtc.
> > > > >
> > > > > >
> > > > > > ... and go further, and hitting HD-audio error:
> > > > > >
> > > > > That would be me having attached the DisplayPort and done WindowsKey+P
> > > > > again.
> > > > >
> > > > > > [ 38.936624] [drm] fb mappable at 0x4B0696000
> > > > > > [ 38.936626] [drm] vram apper at 0x4B0000000
> > > > > > [ 38.936626] [drm] size 33177600
> > > > > > [ 38.936627] [drm] fb depth is 24
> > > > > > [ 38.936627] [drm] pitch is 15360
> > > > > > [ 38.936673] amdgpu 0000:08:00.0: fb1: amdgpudrmfb frame buffer device
> > > > > > [ 40.092223] snd_hda_intel 0000:08:00.1: azx_get_response timeout, switching to polling mode: last cmd=0x00170500
> > > > > >
> > > > > > After this point, HD-audio communication was screwed up.
> > > > > >
> > > > > > This lastcmd in the above message is AC_SET_POWER_STATE verb for the
> > > > > > root node to D0, so the very first command to power up the codec.
> > > > > > The rest commands are also about the power up of each node, so the
> > > > > > whole error indicate that the power up at runtime resume failed.
> > > > > >
> > > > > > So, this looks to me as if the device gets runtime-resumed at the bad
> > > > > > moment?
> > > > > It does. However, this is not going to be easy to pin down.
> > > > >
> > > > > I moved from Arch to Ubuntu, and it behaves differently. I cannot
> > > > > trigger the bug in Ubuntu. Plus, it puts the GPUs asleep, even if
> > > > > attached at boot, unlike Arch. I will continue to try to trigger it. But
> > > > > even if this is a problem with the Linux distribution, it should not be
> > > > > able to trigger a kernel mode bug, so we should persist with finding it.
> > > >
> > > > Sure, that's a bug to be fixed.
> > > >
> > > > This made me thinking what happens if we load the HD-audio driver very
> > > > late. Could you try to blacklist snd-hda-intel module, then load it
> > > > manually after plugging the DP monitor and activating it?
> > > Attached dmesg-blacklist-*
> > >
> > > It is interesting. If I enable the monitor with the module unloaded, and
> > > then load the module, I cannot trigger the bug, even if disabling the
> > > monitor, waiting for GPU to sleep, and then waking again.
> > >
> > > Even if I wake monitor up, put to sleep again, and then insmod when
> > > sleeping, it does not cause bug when waking again.
> > >
> > > Is there anything special about the first time the monitor is used?
> > >
> >
> > What do you mean by used? Do you mean plugged in to the GPU or used
> > in the GUI? It might be easier to debug this without a GUI involved.
> > Can you try this at runlevel 3 or something equivalent for your
> > distro?
> >
> > When the GPU is powered up, the driver gets an interrupt when a
> > display is hotplugged and generates an event and userspace
> > applications can listen for these events. When the GPU is powered
> > down, there's no interrupt. I think most GUIs poll GPUs periodically
> > to handle this case so they can detect a new display even when the GPU
> > is off. Maybe we are getting some sort of race here. GUI queries GPU
> > driver, causes GPU to wake up, checks attached displays, GPU driver
> > resets runtime pm timer. GPU goes back to sleep. The detection
> > updates the ELD data which causes the HDA driver to wake up. It
> > assumes the hw is on and tries to query it. In the meantime, the GPU
> > has already powered everything down again.
>
> Well, but the code path there is the runtime PM resume of the audio
> device and it means that GPU must have been runtime-resumed again
> beforehand via the device link. So, it should have worked from the
> beginning but in reality not -- that is, apparently some inconsistency
> is found in the initial attempt of the runtime resume...

Yeah, it should be covered, but I wonder if there is something in the
ELD update sequence that needs to call pm_runtime_get_sync()? The ELD
sequence on AMD GPUs doesn't work the same as on other vendors. The
GPU driver has a backdoor into the HDA device's verbs to set update
the audio state rather than doing it via an ELD buffer update. We
still update the ELD buffer for consistency. Maybe when the GPU
driver sets the audio state at monitor detection time that triggers an
interrupt or something on the HDA side which races with the CPU and
the power down of the GPU. That still seems unlikely though since the
runtime pm on the GPU side defaults to a 5 second suspend timer.

Alex

2020-04-30 15:17:10

by Takashi Iwai

[permalink] [raw]
Subject: Re: [PATCH 0/1] Fiji GPU audio register timeout when in BACO state

On Wed, 29 Apr 2020 18:19:57 +0200,
Alex Deucher wrote:
>
> On Wed, Apr 29, 2020 at 12:05 PM Takashi Iwai <[email protected]> wrote:
> > Well, but the code path there is the runtime PM resume of the audio
> > device and it means that GPU must have been runtime-resumed again
> > beforehand via the device link. So, it should have worked from the
> > beginning but in reality not -- that is, apparently some inconsistency
> > is found in the initial attempt of the runtime resume...
>
> Yeah, it should be covered, but I wonder if there is something in the
> ELD update sequence that needs to call pm_runtime_get_sync()? The ELD
> sequence on AMD GPUs doesn't work the same as on other vendors. The
> GPU driver has a backdoor into the HDA device's verbs to set update
> the audio state rather than doing it via an ELD buffer update. We
> still update the ELD buffer for consistency. Maybe when the GPU
> driver sets the audio state at monitor detection time that triggers an
> interrupt or something on the HDA side which races with the CPU and
> the power down of the GPU. That still seems unlikely though since the
> runtime pm on the GPU side defaults to a 5 second suspend timer.

I'm not sure whether it's the race between runtime suspend of GPU vs
runtime resume of audio. My wild guess is rather that it's the timing
GPU notifies to the audio; then the audio driver notifies to
user-space and user-space opens the stream, which in turn invokes the
runtime resume of GPU. But in GPU side, it's still under processing,
so it proceeds before the GPU finishes its initialization job.

Nicholas, could you try the patch below and see whether the problem
still appears? The patch artificially delays the notification and ELD
update for 300msec. If this works, it means the timing problem.


thanks,

Takashi

--- a/sound/pci/hda/patch_hdmi.c
+++ b/sound/pci/hda/patch_hdmi.c
@@ -767,6 +767,7 @@ static void check_presence_and_report(struct hda_codec *codec, hda_nid_t nid,
if (pin_idx < 0)
return;
mutex_lock(&spec->pcm_lock);
+ get_pin(spec, pin_idx)->repoll_count = 1;
hdmi_present_sense(get_pin(spec, pin_idx), 1);
mutex_unlock(&spec->pcm_lock);
}
@@ -1647,7 +1648,10 @@ static void sync_eld_via_acomp(struct hda_codec *codec,
per_pin->dev_id, &eld->monitor_present,
eld->eld_buffer, ELD_MAX_SIZE);
eld->eld_valid = (eld->eld_size > 0);
- update_eld(codec, per_pin, eld, 0);
+ if (per_pin->repoll_count)
+ schedule_delayed_work(&per_pin->work, msecs_to_jiffies(300));
+ else
+ update_eld(codec, per_pin, eld, 0);
mutex_unlock(&per_pin->lock);
}

@@ -1669,6 +1673,11 @@ static void hdmi_repoll_eld(struct work_struct *work)
struct hdmi_spec *spec = codec->spec;
struct hda_jack_tbl *jack;

+ if (codec_has_acomp(codec)) {
+ per_pin->repoll_count = 0;
+ goto check;
+ }
+
jack = snd_hda_jack_tbl_get_mst(codec, per_pin->pin_nid,
per_pin->dev_id);
if (jack)
@@ -1677,6 +1686,7 @@ static void hdmi_repoll_eld(struct work_struct *work)
if (per_pin->repoll_count++ > 6)
per_pin->repoll_count = 0;

+ check:
mutex_lock(&spec->pcm_lock);
hdmi_present_sense(per_pin, per_pin->repoll_count);
mutex_unlock(&spec->pcm_lock);

2020-04-30 16:54:25

by Nicholas Johnson

[permalink] [raw]
Subject: Re: [PATCH 0/1] Fiji GPU audio register timeout when in BACO state

On Thu, Apr 30, 2020 at 05:14:56PM +0200, Takashi Iwai wrote:
> On Wed, 29 Apr 2020 18:19:57 +0200,
> Alex Deucher wrote:
> >
> > On Wed, Apr 29, 2020 at 12:05 PM Takashi Iwai <[email protected]> wrote:
> > > Well, but the code path there is the runtime PM resume of the audio
> > > device and it means that GPU must have been runtime-resumed again
> > > beforehand via the device link. So, it should have worked from the
> > > beginning but in reality not -- that is, apparently some inconsistency
> > > is found in the initial attempt of the runtime resume...
> >
> > Yeah, it should be covered, but I wonder if there is something in the
> > ELD update sequence that needs to call pm_runtime_get_sync()? The ELD
> > sequence on AMD GPUs doesn't work the same as on other vendors. The
> > GPU driver has a backdoor into the HDA device's verbs to set update
> > the audio state rather than doing it via an ELD buffer update. We
> > still update the ELD buffer for consistency. Maybe when the GPU
> > driver sets the audio state at monitor detection time that triggers an
> > interrupt or something on the HDA side which races with the CPU and
> > the power down of the GPU. That still seems unlikely though since the
> > runtime pm on the GPU side defaults to a 5 second suspend timer.
>
> I'm not sure whether it's the race between runtime suspend of GPU vs
> runtime resume of audio. My wild guess is rather that it's the timing
> GPU notifies to the audio; then the audio driver notifies to
> user-space and user-space opens the stream, which in turn invokes the
> runtime resume of GPU. But in GPU side, it's still under processing,
> so it proceeds before the GPU finishes its initialization job.
>
> Nicholas, could you try the patch below and see whether the problem
> still appears? The patch artificially delays the notification and ELD
> update for 300msec. If this works, it means the timing problem.
The bug still occurred after applying the patch.

But you were absolutely correct - it just needed to be increased to
3000ms - then the bug stopped.

Now the question is, what do we do now that we know this?

Also, are you still interested in the contents of the ELD# files? I can
dump them all into a file at some specific moment in time which you
request, if needed.

Thanks.
Regards, Nicholas

>
>
> thanks,
>
> Takashi
>
> --- a/sound/pci/hda/patch_hdmi.c
> +++ b/sound/pci/hda/patch_hdmi.c
> @@ -767,6 +767,7 @@ static void check_presence_and_report(struct hda_codec *codec, hda_nid_t nid,
> if (pin_idx < 0)
> return;
> mutex_lock(&spec->pcm_lock);
> + get_pin(spec, pin_idx)->repoll_count = 1;
> hdmi_present_sense(get_pin(spec, pin_idx), 1);
> mutex_unlock(&spec->pcm_lock);
> }
> @@ -1647,7 +1648,10 @@ static void sync_eld_via_acomp(struct hda_codec *codec,
> per_pin->dev_id, &eld->monitor_present,
> eld->eld_buffer, ELD_MAX_SIZE);
> eld->eld_valid = (eld->eld_size > 0);
> - update_eld(codec, per_pin, eld, 0);
> + if (per_pin->repoll_count)
> + schedule_delayed_work(&per_pin->work, msecs_to_jiffies(300));
> + else
> + update_eld(codec, per_pin, eld, 0);
> mutex_unlock(&per_pin->lock);
> }
>
> @@ -1669,6 +1673,11 @@ static void hdmi_repoll_eld(struct work_struct *work)
> struct hdmi_spec *spec = codec->spec;
> struct hda_jack_tbl *jack;
>
> + if (codec_has_acomp(codec)) {
> + per_pin->repoll_count = 0;
> + goto check;
> + }
> +
> jack = snd_hda_jack_tbl_get_mst(codec, per_pin->pin_nid,
> per_pin->dev_id);
> if (jack)
> @@ -1677,6 +1686,7 @@ static void hdmi_repoll_eld(struct work_struct *work)
> if (per_pin->repoll_count++ > 6)
> per_pin->repoll_count = 0;
>
> + check:
> mutex_lock(&spec->pcm_lock);
> hdmi_present_sense(per_pin, per_pin->repoll_count);
> mutex_unlock(&spec->pcm_lock);

2020-04-30 17:03:46

by Takashi Iwai

[permalink] [raw]
Subject: Re: [PATCH 0/1] Fiji GPU audio register timeout when in BACO state

On Thu, 30 Apr 2020 18:52:20 +0200,
Nicholas Johnson wrote:
>
> On Thu, Apr 30, 2020 at 05:14:56PM +0200, Takashi Iwai wrote:
> > On Wed, 29 Apr 2020 18:19:57 +0200,
> > Alex Deucher wrote:
> > >
> > > On Wed, Apr 29, 2020 at 12:05 PM Takashi Iwai <[email protected]> wrote:
> > > > Well, but the code path there is the runtime PM resume of the audio
> > > > device and it means that GPU must have been runtime-resumed again
> > > > beforehand via the device link. So, it should have worked from the
> > > > beginning but in reality not -- that is, apparently some inconsistency
> > > > is found in the initial attempt of the runtime resume...
> > >
> > > Yeah, it should be covered, but I wonder if there is something in the
> > > ELD update sequence that needs to call pm_runtime_get_sync()? The ELD
> > > sequence on AMD GPUs doesn't work the same as on other vendors. The
> > > GPU driver has a backdoor into the HDA device's verbs to set update
> > > the audio state rather than doing it via an ELD buffer update. We
> > > still update the ELD buffer for consistency. Maybe when the GPU
> > > driver sets the audio state at monitor detection time that triggers an
> > > interrupt or something on the HDA side which races with the CPU and
> > > the power down of the GPU. That still seems unlikely though since the
> > > runtime pm on the GPU side defaults to a 5 second suspend timer.
> >
> > I'm not sure whether it's the race between runtime suspend of GPU vs
> > runtime resume of audio. My wild guess is rather that it's the timing
> > GPU notifies to the audio; then the audio driver notifies to
> > user-space and user-space opens the stream, which in turn invokes the
> > runtime resume of GPU. But in GPU side, it's still under processing,
> > so it proceeds before the GPU finishes its initialization job.
> >
> > Nicholas, could you try the patch below and see whether the problem
> > still appears? The patch artificially delays the notification and ELD
> > update for 300msec. If this works, it means the timing problem.
> The bug still occurred after applying the patch.
>
> But you were absolutely correct - it just needed to be increased to
> 3000ms - then the bug stopped.

Interesting. 3 seconds are too long, but I guess 1 second would work
as well?

In anyway, the success with a long delay means that the sound setup
after the full runtime resume of GPU seems working.

> Now the question is, what do we do now that we know this?
>
> Also, are you still interested in the contents of the ELD# files? I can
> dump them all into a file at some specific moment in time which you
> request, if needed.

Yes, please take the snapshot before plugging, right after plugging
and right after enabling. I'm not sure whether your monitor supports
the audio, and ELD contents should show that, at least.


thanks,

Takashi

2020-04-30 17:41:01

by Nicholas Johnson

[permalink] [raw]
Subject: Re: [PATCH 0/1] Fiji GPU audio register timeout when in BACO state

On Thu, Apr 30, 2020 at 07:01:08PM +0200, Takashi Iwai wrote:
> On Thu, 30 Apr 2020 18:52:20 +0200,
> Nicholas Johnson wrote:
> >
> > On Thu, Apr 30, 2020 at 05:14:56PM +0200, Takashi Iwai wrote:
> > > On Wed, 29 Apr 2020 18:19:57 +0200,
> > > Alex Deucher wrote:
> > > >
> > > > On Wed, Apr 29, 2020 at 12:05 PM Takashi Iwai <[email protected]> wrote:
> > > > > Well, but the code path there is the runtime PM resume of the audio
> > > > > device and it means that GPU must have been runtime-resumed again
> > > > > beforehand via the device link. So, it should have worked from the
> > > > > beginning but in reality not -- that is, apparently some inconsistency
> > > > > is found in the initial attempt of the runtime resume...
> > > >
> > > > Yeah, it should be covered, but I wonder if there is something in the
> > > > ELD update sequence that needs to call pm_runtime_get_sync()? The ELD
> > > > sequence on AMD GPUs doesn't work the same as on other vendors. The
> > > > GPU driver has a backdoor into the HDA device's verbs to set update
> > > > the audio state rather than doing it via an ELD buffer update. We
> > > > still update the ELD buffer for consistency. Maybe when the GPU
> > > > driver sets the audio state at monitor detection time that triggers an
> > > > interrupt or something on the HDA side which races with the CPU and
> > > > the power down of the GPU. That still seems unlikely though since the
> > > > runtime pm on the GPU side defaults to a 5 second suspend timer.
> > >
> > > I'm not sure whether it's the race between runtime suspend of GPU vs
> > > runtime resume of audio. My wild guess is rather that it's the timing
> > > GPU notifies to the audio; then the audio driver notifies to
> > > user-space and user-space opens the stream, which in turn invokes the
> > > runtime resume of GPU. But in GPU side, it's still under processing,
> > > so it proceeds before the GPU finishes its initialization job.
> > >
> > > Nicholas, could you try the patch below and see whether the problem
> > > still appears? The patch artificially delays the notification and ELD
> > > update for 300msec. If this works, it means the timing problem.
> > The bug still occurred after applying the patch.
> >
> > But you were absolutely correct - it just needed to be increased to
> > 3000ms - then the bug stopped.
>
> Interesting. 3 seconds are too long, but I guess 1 second would work
> as well?
1000ms indeed worked as well.

>
> In anyway, the success with a long delay means that the sound setup
> after the full runtime resume of GPU seems working.
>
> > Now the question is, what do we do now that we know this?
> >
> > Also, are you still interested in the contents of the ELD# files? I can
> > dump them all into a file at some specific moment in time which you
> > request, if needed.
>
> Yes, please take the snapshot before plugging, right after plugging
> and right after enabling. I'm not sure whether your monitor supports
> the audio, and ELD contents should show that, at least.
The monitor supports the audio. There is 3.5mm audio out jack. No
inbuilt speakers, although Samsung did sell a sound bar to suit it. The
sound bar, which I do not own, presumably attaches via 3.5mm jack.

I am not sure if by plugging, you mean hot-adding Thunderbolt GPU or
plugging the monitor to the GPU, so I have covered extra cases to be
sure. I have taken the eld# files with the 1000ms patch applied, so the
error is not triggered.

####
Before hot-adding the Thunderbolt GPU:
/proc/asound/card1 not present
####
####
After hot-adding the GPU with no monitor attached:

/proc/asound/card1 contains:
eld#0.0 eld#0.1 eld#0.2 eld#0.3 eld#0.4 eld#0.5

All of the above have the same contents:

monitor_present 0
eld_valid 0
####
####
Monitor attached to Fiji GPU but not enabled:

Same as above
####
####
Monitor enabled:

All files with same contents except for eld#0.1 which looks like:

monitor_present 1
eld_valid 1
monitor_name U32E850
connection_type DisplayPort
eld_version [0x2] CEA-861D or below
edid_version [0x3] CEA-861-B, C or D
manufacture_id 0x2d4c
product_id 0xce3
port_id 0x0
support_hdcp 0
support_ai 0
audio_sync_delay 0
speakers [0x1] FL/FR
sad_count 1
sad0_coding_type [0x1] LPCM
sad0_channels 2
sad0_rates [0xe0] 32000 44100 48000
sad0_bits [0xe0000] 16 20 24
####

Cheers.
Regards, Nicholas.

>
>
> thanks,
>
> Takashi

2020-04-30 17:53:04

by Takashi Iwai

[permalink] [raw]
Subject: Re: [PATCH 0/1] Fiji GPU audio register timeout when in BACO state

On Thu, 30 Apr 2020 19:38:16 +0200,
Nicholas Johnson wrote:
>
> On Thu, Apr 30, 2020 at 07:01:08PM +0200, Takashi Iwai wrote:
> > On Thu, 30 Apr 2020 18:52:20 +0200,
> > Nicholas Johnson wrote:
> > >
> > > On Thu, Apr 30, 2020 at 05:14:56PM +0200, Takashi Iwai wrote:
> > > > On Wed, 29 Apr 2020 18:19:57 +0200,
> > > > Alex Deucher wrote:
> > > > >
> > > > > On Wed, Apr 29, 2020 at 12:05 PM Takashi Iwai <[email protected]> wrote:
> > > > > > Well, but the code path there is the runtime PM resume of the audio
> > > > > > device and it means that GPU must have been runtime-resumed again
> > > > > > beforehand via the device link. So, it should have worked from the
> > > > > > beginning but in reality not -- that is, apparently some inconsistency
> > > > > > is found in the initial attempt of the runtime resume...
> > > > >
> > > > > Yeah, it should be covered, but I wonder if there is something in the
> > > > > ELD update sequence that needs to call pm_runtime_get_sync()? The ELD
> > > > > sequence on AMD GPUs doesn't work the same as on other vendors. The
> > > > > GPU driver has a backdoor into the HDA device's verbs to set update
> > > > > the audio state rather than doing it via an ELD buffer update. We
> > > > > still update the ELD buffer for consistency. Maybe when the GPU
> > > > > driver sets the audio state at monitor detection time that triggers an
> > > > > interrupt or something on the HDA side which races with the CPU and
> > > > > the power down of the GPU. That still seems unlikely though since the
> > > > > runtime pm on the GPU side defaults to a 5 second suspend timer.
> > > >
> > > > I'm not sure whether it's the race between runtime suspend of GPU vs
> > > > runtime resume of audio. My wild guess is rather that it's the timing
> > > > GPU notifies to the audio; then the audio driver notifies to
> > > > user-space and user-space opens the stream, which in turn invokes the
> > > > runtime resume of GPU. But in GPU side, it's still under processing,
> > > > so it proceeds before the GPU finishes its initialization job.
> > > >
> > > > Nicholas, could you try the patch below and see whether the problem
> > > > still appears? The patch artificially delays the notification and ELD
> > > > update for 300msec. If this works, it means the timing problem.
> > > The bug still occurred after applying the patch.
> > >
> > > But you were absolutely correct - it just needed to be increased to
> > > 3000ms - then the bug stopped.
> >
> > Interesting. 3 seconds are too long, but I guess 1 second would work
> > as well?
> 1000ms indeed worked as well.
>
> >
> > In anyway, the success with a long delay means that the sound setup
> > after the full runtime resume of GPU seems working.
> >
> > > Now the question is, what do we do now that we know this?
> > >
> > > Also, are you still interested in the contents of the ELD# files? I can
> > > dump them all into a file at some specific moment in time which you
> > > request, if needed.
> >
> > Yes, please take the snapshot before plugging, right after plugging
> > and right after enabling. I'm not sure whether your monitor supports
> > the audio, and ELD contents should show that, at least.
> The monitor supports the audio. There is 3.5mm audio out jack. No
> inbuilt speakers, although Samsung did sell a sound bar to suit it. The
> sound bar, which I do not own, presumably attaches via 3.5mm jack.
>
> I am not sure if by plugging, you mean hot-adding Thunderbolt GPU or
> plugging the monitor to the GPU, so I have covered extra cases to be
> sure. I have taken the eld# files with the 1000ms patch applied, so the
> error is not triggered.

OK, thanks. If I understand correctly...

> ####
> Before hot-adding the Thunderbolt GPU:
> /proc/asound/card1 not present
> ####
> ####
> After hot-adding the GPU with no monitor attached:
>
> /proc/asound/card1 contains:
> eld#0.0 eld#0.1 eld#0.2 eld#0.3 eld#0.4 eld#0.5
>
> All of the above have the same contents:
>
> monitor_present 0
> eld_valid 0
> ####
> ####
> Monitor attached to Fiji GPU but not enabled:
>
> Same as above
> ####
> ####
> Monitor enabled:

... the error is triggered at this moment, right?


> All files with same contents except for eld#0.1 which looks like:
>
> monitor_present 1
> eld_valid 1
> monitor_name U32E850
> connection_type DisplayPort
> eld_version [0x2] CEA-861D or below
> edid_version [0x3] CEA-861-B, C or D
> manufacture_id 0x2d4c
> product_id 0xce3
> port_id 0x0
> support_hdcp 0
> support_ai 0
> audio_sync_delay 0
> speakers [0x1] FL/FR
> sad_count 1
> sad0_coding_type [0x1] LPCM
> sad0_channels 2
> sad0_rates [0xe0] 32000 44100 48000
> sad0_bits [0xe0000] 16 20 24

So your monitor supports the audio :)


Takashi

2020-05-02 07:14:15

by Takashi Iwai

[permalink] [raw]
Subject: Re: [PATCH 0/1] Fiji GPU audio register timeout when in BACO state

On Thu, 30 Apr 2020 19:49:03 +0200,
Takashi Iwai wrote:
>
> On Thu, 30 Apr 2020 19:38:16 +0200,
> Nicholas Johnson wrote:
> >
> > On Thu, Apr 30, 2020 at 07:01:08PM +0200, Takashi Iwai wrote:
> > > On Thu, 30 Apr 2020 18:52:20 +0200,
> > > Nicholas Johnson wrote:
> > > >
> > > > On Thu, Apr 30, 2020 at 05:14:56PM +0200, Takashi Iwai wrote:
> > > > > On Wed, 29 Apr 2020 18:19:57 +0200,
> > > > > Alex Deucher wrote:
> > > > > >
> > > > > > On Wed, Apr 29, 2020 at 12:05 PM Takashi Iwai <[email protected]> wrote:
> > > > > > > Well, but the code path there is the runtime PM resume of the audio
> > > > > > > device and it means that GPU must have been runtime-resumed again
> > > > > > > beforehand via the device link. So, it should have worked from the
> > > > > > > beginning but in reality not -- that is, apparently some inconsistency
> > > > > > > is found in the initial attempt of the runtime resume...
> > > > > >
> > > > > > Yeah, it should be covered, but I wonder if there is something in the
> > > > > > ELD update sequence that needs to call pm_runtime_get_sync()? The ELD
> > > > > > sequence on AMD GPUs doesn't work the same as on other vendors. The
> > > > > > GPU driver has a backdoor into the HDA device's verbs to set update
> > > > > > the audio state rather than doing it via an ELD buffer update. We
> > > > > > still update the ELD buffer for consistency. Maybe when the GPU
> > > > > > driver sets the audio state at monitor detection time that triggers an
> > > > > > interrupt or something on the HDA side which races with the CPU and
> > > > > > the power down of the GPU. That still seems unlikely though since the
> > > > > > runtime pm on the GPU side defaults to a 5 second suspend timer.
> > > > >
> > > > > I'm not sure whether it's the race between runtime suspend of GPU vs
> > > > > runtime resume of audio. My wild guess is rather that it's the timing
> > > > > GPU notifies to the audio; then the audio driver notifies to
> > > > > user-space and user-space opens the stream, which in turn invokes the
> > > > > runtime resume of GPU. But in GPU side, it's still under processing,
> > > > > so it proceeds before the GPU finishes its initialization job.
> > > > >
> > > > > Nicholas, could you try the patch below and see whether the problem
> > > > > still appears? The patch artificially delays the notification and ELD
> > > > > update for 300msec. If this works, it means the timing problem.
> > > > The bug still occurred after applying the patch.
> > > >
> > > > But you were absolutely correct - it just needed to be increased to
> > > > 3000ms - then the bug stopped.
> > >
> > > Interesting. 3 seconds are too long, but I guess 1 second would work
> > > as well?
> > 1000ms indeed worked as well.
> >
> > >
> > > In anyway, the success with a long delay means that the sound setup
> > > after the full runtime resume of GPU seems working.
> > >
> > > > Now the question is, what do we do now that we know this?
> > > >
> > > > Also, are you still interested in the contents of the ELD# files? I can
> > > > dump them all into a file at some specific moment in time which you
> > > > request, if needed.
> > >
> > > Yes, please take the snapshot before plugging, right after plugging
> > > and right after enabling. I'm not sure whether your monitor supports
> > > the audio, and ELD contents should show that, at least.
> > The monitor supports the audio. There is 3.5mm audio out jack. No
> > inbuilt speakers, although Samsung did sell a sound bar to suit it. The
> > sound bar, which I do not own, presumably attaches via 3.5mm jack.
> >
> > I am not sure if by plugging, you mean hot-adding Thunderbolt GPU or
> > plugging the monitor to the GPU, so I have covered extra cases to be
> > sure. I have taken the eld# files with the 1000ms patch applied, so the
> > error is not triggered.
>
> OK, thanks. If I understand correctly...
>
> > ####
> > Before hot-adding the Thunderbolt GPU:
> > /proc/asound/card1 not present
> > ####
> > ####
> > After hot-adding the GPU with no monitor attached:
> >
> > /proc/asound/card1 contains:
> > eld#0.0 eld#0.1 eld#0.2 eld#0.3 eld#0.4 eld#0.5
> >
> > All of the above have the same contents:
> >
> > monitor_present 0
> > eld_valid 0
> > ####
> > ####
> > Monitor attached to Fiji GPU but not enabled:
> >
> > Same as above
> > ####
> > ####
> > Monitor enabled:
>
> ... the error is triggered at this moment, right?
>
>
> > All files with same contents except for eld#0.1 which looks like:
> >
> > monitor_present 1
> > eld_valid 1
> > monitor_name U32E850
> > connection_type DisplayPort
> > eld_version [0x2] CEA-861D or below
> > edid_version [0x3] CEA-861-B, C or D
> > manufacture_id 0x2d4c
> > product_id 0xce3
> > port_id 0x0
> > support_hdcp 0
> > support_ai 0
> > audio_sync_delay 0
> > speakers [0x1] FL/FR
> > sad_count 1
> > sad0_coding_type [0x1] LPCM
> > sad0_channels 2
> > sad0_rates [0xe0] 32000 44100 48000
> > sad0_bits [0xe0000] 16 20 24
>
> So your monitor supports the audio :)

Now I took a look at the actual code again, and I found that I
remembered wrongly. Namely, the device link isn't created in the
audio component framework but only in the graphics side, so currently
only i915 has it.

Could you try the patch below?

Note that i915 binding has no DL_FLAG_PM_RUNTIME as it manages the
power via get_power/put_power ops pair.


thanks,

Takashi

--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -673,6 +673,12 @@ static int amdgpu_dm_audio_component_bind(struct device *kdev,
struct amdgpu_device *adev = dev->dev_private;
struct drm_audio_component *acomp = data;

+ if (!device_link_add(hda_kdev, kdev, DL_FLAG_STATELESS |
+ DL_FLAG_PM_RUNTIME)) {
+ DRM_ERROR("DM: cannot add device link to audio device\n");
+ return -ENOMEM;
+ }
+
acomp->ops = &amdgpu_dm_audio_component_ops;
acomp->dev = kdev;
adev->dm.audio_component = acomp;
@@ -690,6 +696,8 @@ static void amdgpu_dm_audio_component_unbind(struct device *kdev,
acomp->ops = NULL;
acomp->dev = NULL;
adev->dm.audio_component = NULL;
+
+ device_link_remove(hda_kdev, kdev);
}

static const struct component_ops amdgpu_dm_audio_component_bind_ops = {

2020-05-02 07:21:09

by Lukas Wunner

[permalink] [raw]
Subject: Re: [PATCH 0/1] Fiji GPU audio register timeout when in BACO state

On Sat, May 02, 2020 at 09:11:58AM +0200, Takashi Iwai wrote:
> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> @@ -673,6 +673,12 @@ static int amdgpu_dm_audio_component_bind(struct device *kdev,
> struct amdgpu_device *adev = dev->dev_private;
> struct drm_audio_component *acomp = data;
>
> + if (!device_link_add(hda_kdev, kdev, DL_FLAG_STATELESS |
> + DL_FLAG_PM_RUNTIME)) {
> + DRM_ERROR("DM: cannot add device link to audio device\n");
> + return -ENOMEM;
> + }
> +

Doesn't this duplicate drivers/pci/quirks.c:quirk_gpu_hda() ?

2020-05-02 07:32:28

by Takashi Iwai

[permalink] [raw]
Subject: Re: [PATCH 0/1] Fiji GPU audio register timeout when in BACO state

On Sat, 02 May 2020 09:17:28 +0200,
Lukas Wunner wrote:
>
> On Sat, May 02, 2020 at 09:11:58AM +0200, Takashi Iwai wrote:
> > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > @@ -673,6 +673,12 @@ static int amdgpu_dm_audio_component_bind(struct device *kdev,
> > struct amdgpu_device *adev = dev->dev_private;
> > struct drm_audio_component *acomp = data;
> >
> > + if (!device_link_add(hda_kdev, kdev, DL_FLAG_STATELESS |
> > + DL_FLAG_PM_RUNTIME)) {
> > + DRM_ERROR("DM: cannot add device link to audio device\n");
> > + return -ENOMEM;
> > + }
> > +
>
> Doesn't this duplicate drivers/pci/quirks.c:quirk_gpu_hda() ?

Gah, you're right, that was the place I overlooked.
It was a typical "false Eureka right-after-wakeup" phenomenon :)
Need a vaccine aka coffee...

So the runtime PM dependency must be already placed there, and the
problem is not the lack of the dependency tree but the really other
timing issue. Back to square.


thanks,

Takashi

2020-05-02 10:13:53

by Takashi Iwai

[permalink] [raw]
Subject: Re: [PATCH 0/1] Fiji GPU audio register timeout when in BACO state

On Sat, 02 May 2020 09:27:31 +0200,
Takashi Iwai wrote:
>
> On Sat, 02 May 2020 09:17:28 +0200,
> Lukas Wunner wrote:
> >
> > On Sat, May 02, 2020 at 09:11:58AM +0200, Takashi Iwai wrote:
> > > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > > @@ -673,6 +673,12 @@ static int amdgpu_dm_audio_component_bind(struct device *kdev,
> > > struct amdgpu_device *adev = dev->dev_private;
> > > struct drm_audio_component *acomp = data;
> > >
> > > + if (!device_link_add(hda_kdev, kdev, DL_FLAG_STATELESS |
> > > + DL_FLAG_PM_RUNTIME)) {
> > > + DRM_ERROR("DM: cannot add device link to audio device\n");
> > > + return -ENOMEM;
> > > + }
> > > +
> >
> > Doesn't this duplicate drivers/pci/quirks.c:quirk_gpu_hda() ?
>
> Gah, you're right, that was the place I overlooked.
> It was a typical "false Eureka right-after-wakeup" phenomenon :)
> Need a vaccine aka coffee...
>
> So the runtime PM dependency must be already placed there, and the
> problem is not the lack of the dependency tree but the really other
> timing issue. Back to square.

One interesting test is to open the stream while the mode isn't set
yet and see whether the same problem appears.
Namely, after the monitor is connected but no mode is set, run
directly like
aplay -Dhdmi:1,0 foo.wav
You might need to wrap the command with pasuspender if PA is active.


Takashi

2020-05-06 15:21:00

by Nicholas Johnson

[permalink] [raw]
Subject: Re: [PATCH 0/1] Fiji GPU audio register timeout when in BACO state

On Sat, May 02, 2020 at 12:09:13PM +0200, Takashi Iwai wrote:
> On Sat, 02 May 2020 09:27:31 +0200,
> Takashi Iwai wrote:
> >
> > On Sat, 02 May 2020 09:17:28 +0200,
> > Lukas Wunner wrote:
> > >
> > > On Sat, May 02, 2020 at 09:11:58AM +0200, Takashi Iwai wrote:
> > > > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > > > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > > > @@ -673,6 +673,12 @@ static int amdgpu_dm_audio_component_bind(struct device *kdev,
> > > > struct amdgpu_device *adev = dev->dev_private;
> > > > struct drm_audio_component *acomp = data;
> > > >
> > > > + if (!device_link_add(hda_kdev, kdev, DL_FLAG_STATELESS |
> > > > + DL_FLAG_PM_RUNTIME)) {
> > > > + DRM_ERROR("DM: cannot add device link to audio device\n");
> > > > + return -ENOMEM;
> > > > + }
> > > > +
> > >
> > > Doesn't this duplicate drivers/pci/quirks.c:quirk_gpu_hda() ?
> >
> > Gah, you're right, that was the place I overlooked.
> > It was a typical "false Eureka right-after-wakeup" phenomenon :)
> > Need a vaccine aka coffee...
> >
> > So the runtime PM dependency must be already placed there, and the
> > problem is not the lack of the dependency tree but the really other
> > timing issue. Back to square.
>
> One interesting test is to open the stream while the mode isn't set
> yet and see whether the same problem appears.
> Namely, after the monitor is connected but no mode is set, run
> directly like
> aplay -Dhdmi:1,0 foo.wav
> You might need to wrap the command with pasuspender if PA is active.
I could not figure out how to get the interface for aplay set other than
not specifying it and having it find the default device (which can
change). I even used aplay -L and aplay -l to show devices. I could not
get it working.

Is there anything else I can try? I did not apply the last patch when it
was pointed out that it is already a quirk.

Regards,
Nicholas
>
>
> Takashi