2021-10-30 19:34:23

by Ken Moffat

[permalink] [raw]
Subject: amdgpu hang on picasso

When I tried 5.15-rc7 on my picasso APU (Ryzen 5 3400G), trying to
run 'startx' (I'm using X11 and logging in to a tty) the output
messages from X11 stopped after a few lines (normally, the desktop
shows before I can read anything) and keyboard/mouse were
inoperative - had to use Magic SysRQ to sync and reboot.

The log showed
Oct 28 03:02:21 deluxe klogd: [ 31.347235] amdgpu 0000:09:00.0: amdgpu: failed to write reg 28b4 wait reg 28c6
Oct 28 03:02:34 deluxe klogd: [ 44.280185] amdgpu 0000:09:00.0: amdgpu: failed to write reg 1a6f4 wait reg 1a706

I started bisecting after confireming that linus' tree with head at
f25a5481af12 still showed the problem. That identified the
following commit, which reverts cleanly and allows Xorg to start:

commit 714d9e4574d54596973ee3b0624ee4a16264d700
Author: Yifan Zhang <[email protected]>
Date: Tue Sep 28 15:42:35 2021 +0800

drm/amdgpu: init iommu after amdkfd device init

This patch is to fix clinfo failure in Raven/Picasso:

Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.2 AMD-APP (3364.0)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback

Platform Name: AMD Accelerated Parallel Processing Number of devices: 0

Signed-off-by: Yifan Zhang <[email protected]>
Reviewed-by: James Zhu <[email protected]>
Tested-by: James Zhu <[email protected]>
Acked-by: Felix Kuehling <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

I've got a laptop with raven, I'll try to find time to test it that
also shows he problem in the next few days.

ĸen
--
A capitalist society is one where individuals own and acquire
property, at least for a few months until cooler objects come out.
-- Late Night Mash


2021-10-31 03:08:50

by Ken Moffat

[permalink] [raw]
Subject: Re: amdgpu hang on picasso

On Sat, Oct 30, 2021 at 07:52:28PM +0100, Ken Moffat wrote:
> When I tried 5.15-rc7 on my picasso APU (Ryzen 5 3400G), trying to
> run 'startx' (I'm using X11 and logging in to a tty) the output
> messages from X11 stopped after a few lines (normally, the desktop
> shows before I can read anything) and keyboard/mouse were
> inoperative - had to use Magic SysRQ to sync and reboot.
>
> The log showed
> Oct 28 03:02:21 deluxe klogd: [ 31.347235] amdgpu 0000:09:00.0: amdgpu: failed to write reg 28b4 wait reg 28c6
> Oct 28 03:02:34 deluxe klogd: [ 44.280185] amdgpu 0000:09:00.0: amdgpu: failed to write reg 1a6f4 wait reg 1a706
>
> I started bisecting after confireming that linus' tree with head at
> f25a5481af12 still showed the problem. That identified the
> following commit, which reverts cleanly and allows Xorg to start:
>
> commit 714d9e4574d54596973ee3b0624ee4a16264d700
> Author: Yifan Zhang <[email protected]>
> Date: Tue Sep 28 15:42:35 2021 +0800
>
> drm/amdgpu: init iommu after amdkfd device init
>
> This patch is to fix clinfo failure in Raven/Picasso:
>
> Number of platforms: 1
> Platform Profile: FULL_PROFILE
> Platform Version: OpenCL 2.2 AMD-APP (3364.0)
> Platform Name: AMD Accelerated Parallel Processing
> Platform Vendor: Advanced Micro Devices, Inc.
> Platform Extensions: cl_khr_icd cl_amd_event_callback
>
> Platform Name: AMD Accelerated Parallel Processing Number of devices: 0
>
> Signed-off-by: Yifan Zhang <[email protected]>
> Reviewed-by: James Zhu <[email protected]>
> Tested-by: James Zhu <[email protected]>
> Acked-by: Felix Kuehling <[email protected]>
> Signed-off-by: Alex Deucher <[email protected]>
>
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> I've got a laptop with raven, I'll try to find time to test it that
> also shows he problem in the next few days.
>
The laptop (AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx) works
fine without reverting that patch, only the picasso has the problem.

ĸen
--
A capitalist society is one where individuals own and acquire
property, at least for a few months until cooler objects come out.
-- Late Night Mash

2021-11-01 16:54:11

by Ken Moffat

[permalink] [raw]
Subject: Re: amdgpu hang on picasso

On Mon, Nov 01, 2021 at 03:20:43PM +0000, Zhu, James wrote:
> [AMD Official Use Only]
>
> Hi Ken
>
> can you share the entire dmesg log?
>
>
> Thanks & Best Regards!
>
>
> James Zhu
>

I'm attaching it - booted vanilla rc7, saved dmesg, then ran startx
and did an emergency sync. At that point I tried to change to a
different tty, and to my surprise managed that by some sort of
random hit ctrl-alt-Fn sequence (i.e. it didn't work at first, then
the screen blanked and I got to a tty login prompt.

So here's the second version with some more messages at the end.

ĸen
> ________________________________
> From: Ken Moffat <[email protected]>
> Sent: Saturday, October 30, 2021 2:52 PM
> To: Deucher, Alexander <[email protected]>; Zhang, Yifan <[email protected]>; Zhu, James <[email protected]>
> Cc: Kuehling, Felix <[email protected]>; Linux Kernel Mailing List <[email protected]>
> Subject: amdgpu hang on picasso
>
> When I tried 5.15-rc7 on my picasso APU (Ryzen 5 3400G), trying to
> run 'startx' (I'm using X11 and logging in to a tty) the output
> messages from X11 stopped after a few lines (normally, the desktop
> shows before I can read anything) and keyboard/mouse were
> inoperative - had to use Magic SysRQ to sync and reboot.
>
> The log showed
> Oct 28 03:02:21 deluxe klogd: [ 31.347235] amdgpu 0000:09:00.0: amdgpu: failed to write reg 28b4 wait reg 28c6
> Oct 28 03:02:34 deluxe klogd: [ 44.280185] amdgpu 0000:09:00.0: amdgpu: failed to write reg 1a6f4 wait reg 1a706
>
> I started bisecting after confireming that linus' tree with head at
> f25a5481af12 still showed the problem. That identified the
> following commit, which reverts cleanly and allows Xorg to start:
>
> commit 714d9e4574d54596973ee3b0624ee4a16264d700
> Author: Yifan Zhang <[email protected]>
> Date: Tue Sep 28 15:42:35 2021 +0800
>
> drm/amdgpu: init iommu after amdkfd device init
>
> This patch is to fix clinfo failure in Raven/Picasso:
>
> Number of platforms: 1
> Platform Profile: FULL_PROFILE
> Platform Version: OpenCL 2.2 AMD-APP (3364.0)
> Platform Name: AMD Accelerated Parallel Processing
> Platform Vendor: Advanced Micro Devices, Inc.
> Platform Extensions: cl_khr_icd cl_amd_event_callback
>
> Platform Name: AMD Accelerated Parallel Processing Number of devices: 0
>
> Signed-off-by: Yifan Zhang <[email protected]>
> Reviewed-by: James Zhu <[email protected]>
> Tested-by: James Zhu <[email protected]>
> Acked-by: Felix Kuehling <[email protected]>
> Signed-off-by: Alex Deucher <[email protected]>
>
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> I've got a laptop with raven, I'll try to find time to test it that
> also shows he problem in the next few days.
>
> ĸen

--
Vetinari smiled. "Can you keep a secret, Mister Lipwig?"
"Oh, yes, sir. I've kept lots."
"Capital. And the point is, so can I. You do not need to know.”


Attachments:
(No filename) (3.07 kB)
dmesg.picasso (65.42 kB)
Download all attachments

2021-11-01 20:52:15

by Ken Moffat

[permalink] [raw]
Subject: Re: amdgpu hang on picasso

On Mon, Nov 01, 2021 at 06:32:11PM +0000, Zhu, James wrote:
> [AMD Official Use Only]
>
> Hi Kent,
>
just Ken, not Kent

> You also can share /var/log/kern.log
>
> I saw some issue from your attached log:
>
> [ 2.135852] amdgpu 0000:09:00.0: Direct firmware load for amdgpu/picasso_ta.bin failed with error -2
> [ 2.135854] amdgpu 0000:09:00.0: amdgpu: psp v10.0: Failed to load firmware "amdgpu/picasso_ta.bin"
> [ 2.135856] amdgpu 0000:09:00.0: amdgpu: PSP runtime database doesn't exist
> Can you try latest firmwre?
>

I didn't know that newer firmware was available, what I had dates
from when I got this box. I've just downloaded
linux-firmware-20211027.tar.gz, last time I looked at firmware I
needed to get the individual files. So, I didn't have that
picasso_ta and I do now. But it still doesn't load - the mention of
a runtime database makes me think it might be for wifi ? If so,
this machine is wired only.

> [ 2.308871] amdgpu 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x10438a380 flags=0x0070]
> Can you try adding amd_iommu=off in grub option?
>
I forgot to do that until I came to reply. I've attached the kernel
log from running with only the updated firmware as sys.log.gz

Turning off the iommu was pretty disastrous, screen went blank
before I could login. Turns out it had oopsed (from 2.316893
seconds in the attached sys.log.iommu_off.gz).

Thanks.

?en


Attachments:
(No filename) (1.44 kB)
sys.log.gz (18.83 kB)
sys.log.iommu_off.gz (17.70 kB)
Download all attachments