2020-08-09 11:54:44

by Ignat Insarov

[permalink] [raw]
Subject: Non-deterministically boot into dark screen with `amdgpu`

Hello!

This is an issue report. I am not familiar with the Linux kernel
development procedure, so please direct me to a more appropriate or
specialized medium if this is not the right avenue.

My laptop (Ryzen 7 Pro CPU/GPU) boots into dark screen more often than
not. Screen blackness correlates with a line in the `systemd` journal
that says `RAM width Nbits DDR4`, where N is either 128 (resulting in
dark screen) or 64 (resulting in a healthy boot). The number seems to
be chosen at random with bias towards 128. This has been going on for
a while so here is some statistics:

* 356 boots proceed far enough to attempt mode setting.
* 82 boots set RAM width to 64 bits and presumably succeed.
* 274 boots set RAM width to 128 bits and presumably fail.

The issue is prevented with the `nomodeset` kernel option.

I reported this previously (about a year ago) on the forum of my Linux
distribution.[1] The issue still persists as of linux 5.8.0.

The details of my graphics controller, as well as some journal
excerpts, can be seen at [1]. One thing that has changed since then is
that on failure, there now appears a null pointer dereference error. I
am attaching the log of kernel messages from the most recent failed
boot — please request more information if needed.

I appreciate any directions and advice as to how I may go about fixing
this annoyance.

[1]: https://bbs.archlinux.org/viewtopic.php?id=248273


Attachments:
kernel.log (56.80 kB)

2020-08-10 06:45:19

by Alexander Monakov

[permalink] [raw]
Subject: Re: Non-deterministically boot into dark screen with `amdgpu`

Hi,

you should Сс a specialized mailing list and a relevant maintainer,
otherwise your email is likely to be ignored as LKML is an incredibly
high-volume list. Adding amd-gfx and Alex Deucher.

More thoughts below.

On Sun, 9 Aug 2020, Ignat Insarov wrote:

> Hello!
>
> This is an issue report. I am not familiar with the Linux kernel
> development procedure, so please direct me to a more appropriate or
> specialized medium if this is not the right avenue.
>
> My laptop (Ryzen 7 Pro CPU/GPU) boots into dark screen more often than
> not. Screen blackness correlates with a line in the `systemd` journal
> that says `RAM width Nbits DDR4`, where N is either 128 (resulting in
> dark screen) or 64 (resulting in a healthy boot). The number seems to
> be chosen at random with bias towards 128. This has been going on for
> a while so here is some statistics:
>
> * 356 boots proceed far enough to attempt mode setting.
> * 82 boots set RAM width to 64 bits and presumably succeed.
> * 274 boots set RAM width to 128 bits and presumably fail.
>
> The issue is prevented with the `nomodeset` kernel option.
>
> I reported this previously (about a year ago) on the forum of my Linux
> distribution.[1] The issue still persists as of linux 5.8.0.
>
> The details of my graphics controller, as well as some journal
> excerpts, can be seen at [1]. One thing that has changed since then is
> that on failure, there now appears a null pointer dereference error. I
> am attaching the log of kernel messages from the most recent failed
> boot — please request more information if needed.
>
> I appreciate any directions and advice as to how I may go about fixing
> this annoyance.
>
> [1]: https://bbs.archlinux.org/viewtopic.php?id=248273


On the forum you show that in the "success" case there's one less "BIOS
signature incorrect" message. This implies that amdgpu_get_bios() in
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c
gets the video BIOS from a different source. If that happens every time
(one "signature incorrect" message for "success", two for "failure")
that may be relevant to the problem you're experiencing.

If you don't mind patching and rebuilding the kernel I suggest adding
debug printks to the aforementioned function to see exactly which methods
fail with wrong signature and which succeeds.

Also might be worthwhile to check if there's a BIOS update for your laptop.

Alexander

2020-08-10 20:36:52

by Alex Deucher

[permalink] [raw]
Subject: Re: Non-deterministically boot into dark screen with `amdgpu`

On Mon, Aug 10, 2020 at 7:46 AM Christian König
<[email protected]> wrote:
>
> Hi guys,
>
> Am 10.08.20 um 08:43 schrieb Alexander Monakov:
>
> Hi,
>
> you should Сс a specialized mailing list and a relevant maintainer,
> otherwise your email is likely to be ignored as LKML is an incredibly
> high-volume list. Adding amd-gfx and Alex Deucher.
>
>
> Thanks for forwarding this. AFAIK we haven't heard of this bug before, but Alex already might know more about it.
>
> More thoughts below.
>
> On Sun, 9 Aug 2020, Ignat Insarov wrote:
>
> Hello!
>
> This is an issue report. I am not familiar with the Linux kernel
> development procedure, so please direct me to a more appropriate or
> specialized medium if this is not the right avenue.
>
> My laptop (Ryzen 7 Pro CPU/GPU) boots into dark screen more often than
> not. Screen blackness correlates with a line in the `systemd` journal
> that says `RAM width Nbits DDR4`, where N is either 128 (resulting in
> dark screen) or 64 (resulting in a healthy boot). The number seems to
> be chosen at random with bias towards 128. This has been going on for
> a while so here is some statistics:
>
> * 356 boots proceed far enough to attempt mode setting.
> * 82 boots set RAM width to 64 bits and presumably succeed.
> * 274 boots set RAM width to 128 bits and presumably fail.
>
> The issue is prevented with the `nomodeset` kernel option.
>
> I reported this previously (about a year ago) on the forum of my Linux
> distribution.[1] The issue still persists as of linux 5.8.0.
>
> The details of my graphics controller, as well as some journal
> excerpts, can be seen at [1]. One thing that has changed since then is
> that on failure, there now appears a null pointer dereference error. I
> am attaching the log of kernel messages from the most recent failed
> boot — please request more information if needed.
>
> I appreciate any directions and advice as to how I may go about fixing
> this annoyance.
>
> [1]: https://bbs.archlinux.org/viewtopic.php?id=248273
>
> On the forum you show that in the "success" case there's one less "BIOS
> signature incorrect" message. This implies that amdgpu_get_bios() in
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c
> gets the video BIOS from a different source. If that happens every time
> (one "signature incorrect" message for "success", two for "failure")
> that may be relevant to the problem you're experiencing.
>
> If you don't mind patching and rebuilding the kernel I suggest adding
> debug printks to the aforementioned function to see exactly which methods
> fail with wrong signature and which succeeds.
>
> Also might be worthwhile to check if there's a BIOS update for your laptop.
>
>
> It might also be a good idea to try the latest amd-staging-drm-next branch from Alex repository (bear with me I don't have the link at hand, but it should be easy to find).
>
> Opening a bug report or searching the existing ones for something similar under https://gitlab.freedesktop.org/drm/amd/-/issues might be a good idea as well.
>
> And I completely agree that this sounds like an issue getting the BIOS image.

I've not heard of an issue like this either. Best to file a gitlab
bug and attach your full dmesg output in both the working and
non-working cases and we can go from there.

Alex

>
> Thanks,
> Christian.
>
>
> Alexander
>
>
> _______________________________________________
> amd-gfx mailing list
> [email protected]
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>
> _______________________________________________
> amd-gfx mailing list
> [email protected]
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx