2022-03-29 09:00:20

by Paul Menzel

[permalink] [raw]
Subject: Re: 回复: Re: [PATCH] drm/amdgpu: resolv e s3 hang for r7340

Dear 李真,


[Your mailer formatted the message oddly. Maybe configure it to use only
plain text email with no HTML parts – common in Linux kernel community
–, or, if not possible, switch to something else (Mozilla Thunderbird, …).]


Am 29.03.22 um 04:54 schrieb 李真能:

[…]

> *日 期:*2022-03-28 15:38
> *发件人:*Paul Menzel

[…]

> Am 28.03.22 um 09:36 schrieb Paul Menzel:
> > Dear Zhenneng,
> >
> >
> > Thank you for your patch.
> >
> > Am 28.03.22 um 06:05 schrieb Zhenneng Li:
> >> This is a workaround for s3 hang for r7340(amdgpu).
> >
> > Is it hanging when resuming from S3?
>
> Yes, this func is a delayed work after init graphics card.

Thank for clarifying it.

> > Maybe also use the line below for
> > the commit message summary:
> >
> > drm/amdgpu: Add 1 ms delay to init handler to fix s3 resume hang
> >
> > Also, please add a space before the ( in “r7340(amdgpu)”.
> >
> >> When we test s3 with r7340 on arm64 platform, graphics card will hang up,
> >> the error message are as follows:
> >> Mar 4 01:14:11 greatwall-GW-XXXXXX-XXX kernel: [ 1.599374][ 7] [ T291] amdgpu 0000:02:00.0: fb0: amdgpudrmfb frame buffer device
> >> Mar 4 01:14:11 greatwall-GW-XXXXXX-XXX kernel: [ 1.612869][ 7] [ T291] [drm:amdgpu_device_ip_late_init [amdgpu]] *ERROR* late_init of IP blockfailed -22
> >> Mar 4 01:14:11 greatwall-GW-XXXXXX-XXX kernel: [ 1.623392][ 7] [ T291] amdgpu 0000:02:00.0: amdgpu_device_ip_late_init failed
> >> Mar 4 01:14:11 greatwall-GW-XXXXXX-XXX kernel: [ 1.630696][ 7] [ T291] amdgpu 0000:02:00.0: Fatal error during GPU init
> >> Mar 4 01:14:11 greatwall-GW-XXXXXX-XXX kernel: [ 1.637477][ 7] [ T291] [drm] amdgpu: finishing device.
> >
> > The prefix in the beginning is not really needed. Only the stuff after
> > `kernel: `.
> >
> > Maybe also add the output of `lspci -nn -s …` for that r7340 device.
> >
> >> Change-Id: I5048b3894c0ca9faf2f4847ddab61f9eb17b4823
> >
> > Without the Gerrit instance this belongs to, the Change-Id is of no use
> > in the public.
> >
> >> Signed-off-by: Zhenneng Li
> >> ---
> >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 ++
> >> 1 file changed, 2 insertions(+)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> index 3987ecb24ef4..1eced991b5b2 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> @@ -2903,6 +2903,8 @@ static void
> >> amdgpu_device_delayed_init_work_handler(struct work_struct *work)
> >> container_of(work, struct amdgpu_device, delayed_init_work.work);
> >> int r;
> >> + mdelay(1);
> >> +
> >
>
> > Wow, I wonder how long it took you to find that workaround.
>
> About 3 months, I try to add this delay
> work(amdgpu_device_delayed_init_work_handler) from 2000ms to 2500ms, or use mb()
> instead of mdelay(1), but it's useless, I don't know the reason,the occurrence
> probability of this bug is one ten-thousandth, do you know the possible reasons?

Oh, it’s not even always reproducible. That is hard. Did you try another
graphics card or another ARM board to rule out hardware specific issues?

Sorry, I do not. Maybe the developers with access to non-public
datasheets and erratas know.

> >> r = amdgpu_ib_ring_tests(adev);
> >> if (r)
> >> DRM_ERROR("ib ring test failed (%d).\n", r);


Kind regards,

Paul