2021-05-12 09:58:55

by Hans de Goede

[permalink] [raw]
Subject: 5.13 i915/PAT regression on Brasswell, adding nopat to the kernel commandline worksaround this

Hi All,

I'm not sure if this is a i915 bug, or caused by changes elsewhere in the kernel,
so I thought it would be best to just send out an email and then see from there.

With 5.13-rc1 gdm fails to show and dmesg contains:

[ 38.504613] x86/PAT: Xwayland:683 map pfn RAM range req write-combining for [mem 0x23883000-0x23883fff], got write-back
<repeated lots of times for different ranges>
[ 39.484766] x86/PAT: gnome-shell:632 map pfn RAM range req write-combining for [mem 0x1c6a3000-0x1c6a3fff], got write-back
<repeated lots of times for different ranges>
[ 54.314858] Asynchronous wait on fence 0000:00:02.0:gnome-shell[632]:a timed out (hint:intel_cursor_plane_create [i915])
[ 58.339769] i915 0000:00:02.0: [drm] GPU HANG: ecode 8:1:86dfdffb, in gnome-shell [632]
[ 58.341161] i915 0000:00:02.0: [drm] Resetting rcs0 for stopped heartbeat on rcs0
[ 58.341267] i915 0000:00:02.0: [drm] gnome-shell[632] context reset due to GPU hang

Because of the PAT errors I tried adding "nopat" to the kernel commandline
and I'm happy to report that that works around this.

Any hints on how to debug this further (without doing a full git bisect) would be
appreciated.

Regards,

Hans


2021-05-12 11:17:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: 5.13 i915/PAT regression on Brasswell, adding nopat to the kernel commandline worksaround this

On Wed, May 12, 2021 at 11:57:02AM +0200, Hans de Goede wrote:
> Hi All,
>
> I'm not sure if this is a i915 bug, or caused by changes elsewhere in the kernel,
> so I thought it would be best to just send out an email and then see from there.
>
> With 5.13-rc1 gdm fails to show and dmesg contains:
>
> [ 38.504613] x86/PAT: Xwayland:683 map pfn RAM range req write-combining for [mem 0x23883000-0x23883fff], got write-back
> <repeated lots of times for different ranges>
> [ 39.484766] x86/PAT: gnome-shell:632 map pfn RAM range req write-combining for [mem 0x1c6a3000-0x1c6a3fff], got write-back
> <repeated lots of times for different ranges>
> [ 54.314858] Asynchronous wait on fence 0000:00:02.0:gnome-shell[632]:a timed out (hint:intel_cursor_plane_create [i915])
> [ 58.339769] i915 0000:00:02.0: [drm] GPU HANG: ecode 8:1:86dfdffb, in gnome-shell [632]
> [ 58.341161] i915 0000:00:02.0: [drm] Resetting rcs0 for stopped heartbeat on rcs0
> [ 58.341267] i915 0000:00:02.0: [drm] gnome-shell[632] context reset due to GPU hang
>
> Because of the PAT errors I tried adding "nopat" to the kernel commandline
> and I'm happy to report that that works around this.
>
> Any hints on how to debug this further (without doing a full git bisect) would be
> appreciated.

IIRC it's because of 74ffa5a3e685 ("mm: add remap_pfn_range_notrack"),
which added a sanity check to make sure expectations were met. It turns
out they were not.

The bug is not new, the warning is. AFAIK the i915 team is aware, but
other than that I've not followed.

2021-05-12 12:01:17

by Christoph Hellwig

[permalink] [raw]
Subject: Re: 5.13 i915/PAT regression on Brasswell, adding nopat to the kernel commandline worksaround this

On Wed, May 12, 2021 at 01:15:03PM +0200, Peter Zijlstra wrote:
> IIRC it's because of 74ffa5a3e685 ("mm: add remap_pfn_range_notrack"),
> which added a sanity check to make sure expectations were met. It turns
> out they were not.
>
> The bug is not new, the warning is. AFAIK the i915 team is aware, but
> other than that I've not followed.


The actual culprit is b12d691ea5e0 ("i915: fix remap_io_sg to verify the
pgprot"), but otherwise agreed. Someone the i915 maintainers all seem
to be on vacation as the previous report did not manage to trigger any
kind of reply.

2021-05-12 12:53:13

by Hans de Goede

[permalink] [raw]
Subject: Re: 5.13 i915/PAT regression on Brasswell, adding nopat to the kernel commandline worksaround this

Hi,

On 5/12/21 1:57 PM, Christoph Hellwig wrote:
> On Wed, May 12, 2021 at 01:15:03PM +0200, Peter Zijlstra wrote:
>> IIRC it's because of 74ffa5a3e685 ("mm: add remap_pfn_range_notrack"),
>> which added a sanity check to make sure expectations were met. It turns
>> out they were not.
>>
>> The bug is not new, the warning is. AFAIK the i915 team is aware, but
>> other than that I've not followed.
>
>
> The actual culprit is b12d691ea5e0 ("i915: fix remap_io_sg to verify the
> pgprot"), but otherwise agreed. Someone the i915 maintainers all seem
> to be on vacation as the previous report did not manage to trigger any
> kind of reply.

I can confirm that reverting that commit restores i915 functionality with
5.13-rc1 on the Braswell machine on which I have been testing this.

Regards,

Hans

2021-05-19 21:01:16

by Jani Nikula

[permalink] [raw]
Subject: Re: [Intel-gfx] 5.13 i915/PAT regression on Brasswell, adding nopat to the kernel commandline worksaround this

On Wed, 12 May 2021, Christoph Hellwig <[email protected]> wrote:
> On Wed, May 12, 2021 at 01:15:03PM +0200, Peter Zijlstra wrote:
>> IIRC it's because of 74ffa5a3e685 ("mm: add remap_pfn_range_notrack"),
>> which added a sanity check to make sure expectations were met. It turns
>> out they were not.
>>
>> The bug is not new, the warning is. AFAIK the i915 team is aware, but
>> other than that I've not followed.
>
>
> The actual culprit is b12d691ea5e0 ("i915: fix remap_io_sg to verify the
> pgprot"), but otherwise agreed. Someone the i915 maintainers all seem
> to be on vacation as the previous report did not manage to trigger any
> kind of reply.

We are aware. I've been rattling the cages to get more attention.


BR,
Jani.


--
Jani Nikula, Intel Open Source Graphics Center