2023-09-19 15:57:43

by Oleksandr Natalenko

[permalink] [raw]
Subject: [REGRESSION] [BISECTED] Panic in gen8_ggtt_insert_entries() with v6.5

/cc Matthew Wilcox and Andrew Morton because of folios (please see below).

On sobota 2. září 2023 18:14:12 CEST Oleksandr Natalenko wrote:
> Hello.
>
> Since v6.5 kernel the following HW:
>
> * Lenovo T460s laptop with Skylake GT2 [HD Graphics 520] (rev 07)
> * Lenovo T490s laptop with WhiskeyLake-U GT2 [UHD Graphics 620] (rev 02)
>
> is affected by the following crash once KDE on either X11 or Wayland is started:
>
> i915 0000:00:02.0: enabling device (0006 -> 0007)
> i915 0000:00:02.0: vgaarb: deactivate vga console
> i915 0000:00:02.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=mem
> i915 0000:00:02.0: [drm] Finished loading DMC firmware i915/skl_dmc_ver1_27.bin (v1.27)
> [drm] Initialized i915 1.6.0 20201103 for 0000:00:02.0 on minor 1
> fbcon: i915drmfb (fb0) is primary device
> i915 0000:00:02.0: [drm] fb0: i915drmfb frame buffer device
> …
> memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=674 'kwin_wayland'
> BUG: unable to handle page fault for address: ffffb422c2800000
> #PF: supervisor write access in kernel mode
> #PF: error_code(0x0002) - not-present page
> PGD 100000067 P4D 100000067 PUD 1001df067 PMD 10d1cf067 PTE 0
> Oops: 0002 [#1] PREEMPT SMP PTI
> CPU: 1 PID: 674 Comm: kwin_wayland Not tainted 6.5.0-pf1 #1 a6c58ff41a7b8bb16a19f5af9e0e9bce20f9f38d
> Hardware name: LENOVO 20FAS2BM0F/20FAS2BM0F, BIOS N1CET90W (1.58 ) 11/15/2022
> RIP: 0010:gen8_ggtt_insert_entries+0xc2/0x140 [i915]
> …
> Call Trace:
> <TASK>
> intel_ggtt_bind_vma+0x3e/0x60 [i915 a83fdc6539431252dba13053979a8b680af86836]
> i915_vma_bind+0x216/0x4b0 [i915 a83fdc6539431252dba13053979a8b680af86836]
> i915_vma_pin_ww+0x405/0xa80 [i915 a83fdc6539431252dba13053979a8b680af86836]
> __i915_ggtt_pin+0x5a/0x130 [i915 a83fdc6539431252dba13053979a8b680af86836]
> i915_ggtt_pin+0x78/0x1f0 [i915 a83fdc6539431252dba13053979a8b680af86836]
> __intel_context_do_pin_ww+0x312/0x700 [i915 a83fdc6539431252dba13053979a8b680af86836]
> i915_gem_do_execbuffer+0xfc6/0x2720 [i915 a83fdc6539431252dba13053979a8b680af86836]
> i915_gem_execbuffer2_ioctl+0x111/0x260 [i915 a83fdc6539431252dba13053979a8b680af86836]
> drm_ioctl_kernel+0xca/0x170
> drm_ioctl+0x30f/0x580
> __x64_sys_ioctl+0x94/0xd0
> do_syscall_64+0x5d/0x90
> entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> …
> note: kwin_wayland[674] exited with irqs disabled
>
> RIP seems to translate into this:
>
> $ scripts/faddr2line drivers/gpu/drm/i915/gt/intel_ggtt.o gen8_ggtt_insert_entries+0xc2
> gen8_ggtt_insert_entries+0xc2/0x150:
> writeq at /home/pf/work/devel/own/pf-kernel/linux/./arch/x86/include/asm/io.h:99
> (inlined by) gen8_set_pte at /home/pf/work/devel/own/pf-kernel/linux/drivers/gpu/drm/i915/gt/intel_ggtt.c:257
> (inlined by) gen8_ggtt_insert_entries at /home/pf/work/devel/own/pf-kernel/linux/drivers/gpu/drm/i915/gt/intel_ggtt.c:300
>
> Probably, recent PTE-related changes are relevant:
>
> $ git log --oneline --no-merges v6.4..v6.5 -- drivers/gpu/drm/i915/gt/intel_ggtt.c
> 3532e75dfadcf drm/i915/uc: perma-pin firmwares
> 4722e2ebe6f21 drm/i915/gt: Fix second parameter type of pre-gen8 pte_encode callbacks
> 9275277d53248 drm/i915: use pat_index instead of cache_level
> 5e352e32aec23 drm/i915: preparation for using PAT index
> 341ad0e8e2542 drm/i915/mtl: Add PTE encode function
>
> Also note Lenovo T14s laptop with TigerLake-LP GT2 [Iris Xe Graphics] (rev 01) is not affected by this issue.
>
> Full dmesg with DRM debug enabled is available in the bugreport I've reported earlier [1]. I'm sending this email to make the issue more visible.
>
> Please help.
>
> Thanks.
>
> [1] https://gitlab.freedesktop.org/drm/intel/-/issues/9256

Matthew,

Andrzej asked me to try to revert commits 0b62af28f249, e0b72c14d8dc and 1e0877d58b1e, and reverting those fixed the i915 crash for me. The e0b72c14d8dc and 1e0877d58b1e commits look like just prerequisites, so I assume 0b62af28f249 ("i915: convert shmem_sg_free_table() to use a folio_batch") is the culprit here.

Could you please check this?

Our conversation with Andrzej is available at drm-intel GitLab [1].

Thanks.

[1] https://gitlab.freedesktop.org/drm/intel/-/issues/9256

--
Oleksandr Natalenko (post-factum)


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part.

2023-09-19 18:01:54

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [REGRESSION] [BISECTED] Panic in gen8_ggtt_insert_entries() with v6.5

On Tue, Sep 19, 2023 at 10:26:42AM +0200, Oleksandr Natalenko wrote:
> Andrzej asked me to try to revert commits 0b62af28f249, e0b72c14d8dc and 1e0877d58b1e, and reverting those fixed the i915 crash for me. The e0b72c14d8dc and 1e0877d58b1e commits look like just prerequisites, so I assume 0b62af28f249 ("i915: convert shmem_sg_free_table() to use a folio_batch") is the culprit here.
>
> Could you please check this?
>
> Our conversation with Andrzej is available at drm-intel GitLab [1].
>
> Thanks.
>
> [1] https://gitlab.freedesktop.org/drm/intel/-/issues/9256

Wow, that is some great debugging. Thanks for all the time & effort
you and others have invested. Sorry for breaking your system.

You're almost right about the "prerequisites", but it's in the other
direction; 0b62af28f249 is a prerequisite for the later two cleanups,
so reverting all three is necessary to test 0b62af28f249.

It seems to me that you've isolated the problem to constructing overly
long sg lists. I didn't realise that was going to be a problem, so
that's my fault.

Could I ask you to try this patch? I'll follow up with another patch
later because I think I made another assumption that may not be valid.

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
index 8f1633c3fb93..73a4a4eb29e0 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
@@ -100,6 +100,7 @@ int shmem_sg_alloc_table(struct drm_i915_private *i915, struct sg_table *st,
st->nents = 0;
for (i = 0; i < page_count; i++) {
struct folio *folio;
+ unsigned long nr_pages;
const unsigned int shrink[] = {
I915_SHRINK_BOUND | I915_SHRINK_UNBOUND,
0,
@@ -150,6 +151,8 @@ int shmem_sg_alloc_table(struct drm_i915_private *i915, struct sg_table *st,
}
} while (1);

+ nr_pages = min_t(unsigned long,
+ folio_nr_pages(folio), page_count - i);
if (!i ||
sg->length >= max_segment ||
folio_pfn(folio) != next_pfn) {
@@ -157,13 +160,13 @@ int shmem_sg_alloc_table(struct drm_i915_private *i915, struct sg_table *st,
sg = sg_next(sg);

st->nents++;
- sg_set_folio(sg, folio, folio_size(folio), 0);
+ sg_set_folio(sg, folio, nr_pages * PAGE_SIZE, 0);
} else {
/* XXX: could overflow? */
- sg->length += folio_size(folio);
+ sg->length += nr_pages * PAGE_SIZE;
}
- next_pfn = folio_pfn(folio) + folio_nr_pages(folio);
- i += folio_nr_pages(folio) - 1;
+ next_pfn = folio_pfn(folio) + nr_pages;
+ i += nr_pages - 1;

/* Check that the i965g/gm workaround works. */
GEM_BUG_ON(gfp & __GFP_DMA32 && next_pfn >= 0x00100000UL);

2023-09-19 19:20:57

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [REGRESSION] [BISECTED] Panic in gen8_ggtt_insert_entries() with v6.5

On Tue, Sep 19, 2023 at 08:11:47PM +0200, Oleksandr Natalenko wrote:
> I can confirm this one fixes the issue for me on T460s laptop. Thank you!

Yay!

> Should you submit it, please add:
>
> Fixes: 0b62af28f2 ("i915: convert shmem_sg_free_table() to use a folio_batch")

Thanks for collecting all these; you're making my life really easy.
One minor point is that the standard format for Fixes: is 12 characters,
not 10. eg,

Documentation/process/5.Posting.rst: Fixes: 1f2e3d4c5b6a ("The first line of the commit specified by the first 12 characters of its SHA-1 ID")

I have this in my .gitconfig:

[pretty]
fixes = Fixes: %h (\"%s\")

and in .git/config,

[core]
abbrev = 12

I'm working on the commit message now.

2023-09-19 22:35:27

by Oleksandr Natalenko

[permalink] [raw]
Subject: Re: [REGRESSION] [BISECTED] Panic in gen8_ggtt_insert_entries() with v6.5

Hello.

On úterý 19. září 2023 17:43:40 CEST Matthew Wilcox wrote:
> On Tue, Sep 19, 2023 at 10:26:42AM +0200, Oleksandr Natalenko wrote:
> > Andrzej asked me to try to revert commits 0b62af28f249, e0b72c14d8dc and 1e0877d58b1e, and reverting those fixed the i915 crash for me. The e0b72c14d8dc and 1e0877d58b1e commits look like just prerequisites, so I assume 0b62af28f249 ("i915: convert shmem_sg_free_table() to use a folio_batch") is the culprit here.
> >
> > Could you please check this?
> >
> > Our conversation with Andrzej is available at drm-intel GitLab [1].
> >
> > Thanks.
> >
> > [1] https://gitlab.freedesktop.org/drm/intel/-/issues/9256
>
> Wow, that is some great debugging. Thanks for all the time & effort
> you and others have invested. Sorry for breaking your system.
>
> You're almost right about the "prerequisites", but it's in the other
> direction; 0b62af28f249 is a prerequisite for the later two cleanups,
> so reverting all three is necessary to test 0b62af28f249.
>
> It seems to me that you've isolated the problem to constructing overly
> long sg lists. I didn't realise that was going to be a problem, so
> that's my fault.
>
> Could I ask you to try this patch? I'll follow up with another patch
> later because I think I made another assumption that may not be valid.

I can confirm this one fixes the issue for me on T460s laptop. Thank you!

Should you submit it, please add:

Fixes: 0b62af28f2 ("i915: convert shmem_sg_free_table() to use a folio_batch")
Cc: [email protected] # 6.5.x
Link: https://gitlab.freedesktop.org/drm/intel/-/issues/9256
Link: https://lore.kernel.org/lkml/[email protected]/
Reported-by: Oleksandr Natalenko <[email protected]>
Tested-by: Oleksandr Natalenko <[email protected]>

> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
> index 8f1633c3fb93..73a4a4eb29e0 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
> @@ -100,6 +100,7 @@ int shmem_sg_alloc_table(struct drm_i915_private *i915, struct sg_table *st,
> st->nents = 0;
> for (i = 0; i < page_count; i++) {
> struct folio *folio;
> + unsigned long nr_pages;
> const unsigned int shrink[] = {
> I915_SHRINK_BOUND | I915_SHRINK_UNBOUND,
> 0,
> @@ -150,6 +151,8 @@ int shmem_sg_alloc_table(struct drm_i915_private *i915, struct sg_table *st,
> }
> } while (1);
>
> + nr_pages = min_t(unsigned long,
> + folio_nr_pages(folio), page_count - i);
> if (!i ||
> sg->length >= max_segment ||
> folio_pfn(folio) != next_pfn) {
> @@ -157,13 +160,13 @@ int shmem_sg_alloc_table(struct drm_i915_private *i915, struct sg_table *st,
> sg = sg_next(sg);
>
> st->nents++;
> - sg_set_folio(sg, folio, folio_size(folio), 0);
> + sg_set_folio(sg, folio, nr_pages * PAGE_SIZE, 0);
> } else {
> /* XXX: could overflow? */
> - sg->length += folio_size(folio);
> + sg->length += nr_pages * PAGE_SIZE;
> }
> - next_pfn = folio_pfn(folio) + folio_nr_pages(folio);
> - i += folio_nr_pages(folio) - 1;
> + next_pfn = folio_pfn(folio) + nr_pages;
> + i += nr_pages - 1;
>
> /* Check that the i965g/gm workaround works. */
> GEM_BUG_ON(gfp & __GFP_DMA32 && next_pfn >= 0x00100000UL);

--
Oleksandr Natalenko (post-factum)


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part.

2023-09-20 00:00:38

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [REGRESSION] [BISECTED] Panic in gen8_ggtt_insert_entries() with v6.5

On Tue, Sep 19, 2023 at 04:43:41PM +0100, Matthew Wilcox wrote:
> Could I ask you to try this patch? I'll follow up with another patch
> later because I think I made another assumption that may not be valid.

Ah, no, never mind. I thought we could start in the middle of a folio,
but we always start constructing the sg list from index 0 of the file,
so we always start at the first page of a folio. If this patch solves
your problem, then I think we're done.