2019-08-09 12:46:12

by Martin Wilck

[permalink] [raw]
Subject: 5.3-rc3: Frozen graphics with kcompactd migrating i915 pages

This happened to me today, running kernel 5.3.0-rc3-1.g571863b-default
(5.3-rc3 with just a few patches on top), after starting a KVM virtual
machine. The X screen was frozen. Remote login via ssh was still
possible, thus I was able to retrieve basic logs.

sysrq-w showed two blocked processes (kcompactd0 and KVM). After a
minute, the same two processes were still blocked. KVM seems to try to
acquire a lock that kcompactd is holding. kcompactd is waiting for IO
to complete on pages owned by the i915 driver.

kcompactd stack:

Aug 09 12:12:48 apollon.suse.de kernel: sysrq: Show Blocked State
Aug 09 12:12:48 apollon.suse.de
kernel: task PC stack pid father
Aug 09 12:12:48 apollon.suse.de kernel:
kcompactd0 D 0 43 2 0x80004000
Aug 09 12:12:48 apollon.suse.de kernel: Call Trace:
Aug 09 12:12:48 apollon.suse.de kernel: ? __schedule+0x2af/0x6a0
Aug 09 12:12:48 apollon.suse.de kernel: schedule+0x33/0x90
Aug 09 12:12:48 apollon.suse.de kernel: io_schedule+0x12/0x40
Aug 09 12:12:48 apollon.suse.de kernel: __lock_page+0x123/0x200
Aug 09 12:12:48 apollon.suse.de kernel: ?
gen8_ppgtt_clear_pdp+0xc0/0x140 [i915]
Aug 09 12:12:48 apollon.suse.de kernel: ?
file_fdatawait_range+0x20/0x20
Aug 09 12:12:48 apollon.suse.de kernel: set_page_dirty_lock+0x49/0x50
Aug 09 12:12:48 apollon.suse.de
kernel: i915_gem_userptr_put_pages+0x13f/0x1c0 [i915]
Aug 09 12:12:48 apollon.suse.de
kernel: __i915_gem_object_put_pages+0x5e/0xa0 [i915]
Aug 09 12:12:48 apollon.suse.de
kernel: userptr_mn_invalidate_range_start+0x1ff/0x220 [i915]
Aug 09 12:12:48 apollon.suse.de
kernel: __mmu_notifier_invalidate_range_start+0x57/0xa0
Aug 09 12:12:48 apollon.suse.de kernel: try_to_unmap_one+0xa0b/0xae0
Aug 09 12:12:48 apollon.suse.de kernel: ? __mod_lruvec_state+0x3f/0xf0
Aug 09 12:12:48 apollon.suse.de kernel: rmap_walk_file+0xf2/0x250
Aug 09 12:12:48 apollon.suse.de kernel: try_to_unmap+0xa6/0xe0
Aug 09 12:12:48 apollon.suse.de kernel: ? page_remove_rmap+0x290/0x290
Aug 09 12:12:48 apollon.suse.de kernel: ? page_not_mapped+0x20/0x20
Aug 09 12:12:48 apollon.suse.de kernel: ? page_get_anon_vma+0x80/0x80
Aug 09 12:12:48 apollon.suse.de kernel: migrate_pages+0x8cd/0xbc0
Aug 09 12:12:48 apollon.suse.de kernel: ?
fast_isolate_freepages+0x6b0/0x6b0
Aug 09 12:12:48 apollon.suse.de kernel: ? move_freelist_tail+0xb0/0xb0
Aug 09 12:12:48 apollon.suse.de kernel: compact_zone+0x669/0xc80
Aug 09 12:12:48 apollon.suse.de kernel: ?
entry_SYSCALL_64_after_hwframe+0xb8/0xbe
Aug 09 12:12:48 apollon.suse.de kernel: kcompactd_do_work+0x120/0x290


KVM stack:

Aug 09 12:12:48 apollon.suse.de kernel: CPU 0/KVM D 0
25189 1 0x00000320
Aug 09 12:12:48 apollon.suse.de kernel: Call Trace:
Aug 09 12:12:48 apollon.suse.de kernel: ? __schedule+0x2af/0x6a0
Aug 09 12:12:48 apollon.suse.de kernel: schedule+0x33/0x90
Aug 09 12:12:48 apollon.suse.de
kernel: schedule_preempt_disabled+0xa/0x10
Aug 09 12:12:48 apollon.suse.de
kernel: __mutex_lock.isra.0+0x172/0x4d0
Aug 09 12:12:48 apollon.suse.de
kernel: userptr_mn_invalidate_range_start+0x1bf/0x220 [i915]
Aug 09 12:12:48 apollon.suse.de
kernel: __mmu_notifier_invalidate_range_start+0x57/0xa0
Aug 09 12:12:48 apollon.suse.de kernel: try_to_unmap_one+0xa0b/0xae0
Aug 09 12:12:48 apollon.suse.de kernel: rmap_walk_file+0xf2/0x250
Aug 09 12:12:48 apollon.suse.de kernel: try_to_unmap+0xa6/0xe0
Aug 09 12:12:48 apollon.suse.de kernel: ? page_remove_rmap+0x290/0x290
Aug 09 12:12:48 apollon.suse.de kernel: ? page_not_mapped+0x20/0x20
Aug 09 12:12:48 apollon.suse.de kernel: ? page_get_anon_vma+0x80/0x80
Aug 09 12:12:48 apollon.suse.de kernel: migrate_pages+0x8cd/0xbc0
Aug 09 12:12:48 apollon.suse.de kernel: ?
fast_isolate_freepages+0x6b0/0x6b0
Aug 09 12:12:48 apollon.suse.de kernel: ? move_freelist_tail+0xb0/0xb0
Aug 09 12:12:48 apollon.suse.de kernel: compact_zone+0x669/0xc80
Aug 09 12:12:48 apollon.suse.de kernel: compact_zone_order+0xc6/0xf0
Aug 09 12:12:48 apollon.suse.de
kernel: try_to_compact_pages+0xcc/0x2a0
Aug 09 12:12:48 apollon.suse.de
kernel: __alloc_pages_direct_compact+0x7c/0x150
Aug 09 12:12:48 apollon.suse.de
kernel: __alloc_pages_slowpath+0x1ee/0xd00
Aug 09 12:12:48 apollon.suse.de kernel: ? vmx_vcpu_load+0x100/0x120
[kvm_intel]

Full logs can be found under https://pastebin.com/KJ6tccj4
I haven't yet tried if this is reproducible.

Regards
Martin


2019-08-09 12:54:56

by Chris Wilson

[permalink] [raw]
Subject: Re: 5.3-rc3: Frozen graphics with kcompactd migrating i915 pages

Quoting Martin Wilck (2019-08-09 13:41:42)
> This happened to me today, running kernel 5.3.0-rc3-1.g571863b-default
> (5.3-rc3 with just a few patches on top), after starting a KVM virtual
> machine. The X screen was frozen. Remote login via ssh was still
> possible, thus I was able to retrieve basic logs.
>
> sysrq-w showed two blocked processes (kcompactd0 and KVM). After a
> minute, the same two processes were still blocked. KVM seems to try to
> acquire a lock that kcompactd is holding. kcompactd is waiting for IO
> to complete on pages owned by the i915 driver.

My bad, it's known. We haven't decided on whether to revert the
unfortunate recursive locking (and so hit another warn elsewhere) or to
ignore the dirty pages (and so risk losing data across swap).

cb6d7c7dc7ff ("drm/i915/userptr: Acquire the page lock around set_page_dirty()")
-Chris

2019-09-10 14:34:28

by Leho Kraav

[permalink] [raw]
Subject: Re: 5.3-rc3: Frozen graphics with kcompactd migrating i915 pages

On Fri, Aug 09, 2019 at 01:53:43PM +0100, Chris Wilson wrote:
> Quoting Martin Wilck (2019-08-09 13:41:42)
> > This happened to me today, running kernel 5.3.0-rc3-1.g571863b-default
> > (5.3-rc3 with just a few patches on top), after starting a KVM virtual
> > machine. The X screen was frozen. Remote login via ssh was still
> > possible, thus I was able to retrieve basic logs.
> >
> > sysrq-w showed two blocked processes (kcompactd0 and KVM). After a
> > minute, the same two processes were still blocked. KVM seems to try to
> > acquire a lock that kcompactd is holding. kcompactd is waiting for IO
> > to complete on pages owned by the i915 driver.
>
> My bad, it's known. We haven't decided on whether to revert the
> unfortunate recursive locking (and so hit another warn elsewhere) or to
> ignore the dirty pages (and so risk losing data across swap).
>
> cb6d7c7dc7ff ("drm/i915/userptr: Acquire the page lock around set_page_dirty()")
> -Chris

Hi Chris. Is this exactly what I'm hitting at
https://bugs.freedesktop.org/show_bug.cgi?id=111500 perhaps?

It reliably breaks the graphics userland, as the machine consistently
freezes at any random moment.

Any workaround options, even if with a performance penalty? Revert
cb6d7c7dc7ff but side effects?

5.3 has useful NVMe power mgmt updates for laptops, I'd like to stick
with the newest if possible.

--
Leho Kraav, senior technology & digital marketing architect

2019-09-12 12:02:30

by Leho Kraav

[permalink] [raw]
Subject: Re: 5.3-rc3: Frozen graphics with kcompactd migrating i915 pages

On Thu, Sep 12, 2019 at 11:23:09AM +0000, Martin Wilck wrote:
>
> There's a considerable risk that many users will start seeing this
> regression when 5.3 is released. I am not aware of a workaround.
>
> Is there an alternative to reverting aa56a292ce62 ("drm/i915/userptr:
> Acquire the page lock around set_page_dirty()")? And if we do, what
> would be the consequences? Would other patches need to be reverted,
> too?

I've been running with revert patch for a couple of days and have not
encountered any kernel warnings thus far, nor any other ill effects that
could be attributed to this locking mechanism.

But I'm far from familiar with these subsystems.

Graphics does not hang anymore.

I've also received developer feedback in private that this really should
be fixed before 5.3 release.

--
Leho Kraav, senior technology & digital marketing architect

2019-09-12 12:05:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: 5.3-rc3: Frozen graphics with kcompactd migrating i915 pages

On Thu, Sep 12, 2019 at 12:51 PM Martin Wilck <[email protected]> wrote:
>
> Is there an alternative to reverting aa56a292ce62 ("drm/i915/userptr:
> Acquire the page lock around set_page_dirty()")? And if we do, what
> would be the consequences? Would other patches need to be reverted,
> too?

Looking at that commit, and the backtrace of the lockup, I think that
reverting it is the correct thing to do.

You can't take the page lock in invalidate_range(), since it's called
from try_to_unmap(), which is called with the page lock already held.

So commit aa56a292ce62 is just fundamentally completely wrong and
should be reverted.

Linus

2019-09-12 12:32:00

by Martin Wilck

[permalink] [raw]
Subject: Re: 5.3-rc3: Frozen graphics with kcompactd migrating i915 pages

Hi Chris,

On Tue, 2019-09-10 at 17:20 +0300, Leho Kraav wrote:
> On Fri, Aug 09, 2019 at 01:53:43PM +0100, Chris Wilson wrote:
> > Quoting Martin Wilck (2019-08-09 13:41:42)
> > > This happened to me today, running kernel 5.3.0-rc3-1.g571863b-
> > > default
> > > (5.3-rc3 with just a few patches on top), after starting a KVM
> > > virtual
> > > machine. The X screen was frozen. Remote login via ssh was still
> > > possible, thus I was able to retrieve basic logs.
> > >
> > > sysrq-w showed two blocked processes (kcompactd0 and KVM). After
> > > a
> > > minute, the same two processes were still blocked. KVM seems to
> > > try to
> > > acquire a lock that kcompactd is holding. kcompactd is waiting
> > > for IO
> > > to complete on pages owned by the i915 driver.
> >
> > My bad, it's known. We haven't decided on whether to revert the
> > unfortunate recursive locking (and so hit another warn elsewhere)
> > or to
> > ignore the dirty pages (and so risk losing data across swap).
> >
> > cb6d7c7dc7ff ("drm/i915/userptr: Acquire the page lock around
> > set_page_dirty()")
> > -Chris
>
> Hi Chris. Is this exactly what I'm hitting at
> https://bugs.freedesktop.org/show_bug.cgi?id=111500 perhaps?
>
> It reliably breaks the graphics userland, as the machine consistently
> freezes at any random moment.
>
> Any workaround options, even if with a performance penalty? Revert
> cb6d7c7dc7ff but side effects?
>
> 5.3 has useful NVMe power mgmt updates for laptops, I'd like to stick
> with the newest if possible.

There's a considerable risk that many users will start seeing this
regression when 5.3 is released. I am not aware of a workaround.

Is there an alternative to reverting aa56a292ce62 ("drm/i915/userptr:
Acquire the page lock around set_page_dirty()")? And if we do, what
would be the consequences? Would other patches need to be reverted,
too?

Thanks,
Martin

2019-09-12 12:51:57

by Chris Wilson

[permalink] [raw]
Subject: Re: 5.3-rc3: Frozen graphics with kcompactd migrating i915 pages

Quoting Linus Torvalds (2019-09-12 12:59:25)
> On Thu, Sep 12, 2019 at 12:51 PM Martin Wilck <[email protected]> wrote:
> >
> > Is there an alternative to reverting aa56a292ce62 ("drm/i915/userptr:
> > Acquire the page lock around set_page_dirty()")? And if we do, what
> > would be the consequences? Would other patches need to be reverted,
> > too?
>
> Looking at that commit, and the backtrace of the lockup, I think that
> reverting it is the correct thing to do.
>
> You can't take the page lock in invalidate_range(), since it's called
> from try_to_unmap(), which is called with the page lock already held.
>
> So commit aa56a292ce62 is just fundamentally completely wrong and
> should be reverted.

There's still the dilemma that we get called without the page lock, but
at this moment in time in order to hit 5.3, it needs a revert sent
directly to Linus.
-Chris

2019-09-12 17:30:37

by Chris Wilson

[permalink] [raw]
Subject: [PATCH] Revert "drm/i915/userptr: Acquire the page lock around set_page_dirty()"

The userptr put_pages can be called from inside try_to_unmap, and so
enters with the page lock held on one of the object's backing pages. We
cannot take the page lock ourselves for fear of recursion.

Reported-by: Lionel Landwerlin <[email protected]>
Reported-by: Martin Wilck <[email protected]>
Reported-by: Leo Kraav <[email protected]>
Fixes: aa56a292ce62 ("drm/i915/userptr: Acquire the page lock around set_page_dirty()")
References: https://bugzilla.kernel.org/show_bug.cgi?id=203317
Signed-off-by: Chris Wilson <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Cc: Jani Nikula <[email protected]>
Cc: Joonas Lahtinen <[email protected]>
Cc: [email protected]
---
drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 10 +---------
1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
index 74da35611d7c..11b231c187c5 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
@@ -672,15 +672,7 @@ i915_gem_userptr_put_pages(struct drm_i915_gem_object *obj,

for_each_sgt_page(page, sgt_iter, pages) {
if (obj->mm.dirty)
- /*
- * As this may not be anonymous memory (e.g. shmem)
- * but exist on a real mapping, we have to lock
- * the page in order to dirty it -- holding
- * the page reference is not sufficient to
- * prevent the inode from being truncated.
- * Play safe and take the lock.
- */
- set_page_dirty_lock(page);
+ set_page_dirty(page);

mark_page_accessed(page);
put_page(page);
--
2.23.0

2019-09-12 20:44:37

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: [Intel-gfx] [PATCH] Revert "drm/i915/userptr: Acquire the page lock around set_page_dirty()"


On 12/09/2019 13:56, Chris Wilson wrote:
> The userptr put_pages can be called from inside try_to_unmap, and so
> enters with the page lock held on one of the object's backing pages. We
> cannot take the page lock ourselves for fear of recursion.
>
> Reported-by: Lionel Landwerlin <[email protected]>
> Reported-by: Martin Wilck <[email protected]>
> Reported-by: Leo Kraav <[email protected]>
> Fixes: aa56a292ce62 ("drm/i915/userptr: Acquire the page lock around set_page_dirty()")
> References: https://bugzilla.kernel.org/show_bug.cgi?id=203317
> Signed-off-by: Chris Wilson <[email protected]>
> Cc: Tvrtko Ursulin <[email protected]>
> Cc: Jani Nikula <[email protected]>
> Cc: Joonas Lahtinen <[email protected]>
> Cc: [email protected]
> ---
> drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 10 +---------
> 1 file changed, 1 insertion(+), 9 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
> index 74da35611d7c..11b231c187c5 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
> @@ -672,15 +672,7 @@ i915_gem_userptr_put_pages(struct drm_i915_gem_object *obj,
>
> for_each_sgt_page(page, sgt_iter, pages) {
> if (obj->mm.dirty)
> - /*
> - * As this may not be anonymous memory (e.g. shmem)
> - * but exist on a real mapping, we have to lock
> - * the page in order to dirty it -- holding
> - * the page reference is not sufficient to
> - * prevent the inode from being truncated.
> - * Play safe and take the lock.
> - */
> - set_page_dirty_lock(page);
> + set_page_dirty(page);
>
> mark_page_accessed(page);
> put_page(page);
>

Acked-by: Tvrtko Ursulin <[email protected]>

Regards,

Tvrtko