Subject: Re: Regression in i915 for 4.11-rc1 - bisected to commit 69df05e11ab8
To: Chris Wilson <chris@chris-wilson.co.uk>,
        LKML <linux-kernel@vger.kernel.org>,
        Tvrtko Ursulin <tvrtko.ursulin@intel.com>,
        intel-gfx@lists.freedesktop.org,
        Jani Nikula <jani.nikula@linux.intel.com>,
        Daniel Vetter <daniel.vetter@intel.com>,
        Thorsten Leemhuis <regressions@leemhuis.info>
References: <558064cb-f489-a743-79cb-c88fd06f17aa@lwfinger.net>
 <20170323204452.GN27773@nuc-i3427.alporthouse.com>
From: Larry Finger <Larry.Finger@lwfinger.net>
Message-ID: <3f59e4e1-10f6-aa06-dfb8-d1443d0920a0@lwfinger.net>
Date: Thu, 23 Mar 2017 16:23:49 -0500
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
 Thunderbird/45.8.0
MIME-Version: 1.0
In-Reply-To: <20170323204452.GN27773@nuc-i3427.alporthouse.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3555
Lines: 69

On 03/23/2017 03:44 PM, Chris Wilson wrote:
> On Thu, Mar 23, 2017 at 01:19:43PM -0500, Larry Finger wrote:
>> Since kernel 4.11-rc1, my desktop (Plasma5/KDE) has encountered
>> intermittent hangs with the following information in the logs:
>>
>> linux-4v1g.suse kernel: [drm] GPU HANG: ecode 7:0:0xf3cffffe, in
>> plasmashell [1283], reason: Hang on render ring, action: reset
>> linux-4v1g.suse kernel: [drm] GPU hangs can indicate a bug anywhere
>> in the entire gfx stack, including userspace.
>> linux-4v1g.suse kernel: [drm] Please file a _new_ bug report on
>> bugs.freedesktop.org against DRI -> DRM/Intel
>> linux-4v1g.suse kernel: [drm] drm/i915 developers can then reassign
>> to the right component if it's not a kernel issue.
>> linux-4v1g.suse kernel: [drm] The gpu crash dump is required to
>> analyze gpu hangs, so please always attach it.
>> linux-4v1g.suse kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error
>> linux-4v1g.suse kernel: drm/i915: Resetting chip after gpu hang
>>
>> This problem was added to
>> https://bugs.freedesktop.org/show_bug.cgi?id=99380, but it probably
>> is a different bug, as the OP in that report has problems with
>> kernel 4.10.x, whereas my problem did not appear until 4.11.
>
> Close. Actually that patch touches code you are not using (oa-perf and
> gvt), the real culprit was e8a9c58fcd9a ("drm/i915: Unify active context
> tracking between legacy/execlists/guc").
>
> The fix
>
> commit 5d4bac5503fcc67dd7999571e243cee49371aef7
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Wed Mar 22 20:59:30 2017 +0000
>
>     drm/i915: Restore marking context objects as dirty on pinning
>
>     Commit e8a9c58fcd9a ("drm/i915: Unify active context tracking between
>     legacy/execlists/guc") converted the legacy intel_ringbuffer submission
>     to the same context pinning mechanism as execlists - that is to pin the
>     context until the subsequent request is retired. Previously it used the
>     vma retirement of the context object to keep itself pinned until the
>     next request (after i915_vma_move_to_active()). In the conversion, I
>     missed that the vma retirement was also responsible for marking the
>     object as dirty. Mark the context object as dirty when pinning
>     (equivalent to execlists) which ensures that if the context is swapped
>     out due to mempressure or suspend/hibernation, when it is loaded back in
>     it does so with the previous state (and not all zero).
>
>     Fixes: e8a9c58fcd9a ("drm/i915: Unify active context tracking between legacy/execlists/guc")
>     Reported-by: Dennis Gilmore <dennis@ausil.us>
>     Reported-by: Mathieu Marquer <mathieu.marquer@gmail.com>
>     Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99993
>     Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100181
>     Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>     Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>     Cc: <drm-intel-fixes@lists.freedesktop.org> # v4.11-rc1
>     Link: http://patchwork.freedesktop.org/patch/msgid/20170322205930.12762-1-chris@chris-wilson.co.uk
>     Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>
> went in this morning and so will be upstreamed ~next week.
> -Chris

Thanks. With a bug that is difficult to trigger, bisection is difficult. I am 
surprised that the only step I got wrong was the last one. BTW, my reversion 
failed after 20 hours. I was ready to write again when I got your fix. Good timing.

If your patch does not fix my problem, I will let you know.

Larry