Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756800AbdCWVXx (ORCPT ); Thu, 23 Mar 2017 17:23:53 -0400 Received: from mail-ot0-f195.google.com ([74.125.82.195]:35621 "EHLO mail-ot0-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751612AbdCWVXw (ORCPT ); Thu, 23 Mar 2017 17:23:52 -0400 Subject: Re: Regression in i915 for 4.11-rc1 - bisected to commit 69df05e11ab8 To: Chris Wilson , LKML , Tvrtko Ursulin , intel-gfx@lists.freedesktop.org, Jani Nikula , Daniel Vetter , Thorsten Leemhuis References: <558064cb-f489-a743-79cb-c88fd06f17aa@lwfinger.net> <20170323204452.GN27773@nuc-i3427.alporthouse.com> From: Larry Finger Message-ID: <3f59e4e1-10f6-aa06-dfb8-d1443d0920a0@lwfinger.net> Date: Thu, 23 Mar 2017 16:23:49 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 MIME-Version: 1.0 In-Reply-To: <20170323204452.GN27773@nuc-i3427.alporthouse.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3555 Lines: 69 On 03/23/2017 03:44 PM, Chris Wilson wrote: > On Thu, Mar 23, 2017 at 01:19:43PM -0500, Larry Finger wrote: >> Since kernel 4.11-rc1, my desktop (Plasma5/KDE) has encountered >> intermittent hangs with the following information in the logs: >> >> linux-4v1g.suse kernel: [drm] GPU HANG: ecode 7:0:0xf3cffffe, in >> plasmashell [1283], reason: Hang on render ring, action: reset >> linux-4v1g.suse kernel: [drm] GPU hangs can indicate a bug anywhere >> in the entire gfx stack, including userspace. >> linux-4v1g.suse kernel: [drm] Please file a _new_ bug report on >> bugs.freedesktop.org against DRI -> DRM/Intel >> linux-4v1g.suse kernel: [drm] drm/i915 developers can then reassign >> to the right component if it's not a kernel issue. >> linux-4v1g.suse kernel: [drm] The gpu crash dump is required to >> analyze gpu hangs, so please always attach it. >> linux-4v1g.suse kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error >> linux-4v1g.suse kernel: drm/i915: Resetting chip after gpu hang >> >> This problem was added to >> https://bugs.freedesktop.org/show_bug.cgi?id=99380, but it probably >> is a different bug, as the OP in that report has problems with >> kernel 4.10.x, whereas my problem did not appear until 4.11. > > Close. Actually that patch touches code you are not using (oa-perf and > gvt), the real culprit was e8a9c58fcd9a ("drm/i915: Unify active context > tracking between legacy/execlists/guc"). > > The fix > > commit 5d4bac5503fcc67dd7999571e243cee49371aef7 > Author: Chris Wilson > Date: Wed Mar 22 20:59:30 2017 +0000 > > drm/i915: Restore marking context objects as dirty on pinning > > Commit e8a9c58fcd9a ("drm/i915: Unify active context tracking between > legacy/execlists/guc") converted the legacy intel_ringbuffer submission > to the same context pinning mechanism as execlists - that is to pin the > context until the subsequent request is retired. Previously it used the > vma retirement of the context object to keep itself pinned until the > next request (after i915_vma_move_to_active()). In the conversion, I > missed that the vma retirement was also responsible for marking the > object as dirty. Mark the context object as dirty when pinning > (equivalent to execlists) which ensures that if the context is swapped > out due to mempressure or suspend/hibernation, when it is loaded back in > it does so with the previous state (and not all zero). > > Fixes: e8a9c58fcd9a ("drm/i915: Unify active context tracking between legacy/execlists/guc") > Reported-by: Dennis Gilmore > Reported-by: Mathieu Marquer > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99993 > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100181 > Signed-off-by: Chris Wilson > Cc: Tvrtko Ursulin > Cc: # v4.11-rc1 > Link: http://patchwork.freedesktop.org/patch/msgid/20170322205930.12762-1-chris@chris-wilson.co.uk > Reviewed-by: Tvrtko Ursulin > > went in this morning and so will be upstreamed ~next week. > -Chris Thanks. With a bug that is difficult to trigger, bisection is difficult. I am surprised that the only step I got wrong was the last one. BTW, my reversion failed after 20 hours. I was ready to write again when I got your fix. Good timing. If your patch does not fix my problem, I will let you know. Larry