2012-10-23 17:07:16

by Justin P. Mattock

[permalink] [raw]
Subject: [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung

This is happening both with MAINLINE and NEXT.

basically system is running fine, then under load system becomes really
sluggish and unresponsive. I was able to get dmesg of the error..:

[ 7745.007008] ath9k 0000:05:00.0 wlan0: disabling VHT as WMM/QoS is not
supported by the AP
[ 7745.007736] wlan0: associate with 68:7f:74:b8:05:82 (try 1/3)
[ 7745.011456] wlan0: RX AssocResp from 68:7f:74:b8:05:82 (capab=0x411
status=0 aid=5)
[ 7745.011529] wlan0: associated
[ 8120.812482] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer
elapsed... GPU hung
[ 8120.812642] [drm] capturing error event; look for more information in
/debug/dri/0/i915_error_state
[ 8122.328682] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer
elapsed... GPU hung
[ 8122.328845] [drm:i915_reset] *ERROR* GPU hanging too fast, declaring
wedged!
[ 8122.328850] [drm:i915_reset] *ERROR* Failed to reset chip.

full log is here: http://fpaste.org/7xH8/

as for good kernels from what I remember 3.6.0-rc1. I can try a bisect
on this once I get the time. or if anybody has a patch I can test.

Justin P. Mattock


2012-10-23 17:39:47

by Daniel Vetter

[permalink] [raw]
Subject: Re: [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung

On Tue, Oct 23, 2012 at 10:06:52AM -0700, Justin P. Mattock wrote:
> This is happening both with MAINLINE and NEXT.
>
> basically system is running fine, then under load system becomes
> really sluggish and unresponsive. I was able to get dmesg of the
> error..:
>
> [ 7745.007008] ath9k 0000:05:00.0 wlan0: disabling VHT as WMM/QoS is
> not supported by the AP
> [ 7745.007736] wlan0: associate with 68:7f:74:b8:05:82 (try 1/3)
> [ 7745.011456] wlan0: RX AssocResp from 68:7f:74:b8:05:82
> (capab=0x411 status=0 aid=5)
> [ 7745.011529] wlan0: associated
> [ 8120.812482] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer
> elapsed... GPU hung
> [ 8120.812642] [drm] capturing error event; look for more
> information in /debug/dri/0/i915_error_state
> [ 8122.328682] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer
> elapsed... GPU hung
> [ 8122.328845] [drm:i915_reset] *ERROR* GPU hanging too fast,
> declaring wedged!
> [ 8122.328850] [drm:i915_reset] *ERROR* Failed to reset chip.
>
> full log is here: http://fpaste.org/7xH8/
>
> as for good kernels from what I remember 3.6.0-rc1. I can try a
> bisect on this once I get the time. or if anybody has a patch I can
> test.

Can you please rehand your machine, and then grab the i915_error_state
from debugfs? That contains the gpu hang dump we need to diagnose things.

And the bisect would obviously be awesome.

Thanks, Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2012-10-25 05:22:58

by Justin P. Mattock

[permalink] [raw]
Subject: Re: [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung

>
>
> On Tue, Oct 23, 2012 at 10:06:52AM -0700, Justin P. Mattock wrote:
> > This is happening both with MAINLINE and NEXT.
> >
> > basically system is running fine, then under load system becomes
> > really sluggish and unresponsive. I was able to get dmesg of the
> > error..:
> >
> > [ 7745.007008] ath9k 0000:05:00.0 wlan0: disabling VHT as WMM/QoS is
> > not supported by the AP
> > [ 7745.007736] wlan0: associate with 68:7f:74:b8:05:82 (try 1/3)
> > [ 7745.011456] wlan0: RX AssocResp from 68:7f:74:b8:05:82
> > (capab=0x411 status=0 aid=5)
> > [ 7745.011529] wlan0: associated
> > [ 8120.812482] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer
> > elapsed... GPU hung
> > [ 8120.812642] [drm] capturing error event; look for more
> > information in /debug/dri/0/i915_error_state
> > [ 8122.328682] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer
> > elapsed... GPU hung
> > [ 8122.328845] [drm:i915_reset] *ERROR* GPU hanging too fast,
> > declaring wedged!
> > [ 8122.328850] [drm:i915_reset] *ERROR* Failed to reset chip.
> >
> > full log is here: http://fpaste.org/7xH8/
> >
> > as for good kernels from what I remember 3.6.0-rc1. I can try a
> > bisect on this once I get the time. or if anybody has a patch I can
> > test.
>
> Can you please rehand your machine, and then grab the i915_error_state
> from debugfs? That contains the gpu hang dump we need to diagnose things.
>
> And the bisect would obviously be awesome.
>
> Thanks, Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch

took a bit to trigger, but finally fired off.

here is a link to the file..: intel_error_decode
http://www.filefactory.com/file/22bypyjhs4mx

the file was to large to send to the list.. let me know if you need more
info with this.
also if anybody has any ideas to trigger this would be appreciated so
the bisect can be more precise. right now dont even think its worth it,
due to not being able to trigger the crash causing the bisect to go
astray and pointing to a wrong commit(which has happened in the past)
but then again you never know.

Justin P. Mattock

2012-10-25 08:16:12

by Daniel Vetter

[permalink] [raw]
Subject: Re: [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung

On Thu, Oct 25, 2012 at 7:22 AM, Justin P. Mattock
<[email protected]> wrote:
>
> here is a link to the file..: intel_error_decode
> http://www.filefactory.com/file/22bypyjhs4mx

I haven't figured out how to access this thing. Can you please file a
bug report on bugs.freedesktop.org and attach it there?

Thanks, Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2012-10-25 08:48:05

by Chris Wilson

[permalink] [raw]
Subject: Re: [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung

On Thu, 25 Oct 2012 10:16:08 +0200, Daniel Vetter <[email protected]> wrote:
> On Thu, Oct 25, 2012 at 7:22 AM, Justin P. Mattock
> <[email protected]> wrote:
> >
> > here is a link to the file..: intel_error_decode
> > http://www.filefactory.com/file/22bypyjhs4mx
>
> I haven't figured out how to access this thing. Can you please file a
> bug report on bugs.freedesktop.org and attach it there?

No worries, it is another ILK hang similar to the ones reported earlier
- it just seems the ring stops advancing. Hopefully it is a missing w/a
from http://cgit.freedesktop.org/~danvet/drm/log/?h=ilk-wa-pile
-Chris

--
Chris Wilson, Intel Open Source Technology Centre

2012-10-26 04:44:23

by Justin P. Mattock

[permalink] [raw]
Subject: Re: [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung

On 10/25/2012 01:47 AM, Chris Wilson wrote:
> On Thu, 25 Oct 2012 10:16:08 +0200, Daniel Vetter <[email protected]> wrote:
>> On Thu, Oct 25, 2012 at 7:22 AM, Justin P. Mattock
>> <[email protected]> wrote:
>>>
>>> here is a link to the file..: intel_error_decode
>>> http://www.filefactory.com/file/22bypyjhs4mx
>>
>> I haven't figured out how to access this thing. Can you please file a
>> bug report on bugs.freedesktop.org and attach it there?

Oops.. I filed with the kernel. maybe can just add a cc's
https://bugzilla.kernel.org/show_bug.cgi?id=49571

>
> No worries, it is another ILK hang similar to the ones reported earlier
> - it just seems the ring stops advancing. Hopefully it is a missing w/a
> from http://cgit.freedesktop.org/~danvet/drm/log/?h=ilk-wa-pile
> -Chris
>

well if this means building libdrm etc.. then thats not a problem, more
time consuming if anything. perhaps an *.rpm that I can test to see?


Justin P. Mattock

2012-10-26 08:05:15

by Daniel Vetter

[permalink] [raw]
Subject: Re: [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung

On Fri, Oct 26, 2012 at 6:43 AM, Justin P. Mattock
<[email protected]> wrote:
>>
>> No worries, it is another ILK hang similar to the ones reported earlier
>> - it just seems the ring stops advancing. Hopefully it is a missing w/a
>> from http://cgit.freedesktop.org/~danvet/drm/log/?h=ilk-wa-pile
>> -Chris
>>
>
> well if this means building libdrm etc.. then thats not a problem, more time
> consuming if anything. perhaps an *.rpm that I can test to see?

It's not libdrm, the above is just a kernel git tree with a bunch of
ironlake workarounds.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2012-10-26 17:36:09

by Justin P. Mattock

[permalink] [raw]
Subject: Re: [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung

On 10/26/2012 01:05 AM, Daniel Vetter wrote:
> On Fri, Oct 26, 2012 at 6:43 AM, Justin P. Mattock
> <[email protected]> wrote:
>>>
>>> No worries, it is another ILK hang similar to the ones reported earlier
>>> - it just seems the ring stops advancing. Hopefully it is a missing w/a
>>> from http://cgit.freedesktop.org/~danvet/drm/log/?h=ilk-wa-pile
>>> -Chris
>>>
>>
>> well if this means building libdrm etc.. then thats not a problem, more time
>> consuming if anything. perhaps an *.rpm that I can test to see?
>
> It's not libdrm, the above is just a kernel git tree with a bunch of
> ironlake workarounds.
> -Daniel
>


hmm.. then in that case maybe I should pull and run that kernel to see
if the crash occurs, before bisecting(if anything).
will do once I get time to download.

Justin P. Mattock

2012-10-26 20:57:47

by Justin P. Mattock

[permalink] [raw]
Subject: Re: [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung

On 10/26/2012 01:05 AM, Daniel Vetter wrote:
> On Fri, Oct 26, 2012 at 6:43 AM, Justin P. Mattock
> <[email protected]> wrote:
>>>
>>> No worries, it is another ILK hang similar to the ones reported earlier
>>> - it just seems the ring stops advancing. Hopefully it is a missing w/a
>>> from http://cgit.freedesktop.org/~danvet/drm/log/?h=ilk-wa-pile
>>> -Chris
>>>
>>
>> well if this means building libdrm etc.. then thats not a problem, more time
>> consuming if anything. perhaps an *.rpm that I can test to see?
>
> It's not libdrm, the above is just a kernel git tree with a bunch of
> ironlake workarounds.
> -Daniel
>


nice..

:~/drm> git clone git://people.freedesktop.org/~danvet/drm
Cloning into 'drm'...
remote: Counting objects: 2728390, done.
remote: Compressing objects: 100% (418606/418606), done.
remote: Total 2728390 (delta 2293727), reused 2717443 (delta 2282880)
Receiving objects: 100% (2728390/2728390), 637.95 MiB | 599 KiB/s, done.
Resolving deltas: 100% (2293727/2293727), done.
warning: remote HEAD refers to nonexistent ref, unable to checkout.


so now I have to go on a witch hunt for 600MB's in my system.

Justin P. Mattock

2012-10-27 13:56:45

by Daniel Vetter

[permalink] [raw]
Subject: Re: [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung

On Fri, Oct 26, 2012 at 10:57 PM, Justin P. Mattock
<[email protected]> wrote:
>
> :~/drm> git clone git://people.freedesktop.org/~danvet/drm
> Cloning into 'drm'...
> remote: Counting objects: 2728390, done.
> remote: Compressing objects: 100% (418606/418606), done.
> remote: Total 2728390 (delta 2293727), reused 2717443 (delta 2282880)
> Receiving objects: 100% (2728390/2728390), 637.95 MiB | 599 KiB/s, done.
> Resolving deltas: 100% (2293727/2293727), done.
> warning: remote HEAD refers to nonexistent ref, unable to checkout.
>
>
> so now I have to go on a witch hunt for 600MB's in my system.

$ git checkout origin/ilk-wa-pile

... and you have the right branch checked out. No need for pitchforks
and witch hunts ;-)
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2012-10-27 19:11:58

by Justin P. Mattock

[permalink] [raw]
Subject: Re: [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung

On 10/27/2012 06:56 AM, Daniel Vetter wrote:
> On Fri, Oct 26, 2012 at 10:57 PM, Justin P. Mattock
> <[email protected]> wrote:
>>
>> :~/drm> git clone git://people.freedesktop.org/~danvet/drm
>> Cloning into 'drm'...
>> remote: Counting objects: 2728390, done.
>> remote: Compressing objects: 100% (418606/418606), done.
>> remote: Total 2728390 (delta 2293727), reused 2717443 (delta 2282880)
>> Receiving objects: 100% (2728390/2728390), 637.95 MiB | 599 KiB/s, done.
>> Resolving deltas: 100% (2293727/2293727), done.
>> warning: remote HEAD refers to nonexistent ref, unable to checkout.
>>
>>
>> so now I have to go on a witch hunt for 600MB's in my system.
>
> $ git checkout origin/ilk-wa-pile

cool thanks..(not so good at git over here).
>
> ... and you have the right branch checked out. No need for pitchforks
> and witch hunts ;-)
> -Daniel
>


alright.. putting the pitchfork away for now.

Justin P. Mattock