2012-10-05 23:42:32

by Willy Tarreau

[permalink] [raw]
Subject: 3.5 regression on i915

Chris, Daniel,

since version 3.5, my Asus EeePC 1005HA bugs during startx. I didn't
have the time to investigate until this evening.

I could bisect the commits and found that the following one was merged
in 3.5-rc1 and is responsible for these bugs that can reliably be
triggered :

1b50247a8ddde4af5aaa0e6bc125615372ce6c16 is the first bad commit
commit 1b50247a8ddde4af5aaa0e6bc125615372ce6c16
Author: Chris Wilson <[email protected]>
Date: Tue Apr 24 15:47:30 2012 +0100

drm/i915: Remove the list of pinned inactive objects

Simplify object tracking by removing the inactive but pinned list. The
only place where this was used is for counting the available memory,
which is just as easy performed by checking all objects on the rare
occasions it is required (application startup). For ease of debugging,
we keep the reporting of pinned objects through the error-state and
debugfs.

Signed-off-by: Chris Wilson <[email protected]>
Signed-off-by: Daniel Vetter <[email protected]>

I tried to revert it from 3.5.6-rc1 but it does not revert cleanly at all
and I'm totall unfamiliar with this code to attempt anything sane at this
time of the night.

The crash happens here in i915_gem_entervt_ioctl() :

3659 BUG_ON(!list_empty(&dev_priv->mm.active_list));
3660 BUG_ON(!list_empty(&dev_priv->mm.flushing_list));
-> 3661 BUG_ON(!list_empty(&dev_priv->mm.inactive_list));
3662 mutex_unlock(&dev->struct_mutex);

More info in the trace below :

------------[ cut here ]------------
kernel BUG at drivers/gpu/drm/i915/i915_gem.c:3661!
invalid opcode: 0000 [#1] SMP
Modules linked in: snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss uvcvideo videobuf2_core videodev videobuf2_vmalloc videobuf2_memops uhci_hcd ath9k mac80211 snd_hda_codec_realtek ath9k_common microcode ath9k_hw psmouse serio_raw sg ath cfg80211 atl1c lpc_ich mfd_core ehci_hcd snd_hda_intel snd_hda_codec snd_hwdep snd_pcm rtc_cmos snd_timer snd evdev eeepc_laptop snd_page_alloc sparse_keymap

Pid: 2866, comm: X Not tainted 3.5.6-rc1-eeepc #1 ASUSTeK Computer INC. 1005HA/1005HA
EIP: 0060:[<c12dc291>] EFLAGS: 00013297 CPU: 0
EIP is at i915_gem_entervt_ioctl+0xf1/0x110
EAX: f5941df4 EBX: f5940000 ECX: 00000000 EDX: 00020000
ESI: f5835400 EDI: 00000000 EBP: f51d7e38 ESP: f51d7e20
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
CR0: 8005003b CR2: b760e0a0 CR3: 351b6000 CR4: 000007d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
Process X (pid: 2866, ti=f51d6000 task=f61af8d0 task.ti=f51d6000)
Stack:
00000001 00000000 f5835414 f51d7e84 f5835400 f54f85c0 f51d7f10 c12b530b
00000001 c151b139 c14751b6 c152e030 00000b32 00006459 00000059 0000e200
00000001 00000000 00006459 c159ddd0 c12dc1a0 ffffffea 00000000 00000000
Call Trace:
[<c12b530b>] drm_ioctl+0x2eb/0x440
[<c12dc1a0>] ? i915_gem_init+0xe0/0xe0
[<c1052b2b>] ? enqueue_hrtimer+0x1b/0x50
[<c1053321>] ? __hrtimer_start_range_ns+0x161/0x330
[<c10530b3>] ? lock_hrtimer_base+0x23/0x50
[<c1053163>] ? hrtimer_try_to_cancel+0x33/0x70
[<c12b5020>] ? drm_version+0x90/0x90
[<c10ca171>] vfs_ioctl+0x31/0x50
[<c10ca2e4>] do_vfs_ioctl+0x64/0x510
[<c10535de>] ? hrtimer_nanosleep+0x8e/0x100
[<c1052c20>] ? update_rmtp+0x80/0x80
[<c10ca7c9>] sys_ioctl+0x39/0x60
[<c1433949>] syscall_call+0x7/0xb
Code: 83 c4 0c 5b 5e 5f 5d c3 c7 44 24 04 2c 05 53 c1 c7 04 24 6f ef 47 c1 e8 6e e0 fd ff c7 83 38 1e 00 00 00 00 00 00 e9 3f ff ff ff <0f> 0b eb fe 0f 0b eb fe 8d b4 26 00 00 00 00 0f 0b eb fe 8d b6
EIP: [<c12dc291>] i915_gem_entervt_ioctl+0xf1/0x110 SS:ESP 0068:f51d7e20
---[ end trace dd332ec083cbd513 ]---

I have the full dmesg if that can help. I do not have KMS and have not
tested 3.6-* yet.

$ grep I915 .config
CONFIG_DRM_I915=y
# CONFIG_DRM_I915_KMS is not set

Please tell me if you want more info (config, full dmesg, etc... didn't
want to pollute the list) or if you want me to test any patch. I'm willing
to help getting this issue fixed.

Thanks,
Willy


2012-10-05 23:49:00

by Dave Airlie

[permalink] [raw]
Subject: Re: 3.5 regression on i915

On Sat, Oct 6, 2012 at 9:42 AM, Willy Tarreau <[email protected]> wrote:
> Chris, Daniel,
>
> since version 3.5, my Asus EeePC 1005HA bugs during startx. I didn't
> have the time to investigate until this evening.
>
> I could bisect the commits and found that the following one was merged
> in 3.5-rc1 and is responsible for these bugs that can reliably be
> triggered :
>
> 1b50247a8ddde4af5aaa0e6bc125615372ce6c16 is the first bad commit
> commit 1b50247a8ddde4af5aaa0e6bc125615372ce6c16
> Author: Chris Wilson <[email protected]>
> Date: Tue Apr 24 15:47:30 2012 +0100
>
> drm/i915: Remove the list of pinned inactive objects
>
> Simplify object tracking by removing the inactive but pinned list. The
> only place where this was used is for counting the available memory,
> which is just as easy performed by checking all objects on the rare
> occasions it is required (application startup). For ease of debugging,
> we keep the reporting of pinned objects through the error-state and
> debugfs.
>
> Signed-off-by: Chris Wilson <[email protected]>
> Signed-off-by: Daniel Vetter <[email protected]>
>
> I tried to revert it from 3.5.6-rc1 but it does not revert cleanly at all
> and I'm totall unfamiliar with this code to attempt anything sane at this
> time of the night.
>
> The crash happens here in i915_gem_entervt_ioctl() :
>
> 3659 BUG_ON(!list_empty(&dev_priv->mm.active_list));
> 3660 BUG_ON(!list_empty(&dev_priv->mm.flushing_list));
> -> 3661 BUG_ON(!list_empty(&dev_priv->mm.inactive_list));
> 3662 mutex_unlock(&dev->struct_mutex);
>
> More info in the trace below :
>
> ------------[ cut here ]------------
> kernel BUG at drivers/gpu/drm/i915/i915_gem.c:3661!
> invalid opcode: 0000 [#1] SMP
> Modules linked in: snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss uvcvideo videobuf2_core videodev videobuf2_vmalloc videobuf2_memops uhci_hcd ath9k mac80211 snd_hda_codec_realtek ath9k_common microcode ath9k_hw psmouse serio_raw sg ath cfg80211 atl1c lpc_ich mfd_core ehci_hcd snd_hda_intel snd_hda_codec snd_hwdep snd_pcm rtc_cmos snd_timer snd evdev eeepc_laptop snd_page_alloc sparse_keymap
>
> Pid: 2866, comm: X Not tainted 3.5.6-rc1-eeepc #1 ASUSTeK Computer INC. 1005HA/1005HA
> EIP: 0060:[<c12dc291>] EFLAGS: 00013297 CPU: 0
> EIP is at i915_gem_entervt_ioctl+0xf1/0x110
> EAX: f5941df4 EBX: f5940000 ECX: 00000000 EDX: 00020000
> ESI: f5835400 EDI: 00000000 EBP: f51d7e38 ESP: f51d7e20
> DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> CR0: 8005003b CR2: b760e0a0 CR3: 351b6000 CR4: 000007d0
> DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> DR6: ffff0ff0 DR7: 00000400
> Process X (pid: 2866, ti=f51d6000 task=f61af8d0 task.ti=f51d6000)
> Stack:
> 00000001 00000000 f5835414 f51d7e84 f5835400 f54f85c0 f51d7f10 c12b530b
> 00000001 c151b139 c14751b6 c152e030 00000b32 00006459 00000059 0000e200
> 00000001 00000000 00006459 c159ddd0 c12dc1a0 ffffffea 00000000 00000000
> Call Trace:
> [<c12b530b>] drm_ioctl+0x2eb/0x440
> [<c12dc1a0>] ? i915_gem_init+0xe0/0xe0
> [<c1052b2b>] ? enqueue_hrtimer+0x1b/0x50
> [<c1053321>] ? __hrtimer_start_range_ns+0x161/0x330
> [<c10530b3>] ? lock_hrtimer_base+0x23/0x50
> [<c1053163>] ? hrtimer_try_to_cancel+0x33/0x70
> [<c12b5020>] ? drm_version+0x90/0x90
> [<c10ca171>] vfs_ioctl+0x31/0x50
> [<c10ca2e4>] do_vfs_ioctl+0x64/0x510
> [<c10535de>] ? hrtimer_nanosleep+0x8e/0x100
> [<c1052c20>] ? update_rmtp+0x80/0x80
> [<c10ca7c9>] sys_ioctl+0x39/0x60
> [<c1433949>] syscall_call+0x7/0xb
> Code: 83 c4 0c 5b 5e 5f 5d c3 c7 44 24 04 2c 05 53 c1 c7 04 24 6f ef 47 c1 e8 6e e0 fd ff c7 83 38 1e 00 00 00 00 00 00 e9 3f ff ff ff <0f> 0b eb fe 0f 0b eb fe 8d b4 26 00 00 00 00 0f 0b eb fe 8d b6
> EIP: [<c12dc291>] i915_gem_entervt_ioctl+0xf1/0x110 SS:ESP 0068:f51d7e20
> ---[ end trace dd332ec083cbd513 ]---
>
> I have the full dmesg if that can help. I do not have KMS and have not
> tested 3.6-* yet.
>
> $ grep I915 .config
> CONFIG_DRM_I915=y
> # CONFIG_DRM_I915_KMS is not set

Any reason you don't have KMS, you'll keep hitting these non-kms bugs
since it has no users anymore really.

Granted they'll get fixed, but I suspect its a losing battle over time.

Dave.

2012-10-05 23:58:57

by Willy Tarreau

[permalink] [raw]
Subject: Re: 3.5 regression on i915

On Sat, Oct 06, 2012 at 09:48:57AM +1000, Dave Airlie wrote:
> Any reason you don't have KMS, you'll keep hitting these non-kms bugs
> since it has no users anymore really.
>
> Granted they'll get fixed, but I suspect its a losing battle over time.

Well, back in old times every time I tried to enable it, I only ran into
problems so I got used to disable it, which made sense since it didn't
bring me any benefit.

I can retry with it if needed, but if we consider that it is necessary,
then we should probably not allow it to be disabled anymore, since my
working 3.4 config fails on 3.5 with "make oldconfig".

Willy

2012-10-06 07:27:19

by Willy Tarreau

[permalink] [raw]
Subject: Re: 3.5 regression on i915

On Sat, Oct 06, 2012 at 01:58:45AM +0200, Willy Tarreau wrote:
> On Sat, Oct 06, 2012 at 09:48:57AM +1000, Dave Airlie wrote:
> > Any reason you don't have KMS, you'll keep hitting these non-kms bugs
> > since it has no users anymore really.
> >
> > Granted they'll get fixed, but I suspect its a losing battle over time.
>
> Well, back in old times every time I tried to enable it, I only ran into
> problems so I got used to disable it, which made sense since it didn't
> bring me any benefit.

OK I found why in the end. When I enable KMS, my X server fails to start
and segfaults in intel_drv.so. So I think that the comment in Kconfig
below is quite appropriate :

config DRM_I915_KMS
bool "Enable modesetting on intel by default"
depends on DRM_I915
help
Choose this option if you want kernel modesetting enabled by default,
and you have a new enough userspace to support this. Running old
userspaces with this enabled will cause pain. Note that this causes
the driver to bind to PCI devices, which precludes loading things
like intelfb.

Regards,
Willy

2012-10-06 07:43:58

by Dave Airlie

[permalink] [raw]
Subject: Re: 3.5 regression on i915

On Sat, Oct 6, 2012 at 3:27 AM, Willy Tarreau <[email protected]> wrote:
> On Sat, Oct 06, 2012 at 01:58:45AM +0200, Willy Tarreau wrote:
>> On Sat, Oct 06, 2012 at 09:48:57AM +1000, Dave Airlie wrote:
>> > Any reason you don't have KMS, you'll keep hitting these non-kms bugs
>> > since it has no users anymore really.
>> >
>> > Granted they'll get fixed, but I suspect its a losing battle over time.
>>
>> Well, back in old times every time I tried to enable it, I only ran into
>> problems so I got used to disable it, which made sense since it didn't
>> bring me any benefit.
>
> OK I found why in the end. When I enable KMS, my X server fails to start
> and segfaults in intel_drv.so. So I think that the comment in Kconfig
> below is quite appropriate :

Okay are you running a really old userspace? just wondering what could
cause this.

Dave.

2012-10-06 08:00:12

by Willy Tarreau

[permalink] [raw]
Subject: Re: 3.5 regression on i915

On Sat, Oct 06, 2012 at 03:43:56AM -0400, Dave Airlie wrote:
> On Sat, Oct 6, 2012 at 3:27 AM, Willy Tarreau <[email protected]> wrote:
> > On Sat, Oct 06, 2012 at 01:58:45AM +0200, Willy Tarreau wrote:
> >> On Sat, Oct 06, 2012 at 09:48:57AM +1000, Dave Airlie wrote:
> >> > Any reason you don't have KMS, you'll keep hitting these non-kms bugs
> >> > since it has no users anymore really.
> >> >
> >> > Granted they'll get fixed, but I suspect its a losing battle over time.
> >>
> >> Well, back in old times every time I tried to enable it, I only ran into
> >> problems so I got used to disable it, which made sense since it didn't
> >> bring me any benefit.
> >
> > OK I found why in the end. When I enable KMS, my X server fails to start
> > and segfaults in intel_drv.so. So I think that the comment in Kconfig
> > below is quite appropriate :
>
> Okay are you running a really old userspace? just wondering what could
> cause this.

yes, my Xorg is 1.4.2 and xf86-video-intel is 2.7.1.

I have additional information, I retested with 3.4.12 with and without KMS :

3.4.12, no KMS : X works fine
3.4.12, KMS : X works fine but some garbage follows the mouse pointer
on the external display
3.5.x no KMS : kernel BUG
3.5.x KMS : X server segfaults in intel driver

So in fact, both the KMS/non-KMS confs have regressed in 3.5 on this setup.

It is possible that the commit which removed the list of pinned inactive
objects (1b50247a) has uncovered another long-time bug.

Regards,
Willy

2012-10-06 08:04:54

by Chris Wilson

[permalink] [raw]
Subject: Re: 3.5 regression on i915

On Sat, 6 Oct 2012 01:42:18 +0200, Willy Tarreau <[email protected]> wrote:
> Chris, Daniel,
>
> since version 3.5, my Asus EeePC 1005HA bugs during startx. I didn't
> have the time to investigate until this evening.
>
> I could bisect the commits and found that the following one was merged
> in 3.5-rc1 and is responsible for these bugs that can reliably be
> triggered :
>
> 1b50247a8ddde4af5aaa0e6bc125615372ce6c16 is the first bad commit
> commit 1b50247a8ddde4af5aaa0e6bc125615372ce6c16
> Author: Chris Wilson <[email protected]>
> Date: Tue Apr 24 15:47:30 2012 +0100
>
> drm/i915: Remove the list of pinned inactive objects
>
> Simplify object tracking by removing the inactive but pinned list. The
> only place where this was used is for counting the available memory,
> which is just as easy performed by checking all objects on the rare
> occasions it is required (application startup). For ease of debugging,
> we keep the reporting of pinned objects through the error-state and
> debugfs.
>
> Signed-off-by: Chris Wilson <[email protected]>
> Signed-off-by: Daniel Vetter <[email protected]>
>
> I tried to revert it from 3.5.6-rc1 but it does not revert cleanly at all
> and I'm totall unfamiliar with this code to attempt anything sane at this
> time of the night.
>
> The crash happens here in i915_gem_entervt_ioctl() :
>
> 3659 BUG_ON(!list_empty(&dev_priv->mm.active_list));
> 3660 BUG_ON(!list_empty(&dev_priv->mm.flushing_list));
> -> 3661 BUG_ON(!list_empty(&dev_priv->mm.inactive_list));
> 3662 mutex_unlock(&dev->struct_mutex);

That BUG_ON there is silly and can simply be removed. The check is to
verify that no batches were submitted to the kernel whilst the UMS/GEM
client was suspended - to which the BUG_ONs are a crude approximation.
Furthermore, the checks are too late, since it means we attempted to
program the hardware whilst it was in an invalid state, the BUG_ONs are
the least of your concerns at that point.
-Chris

--
Chris Wilson, Intel Open Source Technology Centre

2012-10-06 08:20:35

by Willy Tarreau

[permalink] [raw]
Subject: Re: 3.5 regression on i915

Hi Chris,

On Sat, Oct 06, 2012 at 09:04:34AM +0100, Chris Wilson wrote:
> > The crash happens here in i915_gem_entervt_ioctl() :
> >
> > 3659 BUG_ON(!list_empty(&dev_priv->mm.active_list));
> > 3660 BUG_ON(!list_empty(&dev_priv->mm.flushing_list));
> > -> 3661 BUG_ON(!list_empty(&dev_priv->mm.inactive_list));
> > 3662 mutex_unlock(&dev->struct_mutex);
>
> That BUG_ON there is silly and can simply be removed. The check is to
> verify that no batches were submitted to the kernel whilst the UMS/GEM
> client was suspended - to which the BUG_ONs are a crude approximation.
> Furthermore, the checks are too late, since it means we attempted to
> program the hardware whilst it was in an invalid state, the BUG_ONs are
> the least of your concerns at that point.

Excellent, that fixed it ! X still segfaults when KMS is used, but
I expect more of a pure user-space issue here since there is nothing
in dmesg.

Would some of you accept the following patch and tag it for -stable ?

Thanks,
Willy

---

>From 3450cb7b7bd0b8fe1eab59d09e6852c4e3b22001 Mon Sep 17 00:00:00 2001
From: Willy Tarreau <[email protected]>
Date: Sat, 6 Oct 2012 10:09:00 +0200
Subject: drm/i915: remove useless BUG_ON which caused a regression in 3.5.

starting an old X server causes a kernel BUG since commit 1b50247a8d:

------------[ cut here ]------------
kernel BUG at drivers/gpu/drm/i915/i915_gem.c:3661!
invalid opcode: 0000 [#1] SMP
Modules linked in: snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss uvcvideo videobuf2_core videodev videobuf2_vmalloc videobuf2_memops uhci_hcd ath9k mac80211 snd_hda_codec_realtek ath9k_common microcode ath9k_hw psmouse serio_raw sg ath cfg80211 atl1c lpc_ich mfd_core ehci_hcd snd_hda_intel snd_hda_codec snd_hwdep snd_pcm rtc_cmos snd_timer snd evdev eeepc_laptop snd_page_alloc sparse_keymap

Pid: 2866, comm: X Not tainted 3.5.6-rc1-eeepc #1 ASUSTeK Computer INC. 1005HA/1005HA
EIP: 0060:[<c12dc291>] EFLAGS: 00013297 CPU: 0
EIP is at i915_gem_entervt_ioctl+0xf1/0x110
EAX: f5941df4 EBX: f5940000 ECX: 00000000 EDX: 00020000
ESI: f5835400 EDI: 00000000 EBP: f51d7e38 ESP: f51d7e20
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
CR0: 8005003b CR2: b760e0a0 CR3: 351b6000 CR4: 000007d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
Process X (pid: 2866, ti=f51d6000 task=f61af8d0 task.ti=f51d6000)
Stack:
00000001 00000000 f5835414 f51d7e84 f5835400 f54f85c0 f51d7f10 c12b530b
00000001 c151b139 c14751b6 c152e030 00000b32 00006459 00000059 0000e200
00000001 00000000 00006459 c159ddd0 c12dc1a0 ffffffea 00000000 00000000
Call Trace:
[<c12b530b>] drm_ioctl+0x2eb/0x440
[<c12dc1a0>] ? i915_gem_init+0xe0/0xe0
[<c1052b2b>] ? enqueue_hrtimer+0x1b/0x50
[<c1053321>] ? __hrtimer_start_range_ns+0x161/0x330
[<c10530b3>] ? lock_hrtimer_base+0x23/0x50
[<c1053163>] ? hrtimer_try_to_cancel+0x33/0x70
[<c12b5020>] ? drm_version+0x90/0x90
[<c10ca171>] vfs_ioctl+0x31/0x50
[<c10ca2e4>] do_vfs_ioctl+0x64/0x510
[<c10535de>] ? hrtimer_nanosleep+0x8e/0x100
[<c1052c20>] ? update_rmtp+0x80/0x80
[<c10ca7c9>] sys_ioctl+0x39/0x60
[<c1433949>] syscall_call+0x7/0xb
Code: 83 c4 0c 5b 5e 5f 5d c3 c7 44 24 04 2c 05 53 c1 c7 04 24 6f ef 47 c1 e8 6e e0 fd ff c7 83 38 1e 00 00 00 00 00 00 e9 3f ff ff ff <0f> 0b eb fe 0f 0b eb fe 8d b4 26 00 00 00 00 0f 0b eb fe 8d b6
EIP: [<c12dc291>] i915_gem_entervt_ioctl+0xf1/0x110 SS:ESP 0068:f51d7e20
---[ end trace dd332ec083cbd513 ]---

The crash happens here in i915_gem_entervt_ioctl() :

3659 BUG_ON(!list_empty(&dev_priv->mm.active_list));
3660 BUG_ON(!list_empty(&dev_priv->mm.flushing_list));
-> 3661 BUG_ON(!list_empty(&dev_priv->mm.inactive_list));
3662 mutex_unlock(&dev->struct_mutex);

Quoting Chris :
"That BUG_ON there is silly and can simply be removed. The check is to
verify that no batches were submitted to the kernel whilst the UMS/GEM
client was suspended - to which the BUG_ONs are a crude approximation.
Furthermore, the checks are too late, since it means we attempted to
program the hardware whilst it was in an invalid state, the BUG_ONs are
the least of your concerns at that point."

Cc: Chris Wilson <[email protected]>
Signed-off-by: Willy Tarreau <[email protected]>
---
drivers/gpu/drm/i915/i915_gem.c | 1 -
1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 35926ad..fc6683a 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -3658,7 +3658,6 @@ i915_gem_entervt_ioctl(struct drm_device *dev, void *data,

BUG_ON(!list_empty(&dev_priv->mm.active_list));
BUG_ON(!list_empty(&dev_priv->mm.flushing_list));
- BUG_ON(!list_empty(&dev_priv->mm.inactive_list));
mutex_unlock(&dev->struct_mutex);

ret = drm_irq_install(dev);
--
1.7.2.1.45.g54fbc

2012-10-06 08:24:45

by Chris Wilson

[permalink] [raw]
Subject: Re: 3.5 regression on i915

On Sat, 6 Oct 2012 09:59:56 +0200, Willy Tarreau <[email protected]> wrote:
> On Sat, Oct 06, 2012 at 03:43:56AM -0400, Dave Airlie wrote:
> > On Sat, Oct 6, 2012 at 3:27 AM, Willy Tarreau <[email protected]> wrote:
> > > On Sat, Oct 06, 2012 at 01:58:45AM +0200, Willy Tarreau wrote:
> > >> On Sat, Oct 06, 2012 at 09:48:57AM +1000, Dave Airlie wrote:
> > >> > Any reason you don't have KMS, you'll keep hitting these non-kms bugs
> > >> > since it has no users anymore really.
> > >> >
> > >> > Granted they'll get fixed, but I suspect its a losing battle over time.
> > >>
> > >> Well, back in old times every time I tried to enable it, I only ran into
> > >> problems so I got used to disable it, which made sense since it didn't
> > >> bring me any benefit.
> > >
> > > OK I found why in the end. When I enable KMS, my X server fails to start
> > > and segfaults in intel_drv.so. So I think that the comment in Kconfig
> > > below is quite appropriate :
> >
> > Okay are you running a really old userspace? just wondering what could
> > cause this.
>
> yes, my Xorg is 1.4.2 and xf86-video-intel is 2.7.1.
>
> I have additional information, I retested with 3.4.12 with and without KMS :
>
> 3.4.12, no KMS : X works fine
> 3.4.12, KMS : X works fine but some garbage follows the mouse pointer
> on the external display
> 3.5.x no KMS : kernel BUG
> 3.5.x KMS : X server segfaults in intel driver
>
> So in fact, both the KMS/non-KMS confs have regressed in 3.5 on this setup.
>
> It is possible that the commit which removed the list of pinned inactive
> objects (1b50247a) has uncovered another long-time bug.

More likely X is segfaulting for another reason altogether. Can you
please attach the stacktrace (with symbols!) and see if another
bisection is required?
-Chris

--
Chris Wilson, Intel Open Source Technology Centre

2012-10-06 08:43:17

by Willy Tarreau

[permalink] [raw]
Subject: Re: 3.5 regression on i915

On Sat, Oct 06, 2012 at 09:24:35AM +0100, Chris Wilson wrote:
> More likely X is segfaulting for another reason altogether. Can you
> please attach the stacktrace (with symbols!) and see if another
> bisection is required?

Yes, here it is.

_XSERVTransSocketOpenCOTSServer: Unable to open socket for inet6
_XSERVTransOpen: transport open failed for inet6/eeepc:0
_XSERVTransMakeAllCOTSServerListeners: failed to open listener for inet6

X.Org X Server 1.4.2
Release Date: 11 June 2008
X Protocol Version 11, Revision 0
Build Operating System: Slackware 12.1 Slackware Linux Project
Current Operating System: Linux eeepc 3.5.6-rc1+ #4 SMP Sat Oct 6 10:13:58 CEST 2012 i686
Build Date: 30 June 2008 11:35:29PM

Before reporting problems, check http://wiki.x.org
to make sure that you have the latest version.
Module Loader present
Markers: (--) probed, (**) from config file, (==) default setting,
(++) from command line, (!!) notice, (II) informational,
(WW) warning, (EE) error, (NI) not implemented, (??) unknown.
(==) Log file: "/var/log/Xorg.0.log", Time: Sat Oct 6 10:42:43 2012
(==) Using config file: "/etc/X11/xorg.conf"
(II) Module "ramdac" already built-in
(II) Module "ddc" already built-in
(II) Module "i2c" already built-in
(EE) intel(0): Failed to initialize kernel memory manager

Backtrace:
0: X(xf86SigHandler+0x7e) [0x80d8b5e]
1: [0xffffe400]
2: /usr/lib/xorg/modules/drivers//intel_drv.so(i830_allocator_init+0x332) [0xb73756e2]
3: /usr/lib/xorg/modules/drivers//intel_drv.so [0xb736de51]
4: X(AddScreen+0x1fc) [0x806d42c]
5: X(InitOutput+0x21e) [0x80a1b7e]
6: X(main+0x296) [0x806dbb6]
7: /lib/libc.so.6(__libc_start_main+0xe0) [0xb7581390]
8: X(FontFileCompleteXLFD+0x20d) [0x806d121]

Fatal server error:
Caught signal 11. Server aborting

giving up.
xinit: Connection reset by peer (errno 104): unable to connect to X server
xinit: No such process (errno 3): Server error.

Willy

2012-10-06 08:43:28

by Chris Wilson

[permalink] [raw]
Subject: Re: 3.5 regression on i915

On Sat, 6 Oct 2012 10:20:16 +0200, Willy Tarreau <[email protected]> wrote:
> Hi Chris,
>
> On Sat, Oct 06, 2012 at 09:04:34AM +0100, Chris Wilson wrote:
> > > The crash happens here in i915_gem_entervt_ioctl() :
> > >
> > > 3659 BUG_ON(!list_empty(&dev_priv->mm.active_list));
> > > 3660 BUG_ON(!list_empty(&dev_priv->mm.flushing_list));
> > > -> 3661 BUG_ON(!list_empty(&dev_priv->mm.inactive_list));
> > > 3662 mutex_unlock(&dev->struct_mutex);
> >
> > That BUG_ON there is silly and can simply be removed. The check is to
> > verify that no batches were submitted to the kernel whilst the UMS/GEM
> > client was suspended - to which the BUG_ONs are a crude approximation.
> > Furthermore, the checks are too late, since it means we attempted to
> > program the hardware whilst it was in an invalid state, the BUG_ONs are
> > the least of your concerns at that point.
>
> Excellent, that fixed it ! X still segfaults when KMS is used, but
> I expect more of a pure user-space issue here since there is nothing
> in dmesg.
>
> Would some of you accept the following patch and tag it for -stable ?

Reviewed-by: Chris Wilson <[email protected]>
-Chris

--
Chris Wilson, Intel Open Source Technology Centre

2012-10-06 09:04:10

by Chris Wilson

[permalink] [raw]
Subject: Re: 3.5 regression on i915

On Sat, 6 Oct 2012 10:42:58 +0200, Willy Tarreau <[email protected]> wrote:
> On Sat, Oct 06, 2012 at 09:24:35AM +0100, Chris Wilson wrote:
> > More likely X is segfaulting for another reason altogether. Can you
> > please attach the stacktrace (with symbols!) and see if another
> > bisection is required?
>
> Yes, here it is.
>
> (EE) intel(0): Failed to initialize kernel memory manager
>
> Backtrace:
> 0: X(xf86SigHandler+0x7e) [0x80d8b5e]
> 1: [0xffffe400]
> 2: /usr/lib/xorg/modules/drivers//intel_drv.so(i830_allocator_init+0x332) [0xb73756e2]

Drat, that is:

commit 7bb6fb8dd958ae773ac205282e3c0b56c22e01ed
Author: Daniel Vetter <[email protected]>
Date: Tue Apr 24 08:22:52 2012 +0200

drm/i915: disallow gem ums init ioctl for kms

This ioctl used in a kms driver is only useful to create massive
havoc.

Can't see just why -intel crashes, but I presume it is during the
i830_free_memory along that path.

Anyway, looks like that patch needs to be reverted.
-Chris

--
Chris Wilson, Intel Open Source Technology Centre

2012-10-06 09:17:23

by Willy Tarreau

[permalink] [raw]
Subject: Re: 3.5 regression on i915

On Sat, Oct 06, 2012 at 10:04:00AM +0100, Chris Wilson wrote:
> On Sat, 6 Oct 2012 10:42:58 +0200, Willy Tarreau <[email protected]> wrote:
> > On Sat, Oct 06, 2012 at 09:24:35AM +0100, Chris Wilson wrote:
> > > More likely X is segfaulting for another reason altogether. Can you
> > > please attach the stacktrace (with symbols!) and see if another
> > > bisection is required?
> >
> > Yes, here it is.
> >
> > (EE) intel(0): Failed to initialize kernel memory manager
> >
> > Backtrace:
> > 0: X(xf86SigHandler+0x7e) [0x80d8b5e]
> > 1: [0xffffe400]
> > 2: /usr/lib/xorg/modules/drivers//intel_drv.so(i830_allocator_init+0x332) [0xb73756e2]
>
> Drat, that is:
>
> commit 7bb6fb8dd958ae773ac205282e3c0b56c22e01ed
> Author: Daniel Vetter <[email protected]>
> Date: Tue Apr 24 08:22:52 2012 +0200
>
> drm/i915: disallow gem ums init ioctl for kms
>
> This ioctl used in a kms driver is only useful to create massive
> havoc.
>
> Can't see just why -intel crashes, but I presume it is during the
> i830_free_memory along that path.
>
> Anyway, looks like that patch needs to be reverted.

Good catch, but now it dies later after screen goes black :

Backtrace:
0: X(xf86SigHandler+0x7e) [0x80d8b5e]
1: [0xffffe400]
2: /usr/lib/xorg/modules/drivers//intel_drv.so(I830Sync+0x4e) [0xb732eefe]
3: /usr/lib/xorg/modules/drivers//intel_drv.so [0xb733d916]
4: X(AbortDDX+0x8d) [0x80a146d]
5: X(AbortServer+0x28) [0x81b3ae8]
6: X(FatalError+0x66) [0x81b4066]
7: /usr/lib/xorg/modules/drivers//intel_drv.so [0xb733e626]
8: /usr/lib/xorg/modules/drivers//intel_drv.so [0xb733f962]
9: X(AddScreen+0x1fc) [0x806d42c]
10: X(InitOutput+0x21e) [0x80a1b7e]
11: X(main+0x296) [0x806dbb6]
12: /lib/libc.so.6(__libc_start_main+0xe0) [0xb7552390]
13: X(FontFileCompleteXLFD+0x20d) [0x806d121]

FatalError re-entered, aborting
Caught signal 11. Server aborting

I suspect the error handling in this version of intel_drv is incomplete,
resulting in segfaults instead of plain error reporting.

Willy

2012-10-06 16:11:37

by Chris Wilson

[permalink] [raw]
Subject: Re: 3.5 regression on i915

On Sat, 6 Oct 2012 11:17:08 +0200, Willy Tarreau <[email protected]> wrote:
> On Sat, Oct 06, 2012 at 10:04:00AM +0100, Chris Wilson wrote:
> > On Sat, 6 Oct 2012 10:42:58 +0200, Willy Tarreau <[email protected]> wrote:
> > > On Sat, Oct 06, 2012 at 09:24:35AM +0100, Chris Wilson wrote:
> > > > More likely X is segfaulting for another reason altogether. Can you
> > > > please attach the stacktrace (with symbols!) and see if another
> > > > bisection is required?
> > >
> > > Yes, here it is.
> > >
> > > (EE) intel(0): Failed to initialize kernel memory manager
> > >
> > > Backtrace:
> > > 0: X(xf86SigHandler+0x7e) [0x80d8b5e]
> > > 1: [0xffffe400]
> > > 2: /usr/lib/xorg/modules/drivers//intel_drv.so(i830_allocator_init+0x332) [0xb73756e2]

So far I have tested -intel-2.6.3 and -intel-2.7.1 with xorg-1.5.3 and
kernel-3.7 on a 965gm both using the EXA AccelMethod, and both are still
operational.

Can you send the complete Xorg.log for the current failure? Thanks,
-Chris

--
Chris Wilson, Intel Open Source Technology Centre

2012-10-06 20:44:07

by Willy Tarreau

[permalink] [raw]
Subject: Re: 3.5 regression on i915

On Sat, Oct 06, 2012 at 05:10:57PM +0100, Chris Wilson wrote:
> So far I have tested -intel-2.6.3 and -intel-2.7.1 with xorg-1.5.3 and
> kernel-3.7 on a 965gm both using the EXA AccelMethod, and both are still
> operational.
>
> Can you send the complete Xorg.log for the current failure? Thanks,

Yes, I'm sending this to you as well as my dmesg and kernel config off-list
to limit pollution. There's nothing private here so feel free to discuss it
on-list.

Thanks!
Willy

2012-10-07 20:59:37

by Daniel Vetter

[permalink] [raw]
Subject: Re: 3.5 regression on i915

On Sat, Oct 06, 2012 at 10:20:16AM +0200, Willy Tarreau wrote:
> Hi Chris,
>
> On Sat, Oct 06, 2012 at 09:04:34AM +0100, Chris Wilson wrote:
> > > The crash happens here in i915_gem_entervt_ioctl() :
> > >
> > > 3659 BUG_ON(!list_empty(&dev_priv->mm.active_list));
> > > 3660 BUG_ON(!list_empty(&dev_priv->mm.flushing_list));
> > > -> 3661 BUG_ON(!list_empty(&dev_priv->mm.inactive_list));
> > > 3662 mutex_unlock(&dev->struct_mutex);
> >
> > That BUG_ON there is silly and can simply be removed. The check is to
> > verify that no batches were submitted to the kernel whilst the UMS/GEM
> > client was suspended - to which the BUG_ONs are a crude approximation.
> > Furthermore, the checks are too late, since it means we attempted to
> > program the hardware whilst it was in an invalid state, the BUG_ONs are
> > the least of your concerns at that point.
>
> Excellent, that fixed it ! X still segfaults when KMS is used, but
> I expect more of a pure user-space issue here since there is nothing
> in dmesg.
>
> Would some of you accept the following patch and tag it for -stable ?
>
> Thanks,
> Willy
>
> ---
>
> From 3450cb7b7bd0b8fe1eab59d09e6852c4e3b22001 Mon Sep 17 00:00:00 2001
> From: Willy Tarreau <[email protected]>
> Date: Sat, 6 Oct 2012 10:09:00 +0200
> Subject: drm/i915: remove useless BUG_ON which caused a regression in 3.5.
>
> starting an old X server causes a kernel BUG since commit 1b50247a8d:
>
> ------------[ cut here ]------------
> kernel BUG at drivers/gpu/drm/i915/i915_gem.c:3661!
> invalid opcode: 0000 [#1] SMP
> Modules linked in: snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss uvcvideo videobuf2_core videodev videobuf2_vmalloc videobuf2_memops uhci_hcd ath9k mac80211 snd_hda_codec_realtek ath9k_common microcode ath9k_hw psmouse serio_raw sg ath cfg80211 atl1c lpc_ich mfd_core ehci_hcd snd_hda_intel snd_hda_codec snd_hwdep snd_pcm rtc_cmos snd_timer snd evdev eeepc_laptop snd_page_alloc sparse_keymap
>
> Pid: 2866, comm: X Not tainted 3.5.6-rc1-eeepc #1 ASUSTeK Computer INC. 1005HA/1005HA
> EIP: 0060:[<c12dc291>] EFLAGS: 00013297 CPU: 0
> EIP is at i915_gem_entervt_ioctl+0xf1/0x110
> EAX: f5941df4 EBX: f5940000 ECX: 00000000 EDX: 00020000
> ESI: f5835400 EDI: 00000000 EBP: f51d7e38 ESP: f51d7e20
> DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> CR0: 8005003b CR2: b760e0a0 CR3: 351b6000 CR4: 000007d0
> DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> DR6: ffff0ff0 DR7: 00000400
> Process X (pid: 2866, ti=f51d6000 task=f61af8d0 task.ti=f51d6000)
> Stack:
> 00000001 00000000 f5835414 f51d7e84 f5835400 f54f85c0 f51d7f10 c12b530b
> 00000001 c151b139 c14751b6 c152e030 00000b32 00006459 00000059 0000e200
> 00000001 00000000 00006459 c159ddd0 c12dc1a0 ffffffea 00000000 00000000
> Call Trace:
> [<c12b530b>] drm_ioctl+0x2eb/0x440
> [<c12dc1a0>] ? i915_gem_init+0xe0/0xe0
> [<c1052b2b>] ? enqueue_hrtimer+0x1b/0x50
> [<c1053321>] ? __hrtimer_start_range_ns+0x161/0x330
> [<c10530b3>] ? lock_hrtimer_base+0x23/0x50
> [<c1053163>] ? hrtimer_try_to_cancel+0x33/0x70
> [<c12b5020>] ? drm_version+0x90/0x90
> [<c10ca171>] vfs_ioctl+0x31/0x50
> [<c10ca2e4>] do_vfs_ioctl+0x64/0x510
> [<c10535de>] ? hrtimer_nanosleep+0x8e/0x100
> [<c1052c20>] ? update_rmtp+0x80/0x80
> [<c10ca7c9>] sys_ioctl+0x39/0x60
> [<c1433949>] syscall_call+0x7/0xb
> Code: 83 c4 0c 5b 5e 5f 5d c3 c7 44 24 04 2c 05 53 c1 c7 04 24 6f ef 47 c1 e8 6e e0 fd ff c7 83 38 1e 00 00 00 00 00 00 e9 3f ff ff ff <0f> 0b eb fe 0f 0b eb fe 8d b4 26 00 00 00 00 0f 0b eb fe 8d b6
> EIP: [<c12dc291>] i915_gem_entervt_ioctl+0xf1/0x110 SS:ESP 0068:f51d7e20
> ---[ end trace dd332ec083cbd513 ]---
>
> The crash happens here in i915_gem_entervt_ioctl() :
>
> 3659 BUG_ON(!list_empty(&dev_priv->mm.active_list));
> 3660 BUG_ON(!list_empty(&dev_priv->mm.flushing_list));
> -> 3661 BUG_ON(!list_empty(&dev_priv->mm.inactive_list));
> 3662 mutex_unlock(&dev->struct_mutex);
>
> Quoting Chris :
> "That BUG_ON there is silly and can simply be removed. The check is to
> verify that no batches were submitted to the kernel whilst the UMS/GEM
> client was suspended - to which the BUG_ONs are a crude approximation.
> Furthermore, the checks are too late, since it means we attempted to
> program the hardware whilst it was in an invalid state, the BUG_ONs are
> the least of your concerns at that point."
>
> Cc: Chris Wilson <[email protected]>
> Signed-off-by: Willy Tarreau <[email protected]>

Applied to -fixes, with cc: stable and a note mentioning the regressing
commit sha1 added.

Thanks, Daniel
> ---
> drivers/gpu/drm/i915/i915_gem.c | 1 -
> 1 files changed, 0 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 35926ad..fc6683a 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -3658,7 +3658,6 @@ i915_gem_entervt_ioctl(struct drm_device *dev, void *data,
>
> BUG_ON(!list_empty(&dev_priv->mm.active_list));
> BUG_ON(!list_empty(&dev_priv->mm.flushing_list));
> - BUG_ON(!list_empty(&dev_priv->mm.inactive_list));
> mutex_unlock(&dev->struct_mutex);
>
> ret = drm_irq_install(dev);
> --
> 1.7.2.1.45.g54fbc
>

--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2012-10-07 21:07:38

by Willy Tarreau

[permalink] [raw]
Subject: Re: 3.5 regression on i915

On Sun, Oct 07, 2012 at 11:00:31PM +0200, Daniel Vetter wrote:
> Applied to -fixes, with cc: stable and a note mentioning the regressing
> commit sha1 added.

Thanks to you and Chris for the quick diag & fix !

Willy