LinuxLists.cc - [REGRESSION, i915]: Periodic stalls with 2.6.36-rc2

2010-08-24 00:02:30

Subject: [REGRESSION, i915]: Periodic stalls with 2.6.36-rc2

Hi,

With 2.6.36-rc2 I see periodic stalls when running with a stock Ubuntu
10.04 userspace. These stalls were not present in 2.6.36-rc1 on an EeePC
900 with an i915.

Attempts to bisect the issue are not successful - most kernels in
between rc1 and rc2 just make the system come up with a black screen
which takes minutes between showing messages (rather than finishing in
two seconds). bc584c5107bfd97e2aa41c798e3b213bcdd4eae7 seems to be good
but 45d7f32c7a43cbb9592886d38190e379e2eb2226 is not.

Warnings like:

[ 64.227046] [drm:intel_calculate_wm] *ERROR* Insufficient FIFO for plane, expect flickering: entries required = 36, available = 31.
[ 82.953011] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU idle, missed IRQ.

appear frequently in the dmesg and the stalls are most visible when
playing fullscreen video. The first warning message (Insufficient
FIFO...) is present in 2.6.36-rc1 but not 2.6.35.

One additional quirk is that a more recent X userspace
(xserver-xorg-video-intel
2:2.12.0+git20100810.19c48d3b-0ubuntu0sarvatt2~lucid, libdrm-intel1
2.4.21+git20100804.431f7f00-0ubuntu0sarvatt~lucid,
xorg 1:7.5+6ubuntu1) now has major tearing (when watching video) and
refresh (in general at least when using compiz) issues which were not
seen with 2.6.36-rc1.

--
Sitsofe | http://sucs.org/~sits/

2010-08-24 00:12:30

by Chris Wilson

[permalink] [raw]

Subject: Re: [REGRESSION, i915]: Periodic stalls with 2.6.36-rc2

On Tue, 24 Aug 2010 00:35:51 +0100, Sitsofe Wheeler <[email protected]> wrote:
> Hi,
>
> With 2.6.36-rc2 I see periodic stalls when running with a stock Ubuntu
> 10.04 userspace. These stalls were not present in 2.6.36-rc1 on an EeePC
> 900 with an i915.

>From the error message, I'd suggest we'd tackle hangcheck - it simply
shouldn't be firing at all under normal circumstances. (Looking at it we
don't handle the introduction of the BSD ring correctly, but that is
irrelevant on the EeePC 900.)

Do the stalls and tearing go away with:

diff --git a/drivers/gpu/drm/i915/i915_irq.c
b/drivers/gpu/drm/i915/i915_irq.c
index e4d42a7..dc5fb4f 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1305,8 +1305,7 @@ void i915_hangcheck_elapsed(unsigned long data)
dev_priv->hangcheck_count = 0;

/* Issue a wake-up to catch stuck h/w. */
- if (dev_priv->render_ring.waiting_gem_seqno |
- dev_priv->bsd_ring.waiting_gem_seqno) {
+ if (0) {
DRM_ERROR("Hangcheck timer elapsed... GPU idle, missed IRQ.\n");
if (dev_priv->render_ring.waiting_gem_seqno)
DRM_WAKEUP(&dev_priv->render_ring.irq_queue);

--
Chris Wilson, Intel Open Source Technology Centre

2010-08-24 07:57:46

by Sitsofe Wheeler

[permalink] [raw]

Subject: Re: [REGRESSION, i915]: Periodic stalls with 2.6.36-rc2

On Tue, Aug 24, 2010 at 01:12:22AM +0100, Chris Wilson wrote:
>
> From the error message, I'd suggest we'd tackle hangcheck - it simply
> shouldn't be firing at all under normal circumstances. (Looking at it we
> don't handle the introduction of the BSD ring correctly, but that is
> irrelevant on the EeePC 900.)
>
> Do the stalls and tearing go away with:

The stalls became a bit better with the patch but there were still very
small pauses but the tearing with more recent X bits is definitely still
there. Additionally videos in totem would play as a tiny one pixel high
row about a quarter of the screen across at the top left of the screen.
It also seems that X has started having trouble refreshing itself so
sometimes new windows are invisible (especially with compiz).

I've also noticed a couple of other things. Sometimes (but without any
pattern) an extra display is detected even though no external display is
plugged in. Here's an example of the xrandr output:
Screen 0: minimum 320 x 200, current 2384 x 768, maximum 4096 x 4096
LVDS1 connected 1024x600+0+0 (normal left inverted right x axis y axis) 0mm x 0mm
1024x600 59.5*+
800x600 85.1 72.2 75.0 60.3 56.2
640x480 85.0 72.8 75.0 59.9
720x400 85.0
640x400 85.1
640x350 85.1
VGA1 disconnected (normal left inverted right x axis y axis)
TV1 disconnected (normal left inverted right x axis y axis)
1360x768 (0xbc) 84.8MHz
h: width 1360 start 1432 end 1568 total 1776 skew 0 clock 47.7KHz
v: height 768 start 771 end 781 total 798 clock 59.8Hz

Sometimes the output is TV1 sometimes it is VGA1 and sometimes these
extra outputs will disappear. Another issue is that switching from X to
a console for the first time causes a LCD "blooming" effect.

--
Sitsofe | http://sucs.org/~sits/

2010-08-24 08:16:56

by Chris Wilson

[permalink] [raw]

Subject: Re: [REGRESSION, i915]: Periodic stalls with 2.6.36-rc2

On Tue, 24 Aug 2010 08:57:41 +0100, Sitsofe Wheeler <[email protected]> wrote:
> The stalls became a bit better with the patch but there were still very
> small pauses but the tearing with more recent X bits is definitely still
> there. Additionally videos in totem would play as a tiny one pixel high
> row about a quarter of the screen across at the top left of the screen.
> It also seems that X has started having trouble refreshing itself so
> sometimes new windows are invisible (especially with compiz).

Ok, I'm a little happier that the hangcheck could be just another symptom
of the problem...

I think it is safe to assume that the bug is in i915, so restricting the
bisect to just drm seems plausible:

git bisect start drivers/gpu/drm
git bisect good 2.6.36-rc1
git bisect bad 2.6.36-rc2

And yes, it does apppear that the false detection rate for TV has changed.
The suspicion is that the change to intel_wait_for_vblank() is actually
causing issues here.

--
Chris Wilson, Intel Open Source Technology Centre

2010-08-24 08:49:06

by Sitsofe Wheeler

[permalink] [raw]

Subject: Re: [REGRESSION, i915]: Periodic stalls with 2.6.36-rc2

On Tue, Aug 24, 2010 at 09:16:50AM +0100, Chris Wilson wrote:
>
> Ok, I'm a little happier that the hangcheck could be just another symptom
> of the problem...
>
> I think it is safe to assume that the bug is in i915, so restricting the
> bisect to just drm seems plausible:
>
> git bisect start drivers/gpu/drm
> git bisect good 2.6.36-rc1
> git bisect bad 2.6.36-rc2

I should mention that I ran a similar bisect yesterday but it led to a dead
end:

# bad: [9ee47476d6734c9deb9ae9ab05d963302f6b6150] Merge branch 'radix-tree' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev
# good: [da5cabf80e2433131bf0ed8993abc0f7ea618c73] Linux 2.6.36-rc1
git bisect start 'HEAD' 'v2.6.36-rc1'
# good: [d1126ad907ce197ff45fbc2369fbeaf8ae6f75a8] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid
git bisect good d1126ad907ce197ff45fbc2369fbeaf8ae6f75a8
# bad: [4fefe435626758b14e6c05d2a5f8d71a997c0ad6] drm/i915,intel_agp: Add support for Sandybridge D0
git bisect bad 4fefe435626758b14e6c05d2a5f8d71a997c0ad6
# bad: [2e88e40bed136a7b7cb1c77d8dc6bd181d0d2732] drm/i915/sdvo: Markup a few constant strings.
git bisect bad 2e88e40bed136a7b7cb1c77d8dc6bd181d0d2732
# bad: [6146b3d61925116e3fecce36c2fd873665bd6614] drm/i915: i8xx also doesn't like multiple oustanding pageflips
git bisect bad 6146b3d61925116e3fecce36c2fd873665bd6614
# bad: [e044218a8ecb560b6fad65912a4e7e509db40414] drm/i915/sdvo: Add dot crawl property
git bisect bad e044218a8ecb560b6fad65912a4e7e509db40414
# bad: [45d7f32c7a43cbb9592886d38190e379e2eb2226] Merge git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile
git bisect bad 45d7f32c7a43cbb9592886d38190e379e2eb2226

All the bad kernels above boot EXTREMELY slowly and it's not clear why. Using
the results above to run your tests produced the following:

Bisecting: a merge base must be tested
[45d7f32c7a43cbb9592886d38190e379e2eb2226] Merge git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile
git bisect bad 45d7f32c7a43cbb9592886d38190e379e2eb2226
The merge base 45d7f32c7a43cbb9592886d38190e379e2eb2226 is bad.
This means the bug has been fixed between 45d7f32c7a43cbb9592886d38190e379e2eb2226 and [da5cabf80e2433131bf0ed8993abc0f7ea618c73].

As mentioned in the first mail, c584c5107bfd97e2aa41c798e3b213bcdd4eae7
seems to be good but 45d7f32c7a43cbb9592886d38190e379e2eb2226 does
not...

Any ideas?

> And yes, it does apppear that the false detection rate for TV has changed.
> The suspicion is that the change to intel_wait_for_vblank() is actually
> causing issues here.

OK.

--
Sitsofe | http://sucs.org/~sits/

2010-08-24 09:00:53

by Chris Wilson

[permalink] [raw]

Subject: Re: [REGRESSION, i915]: Periodic stalls with 2.6.36-rc2

On Tue, 24 Aug 2010 09:49:02 +0100, Sitsofe Wheeler <[email protected]> wrote:
> On Tue, Aug 24, 2010 at 09:16:50AM +0100, Chris Wilson wrote:
> >
> > Ok, I'm a little happier that the hangcheck could be just another symptom
> > of the problem...
> >
> > I think it is safe to assume that the bug is in i915, so restricting the
> > bisect to just drm seems plausible:
> >
> > git bisect start drivers/gpu/drm
> > git bisect good 2.6.36-rc1
> > git bisect bad 2.6.36-rc2
>
> I should mention that I ran a similar bisect yesterday but it led to a dead
> end:
[snip]
> All the bad kernels above boot EXTREMELY slowly and it's not clear why. Using
> the results above to run your tests produced the following:

I was hoping that git would be more intelligent than that. Is there a way
to simply bisect down one side of a merge?

The slow boot is probably fixed by 4936a3b90d79dd8775c6ac23c2cf2dcebe29abde.
A trivial patch you can apply on each step is:

diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 33dbcc4..88f3b6c 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -837,7 +837,7 @@ static int hpet_clocksource_register(void)
* cyc/sec = FSEC_PER_SEC/hpet_period(fsec/cyc)
* cyc/sec = (FSEC_PER_NSEC * NSEC_PER_SEC)/hpet_period
*/
- hpet_freq = FSEC_PER_NSEC * NSEC_PER_SEC;
+ hpet_freq = (u64) FSEC_PER_NSEC * NSEC_PER_SEC;
do_div(hpet_freq, hpet_period);
clocksource_register_hz(&clocksource_hpet, (u32)hpet_freq);

--
Chris Wilson, Intel Open Source Technology Centre

2010-08-24 09:55:38

by Sitsofe Wheeler

[permalink] [raw]

Subject: Re: [REGRESSION, BISECTED, i915]: Periodic stalls with 2.6.36-rc2

On Tue, Aug 24, 2010 at 10:00:47AM +0100, Chris Wilson wrote:
>
> I was hoping that git would be more intelligent than that. Is there a
> way to simply bisect down one side of a merge?

Seemingly not...

> The slow boot is probably fixed by 4936a3b90d79dd8775c6ac23c2cf2dcebe29abde.
> A trivial patch you can apply on each step is:

Thanks that patch got it booting at normal speeds. Bisecting has now
narrowed the flickering/corruption with a newer X userspace down to
this:

commit 9d0498a2bf7455159b317f19531a3e5db2ecc9c4
Author: Jesse Barnes <[email protected]>
Date: Wed Aug 18 13:20:54 2010 -0700

drm/i915: wait for actual vblank, not just 20ms

Waiting for a hard coded 20ms isn't always enough to make sure a vblank
period has actually occurred, so add code to make sure we really have
passed through a vblank period (or that the pipe is off when disabling).

This prevents problems with mode setting and link training, and seems to
fix a bug like https://bugs.freedesktop.org/show_bug.cgi?id=29278, but
on an HP 8440p instead. Hopefully also fixes
https://bugs.freedesktop.org/show_bug.cgi?id=29141.

Signed-off-by: Jesse Barnes <[email protected]>
Signed-off-by: Eric Anholt <[email protected]>

Reverting the above against HEAD seems to have fixed the misdetection of
displays and hangcheck warnings don't appear (although plenty of FIFO
warnings still). I notice that this commit also seemed to be the cause
of problems for Ivan and Pekka in http://lkml.org/lkml/2010/8/23/452 ...

--
Sitsofe | http://sucs.org/~sits/

2010-08-24 15:41:59

by Sitsofe Wheeler

[permalink] [raw]

Subject: [PATCH 1/2] drm/i915: Revert wait for vblank to prevent X refresh issues

In commit 9d0498a2bf7455159b317f19531a3e5db2ecc9c4 20ms waits were
converted into vblank waits. One of these caused tearing, mode detection
and redraw issues on an EeePC 900 with a more recent intel userspace (
http://lkml.org/lkml/2010/8/23/432 ). Restoring the 20ms wait resolves
the issue.
---
drivers/gpu/drm/i915/intel_display.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index 23157e1..116d938 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -4539,7 +4539,7 @@ struct drm_crtc *intel_get_load_detect_pipe(struct intel_encoder *intel_encoder,
encoder_funcs->commit(encoder);
}
/* let the connector get through one full cycle before testing */
- intel_wait_for_vblank(dev, intel_crtc->pipe);
+ msleep(20);

return crtc;
}
--
1.7.1

2010-08-24 15:54:07

by Sitsofe Wheeler

[permalink] [raw]

Subject: [PATCH 1/2] drm/i915: Revert wait for vblank to prevent X refresh issues

In commit 9d0498a2bf7455159b317f19531a3e5db2ecc9c4 20ms waits were
converted into vblank waits. One of these caused tearing, mode detection
and redraw issues on an EeePC 900 with a more recent intel userspace (
http://lkml.org/lkml/2010/8/23/432 ). Restoring the 20ms wait resolves
the issue.

Signed-off-by: Sitsofe Wheeler <[email protected]>
---
drivers/gpu/drm/i915/intel_display.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index 23157e1..116d938 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -4539,7 +4539,7 @@ struct drm_crtc *intel_get_load_detect_pipe(struct intel_encoder *intel_encoder,
encoder_funcs->commit(encoder);
}
/* let the connector get through one full cycle before testing */
- intel_wait_for_vblank(dev, intel_crtc->pipe);
+ msleep(20);

return crtc;
}
--
1.7.1

2010-08-24 15:56:22

by Sitsofe Wheeler

[permalink] [raw]

Subject: [PATCH 2/2] drm/i915: Revert extra intel_wait_for_vblank to prevent stalls.

With the extra intel_wait_for_vblank added in commit
9d0498a2bf7455159b317f19531a3e5db2ecc9c4 periodic stalls were being
triggered (which were detected by i915_hangcheck_elapsed). Partially
revert this change for now.

Signed-off-by: Sitsofe Wheeler <[email protected]>
---
drivers/gpu/drm/i915/intel_display.c | 6 ++++--
1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index 116d938..534f1fa 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -2379,8 +2379,10 @@ static void i9xx_crtc_dpms(struct drm_crtc *crtc, int mode)
I915_READ(dspbase_reg);
}

- /* Wait for vblank for the disable to take effect */
- intel_wait_for_vblank_off(dev, pipe);
+ if (!IS_I9XX(dev)) {
+ /* Wait for vblank for the disable to take effect */
+ intel_wait_for_vblank_off(dev, pipe);
+ }

/* Don't disable pipe A or pipe A PLLs if needed */
if (pipeconf_reg == PIPEACONF &&
--
1.7.1

2010-08-24 16:35:11

by Jesse Barnes

[permalink] [raw]

Subject: Re: [PATCH 2/2] drm/i915: Revert extra intel_wait_for_vblank to prevent stalls.

On Tue, 24 Aug 2010 16:56:16 +0100
Sitsofe Wheeler <[email protected]> wrote:

> With the extra intel_wait_for_vblank added in commit
> 9d0498a2bf7455159b317f19531a3e5db2ecc9c4 periodic stalls were being
> triggered (which were detected by i915_hangcheck_elapsed). Partially
> revert this change for now.
>
> Signed-off-by: Sitsofe Wheeler <[email protected]>
> ---
> drivers/gpu/drm/i915/intel_display.c | 6 ++++--
> 1 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
> index 116d938..534f1fa 100644
> --- a/drivers/gpu/drm/i915/intel_display.c
> +++ b/drivers/gpu/drm/i915/intel_display.c
> @@ -2379,8 +2379,10 @@ static void i9xx_crtc_dpms(struct drm_crtc *crtc, int mode)
> I915_READ(dspbase_reg);
> }
>
> - /* Wait for vblank for the disable to take effect */
> - intel_wait_for_vblank_off(dev, pipe);
> + if (!IS_I9XX(dev)) {
> + /* Wait for vblank for the disable to take effect */
> + intel_wait_for_vblank_off(dev, pipe);
> + }
>
> /* Don't disable pipe A or pipe A PLLs if needed */
> if (pipeconf_reg == PIPEACONF &&

Hm why would we be triggering the hangcheck timer due to this code?
I'd rather figure that out and fix it before covering it up like this.

Wait for vblank off will spin until the display line reg stops
incrementing, so it's important that we flush any previous writes to
shut off the pipe before waiting. So adding a POSTING_READ() above it
might help? But that still doesn't explain why it would cause the
hangcheck timer to notice a hang...

--
Jesse Barnes, Intel Open Source Technology Center

2010-08-24 16:39:44

by Jesse Barnes

[permalink] [raw]

Subject: Re: [PATCH 1/2] drm/i915: Revert wait for vblank to prevent X refresh issues

On Tue, 24 Aug 2010 16:53:59 +0100
Sitsofe Wheeler <[email protected]> wrote:

> In commit 9d0498a2bf7455159b317f19531a3e5db2ecc9c4 20ms waits were
> converted into vblank waits. One of these caused tearing, mode detection
> and redraw issues on an EeePC 900 with a more recent intel userspace (
> http://lkml.org/lkml/2010/8/23/432 ). Restoring the 20ms wait resolves
> the issue.
>
> Signed-off-by: Sitsofe Wheeler <[email protected]>
> ---
> drivers/gpu/drm/i915/intel_display.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
> index 23157e1..116d938 100644
> --- a/drivers/gpu/drm/i915/intel_display.c
> +++ b/drivers/gpu/drm/i915/intel_display.c
> @@ -4539,7 +4539,7 @@ struct drm_crtc *intel_get_load_detect_pipe(struct intel_encoder *intel_encoder,
> encoder_funcs->commit(encoder);
> }
> /* let the connector get through one full cycle before testing */
> - intel_wait_for_vblank(dev, intel_crtc->pipe);
> + msleep(20);
>
> return crtc;
> }

Wow, tearing, mode detection and redraw problems all because of this
line? Maybe because we wait for a longer period here now? Can you
check and see if we're timing out in the wait_for_vblank function?

--
Jesse Barnes, Intel Open Source Technology Center