2011-02-19 05:58:14

by Indan Zupancic

[permalink] [raw]
Subject: [BUG] drm/i915 Screen corruption introduced by a00b10c360b35d6431a94cb

Hello,

Since 2.6.38-rc I get screen corruption (mostly horizontal grabage stripes on
the right side of the screen). After a long time bisecting the offending commit
ends up being:

commit a00b10c360b35d6431a94cbf130a4e162870d661
Author: Chris Wilson <[email protected]>
Date: Fri Sep 24 21:15:47 2010 +0100

drm/i915: Only enforce fence limits inside the GTT.

So long as we adhere to the fence registers rules for alignment and no
overlaps (including with unfenced accesses to linear memory) and account
for the tiled access in our size allocation, we do not have to allocate
the full fenced region for the object. This allows us to fight the bloat
tiling imposed on pre-i965 chipsets and frees up RAM for real use. [Inside
the GTT we still suffer the additional alignment constraints, so it doesn't
magic allow us to render larger scenes without stalls -- we need the
expanded GTT and fence pipelining to overcome those...]

Signed-off-by: Chris Wilson <[email protected]>

This commit caused other problems too, which Daniel tried to fix with commits:

5e78330126e23e00950 drm/i915: fix relaxed tiling for gen <= 3 && !g33
75e9e9158f38e5cb21e drm/i915: kill mappable/fenceable disdinction
818f2a3cc34b0673dcc drm/i915: revert pageflip/mappable related abi breakage

But those don't fix my screen corruption.

Unfortunately, it's a big commit and it doesn't revert cleanly, and its size
makes it unclear what the source of the problem is. Daniel's commits don't
revert cleanly either, so reverting all of them didn't work.

I'll start poking at it and see if I can find anything.

Greetings,

Indan


2011-02-19 18:34:06

by Daniel Vetter

[permalink] [raw]
Subject: Re: [BUG] drm/i915 Screen corruption introduced by a00b10c360b35d6431a94cb

Hi Indan,

Please provide the usual details about your system (especially what gpu
this is on). Also, screenshots of what typical corruptions look like can
help a lot in tracking down such things.

I've created two quick patches to check a few theories, please test them
(both patches independently and both together). Patches attached.

Yours, Daniel

On Sat, Feb 19, 2011 at 06:58:06AM +0100, Indan Zupancic wrote:
> Hello,
>
> Since 2.6.38-rc I get screen corruption (mostly horizontal grabage stripes on
> the right side of the screen). After a long time bisecting the offending commit
> ends up being:
>
> commit a00b10c360b35d6431a94cbf130a4e162870d661
> Author: Chris Wilson <[email protected]>
> Date: Fri Sep 24 21:15:47 2010 +0100
>
> drm/i915: Only enforce fence limits inside the GTT.
>
> So long as we adhere to the fence registers rules for alignment and no
> overlaps (including with unfenced accesses to linear memory) and account
> for the tiled access in our size allocation, we do not have to allocate
> the full fenced region for the object. This allows us to fight the bloat
> tiling imposed on pre-i965 chipsets and frees up RAM for real use. [Inside
> the GTT we still suffer the additional alignment constraints, so it doesn't
> magic allow us to render larger scenes without stalls -- we need the
> expanded GTT and fence pipelining to overcome those...]
>
> Signed-off-by: Chris Wilson <[email protected]>
>
> This commit caused other problems too, which Daniel tried to fix with commits:
>
> 5e78330126e23e00950 drm/i915: fix relaxed tiling for gen <= 3 && !g33
> 75e9e9158f38e5cb21e drm/i915: kill mappable/fenceable disdinction
> 818f2a3cc34b0673dcc drm/i915: revert pageflip/mappable related abi breakage
>
> But those don't fix my screen corruption.
>
> Unfortunately, it's a big commit and it doesn't revert cleanly, and its size
> makes it unclear what the source of the problem is. Daniel's commits don't
> revert cleanly either, so reverting all of them didn't work.
>
> I'll start poking at it and see if I can find anything.
>
> Greetings,
>
> Indan
>
>

--
Daniel Vetter
Mail: [email protected]
Mobile: +41 (0)79 365 57 48


Attachments:
(No filename) (2.18 kB)
for-indan-1.patch (452.00 B)
for-indan-2.patch (430.00 B)
Download all attachments

2011-02-20 02:20:20

by Indan Zupancic

[permalink] [raw]
Subject: Re: [BUG] drm/i915 Screen corruption introduced by a00b10c360b35d6431a94cb

Hi,

On Sat, February 19, 2011 19:25, Daniel Vetter wrote:
> Hi Indan,
>
> Please provide the usual details about your system (especially what gpu
> this is on). Also, screenshots of what typical corruptions look like can
> help a lot in tracking down such things.

Sorry, I plainly forgot to mention that. I'm on my Thinkpad X40 with gen 2
hardware, 855GM.

I uploaded a screenshot of Firefox to:
http://img406.imageshack.us/img406/9159/31827167.png

It happens a lot when scrolling in FF on pages with images, pageup or pagedown
don't exhibit it. Text gets corrupted as well, but in a less predictable way.
It must be something that FF does that triggers it easily, I haven't managed
to get it with other programs yet, though I do see some corruption in the
window decoration too sometimes (which is Fluxbox). Scrolling horizontally
doesn't show the same behaviour. The corruptions seems to be happening on
the right side of a surface/window.

Forcing a refresh makes it go away again (e.g. switching to another window or
opening another window on top of it. Moving a window doesn't though). I get
it with and without xcompmgr running.

Okay, interesting titbit: For text corruption, the stripes are one pixel high
for the default font size, no corruption for smaller sizes than the default,
two pixel lines for 1 ctrl+ bigger than the default, 3 pixels for +2 etc.
With bigger sizes (>=5) I get less corruption to the left of the text, but
more in the text itself. The distance between the lines continues to increase,
but the stripes thickness goes back to 1 pixel.

Perhaps an off-by-one error somewhere?

> I've created two quick patches to check a few theories, please test them
> (both patches independently and both together). Patches attached.

Tried with both applied, doesn't change anything.

Greetings,

Indan

2011-02-20 02:26:33

by Indan Zupancic

[permalink] [raw]
Subject: Re: [BUG] drm/i915 Screen corruption introduced by a00b10c360b35d6431a94cb

Huh, no idea why it didn't go to Daniel the first time, Squirrelmail never
did that before.

On Sun, February 20, 2011 03:20, Indan Zupancic wrote:
> Hi,
>
> On Sat, February 19, 2011 19:25, Daniel Vetter wrote:
>> Hi Indan,
>>
>> Please provide the usual details about your system (especially what gpu
>> this is on). Also, screenshots of what typical corruptions look like can
>> help a lot in tracking down such things.
>
> Sorry, I plainly forgot to mention that. I'm on my Thinkpad X40 with gen 2
> hardware, 855GM.
>
> I uploaded a screenshot of Firefox to:
> http://img406.imageshack.us/img406/9159/31827167.png
>
> It happens a lot when scrolling in FF on pages with images, pageup or pagedown
> don't exhibit it. Text gets corrupted as well, but in a less predictable way.
> It must be something that FF does that triggers it easily, I haven't managed
> to get it with other programs yet, though I do see some corruption in the
> window decoration too sometimes (which is Fluxbox). Scrolling horizontally
> doesn't show the same behaviour. The corruptions seems to be happening on
> the right side of a surface/window.
>
> Forcing a refresh makes it go away again (e.g. switching to another window or
> opening another window on top of it. Moving a window doesn't though). I get
> it with and without xcompmgr running.
>
> Okay, interesting titbit: For text corruption, the stripes are one pixel high
> for the default font size, no corruption for smaller sizes than the default,
> two pixel lines for 1 ctrl+ bigger than the default, 3 pixels for +2 etc.
> With bigger sizes (>=5) I get less corruption to the left of the text, but
> more in the text itself. The distance between the lines continues to increase,
> but the stripes thickness goes back to 1 pixel.
>
> Perhaps an off-by-one error somewhere?
>
>> I've created two quick patches to check a few theories, please test them
>> (both patches independently and both together). Patches attached.
>
> Tried with both applied, doesn't change anything.
>
> Greetings,
>
> Indan
>

2011-02-20 03:51:48

by Peter Stuge

[permalink] [raw]
Subject: Re: [BUG] drm/i915 Screen corruption introduced by a00b10c360b35d6431a94cb

Daniel Vetter wrote:
> Please provide the usual details about your system (especially what gpu
> this is on). Also, screenshots of what typical corruptions look like can
> help a lot in tracking down such things.

Confirm I also have this issue on my X40, but there are other bugs
that are much more significant so I haven't bothered mentioning this.

On the other hand my kernel is nearly two months old.

Please don't mistake this as a complaint. I'm overall rather happy
even with current state and it's issues, and especially I think the
situation is consistently getting better, with some bumps here and
there along the way.

Last time I used a projector I had some trouble, because the graphics
driver was too clever; it automatically set the correct resolution
for the projector, but I needed the resolution of my internal screen. :)


//Peter

2011-02-20 06:14:07

by Indan Zupancic

[permalink] [raw]
Subject: Re: [BUG] drm/i915 Screen corruption introduced by a00b10c360b35d6431a94cb

Hi Peter,

On Sun, February 20, 2011 04:51, Peter Stuge wrote:
> Confirm I also have this issue on my X40, but there are other bugs
> that are much more significant so I haven't bothered mentioning this.

What issues? If it's backlight related, try my patch at:
http://lkml.org/lkml/2011/2/16/447

Or if the screen is black after suspend/screen blank then just try a
newer kernel, that got recently fixed. Actually, a lot of bugs were
recently introduced and fixed, with two months ago you're probably
in the new-bugs-only period, so I can recommend trying 2.6.38-rc5.

This screen corruption is the only problem for me, but I don't do anything
fancy with my laptop. The ipw2200 wireless driver is quite crappy, but it
has always been as far as I know. (Though not too long ago the USB got
burnt out, which probably took one RAM module with it, so now I'm running
with 512MB instead of 1GB, which is really noticeable when doing kernel
git stuff or compiles with this very very slow HD.)

Good luck,

Indan

P.S. Anyone any idea where that "Mail-Followup-To" header comes from?
Or is everyone subscribed to dri-devel automatically excluded from
Mail-Followup-To? If that's the case, pardon for the dubplicate mails.

2011-02-20 09:20:35

by Daniel Vetter

[permalink] [raw]
Subject: Re: [BUG] drm/i915 Screen corruption introduced by a00b10c360b35d6431a94cb

Hi,

On Sun, Feb 20, 2011 at 03:20:15AM +0100, Indan Zupancic wrote:
> Sorry, I plainly forgot to mention that. I'm on my Thinkpad X40 with gen 2
> hardware, 855GM.

Ah, craptastic i855gm. Is this with the HIC poke patch?

<snip>

> > I've created two quick patches to check a few theories, please test them
> > (both patches independently and both together). Patches attached.
>
> Tried with both applied, doesn't change anything.

Well, these two patches completely undo the patch by Chris Wilson (and
I've re-reviewed it to check whether there are any other hidden bugs in
it). I suspect that your bisect went wrong because the range
a00b10c360b35d6431a94cb..5e78330126e23e00950 is known to be broken on your
hw. Could you recheck these two endpoints to confirm that breakage was
introduce in that range? If so, could you rebase this range and squash the
three patches by me you've mentioned in the original report into the patch
by Chris and re-run git bisect on this new branch?

Thanks, Daniel
--
Daniel Vetter
Mail: [email protected]
Mobile: +41 (0)79 365 57 48

2011-02-20 10:55:31

by Daniel Vetter

[permalink] [raw]
Subject: Re: [BUG] drm/i915 Screen corruption introduced by a00b10c360b35d6431a94cb

Hi,

On Sun, Feb 20, 2011 at 03:20:15AM +0100, Indan Zupancic wrote:
> > I've created two quick patches to check a few theories, please test them
> > (both patches independently and both together). Patches attached.
>
> Tried with both applied, doesn't change anything.

I've just noticed that one of the patches (the 2nd one) doesn't work as
advertised. Please retest with the attached one.

Yours, Daniel
--
Daniel Vetter
Mail: [email protected]
Mobile: +41 (0)79 365 57 48


Attachments:
(No filename) (476.00 B)
for-indan-2.patch (453.00 B)
Download all attachments

2011-02-20 11:03:04

by Indan Zupancic

[permalink] [raw]
Subject: Re: [BUG] drm/i915 Screen corruption introduced by a00b10c360b35d6431a94cb

Hi Daniel,

On Sun, February 20, 2011 10:20, Daniel Vetter wrote:
> On Sun, Feb 20, 2011 at 03:20:15AM +0100, Indan Zupancic wrote:
>> Sorry, I plainly forgot to mention that. I'm on my Thinkpad X40 with gen 2
>> hardware, 855GM.
>
> Ah, craptastic i855gm. Is this with the HIC poke patch?

Yeah, that one. That's what I first suspected, so I tried both with and
without it, but this is something new.

> Well, these two patches completely undo the patch by Chris Wilson (and
> I've re-reviewed it to check whether there are any other hidden bugs in
> it).

I can't find anything dodgy in it either, so it's either somewhere else
or really subtle. I certainly don't see much gen2 specific stuff.

> I suspect that your bisect went wrong because the range
> a00b10c360b35d6431a94cb..5e78330126e23e00950 is known to be broken on your
> hw.

It mostly worked after a bit of twiddling (after that I didn't need to skip
commits any more), and the kind of corruption was the same too.

Only way I can see that this would make a difference is when a00b10c3 created
the bug, the bug got fixed by your patches, but something else introduced it
between that range in a different way. This actually makes sense when it's
re-introduced by a bad merge. But then it should show up in the current code.
If not for a bad merge it seems unlikely though.

> Could you recheck these two endpoints to confirm that breakage was
> introduce in that range?

I'm pretty sure the breakage was introduced starting with a00b10c3, because
the commit just before it (746537) works fine. so yes, it's in that range.
I was getting a bit tired and less sharp though, so perhaps I made a mistake
somewhere. I'll recheck, probably tomorrow (getting a bit late here).

> If so, could you rebase this range and squash the
> three patches by me you've mentioned in the original report into the patch
> by Chris and re-run git bisect on this new branch?

As I have to apply patches anyway, it seems easier to just add your patches
to the bunch, if they apply...

Annoying bug, it seems I'm always hitting the tricky ones.

Looking at: gitk a00b10c360b35d6431a94cb..5e78330126e23e00950 \
drivers/char/agp/ drivers/gpu/drm/i915/

There were three merges around that area, so it seems a bit messy and something
might have gone wrong when merging there. But the right branch seems okay because
all later good bisections were from there.

---

I just saw your new email with a new patch to try, I'll try that one first and
report back, before digging further into all this.

Greetings,

Indan

---

First bisection info:

When I did the bisect I had to fix a compile error and an OOPS, patches I
applied were:

commit ff75b9bc489710ff223bc0d805bf3b862dc347ea
Author: Chris Wilson <[email protected]>
Date: Sat Oct 30 22:52:31 2010 +0100

drm/i915: Fix typo from e5281ccd in i915_gem_attach_phys_object()

Accessing the uninitialised obj->pages instead of the local page lead to
an OOPs.

Reported-by: Xavier Chantry <[email protected]>
Signed-off-by: Chris Wilson <[email protected]>

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 936ddd8..e3fc333 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -4925,7 +4925,7 @@ i915_gem_attach_phys_object(struct drm_device *dev,
if (IS_ERR(page))
return PTR_ERR(page);

- src = kmap_atomic(obj_priv->pages[i]);
+ src = kmap_atomic(page);
dst = obj_priv->phys_obj->handle->vaddr + (i * PAGE_SIZE);
memcpy(dst, src, PAGE_SIZE);
kunmap_atomic(src);

diff --git a/arch/x86/mm/iomap_32.c b/arch/x86/mm/iomap_32.c
index 75a3d7f..e1cc56a 100644
--- a/arch/x86/mm/iomap_32.c
+++ b/arch/x86/mm/iomap_32.c
@@ -61,7 +61,8 @@ void *kmap_atomic_prot_pfn(unsigned long pfn, pgprot_t prot)

pagefault_disable();

- type = kmap_atomic_idx_push();
+// type = kmap_atomic_idx_push();
+ type = 0;
idx = type + KM_TYPE_NR * smp_processor_id();
vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
set_pte(kmap_pte - idx, pfn_pte(pfn, prot));
@@ -98,7 +99,8 @@ iounmap_atomic(void __iomem *kvaddr)
vaddr <= __fix_to_virt(FIX_KMAP_BEGIN)) {
int idx, type;

- type = kmap_atomic_idx_pop();
+// type = kmap_atomic_idx_pop();
+ type = 0;
idx = type + KM_TYPE_NR * smp_processor_id();

#ifdef CONFIG_DEBUG_HIGHMEM

With these patches it worked for me and the only difference was that corruption
showing up or not, and the kind of corruption seemed the same too.

I ran bisect on the drm/i915 and char/agp paths. My bisect log was:

# bad: [d2478521afc20227658a10a8c5c2bf1a2aa615b3] char/ipmi: fix OOPS caused by
pnp_unregister_driver on unregistered driver
# good: [3c0eee3fe6a3a1c745379547c7e7c904aa64f6d5] Linux 2.6.37
git bisect start 'd2478521afc20227658a10a8c5c2bf1a2aa615b3' 'v2.6.37'
'drivers/gpu/drm/i915/' 'drivers/char/agp/'
# bad: [c6748e09eed072be077fe583976516b76daf42ec] drm/i915: Remove inactive LRU tracking
from set_domain_ioctl
git bisect bad c6748e09eed072be077fe583976516b76daf42ec
# bad: [e624ae8e0d4243e71daedce7570e91290438eaca] Merge branch 'drm-intel-fixes' into
drm-intel-next
git bisect bad e624ae8e0d4243e71daedce7570e91290438eaca
# good: [4ab0fbd3a29067e1540f05093ae4ed07645d18c8] Merge remote branch 'linus' into
drm-intel-fixes
git bisect good 4ab0fbd3a29067e1540f05093ae4ed07645d18c8
# good: [1bb95834bbcdc969e477a9284cf96c17a4c2616f] Merge remote branch
'airlied/drm-fixes' into drm-intel-fixes
git bisect good 1bb95834bbcdc969e477a9284cf96c17a4c2616f
# bad: [c94f28c383f58c9de74678e0f1624db9c5f8a8cb] Merge branch 'drm-intel-fixes' into
drm-intel-next
git bisect bad c94f28c383f58c9de74678e0f1624db9c5f8a8cb
# bad: [46168f39360f419e59952d58cd08a862886ec8cd] Merge branch 'drm-intel-fixes' into
drm-intel-next
git bisect bad 46168f39360f419e59952d58cd08a862886ec8cd
# good: [16a02cf08a2de0863daf7ebb91718d7c6bbe7f9c] agp/intel: fix cache control for
sandybridge
git bisect good 16a02cf08a2de0863daf7ebb91718d7c6bbe7f9c
# bad: [ff75b9bc489710ff223bc0d805bf3b862dc347ea] drm/i915: Fix typo from e5281ccd in
i915_gem_attach_phys_object()
git bisect bad ff75b9bc489710ff223bc0d805bf3b862dc347ea
# good: [dd2b379f071424f36f9f90ff83cb4ad058c7b6ed] drm/i915: Fix typo from "Enable
DisplayPort Audio"
git bisect good dd2b379f071424f36f9f90ff83cb4ad058c7b6ed
# good: [b4ce0f85159f77f208a62930f67b4e548576a5a3] drm/i915: Use pci_iomap for remapping
the MMIO registers.
git bisect good b4ce0f85159f77f208a62930f67b4e548576a5a3
# good: [39a01d1fb63cf8ebc1a8cf436f5c0ba9657b55c6] drm/i915: Remove mmap_offset
git bisect good 39a01d1fb63cf8ebc1a8cf436f5c0ba9657b55c6
# good: [e5281ccd2e0049e2b9e8ce82449630d25082372d] drm/i915: Eliminate nested get/put pages
git bisect good e5281ccd2e0049e2b9e8ce82449630d25082372d
# good: [e5281ccd2e0049e2b9e8ce82449630d25082372d] drm/i915: Eliminate nested get/put pages
git bisect good e5281ccd2e0049e2b9e8ce82449630d25082372d
# bad: [e380f60b22eddec7825224b8d788572c82b63161] agp/intel: Sandybridge doesn't require
GMCH enabling
git bisect bad e380f60b22eddec7825224b8d788572c82b63161
# bad: [e380f60b22eddec7825224b8d788572c82b63161] agp/intel: Sandybridge doesn't require
GMCH enabling
git bisect bad e380f60b22eddec7825224b8d788572c82b63161
# bad: [a00b10c360b35d6431a94cbf130a4e162870d661] drm/i915: Only enforce fence limits
inside the GTT.
git bisect bad a00b10c360b35d6431a94cbf130a4e162870d661
# good: [4a684a4117abd756291969336af454e8a958802f] drm/i915: Kill GTT mappings when
moving from GTT domain
git bisect good 4a684a4117abd756291969336af454e8a958802f
# good: [7465378fd7c681f6cf2b74b3494c4f0991d8c8ac] drm/i915: Convert BUG_ON(pin_count)
from an impossible condition
git bisect good 7465378fd7c681f6cf2b74b3494c4f0991d8c8ac

Or in condensed form:

- d24785 char/ipmi: fix OOPS caused by pnp_unregister_driver on unregistered driver
+ 3c0eee Linux 2.6.37
- c6748e drm/i915: Remove inactive LRU tracking from set_domain_ioctl
- e624ae Merge branch 'drm-intel-fixes' into drm-intel-next
+ 4ab0fb Merge remote branch 'linus' into drm-intel-fixes
+ 1bb958 Merge remote branch 'airlied/drm-fixes' into drm-intel-fixes
- c94f28 Merge branch 'drm-intel-fixes' into drm-intel-next
- 46168f Merge branch 'drm-intel-fixes' into drm-intel-next
+ 16a02c agp/intel: fix cache control for sandybridge
- ff75b9 drm/i915: Fix typo from e5281ccd in i915_gem_attach_phys_object()
+ dd2b37 drm/i915: Fix typo from "Enable DisplayPort Audio"
+ b4ce0f drm/i915: Use pci_iomap for remapping the MMIO registers.
+ 39a01d drm/i915: Remove mmap_offset
+ e5281c drm/i915: Eliminate nested get/put pages
- e380f6 agp/intel: Sandybridge doesn't require GMCH enabling
- a00b10 drm/i915: Only enforce fence limits inside the GTT.
+ 4a684a drm/i915: Kill GTT mappings when moving from GTT domain
+ 746537 drm/i915: Convert BUG_ON(pin_count) from an impossible condition

2011-02-20 11:19:19

by Indan Zupancic

[permalink] [raw]
Subject: Re: [BUG] drm/i915 Screen corruption introduced by a00b10c360b35d6431a94cb

Hi,

Good news:

On Sun, February 20, 2011 11:55, Daniel Vetter wrote:
>> > I've created two quick patches to check a few theories, please test them
>> > (both patches independently and both together). Patches attached.
>>
>> Tried with both applied, doesn't change anything.
>
> I've just noticed that one of the patches (the 2nd one) doesn't work as
> advertised. Please retest with the attached one.

This one fixes it!

So ignore my other email, you nailed it. Thanks a heap Daniel.

Tomorrow I'll test more and double check, but it seems fixed.

Of course, looking at the patch, I guess it isn't generally
applicable, so perhaps either more digging or a special
exception for older chipsets is needed. I'm happy to test
any further patches if that helps, triggering the corruption
is really easy for me.

Thanks,

Indan

2011-02-20 13:21:38

by Daniel Vetter

[permalink] [raw]
Subject: Re: [BUG] drm/i915 Screen corruption introduced by a00b10c360b35d6431a94cb

On Sun, Feb 20, 2011 at 12:19:14PM +0100, Indan Zupancic wrote:
> Hi,
>
> Good news:
>
> On Sun, February 20, 2011 11:55, Daniel Vetter wrote:
> >> > I've created two quick patches to check a few theories, please test them
> >> > (both patches independently and both together). Patches attached.
> >>
> >> Tried with both applied, doesn't change anything.
> >
> > I've just noticed that one of the patches (the 2nd one) doesn't work as
> > advertised. Please retest with the attached one.
>
> This one fixes it!

Well, don't start jumping around, yet. These patches are just to rule out
some theories. Now: Is it fixed with just the 2nd patch alone or do you
need both patches? This is very important, so please test extensively
whether there are really no corruptions with just the 2nd patch.

Yours, Daniel
--
Daniel Vetter
Mail: [email protected]
Mobile: +41 (0)79 365 57 48

2011-02-21 01:11:23

by Indan Zupancic

[permalink] [raw]
Subject: Re: [BUG] drm/i915 Screen corruption introduced by a00b10c360b35d6431a94cb

Hi,

On Sun, February 20, 2011 14:21, Daniel Vetter wrote:
> Well, don't start jumping around, yet. These patches are just to rule
> out some theories.

I know, that's why I said I don't mind testing other patches.

> Now: Is it fixed with just the 2nd patch alone or do you
> need both patches? This is very important, so please test extensively
> whether there are really no corruptions with just the 2nd patch.

I only applied the 2nd version of the 2nd patch, no other patches.
The horizontal garbage stripes are gone for sure. It's a lot easier
to prove that something doesn't work than to prove it does, but I'll
keep running it for a while and see if I can spot any other badness.
If it's hard to trigger it's also very hard to find out what causes it.
Without your 2nd patch I always get the garbage on certain webpages,
so I'm quite confident that the original source is located for this
particular problem.

Greetings,

Indan

2011-02-21 02:51:45

by Indan Zupancic

[permalink] [raw]
Subject: Re: [BUG] drm/i915 Screen corruption introduced by a00b10c360b35d6431a94cb

Hi,

Feel free to ignore this post, it's a it of rambling.

On Sun, February 20, 2011 11:55, Daniel Vetter wrote:
> I've just noticed that one of the patches (the 2nd one) doesn't work
> as advertised. Please retest with the attached one.

I was wondering why that was, so I looked a bit closer:

Old patch:

diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
index 17bd766..fbc21e3 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -763,7 +763,7 @@ static int i915_getparam(struct drm_device *dev, void *data,
value = HAS_BLT(dev);
break;
case I915_PARAM_HAS_RELAXED_FENCING:
- value = 1;
+ value = 0;
break;
case I915_PARAM_HAS_COHERENT_RINGS:
value = 1;

New patch:

diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
index 17bd766..d275c96 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -762,9 +762,6 @@ static int i915_getparam(struct drm_device *dev, void *data,
case I915_PARAM_HAS_BLT:
value = HAS_BLT(dev);
break;
- case I915_PARAM_HAS_RELAXED_FENCING:
- value = 1;
- break;
case I915_PARAM_HAS_COHERENT_RINGS:
value = 1;
break;

Looking at userspace intel-dri code it becomes clear why:

gp.param = I915_PARAM_HAS_EXECBUF2;
ret = drmIoctl(bufmgr_gem->fd, DRM_IOCTL_I915_GETPARAM, &gp);
if (!ret)
exec2 = 1;

gp.param = I915_PARAM_HAS_BSD;
ret = drmIoctl(bufmgr_gem->fd, DRM_IOCTL_I915_GETPARAM, &gp);
bufmgr_gem->has_bsd = ret == 0;

gp.param = I915_PARAM_HAS_BLT;
ret = drmIoctl(bufmgr_gem->fd, DRM_IOCTL_I915_GETPARAM, &gp);
bufmgr_gem->has_blt = ret == 0;

gp.param = I915_PARAM_HAS_RELAXED_FENCING;
ret = drmIoctl(bufmgr_gem->fd, DRM_IOCTL_I915_GETPARAM, &gp);
bufmgr_gem->has_relaxed_fencing = ret == 0;

The stupid code assumes that if an option is supported, the value is true too,
while the kernel return code just tells whether it knows the option, and the
value is copied to gp.value, but that's ignored by intel-dri!

And a quick look at the other code gives a strong impression of lots of code
dubplication with the kernel driver.

More and more I'm getting the impression that you guys haven't gotten the
interfaces between all bits right yet. All in all you seem to be somewhat
in the same mess as filesystems are with overly complex interaction between
the MM system, VFS and individual filesystems, all doing more or less the
same in slightly different ways, instead of abstracting things properly.
(Which is also just an impression, I haven't looked that close yet.)

To give you an idea of a driver subsystem that does get it right:

All the common libata code is 23k lines. (wc liba*)
All the individual sata drivers code combined are 19k lines. (wc sata_*)

DRM does it differently (only counting .c files):

Common drm code is 21k
i915 is 37k
nouveau is 47k
radeon is 68k

And then there are the userspace drivers:
xf86-video-intel/src/*.c is 16k
drm/intel/ is 4k
mesa/drivers/dri/i915/ is 22k

Of course graphics drivers are a lot more complex than sata drivers,
but because of that extra complexity it's a lot easier to make things
even more compex by getting the interfaces wrong. So what I'm saying is,
there seems a lot of room for improvements. The intel driver code I've
seen is mostly busy with infrastructure stuff and talking to other bits
and pieces, but it doesn't seem to do much actual work.

To come back to the I915_PARAM_HAS_RELAXED_FENCING thing:

Why is this interface there at all? It seems like a driver internal detail
thing. Either it should be handled entirely in the kernel driver, and this
interface wouldn't be there, or, if the userspace driver has to know about
it, it should be entirely handled by the userspace driver and not done by
the kernel driver at all.

Another slightly annoying thing: The code is littered with gen checks,
while very often the only difference is a register or size value. Why
not put those in a gen specific hardware description structure which is
used unconditionally, instead of having a lot of almost the same code?

The current design seems overly complex and fragile, and I think you guys
make it more difficult for yourself than necessary.

Greetings,

Indan

2011-02-21 04:10:18

by Peter Stuge

[permalink] [raw]
Subject: Re: [BUG] drm/i915 Screen corruption introduced by a00b10c360b35d6431a94cb

Indan Zupancic wrote:
> > Confirm I also have this issue on my X40, but there are other bugs
> > that are much more significant so I haven't bothered mentioning this.
>
> What issues?

The one I confirmed is corrupted graphics within Gecko. I haven't had
Xv working for a long time either. Not sure if I've tried it with
this kernel though.


> If it's backlight related, try my patch at:
> http://lkml.org/lkml/2011/2/16/447

Yeah also have backlight issue whenever the backlight level changes
by other means than Fn+Home/End. (Lid switch, screen blank)
I noticed the patch and that it solved the issue for someone. I'm not
too inconvenienced by this issue though.


> Or if the screen is black after suspend/screen blank

Oh I've stopped using suspend since using KMS. I get way too angry
about all the state that I lose if resume fails so I don't risk it.


> Actually, a lot of bugs were recently introduced and fixed, with
> two months ago you're probably in the new-bugs-only period, so I
> can recommend trying 2.6.38-rc5.

Yeah, I think it's time to pull Linus' git. I've been keeping an eye
on things i915 on the list for a good while already.


> This screen corruption is the only problem for me, but I don't do
> anything fancy with my laptop. The ipw2200 wireless driver is quite
> crappy, but it has always been as far as I know.

I've used ipw2200 with great success for many years, but these days
I'm having fun (no, not at all) with ath9k where there is some very
fundamental hardware issue between laptop and card. I'd need to hook
up logic analyzer to say anything concrete, but I have no end of
problems with internal ath9k in my machine. It's completely unusable.

The only other annoying issue I have is that as wine enumerates
available screen resolutions i915 goes out to the VGA connector,
which on 855 always means a 600ms timeout when nothing is connected,
but this is a bit tricky because the hardware just can not tell if
anything is connected.

I would be very happy if there was a knob for enabling/disabling the
VGA connector though.


> Good luck,

Thanks, you too!


//Peter

2011-02-21 05:26:12

by Indan Zupancic

[permalink] [raw]
Subject: Re: [BUG] drm/i915 Screen corruption introduced by a00b10c360b35d6431a94cb

Hi Peter,

The new corruption is "fixed" with daniel's patch from
https://lkml.org/lkml/2011/2/20/25

On Mon, February 21, 2011 05:10, Peter Stuge wrote:
> Indan Zupancic wrote:
>> > Confirm I also have this issue on my X40, but there are other bugs
>> > that are much more significant so I haven't bothered mentioning this.
>>
>> What issues?
>
> The one I confirmed is corrupted graphics within Gecko. I haven't had
> Xv working for a long time either. Not sure if I've tried it with
> this kernel though.

Hmm, I recommend upgrading all the graphics userspace too then,
current stuff works pretty well for me.

>
>> If it's backlight related, try my patch at:
>> http://lkml.org/lkml/2011/2/16/447
>
> Yeah also have backlight issue whenever the backlight level changes
> by other means than Fn+Home/End. (Lid switch, screen blank)
> I noticed the patch and that it solved the issue for someone. I'm not
> too inconvenienced by this issue though.

Well, it was for me, because I suspend all the time.

>
>> Or if the screen is black after suspend/screen blank
>
> Oh I've stopped using suspend since using KMS. I get way too angry
> about all the state that I lose if resume fails so I don't risk it.

I never had a failed resume with this laptop, not even once. But I
avoided the early days of KMS because of all the instability.

>
>> Actually, a lot of bugs were recently introduced and fixed, with
>> two months ago you're probably in the new-bugs-only period, so I
>> can recommend trying 2.6.38-rc5.
>
> Yeah, I think it's time to pull Linus' git. I've been keeping an eye
> on things i915 on the list for a good while already.

There were coherency bugs for 855, but those seem to be fixed in the
newer kernel too (2.6.37?), so another reason to upgrade.

>
>> This screen corruption is the only problem for me, but I don't do
>> anything fancy with my laptop. The ipw2200 wireless driver is quite
>> crappy, but it has always been as far as I know.
>
> I've used ipw2200 with great success for many years, but these days
> I'm having fun (no, not at all) with ath9k where there is some very
> fundamental hardware issue between laptop and card. I'd need to hook
> up logic analyzer to say anything concrete, but I have no end of
> problems with internal ath9k in my machine. It's completely unusable.

My impression is that ipw2200's firmware is crappy. Everything is fine
with a good signal, but with a bad one things don't work out too well
after a time.

> The only other annoying issue I have is that as wine enumerates
> available screen resolutions i915 goes out to the VGA connector,
> which on 855 always means a 600ms timeout when nothing is connected,
> but this is a bit tricky because the hardware just can not tell if
> anything is connected.

I have the impression that got better too, but I haven't followed the
development. Can't you just lower the timeout in the code, or let it
cache the VGA status? The latter is what I'd expect the code to do.

Running xrandr to see the modes takes 0.3s here.

> I would be very happy if there was a knob for enabling/disabling the
> VGA connector though.
>
>
>> Good luck,
>
> Thanks, you too!

I can recommend using PHC too, if you don't already: http://www.linux-phc.org
With it I almost never have the fan running and it uses a lot less power
when under load. (But don't use it till you got your system stable.)

Greetings,

Indan

2011-02-22 03:25:19

by Indan Zupancic

[permalink] [raw]
Subject: Re: [BUG] drm/i915 Screen corruption introduced by a00b10c360b35d6431a94cb

Hi,

On Sun, February 20, 2011 14:21, Daniel Vetter wrote:
> Well, don't start jumping around, yet. These patches are just to rule out
> some theories. Now: Is it fixed with just the 2nd patch alone or do you need both
> patches? This is very important, so please test extensively whether there are really
> no corruptions with just the 2nd patch.

I managed to create some corruption with an xterm above xpdf. It looks different
than the original corruption, so I think it's safe to say it's a different bug:

http://img593.imageshack.us/i/ss1298340823.png
http://img203.imageshack.us/i/ss1298340776.png

This was without xcompmgr running, I don't think I would have seen it otherwise.
Actually, it turns out it's really easy to trigger as well, so I can more easily
test patches now. Unfortunately I wasn't smart enough to store the bisecting kernels,
gah!

I tried with both your patches, as well as the HIC poking patch, but I still get it.

I'll try to pinpoint more exactly when this started to happen.

Greetings,

Indan

2011-02-22 03:39:58

by Indan Zupancic

[permalink] [raw]
Subject: Re: [BUG] drm/i915 Screen corruption introduced by a00b10c360b35d6431a94cb

On Tue, February 22, 2011 04:25, Indan Zupancic wrote:
> Hi,
>
> On Sun, February 20, 2011 14:21, Daniel Vetter wrote:
>> Well, don't start jumping around, yet. These patches are just to rule out
>> some theories. Now: Is it fixed with just the 2nd patch alone or do you need both
>> patches? This is very important, so please test extensively whether there are really
>> no corruptions with just the 2nd patch.
>
> I managed to create some corruption with an xterm above xpdf. It looks different
> than the original corruption, so I think it's safe to say it's a different bug:
>
> http://img593.imageshack.us/i/ss1298340823.png
> http://img203.imageshack.us/i/ss1298340776.png
>
> This was without xcompmgr running, I don't think I would have seen it otherwise.
> Actually, it turns out it's really easy to trigger as well, so I can more easily
> test patches now. Unfortunately I wasn't smart enough to store the bisecting kernels,
> gah!
>
> I tried with both your patches, as well as the HIC poking patch, but I still get it.
>
> I'll try to pinpoint more exactly when this started to happen.

Okay, I tried older kernels and I get it even with 2.6.34, so I think it's a bug
in userspace, like intel-dri, or perhaps the intel X driver (though I get the same
with an older version).

Greetings,

Indan