After upgrading one of my machines to 2.6.32, I saw hangs after one to thirty minutes after
booting, with random data written to parts of the frame buffer. I've bisected it down
to 620f37811d "drm: prune modes when output is disconnected.", which was merged
in 2.6.32-rc1. Connecting a serial console does not reveal any output at the time of
the crash.
The machine uses an Intel G45 chipset with the i915 kernel mode setting enabled.
I have no clue what that patch does or why reverting it fixes the problem but 2.6.32
with this revert applied has not shown these hangs yet.
Signed-off-by: Arnd Bergmann <[email protected]>
---
--- a/drivers/gpu/drm/drm_crtc_helper.c
+++ b/drivers/gpu/drm/drm_crtc_helper.c
@@ -104,7 +104,8 @@ int drm_helper_probe_single_connector_modes(struct drm_connector *connector,
if (connector->status == connector_status_disconnected) {
DRM_DEBUG_KMS("%s is disconnected\n",
drm_get_connector_name(connector));
- goto prune;
+ /* TODO set EDID to NULL */
+ return 0;
}
count = (*connector_funcs->get_modes)(connector);
@@ -132,7 +133,6 @@ int drm_helper_probe_single_connector_modes(struct drm_connector *connector,
mode);
}
-prune:
drm_mode_prune_invalid(dev, &connector->modes, true);
if (list_empty(&connector->modes))
On Monday 07 December 2009, Arnd Bergmann wrote:
> After upgrading one of my machines to 2.6.32, I saw hangs after one to thirty minutes after
> booting, with random data written to parts of the frame buffer. I've bisected it down
> to 620f37811d "drm: prune modes when output is disconnected.", which was merged
> in 2.6.32-rc1. Connecting a serial console does not reveal any output at the time of
> the crash.
>
> The machine uses an Intel G45 chipset with the i915 kernel mode setting enabled.
>
> I have no clue what that patch does or why reverting it fixes the problem but 2.6.32
> with this revert applied has not shown these hangs yet.
>
> Signed-off-by: Arnd Bergmann <[email protected]>
Scratch this, the machine crashed again in the same way, but only after
a some hours of uptime (normally it took a few minutes). It still looks like
this commit makes the bug much more likely, but it could also be
attributed to complete coincedence and bad timing. 2.6.31 certainly never
showed the problem and the bisection clearly pointed in the general direction
of i915.
I hate debugging sporadic hangs...
Sorry for blaming the wrong patch, especially if it already caused a lot
of work. I'll try blaming ec2a4c3fdc8 "drm/i915: get the bridge device
once" now, that is the next-best candidate that my bisection pointed to,
but I'll do more rigorous testing.
Arnd
On Tue, 2009-12-08 at 22:59 +0100, Arnd Bergmann wrote:
> On Monday 07 December 2009, Arnd Bergmann wrote:
> > After upgrading one of my machines to 2.6.32, I saw hangs after one to thirty minutes after
> > booting, with random data written to parts of the frame buffer. I've bisected it down
> > to 620f37811d "drm: prune modes when output is disconnected.", which was merged
> > in 2.6.32-rc1. Connecting a serial console does not reveal any output at the time of
> > the crash.
> >
> > The machine uses an Intel G45 chipset with the i915 kernel mode setting enabled.
> >
> > I have no clue what that patch does or why reverting it fixes the problem but 2.6.32
> > with this revert applied has not shown these hangs yet.
> >
> > Signed-off-by: Arnd Bergmann <[email protected]>
>
> Scratch this, the machine crashed again in the same way, but only after
> a some hours of uptime (normally it took a few minutes). It still looks like
> this commit makes the bug much more likely, but it could also be
> attributed to complete coincedence and bad timing. 2.6.31 certainly never
> showed the problem and the bisection clearly pointed in the general direction
> of i915.
>
> I hate debugging sporadic hangs...
>
> Sorry for blaming the wrong patch, especially if it already caused a lot
> of work. I'll try blaming ec2a4c3fdc8 "drm/i915: get the bridge device
> once" now, that is the next-best candidate that my bisection pointed to,
> but I'll do more rigorous testing.
>
Btw is this just a laptop with no monitor plugged in?
just want to rule out the patch you mentioned for any sort of memory
corruption.
Dave.
On Tuesday 08 December 2009, Dave Airlie wrote:
> >
> > Sorry for blaming the wrong patch, especially if it already caused a lot
> > of work. I'll try blaming ec2a4c3fdc8 "drm/i915: get the bridge device
> > once" now, that is the next-best candidate that my bisection pointed to,
> > but I'll do more rigorous testing.
It just crashed again with both 620f378 and ec2a4c3fdc8 reverted as well,
after about an hour of uptime.
> Btw is this just a laptop with no monitor plugged in?
>
> just want to rule out the patch you mentioned for any sort of memory
> corruption.
It's a desktop machine with three outputs, Asus P5Q-EM mainboard.
VGA is not connected at all, DVI has an open-ended cable on it, i.e. no
screen on it. HDMI is connected through a cheap passive HDMI multiswitch
to the DVI-D input of a Lenovo screen.
Arnd
On Wed, 2009-12-09 at 00:07 +0100, Arnd Bergmann wrote:
> On Tuesday 08 December 2009, Dave Airlie wrote:
> > >
> > > Sorry for blaming the wrong patch, especially if it already caused a lot
> > > of work. I'll try blaming ec2a4c3fdc8 "drm/i915: get the bridge device
> > > once" now, that is the next-best candidate that my bisection pointed to,
> > > but I'll do more rigorous testing.
>
> It just crashed again with both 620f378 and ec2a4c3fdc8 reverted as well,
> after about an hour of uptime.
>
> > Btw is this just a laptop with no monitor plugged in?
> >
> > just want to rule out the patch you mentioned for any sort of memory
> > corruption.
>
> It's a desktop machine with three outputs, Asus P5Q-EM mainboard.
> VGA is not connected at all, DVI has an open-ended cable on it, i.e. no
> screen on it. HDMI is connected through a cheap passive HDMI multiswitch
> to the DVI-D input of a Lenovo screen.
CC'ed the Intel ppl,
regression since 2.6.31 by the looks of it, bisection has found two
false positives so far.
Dave.
On Wednesday 09 December 2009 05:38:07 Dave Airlie wrote:
> On Wed, 2009-12-09 at 00:07 +0100, Arnd Bergmann wrote:
> > On Tuesday 08 December 2009, Dave Airlie wrote:
> > > >
> > > > Sorry for blaming the wrong patch, especially if it already caused a lot
> > > > of work. I'll try blaming ec2a4c3fdc8 "drm/i915: get the bridge device
> > > > once" now, that is the next-best candidate that my bisection pointed to,
> > > > but I'll do more rigorous testing.
> >
> > It just crashed again with both 620f378 and ec2a4c3fdc8 reverted as well,
> > after about an hour of uptime.
> >
> ...
>
> regression since 2.6.31 by the looks of it, bisection has found two
> false positives so far.
I did a more thorough bisection this time, which points to c05422d52ee6b
"drm/i915: remove open-coded drm_mode_object_find". The system crashed
within minutes with 2.6.32-git and with c05422d52ee6b, but stayed up
for 12 hours on identical kernels with this one reverted:
> From c05422d52ee6b4cff8b63eab1a7351780518fc5e Mon Sep 17 00:00:00 2001
> From: Daniel Vetter <[email protected]>
> Date: Tue, 11 Aug 2009 16:05:30 +0200
> Subject: [PATCH] drm/i915: remove open-coded drm_mode_object_find
>
> And clean up a small whitespace goof-up in the same function, while
> I was looking at it.
>
> Signed-off-by: Daniel Vetter <[email protected]>
> Signed-off-by: Eric Anholt <[email protected]>
Given that my last bisection went wrong and I can't prove why this
patch is bad, I'd of course like to see some insight about what
could go wrong here, but at least reverting it should not be able
to cause any harm.
---
[PATCH] drm/i915: revert c05422d52 "remove open-coded drm_mode_object_find"
The patch was meant as a cleanup, but apparently caused random hangs
on my machine.
Signed-off-by: Arnd Bergmann <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Eric Anholt <[email protected]>
diff --git b/drivers/gpu/drm/i915/intel_display.c a/drivers/gpu/drm/i915/intel_display.c
index f0f38f5..35e8619 100644
--- b/drivers/gpu/drm/i915/intel_display.c
+++ a/drivers/gpu/drm/i915/intel_display.c
@@ -3557,26 +3557,30 @@ int intel_get_pipe_from_crtc_id(struct drm_device *dev, void *data,
{
drm_i915_private_t *dev_priv = dev->dev_private;
struct drm_i915_get_pipe_from_crtc_id *pipe_from_crtc_id = data;
- struct drm_mode_object *drmmode_obj;
- struct intel_crtc *crtc;
+ struct drm_crtc *crtc = NULL;
+ int pipe = -1;
if (!dev_priv) {
DRM_ERROR("called with no initialization\n");
return -EINVAL;
}
- drmmode_obj = drm_mode_object_find(dev, pipe_from_crtc_id->crtc_id,
- DRM_MODE_OBJECT_CRTC);
+ list_for_each_entry(crtc, &dev->mode_config.crtc_list, head) {
+ struct intel_crtc *intel_crtc = to_intel_crtc(crtc);
+ if (crtc->base.id == pipe_from_crtc_id->crtc_id) {
+ pipe = intel_crtc->pipe;
+ break;
+ }
+ }
- if (!drmmode_obj) {
+ if (pipe == -1) {
DRM_ERROR("no such CRTC id\n");
return -EINVAL;
}
- crtc = to_intel_crtc(obj_to_crtc(drmmode_obj));
- pipe_from_crtc_id->pipe = crtc->pipe;
+ pipe_from_crtc_id->pipe = pipe;
- return 0;
+ return 0;
}
struct drm_crtc *intel_get_crtc_from_pipe(struct drm_device *dev, int pipe)
On Sunday 13 December 2009 12:18:24 Arnd Bergmann wrote:
> On Wednesday 09 December 2009 05:38:07 Dave Airlie wrote:
> > On Wed, 2009-12-09 at 00:07 +0100, Arnd Bergmann wrote:
> > > On Tuesday 08 December 2009, Dave Airlie wrote:
> > > > >
> > > > > Sorry for blaming the wrong patch, especially if it already caused a lot
> > > > > of work. I'll try blaming ec2a4c3fdc8 "drm/i915: get the bridge device
> > > > > once" now, that is the next-best candidate that my bisection pointed to,
> > > > > but I'll do more rigorous testing.
> > >
> > > It just crashed again with both 620f378 and ec2a4c3fdc8 reverted as well,
> > > after about an hour of uptime.
> > >
> > ...
> >
> > regression since 2.6.31 by the looks of it, bisection has found two
> > false positives so far.
>
> I did a more thorough bisection this time, which points to c05422d52ee6b
> "drm/i915: remove open-coded drm_mode_object_find". The system crashed
> within minutes with 2.6.32-git and with c05422d52ee6b, but stayed up
> for 12 hours on identical kernels with this one reverted:
>
> > From c05422d52ee6b4cff8b63eab1a7351780518fc5e Mon Sep 17 00:00:00 2001
> > From: Daniel Vetter <[email protected]>
> > Date: Tue, 11 Aug 2009 16:05:30 +0200
> > Subject: [PATCH] drm/i915: remove open-coded drm_mode_object_find
And now it's obvious that my computer hates me. 12 hours of uptime, one reboot
to check the old other version is broken, it crashes. I reboot into the
good version, send out the above email and the next minute it crashes again.
c05422d52ee6b is not the culprit. Sorry Daniel for blaming your patch.
Arnd
On Sun, Dec 13, 2009 at 12:30:25PM +0000, Arnd Bergmann wrote:
> And now it's obvious that my computer hates me. 12 hours of uptime, one reboot
> to check the old other version is broken, it crashes. I reboot into the
> good version, send out the above email and the next minute it crashes again.
> c05422d52ee6b is not the culprit. Sorry Daniel for blaming your patch.
No problem. Looks like your hunting a pretty ugly Heisenbug. There's quite
a interesting blog post by Paul McKenney, esp. the solution to "Quick Quiz 1"
might be usefull in your case:
http://paulmck.livejournal.com/14639.html
Happy Hunting, Daniel
--
Daniel Vetter
Mail: [email protected]
Mobile: +41 (0)79 365 57 48
On Sunday 13 December 2009 20:00:18 Daniel Vetter wrote:
> On Sun, Dec 13, 2009 at 12:30:25PM +0000, Arnd Bergmann wrote:
> > And now it's obvious that my computer hates me. 12 hours of uptime, one reboot
> > to check the old other version is broken, it crashes. I reboot into the
> > good version, send out the above email and the next minute it crashes again.
> > c05422d52ee6b is not the culprit. Sorry Daniel for blaming your patch.
>
> No problem. Looks like your hunting a pretty ugly Heisenbug. There's quite
> a interesting blog post by Paul McKenney, esp. the solution to "Quick Quiz 1"
> might be usefull in your case:
>
> http://paulmck.livejournal.com/14639.html
Thanks! In fact I've actually read that post on the kernel planet and decided
to do basically a linear search through the i915 patches merged into 2.6.32.
The current result is 67cf781bea5 "drm/i915: Make the downclocking debug code
be under DRM_DEBUG not DRM_ERROR." is known bad, while 043029655 "drm/i915:
Support IGD EOS" is probably good, pointing to Jesses 652c393a33 "drm/i915:
add dynamic clock frequency control" as the next best guess. Unfortunately,
that is a rather large change that is not easy to revert on current kernels.
I'll keep running my machine on the 043029655 level and tell you if it keeps
working a few more days.
Arnd
On Sun, 2009-12-13 at 21:31 +0000, Arnd Bergmann wrote:
> On Sunday 13 December 2009 20:00:18 Daniel Vetter wrote:
> > On Sun, Dec 13, 2009 at 12:30:25PM +0000, Arnd Bergmann wrote:
> > > And now it's obvious that my computer hates me. 12 hours of uptime, one reboot
> > > to check the old other version is broken, it crashes. I reboot into the
> > > good version, send out the above email and the next minute it crashes again.
> > > c05422d52ee6b is not the culprit. Sorry Daniel for blaming your patch.
> >
> > No problem. Looks like your hunting a pretty ugly Heisenbug. There's quite
> > a interesting blog post by Paul McKenney, esp. the solution to "Quick Quiz 1"
> > might be usefull in your case:
> >
> > http://paulmck.livejournal.com/14639.html
>
> Thanks! In fact I've actually read that post on the kernel planet and decided
> to do basically a linear search through the i915 patches merged into 2.6.32.
>
> The current result is 67cf781bea5 "drm/i915: Make the downclocking debug code
> be under DRM_DEBUG not DRM_ERROR." is known bad, while 043029655 "drm/i915:
> Support IGD EOS" is probably good, pointing to Jesses 652c393a33 "drm/i915:
> add dynamic clock frequency control" as the next best guess. Unfortunately,
> that is a rather large change that is not easy to revert on current kernels.
That seems the most likely, perhaps jbarnes can comment.
Dave.
On Mon, 14 Dec 2009 07:54:05 +1000
Dave Airlie <[email protected]> wrote:
> On Sun, 2009-12-13 at 21:31 +0000, Arnd Bergmann wrote:
> > On Sunday 13 December 2009 20:00:18 Daniel Vetter wrote:
> > > On Sun, Dec 13, 2009 at 12:30:25PM +0000, Arnd Bergmann wrote:
> > > > And now it's obvious that my computer hates me. 12 hours of
> > > > uptime, one reboot to check the old other version is broken, it
> > > > crashes. I reboot into the good version, send out the above
> > > > email and the next minute it crashes again. c05422d52ee6b is
> > > > not the culprit. Sorry Daniel for blaming your patch.
> > >
> > > No problem. Looks like your hunting a pretty ugly Heisenbug.
> > > There's quite a interesting blog post by Paul McKenney, esp. the
> > > solution to "Quick Quiz 1" might be usefull in your case:
> > >
> > > http://paulmck.livejournal.com/14639.html
> >
> > Thanks! In fact I've actually read that post on the kernel planet
> > and decided to do basically a linear search through the i915
> > patches merged into 2.6.32.
> >
> > The current result is 67cf781bea5 "drm/i915: Make the downclocking
> > debug code be under DRM_DEBUG not DRM_ERROR." is known bad, while
> > 043029655 "drm/i915: Support IGD EOS" is probably good, pointing to
> > Jesses 652c393a33 "drm/i915: add dynamic clock frequency control"
> > as the next best guess. Unfortunately, that is a rather large
> > change that is not easy to revert on current kernels.
>
> That seems the most likely, perhaps jbarnes can comment.
You can disable most of that code by loading i915 with 'powersave=0'.
If that patch really is at fault the powersave=0 should work around the
issue as well.
It's been implicated in another issue (some display flicker and
underruns) so I'm pretty sure there's something wrong with it in some
configurations at least...
--
Jesse Barnes, Intel Open Source Technology Center
On Monday 14 December 2009 18:20:15 Jesse Barnes wrote:
> You can disable most of that code by loading i915 with 'powersave=0'.
> If that patch really is at fault the powersave=0 should work around the
> issue as well.
Ok, I'll try that and let you know. Running the kernel before your
patch has not crashed yet after two days of uptime. Now running
with your patch but nothing else. When that crashes, I'll try
the latest mainline with powersave=0.
> It's been implicated in another issue (some display flicker and
> underruns) so I'm pretty sure there's something wrong with it in some
> configurations at least...
I haven't seen that yet. FWIW, the device in question is
00:02.0 VGA compatible controller: Intel Corporation 4 Series Chipset Integrated Graphics Controller (rev 03)
Subsystem: ASUSTeK Computer Inc. Device 8276
Flags: bus master, fast devsel, latency 0, IRQ 27
Memory at fe400000 (64-bit, non-prefetchable) [size=4M]
Memory at d0000000 (64-bit, prefetchable) [size=256M]
I/O ports at cc00 [size=8]
Capabilities: [90] Message Signalled Interrupts: Mask- 64bit- Queue=0/0 Enable+
Capabilities: [d0] Power Management version 2
Capabilities: [a4] PCIe advanced features <?>
Kernel driver in use: i915
00: 86 80 22 2e 07 04 90 00 03 00 00 03 00 00 80 00
10: 04 00 40 fe 00 00 00 00 0c 00 00 d0 00 00 00 00
20: 01 cc 00 00 00 00 00 00 00 00 00 00 43 10 76 82
30: 00 00 00 00 90 00 00 00 00 00 00 00 0a 01 00 00
00:02.1 Display controller: Intel Corporation 4 Series Chipset Integrated Graphics Controller (rev 03)
Subsystem: ASUSTeK Computer Inc. Device 8276
Flags: bus master, fast devsel, latency 0
Memory at fe800000 (64-bit, non-prefetchable) [size=1M]
Capabilities: [d0] Power Management version 2
00: 86 80 23 2e 07 00 90 00 03 00 80 03 00 00 80 00
10: 04 00 80 fe 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 43 10 76 82
30: 00 00 00 00 d0 00 00 00 00 00 00 00 00 00 00 00
Let me know if you have a patch you want me to test.
Thanks,
Arnd
On Mon, 14 Dec 2009 20:38:09 +0000
Arnd Bergmann <[email protected]> wrote:
> On Monday 14 December 2009 18:20:15 Jesse Barnes wrote:
> > You can disable most of that code by loading i915 with
> > 'powersave=0'. If that patch really is at fault the powersave=0
> > should work around the issue as well.
>
> Ok, I'll try that and let you know. Running the kernel before your
> patch has not crashed yet after two days of uptime. Now running
> with your patch but nothing else. When that crashes, I'll try
> the latest mainline with powersave=0.
Ok great.
> > It's been implicated in another issue (some display flicker and
> > underruns) so I'm pretty sure there's something wrong with it in
> > some configurations at least...
>
> I haven't seen that yet. FWIW, the device in question is
[snip 4 series pci info]
> Let me know if you have a patch you want me to test.
This patch removes the suspect portion of the dynamic clock control
code. Hopefully it'll be as stable as powersave=0 in your config
(assuming powersave=0 works :).
--
Jesse Barnes, Intel Open Source Technology Center
diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_d
index 279dc96..b8730de 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -3825,8 +3825,6 @@ void intel_decrease_renderclock(struct drm_device *dev)
/* Down to minimum... */
gcfgc &= ~GM45_GC_RENDER_CLOCK_MASK;
gcfgc |= GM45_GC_RENDER_CLOCK_266_MHZ;
-
- pci_write_config_word(dev->pdev, GCFGC, gcfgc);
} else if (IS_I965G(dev)) {
u16 gcfgc;
On Monday 14 December 2009, Jesse Barnes wrote:
> This patch removes the suspect portion of the dynamic clock control
> code. Hopefully it'll be as stable as powersave=0 in your config
> (assuming powersave=0 works :).
Ok, great! The machine is still running fine with powersave=0 so
far. I'll try your patch after some more uptime.
Arnd
> Jesse Barnes, Intel Open Source Technology Center
>
> diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_d
> index 279dc96..b8730de 100644
> --- a/drivers/gpu/drm/i915/intel_display.c
> +++ b/drivers/gpu/drm/i915/intel_display.c
> @@ -3825,8 +3825,6 @@ void intel_decrease_renderclock(struct drm_device *dev)
> /* Down to minimum... */
> gcfgc &= ~GM45_GC_RENDER_CLOCK_MASK;
> gcfgc |= GM45_GC_RENDER_CLOCK_266_MHZ;
> -
> - pci_write_config_word(dev->pdev, GCFGC, gcfgc);
> } else if (IS_I965G(dev)) {
> u16 gcfgc;
Hi,
> You can disable most of that code by loading i915 with 'powersave=0'.
> If that patch really is at fault the powersave=0 should work around the
> issue as well.
>
> It's been implicated in another issue (some display flicker and
> underruns) so I'm pretty sure there's something wrong with it in some
> configurations at least...
>
Could it be also that the powersave feature produces gpu hangs when
xserver is run the first time (after a cold boot). I have serious issues
the intel drm starting at v2.6.32-rc2 (I can not use the v2.6.32 series
on my laptop) while the same xorg stack works well with the v2.6.31
series. I try to bisect this problem and obtained this (sorry it is a
bit long):
git bisect start 'drivers/gpu' 'arch/x86_64/'
# bad: [053fe57ac249a9531c396175778160d9e9509399] Merge
git://git.kernel.org/pub
/scm/linux/kernel/git/davem/net-2.6
git bisect bad 053fe57ac249a9531c396175778160d9e9509399
# good: [74fca6a42863ffacaf7ba6f1936a9f228950f657] Linux 2.6.31
git bisect good 74fca6a42863ffacaf7ba6f1936a9f228950f657
# bad: [d8a2d0e00c0d5a0d55e14b884bff034205015e51] drm/i915: HDMI
hardware workar
ound for Ironlake
git bisect bad d8a2d0e00c0d5a0d55e14b884bff034205015e51
# bad: [d50ba256b5f1478e15accfcfda9b72fd7a661364] drm/kms: start adding
command
line interface using fb.
git bisect bad d50ba256b5f1478e15accfcfda9b72fd7a661364
# good: [551ebd837c75fc75df81811a18b7136c39cab487] drm/radeon/kms: add
rn50/r100
/r200 CS tracker.
git bisect good 551ebd837c75fc75df81811a18b7136c39cab487
# good: [fdd5cace733370ab7a518a98ef084e02aa76fdea] drm/radeon/kms: Don't
kzalloc
memory which is immediately overwritten.
git bisect good fdd5cace733370ab7a518a98ef084e02aa76fdea
# good: [698443d9ec1a33eff65b27b9514e06998bf57eb3] drm/radeon/kms:
disable VGA r
endering engine before taking over VRAM
git bisect good 698443d9ec1a33eff65b27b9514e06998bf57eb3
# good: [c214271563c00f2721c5111e27b53bf06dabc6e4] drm/radeon:
consolidate famil
y flags used in pciids.
git bisect good c214271563c00f2721c5111e27b53bf06dabc6e4
# good: [93dc6c2b0d97a55508144073838e041140b206cd] drm/edid: Detailed
standard t
iming blocks have six timings, not five.
git bisect good 93dc6c2b0d97a55508144073838e041140b206cd
# good: [4bbd4973703bf8a5f00f05eff30a99cd9814f37f] drm/radeon/kms:
enable r600 t
v outputs.
git bisect good 4bbd4973703bf8a5f00f05eff30a99cd9814f37f
# good: [513bcb4655e68706594e45dfa1d4b181500110ba] drm/radeon/kms: don't
require
up to 64k allocations. (v2)
git bisect good 513bcb4655e68706594e45dfa1d4b181500110ba
Regards
Mathieu
On Tuesday 15 December 2009, Arnd Bergmann wrote:
> On Monday 14 December 2009, Jesse Barnes wrote:
> > This patch removes the suspect portion of the dynamic clock control
> > code. Hopefully it'll be as stable as powersave=0 in your config
> > (assuming powersave=0 works :).
>
> Ok, great! The machine is still running fine with powersave=0 so
> far. I'll try your patch after some more uptime.
It's working fine so far, no more crashes, but I supposed this effectively
disables the power saving on my card again, right?
I should probably mention that this machine has had stability problems
before and showed single-bit memory corruption, but that problem
has disappeared after I installed bigger CPU and northbridge coolers
and increased the voltage slightly.
Arnd
On Wed, 16 Dec 2009 14:53:11 +0100
Arnd Bergmann <[email protected]> wrote:
> On Tuesday 15 December 2009, Arnd Bergmann wrote:
> > On Monday 14 December 2009, Jesse Barnes wrote:
> > > This patch removes the suspect portion of the dynamic clock
> > > control code. Hopefully it'll be as stable as powersave=0 in
> > > your config (assuming powersave=0 works :).
> >
> > Ok, great! The machine is still running fine with powersave=0 so
> > far. I'll try your patch after some more uptime.
>
> It's working fine so far, no more crashes, but I supposed this
> effectively disables the power saving on my card again, right?
The patch just disables one (probably ineffective) power saving
feature. So if things are working well for you with it I'll queue up a
revert patch. I'm working on a better version of dynamic clock control
anyway.
--
Jesse Barnes, Intel Open Source Technology Center
On Wednesday 16 December 2009 20:18:23 Jesse Barnes wrote:
> On Wed, 16 Dec 2009 14:53:11 +0100
> Arnd Bergmann <[email protected]> wrote:
>
> > It's working fine so far, no more crashes, but I supposed this
> > effectively disables the power saving on my card again, right?
>
> The patch just disables one (probably ineffective) power saving
> feature. So if things are working well for you with it I'll queue up a
> revert patch. I'm working on a better version of dynamic clock control
> anyway.
It just crashed again after a few hours of uptime with some unrelated
reboots in-between, with your patch applied.
The symptom was slightly different, now the whole screen was filled
with random patterns, not just parts of the screen. Aside from that, it
was just the same complete lockup as without the patch. Possibly less
frequent, but that's hard to tell after a single sample.
Arnd
On Wed, 16 Dec 2009 21:20:34 +0000
Arnd Bergmann <[email protected]> wrote:
> On Wednesday 16 December 2009 20:18:23 Jesse Barnes wrote:
> > On Wed, 16 Dec 2009 14:53:11 +0100
> > Arnd Bergmann <[email protected]> wrote:
> >
> > > It's working fine so far, no more crashes, but I supposed this
> > > effectively disables the power saving on my card again, right?
> >
> > The patch just disables one (probably ineffective) power saving
> > feature. So if things are working well for you with it I'll queue
> > up a revert patch. I'm working on a better version of dynamic
> > clock control anyway.
>
> It just crashed again after a few hours of uptime with some unrelated
> reboots in-between, with your patch applied.
>
> The symptom was slightly different, now the whole screen was filled
> with random patterns, not just parts of the screen. Aside from that,
> it was just the same complete lockup as without the patch. Possibly
> less frequent, but that's hard to tell after a single sample.
But you're sure powersave=0 was solid? Hmm...
--
Jesse Barnes, Intel Open Source Technology Center
On Wednesday 16 December 2009 21:30:05 Jesse Barnes wrote:
> But you're sure powersave=0 was solid? Hmm...
It's hard to be sure when it sometimes takes a day before a
broken version crashes. I can keep running this kernel with and
without powersave=0 some more and tell you if it stays reproducible.
Arnd
On Wednesday 16 December 2009 21:36:36 Arnd Bergmann wrote:
> On Wednesday 16 December 2009 21:30:05 Jesse Barnes wrote:
> > But you're sure powersave=0 was solid? Hmm...
>
> It's hard to be sure when it sometimes takes a day before a
> broken version crashes. I can keep running this kernel with and
> without powersave=0 some more and tell you if it stays reproducible.
Now it has crashed with i915.powersave=0 plus your patch as well
(latest 2.6.32 git), indicating that there is something else wrong
with the original 652c393a33 patch.
Arnd
On Wed, 16 Dec 2009 22:41:27 +0000
Arnd Bergmann <[email protected]> wrote:
> On Wednesday 16 December 2009 21:36:36 Arnd Bergmann wrote:
> > On Wednesday 16 December 2009 21:30:05 Jesse Barnes wrote:
> > > But you're sure powersave=0 was solid? Hmm...
> >
> > It's hard to be sure when it sometimes takes a day before a
> > broken version crashes. I can keep running this kernel with and
> > without powersave=0 some more and tell you if it stays reproducible.
>
> Now it has crashed with i915.powersave=0 plus your patch as well
> (latest 2.6.32 git), indicating that there is something else wrong
> with the original 652c393a33 patch.
It does very little else that should affect things. You're sure
reverting the commit makes things ok?
Other potential problems:
- clock gating (the call to intel_init_clock_gating)
- the actual mark_busy stuff itself (calls to intel_mark_busy)
- intel_idle_update (but powersave=0 should prevent that)
If you want to keep testing you could try removing those calls...
--
Jesse Barnes, Intel Open Source Technology Center
On Thursday 17 December 2009, Jesse Barnes wrote:
> It does very little else that should affect things. You're sure
> reverting the commit makes things ok?
No, not sure. But I've been running the kernel before that commit
for days without ever seeing a crash. 2.6.31 (also without that
commit but otherwise pretty similar) has been stable for weeks
weeks on the same box.
Once any given kernel version crashes, which can take up to
a day while I'm waiting for the bug to show up, it will typically
crash on that same kernel again within seconds after boot,
just to annoy me and prevent me from using that kernel for
other things.
> Other potential problems:
> - clock gating (the call to intel_init_clock_gating)
> - the actual mark_busy stuff itself (calls to intel_mark_busy)
> - intel_idle_update (but powersave=0 should prevent that)
>
> If you want to keep testing you could try removing those calls...
Ok, I'll try that. The problem is keeping me from doing any useful
upstream kernel work on my main machine, so I'm rather motivated ;-)
Arnd
On Thu, 17 Dec 2009 18:52:02 +0100
Arnd Bergmann <[email protected]> wrote:
> On Thursday 17 December 2009, Jesse Barnes wrote:
> > It does very little else that should affect things. You're sure
> > reverting the commit makes things ok?
>
> No, not sure. But I've been running the kernel before that commit
> for days without ever seeing a crash. 2.6.31 (also without that
> commit but otherwise pretty similar) has been stable for weeks
> weeks on the same box.
>
> Once any given kernel version crashes, which can take up to
> a day while I'm waiting for the bug to show up, it will typically
> crash on that same kernel again within seconds after boot,
> just to annoy me and prevent me from using that kernel for
> other things.
>
> > Other potential problems:
> > - clock gating (the call to intel_init_clock_gating)
> > - the actual mark_busy stuff itself (calls to intel_mark_busy)
> > - intel_idle_update (but powersave=0 should prevent that)
> >
> > If you want to keep testing you could try removing those calls...
>
> Ok, I'll try that. The problem is keeping me from doing any useful
> upstream kernel work on my main machine, so I'm rather motivated ;-)
Another patch to try...
diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index 279dc96..b8ca398 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -3779,125 +3779,6 @@ static void intel_gpu_idle_timer(unsigned long arg)
queue_work(dev_priv->wq, &dev_priv->idle_work);
}
-void intel_increase_renderclock(struct drm_device *dev, bool schedule)
-{
- drm_i915_private_t *dev_priv = dev->dev_private;
-
- if (IS_IRONLAKE(dev))
- return;
-
- if (!dev_priv->render_reclock_avail) {
- DRM_DEBUG_DRIVER("not reclocking render clock\n");
- return;
- }
-
- /* Restore render clock frequency to original value */
- if (IS_G4X(dev) || IS_I9XX(dev))
- pci_write_config_word(dev->pdev, GCFGC, dev_priv->orig_clock);
- else if (IS_I85X(dev))
- pci_write_config_word(dev->pdev, HPLLCC, dev_priv->orig_clock);
- DRM_DEBUG_DRIVER("increasing render clock frequency\n");
-
- /* Schedule downclock */
- if (schedule)
- mod_timer(&dev_priv->idle_timer, jiffies +
- msecs_to_jiffies(GPU_IDLE_TIMEOUT));
-}
-
-void intel_decrease_renderclock(struct drm_device *dev)
-{
- drm_i915_private_t *dev_priv = dev->dev_private;
-
- if (IS_IRONLAKE(dev))
- return;
-
- if (!dev_priv->render_reclock_avail) {
- DRM_DEBUG_DRIVER("not reclocking render clock\n");
- return;
- }
-
- if (IS_G4X(dev)) {
- u16 gcfgc;
-
- /* Adjust render clock... */
- pci_read_config_word(dev->pdev, GCFGC, &gcfgc);
-
- /* Down to minimum... */
- gcfgc &= ~GM45_GC_RENDER_CLOCK_MASK;
- gcfgc |= GM45_GC_RENDER_CLOCK_266_MHZ;
-
- pci_write_config_word(dev->pdev, GCFGC, gcfgc);
- } else if (IS_I965G(dev)) {
- u16 gcfgc;
-
- /* Adjust render clock... */
- pci_read_config_word(dev->pdev, GCFGC, &gcfgc);
-
- /* Down to minimum... */
- gcfgc &= ~I965_GC_RENDER_CLOCK_MASK;
- gcfgc |= I965_GC_RENDER_CLOCK_267_MHZ;
-
- pci_write_config_word(dev->pdev, GCFGC, gcfgc);
- } else if (IS_I945G(dev) || IS_I945GM(dev)) {
- u16 gcfgc;
-
- /* Adjust render clock... */
- pci_read_config_word(dev->pdev, GCFGC, &gcfgc);
-
- /* Down to minimum... */
- gcfgc &= ~I945_GC_RENDER_CLOCK_MASK;
- gcfgc |= I945_GC_RENDER_CLOCK_166_MHZ;
-
- pci_write_config_word(dev->pdev, GCFGC, gcfgc);
- } else if (IS_I915G(dev)) {
- u16 gcfgc;
-
- /* Adjust render clock... */
- pci_read_config_word(dev->pdev, GCFGC, &gcfgc);
-
- /* Down to minimum... */
- gcfgc &= ~I915_GC_RENDER_CLOCK_MASK;
- gcfgc |= I915_GC_RENDER_CLOCK_166_MHZ;
-
- pci_write_config_word(dev->pdev, GCFGC, gcfgc);
- } else if (IS_I85X(dev)) {
- u16 hpllcc;
-
- /* Adjust render clock... */
- pci_read_config_word(dev->pdev, HPLLCC, &hpllcc);
-
- /* Up to maximum... */
- hpllcc &= ~GC_CLOCK_CONTROL_MASK;
- hpllcc |= GC_CLOCK_133_200;
-
- pci_write_config_word(dev->pdev, HPLLCC, hpllcc);
- }
- DRM_DEBUG_DRIVER("decreasing render clock frequency\n");
-}
-
-/* Note that no increase function is needed for this - increase_renderclock()
- * will also rewrite these bits
- */
-void intel_decrease_displayclock(struct drm_device *dev)
-{
- if (IS_IRONLAKE(dev))
- return;
-
- if (IS_I945G(dev) || IS_I945GM(dev) || IS_I915G(dev) ||
- IS_I915GM(dev)) {
- u16 gcfgc;
-
- /* Adjust render clock... */
- pci_read_config_word(dev->pdev, GCFGC, &gcfgc);
-
- /* Down to minimum... */
- gcfgc &= ~0xf0;
- gcfgc |= 0x80;
-
- pci_write_config_word(dev->pdev, GCFGC, gcfgc);
- }
-}
-
#define CRTC_IDLE_TIMEOUT 1000 /* ms */
static void intel_crtc_idle_timer(unsigned long arg)
@@ -4011,12 +3892,6 @@ static void intel_idle_update(struct work_struct *work)
mutex_lock(&dev->struct_mutex);
- /* GPU isn't processing, downclock it. */
- if (!dev_priv->busy) {
- intel_decrease_renderclock(dev);
- intel_decrease_displayclock(dev);
- }
-
list_for_each_entry(crtc, &dev->mode_config.crtc_list, head) {
/* Skip inactive CRTCs */
if (!crtc->fb)
@@ -4050,13 +3925,11 @@ void intel_mark_busy(struct drm_device *dev, struct drm_gem_object *obj)
if (!drm_core_check_feature(dev, DRIVER_MODESET))
return;
- if (!dev_priv->busy) {
+ if (!dev_priv->busy)
dev_priv->busy = true;
- intel_increase_renderclock(dev, true);
- } else {
+ else
mod_timer(&dev_priv->idle_timer, jiffies +
msecs_to_jiffies(GPU_IDLE_TIMEOUT));
- }
list_for_each_entry(crtc, &dev->mode_config.crtc_list, head) {
if (!crtc->fb)
@@ -4770,7 +4643,6 @@ void intel_modeset_cleanup(struct drm_device *dev)
del_timer_sync(&intel_crtc->idle_timer);
}
- intel_increase_renderclock(dev, false);
del_timer_sync(&dev_priv->idle_timer);
if (dev_priv->display.disable_fbc)
On Thursday 17 December 2009 19:13:59 Jesse Barnes wrote:
> > Ok, I'll try that. The problem is keeping me from doing any useful
> > upstream kernel work on my main machine, so I'm rather motivated ;-)
>
> Another patch to try...
Thanks!
It's running fine so far, uptime 90 minutes together with powersave=0.
I'll keep the machine running and tell you if it crashes again later.
Arnd
On Thursday 17 December 2009 19:13:59 Jesse Barnes wrote:
> Another patch to try...
The machine has been working fine for almost two days with this.
> -
> - if (IS_G4X(dev)) {
> - u16 gcfgc;
> -
> - /* Adjust render clock... */
> - pci_read_config_word(dev->pdev, GCFGC, &gcfgc);
> -
> - /* Down to minimum... */
> - gcfgc &= ~GM45_GC_RENDER_CLOCK_MASK;
> - gcfgc |= GM45_GC_RENDER_CLOCK_266_MHZ;
> -
> - pci_write_config_word(dev->pdev, GCFGC, gcfgc);
Interestingly, I could not even find GCFGC in the public documentation,
but it also doesn't seem to be what it was in older chips.
The value in there is 0x0025, which does not make any sense whatsoever
according to the definitions in the header file. Writing
GM45_GC_RENDER_CLOCK_266_MHZ in there manually does not seem to have an
effect either, the machine does not crash and the value still remains
at 0x0025.
I tried "while true ; do sudo setpci -s 00:02.1 f0.W=0028 ; done".
I'll try the patch below next, not touching the register at all.
> -/* Note that no increase function is needed for this - increase_renderclock()
> - * will also rewrite these bits
> - */
> -void intel_decrease_displayclock(struct drm_device *dev)
> -{
> - if (IS_IRONLAKE(dev))
> - return;
> -
> - if (IS_I945G(dev) || IS_I945GM(dev) || IS_I915G(dev) ||
> - IS_I915GM(dev)) {
> - u16 gcfgc;
> -
> - /* Adjust render clock... */
> - pci_read_config_word(dev->pdev, GCFGC, &gcfgc);
> -
> - /* Down to minimum... */
> - gcfgc &= ~0xf0;
> - gcfgc |= 0x80;
> -
> - pci_write_config_word(dev->pdev, GCFGC, gcfgc);
> - }
> -}
This does not do anything on G4X, right? I actually put this function
back in (still doesn't crash), but then I noticed that the registers
are not accessed at all on my machine in this function.
Arnd
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -3792,7 +3792,7 @@ void intel_increase_renderclock(struct drm_device *dev, bool schedule)
}
/* Restore render clock frequency to original value */
- if (IS_G4X(dev) || IS_I9XX(dev))
+ if (IS_I9XX(dev))
pci_write_config_word(dev->pdev, GCFGC, dev_priv->orig_clock);
else if (IS_I85X(dev))
pci_write_config_word(dev->pdev, HPLLCC, dev_priv->orig_clock);
@@ -3816,18 +3816,7 @@ void intel_decrease_renderclock(struct drm_device *dev)
return;
}
- if (IS_G4X(dev)) {
- u16 gcfgc;
-
- /* Adjust render clock... */
- pci_read_config_word(dev->pdev, GCFGC, &gcfgc);
-
- /* Down to minimum... */
- gcfgc &= ~GM45_GC_RENDER_CLOCK_MASK;
- gcfgc |= GM45_GC_RENDER_CLOCK_266_MHZ;
-
- pci_write_config_word(dev->pdev, GCFGC, gcfgc);
- } else if (IS_I965G(dev)) {
+ if (IS_I965G(dev)) {
u16 gcfgc;
/* Adjust render clock... */
On Monday 21 December 2009 16:06:43 Arnd Bergmann wrote:
>
> - if (IS_G4X(dev)) {
> - u16 gcfgc;
> -
> - /* Adjust render clock... */
> - pci_read_config_word(dev->pdev, GCFGC, &gcfgc);
> -
> - /* Down to minimum... */
> - gcfgc &= ~GM45_GC_RENDER_CLOCK_MASK;
> - gcfgc |= GM45_GC_RENDER_CLOCK_266_MHZ;
> -
> - pci_write_config_word(dev->pdev, GCFGC, gcfgc);
> - } else if (IS_I965G(dev)) {
> + if (IS_I965G(dev)) {
> u16 gcfgc;
This crashed again immediately, apparently because IS_I965G is a superset
of IS_G4X (that seems counterintuitive), but not it seems to run using
this modified version of the patch (half an hour so far).
Arnd
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -3783,7 +3783,7 @@ void intel_increase_renderclock(struct drm_device *dev, bool schedule)
{
drm_i915_private_t *dev_priv = dev->dev_private;
- if (IS_IRONLAKE(dev))
+ if (IS_IRONLAKE(dev) || IS_G4X(dev))
return;
if (!dev_priv->render_reclock_avail) {
@@ -3792,7 +3792,7 @@ void intel_increase_renderclock(struct drm_device *dev, bool schedule)
}
/* Restore render clock frequency to original value */
- if (IS_G4X(dev) || IS_I9XX(dev))
+ if (IS_I9XX(dev))
pci_write_config_word(dev->pdev, GCFGC, dev_priv->orig_clock);
else if (IS_I85X(dev))
pci_write_config_word(dev->pdev, HPLLCC, dev_priv->orig_clock);
@@ -3808,7 +3808,7 @@ void intel_decrease_renderclock(struct drm_device *dev)
{
drm_i915_private_t *dev_priv = dev->dev_private;
- if (IS_IRONLAKE(dev))
+ if (IS_IRONLAKE(dev) || IS_G4X(dev))
return;
if (!dev_priv->render_reclock_avail) {
@@ -3816,16 +3816,7 @@ void intel_decrease_renderclock(struct drm_device *dev)
return;
}
- if (IS_G4X(dev)) {
- u16 gcfgc;
-
- /* Adjust render clock... */
- pci_read_config_word(dev->pdev, GCFGC, &gcfgc);
-
- /* Down to minimum... */
- gcfgc &= ~GM45_GC_RENDER_CLOCK_MASK;
- gcfgc |= GM45_GC_RENDER_CLOCK_266_MHZ;
- } else if (IS_I965G(dev)) {
+ if (IS_I965G(dev)) {
u16 gcfgc;
/* Adjust render clock... */
On Mon, 21 Dec 2009 17:57:08 +0000
Arnd Bergmann <[email protected]> wrote:
> On Monday 21 December 2009 16:06:43 Arnd Bergmann wrote:
> >
> > - if (IS_G4X(dev)) {
> > - u16 gcfgc;
> > -
> > - /* Adjust render clock... */
> > - pci_read_config_word(dev->pdev, GCFGC, &gcfgc);
> > -
> > - /* Down to minimum... */
> > - gcfgc &= ~GM45_GC_RENDER_CLOCK_MASK;
> > - gcfgc |= GM45_GC_RENDER_CLOCK_266_MHZ;
> > -
> > - pci_write_config_word(dev->pdev, GCFGC, gcfgc);
> > - } else if (IS_I965G(dev)) {
> > + if (IS_I965G(dev)) {
> > u16 gcfgc;
>
> This crashed again immediately, apparently because IS_I965G is a
> superset of IS_G4X (that seems counterintuitive), but not it seems to
> run using this modified version of the patch (half an hour so far).
>
This code should be completely gone in Eric's drm-intel-next branch
(from
git://git.kernel.org/pub/scm/linux/kernel/git/anholt/drm-intel.git), so
hopefully when that lands upstream (if it hasn't already) things will
be solid for you.
Thanks,
Jesse
On Monday 28 December 2009, Jesse Barnes wrote:
> This code should be completely gone in Eric's drm-intel-next branch
> (from
> git://git.kernel.org/pub/scm/linux/kernel/git/anholt/drm-intel.git), so
> hopefully when that lands upstream (if it hasn't already) things will
> be solid for you.
Unfortunately, it seems it did not help. I could also reproduce the hang
with your patch that I reported working for two days before Christmas,
and with the version of that patch that I posted later. I'm running
04302965 again now, and try to apply other parts of the patch that
introduced the regression.
Arnd