I started wondering why 'top' was showing an otherwise idle system as
having a load average of 0.5+, and worker threads constantly using the
CPU.
So I did a system-wide profile, and got the attached output (look at
it in a really wide terminal).
There seems to be something _seriously_ wrong with i915 SDVO detect.
This is on an Apple Mac Mini (hey, your favorite problem child!), and
apparently it spends 20% of its non-idle CPU time just doing udelay's
for the i2c SDVO connection detection.
That sounds a bit wrong, doesn't it?
I don't know how recent this is - it might have been going on for some
time without me noticing. It's the wife's computer, and the same thing
doesn't seem to happen on my Core i5 desktop
Any ideas? Any information I can give about the machine?
Linus
On Sun, Aug 15, 2010 at 07:32:15PM -0700, Linus Torvalds wrote:
> I started wondering why 'top' was showing an otherwise idle system as
> having a load average of 0.5+, and worker threads constantly using the
> CPU.
>
> So I did a system-wide profile, and got the attached output (look at
> it in a really wide terminal).
>
> There seems to be something _seriously_ wrong with i915 SDVO detect.
> This is on an Apple Mac Mini (hey, your favorite problem child!), and
> apparently it spends 20% of its non-idle CPU time just doing udelay's
> for the i2c SDVO connection detection.
>
> That sounds a bit wrong, doesn't it?
>
> I don't know how recent this is - it might have been going on for some
> time without me noticing. It's the wife's computer, and the same thing
> doesn't seem to happen on my Core i5 desktop
>
> Any ideas? Any information I can give about the machine?
You're not alone. I'm seeing similar things on my machine (AMD,Radeon).
The monitor outputs are polled every 10 sec in drivers/gpu/drm/drm_crtc_helper.c.
There is a patch attached to the following bug report that disables
polling and solves the problem for me:
http://bugs.freedesktop.org/show_bug.cgi?id=29536
--
?A man who doesn't know he is in prison can never escape.?
William S. Burroughs
On Sun, 2010-08-15 at 19:32 -0700, Linus Torvalds wrote:
> I started wondering why 'top' was showing an otherwise idle system as
> having a load average of 0.5+, and worker threads constantly using the
> CPU.
>
> So I did a system-wide profile, and got the attached output (look at
> it in a really wide terminal).
>
> There seems to be something _seriously_ wrong with i915 SDVO detect.
> This is on an Apple Mac Mini (hey, your favorite problem child!), and
> apparently it spends 20% of its non-idle CPU time just doing udelay's
> for the i2c SDVO connection detection.
>
> That sounds a bit wrong, doesn't it?
At least we should replace mdelay with msleep in those functions.
Can you get a boot with drm.debug=4?
I wonder are we picking up a bad SDVO, the insane code retries 50 times
with a hard loop delay.
Dave.
On Sun, Aug 15, 2010 at 8:30 PM, Dave Airlie <[email protected]> wrote:
>
> At least we should replace mdelay with msleep in those functions.
How precise does the timing have to be? I think i2c is self-clocking,
so it's ok to see big skews? Becuase msleep() can be off by quite a
bit (mdelay can too, but it's _way_ more rare, and requires either a
preemptible kernel or serious interrupt activity).
> Can you get a boot with drm.debug=4?
Sure. That results in a truncated dmesg (with a 128k buffer). The
thing seems to spew out something every ten seconds:
...
[ 232.610044] [drm:intel_sdvo_debug_write], SDVOB: W: 0B
(SDVO_CMD_GET_ATTACHED_DISPLAYS)
[ 232.624504] [drm:intel_sdvo_debug_response], SDVOB: R: 01 00
(Success)
[ 232.624517] [drm:intel_sdvo_detect], SDVO response 1 0
[ 232.624524] [drm:intel_sdvo_debug_write], SDVOB: W: 7A 01
(SDVO_CMD_SET_CONTROL_BUS_SWITCH)
[ 242.672044] [drm:intel_sdvo_debug_write], SDVOB: W: 0B
(SDVO_CMD_GET_ATTACHED_DISPLAYS)
[ 242.686503] [drm:intel_sdvo_debug_response], SDVOB: R: 01 00
(Success)
[ 242.686516] [drm:intel_sdvo_detect], SDVO response 1 0
[ 242.686523] [drm:intel_sdvo_debug_write], SDVOB: W: 7A 01
(SDVO_CMD_SET_CONTROL_BUS_SWITCH)
[ 252.750044] [drm:intel_sdvo_debug_write], SDVOB: W: 0B
(SDVO_CMD_GET_ATTACHED_DISPLAYS)
[ 252.764526] [drm:intel_sdvo_debug_response], SDVOB: R: 01 00
(Success)
[ 252.764539] [drm:intel_sdvo_detect], SDVO response 1 0
...
It looks like it takes about 15 ms each time. But 15 ms each 10s
doesn't seem to be enough to account for the load average. Maybe it
gets synchronized with the timer tick or something, causing the load
average to look artificially inflated (it also doesn't match up with
kworker using 1%+ CPU time).
So maybe there is something else going on. Maybe the load average
thing comes from some interaction with the new workqueue thing.
I'll send the whole dmesg to you in a private message, I don't think
we want 128kB of crud on lkml.
> I wonder are we picking up a bad SDVO, the insane code retries 50 times
> with a hard loop delay.
Well, I'm not seeing any failures, but maybe those don't get printed out?
Linus
Linus Torvalds wrote:
> I started wondering why 'top' was showing an otherwise idle system as
> having a load average of 0.5+, and worker threads constantly using the
> CPU.
>
> So I did a system-wide profile, and got the attached output (look at
> it in a really wide terminal).
>
> There seems to be something _seriously_ wrong with i915 SDVO detect.
> This is on an Apple Mac Mini (hey, your favorite problem child!), and
> apparently it spends 20% of its non-idle CPU time just doing udelay's
> for the i2c SDVO connection detection.
You might be hitting the infamous hotplug storm [1]. The symptoms vary
by kernel version.
2.6.34 and before: udevadm --monitor shows craploads of events and, as
long as X is running, X keeps reprobing the outputs which (depending on
the particular bug) can suck cpu in the i2c code or cause more hotplug
events. It also makes X oddly laggy.
2.6.35 and newer: The kernel is smart enough to probe outputs itself
before telling X, so the events never hit userspace. But things still
can get a bit laggy.
Anyone know why merely *reading* /sys/class/drm/whatever/status causes
the output to get probed? (I see it in the code, but I have no idea why
that code's still there after most of the rest of the hotplug code got
cleaned up in 2.6.36).
Once I find some free time, I plan on trying to at least fix the issue
that causes this bug for me. (It's apparently quite nontrivial due to
silliness in the way dock/undock (!) works on some laptops.)
[1] for example:
http://www.mail-archive.com/[email protected]/msg00921.html
>
> That sounds a bit wrong, doesn't it?
>
> I don't know how recent this is - it might have been going on for some
> time without me noticing. It's the wife's computer, and the same thing
> doesn't seem to happen on my Core i5 desktop
>
> Any ideas? Any information I can give about the machine?
If I'm right, the outputs of intel_bios_dumper and intel_bios_reader
could be instructive (both are in intel-gpu-tools).
You could also try intel_reg_write 0x61110 0x0 and see if the problem
stops (at least until a suspend/resume cycle). That command turns off
output hotplug on the card, which has the side effect that the kernel
will stop acting on bogus interrupts.
--Andy
On Sun, Aug 15, 2010 at 9:06 PM, Andy Lutomirski <[email protected]> wrote:
>
> You might be hitting the infamous hotplug storm [1]. ?The symptoms vary by
> kernel version.
Hmm. I don't think it's a storm. The drm.debug=4 thing shows things
just every 10 seconds. That seems pretty controlled.
Of course, it seems to be several milliseconds worth of work then, so
it's not _too_ controlled. I must be missing some detail.
> If I'm right, the outputs of intel_bios_dumper and intel_bios_reader could
> be instructive (both are in intel-gpu-tools).
My version of intel-gpu-tools must be old (fedora 12). It has
intel_gpu_dump and intel_reg_write and some other apps, but not the
bios-dumper/reader.
Not that I could read the output of them anyway, I bet ;)
> You could also try intel_reg_write 0x61110 0x0 and see if the problem stops
> (at least until a suspend/resume cycle). ?That command turns off output
> hotplug on the card, which has the side effect that the kernel will stop
> acting on bogus interrupts.
# intel_reg_write 0x61110 0x0
Value before: 0x4000220
Value after: 0x0
but it doesn't seem to change any behavior. Still that sdvo probe
every 10 s, and still 1% of CPU for kworker in top. But that kworker
thing definitely has to be separate from the 10-second sdvo probe,
because it shows up all the time (ie top updates every second).
But right now the wife is using that machine, so I can't check the "it
should be totally idle, what the heck is going on" behavior at the
moment.
Linus
Linus
On Sun, 2010-08-15 at 21:01 -0700, Linus Torvalds wrote:
> On Sun, Aug 15, 2010 at 8:30 PM, Dave Airlie <[email protected]> wrote:
> >
> > At least we should replace mdelay with msleep in those functions.
>
> How precise does the timing have to be? I think i2c is self-clocking,
> so it's ok to see big skews? Becuase msleep() can be off by quite a
> bit (mdelay can too, but it's _way_ more rare, and requires either a
> preemptible kernel or serious interrupt activity).
In this case it definitely doesn't matter, I expect msleep to be a much
nicer to the system in general idea esp in routines we can all at
runtime from userspace.
>
> > Can you get a boot with drm.debug=4?
>
> Sure. That results in a truncated dmesg (with a 128k buffer). The
> thing seems to spew out something every ten seconds:
>
> ...
> [ 232.610044] [drm:intel_sdvo_debug_write], SDVOB: W: 0B
> (SDVO_CMD_GET_ATTACHED_DISPLAYS)
> [ 232.624504] [drm:intel_sdvo_debug_response], SDVOB: R: 01 00
> (Success)
> [ 232.624517] [drm:intel_sdvo_detect], SDVO response 1 0
> [ 232.624524] [drm:intel_sdvo_debug_write], SDVOB: W: 7A 01
> (SDVO_CMD_SET_CONTROL_BUS_SWITCH)
> [ 242.672044] [drm:intel_sdvo_debug_write], SDVOB: W: 0B
> (SDVO_CMD_GET_ATTACHED_DISPLAYS)
> [ 242.686503] [drm:intel_sdvo_debug_response], SDVOB: R: 01 00
> (Success)
> [ 242.686516] [drm:intel_sdvo_detect], SDVO response 1 0
> [ 242.686523] [drm:intel_sdvo_debug_write], SDVOB: W: 7A 01
> (SDVO_CMD_SET_CONTROL_BUS_SWITCH)
> [ 252.750044] [drm:intel_sdvo_debug_write], SDVOB: W: 0B
> (SDVO_CMD_GET_ATTACHED_DISPLAYS)
> [ 252.764526] [drm:intel_sdvo_debug_response], SDVOB: R: 01 00
> (Success)
> [ 252.764539] [drm:intel_sdvo_detect], SDVO response 1 0
> ...
>
> It looks like it takes about 15 ms each time. But 15 ms each 10s
> doesn't seem to be enough to account for the load average. Maybe it
> gets synchronized with the timer tick or something, causing the load
> average to look artificially inflated (it also doesn't match up with
> kworker using 1%+ CPU time).
>
> So maybe there is something else going on. Maybe the load average
> thing comes from some interaction with the new workqueue thing.
>
> I'll send the whole dmesg to you in a private message, I don't think
> we want 128kB of crud on lkml.
>
> > I wonder are we picking up a bad SDVO, the insane code retries 50 times
> > with a hard loop delay.
>
> Well, I'm not seeing any failures, but maybe those don't get printed out?
Oh wierd, so not where I thought it was, I expect then the SDVO HDMI
detection is completely insane,
intel_sdvo_hdmi_sink_detect looks to contain some really uninspiring
code. Might be worth adding some debug in there to see if it sinks a lot
of time.
Dave.
[cc: intel-gfx]
Linus Torvalds wrote:
> On Sun, Aug 15, 2010 at 9:06 PM, Andy Lutomirski <[email protected]> wrote:
>> You might be hitting the infamous hotplug storm [1]. The symptoms vary by
>> kernel version.
>
> Hmm. I don't think it's a storm. The drm.debug=4 thing shows things
> just every 10 seconds. That seems pretty controlled.
OK, probably not the hotplug storm. That's always been at least once per
second and sometimes several for me.
>
> Of course, it seems to be several milliseconds worth of work then, so
> it's not _too_ controlled. I must be missing some detail.
>
>> If I'm right, the outputs of intel_bios_dumper and intel_bios_reader could
>> be instructive (both are in intel-gpu-tools).
>
> My version of intel-gpu-tools must be old (fedora 12). It has
> intel_gpu_dump and intel_reg_write and some other apps, but not the
> bios-dumper/reader.
>
> Not that I could read the output of them anyway, I bet ;)
>
>> You could also try intel_reg_write 0x61110 0x0 and see if the problem stops
>> (at least until a suspend/resume cycle). That command turns off output
>> hotplug on the card, which has the side effect that the kernel will stop
>> acting on bogus interrupts.
>
> # intel_reg_write 0x61110 0x0
> Value before: 0x4000220
> Value after: 0x0
>
> but it doesn't seem to change any behavior. Still that sdvo probe
> every 10 s, and still 1% of CPU for kworker in top. But that kworker
> thing definitely has to be separate from the 10-second sdvo probe,
> because it shows up all the time (ie top updates every second).
Almost certainly not the hotplug storm bug.
The 10 seconds is probably the timer in output_poll_execute (drm_crtc_helper.c):
#define DRM_OUTPUT_POLL_PERIOD (10*HZ)
The code in output_poll_execute (which is in your perf output) looks wrong:
list_for_each_entry(connector, &dev->mode_config.connector_list, head) {
/* if this is HPD or polled don't check it -
TV out for instance */
if (!connector->polled)
continue;
else if (connector->polled & (DRM_CONNECTOR_POLL_CONNECT | DRM_CONNECTOR_POLL_DISCONNECT))
repoll = true;
[ so if we have DRM_CONNECTOR_POLL_CONNECT set we requeue the *entire*
work item. or if we have DRM_CONNECTOR_POLL_DISCONNECT and it's not
even connected, we still poll. ]
old_status = connector->status;
/* if we are connected and don't want to poll for disconnect
skip it */
if (old_status == connector_status_connected &&
!(connector->polled & DRM_CONNECTOR_POLL_DISCONNECT) &&
!(connector->polled & DRM_CONNECTOR_POLL_HPD))
continue;
[ but if the status isn't connector_status_connected, we poll it even if we're
only in here due to a *different* connector ]
status = connector->funcs->detect(connector);
[ ->detect is the expensive part ]
if (old_status != status)
changed = true;
}
But that's really the problem, because intel_sdvo_dvi_init contains:
connector->polled = DRM_CONNECTOR_POLL_CONNECT | DRM_CONNECTOR_POLL_DISCONNECT;
I don't know if SDVO is supposed to send hotplug interrupts because that section of the docs is incomplete.
--Andy
On Mon, 2010-08-16 at 10:27 -0400, Andy Lutomirski wrote:
> But that's really the problem, because intel_sdvo_dvi_init contains:
>
> connector->polled = DRM_CONNECTOR_POLL_CONNECT | DRM_CONNECTOR_POLL_DISCONNECT;
>
> I don't know if SDVO is supposed to send hotplug interrupts because
> that section of the docs is incomplete.
SDVO can generate interrupts, although a particular SDVO output type
might not. (The G45 docs are missing this section, but the 965 docs are
applicable.) We don't have that turned on, of course, because that
would be entirely too sensible.
You'd probably need to uncomment the #if 0'd intel_sdvo_set_hotplug() in
intel_sdvo.c, and call that from the appropriate bit of
intel_sdvo_init(). For pre-gen4 you'd probably also need to set
SDVO_INTERRUPT_ENABLE in the appropriate SDVOx register, but in gen4+
that bit is used for bpc selection.
Having done that you'd then need to wait the inevitable six months to
find out which SDVO devices actually have functioning hotplug and which
ones are just plain lying about it.
- ajax