2009-06-15 00:33:24

by Arjan van de Ven

[permalink] [raw]
Subject: kerneloops.org report for the week of June 14 2009

Hi,

First, a word of thanks to the kernel.org team for helping to host the kerneloops.org database
on a real machine rather than a small virtualized host. Performance and continuity will be a lot
better in the new setup!
(as a result of the transition and hosting issues on the original server, there was a downtime of
two days, meaning that the total number of reports is lower than typical)


New sightings this report include the XFI driver, the i915 gem code and the DMAR code.


This week, a total of 4026 oopses and warnings have been reported
for kernels 2.6.29 and later. Reports for kernels earlier than 2.6.29
have been omitted from this report.

Per file statistics
830 drivers/gpu/drm/i915/i915_gem_tiling.c
470 drivers/net/wireless/iwlwifi/iwl-sta.c
303 kernel/time.c
276 drivers/parport/procfs.c
269 include/linux/slob_def.h
187 include/linux/hrtimer.h
138 arch/x86/kernel/cpu/mtrr/generic.c
120 drivers/pci/dmar.c
65 net/mac80211/rc80211_minstrel.c
62 drivers/net/r8169.c
37 net/ipv4/tcp.c
34 fs/cifs/cifssmb.c


Rank 1: i915_gem_set_tiling (warning)
Reported 830 times (3700 total reports)
[gem] Failure in the tiling code
This warning was last seen in version 2.6.30, and first seen in 2.6.29-rc2.
More info: http://www.kerneloops.org/searchweek.php?search=i915_gem_set_tiling

Rank 2: iwl_set_dynamic_key (warning)
Reported 470 times (13682 total reports)
"no space for new kew" message.
This warning was last seen in version 2.6.30, and first seen in 2.6.27.
More info: http://www.kerneloops.org/searchweek.php?search=iwl_set_dynamic_key

Rank 3: getnstimeofday (warning)
Reported 309 times (2446 total reports)
[suspend resume] getnstimeofday() is called before timekeeping is resumed
This warning was last seen in version 2.6.29.4, and first seen in 2.6.24.
More info: http://www.kerneloops.org/searchweek.php?search=getnstimeofday

Rank 4: ct_vm_map (warning) - sleeping in atomic context
Reported 285 times (960 total reports)
Bug in the creative XFI driver (backported to Fedora)
This warning was last seen in version 2.6.29.4, and first seen in 2.6.29-rc4-git1.
More info: http://www.kerneloops.org/searchweek.php?search=ct_vm_map

Rank 5: parport_device_proc_unregister (warning)
Reported 278 times (1810 total reports)
Alan has a fix for this on LKML
This warning was last seen in version 2.6.29.4, and first seen in 2.6.27.
More info: http://www.kerneloops.org/searchweek.php?search=parport_device_proc_unregister

Rank 6: hres_timers_resume (warning)
Reported 188 times (1024 total reports)
[suspend resume] hres_timers_resume() is incorrectly called with interrupts on
This warning was last seen in version 2.6.30-rc6, and first seen in 2.6.24.7.
More info: http://www.kerneloops.org/searchweek.php?search=hres_timers_resume

Rank 7: generic_get_mtrr (warning)
Reported 139 times (1009 total reports)
BIOS bug where the MTRRs are not set up correctly
This warning was last seen in version 2.6.30, and first seen in 2.6.25.3.
More info: http://www.kerneloops.org/searchweek.php?search=generic_get_mtrr

Rank 8: dmar_table_init (warning)
Reported 121 times (507 total reports)
BIOS bug exposed via a WARN_ON
This warning was last seen in version 2.6.30-rc8, and first seen in 2.6.29.
More info: http://www.kerneloops.org/searchweek.php?search=dmar_table_init

Rank 9: minstrel_get_rate (warning)
Reported 64 times (384 total reports)
Issue with the ath5k driver
This warning was last seen in version 2.6.30-rc3, and first seen in 2.6.27.12.
More info: http://www.kerneloops.org/searchweek.php?search=minstrel_get_rate

Rank 10: dev_watchdog(r8169) (warning)
Reported 62 times (9116 total reports)
This warning was last seen in version 2.6.30-rc7, and first seen in 2.6.26.6.
More info: http://www.kerneloops.org/searchweek.php?search=dev_watchdog(r8169)

Rank 11: nv_tx_done_optimized (warning)
Reported 61 times (2974 total reports)
the NV ethernet driver is getting caught with the new DMA API checks
This warning was last seen in version 2.6.30, and first seen in 2.6.29-rc3-git7.
More info: http://www.kerneloops.org/searchweek.php?search=nv_tx_done_optimized

Rank 12: tcp_recvmsg (warning)
Reported 37 times (3797 total reports)
fixed in 2.6.30 (commit 775273131810caa41dfc7f9e552ea5d8508caf40)
This warning was last seen in version 2.6.30, and first seen in 2.6.25.
More info: http://www.kerneloops.org/searchweek.php?search=tcp_recvmsg


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org


2009-06-15 05:53:51

by Takashi Iwai

[permalink] [raw]
Subject: Re: [alsa-devel] kerneloops.org report for the week of June 14 2009

At Sun, 14 Jun 2009 17:33:31 -0700,
Arjan van de Ven wrote:
>
> Rank 4: ct_vm_map (warning) - sleeping in atomic context
> Reported 285 times (960 total reports)
> Bug in the creative XFI driver (backported to Fedora)
> This warning was last seen in version 2.6.29.4, and first seen in 2.6.29-rc4-git1.
> More info: http://www.kerneloops.org/searchweek.php?search=ct_vm_map

This should have been fixed in the latest code, at least, the one
in Linus tree should be OK.


thanks,

Takashi

2009-06-15 07:07:37

by Stefan Schmidt

[permalink] [raw]
Subject: Re: kerneloops.org report for the week of June 14 2009

Hello.

On Sun, 2009-06-14 at 17:33, Arjan van de Ven wrote:
>
> Rank 1: i915_gem_set_tiling (warning)
> Reported 830 times (3700 total reports)
> [gem] Failure in the tiling code
> This warning was last seen in version 2.6.30, and first seen in 2.6.29-rc2.
> More info: http://www.kerneloops.org/searchweek.php?search=i915_gem_set_tiling

There seem to be more then one issue. I'm running into one of them here on a
Lenovo Thinkpad X200s. It's 2.6.30 and I enabled KMS the first time here. So I
can't tell if it is a regression.

http://www.kerneloops.org/oops.php?number=207353

The result is that the EDID data could not be read and the wrong screen size
set:
[ 45.524589] i2c-adapter i2c-3: unable to read EDID block.
[ 45.588200] i915 0000:00:02.0: DVI-D-2: no EDID data
[ 45.777251] [drm] TMDS-14: set mode 1360x768 13
[ 45.777261] [drm] TMDS-16: set mode 1360x768 13 <= Should be 1280x900 here

Somehow the driver tries to use DVI1 and DVI2 while the display is connected to
LVDS1.

stefan@excalibur:~$ xrandr
Screen 0: minimum 320 x 200, current 1360 x 800, maximum 8192 x 8192
VGA1 disconnected (normal left inverted right x axis y axis)
LVDS1 connected 1280x800+0+0 (normal left inverted right x axis y axis) 261mm x
163mm
1280x800 60.0*+ 50.0
DVI1 connected 1360x768+0+0 (normal left inverted right x axis y axis) 0mm x 0mm
1360x768 59.8*
1152x864 60.0
1024x768 60.0
800x600 60.3
640x480 59.9
DVI2 connected 1360x768+0+0 (normal left inverted right x axis y axis) 0mm x 0mm
1360x768 59.8*
1152x864 60.0
1024x768 60.0
800x600 60.3
640x480 59.9

regards
Stefan Schmidt

2009-06-16 07:11:53

by Stefan Schmidt

[permalink] [raw]
Subject: Re: WARN_ON's in i915_gem_tiling.c

Hello.

Bringing Eric into the loop here as he is listed as author of the
i915_gem_tiling.c

On Mon, 2009-06-15 at 09:06, Stefan Schmidt wrote:
>
> On Sun, 2009-06-14 at 17:33, Arjan van de Ven wrote:
> >
> > Rank 1: i915_gem_set_tiling (warning)
> > Reported 830 times (3700 total reports)
> > [gem] Failure in the tiling code
> > This warning was last seen in version 2.6.30, and first seen in 2.6.29-rc2.
> > More info: http://www.kerneloops.org/searchweek.php?search=i915_gem_set_tiling
>
> There seem to be more then one issue. I'm running into one of them here on a
> Lenovo Thinkpad X200s. It's 2.6.30 and I enabled KMS the first time here. So I
> can't tell if it is a regression.
>
> http://www.kerneloops.org/oops.php?number=207353

http://www.kerneloops.org/raw.php?rawid=453362&msgid=

and

http://www.kerneloops.org/oops.php?number=453361

Are looking similar to #207353 to me. That's the one I can easily reproduce by
booting into an KMS enabled kernel. Without modsetting eveything is fine. Let me
know if there are patches I should give a try.

> The result is that the EDID data could not be read and the wrong screen size
> set:
> [ 45.524589] i2c-adapter i2c-3: unable to read EDID block.
> [ 45.588200] i915 0000:00:02.0: DVI-D-2: no EDID data
> [ 45.777251] [drm] TMDS-14: set mode 1360x768 13
> [ 45.777261] [drm] TMDS-16: set mode 1360x768 13 <= Should be 1280x900 here
>
> Somehow the driver tries to use DVI1 and DVI2 while the display is connected to
> LVDS1.
>
> stefan@excalibur:~$ xrandr
> Screen 0: minimum 320 x 200, current 1360 x 800, maximum 8192 x 8192
> VGA1 disconnected (normal left inverted right x axis y axis)
> LVDS1 connected 1280x800+0+0 (normal left inverted right x axis y axis) 261mm x
> 163mm
> 1280x800 60.0*+ 50.0
> DVI1 connected 1360x768+0+0 (normal left inverted right x axis y axis) 0mm x 0mm
> 1360x768 59.8*
> 1152x864 60.0
> 1024x768 60.0
> 800x600 60.3
> 640x480 59.9
> DVI2 connected 1360x768+0+0 (normal left inverted right x axis y axis) 0mm x 0mm
> 1360x768 59.8*
> 1152x864 60.0
> 1024x768 60.0
> 800x600 60.3
> 640x480 59.9

regards
Stefan Schmidt

2009-06-18 17:14:44

by Bob Copeland

[permalink] [raw]
Subject: Re: kerneloops.org report for the week of June 14 2009

On Sun, Jun 14, 2009 at 8:33 PM, Arjan van de Ven<[email protected]> wrote:
> Rank 9: minstrel_get_rate (warning)
> ? ? ? ?Reported 64 times (384 total reports)
> ? ? ? ?Issue with the ath5k driver
> ? ? ? ?This warning was last seen in version 2.6.30-rc3, and first seen in 2.6.27.12.
> ? ? ? ?More info: http://www.kerneloops.org/searchweek.php?search=minstrel_get_rate

I believe this is fixed by http://lkml.org/lkml/2009/6/5/269, now
in pre-2.6.31. Hmm, odd though, we should be seeing reports from
other wireless drivers since it's a general memory corruption
unrelated to ath5k (but not all use the mac80211 rate controllers).

(2.6.27.12 must be a backport, minstrel didn't exist until .29)

--
Bob Copeland %% http://www.bobcopeland.com

2009-06-19 05:32:20

by Arjan van de Ven

[permalink] [raw]
Subject: Re: kerneloops.org report for the week of June 14 2009

On Thu, 18 Jun 2009 13:08:39 -0400
Bob Copeland <[email protected]> wrote:

> On Sun, Jun 14, 2009 at 8:33 PM, Arjan van de
> Ven<[email protected]> wrote:
> > Rank 9: minstrel_get_rate (warning)
> >        Reported 64 times (384 total reports)
> >        Issue with the ath5k driver
> >        This warning was last seen in version 2.6.30-rc3, and first
> > seen in 2.6.27.12. More info:
> > http://www.kerneloops.org/searchweek.php?search=minstrel_get_rate
>
> I believe this is fixed by http://lkml.org/lkml/2009/6/5/269, now
> in pre-2.6.31. Hmm, odd though, we should be seeing reports from
> other wireless drivers since it's a general memory corruption
> unrelated to ath5k (but not all use the mac80211 rate controllers).
>

well.. out of the 534 reports or so that I have by now... 525 have
ath5k loaded.....

it's not 100% but it'd pretty close.


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-06-23 10:07:34

by Thomas Gleixner

[permalink] [raw]
Subject: Re: kerneloops.org report for the week of June 14 2009

On Sun, 14 Jun 2009, Arjan van de Ven wrote:
> Rank 3: getnstimeofday (warning)
> Reported 309 times (2446 total reports)
> [suspend resume] getnstimeofday() is called before timekeeping is
resumed

> Rank 6: hres_timers_resume (warning)
> Reported 188 times (1024 total reports)
> [suspend resume] hres_timers_resume() is incorrectly called with
> interrupts on

Both have the same root cause. Something enables interrupts in the
early resume path. IIRC, there was a culprit identified
recently. Rafael ?

Thanks,

tglx

2009-06-23 11:55:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: kerneloops.org report for the week of June 14 2009


* Thomas Gleixner <[email protected]> wrote:

> On Sun, 14 Jun 2009, Arjan van de Ven wrote:
> > Rank 3: getnstimeofday (warning)
> > Reported 309 times (2446 total reports)
> > [suspend resume] getnstimeofday() is called before timekeeping is
> resumed
>
> > Rank 6: hres_timers_resume (warning)
> > Reported 188 times (1024 total reports)
> > [suspend resume] hres_timers_resume() is incorrectly called with
> > interrupts on
>
> Both have the same root cause. Something enables interrupts in the
> early resume path. IIRC, there was a culprit identified recently.
> Rafael ?

This can be debugged automatically today, using lockdep, by using a
'helper lock':

static DEFINE_PER_CPU(struct lockdep_map, helper_lock);

Then mark the lock irq-safe by doing something like:

static void mark_lock_irqsafe(void)
{
unsigned long flags;
int cpu;

local_irq_save(flags);
irq_enter(0);

for_each_online_cpu(cpu) {
lock_acquire(&per_cpu(helper_lock, cpu), 0, 0, 0, 0, NULL, 0);
lock_release(&per_cpu(helper_lock, cpu), 0, 0, 0, 0, NULL, 0);
}

irq_exit(0);
local_irq_restore(flags);
}

Then, the resume path, when it disables irqs, you can disallow
irq-enable via:

local_irq_disable();
lock_acquire(&__get_cpu_var(helper_lock), 0, 0, 0, 0, NULL, 0);
...
<extensive suspend or resume codepaths, callbacks>
...
lock_release(&__get_cpu_var(helper_lock), 0, 0, 0, 0, NULL, 0);
local_irq_enable();

And lockdep will warn if any function inbetween enables IRQs, by
emitting a splat about incorrectly enabled hardirqs. It will warn
about the specific place and will emit a relevant backtrace, - not
just the handler in general.

This should work just fine with current lockdep facilities.

Rafael?

Ingo

2009-06-23 14:50:24

by Arjan van de Ven

[permalink] [raw]
Subject: Re: kerneloops.org report for the week of June 14 2009

On Tue, 23 Jun 2009 13:55:10 +0200
Ingo Molnar <[email protected]> wrote:

>
> * Thomas Gleixner <[email protected]> wrote:
>
> > On Sun, 14 Jun 2009, Arjan van de Ven wrote:
> > > Rank 3: getnstimeofday (warning)
> > > Reported 309 times (2446 total reports)
> > > [suspend resume] getnstimeofday() is called before
> > > timekeeping is
> > resumed
> >
> > > Rank 6: hres_timers_resume (warning)
> > > Reported 188 times (1024 total reports)
> > > [suspend resume] hres_timers_resume() is incorrectly
> > > called with interrupts on
> >
> > Both have the same root cause. Something enables interrupts in the
> > early resume path. IIRC, there was a culprit identified recently.
> > Rafael ?
>
> This can be debugged automatically today, using lockdep, by using a
> 'helper lock':
>
> static DEFINE_PER_CPU(struct lockdep_map, helper_lock);
>
> Then mark the lock irq-safe by doing something like:
>
> static void mark_lock_irqsafe(void)
> {
> unsigned long flags;
> int cpu;
>
> local_irq_save(flags);
> irq_enter(0);
>
> for_each_online_cpu(cpu) {
> lock_acquire(&per_cpu(helper_lock, cpu), 0, 0, 0, 0,
> NULL, 0); lock_release(&per_cpu(helper_lock, cpu), 0, 0, 0, 0, NULL,
> 0); }
>
> irq_exit(0);
> local_irq_restore(flags);
> }
>
> Then, the resume path, when it disables irqs, you can disallow
> irq-enable via:
>
> local_irq_disable();
> lock_acquire(&__get_cpu_var(helper_lock), 0, 0, 0, 0, NULL,
> 0); ...
> <extensive suspend or resume codepaths, callbacks>
> ...
> lock_release(&__get_cpu_var(helper_lock), 0, 0, 0, 0, NULL,
> 0); local_irq_enable();
>
> And lockdep will warn if any function inbetween enables IRQs, by
> emitting a splat about incorrectly enabled hardirqs. It will warn
> about the specific place and will emit a relevant backtrace, - not
> just the handler in general.

looks like an interesting approach; it'll pinpoint the bad guy exactly
where he's enabling interrupts.. (assuming he's using kernel facilities
of course)



--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-06-23 16:56:33

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: kerneloops.org report for the week of June 14 2009

On Tuesday 23 June 2009, Ingo Molnar wrote:
>
> * Thomas Gleixner <[email protected]> wrote:
>
> > On Sun, 14 Jun 2009, Arjan van de Ven wrote:
> > > Rank 3: getnstimeofday (warning)
> > > Reported 309 times (2446 total reports)
> > > [suspend resume] getnstimeofday() is called before timekeeping is
> > resumed
> >
> > > Rank 6: hres_timers_resume (warning)
> > > Reported 188 times (1024 total reports)
> > > [suspend resume] hres_timers_resume() is incorrectly called with
> > > interrupts on
> >
> > Both have the same root cause. Something enables interrupts in the
> > early resume path. IIRC, there was a culprit identified recently.
> > Rafael ?

Apparently, we have smp_call_function_single() called from cpufreq_suspend
via acpi_cpufreq somehow, but I'm still to figure out how this happens.

> This can be debugged automatically today, using lockdep, by using a
> 'helper lock':
>
> static DEFINE_PER_CPU(struct lockdep_map, helper_lock);
>
> Then mark the lock irq-safe by doing something like:
>
> static void mark_lock_irqsafe(void)
> {
> unsigned long flags;
> int cpu;
>
> local_irq_save(flags);
> irq_enter(0);
>
> for_each_online_cpu(cpu) {
> lock_acquire(&per_cpu(helper_lock, cpu), 0, 0, 0, 0, NULL, 0);
> lock_release(&per_cpu(helper_lock, cpu), 0, 0, 0, 0, NULL, 0);
> }
>
> irq_exit(0);
> local_irq_restore(flags);
> }
>
> Then, the resume path, when it disables irqs, you can disallow
> irq-enable via:
>
> local_irq_disable();
> lock_acquire(&__get_cpu_var(helper_lock), 0, 0, 0, 0, NULL, 0);
> ...
> <extensive suspend or resume codepaths, callbacks>
> ...
> lock_release(&__get_cpu_var(helper_lock), 0, 0, 0, 0, NULL, 0);
> local_irq_enable();
>
> And lockdep will warn if any function inbetween enables IRQs, by
> emitting a splat about incorrectly enabled hardirqs. It will warn
> about the specific place and will emit a relevant backtrace, - not
> just the handler in general.
>
> This should work just fine with current lockdep facilities.
>
> Rafael?

We have some debug code for checking interrupts disabled in sysdev_suspend
and sysdev_resume already and these reports are from 2.6.29 where that code
was not present.

The long term solution for the issue at hand is to clean up the suspend-resume
support in cpufreq so that it doesn't do stupid things like calling
smp_call_function_single() with interrupts disabled, but that requires someone
(I can do it, but I need to dig through the cpufreq code for this purpose) to
figure out how to fix it.

I'm not quite sure if there's an acceptable short term solution, though.

In principle we can do

local_irq_save()
...
local_irq_restore()

around each sysdevs ->susend() and ->resume() in addition to checking the
status of interrupts. Would that work?

Rafael

2009-06-23 17:01:07

by Thomas Gleixner

[permalink] [raw]
Subject: Re: kerneloops.org report for the week of June 14 2009

On Tue, 23 Jun 2009, Rafael J. Wysocki wrote:
> The long term solution for the issue at hand is to clean up the suspend-resume
> support in cpufreq so that it doesn't do stupid things like calling
> smp_call_function_single() with interrupts disabled, but that requires someone
> (I can do it, but I need to dig through the cpufreq code for this purpose) to
> figure out how to fix it.
>
> I'm not quite sure if there's an acceptable short term solution, though.
>
> In principle we can do
>
> local_irq_save()
> ...
> local_irq_restore()
>
> around each sysdevs ->susend() and ->resume() in addition to checking the
> status of interrupts. Would that work?

Well not really, if the function enables interrupts you run into the
same issue (interrupt service routine calls ktime_get()) again.

Thanks,

tglx

2009-06-25 13:14:26

by Pádraig Brady

[permalink] [raw]
Subject: Re: kerneloops.org report for the week of June 14 2009

Thomas Gleixner wrote:
> On Sun, 14 Jun 2009, Arjan van de Ven wrote:
>> Rank 3: getnstimeofday (warning)
>> Reported 309 times (2446 total reports)
>> [suspend resume] getnstimeofday() is called before timekeeping is
> resumed
>
>> Rank 6: hres_timers_resume (warning)
>> Reported 188 times (1024 total reports)
>> [suspend resume] hres_timers_resume() is incorrectly called with
>> interrupts on
>
> Both have the same root cause. Something enables interrupts in the
> early resume path. IIRC, there was a culprit identified
> recently. Rafael ?

Lots of reports I guess from Fedora 11 users trying to suspend/resume:
https://bugzilla.redhat.com/show_bug.cgi?id=499651

cheers,
P?draig.

2009-06-28 09:12:05

by Bob Copeland

[permalink] [raw]
Subject: Re: kerneloops.org report for the week of June 14 2009

On Fri, Jun 19, 2009 at 1:32 AM, Arjan van de Ven<[email protected]> wrote:
> On Thu, 18 Jun 2009 13:08:39 -0400
> Bob Copeland <[email protected]> wrote:
>> > Rank 9: minstrel_get_rate (warning)
>>
>> I believe this is fixed by http://lkml.org/lkml/2009/6/5/269, now
>> in pre-2.6.31. ?Hmm, odd though, we should be seeing reports from
>> other wireless drivers since it's a general memory corruption
>> unrelated to ath5k (but not all use the mac80211 rate controllers).
>
> well.. out of the 534 reports or so that I have by now... 525 have
> ath5k loaded.....
>
> it's not 100% but it'd pretty close.

Indeed, I can replicate this now, looks like some kind of race condition
since it's looking for a rate in the 5 ghz band while I'm communicating
with a 2 ghz AP (during a scan perhaps?). The warning is harmless but
there is some underlying issue to be fixed.

--
Bob Copeland %% http://www.bobcopeland.com