2011-06-15 20:25:11

by Martin

[permalink] [raw]
Subject: Re: Kernel panic on HT machine - maybe i915 related?

wzab wrote:

> Hi,
> I tried to find the source of the problem I experience, supposing, that
> it may affect also other users of machine with Hyper-Threaded CPU.
> [...]

Hi wzab et al,

only today I have started to use a spare machine at work, with a hyperthreaded
CPU (P4 3.2GHz). I also experience random kernel panics with 2.6.39.1 and
SMP/HT enabled. The machine usually runs for a couple of hours before the
panic occurs.

> Crash seems to happen in random places:
> [...]

Same here, although my panics look different from yours. In both cases I had a
lot of hardware interrupts on the stack trace but no kmemcheck. I only
remembered to take a photo with my mobile phone the second time. The stack
trace contains handle_*irq*, tg3, ata_bmdma, __ata_sff, i915, drm_vblank_put,
intel_thermal, try_preempt, resched*, drm_vblank_put, do_invalid_op, oops_end,
do_bounds, panic. The EIP is drm_vblank_put+0x13/0x50.

Could this be an i915 issue with 2.6.39.1? At home I have two other machines
on i915 but with different chipsets that run fine with the same kernel.

> As crash occurs only with HT on and doesn't happen on another machine
> with 2 cores, it seems that maybe the problem is associated with
> incorrect allocation of resources or locking for HT enabled CPU...

I do not observe the issue with the regular non-SMP 2.6.38.7 distro kernel, so
it might be HT related. Truth be told I need to investigate further since my
2.6.39.1 kernel is patched. In the meantime, if someone recognizes the issue
please come forward.

Thanks and regards,

Martin


2011-06-16 11:41:23

by Wojciech Zabołotny

[permalink] [raw]
Subject: Re: Kernel panic on HT machine - maybe i915 related?

W dniu 15.06.2011 22:24, Martin wrote:
> Same here, although my panics look different from yours. In both cases I had a
> lot of hardware interrupts on the stack trace but no kmemcheck. I only
> remembered to take a photo with my mobile phone the second time. The stack
> trace contains handle_*irq*, tg3, ata_bmdma, __ata_sff, i915, drm_vblank_put,
> intel_thermal, try_preempt, resched*, drm_vblank_put, do_invalid_op, oops_end,
> do_bounds, panic. The EIP is drm_vblank_put+0x13/0x50.
>
Probably you have not enabled kmemcheck in the kernel configuration?
> Could this be an i915 issue with 2.6.39.1? At home I have two other machines
> on i915 but with different chipsets that run fine with the same kernel.
>

The i915 driver with 82865G chipset works strange for quite a long time (since 2.6.37???)
The machine which causes problem uses this chipset.
Another one which works good with 2.6.39.1 (I write this message on it) uses: Intel Corporation Mobile 945GM/GMS, 943/940GML Express Integrated Graphics Controller (rev 03)

>> As crash occurs only with HT on and doesn't happen on another machine
>> with 2 cores, it seems that maybe the problem is associated with
>> incorrect allocation of resources or locking for HT enabled CPU...
> I do not observe the issue with the regular non-SMP 2.6.38.7 distro kernel, so
> it might be HT related. Truth be told I need to investigate further since my
> 2.6.39.1 kernel is patched. In the meantime, if someone recognizes the issue
> please come forward.
>
In my last message
( https://groups.google.com/group/kernelarchive/browse_thread/thread/adaeb363d2eadf24/559b1fd44876e4a6 )
I've reported some problems even with HT switched off,
but maybe it is associated with problems related to using both kmemcheck and perf, as reported e.g. here: https://lkml.org/lkml/2011/4/26/83
--
Regards,
Wojtek

2011-06-16 21:45:51

by Martin

[permalink] [raw]
Subject: Re: Kernel panic on HT machine - maybe i915 related?

On Thursday 16 June 2011 13:41:04 Wojciech Zabołotny wrote:
> W dniu 15.06.2011 22:24, Martin wrote:
> > Same here, although my panics look different from yours. In both cases I
> > had a lot of hardware interrupts on the stack trace but no kmemcheck. I
> > only remembered to take a photo with my mobile phone the second time.
> > The stack trace contains handle_*irq*, tg3, ata_bmdma, __ata_sff, i915,
> > drm_vblank_put, intel_thermal, try_preempt, resched*, drm_vblank_put,
> > do_invalid_op, oops_end, do_bounds, panic. The EIP is
> > drm_vblank_put+0x13/0x50.
>
> Probably you have not enabled kmemcheck in the kernel configuration?

good point. ;)

>
> > Could this be an i915 issue with 2.6.39.1? At home I have two other
> > machines on i915 but with different chipsets that run fine with the same
> > kernel.
>
> The i915 driver with 82865G chipset works strange for quite a long time
> (since 2.6.37???) The machine which causes problem uses this chipset.
> Another one which works good with 2.6.39.1 (I write this message on it)
> uses: Intel Corporation Mobile 945GM/GMS, 943/940GML Express Integrated
> Graphics Controller (rev 03)
>

I had i915 weirdness with different chipsets at different times. Other than
the oops happening in drm_vblank_put it occurred to me that a random X
screensaver must have been active when it happened.

I didn't manage to trigger the panic over night with the problematic kernel in
runlevel 3 (console), nor did it occur with a vanilla 2.6.39.1 during the day.
I shall have to continue testing.

Martin

2011-06-17 13:21:09

by Martin

[permalink] [raw]
Subject: Re: Kernel panic on HT machine - maybe i915 related?

On Thursday 16 June 2011 23:45:16 Martin wrote:
> On Thursday 16 June 2011 13:41:04 Wojciech Zabołotny wrote:
> > W dniu 15.06.2011 22:24, Martin wrote:
> > > Same here, although my panics look different from yours. In both cases
> > > I had a lot of hardware interrupts on the stack trace but no
> > > kmemcheck. I only remembered to take a photo with my mobile phone the
> > > second time. The stack trace contains handle_*irq*, tg3, ata_bmdma,
> > > __ata_sff, i915, drm_vblank_put, intel_thermal, try_preempt, resched*,
> > > drm_vblank_put, do_invalid_op, oops_end, do_bounds, panic. The EIP is
> > > drm_vblank_put+0x13/0x50.
[...]

I managed to catch a couple of kernel panics with the vanilla 2.6.39.1 kernel
today. Screenshots:

http://www.wupload.com/file/21411414/panic_screenshots.tar

I created a kernel bugzilla entry for the issue:

https://bugzilla.kernel.org/show_bug.cgi?id=37752

Martin