2006-05-23 18:29:33

by Carlos Martín Nieto

[permalink] [raw]
Subject: A couple of oops.

Hi,

I've nailed this down to something that happened in 2.6.17-rc4. The
system locks up with either a NULL dereference or an unhandable paging
request. The stack trace shows this:

paging request NULL dereference

_raw_spin_trylock+12 _raw_spin_trylock+20
__spin_lock+22
main_timer_handler+22
timer_interrupt+18
handle_IRQ_event+41
__do_IRQ+156
do_IRQ+51
default_idle+0
_spin_unlock_irq+43
thread_return+187
generic_unplug_device+0
default_idle+45
dev_idle+95 (I can't read the func clearly in this handwriting)
start_secondary+1129

I'm guessing this is the same problem only that it once manifests itself
as one and another time as the other. The problem is in the call to
write_seqlock(&xtime_lock) from main_timer_handler().

I've not been able to determine what patch has caused this to happen,
but it is between 2.6.17-rc3 and -rc4. I'm bisecting, but if anybody has
a good candidate, it'd probably be faster than doing a complete bisect.

cmn
--
Carlos Martín Nieto | http://www.cmartin.tk
Hobbyist programmer |


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2006-05-24 00:41:08

by Andrew Morton

[permalink] [raw]
Subject: Re: A couple of oops.

Carlos Mart?n <[email protected]> wrote:
>
> Hi,
>
> I've nailed this down to something that happened in 2.6.17-rc4. The
> system locks up with either a NULL dereference or an unhandable paging
> request. The stack trace shows this:
>
> paging request NULL dereference
>
> _raw_spin_trylock+12 _raw_spin_trylock+20
> __spin_lock+22
> main_timer_handler+22
> timer_interrupt+18
> handle_IRQ_event+41
> __do_IRQ+156
> do_IRQ+51
> default_idle+0
> _spin_unlock_irq+43
> thread_return+187
> generic_unplug_device+0
> default_idle+45
> dev_idle+95 (I can't read the func clearly in this handwriting)
> start_secondary+1129
>
> I'm guessing this is the same problem only that it once manifests itself
> as one and another time as the other. The problem is in the call to
> write_seqlock(&xtime_lock) from main_timer_handler().
>
> I've not been able to determine what patch has caused this to happen,
> but it is between 2.6.17-rc3 and -rc4. I'm bisecting, but if anybody has
> a good candidate, it'd probably be faster than doing a complete bisect.

hm, so an attempt to access xtime_lock.lock results in a null-pointer deref?

x86_64 does novel things, putting xtime_lock into a linker section all of
its own. At a guess I'd say that your compiler/assembler/linker toolchain
got confused and generated the incorrect address for xtime_lock. But if
that was the case you'd get oopses 100% of the time - it wouldn't be
intermittent, as your description seems to imply (although it's quite
unclear?).

You could do:

gdb vmlinux
(gdb) p &xtime_lock
(gdb) x/40i main_timer_handler

2006-05-24 14:27:49

by Carlos Martín Nieto

[permalink] [raw]
Subject: Re: A couple of oops.

On Tue, 2006-05-23 at 17:40 -0700, Andrew Morton wrote:
> Carlos Martín <[email protected]> wrote:
> >
> > Hi,
> >
> > I've nailed this down to something that happened in 2.6.17-rc4. The
> > system locks up with either a NULL dereference or an unhandable paging
> > request. The stack trace shows this:
> >
> > paging request NULL dereference
> >
> > _raw_spin_trylock+12 _raw_spin_trylock+20
> > __spin_lock+22
> > main_timer_handler+22
> > timer_interrupt+18
> > handle_IRQ_event+41
> > __do_IRQ+156
> > do_IRQ+51
> > default_idle+0
> > _spin_unlock_irq+43
> > thread_return+187
> > generic_unplug_device+0
> > default_idle+45
> > dev_idle+95 (I can't read the func clearly in this handwriting)
> > start_secondary+1129
> >
> > I'm guessing this is the same problem only that it once manifests itself
> > as one and another time as the other. The problem is in the call to
> > write_seqlock(&xtime_lock) from main_timer_handler().
> >
> > I've not been able to determine what patch has caused this to happen,
> > but it is between 2.6.17-rc3 and -rc4. I'm bisecting, but if anybody has
> > a good candidate, it'd probably be faster than doing a complete bisect.
>
> hm, so an attempt to access xtime_lock.lock results in a null-pointer deref?

It seems to be xtime_lock.sequence:
main_timer_handler:
pushq %r13 #
movq %rdi, %r13 # regs, regs
movq $xtime_lock+8, %rdi #,
pushq %r12 #
pushq %rbp #
pushq %rbx #
pushq %rbp #
call _spin_lock #
OOPS--> incl xtime_lock(%rip) # xtime_lock.sequence
xorl %r12d, %r12d # offset

>
> x86_64 does novel things, putting xtime_lock into a linker section all of
> its own. At a guess I'd say that your compiler/assembler/linker toolchain
> got confused and generated the incorrect address for xtime_lock. But if
> that was the case you'd get oopses 100% of the time - it wouldn't be
> intermittent, as your description seems to imply (although it's quite
> unclear?).

Once I've seen the NULL dereference, the rest are paging request errors.
Most of the time I'm on X, so I don't actually see output, but the ones
I've seen have been like that.

It starts somewhere between rc3 and rc4.

With the bisect, now I'm left with a bunch of SCSI patches which don't
seem to bear any relationship and which I don't use (except for
usb-storage).

>
> You could do:
>
> gdb vmlinux
> (gdb) p &xtime_lock
> (gdb) x/40i main_timer_handler
>
--
Carlos Martín Nieto | http://www.cmartin.tk
Hobbyist programmer |


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2006-05-26 16:31:58

by Carlos Martín Nieto

[permalink] [raw]
Subject: Re: A couple of oops.

This seems to have gone as quick as it appeared. I'm currently running
rc5 with three hours of uptime, which is much longer than I ever managed
with the broken kernel.

It looks like it was due to some miscompilation issue, though it
happened with a clean tree (make mrproper) so it seemed to be a problem
with the kernel.

cmn
--
Carlos Martín Nieto | http://www.cmartin.tk
Hobbyist programmer |


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part