2005-02-15 14:56:27

by Ralf Hildebrandt

[permalink] [raw]
Subject: Oops in 2.6.10-ac12 in kjournald (journal_commit_transaction)

Today our mailserver froze after just one day of uptime. I was able to
capture the Oops on the screen using my digital camera:

http://www.stahl.bau.tu-bs.de/~hildeb/bugreport/

Keywords: EIP is at journal_commit_transaction, process kjournald

# mount
/dev/cciss/c0d0p6 on / type ext3 (rw,errors=remount-ro)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw)
/dev/cciss/c0d0p5 on /boot type ext3 (rw)
/dev/shm on /var/amavis type tmpfs (rw,noatime,size=200m,mode=770,uid=104,gid=108)

--
Ralf Hildebrandt (i.A. des IT-Zentrum) [email protected]
Charite - Universit?tsmedizin Berlin Tel. +49 (0)30-450 570-155
Gemeinsame Einrichtung von FU- und HU-Berlin Fax. +49 (0)30-450 570-962
IT-Zentrum Standort CBF send no mail to [email protected]


2005-02-16 15:33:45

by Jan Kara

[permalink] [raw]
Subject: Re: Oops in 2.6.10-ac12 in kjournald (journal_commit_transaction)

Hello,

> Today our mailserver froze after just one day of uptime. I was able to
> capture the Oops on the screen using my digital camera:
>
> http://www.stahl.bau.tu-bs.de/~hildeb/bugreport/
>
> Keywords: EIP is at journal_commit_transaction, process kjournald
I guess the system is SMP... Sadly a few lines in the beginning of the
report are missing (probably scrolled off the screen) but it seems
similar like a several other oopses I've seen reported recently. Is this
the first time you hit this bug?

> # mount
> /dev/cciss/c0d0p6 on / type ext3 (rw,errors=remount-ro)
> proc on /proc type proc (rw)
> sysfs on /sys type sysfs (rw)
> devpts on /dev/pts type devpts (rw,gid=5,mode=620)
> tmpfs on /dev/shm type tmpfs (rw)
> /dev/cciss/c0d0p5 on /boot type ext3 (rw)
> /dev/shm on /var/amavis type tmpfs (rw,noatime,size=200m,mode=770,uid=104,gid=108)

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2005-02-16 20:05:09

by Ralf Hildebrandt

[permalink] [raw]
Subject: Re: Oops in 2.6.10-ac12 in kjournald (journal_commit_transaction)

* Jan Kara <[email protected]>:

> I guess the system is SMP...

Indeed it is. Dual Xeon with SMP.

> Sadly a few lines in the beginning of the
> report are missing (probably scrolled off the screen)

Yes, this sucks. I rebooted with vesafb active, no I do have 50 lines :)

> but it seems similar like a several other oopses I've seen reported
> recently. Is this the first time you hit this bug?

It's actually the second time. The first time it hit the SAME box but
with kernel-2.6.10 (vanilla) after 30 days of uptime. Nobody had a
camera at hand, so I couldn't take a photo.

Any suggestions? I'm open to suggestions. One difference between the
2.6.10 and 2.6.10-ac12 was that 2.6.10 has no in-kernel irq
balancing, while in 2.6.10-ac12 I acivated that.

--
Ralf Hildebrandt (i.A. des IT-Zentrum) [email protected]
Charite - Universit?tsmedizin Berlin Tel. +49 (0)30-450 570-155
Gemeinsame Einrichtung von FU- und HU-Berlin Fax. +49 (0)30-450 570-962
IT-Zentrum Standort CBF send no mail to [email protected]

2005-02-16 21:54:53

by Dale Blount

[permalink] [raw]
Subject: Re: Oops in 2.6.10-ac12 in kjournald (journal_commit_transaction)

On Wed, 2005-02-16 at 21:04 +0100, Ralf Hildebrandt wrote:
> * Jan Kara <[email protected]>:
>
> > I guess the system is SMP...
>
> Indeed it is. Dual Xeon with SMP.
>

This looks very similar (at least to me) to an OOPS I posted with 2.6.9
on 12/03/2004.
http://marc.theaimsgroup.com/?l=linux-kernel&m=110210705504716&w=2

My system is also a dual Xeon using SMP and Hyperthreading
(/proc/cpuinfo shows 4 cpus).
Mine, like Ralf's, is also a mail server running postfix using ext3 for
the spool directory.

> > but it seems similar like a several other oopses I've seen reported
> > recently. Is this the first time you hit this bug?
>
> It's actually the second time. The first time it hit the SAME box but
> with kernel-2.6.10 (vanilla) after 30 days of uptime. Nobody had a
> camera at hand, so I couldn't take a photo.
>

I've actually hit this bug (assuming it's the same) with 2.6.10 also. I
had to power cycle remotely and unfortunately didn't have the serial
console logging enabled when it happened with 2.6.10. I upgraded from
2.4.23 to 2.6.8.1 and crashed within a week, and continued to crash at
least monthly after that. It had been running 2.4.23 for 200+ days with
no problems.

Hope this helps trace it back.

Dale

2005-02-16 22:02:42

by Ralf Hildebrandt

[permalink] [raw]
Subject: Re: Oops in 2.6.10-ac12 in kjournald (journal_commit_transaction)

* Dale Blount <[email protected]>:

> This looks very similar (at least to me) to an OOPS I posted with 2.6.9
> on 12/03/2004.
> http://marc.theaimsgroup.com/?l=linux-kernel&m=110210705504716&w=2

Could be.

> My system is also a dual Xeon using SMP and Hyperthreading
> (/proc/cpuinfo shows 4 cpus).

Same system here.

> Mine, like Ralf's, is also a mail server running postfix using ext3 for
> the spool directory.

Same here.

> I've actually hit this bug (assuming it's the same) with 2.6.10 also. I
> had to power cycle remotely and unfortunately didn't have the serial
> console logging enabled when it happened with 2.6.10. I upgraded from
> 2.4.23 to 2.6.8.1 and crashed within a week, and continued to crash at
> least monthly after that. It had been running 2.4.23 for 200+ days with
> no problems.
>
> Hope this helps trace it back.

Me too


--
Ralf Hildebrandt (i.A. des IT-Zentrum) [email protected]
Charite - Universit?tsmedizin Berlin Tel. +49 (0)30-450 570-155
Gemeinsame Einrichtung von FU- und HU-Berlin Fax. +49 (0)30-450 570-962
IT-Zentrum Standort CBF send no mail to [email protected]

2005-02-16 22:50:56

by Andrew Morton

[permalink] [raw]
Subject: Re: Oops in 2.6.10-ac12 in kjournald (journal_commit_transaction)

Dale Blount <[email protected]> wrote:
>
> This looks very similar (at least to me) to an OOPS I posted with 2.6.9
> on 12/03/2004.
> http://marc.theaimsgroup.com/?l=linux-kernel&m=110210705504716&w=2

There have been a handful of reports - there's surely a race in there.

Unfortunately I've yet to see a report from which we can identify the
offending line in the very large journal_commit_transaction() function.

The best way to do that is to ensure that the kernel was built with
CONFIG_DEBUG_INFO, note the offending EIP value, then do

# gdb vmlinux
(gdb) l *0xc0<whatever>

2005-02-17 10:58:27

by Ralf Hildebrandt

[permalink] [raw]
Subject: Re: Oops in 2.6.10-ac12 in kjournald (journal_commit_transaction)

* Andrew Morton <[email protected]>:

> There have been a handful of reports - there's surely a race in there.
>
> Unfortunately I've yet to see a report from which we can identify the
> offending line in the very large journal_commit_transaction() function.

:(

>
> The best way to do that is to ensure that the kernel was built with
> CONFIG_DEBUG_INFO, note the offending EIP value, then do
>
> # gdb vmlinux
> (gdb) l *0xc0<whatever>

I'm rebuilding the ac12 kernel which crashed on me after just one day
and will reboot it today.

--
Ralf Hildebrandt (i.A. des IT-Zentrum) [email protected]
Charite - Universit?tsmedizin Berlin Tel. +49 (0)30-450 570-155
Gemeinsame Einrichtung von FU- und HU-Berlin Fax. +49 (0)30-450 570-962
IT-Zentrum Standort CBF send no mail to [email protected]

2005-02-17 13:21:51

by Ralf Hildebrandt

[permalink] [raw]
Subject: Re: Oops in 2.6.10-ac12 in kjournald (journal_commit_transaction)

* Ralf Hildebrandt <[email protected]>:

> > The best way to do that is to ensure that the kernel was built with
> > CONFIG_DEBUG_INFO, note the offending EIP value, then do
> >
> > # gdb vmlinux
> > (gdb) l *0xc0<whatever>
>
> I'm rebuilding the ac12 kernel which crashed on me after just one day
> and will reboot it today.

Is it normal that the kernel with debugging enabled is not larger than
the normal kernel?

2005-02-17 15:51:28

by Randy.Dunlap

[permalink] [raw]
Subject: Re: Oops in 2.6.10-ac12 in kjournald (journal_commit_transaction)

Ralf Hildebrandt wrote:
> * Ralf Hildebrandt <[email protected]>:
>
>
>>>The best way to do that is to ensure that the kernel was built with
>>>CONFIG_DEBUG_INFO, note the offending EIP value, then do
>>>
>>># gdb vmlinux
>>>(gdb) l *0xc0<whatever>
>>
>>I'm rebuilding the ac12 kernel which crashed on me after just one day
>>and will reboot it today.
>
>
> Is it normal that the kernel with debugging enabled is not larger than
> the normal kernel?
> -

No, it should be much larger. Recheck the .config file
for CONFIG_DEBUG_INFO=y. Maybe you need to do 'make clean'
first.

--
~Randy

2005-02-17 16:00:18

by Ralf Hildebrandt

[permalink] [raw]
Subject: Re: Oops in 2.6.10-ac12 in kjournald (journal_commit_transaction)

* Randy.Dunlap <[email protected]>:

> >Is it normal that the kernel with debugging enabled is not larger than
> >the normal kernel?
> >-
>
> No, it should be much larger. Recheck the .config file
> for CONFIG_DEBUG_INFO=y. Maybe you need to do 'make clean'
> first.

CONFIG_DEBUG_KERNEL=y
CONFIG_MAGIC_SYSRQ=y
# CONFIG_SCHEDSTATS is not set
# CONFIG_DEBUG_SLAB is not set
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_DEBUG_SPINLOCK_SLEEP is not set
# CONFIG_DEBUG_KOBJECT is not set
# CONFIG_DEBUG_HIGHMEM is not set
CONFIG_DEBUG_INFO=y
# CONFIG_FRAME_POINTER is not set
CONFIG_EARLY_PRINTK=y

I built that using "make-kpkg"

make-kpkg clean
CONCURRENCY_LEVEL=4 MAKEFLAGS="CC=gcc-3.4" make-kpkg --revision=20050217 kernel_image

--
Ralf Hildebrandt (i.A. des IT-Zentrum) [email protected]
Charite - Universit?tsmedizin Berlin Tel. +49 (0)30-450 570-155
Gemeinsame Einrichtung von FU- und HU-Berlin Fax. +49 (0)30-450 570-962
IT-Zentrum Standort CBF send no mail to [email protected]