LinuxLists.cc - Kernel hang on MX server

2004-04-14 00:06:50

Subject: Kernel hang on MX server

I have 2 IBM x305's which hang every few days in identical ways. Both are
running postfix (with amavis-new, spamassassin and clamav from
backports.org) and bind9. They are running debian stable.

The IDE controler is: ServerWorks CSB5 IDE
The Kernel is 2.6.4
I have tried 2.4 kernels with ext3. I have also tried 2.6.4 kernels with
ext3, jfs, and xfs, but all three FS show the same errors.
Currently trying ext2 on one of the machines.

The systems is running software raid1 between both hard drives.
/dev/hda1 and /dev/hdc1 make up md0 which is mount on /
/dev/hda3 and /dev/hdc3 make up md1 which used LVM to make /usr /home /var
and /var/spool/postfix

I have also tried mount --bind /postfix /var/spool/postfix to take LVM out
of the picture. That did not help. I have not tried to break the mirror
and only use 1 IDE drive.

Please see the attached files for better formating, and all of the
running commands.

The following processes are in uninterruptible sleep (usually IO) via ps:
root@ares:~# ps axln | grep D
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
1 0 103 1 15 0 0 0 11e11e D ? 0:07
[kjournald]
1 0 248 1 15 0 0 0 11e11e D ? 0:04
[kjournald]
5 0 332 1 15 0 1432 672 86e7cf Ds ? 0:24
/sbin/syslogd
4 102 9574 9572 16 0 2596 1176 109ea5 D ? 0:24 qmgr -l
-t fifo -u -c
4 102 26473 9572 17 0 2544 1112 86e7cf D ? 0:00 cleanup
-z -n pre-cleanup -t unix -u -c -o virtual_alias_maps -o canonical_maps
-o sender_canonical_maps -o recipient_canonical_maps -o masquerade_domains
4 102 26509 9572 16 0 2464 1020 109ea5 D ? 0:00 showq
-t unix -u -c

Decoding the WCHAN:
11e11e ->
c011e0f4 T yield
c011e110 T io_schedule
c011e128 T io_schedule_timeout
c011e144 T sys_sched_get_priority_max

86e7cf ->
not in System.map, but PS gave: log_wait_commit

109ea5 ->
c0109df4 t __check_printsym_format
c0109e00 T __up
c0109e18 T __down
c0109f24 T __down_interruptible

Where PS was run as:
ps -o pid,nwchan,wchan=WIDE-WCHAN-COLUMN -o comm -o state axf

I'll happily provide more information, but not sure what to try next.
Looks like it may be a driver issue, blocking IO writes to the drives.

Later, JOSH

Attachments:

ares-20040413-1617.txt (19.12 kB)

2004-04-14 00:42:08

by Alexander Hoogerhuis

[permalink] [raw]

Subject: Re: Kernel hang on MX server

I have two HP DL360G3's running a very identical load on Gentoo, both
go into doorstop-emulation mode on 2.6.5-mm1 and higher -mm-kernels
withing 24 of running with very low load. I use hardware RAID, and
ha've SCSI, but otherwise its ServerWorks chipsets and ext3.

I've only caught them a few times in seeing the actual oops; at other
times it's happened with a blanked console, but the few I've seen have
been dying while handling an interrupt, and also, the tg3-driver have
figured prominently in the stack trace.

If anyone have have a good idea on how to capture an oops that
instantly freezes the machine, then I'll go through and capture it
again; but for now I'm down on 2.6.3-mm3 and running solidly.

mvh,
A

Josh Logan <[email protected]> writes:

> I have 2 IBM x305's which hang every few days in identical ways. Both are
> running postfix (with amavis-new, spamassassin and clamav from
> backports.org) and bind9. They are running debian stable.
>
> The IDE controler is: ServerWorks CSB5 IDE
> The Kernel is 2.6.4
> I have tried 2.4 kernels with ext3. I have also tried 2.6.4 kernels with
> ext3, jfs, and xfs, but all three FS show the same errors.
> Currently trying ext2 on one of the machines.
>
> The systems is running software raid1 between both hard drives.
> /dev/hda1 and /dev/hdc1 make up md0 which is mount on /
> /dev/hda3 and /dev/hdc3 make up md1 which used LVM to make /usr /home /var
> and /var/spool/postfix
>
> I have also tried mount --bind /postfix /var/spool/postfix to take LVM out
> of the picture. That did not help. I have not tried to break the mirror
> and only use 1 IDE drive.
>
> Please see the attached files for better formating, and all of the
> running commands.
>
> The following processes are in uninterruptible sleep (usually IO) via ps:
> root@ares:~# ps axln | grep D
> F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
> 1 0 103 1 15 0 0 0 11e11e D ? 0:07
> [kjournald]
> 1 0 248 1 15 0 0 0 11e11e D ? 0:04
> [kjournald]
> 5 0 332 1 15 0 1432 672 86e7cf Ds ? 0:24
> /sbin/syslogd
> 4 102 9574 9572 16 0 2596 1176 109ea5 D ? 0:24 qmgr -l
> -t fifo -u -c
> 4 102 26473 9572 17 0 2544 1112 86e7cf D ? 0:00 cleanup
> -z -n pre-cleanup -t unix -u -c -o virtual_alias_maps -o canonical_maps
> -o sender_canonical_maps -o recipient_canonical_maps -o masquerade_domains
> 4 102 26509 9572 16 0 2464 1020 109ea5 D ? 0:00 showq
> -t unix -u -c
>
> Decoding the WCHAN:
> 11e11e ->
> c011e0f4 T yield
> c011e110 T io_schedule
> c011e128 T io_schedule_timeout
> c011e144 T sys_sched_get_priority_max
>
>
> 86e7cf ->
> not in System.map, but PS gave: log_wait_commit
>
>
> 109ea5 ->
> c0109df4 t __check_printsym_format
> c0109e00 T __up
> c0109e18 T __down
> c0109f24 T __down_interruptible
>
> Where PS was run as:
> ps -o pid,nwchan,wchan=WIDE-WCHAN-COLUMN -o comm -o state axf
>
>
> I'll happily provide more information, but not sure what to try next.
> Looks like it may be a driver issue, blocking IO writes to the drives.
>
> Later, JOSH
>

--
Alexander Hoogerhuis | [email protected]
CCNP - CCDP - MCNE - CCSE | +47 908 21 485
"You have zero privacy anyway. Get over it." --Scott McNealy

2004-04-14 03:09:09

by Andrew Morton

[permalink] [raw]

Subject: Re: Kernel hang on MX server

Josh Logan <[email protected]> wrote:
>
> I have 2 IBM x305's which hang every few days in identical ways.

Can you capture a sysrq-T trace? Wait for a hang, do alt-sysrq-T?

Yo may need to arrange for klogd to write to a different disk, or
use a serial console.