2004-01-31 16:57:33

by Sergey S. Kostyliov

[permalink] [raw]
Subject: 2.6.1 IO lockup on SMP systems

Hello all,

I had experienced a lockups on three of my servers with 2.6.1. It doesn't
look like a deadlock, the box is still pingable and all tcp ports which were
in listen state before a lockup are remains in listen state, but I can't get
any data from this ports. According to sar(1) systems had not been overloaded
right before a lockup. And there is no log entries in all user services logs
for almost 10 hours after lockup.

So I think this is an IO lockup. On the other side it doesn't look like a bug
in particular controller driver, because they are different for each box.
And finally it doesn't look like a bug in particular io-scheduler because two
of boxes were runed with "deadline" and one with "as". Of course all
assumptions are valid only if all lockups I had seen have the same nature.

All of three boxes are SMP. Unfortunately all are remote and aren't attached
to a serial console yet (this is planed in next couple of weeks).

1) ope
01:02.1 RAID bus controller: Mylex Corporation: Unknown device 0050 (rev 02)
elevator=deadline
.config: http://sysadminday.org.ru/2.6.1-io_lockup/ope/.config
lspci: http://sysadminday.org.ru/2.6.1-io_lockup/ope/lspci
lspci -vvn: http://sysadminday.org.ru/2.6.1-io_lockup/ope/lspci_-vvn

2) white
02:04.0 RAID bus controller: American Megatrends Inc. MegaRAID (rev 02)
elevator=deadline
.config: http://sysadminday.org.ru/2.6.1-io_lockup/white/.config
lspci: http://sysadminday.org.ru/2.6.1-io_lockup/white/lspci
lspci -vvn: http://sysadminday.org.ru/2.6.1-io_lockup/white/lspci_-vvn

3) tiny
02:00.0 Unknown mass storage controller: Compaq Computer Corporation Smart-2/P RAID Controller (rev 03)
03:00.0 Unknown mass storage controller: Compaq Computer Corporation Smart-2/P RAID Controller (rev 03)
elevator=as
.config: http://sysadminday.org.ru/2.6.1-io_lockup/tiny/.config
lspci: http://sysadminday.org.ru/2.6.1-io_lockup/tiny/lspci
lspci -vvn: http://sysadminday.org.ru/2.6.1-io_lockup/tiny/lspci_-vvn

Any hints will be appreciated.

--
Best regards,
Sergey S. Kostyliov <[email protected]>
Public PGP key: http://sysadminday.org.ru/rathamahata.asc


2004-02-01 00:16:33

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.1 IO lockup on SMP systems

"Sergey S. Kostyliov" <[email protected]> wrote:
>
> I had experienced a lockups on three of my servers with 2.6.1. It doesn't
> look like a deadlock, the box is still pingable and all tcp ports which were
> in listen state before a lockup are remains in listen state, but I can't get
> any data from this ports. According to sar(1) systems had not been overloaded
> right before a lockup. And there is no log entries in all user services logs
> for almost 10 hours after lockup.

Please ensure that CONFIG_KALLSYMS is enabled, then generate an all-tasks
backtrace or a locked machine with sysrq-T or `echo t >
/proc/sysrq-trigger'. Then send us the resulting trace.

You may need a serial console to be able to capture all the output.

Also, it would be useful to know what sort of load the machines are under,
and what filesystems are in use.

Thanks.