2006-01-30 08:51:26

by Brice Figureau

[permalink] [raw]
Subject: 2.6.15.1: kernel BUG at drivers/ide/ide-io.c:63!

Hi,

I'm stumbling accross the following kernel bug (panic and reboot), this
is on a bi-Xeon server with 2 IDE drives in a raid 1 MD mirror, running
vanilla 2.6.15.1.

The oops seems very reproducible (at least it occured two times, at
exactly the same time) and it seems to be triggered while the disk is
doing a SMART extended self-test (which is scheduled every saturday at
3AM on this machine).

The previous kernel this server was running was 2.6.14 and it was
running fine, the server began to crash two weeks ago when I installed
the 2.6.15.1.

Jan 28 03:22:17 games hda: lost interrupt
Jan 28 03:22:17 games hda: failed barrier write: sector=e6126c4(good=0/bad=8)
Jan 28 03:22:17 games ------------[ cut here ]------------
Jan 28 03:22:17 games kernel BUG at drivers/ide/ide-io.c:63!
Jan 28 03:22:17 games invalid operand: 0000 [#1]
Jan 28 03:22:17 games SMP
Jan 28 03:22:17 games
Jan 28 03:22:17 games Modules linked in:
Jan 28 03:22:17 games netconsole
Jan 28 03:22:17 games i2c_i801
Jan 28 03:22:17 games i2c_core
Jan 28 03:22:17 games ipt_LOG
Jan 28 03:22:17 games ipt_limit
Jan 28 03:22:17 games ipt_REJECT
Jan 28 03:22:17 games ipt_state
Jan 28 03:22:17 games iptable_filter
Jan 28 03:22:17 games ip_conntrack
Jan 28 03:22:17 games ip_tables
Jan 28 03:22:17 games
Jan 28 03:22:17 games CPU: 0
Jan 28 03:22:17 games EIP: 0060:[<c02ae894>] Not tainted VLI
Jan 28 03:22:17 games EFLAGS: 00010002 (2.6.15.1-zen)
Jan 28 03:22:17 games EIP is at __ide_end_request+0x10c/0x119
Jan 28 03:22:17 games eax: c04eb594 ebx: e7f26084 ecx: 00000008 edx: 00000086
Jan 28 03:22:17 games esi: 00000000 edi: c04eb594 ebp: c0483de8 esp: c0483dcc
Jan 28 03:22:17 games ds: 007b es: 007b ss: 0068
Jan 28 03:22:17 games Process swapper (pid: 0, threadinfo=c0482000 task=c0419b20)
Jan 28 03:22:17 games
Jan 28 03:22:17 games Stack:
Jan 28 03:22:17 games 00000000
Jan 28 03:22:17 games e7f26084
Jan 28 03:22:17 games f7db670c
Jan 28 03:22:17 games 00000001
Jan 28 03:22:17 games 00000000
Jan 28 03:22:17 games e7f26084
Jan 28 03:22:17 games 00000008
Jan 28 03:22:17 games c0483e1c
Jan 28 03:22:17 games
Jan 28 03:22:17 games
Jan 28 03:22:17 games c02ba1c5
Jan 28 03:22:17 games c04eb594
Jan 28 03:22:17 games e7f26084
Jan 28 03:22:17 games 00000000
Jan 28 03:22:17 games 00000008
Jan 28 03:22:17 games 00000000

This has been captured with the help of netconsole, unfortunately it
seems the oops is truncated (there is no stack traces as usual), maybe
because the machine rebooted because of the command-line argument
panic=60.

What's also interesting is that after rebooting it crashed again when
the second drive began its SMART extended self-test (scheduled one hour
later). I don't have this other trace because the machine doesn't load
netconsole at reboot, but I suspect it is the same one.

This sounds like what is described:
http://www.uwsg.iu.edu/hypermail/linux/kernel/0512.2/1844.html
and is related to write barriers, it seems.
What I don't really understand is why doing a SMART self-test could
affect the kernel, since only the disk is doing work, the kernel is not
involved in the test.

I didn't had the opportunity to run with barrier=0 mount option, I
simply disabled the SMART test scheduling to be sure it won't crash
anymore.

As this is a production server, I can't reboot it whenever I want, so I
didn't want to perform a binary search on kernel version affected or
not.

If someone need more information, just let me know.

Please CC: me as I'm not subscribed to the list.

Regards,
--
Brice Figureau