In linux 2.5.70, with no patches applied, we've had one BUG of the form:
kernel BUG at include/linux/blkdev.h:407!
This is running a database workload, on an 8-way x86 machine with 4gig
of memory. we've seen this only once, after 6 hours of run time.
Ironically, the BUG occurs during a part of the test that is not
particularly disk I/O intensive. The disk I/O that IS going on at this
time is predominantly sequential writes to a logging device.
no file systems are involved.
------------[ cut here ]------------
kernel BUG at include/linux/blkdev.h:407!
invalid operand: 0000 [#1]
CPU: 4
EIP: 0060:[<c01f3565>] Not tainted
EFLAGS: 00010046
EIP is at DAC960_ProcessRequest+0xb5/0x170
eax: 00000080 ebx: f78f3360 ecx: f59eeb98 edx: f59eeb98
esi: f7fa8174 edi: f05fbc58 ebp: f7fa8000 esp: f05fbc0c
ds: 007b es: 007b ss: 0068
Process kernel (pid: 8281, threadinfo=f05fa000 task=f0737940)
Stack: 01000082 00000001 f7fa8174 f05fbc58 00000002 c01f36d6 f7fa8000
00000001
f7fa8174 c032c8a0 c01e10c2 f7fa8174 00000040 00000000 00209008
f7656e00
f05fbc58 c01e1252 f7fa8174 f05fbc58 f05fbc58 f7656e00 f05fa000
00000000
Call Trace:
[<c01f36d6>] DAC960_RequestFunction+0x26/0x30
[<c01e10c2>] generic_unplug_device+0x52/0x70
[<c01e1252>] blk_run_queues+0x72/0xa0
[<c016a676>] dio_await_one+0x56/0xa0
[<c016a7ad>] dio_await_completion+0x1d/0x40
[<c016b2ee>] direct_io_worker+0x2ce/0x340
[<c016b499>] blockdev_direct_IO+0x139/0x14c
[<c0152a90>] blkdev_get_blocks+0x0/0x90
[<c0152b61>] blkdev_direct_IO+0x41/0x50
[<c0152a90>] blkdev_get_blocks+0x0/0x90
[<c013575c>] generic_file_direct_IO+0x5c/0x80
[<c0134f04>] generic_file_aio_write_nolock+0x3f4/0x9a0
[<c01a9f24>] sys_semtimedop+0x564/0x5b0
[<c0152b61>] blkdev_direct_IO+0x41/0x50
[<c0152a90>] blkdev_get_blocks+0x0/0x90
[<c013575c>] generic_file_direct_IO+0x5c/0x80
[<c01631cd>] update_atime+0x6d/0xc0
[<c0133e71>] __generic_file_aio_read+0x111/0x1e0
[<c013551f>] generic_file_write_nolock+0x6f/0x90
[<c0121edb>] do_softirq+0x6b/0xd0
[<c01169e9>] smp_apic_timer_interrupt+0x149/0x160
[<c011181d>] restore_i387_fxsave+0x5d/0x70
[<c01cdf47>] raw_file_write+0x27/0x30
[<c014c07a>] vfs_write+0xaa/0xe0
[<c014bb50>] default_llseek+0x0/0xd0
[<c014bd65>] sys_llseek+0xb5/0xe0
[<c014c12f>] sys_write+0x2f/0x50
[<c010a953>] syscall_call+0x7/0xb
Code: 0f 0b 97 01 04 93 2a c0 8b 41 04 89 42 04 89 10 89 09 89 49
--------------------------------------------------------------------
The BUG is occuring in DAC960_ProcessRequest, which essentially does:
The DAC960_ProcessRequest
{
elv_next_request
blkdev_dequeue_request
}
It looks as though there was a request on the request queue
when elv_next_request was called, but it was removed before
blkdev_dequeue_request is called. The only scenarios I can
imagine this happening would be if there were some flaw in the
locking code around accesses to the request queue, or if there were
some other kind of memory corruption going on.
Regarding locking,
Called from DAC960_RequestFunction, which is entered and exited
with queue_lock held.
Called from DAC960_BA_InterruptHandler(), with queue_lock held.
I'm looking for some hints on how to track down what's happening, so that
when it happens again, we can get more information.
I'm thinking of starting by stashing a copy of the request structure found
on the request queue, and dumping it out at BUG() time, to see if
examining the data there gives any hint what happened to it.
Dave Olien wrote:
>
> In linux 2.5.70, with no patches applied, we've had one BUG of the form:
>
> kernel BUG at include/linux/blkdev.h:407!
>
The below should fix it.
diff -puN drivers/block/deadline-iosched.c~deadline-hash-removal-fix drivers/block/deadline-iosched.c
--- 25/drivers/block/deadline-iosched.c~deadline-hash-removal-fix 2003-06-04 00:50:36.000000000 -0700
+++ 25-akpm/drivers/block/deadline-iosched.c 2003-06-04 00:50:36.000000000 -0700
@@ -121,6 +121,15 @@ static inline void deadline_del_drq_hash
__deadline_del_drq_hash(drq);
}
+static void
+deadline_remove_merge_hints(request_queue_t *q, struct deadline_rq *drq)
+{
+ deadline_del_drq_hash(drq);
+
+ if (q->last_merge == &drq->request->queuelist)
+ q->last_merge = NULL;
+}
+
static inline void
deadline_add_drq_hash(struct deadline_data *dd, struct deadline_rq *drq)
{
@@ -310,7 +319,7 @@ static void deadline_remove_request(requ
struct deadline_data *dd = q->elevator.elevator_data;
list_del_init(&drq->fifo);
- deadline_del_drq_hash(drq);
+ deadline_remove_merge_hints(q, drq);
deadline_del_drq_rb(dd, drq);
}
}
_
Mary,
How's this for a response to Andrew and lkml?
------------------------------------------------------------------------
Andrew,
After running several iterations of the database workload, it seems
the patch below did indeed eliminate the BUG() we were hitting.
In addition, the patched kernel seems to perform somwhat better.
The workload performance is improved by about 2%, and the standard
deviation in performance between iterations of the work load was reduced
from 137 to 48.
A performance results comparison can be found at
http://www.osdl.org/projects/dbt2dev/results/8way/70AKPM/results.html
Thanks to Mary Meredith <[email protected]> for the performance
measurements and comparisons, and for hitting the BUG() in the first place.
Andrew Morton wrote:
>
> Dave Olien wrote:
> >
> > In linux 2.5.70, with no patches applied, we've had one BUG of the form:
> >
> > kernel BUG at include/linux/blkdev.h:407!
>
>
>
> The below should fix it.
>
>
> diff -puN drivers/block/deadline-iosched.c~deadline-hash-removal-fix drivers/block/deadline-iosched.c
> --- 25/drivers/block/deadline-iosched.c~deadline-hash-removal-fix 2003-06-04 00:50:36.000000000 -0700
> +++ 25-akpm/drivers/block/deadline-iosched.c 2003-06-04 00:50:36.000000000 -0700
> @@ -121,6 +121,15 @@ static inline void deadline_del_drq_hash
> __deadline_del_drq_hash(drq);
> }
>
> +static void
> +deadline_remove_merge_hints(request_queue_t *q, struct deadline_rq *drq)
> +{
> + deadline_del_drq_hash(drq);
> +
> + if (q->last_merge == &drq->request->queuelist)
> + q->last_merge = NULL;
> +}
> +
> static inline void
> deadline_add_drq_hash(struct deadline_data *dd, struct deadline_rq *drq)
> {
> @@ -310,7 +319,7 @@ static void deadline_remove_request(requ
> struct deadline_data *dd = q->elevator.elevator_data;
>
> list_del_init(&drq->fifo);
> - deadline_del_drq_hash(drq);
> + deadline_remove_merge_hints(q, drq);
> deadline_del_drq_rb(dd, drq);
> }
> }
>
> _
>
> ----- End forwarded message -----
>
> --=-OFfR5BGtLWFFgyRX7TNt--