2008-01-29 17:54:56

by Andrew Vasquez

[permalink] [raw]
Subject: kernel BUG at drivers/block/cciss.c:1260! (with recent linux-2.6 tree)

Hitting a consistent BUG() with recent Linus' linux-2.6.git:

[ 12.941428] ------------[ cut here ]------------
[ 12.944874] kernel BUG at drivers/block/cciss.c:1260!
[ 12.944874] invalid opcode: 0000 [1] SMP
[ 12.944874] CPU 0
[ 12.944874] Modules linked in:
[ 12.944874] Pid: 0, comm: swapper Not tainted 2.6.24 #43
[ 12.944874] RIP: 0010:[<ffffffff8039e43d>] [<ffffffff8039e43d>] cciss_softirq_done+0xbc/0x1bf
[ 12.944874] RSP: 0018:ffffffff8063aed0 EFLAGS: 00010202
[ 12.944874] RAX: 0000000000000001 RBX: ffff8100cf800010 RCX: ffff81042f1253b0
[ 12.944874] RDX: ffff81042de398f0 RSI: ffff81042de398f0 RDI: 0000000000000001
[ 12.944874] RBP: ffff81042daa0000 R08: ffff81042f1253b0 R09: 0000000000000001
[ 12.944874] R10: 00000000000000fe R11: 0000000000000000 R12: 0000000000000002
[ 12.944874] R13: 0000000000000001 R14: ffff8100cf800000 R15: ffff81042de398f0
[ 12.944874] FS: 0000000000000000(0000) GS:ffffffff805bb000(0000) knlGS:0000000000000000
[ 12.944874] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[ 12.944874] CR2: 00002afed7eea340 CR3: 000000042dbba000 CR4: 00000000000006e0
[ 12.944874] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 12.944874] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 12.944874] Process swapper (pid: 0, threadinfo ffffffff805f4000, task ffffffff805624a0)
[ 12.944874] Stack: 0000000000000000 ffffffff8063af10 0000000000000001 ffffffff80632d60
[ 12.944874] 0000000000000000 000000000000000a ffffffff805bb900 ffffffff8032038f
[ 12.944874] ffffffff8063af10 ffffffff8063af10 ffffffff805bb940 ffffffff802346b4
[ 12.944874] Call Trace:
[ 12.944874] <IRQ> [<ffffffff8032038f>] blk_done_softirq+0x69/0x78
[ 12.944874] [<ffffffff802346b4>] __do_softirq+0x6f/0xd8
[ 12.944874] [<ffffffff8020c45c>] call_softirq+0x1c/0x30
[ 12.944874] [<ffffffff8020e347>] do_softirq+0x30/0x80
[ 12.944874] [<ffffffff8020e409>] do_IRQ+0x72/0xd9
[ 12.944874] [<ffffffff8020a50a>] mwait_idle+0x0/0x46
[ 12.944874] [<ffffffff8020a3da>] default_idle+0x0/0x3d
[ 12.944874] [<ffffffff8020b7e1>] ret_from_intr+0x0/0xa
[ 12.944874] <EOI> [<ffffffff8020a54c>] mwait_idle+0x42/0x46
[ 12.944874] [<ffffffff8020a481>] cpu_idle+0x6a/0xae
[ 12.944874]
[ 12.944874]
[ 12.944874] Code: 0f 0b eb fe 48 8d 85 d8 c0 00 00 48 89 04 24 48 89 c7 e8 e5
[ 12.944874] RIP [<ffffffff8039e43d>] cciss_softirq_done+0xbc/0x1bf
[ 12.944874] RSP <ffffffff8063aed0>
[ 12.944903] ---[ end trace e9c631603f90d22f ]---

code in question is in drivers/block/cciss.c:cciss_softirq_done():

...
if (blk_end_request(rq, (rq->errors == 0) ? 0 : -EIO, blk_rq_bytes(rq)))
BUG();

And appears to be a result of a recent merge:

commit f0f0052069989b80d2a3e50c9cd2f2a650bc1aea
Refs: v2.6.24-1949-gf0f0052
Merge: 68fbda7... a65b586...
Author: Linus Torvalds <[email protected]>
Date: Tue Jan 29 08:51:32 2008 +1100

Merge branch 'blk-end-request' of git://git.kernel.dk/linux-2.6-block

Here's the commit which added the blk_end_request() BUG() on:

commit 3daeea29f9348263e0dda89a565074390475bdf8
Refs: v2.6.24-1743-g3daeea2
Author: Kiyoshi Ueda <[email protected]>
Date: Tue Dec 11 17:50:03 2007 -0500

blk_end_request: changing cciss (take 4)

This patch converts cciss to use blk_end_request interfaces.
Related 'uptodate' arguments are converted to 'error'.

cciss is a little bit different from "normal" drivers.
cciss directly calls bio_endio() and disk_stat_add()
when completing request. But those can be replaced with
__end_that_request_first().
After the replacement, request completion procedures of
those drivers become like the following:
o end_that_request_first()
o add_disk_randomness()
o end_that_request_last()
This can be converted to blk_end_request() by following
the rule (a) mentioned in the patch subject
"[PATCH 01/30] blk_end_request: add new request completion interface".

Cc: Mike Miller <[email protected]>
Signed-off-by: Kiyoshi Ueda <[email protected]>
Signed-off-by: Jun'ichi Nomura <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

--
av


2008-01-29 18:02:55

by Jens Axboe

[permalink] [raw]
Subject: Re: kernel BUG at drivers/block/cciss.c:1260! (with recent linux-2.6 tree)

On Tue, Jan 29 2008, Andrew Vasquez wrote:
> Hitting a consistent BUG() with recent Linus' linux-2.6.git:
>
> [ 12.941428] ------------[ cut here ]------------
> [ 12.944874] kernel BUG at drivers/block/cciss.c:1260!
> [ 12.944874] invalid opcode: 0000 [1] SMP
> [ 12.944874] CPU 0
> [ 12.944874] Modules linked in:
> [ 12.944874] Pid: 0, comm: swapper Not tainted 2.6.24 #43
> [ 12.944874] RIP: 0010:[<ffffffff8039e43d>] [<ffffffff8039e43d>] cciss_softirq_done+0xbc/0x1bf
> [ 12.944874] RSP: 0018:ffffffff8063aed0 EFLAGS: 00010202
> [ 12.944874] RAX: 0000000000000001 RBX: ffff8100cf800010 RCX: ffff81042f1253b0
> [ 12.944874] RDX: ffff81042de398f0 RSI: ffff81042de398f0 RDI: 0000000000000001
> [ 12.944874] RBP: ffff81042daa0000 R08: ffff81042f1253b0 R09: 0000000000000001
> [ 12.944874] R10: 00000000000000fe R11: 0000000000000000 R12: 0000000000000002
> [ 12.944874] R13: 0000000000000001 R14: ffff8100cf800000 R15: ffff81042de398f0
> [ 12.944874] FS: 0000000000000000(0000) GS:ffffffff805bb000(0000) knlGS:0000000000000000
> [ 12.944874] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> [ 12.944874] CR2: 00002afed7eea340 CR3: 000000042dbba000 CR4: 00000000000006e0
> [ 12.944874] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 12.944874] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 12.944874] Process swapper (pid: 0, threadinfo ffffffff805f4000, task ffffffff805624a0)
> [ 12.944874] Stack: 0000000000000000 ffffffff8063af10 0000000000000001 ffffffff80632d60
> [ 12.944874] 0000000000000000 000000000000000a ffffffff805bb900 ffffffff8032038f
> [ 12.944874] ffffffff8063af10 ffffffff8063af10 ffffffff805bb940 ffffffff802346b4
> [ 12.944874] Call Trace:
> [ 12.944874] <IRQ> [<ffffffff8032038f>] blk_done_softirq+0x69/0x78
> [ 12.944874] [<ffffffff802346b4>] __do_softirq+0x6f/0xd8
> [ 12.944874] [<ffffffff8020c45c>] call_softirq+0x1c/0x30
> [ 12.944874] [<ffffffff8020e347>] do_softirq+0x30/0x80
> [ 12.944874] [<ffffffff8020e409>] do_IRQ+0x72/0xd9
> [ 12.944874] [<ffffffff8020a50a>] mwait_idle+0x0/0x46
> [ 12.944874] [<ffffffff8020a3da>] default_idle+0x0/0x3d
> [ 12.944874] [<ffffffff8020b7e1>] ret_from_intr+0x0/0xa
> [ 12.944874] <EOI> [<ffffffff8020a54c>] mwait_idle+0x42/0x46
> [ 12.944874] [<ffffffff8020a481>] cpu_idle+0x6a/0xae
> [ 12.944874]
> [ 12.944874]
> [ 12.944874] Code: 0f 0b eb fe 48 8d 85 d8 c0 00 00 48 89 04 24 48 89 c7 e8 e5
> [ 12.944874] RIP [<ffffffff8039e43d>] cciss_softirq_done+0xbc/0x1bf
> [ 12.944874] RSP <ffffffff8063aed0>
> [ 12.944903] ---[ end trace e9c631603f90d22f ]---

Thanks, will take a look.

--
Jens Axboe

2008-01-29 18:05:52

by Jens Axboe

[permalink] [raw]
Subject: Re: kernel BUG at drivers/block/cciss.c:1260! (with recent linux-2.6 tree)

On Tue, Jan 29 2008, Jens Axboe wrote:
> On Tue, Jan 29 2008, Andrew Vasquez wrote:
> > Hitting a consistent BUG() with recent Linus' linux-2.6.git:
> >
> > [ 12.941428] ------------[ cut here ]------------
> > [ 12.944874] kernel BUG at drivers/block/cciss.c:1260!
> > [ 12.944874] invalid opcode: 0000 [1] SMP
> > [ 12.944874] CPU 0
> > [ 12.944874] Modules linked in:
> > [ 12.944874] Pid: 0, comm: swapper Not tainted 2.6.24 #43
> > [ 12.944874] RIP: 0010:[<ffffffff8039e43d>] [<ffffffff8039e43d>] cciss_softirq_done+0xbc/0x1bf
> > [ 12.944874] RSP: 0018:ffffffff8063aed0 EFLAGS: 00010202
> > [ 12.944874] RAX: 0000000000000001 RBX: ffff8100cf800010 RCX: ffff81042f1253b0
> > [ 12.944874] RDX: ffff81042de398f0 RSI: ffff81042de398f0 RDI: 0000000000000001
> > [ 12.944874] RBP: ffff81042daa0000 R08: ffff81042f1253b0 R09: 0000000000000001
> > [ 12.944874] R10: 00000000000000fe R11: 0000000000000000 R12: 0000000000000002
> > [ 12.944874] R13: 0000000000000001 R14: ffff8100cf800000 R15: ffff81042de398f0
> > [ 12.944874] FS: 0000000000000000(0000) GS:ffffffff805bb000(0000) knlGS:0000000000000000
> > [ 12.944874] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> > [ 12.944874] CR2: 00002afed7eea340 CR3: 000000042dbba000 CR4: 00000000000006e0
> > [ 12.944874] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [ 12.944874] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > [ 12.944874] Process swapper (pid: 0, threadinfo ffffffff805f4000, task ffffffff805624a0)
> > [ 12.944874] Stack: 0000000000000000 ffffffff8063af10 0000000000000001 ffffffff80632d60
> > [ 12.944874] 0000000000000000 000000000000000a ffffffff805bb900 ffffffff8032038f
> > [ 12.944874] ffffffff8063af10 ffffffff8063af10 ffffffff805bb940 ffffffff802346b4
> > [ 12.944874] Call Trace:
> > [ 12.944874] <IRQ> [<ffffffff8032038f>] blk_done_softirq+0x69/0x78
> > [ 12.944874] [<ffffffff802346b4>] __do_softirq+0x6f/0xd8
> > [ 12.944874] [<ffffffff8020c45c>] call_softirq+0x1c/0x30
> > [ 12.944874] [<ffffffff8020e347>] do_softirq+0x30/0x80
> > [ 12.944874] [<ffffffff8020e409>] do_IRQ+0x72/0xd9
> > [ 12.944874] [<ffffffff8020a50a>] mwait_idle+0x0/0x46
> > [ 12.944874] [<ffffffff8020a3da>] default_idle+0x0/0x3d
> > [ 12.944874] [<ffffffff8020b7e1>] ret_from_intr+0x0/0xa
> > [ 12.944874] <EOI> [<ffffffff8020a54c>] mwait_idle+0x42/0x46
> > [ 12.944874] [<ffffffff8020a481>] cpu_idle+0x6a/0xae
> > [ 12.944874]
> > [ 12.944874]
> > [ 12.944874] Code: 0f 0b eb fe 48 8d 85 d8 c0 00 00 48 89 04 24 48 89 c7 e8 e5
> > [ 12.944874] RIP [<ffffffff8039e43d>] cciss_softirq_done+0xbc/0x1bf
> > [ 12.944874] RSP <ffffffff8063aed0>
> > [ 12.944903] ---[ end trace e9c631603f90d22f ]---
>
> Thanks, will take a look.

Andrew, can you try with this applied?

diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
index ef50068..bd7b352 100644
--- a/drivers/block/cciss.c
+++ b/drivers/block/cciss.c
@@ -1257,7 +1257,7 @@ static void cciss_softirq_done(struct request *rq)
#endif /* CCISS_DEBUG */

if (blk_end_request(rq, (rq->errors == 0) ? 0 : -EIO, blk_rq_bytes(rq)))
- BUG();
+ blk_dump_rq_flags(rq, "cciss rq");

spin_lock_irqsave(&h->lock, flags);
cmd_free(h, cmd, 1);

--
Jens Axboe

2008-01-29 18:22:38

by Andrew Vasquez

[permalink] [raw]
Subject: Re: kernel BUG at drivers/block/cciss.c:1260! (with recent linux-2.6 tree)

On Tue, 29 Jan 2008, Jens Axboe wrote:

> Andrew, can you try with this applied?
>
> diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
> index ef50068..bd7b352 100644
> --- a/drivers/block/cciss.c
> +++ b/drivers/block/cciss.c
> @@ -1257,7 +1257,7 @@ static void cciss_softirq_done(struct request *rq)
> #endif /* CCISS_DEBUG */
>
> if (blk_end_request(rq, (rq->errors == 0) ? 0 : -EIO, blk_rq_bytes(rq)))
> - BUG();
> + blk_dump_rq_flags(rq, "cciss rq");
>
> spin_lock_irqsave(&h->lock, flags);
> cmd_free(h, cmd, 1);

Here the final snippet that was logged:

[ 12.724997] input: USB HID v1.01 Mouse [HP Virtual Keyboard] on usb-0000:01:04.4-1
[ 12.728971] usbcore: registered new interface driver usbhid
[ 12.732866] drivers/hid/usbhid/hid-core.c: v2.6:USB HID core driver
[ 12.741172] TCP cubic registered
[ 12.744506] NET: Registered protocol family 1
[ 12.744884] NET: Registered protocol family 17
[ 12.749217] Freeing unused kernel memory: 228k freed
[ 12.885823] cciss rq: dev cciss/c0d0: type=2, flags=104c8
[ 12.888929]
[ 12.888930] sector 6510615555426900570, nr/cnr 0/0
[ 12.892895] bio ffff81042f130730, biotail ffff81042f130730, buffer 0000000000000000, data 0000000000000000, len 0
[ 12.896895] cdb: 12 00 00 00 fe 00 00 00 00 00 00 00 00 00 00 00

2008-01-29 18:28:47

by Jens Axboe

[permalink] [raw]
Subject: Re: kernel BUG at drivers/block/cciss.c:1260! (with recent linux-2.6 tree)

On Tue, Jan 29 2008, Andrew Vasquez wrote:
> On Tue, 29 Jan 2008, Jens Axboe wrote:
>
> > Andrew, can you try with this applied?
> >
> > diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
> > index ef50068..bd7b352 100644
> > --- a/drivers/block/cciss.c
> > +++ b/drivers/block/cciss.c
> > @@ -1257,7 +1257,7 @@ static void cciss_softirq_done(struct request *rq)
> > #endif /* CCISS_DEBUG */
> >
> > if (blk_end_request(rq, (rq->errors == 0) ? 0 : -EIO, blk_rq_bytes(rq)))
> > - BUG();
> > + blk_dump_rq_flags(rq, "cciss rq");
> >
> > spin_lock_irqsave(&h->lock, flags);
> > cmd_free(h, cmd, 1);
>
> Here the final snippet that was logged:
>
> [ 12.724997] input: USB HID v1.01 Mouse [HP Virtual Keyboard] on usb-0000:01:04.4-1
> [ 12.728971] usbcore: registered new interface driver usbhid
> [ 12.732866] drivers/hid/usbhid/hid-core.c: v2.6:USB HID core driver
> [ 12.741172] TCP cubic registered
> [ 12.744506] NET: Registered protocol family 1
> [ 12.744884] NET: Registered protocol family 17
> [ 12.749217] Freeing unused kernel memory: 228k freed
> [ 12.885823] cciss rq: dev cciss/c0d0: type=2, flags=104c8
> [ 12.888929]
> [ 12.888930] sector 6510615555426900570, nr/cnr 0/0
> [ 12.892895] bio ffff81042f130730, biotail ffff81042f130730, buffer 0000000000000000, data 0000000000000000, len 0
> [ 12.896895] cdb: 12 00 00 00 fe 00 00 00 00 00 00 00 00 00 00 00

Ah ok, I see the problem... cciss is overriding the data_len for
BLOCK_PC requests, hence it does not complete them properly. Hmm. Does
this work?

diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
index ef50068..b6fa52e 100644
--- a/drivers/block/cciss.c
+++ b/drivers/block/cciss.c
@@ -2524,7 +2524,6 @@ after_error_processing:
resend_cciss_cmd(h, cmd);
return;
}
- cmd->rq->data_len = 0;
cmd->rq->completion_data = cmd;
blk_complete_request(cmd->rq);
}

--
Jens Axboe

2008-01-29 18:38:40

by Andrew Vasquez

[permalink] [raw]
Subject: Re: kernel BUG at drivers/block/cciss.c:1260! (with recent linux-2.6 tree)

On Tue, 29 Jan 2008, Jens Axboe wrote:

> > Here the final snippet that was logged:
> >
> > [ 12.724997] input: USB HID v1.01 Mouse [HP Virtual Keyboard] on usb-0000:01:04.4-1
> > [ 12.728971] usbcore: registered new interface driver usbhid
> > [ 12.732866] drivers/hid/usbhid/hid-core.c: v2.6:USB HID core driver
> > [ 12.741172] TCP cubic registered
> > [ 12.744506] NET: Registered protocol family 1
> > [ 12.744884] NET: Registered protocol family 17
> > [ 12.749217] Freeing unused kernel memory: 228k freed
> > [ 12.885823] cciss rq: dev cciss/c0d0: type=2, flags=104c8
> > [ 12.888929]
> > [ 12.888930] sector 6510615555426900570, nr/cnr 0/0
> > [ 12.892895] bio ffff81042f130730, biotail ffff81042f130730, buffer 0000000000000000, data 0000000000000000, len 0
> > [ 12.896895] cdb: 12 00 00 00 fe 00 00 00 00 00 00 00 00 00 00 00
>
> Ah ok, I see the problem... cciss is overriding the data_len for
> BLOCK_PC requests, hence it does not complete them properly. Hmm. Does
> this work?
>
> diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
> index ef50068..b6fa52e 100644
> --- a/drivers/block/cciss.c
> +++ b/drivers/block/cciss.c
> @@ -2524,7 +2524,6 @@ after_error_processing:
> resend_cciss_cmd(h, cmd);
> return;
> }
> - cmd->rq->data_len = 0;
> cmd->rq->completion_data = cmd;
> blk_complete_request(cmd->rq);
> }


Things look good so far -- with the patch above I can finally boot the
machine.

Thanks, av

2008-01-29 18:44:56

by Jens Axboe

[permalink] [raw]
Subject: Re: kernel BUG at drivers/block/cciss.c:1260! (with recent linux-2.6 tree)

On Tue, Jan 29 2008, Andrew Vasquez wrote:
> On Tue, 29 Jan 2008, Jens Axboe wrote:
>
> > > Here the final snippet that was logged:
> > >
> > > [ 12.724997] input: USB HID v1.01 Mouse [HP Virtual Keyboard] on usb-0000:01:04.4-1
> > > [ 12.728971] usbcore: registered new interface driver usbhid
> > > [ 12.732866] drivers/hid/usbhid/hid-core.c: v2.6:USB HID core driver
> > > [ 12.741172] TCP cubic registered
> > > [ 12.744506] NET: Registered protocol family 1
> > > [ 12.744884] NET: Registered protocol family 17
> > > [ 12.749217] Freeing unused kernel memory: 228k freed
> > > [ 12.885823] cciss rq: dev cciss/c0d0: type=2, flags=104c8
> > > [ 12.888929]
> > > [ 12.888930] sector 6510615555426900570, nr/cnr 0/0
> > > [ 12.892895] bio ffff81042f130730, biotail ffff81042f130730, buffer 0000000000000000, data 0000000000000000, len 0
> > > [ 12.896895] cdb: 12 00 00 00 fe 00 00 00 00 00 00 00 00 00 00 00
> >
> > Ah ok, I see the problem... cciss is overriding the data_len for
> > BLOCK_PC requests, hence it does not complete them properly. Hmm. Does
> > this work?
> >
> > diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
> > index ef50068..b6fa52e 100644
> > --- a/drivers/block/cciss.c
> > +++ b/drivers/block/cciss.c
> > @@ -2524,7 +2524,6 @@ after_error_processing:
> > resend_cciss_cmd(h, cmd);
> > return;
> > }
> > - cmd->rq->data_len = 0;
> > cmd->rq->completion_data = cmd;
> > blk_complete_request(cmd->rq);
> > }
>
>
> Things look good so far -- with the patch above I can finally boot the
> machine.

Cool, sorry about that. Will get that applied asap. So after this patch
was applied, you didn't see any debug messages from blk_dump_rq_flags()
anymore, right?

--
Jens Axboe

2008-01-29 18:49:50

by Andrew Vasquez

[permalink] [raw]
Subject: Re: kernel BUG at drivers/block/cciss.c:1260! (with recent linux-2.6 tree)

On Tue, 29 Jan 2008, Jens Axboe wrote:

> On Tue, Jan 29 2008, Andrew Vasquez wrote:
> > On Tue, 29 Jan 2008, Jens Axboe wrote:
> >
> > > > Here the final snippet that was logged:
> > > >
> > > > [ 12.724997] input: USB HID v1.01 Mouse [HP Virtual Keyboard] on usb-0000:01:04.4-1
> > > > [ 12.728971] usbcore: registered new interface driver usbhid
> > > > [ 12.732866] drivers/hid/usbhid/hid-core.c: v2.6:USB HID core driver
> > > > [ 12.741172] TCP cubic registered
> > > > [ 12.744506] NET: Registered protocol family 1
> > > > [ 12.744884] NET: Registered protocol family 17
> > > > [ 12.749217] Freeing unused kernel memory: 228k freed
> > > > [ 12.885823] cciss rq: dev cciss/c0d0: type=2, flags=104c8
> > > > [ 12.888929]
> > > > [ 12.888930] sector 6510615555426900570, nr/cnr 0/0
> > > > [ 12.892895] bio ffff81042f130730, biotail ffff81042f130730, buffer 0000000000000000, data 0000000000000000, len 0
> > > > [ 12.896895] cdb: 12 00 00 00 fe 00 00 00 00 00 00 00 00 00 00 00
> > >
> > > Ah ok, I see the problem... cciss is overriding the data_len for
> > > BLOCK_PC requests, hence it does not complete them properly. Hmm. Does
> > > this work?
> > >
> > > diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
> > > index ef50068..b6fa52e 100644
> > > --- a/drivers/block/cciss.c
> > > +++ b/drivers/block/cciss.c
> > > @@ -2524,7 +2524,6 @@ after_error_processing:
> > > resend_cciss_cmd(h, cmd);
> > > return;
> > > }
> > > - cmd->rq->data_len = 0;
> > > cmd->rq->completion_data = cmd;
> > > blk_complete_request(cmd->rq);
> > > }
> >
> >
> > Things look good so far -- with the patch above I can finally boot the
> > machine.
>
> Cool, sorry about that. Will get that applied asap. So after this patch
> was applied, you didn't see any debug messages from blk_dump_rq_flags()
> anymore, right?

That's correct. I've yet to see any additional debug-messages from
blk_dump_rq_flags().

--
av

2008-01-29 18:54:18

by Jens Axboe

[permalink] [raw]
Subject: Re: kernel BUG at drivers/block/cciss.c:1260! (with recent linux-2.6 tree)

On Tue, Jan 29 2008, Andrew Vasquez wrote:
> On Tue, 29 Jan 2008, Jens Axboe wrote:
>
> > On Tue, Jan 29 2008, Andrew Vasquez wrote:
> > > On Tue, 29 Jan 2008, Jens Axboe wrote:
> > >
> > > > > Here the final snippet that was logged:
> > > > >
> > > > > [ 12.724997] input: USB HID v1.01 Mouse [HP Virtual Keyboard] on usb-0000:01:04.4-1
> > > > > [ 12.728971] usbcore: registered new interface driver usbhid
> > > > > [ 12.732866] drivers/hid/usbhid/hid-core.c: v2.6:USB HID core driver
> > > > > [ 12.741172] TCP cubic registered
> > > > > [ 12.744506] NET: Registered protocol family 1
> > > > > [ 12.744884] NET: Registered protocol family 17
> > > > > [ 12.749217] Freeing unused kernel memory: 228k freed
> > > > > [ 12.885823] cciss rq: dev cciss/c0d0: type=2, flags=104c8
> > > > > [ 12.888929]
> > > > > [ 12.888930] sector 6510615555426900570, nr/cnr 0/0
> > > > > [ 12.892895] bio ffff81042f130730, biotail ffff81042f130730, buffer 0000000000000000, data 0000000000000000, len 0
> > > > > [ 12.896895] cdb: 12 00 00 00 fe 00 00 00 00 00 00 00 00 00 00 00
> > > >
> > > > Ah ok, I see the problem... cciss is overriding the data_len for
> > > > BLOCK_PC requests, hence it does not complete them properly. Hmm. Does
> > > > this work?
> > > >
> > > > diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
> > > > index ef50068..b6fa52e 100644
> > > > --- a/drivers/block/cciss.c
> > > > +++ b/drivers/block/cciss.c
> > > > @@ -2524,7 +2524,6 @@ after_error_processing:
> > > > resend_cciss_cmd(h, cmd);
> > > > return;
> > > > }
> > > > - cmd->rq->data_len = 0;
> > > > cmd->rq->completion_data = cmd;
> > > > blk_complete_request(cmd->rq);
> > > > }
> > >
> > >
> > > Things look good so far -- with the patch above I can finally boot the
> > > machine.
> >
> > Cool, sorry about that. Will get that applied asap. So after this patch
> > was applied, you didn't see any debug messages from blk_dump_rq_flags()
> > anymore, right?
>
> That's correct. I've yet to see any additional debug-messages from
> blk_dump_rq_flags().

Great, thanks for confirming. It does look like a clear bug in cciss, it
just got exposed now that it uses proper end request handling. We never
need to clear ->data_len, since for blk_fs_request() it will be cleared
on init. So just setting a residual count there for blk_fs_request()
like cciss does is fine.

Anyway, it's in my pending queue for Linus.

--
Jens Axboe

2008-01-29 19:04:44

by Mike Miller

[permalink] [raw]
Subject: RE: kernel BUG at drivers/block/cciss.c:1260! (with recent linux-2.6 tree)

Jens wrote:

> -----Original Message-----
> From: Jens Axboe [mailto:[email protected]]
> Sent: Tuesday, January 29, 2008 12:54 PM
> To: Andrew Vasquez
> Cc: Linux Kernel Mailing List; Miller, Mike (OS Dev);
> [email protected]; [email protected]
> Subject: Re: kernel BUG at drivers/block/cciss.c:1260! (with
> recent linux-2.6 tree)
>
> On Tue, Jan 29 2008, Andrew Vasquez wrote:
> > On Tue, 29 Jan 2008, Jens Axboe wrote:
> >
> > > On Tue, Jan 29 2008, Andrew Vasquez wrote:
> > > > On Tue, 29 Jan 2008, Jens Axboe wrote:
> > > >
> > > > > > Here the final snippet that was logged:
> > > > > >
> > > > > > [ 12.724997] input: USB HID v1.01 Mouse [HP
> Virtual Keyboard] on usb-0000:01:04.4-1
> > > > > > [ 12.728971] usbcore: registered new interface
> driver usbhid
> > > > > > [ 12.732866] drivers/hid/usbhid/hid-core.c:
> v2.6:USB HID core driver
> > > > > > [ 12.741172] TCP cubic registered
> > > > > > [ 12.744506] NET: Registered protocol family 1
> > > > > > [ 12.744884] NET: Registered protocol family 17
> > > > > > [ 12.749217] Freeing unused kernel memory: 228k freed
> > > > > > [ 12.885823] cciss rq: dev cciss/c0d0: type=2, flags=104c8
> > > > > > [ 12.888929]
> > > > > > [ 12.888930] sector 6510615555426900570, nr/cnr 0/0
> > > > > > [ 12.892895] bio ffff81042f130730, biotail
> ffff81042f130730, buffer 0000000000000000, data
> 0000000000000000, len 0
> > > > > > [ 12.896895] cdb: 12 00 00 00 fe 00 00 00 00 00
> 00 00 00 00 00 00
> > > > >
> > > > > Ah ok, I see the problem... cciss is overriding the
> data_len for
> > > > > BLOCK_PC requests, hence it does not complete them properly.
> > > > > Hmm. Does this work?
> > > > >
> > > > > diff --git a/drivers/block/cciss.c
> b/drivers/block/cciss.c index
> > > > > ef50068..b6fa52e 100644
> > > > > --- a/drivers/block/cciss.c
> > > > > +++ b/drivers/block/cciss.c
> > > > > @@ -2524,7 +2524,6 @@ after_error_processing:
> > > > > resend_cciss_cmd(h, cmd);
> > > > > return;
> > > > > }
> > > > > - cmd->rq->data_len = 0;
> > > > > cmd->rq->completion_data = cmd;
> > > > > blk_complete_request(cmd->rq); }
> > > >
> > > >
> > > > Things look good so far -- with the patch above I can
> finally boot
> > > > the machine.
> > >
> > > Cool, sorry about that. Will get that applied asap. So after this
> > > patch was applied, you didn't see any debug messages from
> > > blk_dump_rq_flags() anymore, right?
> >
> > That's correct. I've yet to see any additional debug-messages from
> > blk_dump_rq_flags().
>
> Great, thanks for confirming. It does look like a clear bug
> in cciss, it just got exposed now that it uses proper end
> request handling. We never need to clear ->data_len, since
> for blk_fs_request() it will be cleared on init. So just
> setting a residual count there for blk_fs_request() like
> cciss does is fine.

Just so I'm clear: just removing the one line is enough to resolve the problem?

-- mikem


>
> Anyway, it's in my pending queue for Linus.
>
> --
> Jens Axboe
>
>

2008-01-29 19:07:54

by Jens Axboe

[permalink] [raw]
Subject: Re: kernel BUG at drivers/block/cciss.c:1260! (with recent linux-2.6 tree)

On Tue, Jan 29 2008, Miller, Mike (OS Dev) wrote:
> Jens wrote:
>
> > -----Original Message-----
> > From: Jens Axboe [mailto:[email protected]]
> > Sent: Tuesday, January 29, 2008 12:54 PM
> > To: Andrew Vasquez
> > Cc: Linux Kernel Mailing List; Miller, Mike (OS Dev);
> > [email protected]; [email protected]
> > Subject: Re: kernel BUG at drivers/block/cciss.c:1260! (with
> > recent linux-2.6 tree)
> >
> > On Tue, Jan 29 2008, Andrew Vasquez wrote:
> > > On Tue, 29 Jan 2008, Jens Axboe wrote:
> > >
> > > > On Tue, Jan 29 2008, Andrew Vasquez wrote:
> > > > > On Tue, 29 Jan 2008, Jens Axboe wrote:
> > > > >
> > > > > > > Here the final snippet that was logged:
> > > > > > >
> > > > > > > [ 12.724997] input: USB HID v1.01 Mouse [HP
> > Virtual Keyboard] on usb-0000:01:04.4-1
> > > > > > > [ 12.728971] usbcore: registered new interface
> > driver usbhid
> > > > > > > [ 12.732866] drivers/hid/usbhid/hid-core.c:
> > v2.6:USB HID core driver
> > > > > > > [ 12.741172] TCP cubic registered
> > > > > > > [ 12.744506] NET: Registered protocol family 1
> > > > > > > [ 12.744884] NET: Registered protocol family 17
> > > > > > > [ 12.749217] Freeing unused kernel memory: 228k freed
> > > > > > > [ 12.885823] cciss rq: dev cciss/c0d0: type=2, flags=104c8
> > > > > > > [ 12.888929]
> > > > > > > [ 12.888930] sector 6510615555426900570, nr/cnr 0/0
> > > > > > > [ 12.892895] bio ffff81042f130730, biotail
> > ffff81042f130730, buffer 0000000000000000, data
> > 0000000000000000, len 0
> > > > > > > [ 12.896895] cdb: 12 00 00 00 fe 00 00 00 00 00
> > 00 00 00 00 00 00
> > > > > >
> > > > > > Ah ok, I see the problem... cciss is overriding the
> > data_len for
> > > > > > BLOCK_PC requests, hence it does not complete them properly.
> > > > > > Hmm. Does this work?
> > > > > >
> > > > > > diff --git a/drivers/block/cciss.c
> > b/drivers/block/cciss.c index
> > > > > > ef50068..b6fa52e 100644
> > > > > > --- a/drivers/block/cciss.c
> > > > > > +++ b/drivers/block/cciss.c
> > > > > > @@ -2524,7 +2524,6 @@ after_error_processing:
> > > > > > resend_cciss_cmd(h, cmd);
> > > > > > return;
> > > > > > }
> > > > > > - cmd->rq->data_len = 0;
> > > > > > cmd->rq->completion_data = cmd;
> > > > > > blk_complete_request(cmd->rq); }
> > > > >
> > > > >
> > > > > Things look good so far -- with the patch above I can
> > finally boot
> > > > > the machine.
> > > >
> > > > Cool, sorry about that. Will get that applied asap. So after this
> > > > patch was applied, you didn't see any debug messages from
> > > > blk_dump_rq_flags() anymore, right?
> > >
> > > That's correct. I've yet to see any additional debug-messages from
> > > blk_dump_rq_flags().
> >
> > Great, thanks for confirming. It does look like a clear bug
> > in cciss, it just got exposed now that it uses proper end
> > request handling. We never need to clear ->data_len, since
> > for blk_fs_request() it will be cleared on init. So just
> > setting a residual count there for blk_fs_request() like
> > cciss does is fine.
>
> Just so I'm clear: just removing the one line is enough to resolve the
> problem?

Yep, that's it. The line should never have been there (it accomplishes
nothing for the old end io handling) and with the newer end io handling
it breaks when it uses the block completion functions instead of the
home grown in cciss.

--
Jens Axboe

2008-01-29 19:11:29

by Andrew Vasquez

[permalink] [raw]
Subject: Re: kernel BUG at drivers/block/cciss.c:1260! (with recent linux-2.6 tree)

On Tue, 29 Jan 2008, Miller, Mike (OS Dev) wrote:

> Jens wrote:
>
> > -----Original Message-----
> > From: Jens Axboe [mailto:[email protected]]
> > Sent: Tuesday, January 29, 2008 12:54 PM
> > To: Andrew Vasquez
> > Cc: Linux Kernel Mailing List; Miller, Mike (OS Dev);
> > [email protected]; [email protected]
> > Subject: Re: kernel BUG at drivers/block/cciss.c:1260! (with
> > recent linux-2.6 tree)
> >
> > On Tue, Jan 29 2008, Andrew Vasquez wrote:
> > > On Tue, 29 Jan 2008, Jens Axboe wrote:
> > >
> > > > On Tue, Jan 29 2008, Andrew Vasquez wrote:
> > > > > On Tue, 29 Jan 2008, Jens Axboe wrote:
> > > > >
> > > > > > > Here the final snippet that was logged:
> > > > > > >
> > > > > > > [ 12.724997] input: USB HID v1.01 Mouse [HP
> > Virtual Keyboard] on usb-0000:01:04.4-1
> > > > > > > [ 12.728971] usbcore: registered new interface
> > driver usbhid
> > > > > > > [ 12.732866] drivers/hid/usbhid/hid-core.c:
> > v2.6:USB HID core driver
> > > > > > > [ 12.741172] TCP cubic registered
> > > > > > > [ 12.744506] NET: Registered protocol family 1
> > > > > > > [ 12.744884] NET: Registered protocol family 17
> > > > > > > [ 12.749217] Freeing unused kernel memory: 228k freed
> > > > > > > [ 12.885823] cciss rq: dev cciss/c0d0: type=2, flags=104c8
> > > > > > > [ 12.888929]
> > > > > > > [ 12.888930] sector 6510615555426900570, nr/cnr 0/0
> > > > > > > [ 12.892895] bio ffff81042f130730, biotail
> > ffff81042f130730, buffer 0000000000000000, data
> > 0000000000000000, len 0
> > > > > > > [ 12.896895] cdb: 12 00 00 00 fe 00 00 00 00 00
> > 00 00 00 00 00 00
> > > > > >
> > > > > > Ah ok, I see the problem... cciss is overriding the
> > data_len for
> > > > > > BLOCK_PC requests, hence it does not complete them properly.
> > > > > > Hmm. Does this work?
> > > > > >
> > > > > > diff --git a/drivers/block/cciss.c
> > b/drivers/block/cciss.c index
> > > > > > ef50068..b6fa52e 100644
> > > > > > --- a/drivers/block/cciss.c
> > > > > > +++ b/drivers/block/cciss.c
> > > > > > @@ -2524,7 +2524,6 @@ after_error_processing:
> > > > > > resend_cciss_cmd(h, cmd);
> > > > > > return;
> > > > > > }
> > > > > > - cmd->rq->data_len = 0;
> > > > > > cmd->rq->completion_data = cmd;
> > > > > > blk_complete_request(cmd->rq); }
> > > > >
> > > > >
> > > > > Things look good so far -- with the patch above I can
> > finally boot
> > > > > the machine.
> > > >
> > > > Cool, sorry about that. Will get that applied asap. So after this
> > > > patch was applied, you didn't see any debug messages from
> > > > blk_dump_rq_flags() anymore, right?
> > >
> > > That's correct. I've yet to see any additional debug-messages from
> > > blk_dump_rq_flags().
> >
> > Great, thanks for confirming. It does look like a clear bug
> > in cciss, it just got exposed now that it uses proper end
> > request handling. We never need to clear ->data_len, since
> > for blk_fs_request() it will be cleared on init. So just
> > setting a residual count there for blk_fs_request() like
> > cciss does is fine.
>
> Just so I'm clear: just removing the one line is enough to resolve the problem?

That's correct. The only other change to cciss.c in my tree is where
the BUG() call was replaced with a call to blk_dump_rq_flags():

@@ -1257,7 +1257,8 @@ static void cciss_softirq_done(struct request *rq)
#endif /* CCISS_DEBUG */

if (blk_end_request(rq, (rq->errors == 0) ? 0 : -EIO, blk_rq_bytes(rq)))
- BUG();
+ blk_dump_rq_flags(rq, "cciss rq");
+// BUG();

spin_lock_irqsave(&h->lock, flags);
cmd_free(h, cmd, 1);

2008-01-29 21:08:59

by Andrew Vasquez

[permalink] [raw]
Subject: Re: kernel BUG at drivers/block/cciss.c:1260! (with recent linux-2.6 tree)

On Tue, 29 Jan 2008, Jens Axboe wrote:

> Great, thanks for confirming. It does look like a clear bug in cciss, it
> just got exposed now that it uses proper end request handling. We never
> need to clear ->data_len, since for blk_fs_request() it will be cleared
> on init. So just setting a residual count there for blk_fs_request()
> like cciss does is fine.
>
> Anyway, it's in my pending queue for Linus.
>


Hmm, probably not related to the block changes in your tree, but I'm
seeing yet another problem after working (compile jobs) the machine:

[ 61.423922] BUG: spinlock recursion on CPU#2, kjournald/2317
[ 61.427843] lock: ffff81042c5a4988, .magic: dead4ead, .owner: kjournald/2317, .owner_cpu: 2
[ 61.427843] Pid: 2317, comm: kjournald Not tainted 2.6.24 #45
[ 61.427843]
[ 61.427843] Call Trace:
[ 61.427843] [<ffffffff803332e1>] _raw_spin_lock+0xe9/0x12a
[ 61.427843] [<ffffffff80324ccf>] as_merged_requests+0xfe/0x115
[ 61.427843] [<ffffffff8031b558>] elv_merge_requests+0x1f/0x45
[ 61.427843] [<ffffffff8031e6f7>] attempt_merge+0x281/0x347
[ 61.427843] [<ffffffff8031f153>] __make_request+0x1e6/0x598
[ 61.427843] [<ffffffff8031d6ea>] generic_make_request+0x1c8/0x276
[ 61.427843] [<ffffffff8031d7f9>] submit_bio+0x61/0xdb
[ 61.427843] [<ffffffff8029b0d2>] submit_bh+0xe2/0x118
[ 61.427843] [<ffffffff802f69f3>] journal_do_submit_data+0x28/0x39
[ 61.427843] [<ffffffff802f77da>] journal_commit_transaction+0xdbe/0x1394
[ 61.427843] [<ffffffff802381a8>] lock_timer_base+0x26/0x4e
[ 61.427843] [<ffffffff802fb85f>] kjournald+0x104/0x373
[ 61.427843] [<ffffffff80242087>] autoremove_wake_function+0x0/0x2e
[ 61.427843] [<ffffffff802fb75b>] kjournald+0x0/0x373
[ 61.427843] [<ffffffff80241cd4>] kthread+0x3d/0x61
[ 61.427843] [<ffffffff8020c0e8>] child_rip+0xa/0x12
[ 61.427843] [<ffffffff80241c97>] kthread+0x0/0x61
[ 61.427843] [<ffffffff8020c0de>] child_rip+0x0/0x12
[ 61.427843]
[ 124.555789] BUG: soft lockup - CPU#6 stuck for 61s! [as:7191]
[ 124.555789] CPU 6:
[ 124.555789] Modules linked in:
[ 124.555789] Pid: 7191, comm: as Not tainted 2.6.24 #45
[ 124.555789] RIP: 0010:[<ffffffff8033329d>] [<ffffffff8033329d>] _raw_spin_lock+0xa5/0x12a
[ 124.555789] RSP: 0018:ffff81042b50be18 EFLAGS: 00000246
[ 124.555789] RAX: 0000000000000000 RBX: 000000000ef44415 RCX: 000000002b543897
[ 124.555789] RDX: 000000000000009c RSI: ffff81042c87b868 RDI: 0000000000000001
[ 124.555789] RBP: 00002b0273879000 R08: 00007fff38107000 R09: ffffffff805739c4
[ 124.555789] R10: ffff81042f33ed78 R11: ffff81042f33ed78 R12: ffff81042b50f3c0
[ 124.555789] R13: 0000001000000000 R14: 0000000000008000 R15: 0000000000000000
[ 124.555789] FS: 00002b02736f3ef0(0000) GS:ffff81042f98a560(0000) knlGS:0000000000000000
[ 124.555789] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 124.555789] CR2: 000000000085d018 CR3: 000000042cb0e000 CR4: 00000000000006e0
[ 124.555789] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 124.555789] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 124.555789]
[ 124.555789] Call Trace:
[ 124.555789] [<ffffffff803332ab>] _raw_spin_lock+0xb3/0x12a
[ 124.555789] [<ffffffff802171a6>] flush_tlb_others+0x4b/0xa8
[ 124.555789] [<ffffffff80217531>] flush_tlb_mm+0x4a/0x99
[ 124.555789] [<ffffffff802688fe>] unmap_region+0x10a/0x141
[ 124.555789] [<ffffffff80269655>] do_munmap+0x1fd/0x2b9
[ 124.555789] [<ffffffff8047d1d1>] __down_write_nested+0xa0/0xb0
[ 124.555789] [<ffffffff8026974c>] sys_munmap+0x3b/0x57
[ 124.555789] [<ffffffff8020b2ce>] system_call+0x7e/0x83
[ 124.555789]
[ 124.555789] BUG: soft lockup - CPU#14 stuck for 61s! [cc1:7190]
[ 124.555789] CPU 14:
[ 124.555789] Modules linked in:
[ 124.555789] Pid: 7190, comm: cc1 Not tainted 2.6.24 #45
[ 124.555789] RIP: 0010:[<ffffffff802171d0>] [<ffffffff802171d0>] flush_tlb_others+0x75/0xa8
[ 124.555789] RSP: 0018:ffff81042b965e48 EFLAGS: 00000202
[ 124.555789] RAX: 0000000000000010 RBX: 0000000000000006 RCX: 0000000000000003
[ 124.555789] RDX: 0000000000000010 RSI: ffff81042b965df8 RDI: 0000000000000002
[ 124.555789] RBP: ffff810011cc2658 R08: 00002aea0f72d000 R09: ffffffff80573dc4
[ 124.555789] R10: ffff81042e65b7b8 R11: ffff81042e65b7b8 R12: ffffffff80630640
[ 124.555789] R13: 0000000000000000 R14: ffffffff80264d07 R15: ffff81042e0dd960
[ 124.555789] FS: 00002aea0fafb6f0(0000) GS:ffff81042fb01cc0(0000) knlGS:0000000000000000
[ 124.555789] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 124.555789] CR2: 00002aea117fe000 CR3: 000000042b537000 CR4: 00000000000006e0
[ 124.555789] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 124.555789] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 124.555789]
[ 124.555789] Call Trace:
[ 124.555789] [<ffffffff802171c4>] flush_tlb_others+0x69/0xa8
[ 124.555789] [<ffffffff80217531>] flush_tlb_mm+0x4a/0x99
[ 124.555789] [<ffffffff802688fe>] unmap_region+0x10a/0x141
[ 124.555789] [<ffffffff80269655>] do_munmap+0x1fd/0x2b9
[ 124.555789] [<ffffffff8047d1d1>] __down_write_nested+0xa0/0xb0
[ 124.555789] [<ffffffff8026974c>] sys_munmap+0x3b/0x57

I'll try to get more info when I return...

2008-01-29 21:16:48

by Jens Axboe

[permalink] [raw]
Subject: Re: kernel BUG at drivers/block/cciss.c:1260! (with recent linux-2.6 tree)

On Tue, Jan 29 2008, Andrew Vasquez wrote:
> On Tue, 29 Jan 2008, Jens Axboe wrote:
>
> > Great, thanks for confirming. It does look like a clear bug in cciss, it
> > just got exposed now that it uses proper end request handling. We never
> > need to clear ->data_len, since for blk_fs_request() it will be cleared
> > on init. So just setting a residual count there for blk_fs_request()
> > like cciss does is fine.
> >
> > Anyway, it's in my pending queue for Linus.
> >
>
>
> Hmm, probably not related to the block changes in your tree, but I'm
> seeing yet another problem after working (compile jobs) the machine:
>
> [ 61.423922] BUG: spinlock recursion on CPU#2, kjournald/2317
> [ 61.427843] lock: ffff81042c5a4988, .magic: dead4ead, .owner: kjournald/2317, .owner_cpu: 2
> [ 61.427843] Pid: 2317, comm: kjournald Not tainted 2.6.24 #45
> [ 61.427843]
> [ 61.427843] Call Trace:
> [ 61.427843] [<ffffffff803332e1>] _raw_spin_lock+0xe9/0x12a
> [ 61.427843] [<ffffffff80324ccf>] as_merged_requests+0xfe/0x115
> [ 61.427843] [<ffffffff8031b558>] elv_merge_requests+0x1f/0x45
> [ 61.427843] [<ffffffff8031e6f7>] attempt_merge+0x281/0x347
> [ 61.427843] [<ffffffff8031f153>] __make_request+0x1e6/0x598
> [ 61.427843] [<ffffffff8031d6ea>] generic_make_request+0x1c8/0x276
> [ 61.427843] [<ffffffff8031d7f9>] submit_bio+0x61/0xdb
> [ 61.427843] [<ffffffff8029b0d2>] submit_bh+0xe2/0x118
> [ 61.427843] [<ffffffff802f69f3>] journal_do_submit_data+0x28/0x39
> [ 61.427843] [<ffffffff802f77da>] journal_commit_transaction+0xdbe/0x1394
> [ 61.427843] [<ffffffff802381a8>] lock_timer_base+0x26/0x4e
> [ 61.427843] [<ffffffff802fb85f>] kjournald+0x104/0x373
> [ 61.427843] [<ffffffff80242087>] autoremove_wake_function+0x0/0x2e
> [ 61.427843] [<ffffffff802fb75b>] kjournald+0x0/0x373
> [ 61.427843] [<ffffffff80241cd4>] kthread+0x3d/0x61
> [ 61.427843] [<ffffffff8020c0e8>] child_rip+0xa/0x12
> [ 61.427843] [<ffffffff80241c97>] kthread+0x0/0x61
> [ 61.427843] [<ffffffff8020c0de>] child_rip+0x0/0x12

Ah crap, I see the problem, nioc is most often equal to rioc. Dang.
Please try this bandaid, will push a real fix now.

diff --git a/block/as-iosched.c b/block/as-iosched.c
index b201d16..5ca55b0 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -1275,9 +1275,15 @@ static void as_merged_requests(struct request_queue *q, struct request *req,
* Don't copy here but swap, because when anext is
* removed below, it must contain the unused context
*/
- double_spin_lock(&rioc->lock, &nioc->lock, rioc < nioc);
+ if (rioc != nioc)
+ double_spin_lock(&rioc->lock, &nioc->lock, rioc < nioc);
+ else
+ spin_lock(&rioc->lock);
swap_io_context(&rioc, &nioc);
- double_spin_unlock(&rioc->lock, &nioc->lock, rioc < nioc);
+ if (rioc != nioc)
+ double_spin_unlock(&rioc->lock, &nioc->lock, rioc < nioc);
+ else
+ spin_unlock(&rioc->lock);
}
}


--
Jens Axboe

2008-01-29 21:24:50

by Jens Axboe

[permalink] [raw]
Subject: Re: kernel BUG at drivers/block/cciss.c:1260! (with recent linux-2.6 tree)

On Tue, Jan 29 2008, Jens Axboe wrote:
> On Tue, Jan 29 2008, Andrew Vasquez wrote:
> > On Tue, 29 Jan 2008, Jens Axboe wrote:
> >
> > > Great, thanks for confirming. It does look like a clear bug in cciss, it
> > > just got exposed now that it uses proper end request handling. We never
> > > need to clear ->data_len, since for blk_fs_request() it will be cleared
> > > on init. So just setting a residual count there for blk_fs_request()
> > > like cciss does is fine.
> > >
> > > Anyway, it's in my pending queue for Linus.
> > >
> >
> >
> > Hmm, probably not related to the block changes in your tree, but I'm
> > seeing yet another problem after working (compile jobs) the machine:
> >
> > [ 61.423922] BUG: spinlock recursion on CPU#2, kjournald/2317
> > [ 61.427843] lock: ffff81042c5a4988, .magic: dead4ead, .owner: kjournald/2317, .owner_cpu: 2
> > [ 61.427843] Pid: 2317, comm: kjournald Not tainted 2.6.24 #45
> > [ 61.427843]
> > [ 61.427843] Call Trace:
> > [ 61.427843] [<ffffffff803332e1>] _raw_spin_lock+0xe9/0x12a
> > [ 61.427843] [<ffffffff80324ccf>] as_merged_requests+0xfe/0x115
> > [ 61.427843] [<ffffffff8031b558>] elv_merge_requests+0x1f/0x45
> > [ 61.427843] [<ffffffff8031e6f7>] attempt_merge+0x281/0x347
> > [ 61.427843] [<ffffffff8031f153>] __make_request+0x1e6/0x598
> > [ 61.427843] [<ffffffff8031d6ea>] generic_make_request+0x1c8/0x276
> > [ 61.427843] [<ffffffff8031d7f9>] submit_bio+0x61/0xdb
> > [ 61.427843] [<ffffffff8029b0d2>] submit_bh+0xe2/0x118
> > [ 61.427843] [<ffffffff802f69f3>] journal_do_submit_data+0x28/0x39
> > [ 61.427843] [<ffffffff802f77da>] journal_commit_transaction+0xdbe/0x1394
> > [ 61.427843] [<ffffffff802381a8>] lock_timer_base+0x26/0x4e
> > [ 61.427843] [<ffffffff802fb85f>] kjournald+0x104/0x373
> > [ 61.427843] [<ffffffff80242087>] autoremove_wake_function+0x0/0x2e
> > [ 61.427843] [<ffffffff802fb75b>] kjournald+0x0/0x373
> > [ 61.427843] [<ffffffff80241cd4>] kthread+0x3d/0x61
> > [ 61.427843] [<ffffffff8020c0e8>] child_rip+0xa/0x12
> > [ 61.427843] [<ffffffff80241c97>] kthread+0x0/0x61
> > [ 61.427843] [<ffffffff8020c0de>] child_rip+0x0/0x12
>
> Ah crap, I see the problem, nioc is most often equal to rioc. Dang.
> Please try this bandaid, will push a real fix now.

This is way cleaner.

diff --git a/block/as-iosched.c b/block/as-iosched.c
index b201d16..9603684 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -1275,9 +1275,13 @@ static void as_merged_requests(struct request_queue *q, struct request *req,
* Don't copy here but swap, because when anext is
* removed below, it must contain the unused context
*/
- double_spin_lock(&rioc->lock, &nioc->lock, rioc < nioc);
- swap_io_context(&rioc, &nioc);
- double_spin_unlock(&rioc->lock, &nioc->lock, rioc < nioc);
+ if (rioc != nioc) {
+ double_spin_lock(&rioc->lock, &nioc->lock,
+ rioc < nioc);
+ swap_io_context(&rioc, &nioc);
+ double_spin_unlock(&rioc->lock, &nioc->lock,
+ rioc < nioc);
+ }
}
}


--
Jens Axboe

2008-01-29 21:33:00

by Luck, Tony

[permalink] [raw]
Subject: RE: kernel BUG at drivers/block/cciss.c:1260! (with recent linux-2.6 tree)

>> Ah crap, I see the problem, nioc is most often equal to rioc. Dang.
>> Please try this bandaid, will push a real fix now.

Patches arriving faster than I can test :-) ... The bandaid appears to
be working (no hang yet).

> This is way cleaner.

Will try this next.

-Tony

2008-01-29 21:35:40

by Jens Axboe

[permalink] [raw]
Subject: Re: kernel BUG at drivers/block/cciss.c:1260! (with recent linux-2.6 tree)

On Tue, Jan 29 2008, Luck, Tony wrote:
> >> Ah crap, I see the problem, nioc is most often equal to rioc. Dang.
> >> Please try this bandaid, will push a real fix now.
>
> Patches arriving faster than I can test :-) ... The bandaid appears to
> be working (no hang yet).

OK good!

> > This is way cleaner.
>
> Will try this next.

Should work even better since it avoids a lock and copy, but please do
test if you have the time.

--
Jens Axboe

2008-01-29 21:55:34

by Luck, Tony

[permalink] [raw]
Subject: RE: kernel BUG at drivers/block/cciss.c:1260! (with recent linux-2.6 tree)

> > Will try this next.
>
> Should work even better since it avoids a lock and copy, but please do
> test if you have the time.

That one works too (survived two full builds at "make -j32" on the 16-way
system).

Thanks for the quick turnaround.

-Tony

2008-01-29 22:02:56

by Andrew Vasquez

[permalink] [raw]
Subject: Re: kernel BUG at drivers/block/cciss.c:1260! (with recent linux-2.6 tree)

On Tue, 29 Jan 2008, Luck, Tony wrote:

> > > Will try this next.
> >
> > Should work even better since it avoids a lock and copy, but please do
> > test if you have the time.
>
> That one works too (survived two full builds at "make -j32" on the 16-way
> system).
>
> Thanks for the quick turnaround.

Jens, this patch appears to work well on my 16-way box... Compile
tests appear to pass (quickly). I'll be sure to let you know if
anything else 'block' related pops-up...

thanks, av