2005-12-08 04:51:05

by Dave Jones

[permalink] [raw]
Subject: 2.6.15rc5git1 cfq related spinlock bad magic

Just as I shutdown my desktop, this popped out..

[311578.273186] BUG: spinlock bad magic on CPU#1, pdflush/30788 (Not tainted)
[311578.293858] general protection fault: 0000 [1] SMP
[311578.308773] CPU 1
[311578.315046] Modules linked in: loop vfat fat radeon drm nfsd exportfs lockd nfs_acl ipv6 lp autofs4 rfcomm l2cap bluetooth suDec 6 03:05:39 nwo kernel: [311578.480129] Pid: 30788, comm: pdflush Not tainted 2.6.14-1.1735_FC5 #1
[311578.499972] RIP: 0010:[<ffffffff8021f8bd>] <ffffffff8021f8bd>{spin_bug+138}
[311578.520605] RSP: 0018:ffff8100307f1e68 EFLAGS: 00010082
[311578.537316] RAX: 00000000c4be8326 RBX: e6491e0e0968bacd RCX: ffffffff804661d8
[311578.558988] RDX: 0000000000000001 RSI: 0000000000000092 RDI: ffffffff804661c0
[311578.580658] RBP: ffff810039fc2000 R08: 0000000000000002 R09: 0000000000000000
[311578.602317] R10: 0000000000000000 R11: ffff810002aec808 R12: ffffffff8039e204
[311578.623985] R13: ffff81003b2e7c48 R14: 0000000000000000 R15: ffffffff80152c90
[311578.645643] FS: 0000000000000000(0000) GS:ffffffff805fd080(0000) knlGS:0000000000000000
[311578.670184] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[311578.687672] CR2: 00002aaaaabe5093 CR3: 000000003d3f7000 CR4: 00000000000006e0
[311578.709338] Process pdflush (pid: 30788, threadinfo ffff8100307f0000, task ffff81003dbc6780)
[311578.734914] Stack: ffff810039fc2000 ffff81003edaa8b0 ffff8100031a00d0 ffffffff8021fbdb
[311578.758699] ffff81003c480630 ffffffff802174a4 ffff81003edaa8b0 ffff810004764928
[311578.783015] 0000000000000286 ffffffff80217527
[311578.798449] Call Trace:<ffffffff8021fbdb>{_raw_spin_lock+25} <ffffffff802174a4>{cfq_exit_single_io_context+85}
[311578.828782] <ffffffff80217527>{cfq_exit_io_context+33} <ffffffff8020d07d>{exit_io_context+137}
[311578.856762] <ffffffff8013f937>{do_exit+183} <ffffffff80152c90>{keventd_create_kthread+0}
[311578.883192] <ffffffff80110c25>{child_rip+15} <ffffffff80152c90>{keventd_create_kthread+0}
[311578.909852] <ffffffff80152d86>{kthread+0} <ffffffff80110c16>{child_rip+0}
[311578.932353]
[311578.939431]
[311578.939432] Code: 44 8b 83 1c 01 00 00 48 8d 8b f8 02 00 00 8b 55 04 41 89 c1


Haven't managed to reproduce it since, but this came up a few weeks
ago, just before we released Fedora Core 5 test1 (We defaulted to
a different elevator for that test release just in case it blew up
during installation), since flipping it back on, it's behaved, until now.

Dave


2005-12-09 10:40:31

by Jens Axboe

[permalink] [raw]
Subject: Re: 2.6.15rc5git1 cfq related spinlock bad magic

On Wed, Dec 07 2005, Dave Jones wrote:
> Just as I shutdown my desktop, this popped out..
>
> [311578.273186] BUG: spinlock bad magic on CPU#1, pdflush/30788 (Not tainted)
> [311578.293858] general protection fault: 0000 [1] SMP
> [311578.308773] CPU 1
> [311578.315046] Modules linked in: loop vfat fat radeon drm nfsd exportfs lockd nfs_acl ipv6 lp autofs4 rfcomm l2cap bluetooth suDec 6 03:05:39 nwo kernel: [311578.480129] Pid: 30788, comm: pdflush Not tainted 2.6.14-1.1735_FC5 #1
> [311578.499972] RIP: 0010:[<ffffffff8021f8bd>] <ffffffff8021f8bd>{spin_bug+138}
> [311578.520605] RSP: 0018:ffff8100307f1e68 EFLAGS: 00010082
> [311578.537316] RAX: 00000000c4be8326 RBX: e6491e0e0968bacd RCX: ffffffff804661d8
> [311578.558988] RDX: 0000000000000001 RSI: 0000000000000092 RDI: ffffffff804661c0
> [311578.580658] RBP: ffff810039fc2000 R08: 0000000000000002 R09: 0000000000000000
> [311578.602317] R10: 0000000000000000 R11: ffff810002aec808 R12: ffffffff8039e204
> [311578.623985] R13: ffff81003b2e7c48 R14: 0000000000000000 R15: ffffffff80152c90
> [311578.645643] FS: 0000000000000000(0000) GS:ffffffff805fd080(0000) knlGS:0000000000000000
> [311578.670184] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> [311578.687672] CR2: 00002aaaaabe5093 CR3: 000000003d3f7000 CR4: 00000000000006e0
> [311578.709338] Process pdflush (pid: 30788, threadinfo ffff8100307f0000, task ffff81003dbc6780)
> [311578.734914] Stack: ffff810039fc2000 ffff81003edaa8b0 ffff8100031a00d0 ffffffff8021fbdb
> [311578.758699] ffff81003c480630 ffffffff802174a4 ffff81003edaa8b0 ffff810004764928
> [311578.783015] 0000000000000286 ffffffff80217527
> [311578.798449] Call Trace:<ffffffff8021fbdb>{_raw_spin_lock+25} <ffffffff802174a4>{cfq_exit_single_io_context+85}
> [311578.828782] <ffffffff80217527>{cfq_exit_io_context+33} <ffffffff8020d07d>{exit_io_context+137}
> [311578.856762] <ffffffff8013f937>{do_exit+183} <ffffffff80152c90>{keventd_create_kthread+0}
> [311578.883192] <ffffffff80110c25>{child_rip+15} <ffffffff80152c90>{keventd_create_kthread+0}
> [311578.909852] <ffffffff80152d86>{kthread+0} <ffffffff80110c16>{child_rip+0}
> [311578.932353]
> [311578.939431]
> [311578.939432] Code: 44 8b 83 1c 01 00 00 48 8d 8b f8 02 00 00 8b 55 04 41 89 c1
>
>
> Haven't managed to reproduce it since, but this came up a few weeks
> ago, just before we released Fedora Core 5 test1 (We defaulted to
> a different elevator for that test release just in case it blew up
> during installation), since flipping it back on, it's behaved, until now.

Hmm nasty, the queue was already gone when the task exited, this should
not happen since the task io context holds a reference to it. We had
this bug before, because SCSI didn't honor the life time rules of the
lock inside the queue, but I fixed that long ago.

What types of devices do you use on this machine? Not just statically,
but also any devices you may have inserted/removed during the uptime of
that machine.

--
Jens Axboe

2005-12-09 18:55:30

by Dave Jones

[permalink] [raw]
Subject: Re: 2.6.15rc5git1 cfq related spinlock bad magic

On Fri, Dec 09, 2005 at 11:41:50AM +0100, Jens Axboe wrote:
> On Wed, Dec 07 2005, Dave Jones wrote:
> > Just as I shutdown my desktop, this popped out..
> >
> > [311578.273186] BUG: spinlock bad magic on CPU#1, pdflush/30788 (Not tainted)
> > [311578.293858] general protection fault: 0000 [1] SMP
> > [311578.308773] CPU 1
> > [311578.315046] Modules linked in: loop vfat fat radeon drm nfsd exportfs lockd nfs_acl ipv6 lp autofs4 rfcomm l2cap bluetooth suDec 6 03:05:39 nwo kernel: [311578.480129] Pid: 30788, comm: pdflush Not tainted 2.6.14-1.1735_FC5 #1
> > [311578.499972] RIP: 0010:[<ffffffff8021f8bd>] <ffffffff8021f8bd>{spin_bug+138}
> > [311578.520605] RSP: 0018:ffff8100307f1e68 EFLAGS: 00010082
> > [311578.537316] RAX: 00000000c4be8326 RBX: e6491e0e0968bacd RCX: ffffffff804661d8
> > [311578.558988] RDX: 0000000000000001 RSI: 0000000000000092 RDI: ffffffff804661c0
> > [311578.580658] RBP: ffff810039fc2000 R08: 0000000000000002 R09: 0000000000000000
> > [311578.602317] R10: 0000000000000000 R11: ffff810002aec808 R12: ffffffff8039e204
> > [311578.623985] R13: ffff81003b2e7c48 R14: 0000000000000000 R15: ffffffff80152c90
> > [311578.645643] FS: 0000000000000000(0000) GS:ffffffff805fd080(0000) knlGS:0000000000000000
> > [311578.670184] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> > [311578.687672] CR2: 00002aaaaabe5093 CR3: 000000003d3f7000 CR4: 00000000000006e0
> > [311578.709338] Process pdflush (pid: 30788, threadinfo ffff8100307f0000, task ffff81003dbc6780)
> > [311578.734914] Stack: ffff810039fc2000 ffff81003edaa8b0 ffff8100031a00d0 ffffffff8021fbdb
> > [311578.758699] ffff81003c480630 ffffffff802174a4 ffff81003edaa8b0 ffff810004764928
> > [311578.783015] 0000000000000286 ffffffff80217527
> > [311578.798449] Call Trace:<ffffffff8021fbdb>{_raw_spin_lock+25} <ffffffff802174a4>{cfq_exit_single_io_context+85}
> > [311578.828782] <ffffffff80217527>{cfq_exit_io_context+33} <ffffffff8020d07d>{exit_io_context+137}
> > [311578.856762] <ffffffff8013f937>{do_exit+183} <ffffffff80152c90>{keventd_create_kthread+0}
> > [311578.883192] <ffffffff80110c25>{child_rip+15} <ffffffff80152c90>{keventd_create_kthread+0}
> > [311578.909852] <ffffffff80152d86>{kthread+0} <ffffffff80110c16>{child_rip+0}
> > [311578.932353]
> > [311578.939431]
> > [311578.939432] Code: 44 8b 83 1c 01 00 00 48 8d 8b f8 02 00 00 8b 55 04 41 89 c1
> >
> >
> > Haven't managed to reproduce it since, but this came up a few weeks
> > ago, just before we released Fedora Core 5 test1 (We defaulted to
> > a different elevator for that test release just in case it blew up
> > during installation), since flipping it back on, it's behaved, until now.
>
> Hmm nasty, the queue was already gone when the task exited, this should
> not happen since the task io context holds a reference to it. We had
> this bug before, because SCSI didn't honor the life time rules of the
> lock inside the queue, but I fixed that long ago.

> What types of devices do you use on this machine? Not just statically,
> but also any devices you may have inserted/removed during the uptime of
> that machine.

There was likely a usb memory key, and a 5-in-1 card reader plugged in
at the time. Possibly a usb hard disk too.

Hmm, I may have also been experimenting at the time with Pete Zaitcev's
ub driver. Pete, could ub have been doing something bad here?
Actually, looking at the Fedora kernel changelogs, the cfq oopses we
saw started happening shortly after we enabled ub there too.

Dave

2005-12-26 22:52:57

by Pete Zaitcev

[permalink] [raw]
Subject: Re: 2.6.15rc5git1 cfq related spinlock bad magic

On Fri, 9 Dec 2005 13:55:21 -0500, Dave Jones <[email protected]> wrote:

> > > [311578.273186] BUG: spinlock bad magic on CPU#1, pdflush/30788 (Not tainted)
>[...]
> > > [311578.499972] RIP: 0010:[<ffffffff8021f8bd>] <ffffffff8021f8bd>{spin_bug+138}
> > > [311578.798449] Call Trace:<ffffffff8021fbdb>{_raw_spin_lock+25} <ffffffff802174a4>{cfq_exit_single_io_context+85}
> > > [311578.828782] <ffffffff80217527>{cfq_exit_io_context+33} <ffffffff8020d07d>{exit_io_context+137}
> > > [311578.856762] <ffffffff8013f937>{do_exit+183} <ffffffff80152c90>{keventd_create_kthread+0}
> > > [311578.883192] <ffffffff80110c25>{child_rip+15} <ffffffff80152c90>{keventd_create_kthread+0}
> > > [311578.909852] <ffffffff80152d86>{kthread+0} <ffffffff80110c16>{child_rip+0}

> Hmm, I may have also been experimenting at the time with Pete Zaitcev's
> ub driver. Pete, could ub have been doing something bad here?

Yes, this is ub's fault. I thought that blk_cleanup_queue frees the queue,
but this is not the case. In recent kernels, it only decrements its refcount.
If CFQ is around, it keeps the queue pinned and uses the queue's spinlock.
But when ub calls blk_init_queue(), it passes a spinlock located in its data
structure (ub_dev), which corresponds to a device. The ub_dev is refcounted
and freed when the device is disconnected or closed. As you can see, this
leaves the queue's spinlock pointer dangling.

The code was taken from Carmel, and it used to work fine for a long time.
I suspect now that Carmel is vulnerable, if it's hot-removed while open.
Maybe Jeff wants to look into it.

The usb-storage is immune to this problem, because SCSI passes NULL to
blk_init_queue.

Schedulers other than CFQ use their own spinlocks, so they do not hit
this problem.

The attached patch works around this issue by using spinlocks which are
static to the ub module. Thus, it places ub into the same group as floppy.
This is not ideal, in case someone manages to remove the module yet have
queues remaining... But I am reluctant to copy what scsi_request_fn is
doing. After all, ub is supposed to be simpler.

Any comments before I send this to Greg?

With Christmas cheers,
-- Pete

--- linux-2.6.15-rc7/drivers/block/ub.c 2005-12-26 01:25:03.000000000 -0800
+++ linux-2.6.15-rc7-locktest/drivers/block/ub.c 2005-12-26 13:56:09.000000000 -0800
@@ -356,7 +356,7 @@ struct ub_lun {
* The USB device instance.
*/
struct ub_dev {
- spinlock_t lock;
+ spinlock_t *lock;
atomic_t poison; /* The USB device is disconnected */
int openc; /* protected by ub_lock! */
/* kref is too implicit for our taste */
@@ -440,6 +440,10 @@ MODULE_DEVICE_TABLE(usb, ub_usb_ids);
#define UB_MAX_HOSTS 26
static char ub_hostv[UB_MAX_HOSTS];

+#define UB_QLOCK_NUM 5
+static spinlock_t ub_qlockv[UB_QLOCK_NUM];
+static int ub_qlock_next = 0;
+
static DEFINE_SPINLOCK(ub_lock); /* Locks globals and ->openc */

/*
@@ -519,7 +523,7 @@ static ssize_t ub_diag_show(struct devic
return 0;

cnt = 0;
- spin_lock_irqsave(&sc->lock, flags);
+ spin_lock_irqsave(sc->lock, flags);

cnt += sprintf(page + cnt,
"qlen %d qmax %d\n",
@@ -564,7 +568,7 @@ static ssize_t ub_diag_show(struct devic
if (++nc == SCMD_TRACE_SZ) nc = 0;
}

- spin_unlock_irqrestore(&sc->lock, flags);
+ spin_unlock_irqrestore(sc->lock, flags);
return cnt;
}

@@ -612,6 +616,24 @@ static void ub_id_put(int id)
}

/*
+ * This is necessitated by the fact that blk_cleanup_queue does not
+ * necesserily destroy the queue. Instead, it may merely decrease q->refcnt.
+ * Since our blk_init_queue() passes a spinlock common with ub_dev,
+ * we have life time issues when ub_cleanup frees ub_dev.
+ */
+static spinlock_t *ub_next_lock(void)
+{
+ unsigned long flags;
+ spinlock_t *ret;
+
+ spin_lock_irqsave(&ub_lock, flags);
+ ret = &ub_qlockv[ub_qlock_next];
+ ub_qlock_next = (ub_qlock_next + 1) % UB_QLOCK_NUM;
+ spin_unlock_irqrestore(&ub_lock, flags);
+ return ret;
+}
+
+/*
* Downcount for deallocation. This rides on two assumptions:
* - once something is poisoned, its refcount cannot grow
* - opens cannot happen at this time (del_gendisk was done)
@@ -1039,9 +1061,9 @@ static void ub_urb_timeout(unsigned long
struct ub_dev *sc = (struct ub_dev *) arg;
unsigned long flags;

- spin_lock_irqsave(&sc->lock, flags);
+ spin_lock_irqsave(sc->lock, flags);
usb_unlink_urb(&sc->work_urb);
- spin_unlock_irqrestore(&sc->lock, flags);
+ spin_unlock_irqrestore(sc->lock, flags);
}

/*
@@ -1064,10 +1086,10 @@ static void ub_scsi_action(unsigned long
struct ub_dev *sc = (struct ub_dev *) _dev;
unsigned long flags;

- spin_lock_irqsave(&sc->lock, flags);
+ spin_lock_irqsave(sc->lock, flags);
del_timer(&sc->work_timer);
ub_scsi_dispatch(sc);
- spin_unlock_irqrestore(&sc->lock, flags);
+ spin_unlock_irqrestore(sc->lock, flags);
}

static void ub_scsi_dispatch(struct ub_dev *sc)
@@ -1836,11 +1858,11 @@ static int ub_sync_tur(struct ub_dev *sc
cmd->done = ub_probe_done;
cmd->back = &compl;

- spin_lock_irqsave(&sc->lock, flags);
+ spin_lock_irqsave(sc->lock, flags);
cmd->tag = sc->tagcnt++;

rc = ub_submit_scsi(sc, cmd);
- spin_unlock_irqrestore(&sc->lock, flags);
+ spin_unlock_irqrestore(sc->lock, flags);

if (rc != 0) {
printk("ub: testing ready: submit error (%d)\n", rc); /* P3 */
@@ -1898,11 +1920,11 @@ static int ub_sync_read_cap(struct ub_de
cmd->done = ub_probe_done;
cmd->back = &compl;

- spin_lock_irqsave(&sc->lock, flags);
+ spin_lock_irqsave(sc->lock, flags);
cmd->tag = sc->tagcnt++;

rc = ub_submit_scsi(sc, cmd);
- spin_unlock_irqrestore(&sc->lock, flags);
+ spin_unlock_irqrestore(sc->lock, flags);

if (rc != 0) {
printk("ub: reading capacity: submit error (%d)\n", rc); /* P3 */
@@ -2176,7 +2198,7 @@ static int ub_probe(struct usb_interface
if ((sc = kmalloc(sizeof(struct ub_dev), GFP_KERNEL)) == NULL)
goto err_core;
memset(sc, 0, sizeof(struct ub_dev));
- spin_lock_init(&sc->lock);
+ sc->lock = ub_next_lock();
INIT_LIST_HEAD(&sc->luns);
usb_init_urb(&sc->work_urb);
tasklet_init(&sc->tasklet, ub_scsi_action, (unsigned long)sc);
@@ -2322,7 +2344,7 @@ static int ub_probe_lun(struct ub_dev *s
disk->driverfs_dev = &sc->intf->dev;

rc = -ENOMEM;
- if ((q = blk_init_queue(ub_request_fn, &sc->lock)) == NULL)
+ if ((q = blk_init_queue(ub_request_fn, sc->lock)) == NULL)
goto err_blkqinit;

disk->queue = q;
@@ -2388,7 +2410,7 @@ static void ub_disconnect(struct usb_int
* and the whole queue drains. So, we just use this code to
* print warnings.
*/
- spin_lock_irqsave(&sc->lock, flags);
+ spin_lock_irqsave(sc->lock, flags);
{
struct ub_scsi_cmd *cmd;
int cnt = 0;
@@ -2405,7 +2427,7 @@ static void ub_disconnect(struct usb_int
"%d was queued after shutdown\n", sc->name, cnt);
}
}
- spin_unlock_irqrestore(&sc->lock, flags);
+ spin_unlock_irqrestore(sc->lock, flags);

/*
* Unregister the upper layer.
@@ -2424,19 +2446,15 @@ static void ub_disconnect(struct usb_int
}

/*
- * Taking a lock on a structure which is about to be freed
- * is very nonsensual. Here it is largely a way to do a debug freeze,
- * and a bracket which shows where the nonsensual code segment ends.
- *
* Testing for -EINPROGRESS is always a bug, so we are bending
* the rules a little.
*/
- spin_lock_irqsave(&sc->lock, flags);
+ spin_lock_irqsave(sc->lock, flags);
if (sc->work_urb.status == -EINPROGRESS) { /* janitors: ignore */
printk(KERN_WARNING "%s: "
"URB is active after disconnect\n", sc->name);
}
- spin_unlock_irqrestore(&sc->lock, flags);
+ spin_unlock_irqrestore(sc->lock, flags);

/*
* There is virtually no chance that other CPU runs times so long
@@ -2471,6 +2489,10 @@ static struct usb_driver ub_driver = {
static int __init ub_init(void)
{
int rc;
+ int i;
+
+ for (i = 0; i < UB_QLOCK_NUM; i++)
+ spin_lock_init(&ub_qlockv[i]);

if ((rc = register_blkdev(UB_MAJOR, DRV_NAME)) != 0)
goto err_regblkdev;

2005-12-28 09:25:13

by Jens Axboe

[permalink] [raw]
Subject: Re: 2.6.15rc5git1 cfq related spinlock bad magic

On Mon, Dec 26 2005, Pete Zaitcev wrote:
> On Fri, 9 Dec 2005 13:55:21 -0500, Dave Jones <[email protected]> wrote:
>
> > > > [311578.273186] BUG: spinlock bad magic on CPU#1, pdflush/30788 (Not tainted)
> >[...]
> > > > [311578.499972] RIP: 0010:[<ffffffff8021f8bd>] <ffffffff8021f8bd>{spin_bug+138}
> > > > [311578.798449] Call Trace:<ffffffff8021fbdb>{_raw_spin_lock+25} <ffffffff802174a4>{cfq_exit_single_io_context+85}
> > > > [311578.828782] <ffffffff80217527>{cfq_exit_io_context+33} <ffffffff8020d07d>{exit_io_context+137}
> > > > [311578.856762] <ffffffff8013f937>{do_exit+183} <ffffffff80152c90>{keventd_create_kthread+0}
> > > > [311578.883192] <ffffffff80110c25>{child_rip+15} <ffffffff80152c90>{keventd_create_kthread+0}
> > > > [311578.909852] <ffffffff80152d86>{kthread+0} <ffffffff80110c16>{child_rip+0}
>
> > Hmm, I may have also been experimenting at the time with Pete Zaitcev's
> > ub driver. Pete, could ub have been doing something bad here?
>
> Yes, this is ub's fault. I thought that blk_cleanup_queue frees the
> queue, but this is not the case. In recent kernels, it only decrements
> its refcount. If CFQ is around, it keeps the queue pinned and uses
> the queue's spinlock. But when ub calls blk_init_queue(), it passes a
> spinlock located in its data structure (ub_dev), which corresponds to
> a device. The ub_dev is refcounted and freed when the device is
> disconnected or closed. As you can see, this leaves the queue's
> spinlock pointer dangling.

That indeed looks like to be it. We've had the reference counted queue
for a long time. CFQ is the only io scheduler that is affected by such a
bug, although the core block layer coultd trigger it as well but the
window is much smaller (with CFQ the window is very large).

> The code was taken from Carmel, and it used to work fine for a long time.
> I suspect now that Carmel is vulnerable, if it's hot-removed while open.
> Maybe Jeff wants to look into it.

Indeed.

> The usb-storage is immune to this problem, because SCSI passes NULL to
> blk_init_queue.

SCSI did have the same problem recently, it just does the assignment
differently.

> Schedulers other than CFQ use their own spinlocks, so they do not hit
> this problem.
>
> The attached patch works around this issue by using spinlocks which are
> static to the ub module. Thus, it places ub into the same group as floppy.
> This is not ideal, in case someone manages to remove the module yet have
> queues remaining... But I am reluctant to copy what scsi_request_fn is
> doing. After all, ub is supposed to be simpler.
>
> Any comments before I send this to Greg?

It's probably the easiest way to fix it in a straight forward manner.
I'll see if I can add some debug code to detect this issue...

--
Jens Axboe