I am proposing two patches to protect the request queue and io
elevator from inadvertent releasing.
I've been testing mmc driver robustness against rapid card removal
reinsert cycles and it has been rather easy to get a kernel oopses
(on 2.6.28) because of insufficient locking.
When MMC driver notices a card has been removed, it removes the
card and the block device. The call proceeds to
blk_cleanup_queue() which marks the request queue dead, calls
elevator exit to release the elevator and puts request queue.
This releases elevator always and may also release the request
queue.
If file system is still mounted and using the disk after
blk_cleanup_queue() has been called, io requests will access
already free'ed structures and oopses and BUG_ON()s jump in.
When the proposed patches are applied, I am not able reproduce
the oopses in io scheduler or elevator anymore. There is still
another oops is sysfs truing to add duplicate file, but i think
that is not related.
Cheers
Jarkko Lavinen
>From 9559377f3166345649c3406427f410cd51472944 Mon Sep 17 00:00:00 2001
From: Jarkko Lavinen <[email protected]>
Date: Wed, 21 Oct 2009 18:48:18 +0300
Subject: [PATCH 1/2] block: Avoid dead request queue too early removal
If disk is removed while file system is still mounted, the disk
removal can release the dead request queue too early while
file system is still trying to sumit requests.
Signed-off-by: Jarkko Lavinen <[email protected]>
---
fs/block_dev.c | 24 ++++++++++++++++++++----
1 files changed, 20 insertions(+), 4 deletions(-)
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 9cf4b92..91f2fc3 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1165,6 +1165,8 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
int ret;
int partno;
int perm = 0;
+ int q_ref = 0;
+ struct module *owner;
if (mode & FMODE_READ)
perm |= MAY_READ;
@@ -1187,6 +1189,11 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
if (!disk)
goto out_unlock_kernel;
+ if (blk_get_queue(disk->queue))
+ goto out_unlock_kernel;
+ else
+ q_ref = 1;
+
mutex_lock_nested(&bdev->bd_mutex, for_part);
if (!bdev->bd_openers) {
bdev->bd_disk = disk;
@@ -1248,8 +1255,10 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
bd_set_size(bdev, (loff_t)bdev->bd_part->nr_sects << 9);
}
} else {
+ owner = disk->fops->owner;
+ blk_put_queue(disk->queue);
put_disk(disk);
- module_put(disk->fops->owner);
+ module_put(owner);
disk = NULL;
if (bdev->bd_contains == bdev) {
if (bdev->bd_disk->fops->open) {
@@ -1281,9 +1290,15 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
out_unlock_kernel:
unlock_kernel();
- if (disk)
- module_put(disk->fops->owner);
- put_disk(disk);
+ if (disk) {
+ owner = disk->fops->owner;
+ if (q_ref)
+ blk_put_queue(disk->queue);
+
+ put_disk(disk);
+ module_put(owner);
+ }
+
bdput(bdev);
return ret;
@@ -1360,6 +1375,7 @@ static int __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
if (!bdev->bd_openers) {
struct module *owner = disk->fops->owner;
+ blk_put_queue(disk->queue);
put_disk(disk);
module_put(owner);
disk_put_part(bdev->bd_part);
--
1.6.3.3
>From ff822af94f86dbe559b7649b7f71c464260ef353 Mon Sep 17 00:00:00 2001
From: Jarkko Lavinen <[email protected]>
Date: Wed, 21 Oct 2009 18:53:20 +0300
Subject: [PATCH 2/2] block: Protect elevator from too early release
When a disk is being removed, blk_put_queue calls elevator exit function
which will release the elevator while the elevator may still be in use.
Signed-off-by: Jarkko Lavinen <[email protected]>
---
block/blk-core.c | 9 +++++++++
1 files changed, 9 insertions(+), 0 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index ac0fa10..da56586 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1173,6 +1173,13 @@ static int __make_request(struct request_queue *q, struct bio *bio)
*/
blk_queue_bounce(q, &bio);
+ mutex_lock(&q->elevator->sysfs_lock);
+ if (test_bit(QUEUE_FLAG_DEAD, &q->queue_flags)) {
+ mutex_unlock(&q->elevator->sysfs_lock);
+ bio_endio(bio, -EIO);
+ return 0;
+ }
+
spin_lock_irq(q->queue_lock);
if (unlikely(bio_rw_flagged(bio, BIO_RW_BARRIER)) || elv_queue_empty(q))
@@ -1275,6 +1282,8 @@ out:
if (unplug || !queue_should_plug(q))
__generic_unplug_device(q);
spin_unlock_irq(q->queue_lock);
+ mutex_unlock(&q->elevator->sysfs_lock);
+
return 0;
}
--
1.6.3.3
Jarkko Lavinen wrote:
> When MMC driver notices a card has been removed, it removes the
> card and the block device. The call proceeds to
> blk_cleanup_queue() which marks the request queue dead, calls
> elevator exit to release the elevator and puts request queue.
> This releases elevator always and may also release the request
> queue.
>
> If file system is still mounted and using the disk after
> blk_cleanup_queue() has been called, io requests will access
> already free'ed structures and oopses and BUG_ON()s jump in.
[...]
> fs/block_dev.c | 24 ++++++++++++++++++++----
> 1 files changed, 20 insertions(+), 4 deletions(-)
[...]
And in patch 2/2:
> block/blk-core.c | 9 +++++++++
> 1 files changed, 9 insertions(+), 0 deletions(-)
This doesn't look right. I suspect you need to fix the MMC subsystem.
It has to reference-count its objects so that they are not freed as long
as they are used by upper layers, notably by the block subsystem.
OK, actually I am not 100% sure whether the issue is in the lower layers
alone --- I'm only (reasonably) familiar with drivers below the SCSI
subsystem, not with those directly below the block subsystem.
In case of SCSI, low level requests the removal of scsi_device or
Scsi_Host instances when appropriate, and the respective removal
functions in SCSI core make sure that upper layers will not use a
scsi_device after scsi_remove_device (or other flavors of it) anymore;
likewise it guarantees that a Scsi_Host will not be used by upper layers
anymore when scsi_remove_host returns to the low level driver.
So, while I don't know what the block subsystem offers in this regard,
the fix cannot be that the block layer somehow notices that lower layers
went away and goes quiet then; instead the way to go is that lower
layers make sure that the block request queue is destroyed before MMC
device objects are destroyed (or in other words: that the MMC device
objects stay around at least as long as the block request queue exists).
[Quoting in different order than posted:]
> I've been testing mmc driver robustness against rapid card removal
> reinsert cycles and it has been rather easy to get a kernel oopses
> (on 2.6.28) because of insufficient locking.
I dare say without knowing the MMC and block system: It is not a
locking issue, it is a reference counting issue.
Make sure to always "get" an object before you hand a pointer to it over
to some store or to another context (and of course "put" it when that
reference is removed from that store or, respectively, not longer used
by that other context). And make sure that an object is destroyed only
when the reference count goes to zero. The kref API does this, and so
do the APIs which are wrapped around it (like get_device, put_device).
So, make sure that the reference to the MMC device which the block layer
needs is counted for as long as there is the request queue.
PS: The kref API is based on an atomic counter so that there is
fundamentally never a need to use locking (mutual exclusion) for proper
reference counting and proper deferral of object destruction. The whole
secret is not to forget to count _all_ references.
--
Stefan Richter
-=====-==--= =-=- =-=-=
http://arcgraph.de/sr/
Stefan Richter wrote:
> I dare say without knowing the MMC and block system: It is not a
> locking issue, it is a reference counting issue.
[...]
> The whole secret is not to forget to count _all_ references.
PS, in case that it might be needed: Sometimes a low level wants to
wait for the moment until its upper level users went away. Reference
counting is then coupled with constructs like wait_for_completion then.
--
Stefan Richter
-=====-==--= =-=- =-=-=
http://arcgraph.de/sr/
Hi Steven
Sorry for late reply.
> It has to reference-count its objects so that they are not freed as long
> as they are used by upper layers,
The block layer and device removal seems to be designed from
top-down approach. Althouh disc is referenced from
__blkdev_get(), disc's request queue is not. Also
blk_cleanup_queue() calls elevator_exit() without caring if
anyone still uses the elevator.
A remedy would be to take reference of the request queue in
__blkdev_get() and move elevator_exit() from blk_cleanup_queue()
to blk_release_queue().
I've tried this and it works fine. I am unable to cause oops within
the elevator or anywhere else with hot card removal. This is how
it should be, since first the queue is marked dead and no new
requests are added into queue and old requests are flushed with
elevator_exit before releasing it and the dead status takes care
they don't even reach to requst issue function.
The caveat is ihat when request queue is initialzed with
blk_init_queue(), caller provides a pointer to spinlock. When
blk_cleanup_queue() is called, drivers release the lock before
elevator_exit() is finished. In mmc driver the struct containing
queue_lock is released immediately after returning blk_cleanup_queue().
Then later when request queue is released, elevator_exit() tries to
use the released spinlock.
Kind of hack workaround is to reset the queue_lock pointer to
request queue's internal lock in blk_cleanup_queue().
Jarkko Lavvinen
---
>From 8ccee6132f376d78bb6e355016ecc06c9ca47b9f Mon Sep 17 00:00:00 2001
From: Jarkko Lavinen <[email protected]>
Date: Tue, 3 Nov 2009 09:48:00 +0200
Subject: [PATCH] block: Move elevator exit from blk_cleanup_queue() to blk_release_queue()
If block_cleanup_queue() should not remove elevator if it is still
in use. Better call the elevator exit in blk_release_queue() when
all the reference to queue are gone. Since drivers could use queue
locks which are freed after returning from blk_cleanup_queue(),
switch the queue lock to internal one.
Signed-off-by: Jarkko Lavinen <[email protected]>
---
block/blk-core.c | 8 ++++++--
block/blk-sysfs.c | 3 +++
2 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 9daf621..8d7d5aa 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -449,6 +449,8 @@ void blk_put_queue(struct request_queue *q)
void blk_cleanup_queue(struct request_queue *q)
{
+ spinlock_t *oldlock;
+
/*
* We know we have process context here, so we can be a little
* cautious and ensure that pending block actions on this device
@@ -461,8 +463,10 @@ void blk_cleanup_queue(struct request_queue *q)
queue_flag_set_unlocked(QUEUE_FLAG_DEAD, q);
mutex_unlock(&q->sysfs_lock);
- if (q->elevator)
- elevator_exit(q->elevator);
+ oldlock = q->queue_lock;
+ spin_lock_irq(oldlock);
+ q->queue_lock = &q->__queue_lock;
+ spin_unlock_irq(oldlock);
blk_put_queue(q);
}
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 21e275d..8ac3e5b 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -304,6 +304,9 @@ static void blk_release_queue(struct kobject *kobj)
container_of(kobj, struct request_queue, kobj);
struct request_list *rl = &q->rq;
+ if (q->elevator)
+ elevator_exit(q->elevator);
+
blk_sync_queue(q);
if (rl->rq_pool)
--
1.6.5
Jarkko Lavinen wrote:
> Hi Steven
>
> Sorry for late reply.
>
>> It has to reference-count its objects so that they are not freed as long
>> as they are used by upper layers,
>
> The block layer and device removal seems to be designed from
> top-down approach. Althouh disc is referenced from
> __blkdev_get(), disc's request queue is not. Also
> blk_cleanup_queue() calls elevator_exit() without caring if
> anyone still uses the elevator.
[...]
I still don't understand how there can be a problem here. Shouldn't the
sequence be:
1. low-level determines that a device went away
2. low-level takes note that from now on no new requests must be
enqueued anymore
3. low-level calls blk_cleanup_queue
4. blk_cleanup_queue waits until remaining requests are done
(it calls blk_sync_queue)
5. blk_cleanup_queue cleans up block layer data
6. low-level can now clean up/ free its own data
Does the MMC layer miss step 2? Because without this, step 4 would be
in vain.
--
Stefan Richter
-=====-==--= =-== ---==
http://arcgraph.de/sr/