LinuxLists.cc - [PATCH v7 0/8] blktrace: fix debugfs use after free

2020-06-20 04:44:24

Subject: [PATCH v7 0/8] blktrace: fix debugfs use after free

Its been a fun ride, but all patch series come to an end. My hope is
that this is it. The simplification of the fix is considerable now,
with only a few lines of code and with no data structure changes.

We were only creating the debugfs_dir upon initialization only if
you had CONFIG_BLK_DEBUG_FS for for make_request block drivers
(multiqueue). That's where the UAF bug could happen. Folks liked
the idea of open coding the debugfs initialization even if
CONFIG_BLK_DEBUG_FS was disabled, given that debugfs code will
simply ignore that code if debugfs is disabled, but to make
the fix easier to backport, that shift is done now in another
patch. Likewise, although we were only creating the debugfs_dir
only for make_request block drivers (multiqueue), the same new
additional patch also creates the debugfs_dir for request-based
block drivers. That *begged* us to just rename the mutex to
clarify its for the debugfs_dir, blktrace then just becomes
its biggest user.

The only patches changed here is the last one from the last series,
which actually fixed the UAF oops, and that one is now split in 3
patches, which makes a secondary fix much clearer.

I've waited a while to post these, to let 0-day give me its blessings,
both for Linus' tree and linux-next. No issues have been found. I've
also taken time to run blktests prior and after this series and I have
found no regressions. In fact, I think I should just extend blktests
with the break-blktrace tests, I'll do that later.

Luis Chamberlain (8):
block: add docs for gendisk / request_queue refcount helpers
block: clarify context for refcount increment helpers
block: revert back to synchronous request_queue removal
blktrace: annotate required lock on do_blk_trace_setup()
loop: be paranoid on exit and prevent new additions / removals
blktrace: fix debugfs use after free
blktrace: ensure our debugfs dir exists
block: create the request_queue debugfs_dir on registration

block/blk-core.c | 31 +++++++++++++----
block/blk-mq-debugfs.c | 5 ---
block/blk-sysfs.c | 52 ++++++++++++++++------------
block/blk.h | 2 --
block/genhd.c | 73 ++++++++++++++++++++++++++++++++++++++-
drivers/block/loop.c | 4 +++
include/linux/blkdev.h | 7 ++--
kernel/trace/blktrace.c | 76 ++++++++++++++++++++++++-----------------
8 files changed, 179 insertions(+), 71 deletions(-)

--
2.26.2

2020-06-20 04:44:43

by Luis Chamberlain

[permalink] [raw]

Subject: [PATCH v7 3/8] block: revert back to synchronous request_queue removal

Commit dc9edc44de6c ("block: Fix a blk_exit_rl() regression") merged on
v4.12 moved the work behind blk_release_queue() into a workqueue after a
splat floated around which indicated some work on blk_release_queue()
could sleep in blk_exit_rl(). This splat would be possible when a driver
called blk_put_queue() or blk_cleanup_queue() (which calls blk_put_queue()
as its final call) from an atomic context.

blk_put_queue() decrements the refcount for the request_queue kobject, and
upon reaching 0 blk_release_queue() is called. Although blk_exit_rl() is
now removed through commit db6d99523560 ("block: remove request_list code")
on v5.0, we reserve the right to be able to sleep within
blk_release_queue() context.

The last reference for the request_queue must not be called from atomic
context. *When* the last reference to the request_queue reaches 0 varies,
and so let's take the opportunity to document when that is expected to
happen and also document the context of the related calls as best as
possible so we can avoid future issues, and with the hopes that the
synchronous request_queue removal sticks.

We revert back to synchronous request_queue removal because asynchronous
removal creates a regression with expected userspace interaction with
several drivers. An example is when removing the loopback driver, one
uses ioctls from userspace to do so, but upon return and if successful,
one expects the device to be removed. Likewise if one races to add another
device the new one may not be added as it is still being removed. This was
expected behavior before and it now fails as the device is still present
and busy still. Moving to asynchronous request_queue removal could have
broken many scripts which relied on the removal to have been completed if
there was no error. Document this expectation as well so that this
doesn't regress userspace again.

Using asynchronous request_queue removal however has helped us find
other bugs. In the future we can test what could break with this
arrangement by enabling CONFIG_DEBUG_KOBJECT_RELEASE.

While at it, update the docs with the context expectations for the
request_queue / gendisk refcount decrement, and make these
expectations explicit by using might_sleep().

Cc: Bart Van Assche <[email protected]>
Cc: Omar Sandoval <[email protected]>
Cc: Hannes Reinecke <[email protected]>
Cc: Nicolai Stange <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: yu kuai <[email protected]>
Suggested-by: Nicolai Stange <[email protected]>
Fixes: dc9edc44de6c ("block: Fix a blk_exit_rl() regression")
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Signed-off-by: Luis Chamberlain <[email protected]>
---
block/blk-core.c | 8 ++++++++
block/blk-sysfs.c | 43 +++++++++++++++++++++---------------------
block/genhd.c | 17 +++++++++++++++++
include/linux/blkdev.h | 2 --
4 files changed, 47 insertions(+), 23 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 14c09daf55f3..a5126c0be777 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -327,6 +327,9 @@ EXPORT_SYMBOL_GPL(blk_clear_pm_only);
*
* Decrements the refcount of the request_queue kobject. When this reaches 0
* we'll have blk_release_queue() called.
+ *
+ * Context: Any context, but the last reference must not be dropped from
+ * atomic context.
*/
void blk_put_queue(struct request_queue *q)
{
@@ -359,9 +362,14 @@ EXPORT_SYMBOL_GPL(blk_set_queue_dying);
*
* Mark @q DYING, drain all pending requests, mark @q DEAD, destroy and
* put it. All future requests will be failed immediately with -ENODEV.
+ *
+ * Context: can sleep
*/
void blk_cleanup_queue(struct request_queue *q)
{
+ /* cannot be called from atomic context */
+ might_sleep();
+
WARN_ON_ONCE(blk_queue_registered(q));

/* mark @q DYING, no new request or merges will be allowed afterwards */
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 02643e149d5e..561624d4cc4e 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -873,22 +873,32 @@ static void blk_exit_queue(struct request_queue *q)
bdi_put(q->backing_dev_info);
}

-
/**
- * __blk_release_queue - release a request queue
- * @work: pointer to the release_work member of the request queue to be released
+ * blk_release_queue - releases all allocated resources of the request_queue
+ * @kobj: pointer to a kobject, whose container is a request_queue
+ *
+ * This function releases all allocated resources of the request queue.
+ *
+ * The struct request_queue refcount is incremented with blk_get_queue() and
+ * decremented with blk_put_queue(). Once the refcount reaches 0 this function
+ * is called.
+ *
+ * For drivers that have a request_queue on a gendisk and added with
+ * __device_add_disk() the refcount to request_queue will reach 0 with
+ * the last put_disk() called by the driver. For drivers which don't use
+ * __device_add_disk() this happens with blk_cleanup_queue().
*
- * Description:
- * This function is called when a block device is being unregistered. The
- * process of releasing a request queue starts with blk_cleanup_queue, which
- * set the appropriate flags and then calls blk_put_queue, that decrements
- * the reference counter of the request queue. Once the reference counter
- * of the request queue reaches zero, blk_release_queue is called to release
- * all allocated resources of the request queue.
+ * Drivers exist which depend on the release of the request_queue to be
+ * synchronous, it should not be deferred.
+ *
+ * Context: can sleep
*/
-static void __blk_release_queue(struct work_struct *work)
+static void blk_release_queue(struct kobject *kobj)
{
- struct request_queue *q = container_of(work, typeof(*q), release_work);
+ struct request_queue *q =
+ container_of(kobj, struct request_queue, kobj);
+
+ might_sleep();

if (test_bit(QUEUE_FLAG_POLL_STATS, &q->queue_flags))
blk_stat_remove_callback(q, q->poll_cb);
@@ -917,15 +927,6 @@ static void __blk_release_queue(struct work_struct *work)
call_rcu(&q->rcu_head, blk_free_queue_rcu);
}

-static void blk_release_queue(struct kobject *kobj)
-{
- struct request_queue *q =
- container_of(kobj, struct request_queue, kobj);
-
- INIT_WORK(&q->release_work, __blk_release_queue);
- schedule_work(&q->release_work);
-}
-
static const struct sysfs_ops queue_sysfs_ops = {
.show = queue_attr_show,
.store = queue_attr_store,
diff --git a/block/genhd.c b/block/genhd.c
index 1be86b1f43ec..60ae4e1b4d38 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -889,12 +889,19 @@ static void invalidate_partition(struct gendisk *disk, int partno)
* The final removal of the struct gendisk happens when its refcount reaches 0
* with put_disk(), which should be called after del_gendisk(), if
* __device_add_disk() was used.
+ *
+ * Drivers exist which depend on the release of the gendisk to be synchronous,
+ * it should not be deferred.
+ *
+ * Context: can sleep
*/
void del_gendisk(struct gendisk *disk)
{
struct disk_part_iter piter;
struct hd_struct *part;

+ might_sleep();
+
blk_integrity_del(disk);
disk_del_events(disk);

@@ -1548,11 +1555,15 @@ int disk_expand_part_tbl(struct gendisk *disk, int partno)
* drivers we also call blk_put_queue() for them, and we expect the
* request_queue refcount to reach 0 at this point, and so the request_queue
* will also be freed prior to the disk.
+ *
+ * Context: can sleep
*/
static void disk_release(struct device *dev)
{
struct gendisk *disk = dev_to_disk(dev);

+ might_sleep();
+
blk_free_devt(dev->devt);
disk_release_events(disk);
kfree(disk->random);
@@ -1797,6 +1808,9 @@ EXPORT_SYMBOL(get_disk_and_module);
*
* This decrements the refcount for the struct gendisk. When this reaches 0
* we'll have disk_release() called.
+ *
+ * Context: Any context, but the last reference must not be dropped from
+ * atomic context.
*/
void put_disk(struct gendisk *disk)
{
@@ -1811,6 +1825,9 @@ EXPORT_SYMBOL(put_disk);
*
* This is a counterpart of get_disk_and_module() and thus also of
* get_gendisk().
+ *
+ * Context: Any context, but the last reference must not be dropped from
+ * atomic context.
*/
void put_disk_and_module(struct gendisk *disk)
{
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6e067dca94cf..780d6fa94746 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -584,8 +584,6 @@ struct request_queue {

size_t cmd_size;

- struct work_struct release_work;
-
#define BLK_MAX_WRITE_HINTS 5
u64 write_hints[BLK_MAX_WRITE_HINTS];
};
--
2.26.2

2020-06-20 04:45:43

by Luis Chamberlain

[permalink] [raw]

Subject: [PATCH v7 1/8] block: add docs for gendisk / request_queue refcount helpers

This adds documentation for the gendisk / request_queue refcount
helpers.

Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Signed-off-by: Luis Chamberlain <[email protected]>
---
block/blk-core.c | 13 +++++++++++++
block/genhd.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 62a4904db921..a0760aac110a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -321,6 +321,13 @@ void blk_clear_pm_only(struct request_queue *q)
}
EXPORT_SYMBOL_GPL(blk_clear_pm_only);

+/**
+ * blk_put_queue - decrement the request_queue refcount
+ * @q: the request_queue structure to decrement the refcount for
+ *
+ * Decrements the refcount of the request_queue kobject. When this reaches 0
+ * we'll have blk_release_queue() called.
+ */
void blk_put_queue(struct request_queue *q)
{
kobject_put(&q->kobj);
@@ -598,6 +605,12 @@ struct request_queue *blk_alloc_queue(make_request_fn make_request, int node_id)
}
EXPORT_SYMBOL(blk_alloc_queue);

+/**
+ * blk_get_queue - increment the request_queue refcount
+ * @q: the request_queue structure to increment the refcount for
+ *
+ * Increment the refcount of the request_queue kobject.
+ */
bool blk_get_queue(struct request_queue *q)
{
if (likely(!blk_queue_dying(q))) {
diff --git a/block/genhd.c b/block/genhd.c
index 1a7659327664..f741613d731f 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -876,6 +876,20 @@ static void invalidate_partition(struct gendisk *disk, int partno)
bdput(bdev);
}

+/**
+ * del_gendisk - remove the gendisk
+ * @disk: the struct gendisk to remove
+ *
+ * Removes the gendisk and all its associated resources. This deletes the
+ * partitions associated with the gendisk, and unregisters the associated
+ * request_queue.
+ *
+ * This is the counter to the respective __device_add_disk() call.
+ *
+ * The final removal of the struct gendisk happens when its refcount reaches 0
+ * with put_disk(), which should be called after del_gendisk(), if
+ * __device_add_disk() was used.
+ */
void del_gendisk(struct gendisk *disk)
{
struct disk_part_iter piter;
@@ -1514,6 +1528,23 @@ int disk_expand_part_tbl(struct gendisk *disk, int partno)
return 0;
}

+/**
+ * disk_release - releases all allocated resources of the gendisk
+ * @dev: the device representing this disk
+ *
+ * This function releases all allocated resources of the gendisk.
+ *
+ * The struct gendisk refcount is incremented with get_gendisk() or
+ * get_disk_and_module(), and its refcount is decremented with
+ * put_disk_and_module() or put_disk(). Once the refcount reaches 0 this
+ * function is called.
+ *
+ * Drivers which used __device_add_disk() have a gendisk with a request_queue
+ * assigned. Since the request_queue sits on top of the gendisk for these
+ * drivers we also call blk_put_queue() for them, and we expect the
+ * request_queue refcount to reach 0 at this point, and so the request_queue
+ * will also be freed prior to the disk.
+ */
static void disk_release(struct device *dev)
{
struct gendisk *disk = dev_to_disk(dev);
@@ -1727,6 +1758,13 @@ struct gendisk *__alloc_disk_node(int minors, int node_id)
}
EXPORT_SYMBOL(__alloc_disk_node);

+/**
+ * get_disk_and_module - increments the gendisk and gendisk fops module refcount
+ * @disk: the struct gendisk to to increment the refcount for
+ *
+ * This increments the refcount for the struct gendisk, and the gendisk's
+ * fops module owner.
+ */
struct kobject *get_disk_and_module(struct gendisk *disk)
{
struct module *owner;
@@ -1747,6 +1785,13 @@ struct kobject *get_disk_and_module(struct gendisk *disk)
}
EXPORT_SYMBOL(get_disk_and_module);

+/**
+ * put_disk - decrements the gendisk refcount
+ * @disk: the struct gendisk to to decrement the refcount for
+ *
+ * This decrements the refcount for the struct gendisk. When this reaches 0
+ * we'll have disk_release() called.
+ */
void put_disk(struct gendisk *disk)
{
if (disk)
@@ -1754,7 +1799,10 @@ void put_disk(struct gendisk *disk)
}
EXPORT_SYMBOL(put_disk);

-/*
+/**
+ * put_disk_and_module - decrements the module and gendisk refcount
+ * @disk: the struct gendisk to to decrement the refcount for
+ *
* This is a counterpart of get_disk_and_module() and thus also of
* get_gendisk().
*/
--
2.26.2

2020-06-20 04:45:49

by Luis Chamberlain

[permalink] [raw]

Subject: [PATCH v7 2/8] block: clarify context for refcount increment helpers

Let us clarify the context under which the helpers to increment the
refcount for the gendisk and request_queue can be called under. We
make this explicit on the places where we may sleep with might_sleep().

We don't address the decrement context yet, as that needs some extra
work and fixes, but will be addressed in the next patch.

Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Signed-off-by: Luis Chamberlain <[email protected]>
---
block/blk-core.c | 2 ++
block/genhd.c | 6 ++++++
2 files changed, 8 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index a0760aac110a..14c09daf55f3 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -610,6 +610,8 @@ EXPORT_SYMBOL(blk_alloc_queue);
* @q: the request_queue structure to increment the refcount for
*
* Increment the refcount of the request_queue kobject.
+ *
+ * Context: Any context.
*/
bool blk_get_queue(struct request_queue *q)
{
diff --git a/block/genhd.c b/block/genhd.c
index f741613d731f..1be86b1f43ec 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -985,11 +985,15 @@ static ssize_t disk_badblocks_store(struct device *dev,
*
* This function gets the structure containing partitioning
* information for the given device @devt.
+ *
+ * Context: can sleep
*/
struct gendisk *get_gendisk(dev_t devt, int *partno)
{
struct gendisk *disk = NULL;

+ might_sleep();
+
if (MAJOR(devt) != BLOCK_EXT_MAJOR) {
struct kobject *kobj;

@@ -1764,6 +1768,8 @@ EXPORT_SYMBOL(__alloc_disk_node);
*
* This increments the refcount for the struct gendisk, and the gendisk's
* fops module owner.
+ *
+ * Context: Any context.
*/
struct kobject *get_disk_and_module(struct gendisk *disk)
{
--
2.26.2

2020-06-20 04:46:08

by Luis Chamberlain

[permalink] [raw]

Subject: [PATCH v7 6/8] blktrace: fix debugfs use after free

On commit 6ac93117ab00 ("blktrace: use existing disk debugfs directory")
merged on v4.12 Omar fixed the original blktrace code for request-based
drivers (multiqueue). This however left in place a possible crash, if you
happen to abuse blktrace while racing to remove / add a device.

We used to use asynchronous removal of the request_queue, and with that
the issue was easier to reproduce. Now that we have reverted to
synchronous removal of the request_queue, the issue is still possible to
reproduce, its however just a bit more difficult.

We essentially run two instances of break-blktrace which add/remove
a loop device, and setup a blktrace and just never tear the blktrace
down. We do this twice in parallel. This is easily reproduced with the
script run_0004.sh from break-blktrace [0].

We can end up with two types of panics each reflecting where we
race, one a failed blktrace setup:

[ 252.426751] debugfs: Directory 'loop0' with parent 'block' already present!
[ 252.432265] BUG: kernel NULL pointer dereference, address: 00000000000000a0
[ 252.436592] #PF: supervisor write access in kernel mode
[ 252.439822] #PF: error_code(0x0002) - not-present page
[ 252.442967] PGD 0 P4D 0
[ 252.444656] Oops: 0002 [#1] SMP NOPTI
[ 252.446972] CPU: 10 PID: 1153 Comm: break-blktrace Tainted: G E 5.7.0-rc2-next-20200420+ #164
[ 252.452673] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
[ 252.456343] RIP: 0010:down_write+0x15/0x40
[ 252.458146] Code: eb ca e8 ae 22 8d ff cc cc cc cc cc cc cc cc cc cc cc cc
cc cc 0f 1f 44 00 00 55 48 89 fd e8 52 db ff ff 31 c0 ba 01 00
00 00 <f0> 48 0f b1 55 00 75 0f 48 8b 04 25 c0 8b 01 00 48 89
45 08 5d
[ 252.463638] RSP: 0018:ffffa626415abcc8 EFLAGS: 00010246
[ 252.464950] RAX: 0000000000000000 RBX: ffff958c25f0f5c0 RCX: ffffff8100000000
[ 252.466727] RDX: 0000000000000001 RSI: ffffff8100000000 RDI: 00000000000000a0
[ 252.468482] RBP: 00000000000000a0 R08: 0000000000000000 R09: 0000000000000001
[ 252.470014] R10: 0000000000000000 R11: ffff958d1f9227ff R12: 0000000000000000
[ 252.471473] R13: ffff958c25ea5380 R14: ffffffff8cce15f1 R15: 00000000000000a0
[ 252.473346] FS: 00007f2e69dee540(0000) GS:ffff958c2fc80000(0000) knlGS:0000000000000000
[ 252.475225] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 252.476267] CR2: 00000000000000a0 CR3: 0000000427d10004 CR4: 0000000000360ee0
[ 252.477526] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 252.478776] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 252.479866] Call Trace:
[ 252.480322] simple_recursive_removal+0x4e/0x2e0
[ 252.481078] ? debugfs_remove+0x60/0x60
[ 252.481725] ? relay_destroy_buf+0x77/0xb0
[ 252.482662] debugfs_remove+0x40/0x60
[ 252.483518] blk_remove_buf_file_callback+0x5/0x10
[ 252.484328] relay_close_buf+0x2e/0x60
[ 252.484930] relay_open+0x1ce/0x2c0
[ 252.485520] do_blk_trace_setup+0x14f/0x2b0
[ 252.486187] __blk_trace_setup+0x54/0xb0
[ 252.486803] blk_trace_ioctl+0x90/0x140
[ 252.487423] ? do_sys_openat2+0x1ab/0x2d0
[ 252.488053] blkdev_ioctl+0x4d/0x260
[ 252.488636] block_ioctl+0x39/0x40
[ 252.489139] ksys_ioctl+0x87/0xc0
[ 252.489675] __x64_sys_ioctl+0x16/0x20
[ 252.490380] do_syscall_64+0x52/0x180
[ 252.491032] entry_SYSCALL_64_after_hwframe+0x44/0xa9

And the other on the device removal:

[ 128.528940] debugfs: Directory 'loop0' with parent 'block' already present!
[ 128.615325] BUG: kernel NULL pointer dereference, address: 00000000000000a0
[ 128.619537] #PF: supervisor write access in kernel mode
[ 128.622700] #PF: error_code(0x0002) - not-present page
[ 128.625842] PGD 0 P4D 0
[ 128.627585] Oops: 0002 [#1] SMP NOPTI
[ 128.629871] CPU: 12 PID: 544 Comm: break-blktrace Tainted: G E 5.7.0-rc2-next-20200420+ #164
[ 128.635595] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
[ 128.640471] RIP: 0010:down_write+0x15/0x40
[ 128.643041] Code: eb ca e8 ae 22 8d ff cc cc cc cc cc cc cc cc cc cc cc cc
cc cc 0f 1f 44 00 00 55 48 89 fd e8 52 db ff ff 31 c0 ba 01 00
00 00 <f0> 48 0f b1 55 00 75 0f 65 48 8b 04 25 c0 8b 01 00 48 89
45 08 5d
[ 128.650180] RSP: 0018:ffffa9c3c05ebd78 EFLAGS: 00010246
[ 128.651820] RAX: 0000000000000000 RBX: ffff8ae9a6370240 RCX: ffffff8100000000
[ 128.653942] RDX: 0000000000000001 RSI: ffffff8100000000 RDI: 00000000000000a0
[ 128.655720] RBP: 00000000000000a0 R08: 0000000000000002 R09: ffff8ae9afd2d3d0
[ 128.657400] R10: 0000000000000056 R11: 0000000000000000 R12: 0000000000000000
[ 128.659099] R13: 0000000000000000 R14: 0000000000000003 R15: 00000000000000a0
[ 128.660500] FS: 00007febfd995540(0000) GS:ffff8ae9afd00000(0000) knlGS:0000000000000000
[ 128.662204] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 128.663426] CR2: 00000000000000a0 CR3: 0000000420042003 CR4: 0000000000360ee0
[ 128.664776] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 128.666022] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 128.667282] Call Trace:
[ 128.667801] simple_recursive_removal+0x4e/0x2e0
[ 128.668663] ? debugfs_remove+0x60/0x60
[ 128.669368] debugfs_remove+0x40/0x60
[ 128.669985] blk_trace_free+0xd/0x50
[ 128.670593] __blk_trace_remove+0x27/0x40
[ 128.671274] blk_trace_shutdown+0x30/0x40
[ 128.671935] blk_release_queue+0x95/0xf0
[ 128.672589] kobject_put+0xa5/0x1b0
[ 128.673188] disk_release+0xa2/0xc0
[ 128.673786] device_release+0x28/0x80
[ 128.674376] kobject_put+0xa5/0x1b0
[ 128.674915] loop_remove+0x39/0x50 [loop]
[ 128.675511] loop_control_ioctl+0x113/0x130 [loop]
[ 128.676199] ksys_ioctl+0x87/0xc0
[ 128.676708] __x64_sys_ioctl+0x16/0x20
[ 128.677274] do_syscall_64+0x52/0x180
[ 128.677823] entry_SYSCALL_64_after_hwframe+0x44/0xa9

The common theme here is:

debugfs: Directory 'loop0' with parent 'block' already present

This crash happens because of how blktrace uses the debugfs directory
where it places its files. Upon init we always create the same directory
which would be needed by blktrace but we only do this for make_request
drivers (multiqueue) block drivers. When you race a removal of these
devices with a blktrace setup you end up in a situation where the
make_request recursive debugfs removal will sweep away the blktrace
files and then later blktrace will also try to remove individual
dentries which are already NULL. The inverse is also possible and hence
the two types of use after frees.

We don't create the block debugfs directory on init for these types of
block devices:

* request-based block driver block devices
* every possible partition
* scsi-generic

And so, this race should in theory only be possible with make_request
drivers.

We can fix the UAF by simply re-using the debugfs directory for
make_request drivers (multiqueue) and only creating the ephemeral
directory for the other type of block devices. The new clarifications
on relying on the q->blk_trace_mutex *and* also checking for q->blk_trace
*prior* to processing a blktrace ensures the debugfs directories are
only created if no possible directory name clashes are possible.

This goes tested with:

o nvme partitions
o ISCSI with tgt, and blktracing against scsi-generic with:
o block
o tape
o cdrom
o media changer
o blktests

This patch is part of the work which disputes the severity of
CVE-2019-19770 which shows this issue is not a core debugfs issue, but
a misuse of debugfs within blktace.

Cc: Bart Van Assche <[email protected]>
Cc: Omar Sandoval <[email protected]>
Cc: Hannes Reinecke <[email protected]>
Cc: Nicolai Stange <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: yu kuai <[email protected]>
Reported-by: [email protected]
Fixes: 6ac93117ab00 ("blktrace: use existing disk debugfs directory")
Signed-off-by: Luis Chamberlain <[email protected]>
---
kernel/trace/blktrace.c | 18 ++++++++++++------
1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 7ff2ea5cd05e..e6e2d25fdbd6 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -524,10 +524,18 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
if (!bt->msg_data)
goto err;

- ret = -ENOENT;
-
- dir = debugfs_lookup(buts->name, blk_debugfs_root);
- if (!dir)
+#ifdef CONFIG_BLK_DEBUG_FS
+ /*
+ * When tracing whole make_request drivers (multiqueue) block devices,
+ * reuse the existing debugfs directory created by the block layer on
+ * init. For request-based block devices, all partitions block devices,
+ * and scsi-generic block devices we create a temporary new debugfs
+ * directory that will be removed once the trace ends.
+ */
+ if (queue_is_mq(q) && bdev && bdev == bdev->bd_contains)
+ dir = q->debugfs_dir;
+ else
+#endif
bt->dir = dir = debugfs_create_dir(buts->name, blk_debugfs_root);

bt->dev = dev;
@@ -565,8 +573,6 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,

ret = 0;
err:
- if (dir && !bt->dir)
- dput(dir);
if (ret)
blk_trace_free(bt);
return ret;
--
2.26.2

2020-06-20 07:32:16

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH v7 6/8] blktrace: fix debugfs use after free

Still not a fan of the wall of text in the commit log, but the changes
look good:

Reviewed-by: Christoph Hellwig <[email protected]>

2020-06-20 17:34:54

by Bart Van Assche

[permalink] [raw]

Subject: Re: [PATCH v7 6/8] blktrace: fix debugfs use after free

On 2020-06-19 13:47, Luis Chamberlain wrote:
> This goes tested with:
^^^^
got?

> diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
> index 7ff2ea5cd05e..e6e2d25fdbd6 100644
> --- a/kernel/trace/blktrace.c
> +++ b/kernel/trace/blktrace.c
> @@ -524,10 +524,18 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
> if (!bt->msg_data)
> goto err;
>
> - ret = -ENOENT;
> -
> - dir = debugfs_lookup(buts->name, blk_debugfs_root);
> - if (!dir)
> +#ifdef CONFIG_BLK_DEBUG_FS
> + /*
> + * When tracing whole make_request drivers (multiqueue) block devices,
> + * reuse the existing debugfs directory created by the block layer on
> + * init. For request-based block devices, all partitions block devices,
^^^^^^^^^^^^^^^^^^^^^
It seems like a word is missing from the comment? Or did you perhaps
want to refer to "all partition block devices"?

> + * and scsi-generic block devices we create a temporary new debugfs
> + * directory that will be removed once the trace ends.
> + */
> + if (queue_is_mq(q) && bdev && bdev == bdev->bd_contains)
> + dir = q->debugfs_dir;
> + else
> +#endif
> bt->dir = dir = debugfs_create_dir(buts->name, blk_debugfs_root);

Can it happen that two different threads each try to set up block
tracing and hence that debugfs_create_dir() fails because a directory
with name buts->name already exists?

> bt->dev = dev;
> @@ -565,8 +573,6 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
>
> ret = 0;
> err:
> - if (dir && !bt->dir)
> - dput(dir);
> if (ret)
> blk_trace_free(bt);
> return ret;

Shouldn't bt->dir be removed in this error path for make_request drivers?

Thanks,

Bart.

2020-06-20 21:21:34

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH v7 0/8] blktrace: fix debugfs use after free

On 6/19/20 2:47 PM, Luis Chamberlain wrote:
> Its been a fun ride, but all patch series come to an end. My hope is
> that this is it. The simplification of the fix is considerable now,
> with only a few lines of code and with no data structure changes.
>
> We were only creating the debugfs_dir upon initialization only if
> you had CONFIG_BLK_DEBUG_FS for for make_request block drivers
> (multiqueue). That's where the UAF bug could happen. Folks liked
> the idea of open coding the debugfs initialization even if
> CONFIG_BLK_DEBUG_FS was disabled, given that debugfs code will
> simply ignore that code if debugfs is disabled, but to make
> the fix easier to backport, that shift is done now in another
> patch. Likewise, although we were only creating the debugfs_dir
> only for make_request block drivers (multiqueue), the same new
> additional patch also creates the debugfs_dir for request-based
> block drivers. That *begged* us to just rename the mutex to
> clarify its for the debugfs_dir, blktrace then just becomes
> its biggest user.
>
> The only patches changed here is the last one from the last series,
> which actually fixed the UAF oops, and that one is now split in 3
> patches, which makes a secondary fix much clearer.
>
> I've waited a while to post these, to let 0-day give me its blessings,
> both for Linus' tree and linux-next. No issues have been found. I've
> also taken time to run blktests prior and after this series and I have
> found no regressions. In fact, I think I should just extend blktests
> with the break-blktrace tests, I'll do that later.
>
> Luis Chamberlain (8):
> block: add docs for gendisk / request_queue refcount helpers
> block: clarify context for refcount increment helpers
> block: revert back to synchronous request_queue removal
> blktrace: annotate required lock on do_blk_trace_setup()
> loop: be paranoid on exit and prevent new additions / removals
> blktrace: fix debugfs use after free
> blktrace: ensure our debugfs dir exists
> block: create the request_queue debugfs_dir on registration

Applied for 5.9, thanks.

--
Jens Axboe

2020-06-22 12:41:07

by Luis Chamberlain

[permalink] [raw]

Subject: Re: [PATCH v7 6/8] blktrace: fix debugfs use after free

On Sat, Jun 20, 2020 at 10:31:51AM -0700, Bart Van Assche wrote:
> On 2020-06-19 13:47, Luis Chamberlain wrote:
> > This goes tested with:
> ^^^^
> got?
>
> > diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
> > index 7ff2ea5cd05e..e6e2d25fdbd6 100644
> > --- a/kernel/trace/blktrace.c
> > +++ b/kernel/trace/blktrace.c
> > @@ -524,10 +524,18 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
> > if (!bt->msg_data)
> > goto err;
> >
> > - ret = -ENOENT;
> > -
> > - dir = debugfs_lookup(buts->name, blk_debugfs_root);
> > - if (!dir)
> > +#ifdef CONFIG_BLK_DEBUG_FS
> > + /*
> > + * When tracing whole make_request drivers (multiqueue) block devices,
> > + * reuse the existing debugfs directory created by the block layer on
> > + * init. For request-based block devices, all partitions block devices,
> ^^^^^^^^^^^^^^^^^^^^^
> It seems like a word is missing from the comment? Or did you perhaps
> want to refer to "all partition block devices"?

Yes, the later.

> > + * and scsi-generic block devices we create a temporary new debugfs
> > + * directory that will be removed once the trace ends.
> > + */
> > + if (queue_is_mq(q) && bdev && bdev == bdev->bd_contains)
> > + dir = q->debugfs_dir;
> > + else
> > +#endif
> > bt->dir = dir = debugfs_create_dir(buts->name, blk_debugfs_root);
>
> Can it happen that two different threads each try to set up block
> tracing and hence that debugfs_create_dir() fails because a directory
> with name buts->name already exists?

Great question, the answer is no. The reason is that we first use the
mutex and then we check for q->blk_trace. If you hold the lock *and*
you have checked for q->blk_trace and its NULL, you are sure you should
not have a duplicate.

Its why the commit log mentioned:

The new clarifications on relying on the q->blk_trace_mutex *and* also
checking for q->blk_trace *prior* to processing a blktrace ensures the
debugfs directories are only created if no possible directory name
clashes are possible.

These clarifications were prompted through discussions with Jan Kara
on the patches he posted which you CC'd me on. I agreed with his
patch *but* I suggested it would hold true only if check for the
q->blk_trace first, and this is why my patch titled "blktrace: break
out of blktrace setup on concurrent calls" got merged prior to Jan
Kara's "blktrace: Avoid sparse warnings when assigning q->blk_trace".

> > bt->dev = dev;
> > @@ -565,8 +573,6 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
> >
> > ret = 0;
> > err:
> > - if (dir && !bt->dir)
> > - dput(dir);
> > if (ret)
> > blk_trace_free(bt);
> > return ret;
>
> Shouldn't bt->dir be removed in this error path for make_request drivers?

If there is an error, bt->dir will be removed still, as I never modified
the removal of bt->dir in this patch.

Luis