This series aims to tackle the various UAF reports, like:
[0] https://lore.kernel.org/linux-block/[email protected]/
[1] https://lore.kernel.org/linux-block/[email protected]/T/#m6c1ac11540522716f645d004e2a5a13c9f218908
[2] https://lore.kernel.org/linux-block/[email protected]/
[3] https://lore.kernel.org/linux-block/[email protected]/
Details are in the commit messages.
The issue addressed in patch 1/3 is pretty easy to reproduce, 2+3/3 not so
much, and I had to add mdelays in the iters functions to recreate in
sane timeframes.
A regards patch 1/3, if 2+3/3 are adopted, then this can simplified to
simply clear the tagset requests pointers without using any atomic
operations. However, this patch on its own seems to solve the problem [3],
above. So the other 2x patches are really for extreme scenarios which may
never be seen in practice. As such, it could be considered to just accept
patch 1/3 now.
Differences to v2:
- Add patch 2+3/3
- Drop patch to lockout blk_mq_queue_tag_busy_iter() when exiting elevator
John Garry (3):
blk-mq: Clean up references to old requests when freeing rqs
blk-mq: Freeze and quiesce all queues for tagset in elevator_exit()
blk-mq: Lockout tagset iterator when exiting elevator
block/blk-mq-sched.c | 2 +-
block/blk-mq-tag.c | 7 ++++++-
block/blk-mq.c | 21 +++++++++++++++++++--
block/blk-mq.h | 2 ++
block/blk.h | 23 +++++++++++++++++++++++
include/linux/blk-mq.h | 1 +
6 files changed, 52 insertions(+), 4 deletions(-)
--
2.26.2
A use-after-free may occur if blk_mq_queue_tag_busy_iter() is run on a
queue when another queue associated with the same tagset is switching IO
scheduler:
BUG: KASAN: use-after-free in bt_iter+0xa0/0x120
Read of size 8 at addr ffff0410285e7e00 by task fio/2302
CPU: 24 PID: 2302 Comm: fio Not tainted 5.12.0-rc1-11925-g29a317e228d9 #747
Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon D05 IT21 Nemo 2.0 RC0 04/18/2018
Call trace:
dump_backtrace+0x0/0x2d8
show_stack+0x18/0x68
dump_stack+0x124/0x1a0
print_address_description.constprop.13+0x68/0x30c
kasan_report+0x1e8/0x258
__asan_load8+0x9c/0xd8
bt_iter+0xa0/0x120
blk_mq_queue_tag_busy_iter+0x348/0x5d8
blk_mq_in_flight+0x80/0xb8
part_stat_show+0xcc/0x210
dev_attr_show+0x44/0x90
sysfs_kf_seq_show+0x120/0x1c0
kernfs_seq_show+0x9c/0xb8
seq_read_iter+0x214/0x668
kernfs_fop_read_iter+0x204/0x2c0
new_sync_read+0x1ec/0x2d0
vfs_read+0x18c/0x248
ksys_read+0xc8/0x178
__arm64_sys_read+0x44/0x58
el0_svc_common.constprop.1+0xc8/0x1a8
do_el0_svc+0x90/0xa0
el0_svc+0x24/0x38
el0_sync_handler+0x90/0xb8
el0_sync+0x154/0x180
Indeed, blk_mq_queue_tag_busy_iter() already does take a reference to its
queue usage counter when called, and the queue cannot be frozen to switch
IO scheduler until all refs are dropped. This ensures no stale references
to IO scheduler requests will be seen by blk_mq_queue_tag_busy_iter().
However, there is nothing to stop blk_mq_queue_tag_busy_iter() being
run for another queue associated with the same tagset, and it seeing
a stale IO scheduler request from the other queue after they are freed.
To stop this happening, freeze and quiesce all queues associated with the
tagset as the elevator is exited.
Signed-off-by: John Garry <[email protected]>
---
I think that this patch is what Bart suggested:
https://lore.kernel.org/linux-block/[email protected]/
block/blk.h | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/block/blk.h b/block/blk.h
index 3b53e44b967e..1a948bfd91e4 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -201,10 +201,29 @@ void elv_unregister_queue(struct request_queue *q);
static inline void elevator_exit(struct request_queue *q,
struct elevator_queue *e)
{
+ struct blk_mq_tag_set *set = q->tag_set;
+ struct request_queue *tmp;
+
lockdep_assert_held(&q->sysfs_lock);
+ mutex_lock(&set->tag_list_lock);
+ list_for_each_entry(tmp, &set->tag_list, tag_set_list) {
+ if (tmp == q)
+ continue;
+ blk_mq_freeze_queue(tmp);
+ blk_mq_quiesce_queue(tmp);
+ }
+
blk_mq_sched_free_requests(q);
__elevator_exit(q, e);
+
+ list_for_each_entry(tmp, &set->tag_list, tag_set_list) {
+ if (tmp == q)
+ continue;
+ blk_mq_unquiesce_queue(tmp);
+ blk_mq_unfreeze_queue(tmp);
+ }
+ mutex_unlock(&set->tag_list_lock);
}
ssize_t part_size_show(struct device *dev, struct device_attribute *attr,
--
2.26.2
All queues associated with a tagset are frozen when one queue is exiting
an elevator. This is to ensure that one queue running
blk_mq_queue_tag_busy_iter() cannot hold a stale request reference for
the queue who is exiting the elevator.
However, there is nothing to stop blk_mq_all_tag_iter() being run for
the tagset, and, again, getting hold of a stale request reference. A kasan
UAF can be triggered for this scenario:
BUG: KASAN: use-after-free in bt_tags_iter+0xe0/0x128
Read of size 4 at addr ffff001085330fcc by task more/3038
CPU: 1 PID: 3038 Comm: more Not tainted 5.12.0-rc1-11926-g7359e4a1604d-dirty #750
Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon D05 IT21 Nemo 2.0 RC0 04/18/2018
Call trace:
dump_backtrace+0x0/0x2d0
show_stack+0x18/0x68
dump_stack+0x100/0x16c
print_address_description.constprop.13+0x68/0x30c
kasan_report+0x1d8/0x240
__asan_load4+0x9c/0xd8
bt_tags_iter+0xe0/0x128
__blk_mq_all_tag_iter+0x320/0x3a8
blk_mq_tagset_busy_iter+0x84/0xb8
scsi_host_busy+0x88/0xb8
show_host_busy+0x1c/0x48
dev_attr_show+0x44/0x90
sysfs_kf_seq_show+0x128/0x1c8
kernfs_seq_show+0xa0/0xb8
seq_read_iter+0x210/0x660
kernfs_fop_read_iter+0x208/0x2b0
new_sync_read+0x1ec/0x2d0
vfs_read+0x188/0x248
ksys_read+0xc8/0x178
__arm64_sys_read+0x44/0x58
el0_svc_common.constprop.1+0xc4/0x190
do_el0_svc+0x90/0xa0
el0_svc+0x24/0x38
el0_sync_handler+0x90/0xb8
el0_sync+0x154/0x180
To avoid this, reject the tagset iterators when the queue is exiting
the elevator.
This should not break any semantics in blk_mq_all_tag_iter(), as, since
all queues are frozen, there should be no active tags to iterate.
Signed-off-by: John Garry <[email protected]>
---
block/blk-mq-tag.c | 5 +++++
block/blk-mq.c | 1 +
block/blk.h | 4 ++++
include/linux/blk-mq.h | 1 +
4 files changed, 11 insertions(+)
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 7ff1b20d58e7..5950fee490e8 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -358,11 +358,16 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
{
int i;
+ if (!atomic_inc_not_zero(&tagset->iter_usage_counter))
+ return;
+
for (i = 0; i < tagset->nr_hw_queues; i++) {
if (tagset->tags && tagset->tags[i])
__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
BT_TAG_ITER_STARTED);
}
+
+ atomic_dec(&tagset->iter_usage_counter);
}
EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9cb60bf7ac24..326e1b0e5b83 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3493,6 +3493,7 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
goto out_free_mq_rq_maps;
}
}
+ atomic_set(&set->iter_usage_counter, 1);
mutex_init(&set->tag_list_lock);
INIT_LIST_HEAD(&set->tag_list);
diff --git a/block/blk.h b/block/blk.h
index 1a948bfd91e4..461e5b54eb5f 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -214,9 +214,13 @@ static inline void elevator_exit(struct request_queue *q,
blk_mq_quiesce_queue(tmp);
}
+ while (atomic_cmpxchg(&set->iter_usage_counter, 1, 0) != 1);
+
blk_mq_sched_free_requests(q);
__elevator_exit(q, e);
+ atomic_set(&set->iter_usage_counter, 1);
+
list_for_each_entry(tmp, &set->tag_list, tag_set_list) {
if (tmp == q)
continue;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2c473c9b8990..30a21335767b 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -263,6 +263,7 @@ struct blk_mq_tag_set {
struct mutex tag_list_lock;
struct list_head tag_list;
+ atomic_t iter_usage_counter;
};
/**
--
2.26.2
It has been reported many times that a use-after-free can be intermittently
found when iterating busy requests:
- https://lore.kernel.org/linux-block/[email protected]/
- https://lore.kernel.org/linux-block/[email protected]/T/#m6c1ac11540522716f645d004e2a5a13c9f218908
- https://lore.kernel.org/linux-block/[email protected]/
The issue is that when we switch scheduler or change queue depth, there may
be references in the driver tagset to the stale requests.
As a solution, clean up any references to those requests in the driver
tagset. This is done with a cmpxchg to make safe any race with setting the
driver tagset request from another queue.
Signed-off-by: John Garry <[email protected]>
---
block/blk-mq-sched.c | 2 +-
block/blk-mq-tag.c | 2 +-
block/blk-mq.c | 20 ++++++++++++++++++--
block/blk-mq.h | 2 ++
4 files changed, 22 insertions(+), 4 deletions(-)
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index ddb65e9e6fd9..bc19bd8f8c7b 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -615,7 +615,7 @@ void blk_mq_sched_free_requests(struct request_queue *q)
queue_for_each_hw_ctx(q, hctx, i) {
if (hctx->sched_tags)
- blk_mq_free_rqs(q->tag_set, hctx->sched_tags, i);
+ blk_mq_free_rqs_ext(q->tag_set, hctx->sched_tags, i, hctx->tags);
}
}
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index ce813b909339..7ff1b20d58e7 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -580,7 +580,7 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
return -ENOMEM;
}
- blk_mq_free_rqs(set, *tagsptr, hctx->queue_num);
+ blk_mq_free_rqs_ext(set, *tagsptr, hctx->queue_num, hctx->tags);
blk_mq_free_rq_map(*tagsptr, flags);
*tagsptr = new;
} else {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d4d7c1caa439..9cb60bf7ac24 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2286,8 +2286,8 @@ blk_qc_t blk_mq_submit_bio(struct bio *bio)
return BLK_QC_T_NONE;
}
-void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
- unsigned int hctx_idx)
+void __blk_mq_free_rqs_ext(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
+ unsigned int hctx_idx, struct blk_mq_tags *ref_tags)
{
struct page *page;
@@ -2296,10 +2296,14 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
for (i = 0; i < tags->nr_tags; i++) {
struct request *rq = tags->static_rqs[i];
+ int j;
if (!rq)
continue;
set->ops->exit_request(set, rq, hctx_idx);
+ /* clean up any references which occur in @ref_tags */
+ for (j = 0; ref_tags && j < ref_tags->nr_tags; j++)
+ cmpxchg(&ref_tags->rqs[j], rq, 0);
tags->static_rqs[i] = NULL;
}
}
@@ -2316,6 +2320,18 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
}
}
+void blk_mq_free_rqs_ext(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
+ unsigned int hctx_idx, struct blk_mq_tags *ref_tags)
+{
+ __blk_mq_free_rqs_ext(set, tags, hctx_idx, ref_tags);
+}
+
+void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
+ unsigned int hctx_idx)
+{
+ __blk_mq_free_rqs_ext(set, tags, hctx_idx, NULL);
+}
+
void blk_mq_free_rq_map(struct blk_mq_tags *tags, unsigned int flags)
{
kfree(tags->rqs);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 3616453ca28c..031e29f74926 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -53,6 +53,8 @@ struct request *blk_mq_dequeue_from_ctx(struct blk_mq_hw_ctx *hctx,
*/
void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
unsigned int hctx_idx);
+void blk_mq_free_rqs_ext(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
+ unsigned int hctx_idx, struct blk_mq_tags *references);
void blk_mq_free_rq_map(struct blk_mq_tags *tags, unsigned int flags);
struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
unsigned int hctx_idx,
--
2.26.2
On Fri, Mar 5, 2021 at 7:20 AM John Garry <[email protected]> wrote:
>
> It has been reported many times that a use-after-free can be intermittently
> found when iterating busy requests:
>
> - https://lore.kernel.org/linux-block/[email protected]/
> - https://lore.kernel.org/linux-block/[email protected]/T/#m6c1ac11540522716f645d004e2a5a13c9f218908
> - https://lore.kernel.org/linux-block/[email protected]/
>
> The issue is that when we switch scheduler or change queue depth, there may
> be references in the driver tagset to the stale requests.
>
> As a solution, clean up any references to those requests in the driver
> tagset. This is done with a cmpxchg to make safe any race with setting the
> driver tagset request from another queue.
I noticed this crash recently when running blktests on a "debug"
config on a 4.15 based kernel (it would always crash), and backporting
this change fixes it. (testing on linus's latest tree also confirmed
the fix, with the same config). I realize I'm late to the
conversation, but appreciate the investigation and fixes :)
On 3/5/21 7:14 AM, John Garry wrote:
> diff --git a/block/blk.h b/block/blk.h
> index 3b53e44b967e..1a948bfd91e4 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -201,10 +201,29 @@ void elv_unregister_queue(struct request_queue *q);
> static inline void elevator_exit(struct request_queue *q,
> struct elevator_queue *e)
> {
> + struct blk_mq_tag_set *set = q->tag_set;
> + struct request_queue *tmp;
> +
> lockdep_assert_held(&q->sysfs_lock);
>
> + mutex_lock(&set->tag_list_lock);
> + list_for_each_entry(tmp, &set->tag_list, tag_set_list) {
> + if (tmp == q)
> + continue;
> + blk_mq_freeze_queue(tmp);
> + blk_mq_quiesce_queue(tmp);
> + }
> +
> blk_mq_sched_free_requests(q);
> __elevator_exit(q, e);
> +
> + list_for_each_entry(tmp, &set->tag_list, tag_set_list) {
> + if (tmp == q)
> + continue;
> + blk_mq_unquiesce_queue(tmp);
> + blk_mq_unfreeze_queue(tmp);
> + }
> + mutex_unlock(&set->tag_list_lock);
> }
This patch introduces nesting of tag_list_lock inside sysfs_lock. The
latter is per request queue while the former can be shared across
multiple request queues. Has it been analyzed whether this is safe?
Thanks,
Bart.
On 3/5/21 7:14 AM, John Garry wrote:
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 7ff1b20d58e7..5950fee490e8 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -358,11 +358,16 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
> {
> int i;
>
> + if (!atomic_inc_not_zero(&tagset->iter_usage_counter))
> + return;
> +
> for (i = 0; i < tagset->nr_hw_queues; i++) {
> if (tagset->tags && tagset->tags[i])
> __blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
> BT_TAG_ITER_STARTED);
> }
> +
> + atomic_dec(&tagset->iter_usage_counter);
> }
> EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
This changes the behavior of blk_mq_tagset_busy_iter(). What will e.g.
happen if the mtip driver calls blk_mq_tagset_busy_iter(&dd->tags,
mtip_abort_cmd, dd) concurrently with another blk_mq_tagset_busy_iter()
call and if that causes all mtip_abort_cmd() calls to be skipped?
> + while (atomic_cmpxchg(&set->iter_usage_counter, 1, 0) != 1);
Isn't it recommended to call cpu_relax() inside busy-waiting loops?
> blk_mq_sched_free_requests(q);
> __elevator_exit(q, e);
>
> + atomic_set(&set->iter_usage_counter, 1);
Can it happen that the above atomic_set() call happens while a
blk_mq_tagset_busy_iter() call is in progress? Should that atomic_set()
call perhaps be changed into an atomic_inc() call?
Thanks,
Bart.
On 3/5/21 7:14 AM, John Garry wrote:
> @@ -2296,10 +2296,14 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>
> for (i = 0; i < tags->nr_tags; i++) {
> struct request *rq = tags->static_rqs[i];
> + int j;
>
> if (!rq)
> continue;
> set->ops->exit_request(set, rq, hctx_idx);
> + /* clean up any references which occur in @ref_tags */
> + for (j = 0; ref_tags && j < ref_tags->nr_tags; j++)
> + cmpxchg(&ref_tags->rqs[j], rq, 0);
> tags->static_rqs[i] = NULL;
> }
> }
What prevents blk_mq_tagset_busy_iter() from reading hctx->tags[...]
before the cmpxcg() call and dereferencing it after blk_mq_free_rqs()
has called __free_pages()?
Thanks,
Bart.
On 06/03/2021 18:13, Bart Van Assche wrote:
> On 3/5/21 7:14 AM, John Garry wrote:
>> @@ -2296,10 +2296,14 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>>
>> for (i = 0; i < tags->nr_tags; i++) {
>> struct request *rq = tags->static_rqs[i];
>> + int j;
>>
>> if (!rq)
>> continue;
>> set->ops->exit_request(set, rq, hctx_idx);
>> + /* clean up any references which occur in @ref_tags */
>> + for (j = 0; ref_tags && j < ref_tags->nr_tags; j++)
>> + cmpxchg(&ref_tags->rqs[j], rq, 0);
>> tags->static_rqs[i] = NULL;
>> }
>> }
Hi Bart,
> What prevents blk_mq_tagset_busy_iter() from reading hctx->tags[...]
> before the cmpxcg() call and dereferencing it after blk_mq_free_rqs()
> has called __free_pages()?
>
So there is nothing in this patch to stop that. But it's pretty
unlikely, as the window is very narrow generally between reading
hctx->tags[...] and actually dereferencing it. However, something like
that should be made safe in patch 2/3.
Thanks,
John
On 06/03/2021 04:32, Bart Van Assche wrote:
> On 3/5/21 7:14 AM, John Garry wrote:
>> diff --git a/block/blk.h b/block/blk.h
>> index 3b53e44b967e..1a948bfd91e4 100644
>> --- a/block/blk.h
>> +++ b/block/blk.h
>> @@ -201,10 +201,29 @@ void elv_unregister_queue(struct request_queue *q);
>> static inline void elevator_exit(struct request_queue *q,
>> struct elevator_queue *e)
>> {
>> + struct blk_mq_tag_set *set = q->tag_set;
>> + struct request_queue *tmp;
>> +
>> lockdep_assert_held(&q->sysfs_lock);
>>
>> + mutex_lock(&set->tag_list_lock);
>> + list_for_each_entry(tmp, &set->tag_list, tag_set_list) {
>> + if (tmp == q)
>> + continue;
>> + blk_mq_freeze_queue(tmp);
>> + blk_mq_quiesce_queue(tmp);
>> + }
>> +
>> blk_mq_sched_free_requests(q);
>> __elevator_exit(q, e);
>> +
>> + list_for_each_entry(tmp, &set->tag_list, tag_set_list) {
>> + if (tmp == q)
>> + continue;
>> + blk_mq_unquiesce_queue(tmp);
>> + blk_mq_unfreeze_queue(tmp);
>> + }
>> + mutex_unlock(&set->tag_list_lock);
>> }
Hi Bart,
> This patch introduces nesting of tag_list_lock inside sysfs_lock. The
> latter is per request queue while the former can be shared across
> multiple request queues. Has it been analyzed whether this is safe?
Firstly - ignoring implementation details for a moment - this patch is
to ensure that the concept is consistent with your suggestion and
whether it is sound.
As for nested locking, I can analyze more, but I did assume that we
don't care about locking-out sysfs intervention during this time. And it
seems pretty difficult to avoid nesting the locks.
And further to this, I see that
https://lore.kernel.org/linux-block/[email protected]/T/#mc3e3175642660578c0ae2a6c32185b1e34ec4b8a
has a new interface for tagset quiesce, which could make this process
more efficient.
Please let me know further thoughts.
Thanks,
John
On 06/03/2021 02:52, Khazhy Kumykov wrote:
> On Fri, Mar 5, 2021 at 7:20 AM John Garry <[email protected]> wrote:
>>
>> It has been reported many times that a use-after-free can be intermittently
>> found when iterating busy requests:
>>
>> - https://lore.kernel.org/linux-block/[email protected]/
>> - https://lore.kernel.org/linux-block/[email protected]/T/#m6c1ac11540522716f645d004e2a5a13c9f218908
>> - https://lore.kernel.org/linux-block/[email protected]/
>>
>> The issue is that when we switch scheduler or change queue depth, there may
>> be references in the driver tagset to the stale requests.
>>
>> As a solution, clean up any references to those requests in the driver
>> tagset. This is done with a cmpxchg to make safe any race with setting the
>> driver tagset request from another queue.
>
> I noticed this crash recently when running blktests on a "debug"
> config on a 4.15 based kernel (it would always crash), and backporting
> this change fixes it. (testing on linus's latest tree also confirmed
> the fix, with the same config). I realize I'm late to the
> conversation, but appreciate the investigation and fixes :)
Good to know. I'll explicitly cc you on further versions.
Thanks,
John
On 06/03/2021 04:43, Bart Van Assche wrote:
> On 3/5/21 7:14 AM, John Garry wrote:
>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>> index 7ff1b20d58e7..5950fee490e8 100644
>> --- a/block/blk-mq-tag.c
>> +++ b/block/blk-mq-tag.c
>> @@ -358,11 +358,16 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
>> {
>> int i;
>>
>> + if (!atomic_inc_not_zero(&tagset->iter_usage_counter))
>> + return;
>> +
>> for (i = 0; i < tagset->nr_hw_queues; i++) {
>> if (tagset->tags && tagset->tags[i])
>> __blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
>> BT_TAG_ITER_STARTED);
>> }
>> +
>> + atomic_dec(&tagset->iter_usage_counter);
>> }
>> EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
Hi Bart,
> This changes the behavior of blk_mq_tagset_busy_iter(). What will e.g.
> happen if the mtip driver calls blk_mq_tagset_busy_iter(&dd->tags,
> mtip_abort_cmd, dd) concurrently with another blk_mq_tagset_busy_iter()
> call and if that causes all mtip_abort_cmd() calls to be skipped?
I'm not sure that I understand this problem you describe. So if
blk_mq_tagset_busy_iter(&dd->tags, mtip_abort_cmd, dd) is called, either
can happen:
a. normal operation, iter_usage_counter initially holds >= 1, and then
iter_usage_counter is incremented in blk_mq_tagset_busy_iter() and we
iter the busy tags. Any parallel call to blk_mq_tagset_busy_iter() will
also increase iter_usage_counter.
b. we're switching IO scheduler. In this scenario, first we quiesce all
queues. After that, there should be no active requests. At that point,
we ensure any calls to blk_mq_tagset_busy_iter() are finished and block
(or discard may be a better term) any more calls. Blocking any more
calls should be safe as there are no requests to iter. atomic_cmpxchg()
is used to set iter_usage_counter to 0, blocking any more calls.
>
>> + while (atomic_cmpxchg(&set->iter_usage_counter, 1, 0) != 1);
> Isn't it recommended to call cpu_relax() inside busy-waiting loops?
Maybe, but I am considering changing this patch to use percpu_refcnt() -
I need to check it further.
>
>> blk_mq_sched_free_requests(q);
>> __elevator_exit(q, e);
>>
>> + atomic_set(&set->iter_usage_counter, 1);
> Can it happen that the above atomic_set() call happens while a
> blk_mq_tagset_busy_iter() call is in progress?
No, as at this point it should be ensured that iter_usage_counter holds
0 from atomic_cmpxchg(), so there should be no active processes in
blk_mq_tagset_busy_iter() sensitive region. Calls to
blk_mq_tagset_busy_iter() are blocked when iter_usage_counter holds 0.
> Should that atomic_set()
> call perhaps be changed into an atomic_inc() call?
They have the same affect in practice, but we use atomic_set() in
blk_mq_alloc_tag_set(), so at least consistent.
Thanks,
John
On 3/8/21 2:50 AM, John Garry wrote:
> Please let me know further thoughts.
Hi John,
My guess is that it is safe to nest these two locks. I was asking
because I had not found any information about the nesting in the patch
description.
Bart.
On 3/8/21 3:17 AM, John Garry wrote:
> On 06/03/2021 04:43, Bart Van Assche wrote:
>> On 3/5/21 7:14 AM, John Garry wrote:
>>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>>> index 7ff1b20d58e7..5950fee490e8 100644
>>> --- a/block/blk-mq-tag.c
>>> +++ b/block/blk-mq-tag.c
>>> @@ -358,11 +358,16 @@ void blk_mq_tagset_busy_iter(struct
>>> blk_mq_tag_set *tagset,
>>> {
>>> int i;
>>> + if (!atomic_inc_not_zero(&tagset->iter_usage_counter))
>>> + return;
>>> +
>>> for (i = 0; i < tagset->nr_hw_queues; i++) {
>>> if (tagset->tags && tagset->tags[i])
>>> __blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
>>> BT_TAG_ITER_STARTED);
>>> }
>>> +
>>> + atomic_dec(&tagset->iter_usage_counter);
>>> }
>>> EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
>
> Hi Bart,
>
>> This changes the behavior of blk_mq_tagset_busy_iter(). What will e.g.
>> happen if the mtip driver calls blk_mq_tagset_busy_iter(&dd->tags,
>> mtip_abort_cmd, dd) concurrently with another blk_mq_tagset_busy_iter()
>> call and if that causes all mtip_abort_cmd() calls to be skipped?
>
> I'm not sure that I understand this problem you describe. So if
> blk_mq_tagset_busy_iter(&dd->tags, mtip_abort_cmd, dd) is called, either
> can happen:
> a. normal operation, iter_usage_counter initially holds >= 1, and then
> iter_usage_counter is incremented in blk_mq_tagset_busy_iter() and we
> iter the busy tags. Any parallel call to blk_mq_tagset_busy_iter() will
> also increase iter_usage_counter.
> b. we're switching IO scheduler. In this scenario, first we quiesce all
> queues. After that, there should be no active requests. At that point,
> we ensure any calls to blk_mq_tagset_busy_iter() are finished and block
> (or discard may be a better term) any more calls. Blocking any more
> calls should be safe as there are no requests to iter. atomic_cmpxchg()
> is used to set iter_usage_counter to 0, blocking any more calls.
Hi John,
My concern is about the insertion of the early return statement in
blk_mq_tagset_busy_iter(). Although most blk_mq_tagset_busy_iter()
callers can handle skipping certain blk_mq_tagset_busy_iter() calls
(e.g. when gathering statistics), I'm not sure this is safe for all
blk_mq_tagset_busy_iter() callers. The example I cited is an example of
a blk_mq_tagset_busy_iter() call with side effects.
The mtip driver allocates one tag set per request queue so quiescing
queues should be sufficient to address my concern for the mtip driver.
The NVMe core and SCSI core however share a single tag set across
multiple namespaces / LUNs. In the error path of nvme_rdma_setup_ctrl()
I found a call to nvme_cancel_tagset(). nvme_cancel_tagset() calls
blk_mq_tagset_busy_iter(ctrl->tagset, nvme_cancel_request, ctrl). I'm
not sure it is safe to skip the nvme_cancel_request() calls if the I/O
scheduler for another NVMe namespace is being modified.
Thanks,
Bart.
On 08/03/2021 19:59, Bart Van Assche wrote:
>>> This changes the behavior of blk_mq_tagset_busy_iter(). What will e.g.
>>> happen if the mtip driver calls blk_mq_tagset_busy_iter(&dd->tags,
>>> mtip_abort_cmd, dd) concurrently with another blk_mq_tagset_busy_iter()
>>> call and if that causes all mtip_abort_cmd() calls to be skipped?
>>
>> I'm not sure that I understand this problem you describe. So if
>> blk_mq_tagset_busy_iter(&dd->tags, mtip_abort_cmd, dd) is called,
>> either can happen:
>> a. normal operation, iter_usage_counter initially holds >= 1, and then
>> iter_usage_counter is incremented in blk_mq_tagset_busy_iter() and we
>> iter the busy tags. Any parallel call to blk_mq_tagset_busy_iter()
>> will also increase iter_usage_counter.
>> b. we're switching IO scheduler. In this scenario, first we quiesce
>> all queues. After that, there should be no active requests. At that
>> point, we ensure any calls to blk_mq_tagset_busy_iter() are finished
>> and block (or discard may be a better term) any more calls. Blocking
>> any more calls should be safe as there are no requests to iter.
>> atomic_cmpxchg() is used to set iter_usage_counter to 0, blocking any
>> more calls.
>
Hi Bart,
> My concern is about the insertion of the early return statement in
> blk_mq_tagset_busy_iter().
So I take this approach as I don't see any way to use a mutual exclusion
waiting mechanism to block calls to blk_mq_tagset_busy_iter() while the
IO scheduler is being switched.
The reason is that blk_mq_tagset_busy_iter() can be called from any
context, including hardirq.
> Although most blk_mq_tagset_busy_iter()
> callers can handle skipping certain blk_mq_tagset_busy_iter() calls
> (e.g. when gathering statistics), I'm not sure this is safe for all
> blk_mq_tagset_busy_iter() callers. The example I cited is an example of
> a blk_mq_tagset_busy_iter() call with side effects.
I don't like to think that we're skipping it, which may imply that there
are some active requests to iter and we're just ignoring them.
It's more like: we know that there are no requests active, so don't
bother trying to iterate.
>
> The mtip driver allocates one tag set per request queue so quiescing
> queues should be sufficient to address my concern for the mtip driver.
>
> The NVMe core and SCSI core however share a single tag set across
> multiple namespaces / LUNs. In the error path of nvme_rdma_setup_ctrl()
> I found a call to nvme_cancel_tagset(). nvme_cancel_tagset() calls
> blk_mq_tagset_busy_iter(ctrl->tagset, nvme_cancel_request, ctrl). I'm
> not sure it is safe to skip the nvme_cancel_request() calls if the I/O
> scheduler for another NVMe namespace is being modified.
Again, I would be relying on all request_queues associated with that
tagset to be queisced when switching IO scheduler at the point
blk_mq_tagset_busy_iter() is called and returns early.
Now if there were active requests, I am relying on the request queue
quiescing to flush them. So blk_mq_tagset_busy_iter() could be called
during that quiescing period, and would continue to iter the requests.
This does fall over if some tags are allocated without associated
request queue, which I do not know exists.
Thanks,
John
On 3/9/21 9:47 AM, John Garry wrote:
> This does fall over if some tags are allocated without associated
> request queue, which I do not know exists.
The only tag allocation mechanism I know of is blk_mq_get_tag(). The
only blk_mq_get_tag() callers I know of are __blk_mq_alloc_request() and
blk_mq_alloc_request_hctx(). So I think all allocated tags are
associated with a request queue.
Regarding this patch series, I have shared the feedback I wanted to
share so I would appreciate it if someone else could also take a look.
Thanks,
Bart.
On 09/03/2021 19:21, Bart Van Assche wrote:
> On 3/9/21 9:47 AM, John Garry wrote:
>> This does fall over if some tags are allocated without associated
>> request queue, which I do not know exists.
>
Hi Bart,
> The only tag allocation mechanism I know of is blk_mq_get_tag(). The
> only blk_mq_get_tag() callers I know of are __blk_mq_alloc_request() and
> blk_mq_alloc_request_hctx(). So I think all allocated tags are
> associated with a request queue.
>
ok, good.
> Regarding this patch series, I have shared the feedback I wanted to
> share so I would appreciate it if someone else could also take a look.
>
So I can incorporate any changes and suggestions so far and send a
non-RFC version - that may get more attention if none extra comes.
As mentioned on the cover letter, if patch 2+3/3 are accepted, then
patch 1/3 could be simplified. But I plan to leave as is.
BTW, any issue with putting your suggested-by on patch 2/3?
Thanks,
John
On 3/5/21 7:14 AM, John Garry wrote:
> diff --git a/block/blk.h b/block/blk.h
> index 3b53e44b967e..1a948bfd91e4 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -201,10 +201,29 @@ void elv_unregister_queue(struct request_queue *q);
> static inline void elevator_exit(struct request_queue *q,
> struct elevator_queue *e)
> {
> + struct blk_mq_tag_set *set = q->tag_set;
> + struct request_queue *tmp;
> +
> lockdep_assert_held(&q->sysfs_lock);
>
> + mutex_lock(&set->tag_list_lock);
> + list_for_each_entry(tmp, &set->tag_list, tag_set_list) {
> + if (tmp == q)
> + continue;
> + blk_mq_freeze_queue(tmp);
> + blk_mq_quiesce_queue(tmp);
> + }
> +
> blk_mq_sched_free_requests(q);
> __elevator_exit(q, e);
> +
> + list_for_each_entry(tmp, &set->tag_list, tag_set_list) {
> + if (tmp == q)
> + continue;
> + blk_mq_unquiesce_queue(tmp);
> + blk_mq_unfreeze_queue(tmp);
> + }
> + mutex_unlock(&set->tag_list_lock);
> }
Reviewed-by: Bart Van Assche <[email protected]>
On 3/10/21 12:52 AM, John Garry wrote:
> On 09/03/2021 19:21, Bart Van Assche wrote:
>> Regarding this patch series, I have shared the feedback I wanted to
>> share so I would appreciate it if someone else could also take a look.
>
> So I can incorporate any changes and suggestions so far and send a
> non-RFC version - that may get more attention if none extra comes.
>
> As mentioned on the cover letter, if patch 2+3/3 are accepted, then
> patch 1/3 could be simplified. But I plan to leave as is.
>
> BTW, any issue with putting your suggested-by on patch 2/3?
Hi John,
I have added my Reviewed-by to patch 2/3.
Regarding the other two patches in this series: I do not agree with
patch 3/3. As I have explained, I am concerned that that patch breaks
existing block drivers.
Are patches 1/3 and 3/3 necessary? Or in other words, is patch 2/3
sufficient to fix the use-after-free?
Thanks,
Bart.
On 10/03/2021 16:00, Bart Van Assche wrote:
>> So I can incorporate any changes and suggestions so far and send a
>> non-RFC version - that may get more attention if none extra comes.
>>
>> As mentioned on the cover letter, if patch 2+3/3 are accepted, then
>> patch 1/3 could be simplified. But I plan to leave as is.
>>
>> BTW, any issue with putting your suggested-by on patch 2/3?
>
Hi Bart,
>
> I have added my Reviewed-by to patch 2/3.
>
OK, thanks.
Please note that I still want to check further whether some of Ming's
series "blk-mq: implement queue quiesce via percpu_ref for
BLK_MQ_F_BLOCKING" can be used.
> Regarding the other two patches in this series: I do not agree with
> patch 3/3. As I have explained, I am concerned that that patch breaks
> existing block drivers.
Understood. I need to check your concern further to allay any fears.
So I could probably change that patch to drop the early return.
Instead we just need to ensure that we complete any existing calls to
blk_mq_tagset_busy_iter() prior to freeing the IO scheduler requests.
Then we don't need to return early and can iter as before - but, as I
said previously, there should be no active tags to iter.
>
> Are patches 1/3 and 3/3 necessary? Or in other words, is patch 2/3
> sufficient to fix the use-after-free?
No, we need them all in some form.
So far, reports are that 1/3 solves the most common seen UAF. It is
pretty easy to trigger.
But the scenarios associated with 2/3 and 3/3 are much harder to
trigger, and I needed to add delays in the code just to trigger them.
Thanks,
John
On Fri, Mar 05, 2021 at 11:14:53PM +0800, John Garry wrote:
> A use-after-free may occur if blk_mq_queue_tag_busy_iter() is run on a
> queue when another queue associated with the same tagset is switching IO
> scheduler:
>
> BUG: KASAN: use-after-free in bt_iter+0xa0/0x120
> Read of size 8 at addr ffff0410285e7e00 by task fio/2302
>
> CPU: 24 PID: 2302 Comm: fio Not tainted 5.12.0-rc1-11925-g29a317e228d9 #747
> Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon D05 IT21 Nemo 2.0 RC0 04/18/2018
> Call trace:
> dump_backtrace+0x0/0x2d8
> show_stack+0x18/0x68
> dump_stack+0x124/0x1a0
> print_address_description.constprop.13+0x68/0x30c
> kasan_report+0x1e8/0x258
> __asan_load8+0x9c/0xd8
> bt_iter+0xa0/0x120
> blk_mq_queue_tag_busy_iter+0x348/0x5d8
> blk_mq_in_flight+0x80/0xb8
> part_stat_show+0xcc/0x210
> dev_attr_show+0x44/0x90
> sysfs_kf_seq_show+0x120/0x1c0
> kernfs_seq_show+0x9c/0xb8
> seq_read_iter+0x214/0x668
> kernfs_fop_read_iter+0x204/0x2c0
> new_sync_read+0x1ec/0x2d0
> vfs_read+0x18c/0x248
> ksys_read+0xc8/0x178
> __arm64_sys_read+0x44/0x58
> el0_svc_common.constprop.1+0xc8/0x1a8
> do_el0_svc+0x90/0xa0
> el0_svc+0x24/0x38
> el0_sync_handler+0x90/0xb8
> el0_sync+0x154/0x180
>
> Indeed, blk_mq_queue_tag_busy_iter() already does take a reference to its
> queue usage counter when called, and the queue cannot be frozen to switch
> IO scheduler until all refs are dropped. This ensures no stale references
> to IO scheduler requests will be seen by blk_mq_queue_tag_busy_iter().
>
> However, there is nothing to stop blk_mq_queue_tag_busy_iter() being
> run for another queue associated with the same tagset, and it seeing
> a stale IO scheduler request from the other queue after they are freed.
>
> To stop this happening, freeze and quiesce all queues associated with the
> tagset as the elevator is exited.
I think this way can't be accepted since switching one queue's scheduler
is nothing to do with other request queues attached to same HBA.
This patch will cause performance regression because userspace may
switch scheduler according to medium or workloads, at that time other
LUNs will be affected by this patch.
--
Ming
On 11/03/2021 00:58, Ming Lei wrote:
>> Indeed, blk_mq_queue_tag_busy_iter() already does take a reference to its
>> queue usage counter when called, and the queue cannot be frozen to switch
>> IO scheduler until all refs are dropped. This ensures no stale references
>> to IO scheduler requests will be seen by blk_mq_queue_tag_busy_iter().
>>
>> However, there is nothing to stop blk_mq_queue_tag_busy_iter() being
>> run for another queue associated with the same tagset, and it seeing
>> a stale IO scheduler request from the other queue after they are freed.
>>
>> To stop this happening, freeze and quiesce all queues associated with the
>> tagset as the elevator is exited.
> I think this way can't be accepted since switching one queue's scheduler
> is nothing to do with other request queues attached to same HBA.
>
> This patch will cause performance regression because userspace may
> switch scheduler according to medium or workloads, at that time other
> LUNs will be affected by this patch.
Hmmm..that was my concern also. Do you think that it may cause a big
impact? Depends totally on the workload, I suppose.
FWIW, it is useful though for solving both iterator problems.
Thanks,
John
On 3/11/21 12:21 AM, John Garry wrote:
> On 11/03/2021 00:58, Ming Lei wrote:
>> I think this way can't be accepted since switching one queue's scheduler
>> is nothing to do with other request queues attached to same HBA.
>>
>> This patch will cause performance regression because userspace may
>> switch scheduler according to medium or workloads, at that time other
>> LUNs will be affected by this patch.
>
> Hmmm..that was my concern also. Do you think that it may cause a big
> impact? Depends totally on the workload, I suppose.
>
> FWIW, it is useful though for solving both iterator problems.
Hi John,
How about replacing the entire patch series with the patch below? The
patch below has the following advantages:
- Instead of making the race window smaller, the race is fixed
completely.
- No new atomic instructions are added to the block layer code.
- No early return is inserted in blk_mq_tagset_busy_iter().
Thanks,
Bart.
From a0e534012a766bd6e53cdd466eec0a811164c12a Mon Sep 17 00:00:00 2001
From: Bart Van Assche <[email protected]>
Date: Wed, 10 Mar 2021 19:11:47 -0800
Subject: [PATCH] blk-mq: Fix races between iterating over requests and freeing
requests
Multiple users have reported use-after-free complaints similar to the
following (see also https://lore.kernel.org/linux-block/[email protected]/):
BUG: KASAN: use-after-free in bt_iter+0x86/0xf0
Read of size 8 at addr ffff88803b335240 by task fio/21412
CPU: 0 PID: 21412 Comm: fio Tainted: G W 4.20.0-rc6-dbg+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
Call Trace:
dump_stack+0x86/0xca
print_address_description+0x71/0x239
kasan_report.cold.5+0x242/0x301
__asan_load8+0x54/0x90
bt_iter+0x86/0xf0
blk_mq_queue_tag_busy_iter+0x373/0x5e0
blk_mq_in_flight+0x96/0xb0
part_in_flight+0x40/0x140
part_round_stats+0x18e/0x370
blk_account_io_start+0x3d7/0x670
blk_mq_bio_to_request+0x19c/0x3a0
blk_mq_make_request+0x7a9/0xcb0
generic_make_request+0x41d/0x960
submit_bio+0x9b/0x250
do_blockdev_direct_IO+0x435c/0x4c70
__blockdev_direct_IO+0x79/0x88
ext4_direct_IO+0x46c/0xc00
generic_file_direct_write+0x119/0x210
__generic_file_write_iter+0x11c/0x280
ext4_file_write_iter+0x1b8/0x6f0
aio_write+0x204/0x310
io_submit_one+0x9d3/0xe80
__x64_sys_io_submit+0x115/0x340
do_syscall_64+0x71/0x210
When multiple request queues share a tag set and when switching the I/O
scheduler for one of the request queues that uses this tag set, the
following race can happen:
* blk_mq_tagset_busy_iter() calls bt_tags_iter() and bt_tags_iter() assigns
a pointer to a scheduler request to the local variable 'rq'.
* blk_mq_sched_free_requests() is called to free hctx->sched_tags.
* blk_mq_tagset_busy_iter() dereferences 'rq' and triggers a use-after-free.
Fix this race as follows:
* Use rcu_assign_pointer() and rcu_dereference() to access hctx->tags->rqs[].
* Protect hctx->tags->rqs[] reads with an RCU read-side lock.
* No new rcu_barrier() call has been added because clearing the request
pointer in hctx->tags->rqs[] happens before blk_queue_exit() and the
blk_freeze_queue() call in blk_cleanup_queue() triggers an RCU barrier
after all scheduler request pointers assiociated with a request queue
have been removed from hctx->tags->rqs[] and before these scheduler
requests are freed.
Signed-off-by: Bart Van Assche <[email protected]>
---
block/blk-mq-tag.c | 27 +++++++++++++++++----------
block/blk-mq-tag.h | 2 +-
block/blk-mq.c | 10 ++++++----
block/blk-mq.h | 1 +
4 files changed, 25 insertions(+), 15 deletions(-)
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 9c92053e704d..8351c3f2fe2d 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -206,18 +206,23 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
struct blk_mq_tags *tags = hctx->tags;
bool reserved = iter_data->reserved;
struct request *rq;
+ bool res = true;
if (!reserved)
bitnr += tags->nr_reserved_tags;
- rq = tags->rqs[bitnr];
+
+ rcu_read_lock();
+ rq = rcu_dereference(tags->rqs[bitnr]);
/*
* We can hit rq == NULL here, because the tagging functions
* test and set the bit before assigning ->rqs[].
*/
if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx)
- return iter_data->fn(hctx, rq, iter_data->data, reserved);
- return true;
+ res = iter_data->fn(hctx, rq, iter_data->data, reserved);
+ rcu_read_unlock();
+
+ return res;
}
/**
@@ -264,10 +269,12 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
struct blk_mq_tags *tags = iter_data->tags;
bool reserved = iter_data->flags & BT_TAG_ITER_RESERVED;
struct request *rq;
+ bool res = true;
if (!reserved)
bitnr += tags->nr_reserved_tags;
+ rcu_read_lock();
/*
* We can hit rq == NULL here, because the tagging functions
* test and set the bit before assigning ->rqs[].
@@ -275,13 +282,13 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
if (iter_data->flags & BT_TAG_ITER_STATIC_RQS)
rq = tags->static_rqs[bitnr];
else
- rq = tags->rqs[bitnr];
- if (!rq)
- return true;
- if ((iter_data->flags & BT_TAG_ITER_STARTED) &&
- !blk_mq_request_started(rq))
- return true;
- return iter_data->fn(rq, iter_data->data, reserved);
+ rq = rcu_dereference(tags->rqs[bitnr]);
+ if (rq && (!(iter_data->flags & BT_TAG_ITER_STARTED) ||
+ blk_mq_request_started(rq)))
+ res = iter_data->fn(rq, iter_data->data, reserved);
+ rcu_read_unlock();
+
+ return res;
}
/**
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 7d3e6b333a4a..7a6d04733261 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -17,7 +17,7 @@ struct blk_mq_tags {
struct sbitmap_queue __bitmap_tags;
struct sbitmap_queue __breserved_tags;
- struct request **rqs;
+ struct request __rcu **rqs;
struct request **static_rqs;
struct list_head page_list;
};
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d4d7c1caa439..594bf7f4ed9a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -495,8 +495,10 @@ static void __blk_mq_free_request(struct request *rq)
blk_crypto_free_request(rq);
blk_pm_mark_last_busy(rq);
rq->mq_hctx = NULL;
- if (rq->tag != BLK_MQ_NO_TAG)
+ if (rq->tag != BLK_MQ_NO_TAG) {
blk_mq_put_tag(hctx->tags, ctx, rq->tag);
+ rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL);
+ }
if (sched_tag != BLK_MQ_NO_TAG)
blk_mq_put_tag(hctx->sched_tags, ctx, sched_tag);
blk_mq_sched_restart(hctx);
@@ -839,8 +841,8 @@ EXPORT_SYMBOL(blk_mq_delay_kick_requeue_list);
struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag)
{
if (tag < tags->nr_tags) {
- prefetch(tags->rqs[tag]);
- return tags->rqs[tag];
+ prefetch((__force void *)tags->rqs[tag]);
+ return rcu_dereference_check(tags->rqs[tag], true);
}
return NULL;
@@ -1111,7 +1113,7 @@ static bool blk_mq_get_driver_tag(struct request *rq)
rq->rq_flags |= RQF_MQ_INFLIGHT;
__blk_mq_inc_active_requests(hctx);
}
- hctx->tags->rqs[rq->tag] = rq;
+ rcu_assign_pointer(hctx->tags->rqs[rq->tag], rq);
return true;
}
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 3616453ca28c..9ccb1818303b 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -226,6 +226,7 @@ static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
struct request *rq)
{
blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag);
+ rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL);
rq->tag = BLK_MQ_NO_TAG;
if (rq->rq_flags & RQF_MQ_INFLIGHT) {
Hi Bart,
I'll have a look at this ASAP - a bit busy.
But a quick scan and I notice this:
> @@ -226,6 +226,7 @@ static inline void __blk_mq_put_driver_tag(struct
blk_mq_hw_ctx *hctx,
> struct request *rq)
> {
> blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag);
> + rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL);
Wasn't a requirement to not touch the fastpath at all, including even if
only NULLifying a pointer?
IIRC, Kashyap some time ago had a patch like above (but without RCU
usage), but the request from Jens was to not touch the fastpath.
Maybe I'm mistaken - I will try to dig up the thread.
Thanks,
John
> How about replacing the entire patch series with the patch below? The
> patch below has the following advantages:
> - Instead of making the race window smaller, the race is fixed
> completely.
> - No new atomic instructions are added to the block layer code.
> - No early return is inserted in blk_mq_tagset_busy_iter().
>
> Thanks,
>
> Bart.
>
> From a0e534012a766bd6e53cdd466eec0a811164c12a Mon Sep 17 00:00:00 2001
> From: Bart Van Assche<[email protected]>
> Date: Wed, 10 Mar 2021 19:11:47 -0800
> Subject: [PATCH] blk-mq: Fix races between iterating over requests and freeing
> requests
>
> Multiple users have reported use-after-free complaints similar to the
> following (see alsohttps://lore.kernel.org/linux-block/[email protected]/):
>
> BUG: KASAN: use-after-free in bt_iter+0x86/0xf0
> Read of size 8 at addr ffff88803b335240 by task fio/21412
>
> CPU: 0 PID: 21412 Comm: fio Tainted: G W 4.20.0-rc6-dbg+ #3
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> Call Trace:
> dump_stack+0x86/0xca
> print_address_description+0x71/0x239
> kasan_report.cold.5+0x242/0x301
> __asan_load8+0x54/0x90
> bt_iter+0x86/0xf0
> blk_mq_queue_tag_busy_iter+0x373/0x5e0
> blk_mq_in_flight+0x96/0xb0
> part_in_flight+0x40/0x140
> part_round_stats+0x18e/0x370
> blk_account_io_start+0x3d7/0x670
> blk_mq_bio_to_request+0x19c/0x3a0
> blk_mq_make_request+0x7a9/0xcb0
> generic_make_request+0x41d/0x960
> submit_bio+0x9b/0x250
> do_blockdev_direct_IO+0x435c/0x4c70
> __blockdev_direct_IO+0x79/0x88
> ext4_direct_IO+0x46c/0xc00
> generic_file_direct_write+0x119/0x210
> __generic_file_write_iter+0x11c/0x280
> ext4_file_write_iter+0x1b8/0x6f0
> aio_write+0x204/0x310
> io_submit_one+0x9d3/0xe80
> __x64_sys_io_submit+0x115/0x340
> do_syscall_64+0x71/0x210
>
> When multiple request queues share a tag set and when switching the I/O
> scheduler for one of the request queues that uses this tag set, the
> following race can happen:
> * blk_mq_tagset_busy_iter() calls bt_tags_iter() and bt_tags_iter() assigns
> a pointer to a scheduler request to the local variable 'rq'.
> * blk_mq_sched_free_requests() is called to free hctx->sched_tags.
> * blk_mq_tagset_busy_iter() dereferences 'rq' and triggers a use-after-free.
>
> Fix this race as follows:
> * Use rcu_assign_pointer() and rcu_dereference() to access hctx->tags->rqs[].
> * Protect hctx->tags->rqs[] reads with an RCU read-side lock.
> * No new rcu_barrier() call has been added because clearing the request
> pointer in hctx->tags->rqs[] happens before blk_queue_exit() and the
> blk_freeze_queue() call in blk_cleanup_queue() triggers an RCU barrier
> after all scheduler request pointers assiociated with a request queue
> have been removed from hctx->tags->rqs[] and before these scheduler
> requests are freed.
>
> Signed-off-by: Bart Van Assche<[email protected]>
> ---
> block/blk-mq-tag.c | 27 +++++++++++++++++----------
> block/blk-mq-tag.h | 2 +-
> block/blk-mq.c | 10 ++++++----
> block/blk-mq.h | 1 +
> 4 files changed, 25 insertions(+), 15 deletions(-)
>
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 9c92053e704d..8351c3f2fe2d 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -206,18 +206,23 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
> struct blk_mq_tags *tags = hctx->tags;
> bool reserved = iter_data->reserved;
> struct request *rq;
> + bool res = true;
>
> if (!reserved)
> bitnr += tags->nr_reserved_tags;
> - rq = tags->rqs[bitnr];
> +
> + rcu_read_lock();
> + rq = rcu_dereference(tags->rqs[bitnr]);
>
> /*
> * We can hit rq == NULL here, because the tagging functions
> * test and set the bit before assigning ->rqs[].
> */
> if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx)
> - return iter_data->fn(hctx, rq, iter_data->data, reserved);
> - return true;
> + res = iter_data->fn(hctx, rq, iter_data->data, reserved);
> + rcu_read_unlock();
> +
> + return res;
> }
>
> /**
> @@ -264,10 +269,12 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
> struct blk_mq_tags *tags = iter_data->tags;
> bool reserved = iter_data->flags & BT_TAG_ITER_RESERVED;
> struct request *rq;
> + bool res = true;
>
> if (!reserved)
> bitnr += tags->nr_reserved_tags;
>
> + rcu_read_lock();
> /*
> * We can hit rq == NULL here, because the tagging functions
> * test and set the bit before assigning ->rqs[].
> @@ -275,13 +282,13 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
> if (iter_data->flags & BT_TAG_ITER_STATIC_RQS)
> rq = tags->static_rqs[bitnr];
> else
> - rq = tags->rqs[bitnr];
> - if (!rq)
> - return true;
> - if ((iter_data->flags & BT_TAG_ITER_STARTED) &&
> - !blk_mq_request_started(rq))
> - return true;
> - return iter_data->fn(rq, iter_data->data, reserved);
> + rq = rcu_dereference(tags->rqs[bitnr]);
> + if (rq && (!(iter_data->flags & BT_TAG_ITER_STARTED) ||
> + blk_mq_request_started(rq)))
> + res = iter_data->fn(rq, iter_data->data, reserved);
> + rcu_read_unlock();
> +
> + return res;
> }
>
> /**
> diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
> index 7d3e6b333a4a..7a6d04733261 100644
> --- a/block/blk-mq-tag.h
> +++ b/block/blk-mq-tag.h
> @@ -17,7 +17,7 @@ struct blk_mq_tags {
> struct sbitmap_queue __bitmap_tags;
> struct sbitmap_queue __breserved_tags;
>
> - struct request **rqs;
> + struct request __rcu **rqs;
> struct request **static_rqs;
> struct list_head page_list;
> };
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index d4d7c1caa439..594bf7f4ed9a 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -495,8 +495,10 @@ static void __blk_mq_free_request(struct request *rq)
> blk_crypto_free_request(rq);
> blk_pm_mark_last_busy(rq);
> rq->mq_hctx = NULL;
> - if (rq->tag != BLK_MQ_NO_TAG)
> + if (rq->tag != BLK_MQ_NO_TAG) {
> blk_mq_put_tag(hctx->tags, ctx, rq->tag);
> + rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL);
> + }
> if (sched_tag != BLK_MQ_NO_TAG)
> blk_mq_put_tag(hctx->sched_tags, ctx, sched_tag);
> blk_mq_sched_restart(hctx);
> @@ -839,8 +841,8 @@ EXPORT_SYMBOL(blk_mq_delay_kick_requeue_list);
> struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag)
> {
> if (tag < tags->nr_tags) {
> - prefetch(tags->rqs[tag]);
> - return tags->rqs[tag];
> + prefetch((__force void *)tags->rqs[tag]);
> + return rcu_dereference_check(tags->rqs[tag], true);
> }
>
> return NULL;
> @@ -1111,7 +1113,7 @@ static bool blk_mq_get_driver_tag(struct request *rq)
> rq->rq_flags |= RQF_MQ_INFLIGHT;
> __blk_mq_inc_active_requests(hctx);
> }
> - hctx->tags->rqs[rq->tag] = rq;
> + rcu_assign_pointer(hctx->tags->rqs[rq->tag], rq);
> return true;
> }
>
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index 3616453ca28c..9ccb1818303b 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -226,6 +226,7 @@ static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
> struct request *rq)
> {
> blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag);
> + rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL);
> rq->tag = BLK_MQ_NO_TAG;
>
> if (rq->rq_flags & RQF_MQ_INFLIGHT) {
On 3/16/21 9:15 AM, John Garry wrote:
> I'll have a look at this ASAP - a bit busy.
>
> But a quick scan and I notice this:
>
> > @@ -226,6 +226,7 @@ static inline void __blk_mq_put_driver_tag(struct
> blk_mq_hw_ctx *hctx,
> > struct request *rq)
> > {
> > blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag);
> > + rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL);
>
> Wasn't a requirement to not touch the fastpath at all, including even if
> only NULLifying a pointer?
>
> IIRC, Kashyap some time ago had a patch like above (but without RCU
> usage), but the request from Jens was to not touch the fastpath.
>
> Maybe I'm mistaken - I will try to dig up the thread.
Hi John,
I agree that Jens asked at the end of 2018 not to touch the fast path to
fix this use-after-free (maybe that request has been repeated more
recently). If Jens or anyone else feels strongly about not clearing
hctx->tags->rqs[rq->tag] from the fast path then I will make that
change. My motivation for clearing these pointers from the fast path is
as follows:
- This results in code that is easier to read and easier to maintain.
- Every modern CPU pipelines store instructions so the performance
impact of adding an additional store should be small.
- Since the block layer has a tendency to reuse tags that have been
freed recently, it is likely that hctx->tags->rqs[rq->tag] will be used
for a next request and hence that it will have to be loaded into the CPU
cache anyway.
Bart.
On 3/16/21 10:43 AM, John Garry wrote:
> On 16/03/2021 17:00, Bart Van Assche wrote:
>> I agree that Jens asked at the end of 2018 not to touch the fast path
>> to fix this use-after-free (maybe that request has been repeated more
>> recently). If Jens or anyone else feels strongly about not clearing
>> hctx->tags->rqs[rq->tag] from the fast path then I will make that change.
>
> Is that possible for this same approach? I need to check the code more..
If the fast path should not be modified, I'm considering to borrow patch
1/3 from your patch series and to add an rcu_barrier() between the code
that clears the request pointers and that frees the scheduler requests.
> And don't we still have the problem that some iter callbacks may
> sleep/block, which is not allowed in an RCU read-side critical section?
Thanks for having brought this up. Since none of the functions that
iterate over requests should be called from the hot path of a block
driver, I think that we can use srcu_read_(un|)lock() inside bt_iter()
and bt_tags_iter() instead of rcu_read_(un|)lock().
Bart.
On 16/03/2021 17:00, Bart Van Assche wrote:
> On 3/16/21 9:15 AM, John Garry wrote:
>> I'll have a look at this ASAP - a bit busy.
>>
>> But a quick scan and I notice this:
>>
>> > @@ -226,6 +226,7 @@ static inline void
>> __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
>> > struct request *rq)
>> > {
>> > blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag);
>> > + rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL);
>>
>> Wasn't a requirement to not touch the fastpath at all, including even
>> if only NULLifying a pointer?
>>
>> IIRC, Kashyap some time ago had a patch like above (but without RCU
>> usage), but the request from Jens was to not touch the fastpath.
>>
>> Maybe I'm mistaken - I will try to dig up the thread.
>
Hi Bart,
>
> I agree that Jens asked at the end of 2018 not to touch the fast path to
> fix this use-after-free (maybe that request has been repeated more
> recently). If Jens or anyone else feels strongly about not clearing
> hctx->tags->rqs[rq->tag] from the fast path then I will make that
> change.
Is that possible for this same approach? I need to check the code more..
And don't we still have the problem that some iter callbacks may
sleep/block, which is not allowed in an RCU read-side critical section?
> My motivation for clearing these pointers from the fast path is
> as follows:
> - This results in code that is easier to read and easier to maintain.
> - Every modern CPU pipelines store instructions so the performance
> impact of adding an additional store should be small.
> - Since the block layer has a tendency to reuse tags that have been
> freed recently, it is likely that hctx->tags->rqs[rq->tag] will be used
> for a next request and hence that it will have to be loaded into the CPU
> cache anyway.
>
Those points make sense to me, but obviously it's the maintainers call.
Thanks,
john
On Mar 05, 2021 / 23:14, John Garry wrote:
> This series aims to tackle the various UAF reports, like:
> [0] https://lore.kernel.org/linux-block/[email protected]/
> [1] https://lore.kernel.org/linux-block/[email protected]/T/#m6c1ac11540522716f645d004e2a5a13c9f218908
> [2] https://lore.kernel.org/linux-block/[email protected]/
> [3] https://lore.kernel.org/linux-block/[email protected]/
>
> Details are in the commit messages.
>
> The issue addressed in patch 1/3 is pretty easy to reproduce, 2+3/3 not so
> much, and I had to add mdelays in the iters functions to recreate in
> sane timeframes.
I also observe the KASAN UAF in blk_mq_queue_tag_busy_iter during blktests run
with kernel version 5.12-rc2 and 5.12-rc3. When the test case block/005 is run
for HDDs behind SAS HBA (Broadcom 9400), the UAF message is always reported and
it makes the test case fail. This failure was not observed with kernel v5.11. I
suppose the failure was rare until v5.11, but changes between 5.11 and 5.12-rcX
made this failure happen more frequent.
I tried the patch 1/3 by John, and saw that it avoids the UAF message and the
block/005 failure. I also tried the patch Bart suggested in this discussion
thread [1], and confirmed that it also avoids the UAF message. I appreciate
these fix work and discussion.
[1] https://marc.info/?l=linux-kernel&m=161559032606201&w=2
--
Best Regards,
Shin'ichiro Kawasaki
On 16/03/2021 19:59, Bart Van Assche wrote:
> On 3/16/21 10:43 AM, John Garry wrote:
>> On 16/03/2021 17:00, Bart Van Assche wrote:
>>> I agree that Jens asked at the end of 2018 not to touch the fast path
>>> to fix this use-after-free (maybe that request has been repeated more
>>> recently). If Jens or anyone else feels strongly about not clearing
>>> hctx->tags->rqs[rq->tag] from the fast path then I will make that change.
Hi Bart,
>> Is that possible for this same approach? I need to check the code more..
> If the fast path should not be modified, I'm considering to borrow patch
> 1/3 from your patch series
Fine
> and to add an rcu_barrier() between the code
> that clears the request pointers and that frees the scheduler requests.
>
>> And don't we still have the problem that some iter callbacks may
>> sleep/block, which is not allowed in an RCU read-side critical section?
> Thanks for having brought this up. Since none of the functions that
> iterate over requests should be called from the hot path of a block
> driver, I think that we can use srcu_read_(un|)lock() inside bt_iter()
> and bt_tags_iter() instead of rcu_read_(un|)lock().
OK, but TBH, I am not so familiar with srcu - where you going to try this?
Thanks,
John
On 3/19/21 11:19 AM, John Garry wrote:
> OK, but TBH, I am not so familiar with srcu - where you going to try this?
Hi John,
Have you received the following patch: "[PATCH] blk-mq: Fix races
between iterating over requests and freeing requests"
(https://lore.kernel.org/linux-block/[email protected]/)?
Thanks,
Bart.