From: Ming Lei <[email protected]>
Hi,
The 1st patch fixes one race between timeout and dying queue.
The 2nd patch add comments on races with timeout handler.
The 3rd patch improves handling for dying queue.
V1:
- add comments on races related with timeout handler
- add Tested-by & Reviewed-by tag
thanks,
Ming
Ming Lei (3):
blk-mq: don't complete un-started request in timeout handler
blk-mq: comment on races related timeout handler
blk-mq: start to freeze queue just after setting dying
block/blk-core.c | 7 +++++--
block/blk-mq.c | 33 +++++++++++++++++++++++----------
2 files changed, 28 insertions(+), 12 deletions(-)
--
2.9.3
Before commit 780db2071a(blk-mq: decouble blk-mq freezing
from generic bypassing), the dying flag is checked before
entering queue, and Tejun converts the checking into .mq_freeze_depth,
and assumes the counter is increased just after dying flag
is set. Unfortunately we doesn't do that in blk_set_queue_dying().
This patch calls blk_mq_freeze_queue_start() for blk-mq in
blk_set_queue_dying(), so that we can block new I/O coming
once the queue is set as dying.
Given blk_set_queue_dying() is always called in remove path
of block device, and queue will be cleaned up later, we don't
need to worry about undoing the counter.
Cc: Bart Van Assche <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
---
block/blk-core.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index d772c221cc17..62d4967c369f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -500,9 +500,12 @@ void blk_set_queue_dying(struct request_queue *q)
queue_flag_set(QUEUE_FLAG_DYING, q);
spin_unlock_irq(q->queue_lock);
- if (q->mq_ops)
+ if (q->mq_ops) {
blk_mq_wake_waiters(q);
- else {
+
+ /* block new I/O coming */
+ blk_mq_freeze_queue_start(q);
+ } else {
struct request_list *rl;
spin_lock_irq(q->queue_lock);
--
2.9.3
When iterating busy requests in timeout handler,
if the STARTED flag of one request isn't set, that means
the request is being processed in block layer or driver, and
isn't dispatch to hardware yet.
In current implementation of blk_mq_check_expired(),
if the request queue becomes dying, un-started requests are
handled as being completed/freed immediately. This way is
wrong, and can cause use-after-free issue easily[1][2], when
doing I/O and removing&resetting NVMe device at the sametime.
This patch fixes several issues reported by Yi Zhang.
[1]. oops log 1
[ 581.789754] ------------[ cut here ]------------
[ 581.789758] kernel BUG at block/blk-mq.c:374!
[ 581.789760] invalid opcode: 0000 [#1] SMP
[ 581.789761] Modules linked in: vfat fat ipmi_ssif intel_rapl sb_edac
edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm nvme
irqbypass crct10dif_pclmul nvme_core crc32_pclmul ghash_clmulni_intel
intel_cstate ipmi_si mei_me ipmi_devintf intel_uncore sg ipmi_msghandler
intel_rapl_perf iTCO_wdt mei iTCO_vendor_support mxm_wmi lpc_ich dcdbas shpchp
pcspkr acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd dm_multipath grace
sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper
syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci
crc32c_intel tg3 libata megaraid_sas i2c_core ptp fjes pps_core dm_mirror
dm_region_hash dm_log dm_mod
[ 581.789796] CPU: 1 PID: 1617 Comm: kworker/1:1H Not tainted 4.10.0.bz1420297+ #4
[ 581.789797] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
[ 581.789804] Workqueue: kblockd blk_mq_timeout_work
[ 581.789806] task: ffff8804721c8000 task.stack: ffffc90006ee4000
[ 581.789809] RIP: 0010:blk_mq_end_request+0x58/0x70
[ 581.789810] RSP: 0018:ffffc90006ee7d50 EFLAGS: 00010202
[ 581.789811] RAX: 0000000000000001 RBX: ffff8802e4195340 RCX: ffff88028e2f4b88
[ 581.789812] RDX: 0000000000001000 RSI: 0000000000001000 RDI: 0000000000000000
[ 581.789813] RBP: ffffc90006ee7d60 R08: 0000000000000003 R09: ffff88028e2f4b00
[ 581.789814] R10: 0000000000001000 R11: 0000000000000001 R12: 00000000fffffffb
[ 581.789815] R13: ffff88042abe5780 R14: 000000000000002d R15: ffff88046fbdff80
[ 581.789817] FS: 0000000000000000(0000) GS:ffff88047fc00000(0000) knlGS:0000000000000000
[ 581.789818] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 581.789819] CR2: 00007f64f403a008 CR3: 000000014d078000 CR4: 00000000001406e0
[ 581.789820] Call Trace:
[ 581.789825] blk_mq_check_expired+0x76/0x80
[ 581.789828] bt_iter+0x45/0x50
[ 581.789830] blk_mq_queue_tag_busy_iter+0xdd/0x1f0
[ 581.789832] ? blk_mq_rq_timed_out+0x70/0x70
[ 581.789833] ? blk_mq_rq_timed_out+0x70/0x70
[ 581.789840] ? __switch_to+0x140/0x450
[ 581.789841] blk_mq_timeout_work+0x88/0x170
[ 581.789845] process_one_work+0x165/0x410
[ 581.789847] worker_thread+0x137/0x4c0
[ 581.789851] kthread+0x101/0x140
[ 581.789853] ? rescuer_thread+0x3b0/0x3b0
[ 581.789855] ? kthread_park+0x90/0x90
[ 581.789860] ret_from_fork+0x2c/0x40
[ 581.789861] Code: 48 85 c0 74 0d 44 89 e6 48 89 df ff d0 5b 41 5c 5d c3 48
8b bb 70 01 00 00 48 85 ff 75 0f 48 89 df e8 7d f0 ff ff 5b 41 5c 5d c3 <0f>
0b e8 71 f0 ff ff 90 eb e9 0f 1f 40 00 66 2e 0f 1f 84 00 00
[ 581.789882] RIP: blk_mq_end_request+0x58/0x70 RSP: ffffc90006ee7d50
[ 581.789889] ---[ end trace bcaf03d9a14a0a70 ]---
[2]. oops log2
[ 6984.857362] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[ 6984.857372] IP: nvme_queue_rq+0x6e6/0x8cd [nvme]
[ 6984.857373] PGD 0
[ 6984.857374]
[ 6984.857376] Oops: 0000 [#1] SMP
[ 6984.857379] Modules linked in: ipmi_ssif vfat fat intel_rapl sb_edac
edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_si iTCO_wdt
iTCO_vendor_support mxm_wmi ipmi_devintf intel_cstate sg dcdbas intel_uncore
mei_me intel_rapl_perf mei pcspkr lpc_ich ipmi_msghandler shpchp
acpi_power_meter wmi nfsd auth_rpcgss dm_multipath nfs_acl lockd grace sunrpc
ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea
sysfillrect crc32c_intel sysimgblt fb_sys_fops ttm nvme drm nvme_core ahci
libahci i2c_core tg3 libata ptp megaraid_sas pps_core fjes dm_mirror
dm_region_hash dm_log dm_mod
[ 6984.857416] CPU: 7 PID: 1635 Comm: kworker/7:1H Not tainted
4.10.0-2.el7.bz1420297.x86_64 #1
[ 6984.857417] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
[ 6984.857427] Workqueue: kblockd blk_mq_run_work_fn
[ 6984.857429] task: ffff880476e3da00 task.stack: ffffc90002e90000
[ 6984.857432] RIP: 0010:nvme_queue_rq+0x6e6/0x8cd [nvme]
[ 6984.857433] RSP: 0018:ffffc90002e93c50 EFLAGS: 00010246
[ 6984.857434] RAX: 0000000000000000 RBX: ffff880275646600 RCX: 0000000000001000
[ 6984.857435] RDX: 0000000000000fff RSI: 00000002fba2a000 RDI: ffff8804734e6950
[ 6984.857436] RBP: ffffc90002e93d30 R08: 0000000000002000 R09: 0000000000001000
[ 6984.857437] R10: 0000000000001000 R11: 0000000000000000 R12: ffff8804741d8000
[ 6984.857438] R13: 0000000000000040 R14: ffff880475649f80 R15: ffff8804734e6780
[ 6984.857439] FS: 0000000000000000(0000) GS:ffff88047fcc0000(0000) knlGS:0000000000000000
[ 6984.857440] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6984.857442] CR2: 0000000000000010 CR3: 0000000001c09000 CR4: 00000000001406e0
[ 6984.857443] Call Trace:
[ 6984.857451] ? mempool_free+0x2b/0x80
[ 6984.857455] ? bio_free+0x4e/0x60
[ 6984.857459] blk_mq_dispatch_rq_list+0xf5/0x230
[ 6984.857462] blk_mq_process_rq_list+0x133/0x170
[ 6984.857465] __blk_mq_run_hw_queue+0x8c/0xa0
[ 6984.857467] blk_mq_run_work_fn+0x12/0x20
[ 6984.857473] process_one_work+0x165/0x410
[ 6984.857475] worker_thread+0x137/0x4c0
[ 6984.857478] kthread+0x101/0x140
[ 6984.857480] ? rescuer_thread+0x3b0/0x3b0
[ 6984.857481] ? kthread_park+0x90/0x90
[ 6984.857489] ret_from_fork+0x2c/0x40
[ 6984.857490] Code: 8b bd 70 ff ff ff 89 95 50 ff ff ff 89 8d 58 ff ff ff 44
89 95 60 ff ff ff e8 b7 dd 12 e1 8b 95 50 ff ff ff 48 89 85 68 ff ff ff <4c>
8b 48 10 44 8b 58 18 8b 8d 58 ff ff ff 44 8b 95 60 ff ff ff
[ 6984.857511] RIP: nvme_queue_rq+0x6e6/0x8cd [nvme] RSP: ffffc90002e93c50
[ 6984.857512] CR2: 0000000000000010
[ 6984.895359] ---[ end trace 2d7ceb528432bf83 ]---
Cc: [email protected]
Reported-by: Yi Zhang <[email protected]>
Tested-by: Yi Zhang <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
---
block/blk-mq.c | 11 +----------
1 file changed, 1 insertion(+), 10 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index a4546f060e80..08a49c69738b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -697,17 +697,8 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
{
struct blk_mq_timeout_data *data = priv;
- if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
- /*
- * If a request wasn't started before the queue was
- * marked dying, kill it here or it'll go unnoticed.
- */
- if (unlikely(blk_queue_dying(rq->q))) {
- rq->errors = -EIO;
- blk_mq_end_request(rq, rq->errors);
- }
+ if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
return;
- }
if (time_after_eq(jiffies, rq->deadline)) {
if (!blk_mark_rq_complete(rq))
--
2.9.3
This patch adds comment on two races related with
timeout handler:
- requeue from queue busy vs. timeout
- rq free & reallocation vs. timeout
Both the races themselves and current solution aren't
explicit enough, so add comments on them.
Cc: Bart Van Assche <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
---
block/blk-mq.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 08a49c69738b..7068779d3bac 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -527,6 +527,15 @@ void blk_mq_start_request(struct request *rq)
}
EXPORT_SYMBOL(blk_mq_start_request);
+/*
+ * When we reach here because queue is busy, REQ_ATOM_COMPLETE
+ * flag isn't set yet, so there may be race with timeout hanlder,
+ * but given rq->deadline is just set in .queue_rq() under
+ * this sitation, the race won't be possible in reality because
+ * rq->timeout should be set as big enough to cover the window
+ * between blk_mq_start_request() called from .queue_rq() and
+ * clearing REQ_ATOM_STARTED here.
+ */
static void __blk_mq_requeue_request(struct request *rq)
{
struct request_queue *q = rq->q;
@@ -700,6 +709,19 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
return;
+ /*
+ * The rq being checked may have been freed and reallocated
+ * out already here, we avoid this race by checking rq->deadline
+ * and REQ_ATOM_COMPLETE flag together:
+ *
+ * - if rq->deadline is observed as new value because of
+ * reusing, the rq won't be timed out because of timing.
+ * - if rq->deadline is observed as previous value,
+ * REQ_ATOM_COMPLETE flag won't be cleared in reuse path
+ * because we put a barrier between setting rq->deadline
+ * and clearing the flag in blk_mq_start_request(), so
+ * this rq won't be timed out too.
+ */
if (time_after_eq(jiffies, rq->deadline)) {
if (!blk_mark_rq_complete(rq))
blk_mq_rq_timed_out(rq, reserved);
--
2.9.3
On Fri, 2017-03-17 at 17:57 +0800, Ming Lei wrote:
> Before commit 780db2071a(blk-mq: decouble blk-mq freezing
> from generic bypassing), the dying flag is checked before
> entering queue, and Tejun converts the checking into .mq_freeze_depth,
> and assumes the counter is increased just after dying flag
> is set. Unfortunately we doesn't do that in blk_set_queue_dying().
>
> This patch calls blk_mq_freeze_queue_start() for blk-mq in
> blk_set_queue_dying(), so that we can block new I/O coming
> once the queue is set as dying.
>
> Given blk_set_queue_dying() is always called in remove path
> of block device, and queue will be cleaned up later, we don't
> need to worry about undoing the counter.
>
> Cc: Bart Van Assche <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Signed-off-by: Ming Lei <[email protected]>
> ---
> block/blk-core.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index d772c221cc17..62d4967c369f 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -500,9 +500,12 @@ void blk_set_queue_dying(struct request_queue *q)
> queue_flag_set(QUEUE_FLAG_DYING, q);
> spin_unlock_irq(q->queue_lock);
>
> - if (q->mq_ops)
> + if (q->mq_ops) {
> blk_mq_wake_waiters(q);
> - else {
> +
> + /* block new I/O coming */
> + blk_mq_freeze_queue_start(q);
> + } else {
> struct request_list *rl;
>
> spin_lock_irq(q->queue_lock);
Hello Ming,
I think we need the queue freezing not only for blk-mq but also for blk-sq.
Since the queue flags and the mq_freeze_depth are stored in different
variables we need to prevent that the CPU reorders the stores to these
variables. The comment about?blk_mq_freeze_queue_start() should be more
clear. How about something like the patch below?
[PATCH] blk-mq: Force block layer users to check the "dying" flag?after it has been set
Commit 780db2071ac4 removed the blk_queue_dying() check from the
hot path of blk_mq_queue_enter() although that check is necessary
when cleaning up a queue. Hence make sure that blk_queue_enter()
and blk_mq_queue_enter() check the dying flag after it has been set.
Because blk_set_queue_dying() is only called from the remove path
of a block device we don't need to worry about unfreezing the queue.
Fixes: commit 780db2071ac4 ("blk-mq: decouble blk-mq freezing from generic bypassing")
---
?block/blk-core.c | 13 +++++++++++++
?1 file changed, 13 insertions(+)
diff --git a/block/blk-core.c b/block/blk-core.c
index d772c221cc17..730f715b72ff 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -500,6 +500,19 @@ void blk_set_queue_dying(struct request_queue *q)
? queue_flag_set(QUEUE_FLAG_DYING, q);
? spin_unlock_irq(q->queue_lock);
?
+ /*
+ ?* Avoid that the updates of the queue flags and q_usage_counter
+ ?* are reordered.
+ ?*/
+ smp_wmb();
+
+ /*
+ ?* Force blk_queue_enter() and blk_mq_queue_enter() to check the
+ ?* "dying" flag. Despite its name, blk_mq_freeze_queue_start()
+ ?* affects blk-sq and blk-mq queues.
+ ?*/
+ blk_mq_freeze_queue_start(q);
+
? if (q->mq_ops)
? blk_mq_wake_waiters(q);
? else {
Thanks,
Bart.
On Fri, 2017-03-17 at 17:57 +0800, Ming Lei wrote:
> +/*
> + * When we reach here because queue is busy, REQ_ATOM_COMPLETE
> + * flag isn't set yet, so there may be race with timeout hanlder,
> + * but given rq->deadline is just set in .queue_rq() under
> + * this sitation, the race won't be possible in reality because
> + * rq->timeout should be set as big enough to cover the window
> + * between blk_mq_start_request() called from .queue_rq() and
> + * clearing REQ_ATOM_STARTED here.
> + */
> static void __blk_mq_requeue_request(struct request *rq)
> {
> struct request_queue *q = rq->q;
> @@ -700,6 +709,19 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
> if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
> return;
>
> + /*
> + * The rq being checked may have been freed and reallocated
> + * out already here, we avoid this race by checking rq->deadline
> + * and REQ_ATOM_COMPLETE flag together:
> + *
> + * - if rq->deadline is observed as new value because of
> + * reusing, the rq won't be timed out because of timing.
> + * - if rq->deadline is observed as previous value,
> + * REQ_ATOM_COMPLETE flag won't be cleared in reuse path
> + * because we put a barrier between setting rq->deadline
> + * and clearing the flag in blk_mq_start_request(), so
> + * this rq won't be timed out too.
> + */
> if (time_after_eq(jiffies, rq->deadline)) {
> if (!blk_mark_rq_complete(rq))
> blk_mq_rq_timed_out(rq, reserved);
Since this explanation applies to the same race addressed by patch 1/3,
please consider squashing this patch into patch 1/3.
Thanks,
Bart.
On Sat, Mar 18, 2017 at 1:39 AM, Bart Van Assche
<[email protected]> wrote:
> On Fri, 2017-03-17 at 17:57 +0800, Ming Lei wrote:
>> +/*
>> + * When we reach here because queue is busy, REQ_ATOM_COMPLETE
>> + * flag isn't set yet, so there may be race with timeout hanlder,
>> + * but given rq->deadline is just set in .queue_rq() under
>> + * this sitation, the race won't be possible in reality because
>> + * rq->timeout should be set as big enough to cover the window
>> + * between blk_mq_start_request() called from .queue_rq() and
>> + * clearing REQ_ATOM_STARTED here.
>> + */
>> static void __blk_mq_requeue_request(struct request *rq)
>> {
>> struct request_queue *q = rq->q;
>> @@ -700,6 +709,19 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
>> if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
>> return;
>>
>> + /*
>> + * The rq being checked may have been freed and reallocated
>> + * out already here, we avoid this race by checking rq->deadline
>> + * and REQ_ATOM_COMPLETE flag together:
>> + *
>> + * - if rq->deadline is observed as new value because of
>> + * reusing, the rq won't be timed out because of timing.
>> + * - if rq->deadline is observed as previous value,
>> + * REQ_ATOM_COMPLETE flag won't be cleared in reuse path
>> + * because we put a barrier between setting rq->deadline
>> + * and clearing the flag in blk_mq_start_request(), so
>> + * this rq won't be timed out too.
>> + */
>> if (time_after_eq(jiffies, rq->deadline)) {
>> if (!blk_mark_rq_complete(rq))
>> blk_mq_rq_timed_out(rq, reserved);
>
> Since this explanation applies to the same race addressed by patch 1/3,
First, this explains how we deal with the race of reuse vs. timeout, and 1/3
fixes another race or rq corruption. Did you see anywhere I mentioned STARTED
flag in above comment?
In case of 1/3, the rq to be dispatched can be destroyed simply by the
blk_mq_end_request() from timeout. Or even it can survive, the same rq
can be allocated into another I/O path, and this situation is different with
reuse vs. timeout. And I can't see any help from the comment for explaining
1/3's issue, can you? Maybe I need to mention rq corruption in 1/3 explicitly.
Secondly introducing this comment to 1/3 just causes unnecessary
backporting burden, as we have to make it into -stable.
> please consider squashing this patch into patch 1/3.
So please do not consider that.
Thanks,
Ming Lei
On Sat, Mar 18, 2017 at 1:32 AM, Bart Van Assche
<[email protected]> wrote:
> On Fri, 2017-03-17 at 17:57 +0800, Ming Lei wrote:
>> Before commit 780db2071a(blk-mq: decouble blk-mq freezing
>> from generic bypassing), the dying flag is checked before
>> entering queue, and Tejun converts the checking into .mq_freeze_depth,
>> and assumes the counter is increased just after dying flag
>> is set. Unfortunately we doesn't do that in blk_set_queue_dying().
>>
>> This patch calls blk_mq_freeze_queue_start() for blk-mq in
>> blk_set_queue_dying(), so that we can block new I/O coming
>> once the queue is set as dying.
>>
>> Given blk_set_queue_dying() is always called in remove path
>> of block device, and queue will be cleaned up later, we don't
>> need to worry about undoing the counter.
>>
>> Cc: Bart Van Assche <[email protected]>
>> Cc: Tejun Heo <[email protected]>
>> Signed-off-by: Ming Lei <[email protected]>
>> ---
>> block/blk-core.c | 7 +++++--
>> 1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index d772c221cc17..62d4967c369f 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -500,9 +500,12 @@ void blk_set_queue_dying(struct request_queue *q)
>> queue_flag_set(QUEUE_FLAG_DYING, q);
>> spin_unlock_irq(q->queue_lock);
>>
>> - if (q->mq_ops)
>> + if (q->mq_ops) {
>> blk_mq_wake_waiters(q);
>> - else {
>> +
>> + /* block new I/O coming */
>> + blk_mq_freeze_queue_start(q);
>> + } else {
>> struct request_list *rl;
>>
>> spin_lock_irq(q->queue_lock);
>
> Hello Ming,
>
> I think we need the queue freezing not only for blk-mq but also for blk-sq.
Yes, we can, but it may not be a big deal for blk-sq, since get_request() does
check the dying flag.
> Since the queue flags and the mq_freeze_depth are stored in different
> variables we need to prevent that the CPU reorders the stores to these
Not needed, see below.
> variables. The comment about blk_mq_freeze_queue_start() should be more
> clear. How about something like the patch below?
>
>
> [PATCH] blk-mq: Force block layer users to check the "dying" flag after it has been set
>
> Commit 780db2071ac4 removed the blk_queue_dying() check from the
> hot path of blk_mq_queue_enter() although that check is necessary
> when cleaning up a queue. Hence make sure that blk_queue_enter()
> and blk_mq_queue_enter() check the dying flag after it has been set.
>
> Because blk_set_queue_dying() is only called from the remove path
> of a block device we don't need to worry about unfreezing the queue.
>
> Fixes: commit 780db2071ac4 ("blk-mq: decouble blk-mq freezing from generic bypassing")
> ---
> block/blk-core.c | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index d772c221cc17..730f715b72ff 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -500,6 +500,19 @@ void blk_set_queue_dying(struct request_queue *q)
> queue_flag_set(QUEUE_FLAG_DYING, q);
> spin_unlock_irq(q->queue_lock);
>
> + /*
> + * Avoid that the updates of the queue flags and q_usage_counter
> + * are reordered.
> + */
> + smp_wmb();
atomic_inc_return() in blk_mq_freeze_queue_start() does imply a
barrier(smp_mb()).
> +
> + /*
> + * Force blk_queue_enter() and blk_mq_queue_enter() to check the
> + * "dying" flag. Despite its name, blk_mq_freeze_queue_start()
> + * affects blk-sq and blk-mq queues.
> + */
> + blk_mq_freeze_queue_start(q);
We need to change the name into blk_freeze_queue_start(), since it is quite
confusing to call a _mq function for both mq and legacy.
I will make it for both in v2, if no one objects.
Thanks,
Ming Lei
On Sat, 2017-03-18 at 02:32 +0800, Ming Lei wrote:
> On Sat, Mar 18, 2017 at 1:32 AM, Bart Van Assche wrote:
> > + /*
> > + * Avoid that the updates of the queue flags and q_usage_counter
> > + * are reordered.
> > + */
> > + smp_wmb();
>
> atomic_inc_return() in blk_mq_freeze_queue_start() does imply a
> barrier(smp_mb()).
Hello Ming,
It's probably a good idea to mention that in a comment. The implementation
of blk_mq_freeze_queue_start() namely could be changed in the future such
that it uses another atomic operation that doesn't implicitly perform smp_mb().
Thanks,
Bart.
On Fri, 2017-03-17 at 17:57 +0800, Ming Lei wrote:
> Given blk_set_queue_dying() is always called in remove path
> of block device, and queue will be cleaned up later, we don't
> need to worry about undoing the counter.
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index d772c221cc17..62d4967c369f 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -500,9 +500,12 @@ void blk_set_queue_dying(struct request_queue *q)
> queue_flag_set(QUEUE_FLAG_DYING, q);
> spin_unlock_irq(q->queue_lock);
>
> - if (q->mq_ops)
> + if (q->mq_ops) {
> blk_mq_wake_waiters(q);
> - else {
> +
> + /* block new I/O coming */
> + blk_mq_freeze_queue_start(q);
> + } else {
> struct request_list *rl;
>
> spin_lock_irq(q->queue_lock);
Hello Ming,
The blk_freeze_queue() call in blk_cleanup_queue() waits until q_usage_counter
drops to zero. Since the above blk_mq_freeze_queue_start() call increases that
counter by one, how is blk_freeze_queue() expected to finish ever?
Bart.
On Fri, Mar 17, 2017 at 11:26:26PM +0000, Bart Van Assche wrote:
> On Fri, 2017-03-17 at 17:57 +0800, Ming Lei wrote:
> > Given blk_set_queue_dying() is always called in remove path
> > of block device, and queue will be cleaned up later, we don't
> > need to worry about undoing the counter.
> >
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index d772c221cc17..62d4967c369f 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -500,9 +500,12 @@ void blk_set_queue_dying(struct request_queue *q)
> > queue_flag_set(QUEUE_FLAG_DYING, q);
> > spin_unlock_irq(q->queue_lock);
> >
> > - if (q->mq_ops)
> > + if (q->mq_ops) {
> > blk_mq_wake_waiters(q);
> > - else {
> > +
> > + /* block new I/O coming */
> > + blk_mq_freeze_queue_start(q);
> > + } else {
> > struct request_list *rl;
> >
> > spin_lock_irq(q->queue_lock);
>
> Hello Ming,
>
> The blk_freeze_queue() call in blk_cleanup_queue() waits until q_usage_counter
> drops to zero. Since the above blk_mq_freeze_queue_start() call increases that
> counter by one, how is blk_freeze_queue() expected to finish ever?
It is q->mq_freeze_depth which is increased by blk_mq_freeze_queue_start(), not
q->q_usage_counter, otherwise blk_freeze_queue() would never return, :-)
Thanks,
Ming
On 03/17/2017 10:57 AM, Ming Lei wrote:
> When iterating busy requests in timeout handler,
> if the STARTED flag of one request isn't set, that means
> the request is being processed in block layer or driver, and
> isn't dispatch to hardware yet.
>
> In current implementation of blk_mq_check_expired(),
> if the request queue becomes dying, un-started requests are
> handled as being completed/freed immediately. This way is
> wrong, and can cause use-after-free issue easily[1][2], when
> doing I/O and removing&resetting NVMe device at the sametime.
>
> This patch fixes several issues reported by Yi Zhang.
>
> [1]. oops log 1
> [ 581.789754] ------------[ cut here ]------------
> [ 581.789758] kernel BUG at block/blk-mq.c:374!
> [ 581.789760] invalid opcode: 0000 [#1] SMP
> [ 581.789761] Modules linked in: vfat fat ipmi_ssif intel_rapl sb_edac
> edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm nvme
> irqbypass crct10dif_pclmul nvme_core crc32_pclmul ghash_clmulni_intel
> intel_cstate ipmi_si mei_me ipmi_devintf intel_uncore sg ipmi_msghandler
> intel_rapl_perf iTCO_wdt mei iTCO_vendor_support mxm_wmi lpc_ich dcdbas shpchp
> pcspkr acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd dm_multipath grace
> sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper
> syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci
> crc32c_intel tg3 libata megaraid_sas i2c_core ptp fjes pps_core dm_mirror
> dm_region_hash dm_log dm_mod
> [ 581.789796] CPU: 1 PID: 1617 Comm: kworker/1:1H Not tainted 4.10.0.bz1420297+ #4
> [ 581.789797] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
> [ 581.789804] Workqueue: kblockd blk_mq_timeout_work
> [ 581.789806] task: ffff8804721c8000 task.stack: ffffc90006ee4000
> [ 581.789809] RIP: 0010:blk_mq_end_request+0x58/0x70
> [ 581.789810] RSP: 0018:ffffc90006ee7d50 EFLAGS: 00010202
> [ 581.789811] RAX: 0000000000000001 RBX: ffff8802e4195340 RCX: ffff88028e2f4b88
> [ 581.789812] RDX: 0000000000001000 RSI: 0000000000001000 RDI: 0000000000000000
> [ 581.789813] RBP: ffffc90006ee7d60 R08: 0000000000000003 R09: ffff88028e2f4b00
> [ 581.789814] R10: 0000000000001000 R11: 0000000000000001 R12: 00000000fffffffb
> [ 581.789815] R13: ffff88042abe5780 R14: 000000000000002d R15: ffff88046fbdff80
> [ 581.789817] FS: 0000000000000000(0000) GS:ffff88047fc00000(0000) knlGS:0000000000000000
> [ 581.789818] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 581.789819] CR2: 00007f64f403a008 CR3: 000000014d078000 CR4: 00000000001406e0
> [ 581.789820] Call Trace:
> [ 581.789825] blk_mq_check_expired+0x76/0x80
> [ 581.789828] bt_iter+0x45/0x50
> [ 581.789830] blk_mq_queue_tag_busy_iter+0xdd/0x1f0
> [ 581.789832] ? blk_mq_rq_timed_out+0x70/0x70
> [ 581.789833] ? blk_mq_rq_timed_out+0x70/0x70
> [ 581.789840] ? __switch_to+0x140/0x450
> [ 581.789841] blk_mq_timeout_work+0x88/0x170
> [ 581.789845] process_one_work+0x165/0x410
> [ 581.789847] worker_thread+0x137/0x4c0
> [ 581.789851] kthread+0x101/0x140
> [ 581.789853] ? rescuer_thread+0x3b0/0x3b0
> [ 581.789855] ? kthread_park+0x90/0x90
> [ 581.789860] ret_from_fork+0x2c/0x40
> [ 581.789861] Code: 48 85 c0 74 0d 44 89 e6 48 89 df ff d0 5b 41 5c 5d c3 48
> 8b bb 70 01 00 00 48 85 ff 75 0f 48 89 df e8 7d f0 ff ff 5b 41 5c 5d c3 <0f>
> 0b e8 71 f0 ff ff 90 eb e9 0f 1f 40 00 66 2e 0f 1f 84 00 00
> [ 581.789882] RIP: blk_mq_end_request+0x58/0x70 RSP: ffffc90006ee7d50
> [ 581.789889] ---[ end trace bcaf03d9a14a0a70 ]---
>
> [2]. oops log2
> [ 6984.857362] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
> [ 6984.857372] IP: nvme_queue_rq+0x6e6/0x8cd [nvme]
> [ 6984.857373] PGD 0
> [ 6984.857374]
> [ 6984.857376] Oops: 0000 [#1] SMP
> [ 6984.857379] Modules linked in: ipmi_ssif vfat fat intel_rapl sb_edac
> edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
> irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_si iTCO_wdt
> iTCO_vendor_support mxm_wmi ipmi_devintf intel_cstate sg dcdbas intel_uncore
> mei_me intel_rapl_perf mei pcspkr lpc_ich ipmi_msghandler shpchp
> acpi_power_meter wmi nfsd auth_rpcgss dm_multipath nfs_acl lockd grace sunrpc
> ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea
> sysfillrect crc32c_intel sysimgblt fb_sys_fops ttm nvme drm nvme_core ahci
> libahci i2c_core tg3 libata ptp megaraid_sas pps_core fjes dm_mirror
> dm_region_hash dm_log dm_mod
> [ 6984.857416] CPU: 7 PID: 1635 Comm: kworker/7:1H Not tainted
> 4.10.0-2.el7.bz1420297.x86_64 #1
> [ 6984.857417] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
> [ 6984.857427] Workqueue: kblockd blk_mq_run_work_fn
> [ 6984.857429] task: ffff880476e3da00 task.stack: ffffc90002e90000
> [ 6984.857432] RIP: 0010:nvme_queue_rq+0x6e6/0x8cd [nvme]
> [ 6984.857433] RSP: 0018:ffffc90002e93c50 EFLAGS: 00010246
> [ 6984.857434] RAX: 0000000000000000 RBX: ffff880275646600 RCX: 0000000000001000
> [ 6984.857435] RDX: 0000000000000fff RSI: 00000002fba2a000 RDI: ffff8804734e6950
> [ 6984.857436] RBP: ffffc90002e93d30 R08: 0000000000002000 R09: 0000000000001000
> [ 6984.857437] R10: 0000000000001000 R11: 0000000000000000 R12: ffff8804741d8000
> [ 6984.857438] R13: 0000000000000040 R14: ffff880475649f80 R15: ffff8804734e6780
> [ 6984.857439] FS: 0000000000000000(0000) GS:ffff88047fcc0000(0000) knlGS:0000000000000000
> [ 6984.857440] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 6984.857442] CR2: 0000000000000010 CR3: 0000000001c09000 CR4: 00000000001406e0
> [ 6984.857443] Call Trace:
> [ 6984.857451] ? mempool_free+0x2b/0x80
> [ 6984.857455] ? bio_free+0x4e/0x60
> [ 6984.857459] blk_mq_dispatch_rq_list+0xf5/0x230
> [ 6984.857462] blk_mq_process_rq_list+0x133/0x170
> [ 6984.857465] __blk_mq_run_hw_queue+0x8c/0xa0
> [ 6984.857467] blk_mq_run_work_fn+0x12/0x20
> [ 6984.857473] process_one_work+0x165/0x410
> [ 6984.857475] worker_thread+0x137/0x4c0
> [ 6984.857478] kthread+0x101/0x140
> [ 6984.857480] ? rescuer_thread+0x3b0/0x3b0
> [ 6984.857481] ? kthread_park+0x90/0x90
> [ 6984.857489] ret_from_fork+0x2c/0x40
> [ 6984.857490] Code: 8b bd 70 ff ff ff 89 95 50 ff ff ff 89 8d 58 ff ff ff 44
> 89 95 60 ff ff ff e8 b7 dd 12 e1 8b 95 50 ff ff ff 48 89 85 68 ff ff ff <4c>
> 8b 48 10 44 8b 58 18 8b 8d 58 ff ff ff 44 8b 95 60 ff ff ff
> [ 6984.857511] RIP: nvme_queue_rq+0x6e6/0x8cd [nvme] RSP: ffffc90002e93c50
> [ 6984.857512] CR2: 0000000000000010
> [ 6984.895359] ---[ end trace 2d7ceb528432bf83 ]---
>
> Cc: [email protected]
> Reported-by: Yi Zhang <[email protected]>
> Tested-by: Yi Zhang <[email protected]>
> Reviewed-by: Bart Van Assche <[email protected]>
> Signed-off-by: Ming Lei <[email protected]>
> ---
> block/blk-mq.c | 11 +----------
> 1 file changed, 1 insertion(+), 10 deletions(-)
Reviewed-by: Hannes Reinecke <[email protected]>
Cheers,
Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: J. Hawn, J. Guild, F. Imend?rffer, HRB 16746 (AG N?rnberg)
On 03/17/2017 10:57 AM, Ming Lei wrote:
> This patch adds comment on two races related with
> timeout handler:
>
> - requeue from queue busy vs. timeout
> - rq free & reallocation vs. timeout
>
> Both the races themselves and current solution aren't
> explicit enough, so add comments on them.
>
> Cc: Bart Van Assche <[email protected]>
> Signed-off-by: Ming Lei <[email protected]>
> ---
> block/blk-mq.c | 22 ++++++++++++++++++++++
> 1 file changed, 22 insertions(+)
>
Reviewed-by: Hannes Reinecke <[email protected]>
Cheers,
Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: J. Hawn, J. Guild, F. Imend?rffer, HRB 16746 (AG N?rnberg)
On 03/17/2017 10:57 AM, Ming Lei wrote:
> Before commit 780db2071a(blk-mq: decouble blk-mq freezing
> from generic bypassing), the dying flag is checked before
> entering queue, and Tejun converts the checking into .mq_freeze_depth,
> and assumes the counter is increased just after dying flag
> is set. Unfortunately we doesn't do that in blk_set_queue_dying().
>
> This patch calls blk_mq_freeze_queue_start() for blk-mq in
> blk_set_queue_dying(), so that we can block new I/O coming
> once the queue is set as dying.
>
> Given blk_set_queue_dying() is always called in remove path
> of block device, and queue will be cleaned up later, we don't
> need to worry about undoing the counter.
>
> Cc: Bart Van Assche <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Signed-off-by: Ming Lei <[email protected]>
> ---
> block/blk-core.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
Reviewed-by: Hannes Reinecke <[email protected]>
Cheers,
Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: J. Hawn, J. Guild, F. Imend?rffer, HRB 16746 (AG N?rnberg)