2017-03-09 13:05:03

by Ming Lei

[permalink] [raw]
Subject: [PATCH 0/2] blk-mq: dying queue fix & improvement

Hi,

The 1st patch fixes one race between timeout and dying queue.

The 2nd patch improves handling for dying queue.

thanks,
Ming


Ming Lei (2):
blk-mq: don't complete un-started request in timeout handler
blk-mq: start to freeze queue just after setting dying

block/blk-core.c | 7 +++++--
block/blk-mq.c | 11 +----------
2 files changed, 6 insertions(+), 12 deletions(-)

--
2.7.4


2017-03-09 13:11:49

by Ming Lei

[permalink] [raw]
Subject: [PATCH 1/2] blk-mq: don't complete un-started request in timeout handler

When iterating busy request in timeout handler,
if the STARTED flag of one request isn't set, that means
the request is being processed in block layer, and not
dispatched to low level driver yet.

In current implementation of blk_mq_check_expired(),
in case that the request queue becomes dying, un-started
requests are completed immediately. This way can cause
use-after-free issue[1][2] when doing I/O and removing&resetting
NVMe device.

This patch fixes several issues reported by Yi Zhang.

[1]. oops log 1
[ 581.789754] ------------[ cut here ]------------
[ 581.789758] kernel BUG at block/blk-mq.c:374!
[ 581.789760] invalid opcode: 0000 [#1] SMP
[ 581.789761] Modules linked in: vfat fat ipmi_ssif intel_rapl sb_edac
edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm nvme
irqbypass crct10dif_pclmul nvme_core crc32_pclmul ghash_clmulni_intel
intel_cstate ipmi_si mei_me ipmi_devintf intel_uncore sg ipmi_msghandler
intel_rapl_perf iTCO_wdt mei iTCO_vendor_support mxm_wmi lpc_ich dcdbas shpchp
pcspkr acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd dm_multipath grace
sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper
syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci
crc32c_intel tg3 libata megaraid_sas i2c_core ptp fjes pps_core dm_mirror
dm_region_hash dm_log dm_mod
[ 581.789796] CPU: 1 PID: 1617 Comm: kworker/1:1H Not tainted 4.10.0.bz1420297+ #4
[ 581.789797] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
[ 581.789804] Workqueue: kblockd blk_mq_timeout_work
[ 581.789806] task: ffff8804721c8000 task.stack: ffffc90006ee4000
[ 581.789809] RIP: 0010:blk_mq_end_request+0x58/0x70
[ 581.789810] RSP: 0018:ffffc90006ee7d50 EFLAGS: 00010202
[ 581.789811] RAX: 0000000000000001 RBX: ffff8802e4195340 RCX: ffff88028e2f4b88
[ 581.789812] RDX: 0000000000001000 RSI: 0000000000001000 RDI: 0000000000000000
[ 581.789813] RBP: ffffc90006ee7d60 R08: 0000000000000003 R09: ffff88028e2f4b00
[ 581.789814] R10: 0000000000001000 R11: 0000000000000001 R12: 00000000fffffffb
[ 581.789815] R13: ffff88042abe5780 R14: 000000000000002d R15: ffff88046fbdff80
[ 581.789817] FS: 0000000000000000(0000) GS:ffff88047fc00000(0000) knlGS:0000000000000000
[ 581.789818] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 581.789819] CR2: 00007f64f403a008 CR3: 000000014d078000 CR4: 00000000001406e0
[ 581.789820] Call Trace:
[ 581.789825] blk_mq_check_expired+0x76/0x80
[ 581.789828] bt_iter+0x45/0x50
[ 581.789830] blk_mq_queue_tag_busy_iter+0xdd/0x1f0
[ 581.789832] ? blk_mq_rq_timed_out+0x70/0x70
[ 581.789833] ? blk_mq_rq_timed_out+0x70/0x70
[ 581.789840] ? __switch_to+0x140/0x450
[ 581.789841] blk_mq_timeout_work+0x88/0x170
[ 581.789845] process_one_work+0x165/0x410
[ 581.789847] worker_thread+0x137/0x4c0
[ 581.789851] kthread+0x101/0x140
[ 581.789853] ? rescuer_thread+0x3b0/0x3b0
[ 581.789855] ? kthread_park+0x90/0x90
[ 581.789860] ret_from_fork+0x2c/0x40
[ 581.789861] Code: 48 85 c0 74 0d 44 89 e6 48 89 df ff d0 5b 41 5c 5d c3 48
8b bb 70 01 00 00 48 85 ff 75 0f 48 89 df e8 7d f0 ff ff 5b 41 5c 5d c3 <0f>
0b e8 71 f0 ff ff 90 eb e9 0f 1f 40 00 66 2e 0f 1f 84 00 00
[ 581.789882] RIP: blk_mq_end_request+0x58/0x70 RSP: ffffc90006ee7d50
[ 581.789889] ---[ end trace bcaf03d9a14a0a70 ]---

[2]. oops log2
[ 6984.857362] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[ 6984.857372] IP: nvme_queue_rq+0x6e6/0x8cd [nvme]
[ 6984.857373] PGD 0
[ 6984.857374]
[ 6984.857376] Oops: 0000 [#1] SMP
[ 6984.857379] Modules linked in: ipmi_ssif vfat fat intel_rapl sb_edac
edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_si iTCO_wdt
iTCO_vendor_support mxm_wmi ipmi_devintf intel_cstate sg dcdbas intel_uncore
mei_me intel_rapl_perf mei pcspkr lpc_ich ipmi_msghandler shpchp
acpi_power_meter wmi nfsd auth_rpcgss dm_multipath nfs_acl lockd grace sunrpc
ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea
sysfillrect crc32c_intel sysimgblt fb_sys_fops ttm nvme drm nvme_core ahci
libahci i2c_core tg3 libata ptp megaraid_sas pps_core fjes dm_mirror
dm_region_hash dm_log dm_mod
[ 6984.857416] CPU: 7 PID: 1635 Comm: kworker/7:1H Not tainted
4.10.0-2.el7.bz1420297.x86_64 #1
[ 6984.857417] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
[ 6984.857427] Workqueue: kblockd blk_mq_run_work_fn
[ 6984.857429] task: ffff880476e3da00 task.stack: ffffc90002e90000
[ 6984.857432] RIP: 0010:nvme_queue_rq+0x6e6/0x8cd [nvme]
[ 6984.857433] RSP: 0018:ffffc90002e93c50 EFLAGS: 00010246
[ 6984.857434] RAX: 0000000000000000 RBX: ffff880275646600 RCX: 0000000000001000
[ 6984.857435] RDX: 0000000000000fff RSI: 00000002fba2a000 RDI: ffff8804734e6950
[ 6984.857436] RBP: ffffc90002e93d30 R08: 0000000000002000 R09: 0000000000001000
[ 6984.857437] R10: 0000000000001000 R11: 0000000000000000 R12: ffff8804741d8000
[ 6984.857438] R13: 0000000000000040 R14: ffff880475649f80 R15: ffff8804734e6780
[ 6984.857439] FS: 0000000000000000(0000) GS:ffff88047fcc0000(0000) knlGS:0000000000000000
[ 6984.857440] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6984.857442] CR2: 0000000000000010 CR3: 0000000001c09000 CR4: 00000000001406e0
[ 6984.857443] Call Trace:
[ 6984.857451] ? mempool_free+0x2b/0x80
[ 6984.857455] ? bio_free+0x4e/0x60
[ 6984.857459] blk_mq_dispatch_rq_list+0xf5/0x230
[ 6984.857462] blk_mq_process_rq_list+0x133/0x170
[ 6984.857465] __blk_mq_run_hw_queue+0x8c/0xa0
[ 6984.857467] blk_mq_run_work_fn+0x12/0x20
[ 6984.857473] process_one_work+0x165/0x410
[ 6984.857475] worker_thread+0x137/0x4c0
[ 6984.857478] kthread+0x101/0x140
[ 6984.857480] ? rescuer_thread+0x3b0/0x3b0
[ 6984.857481] ? kthread_park+0x90/0x90
[ 6984.857489] ret_from_fork+0x2c/0x40
[ 6984.857490] Code: 8b bd 70 ff ff ff 89 95 50 ff ff ff 89 8d 58 ff ff ff 44
89 95 60 ff ff ff e8 b7 dd 12 e1 8b 95 50 ff ff ff 48 89 85 68 ff ff ff <4c>
8b 48 10 44 8b 58 18 8b 8d 58 ff ff ff 44 8b 95 60 ff ff ff
[ 6984.857511] RIP: nvme_queue_rq+0x6e6/0x8cd [nvme] RSP: ffffc90002e93c50
[ 6984.857512] CR2: 0000000000000010
[ 6984.895359] ---[ end trace 2d7ceb528432bf83 ]---

Cc: [email protected]
Reported-by: Yi Zhang <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
---
block/blk-mq.c | 11 +----------
1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 159187a28d66..0aff380099d5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -697,17 +697,8 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
{
struct blk_mq_timeout_data *data = priv;

- if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
- /*
- * If a request wasn't started before the queue was
- * marked dying, kill it here or it'll go unnoticed.
- */
- if (unlikely(blk_queue_dying(rq->q))) {
- rq->errors = -EIO;
- blk_mq_end_request(rq, rq->errors);
- }
+ if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
return;
- }

if (time_after_eq(jiffies, rq->deadline)) {
if (!blk_mark_rq_complete(rq))
--
2.7.4

2017-03-09 13:12:08

by Ming Lei

[permalink] [raw]
Subject: [PATCH 2/2] blk-mq: start to freeze queue just after setting dying

Before commit 780db2071a(blk-mq: decouble blk-mq freezing
from generic bypassing), the dying flag is checked before
entering queue, and Tejun converts the checking into .mq_freeze_depth,
and assumes the counter is increased just after dying flag
is set. Unfortunately we doesn't do that in blk_set_queue_dying().

This patch calls blk_mq_freeze_queue_start() for blk-mq in
blk_set_queue_dying(), so that we can block new I/O coming
once the queue is set as dying.

Given blk_set_queue_dying() is always called in remove path
of block device, and queue will be cleaned up later, we don't
need to worry about undo of the counter.

Cc: Tejun Heo <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
---
block/blk-core.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 0eeb99ef654f..559487e58296 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -500,9 +500,12 @@ void blk_set_queue_dying(struct request_queue *q)
queue_flag_set(QUEUE_FLAG_DYING, q);
spin_unlock_irq(q->queue_lock);

- if (q->mq_ops)
+ if (q->mq_ops) {
blk_mq_wake_waiters(q);
- else {
+
+ /* block new I/O coming */
+ blk_mq_freeze_queue_start(q);
+ } else {
struct request_list *rl;

spin_lock_irq(q->queue_lock);
--
2.7.4

2017-03-09 13:59:29

by Ming Lei

[permalink] [raw]
Subject: [PATCH 0/2] blk-mq: dying queue fix & improvement

Hi,

The 1st patch fixes one race between timeout and dying queue.

The 2nd patch improves handling for dying queue.

thanks,
Ming


Ming Lei (2):
blk-mq: don't complete un-started request in timeout handler
blk-mq: start to freeze queue just after setting dying

block/blk-core.c | 7 +++++--
block/blk-mq.c | 11 +----------
2 files changed, 6 insertions(+), 12 deletions(-)

--
2.7.4

2017-03-09 16:59:18

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH 2/2] blk-mq: start to freeze queue just after setting dying

On Thu, 2017-03-09 at 21:02 +0800, Ming Lei wrote:
> Before commit 780db2071a(blk-mq: decouble blk-mq freezing
> from generic bypassing), the dying flag is checked before
> entering queue, and Tejun converts the checking into .mq_freeze_depth,
> and assumes the counter is increased just after dying flag
> is set. Unfortunately we doesn't do that in blk_set_queue_dying().
>
> This patch calls blk_mq_freeze_queue_start() for blk-mq in
> blk_set_queue_dying(), so that we can block new I/O coming
> once the queue is set as dying.
>
> Given blk_set_queue_dying() is always called in remove path
> of block device, and queue will be cleaned up later, we don't
> need to worry about undo of the counter.
>
> Cc: Tejun Heo <[email protected]>
> Signed-off-by: Ming Lei <[email protected]>
> ---
> block/blk-core.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 0eeb99ef654f..559487e58296 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -500,9 +500,12 @@ void blk_set_queue_dying(struct request_queue *q)
> queue_flag_set(QUEUE_FLAG_DYING, q);
> spin_unlock_irq(q->queue_lock);
>
> - if (q->mq_ops)
> + if (q->mq_ops) {
> blk_mq_wake_waiters(q);
> - else {
> +
> + /* block new I/O coming */
> + blk_mq_freeze_queue_start(q);
> + } else {
> struct request_list *rl;
>
> spin_lock_irq(q->queue_lock);

The comment above blk_mq_freeze_queue_start() should explain more clearly
why that call is needed. Additionally, I think this patch makes the
blk_freeze_queue() call in blk_cleanup_queue() superfluous. How about the
(entirely untested) patch below?

diff --git a/block/blk-core.c b/block/blk-core.c
index 1086dac8724c..3ce48f2d65cf 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -500,6 +500,12 @@ void blk_set_queue_dying(struct request_queue *q)
? queue_flag_set(QUEUE_FLAG_DYING, q);
? spin_unlock_irq(q->queue_lock);
?
+ /*
+ ?* Force blk_queue_enter() and blk_mq_queue_enter() to check the
+ ?* "dying" flag.
+ ?*/
+ blk_freeze_queue(q);
+
? if (q->mq_ops)
? blk_mq_wake_waiters(q);
? else {
@@ -555,7 +561,7 @@ void blk_cleanup_queue(struct request_queue *q)
? ?* Drain all requests queued before DYING marking. Set DEAD flag to
? ?* prevent that q->request_fn() gets invoked after draining finished.
? ?*/
- blk_freeze_queue(q);
+ WARN_ON_ONCE(!atomic_read(&q->mq_freeze_depth));
? spin_lock_irq(lock);
? if (!q->mq_ops)
? __blk_drain_queue(q, true);

Thanks,

Bart.

2017-03-10 02:16:13

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH 2/2] blk-mq: start to freeze queue just after setting dying

On Fri, Mar 10, 2017 at 12:58 AM, Bart Van Assche
<[email protected]> wrote:
> On Thu, 2017-03-09 at 21:02 +0800, Ming Lei wrote:
>> Before commit 780db2071a(blk-mq: decouble blk-mq freezing
>> from generic bypassing), the dying flag is checked before
>> entering queue, and Tejun converts the checking into .mq_freeze_depth,
>> and assumes the counter is increased just after dying flag
>> is set. Unfortunately we doesn't do that in blk_set_queue_dying().
>>
>> This patch calls blk_mq_freeze_queue_start() for blk-mq in
>> blk_set_queue_dying(), so that we can block new I/O coming
>> once the queue is set as dying.
>>
>> Given blk_set_queue_dying() is always called in remove path
>> of block device, and queue will be cleaned up later, we don't
>> need to worry about undo of the counter.
>>
>> Cc: Tejun Heo <[email protected]>
>> Signed-off-by: Ming Lei <[email protected]>
>> ---
>> block/blk-core.c | 7 +++++--
>> 1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 0eeb99ef654f..559487e58296 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -500,9 +500,12 @@ void blk_set_queue_dying(struct request_queue *q)
>> queue_flag_set(QUEUE_FLAG_DYING, q);
>> spin_unlock_irq(q->queue_lock);
>>
>> - if (q->mq_ops)
>> + if (q->mq_ops) {
>> blk_mq_wake_waiters(q);
>> - else {
>> +
>> + /* block new I/O coming */
>> + blk_mq_freeze_queue_start(q);
>> + } else {
>> struct request_list *rl;
>>
>> spin_lock_irq(q->queue_lock);
>
> The comment above blk_mq_freeze_queue_start() should explain more clearly
> why that call is needed. Additionally, I think this patch makes the

The comment of "block new I/O coming" has been added, and let me know what
others are needed, :-)

> blk_freeze_queue() call in blk_cleanup_queue() superfluous. How about the
> (entirely untested) patch below?

I don't think we need to wait in blk_set_queue_dying(), and the purpose
of this patch is to block new I/O coming once dying iset as pointed in the
comment, and the change in blk_cleanup_queue() isn't necessary too, since
that is exactly where we should drain the queue.


Thanks,
Ming Lei

2017-03-15 00:08:03

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH 1/2] blk-mq: don't complete un-started request in timeout handler

On Thu, 2017-03-09 at 21:02 +0800, Ming Lei wrote:
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 159187a28d66..0aff380099d5 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -697,17 +697,8 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
> {
> struct blk_mq_timeout_data *data = priv;
>
> - if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
> - /*
> - * If a request wasn't started before the queue was
> - * marked dying, kill it here or it'll go unnoticed.
> - */
> - if (unlikely(blk_queue_dying(rq->q))) {
> - rq->errors = -EIO;
> - blk_mq_end_request(rq, rq->errors);
> - }
> + if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
> return;
> - }
>
> if (time_after_eq(jiffies, rq->deadline)) {
> if (!blk_mark_rq_complete(rq))

Both the old and the new check look racy to me. The REQ_ATOM_STARTED bit can
be changed concurrently by blk_mq_start_request(), __blk_mq_finish_request()
or __blk_mq_requeue_request(). Another issue with this function is that the
request passed to this function can be reinitialized concurrently. Sorry but
I'm not sure what the best way is to address these issues.

Bart.

2017-03-15 12:20:00

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH 1/2] blk-mq: don't complete un-started request in timeout handler

On Wed, Mar 15, 2017 at 12:07:37AM +0000, Bart Van Assche wrote:
> On Thu, 2017-03-09 at 21:02 +0800, Ming Lei wrote:
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 159187a28d66..0aff380099d5 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -697,17 +697,8 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
> > {
> > struct blk_mq_timeout_data *data = priv;
> >
> > - if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
> > - /*
> > - * If a request wasn't started before the queue was
> > - * marked dying, kill it here or it'll go unnoticed.
> > - */
> > - if (unlikely(blk_queue_dying(rq->q))) {
> > - rq->errors = -EIO;
> > - blk_mq_end_request(rq, rq->errors);
> > - }
> > + if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
> > return;
> > - }
> >
> > if (time_after_eq(jiffies, rq->deadline)) {
> > if (!blk_mark_rq_complete(rq))
>
> Both the old and the new check look racy to me. The REQ_ATOM_STARTED bit can
> be changed concurrently by blk_mq_start_request(), __blk_mq_finish_request()

blk_mq_start_request() and __blk_mq_finish_request() won't be run concurrently.

>From view of __blk_mq_finish_request():

- if it is run from merge queue io path(blk_mq_merge_queue_io()),
blk_mq_start_request() can't be run at all, and the COMPLETE flag
is kept as previous value(zero)

- if it is run from normal complete path, COMPLETE flag is cleared
before the req/tag is released to tag set.

so there isn't race in blk_mq_start_request() vs. __blk_mq_finish_request()
wrt. timeout.

> or __blk_mq_requeue_request(). Another issue with this function is that the

__blk_mq_requeue_request() can be run from two pathes:

- dispatch failure, in which case the req/tag isn't released to tag set

- IO completion path, in which COMPLETE flag is cleared before requeue.

so I can't see races with timeout in case of start rq vs. requeue rq.

> request passed to this function can be reinitialized concurrently. Sorry but

Yes, that is possible, but rq->atomic_flags is kept, and in that case
when we handle timeout, COMPLETE is cleared before releasing the rq/tag to
tag set via blk_mark_rq_complete(), so we won't complete that req twice.



Thanks,
Ming

2017-03-15 12:41:55

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH 1/2] blk-mq: don't complete un-started request in timeout handler

On Wed, Mar 15, 2017 at 08:18:53PM +0800, Ming Lei wrote:
> On Wed, Mar 15, 2017 at 12:07:37AM +0000, Bart Van Assche wrote:
>
> > or __blk_mq_requeue_request(). Another issue with this function is that the
>
> __blk_mq_requeue_request() can be run from two pathes:
>
> - dispatch failure, in which case the req/tag isn't released to tag set
>
> - IO completion path, in which COMPLETE flag is cleared before requeue.
>
> so I can't see races with timeout in case of start rq vs. requeue rq.

Actually rq/tag won't be released to tag set if it will be requeued, so
the timeout race is nothing to do with requeue.

Thanks,
Ming

2017-03-15 14:11:55

by Yi Zhang

[permalink] [raw]
Subject: Re: [PATCH 1/2] blk-mq: don't complete un-started request in timeout handler

Thanks Ming.

Tested-by: Yi Zhang <[email protected]>

Best Regards,
Yi Zhang


----- Original Message -----
From: "Ming Lei" <[email protected]>
To: "Jens Axboe" <[email protected]>, [email protected], [email protected], "Christoph Hellwig" <[email protected]>
Cc: "Yi Zhang" <[email protected]>, "Ming Lei" <[email protected]>, [email protected]
Sent: Thursday, March 9, 2017 9:02:57 PM
Subject: [PATCH 1/2] blk-mq: don't complete un-started request in timeout handler

When iterating busy request in timeout handler,
if the STARTED flag of one request isn't set, that means
the request is being processed in block layer, and not
dispatched to low level driver yet.

In current implementation of blk_mq_check_expired(),
in case that the request queue becomes dying, un-started
requests are completed immediately. This way can cause
use-after-free issue[1][2] when doing I/O and removing&resetting
NVMe device.

This patch fixes several issues reported by Yi Zhang.

[1]. oops log 1
[ 581.789754] ------------[ cut here ]------------
[ 581.789758] kernel BUG at block/blk-mq.c:374!
[ 581.789760] invalid opcode: 0000 [#1] SMP
[ 581.789761] Modules linked in: vfat fat ipmi_ssif intel_rapl sb_edac
edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm nvme
irqbypass crct10dif_pclmul nvme_core crc32_pclmul ghash_clmulni_intel
intel_cstate ipmi_si mei_me ipmi_devintf intel_uncore sg ipmi_msghandler
intel_rapl_perf iTCO_wdt mei iTCO_vendor_support mxm_wmi lpc_ich dcdbas shpchp
pcspkr acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd dm_multipath grace
sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper
syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci
crc32c_intel tg3 libata megaraid_sas i2c_core ptp fjes pps_core dm_mirror
dm_region_hash dm_log dm_mod
[ 581.789796] CPU: 1 PID: 1617 Comm: kworker/1:1H Not tainted 4.10.0.bz1420297+ #4
[ 581.789797] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
[ 581.789804] Workqueue: kblockd blk_mq_timeout_work
[ 581.789806] task: ffff8804721c8000 task.stack: ffffc90006ee4000
[ 581.789809] RIP: 0010:blk_mq_end_request+0x58/0x70
[ 581.789810] RSP: 0018:ffffc90006ee7d50 EFLAGS: 00010202
[ 581.789811] RAX: 0000000000000001 RBX: ffff8802e4195340 RCX: ffff88028e2f4b88
[ 581.789812] RDX: 0000000000001000 RSI: 0000000000001000 RDI: 0000000000000000
[ 581.789813] RBP: ffffc90006ee7d60 R08: 0000000000000003 R09: ffff88028e2f4b00
[ 581.789814] R10: 0000000000001000 R11: 0000000000000001 R12: 00000000fffffffb
[ 581.789815] R13: ffff88042abe5780 R14: 000000000000002d R15: ffff88046fbdff80
[ 581.789817] FS: 0000000000000000(0000) GS:ffff88047fc00000(0000) knlGS:0000000000000000
[ 581.789818] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 581.789819] CR2: 00007f64f403a008 CR3: 000000014d078000 CR4: 00000000001406e0
[ 581.789820] Call Trace:
[ 581.789825] blk_mq_check_expired+0x76/0x80
[ 581.789828] bt_iter+0x45/0x50
[ 581.789830] blk_mq_queue_tag_busy_iter+0xdd/0x1f0
[ 581.789832] ? blk_mq_rq_timed_out+0x70/0x70
[ 581.789833] ? blk_mq_rq_timed_out+0x70/0x70
[ 581.789840] ? __switch_to+0x140/0x450
[ 581.789841] blk_mq_timeout_work+0x88/0x170
[ 581.789845] process_one_work+0x165/0x410
[ 581.789847] worker_thread+0x137/0x4c0
[ 581.789851] kthread+0x101/0x140
[ 581.789853] ? rescuer_thread+0x3b0/0x3b0
[ 581.789855] ? kthread_park+0x90/0x90
[ 581.789860] ret_from_fork+0x2c/0x40
[ 581.789861] Code: 48 85 c0 74 0d 44 89 e6 48 89 df ff d0 5b 41 5c 5d c3 48
8b bb 70 01 00 00 48 85 ff 75 0f 48 89 df e8 7d f0 ff ff 5b 41 5c 5d c3 <0f>
0b e8 71 f0 ff ff 90 eb e9 0f 1f 40 00 66 2e 0f 1f 84 00 00
[ 581.789882] RIP: blk_mq_end_request+0x58/0x70 RSP: ffffc90006ee7d50
[ 581.789889] ---[ end trace bcaf03d9a14a0a70 ]---

[2]. oops log2
[ 6984.857362] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[ 6984.857372] IP: nvme_queue_rq+0x6e6/0x8cd [nvme]
[ 6984.857373] PGD 0
[ 6984.857374]
[ 6984.857376] Oops: 0000 [#1] SMP
[ 6984.857379] Modules linked in: ipmi_ssif vfat fat intel_rapl sb_edac
edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_si iTCO_wdt
iTCO_vendor_support mxm_wmi ipmi_devintf intel_cstate sg dcdbas intel_uncore
mei_me intel_rapl_perf mei pcspkr lpc_ich ipmi_msghandler shpchp
acpi_power_meter wmi nfsd auth_rpcgss dm_multipath nfs_acl lockd grace sunrpc
ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea
sysfillrect crc32c_intel sysimgblt fb_sys_fops ttm nvme drm nvme_core ahci
libahci i2c_core tg3 libata ptp megaraid_sas pps_core fjes dm_mirror
dm_region_hash dm_log dm_mod
[ 6984.857416] CPU: 7 PID: 1635 Comm: kworker/7:1H Not tainted
4.10.0-2.el7.bz1420297.x86_64 #1
[ 6984.857417] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
[ 6984.857427] Workqueue: kblockd blk_mq_run_work_fn
[ 6984.857429] task: ffff880476e3da00 task.stack: ffffc90002e90000
[ 6984.857432] RIP: 0010:nvme_queue_rq+0x6e6/0x8cd [nvme]
[ 6984.857433] RSP: 0018:ffffc90002e93c50 EFLAGS: 00010246
[ 6984.857434] RAX: 0000000000000000 RBX: ffff880275646600 RCX: 0000000000001000
[ 6984.857435] RDX: 0000000000000fff RSI: 00000002fba2a000 RDI: ffff8804734e6950
[ 6984.857436] RBP: ffffc90002e93d30 R08: 0000000000002000 R09: 0000000000001000
[ 6984.857437] R10: 0000000000001000 R11: 0000000000000000 R12: ffff8804741d8000
[ 6984.857438] R13: 0000000000000040 R14: ffff880475649f80 R15: ffff8804734e6780
[ 6984.857439] FS: 0000000000000000(0000) GS:ffff88047fcc0000(0000) knlGS:0000000000000000
[ 6984.857440] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6984.857442] CR2: 0000000000000010 CR3: 0000000001c09000 CR4: 00000000001406e0
[ 6984.857443] Call Trace:
[ 6984.857451] ? mempool_free+0x2b/0x80
[ 6984.857455] ? bio_free+0x4e/0x60
[ 6984.857459] blk_mq_dispatch_rq_list+0xf5/0x230
[ 6984.857462] blk_mq_process_rq_list+0x133/0x170
[ 6984.857465] __blk_mq_run_hw_queue+0x8c/0xa0
[ 6984.857467] blk_mq_run_work_fn+0x12/0x20
[ 6984.857473] process_one_work+0x165/0x410
[ 6984.857475] worker_thread+0x137/0x4c0
[ 6984.857478] kthread+0x101/0x140
[ 6984.857480] ? rescuer_thread+0x3b0/0x3b0
[ 6984.857481] ? kthread_park+0x90/0x90
[ 6984.857489] ret_from_fork+0x2c/0x40
[ 6984.857490] Code: 8b bd 70 ff ff ff 89 95 50 ff ff ff 89 8d 58 ff ff ff 44
89 95 60 ff ff ff e8 b7 dd 12 e1 8b 95 50 ff ff ff 48 89 85 68 ff ff ff <4c>
8b 48 10 44 8b 58 18 8b 8d 58 ff ff ff 44 8b 95 60 ff ff ff
[ 6984.857511] RIP: nvme_queue_rq+0x6e6/0x8cd [nvme] RSP: ffffc90002e93c50
[ 6984.857512] CR2: 0000000000000010
[ 6984.895359] ---[ end trace 2d7ceb528432bf83 ]---

Cc: [email protected]
Reported-by: Yi Zhang <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
---
block/blk-mq.c | 11 +----------
1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 159187a28d66..0aff380099d5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -697,17 +697,8 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
{
struct blk_mq_timeout_data *data = priv;

- if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
- /*
- * If a request wasn't started before the queue was
- * marked dying, kill it here or it'll go unnoticed.
- */
- if (unlikely(blk_queue_dying(rq->q))) {
- rq->errors = -EIO;
- blk_mq_end_request(rq, rq->errors);
- }
+ if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
return;
- }

if (time_after_eq(jiffies, rq->deadline)) {
if (!blk_mark_rq_complete(rq))
--
2.7.4

2017-03-15 15:37:02

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH 1/2] blk-mq: don't complete un-started request in timeout handler

On Wed, 2017-03-15 at 20:40 +0800, Ming Lei wrote:
> On Wed, Mar 15, 2017 at 08:18:53PM +0800, Ming Lei wrote:
> > On Wed, Mar 15, 2017 at 12:07:37AM +0000, Bart Van Assche wrote:
> >
> > > or __blk_mq_requeue_request(). Another issue with this function is that the
> >
> > __blk_mq_requeue_request() can be run from two pathes:
> >
> > - dispatch failure, in which case the req/tag isn't released to tag set
> >
> > - IO completion path, in which COMPLETE flag is cleared before requeue.
> >
> > so I can't see races with timeout in case of start rq vs. requeue rq.
>
> Actually rq/tag won't be released to tag set if it will be requeued, so
> the timeout race is nothing to do with requeue.

Hello Ming,

Please have another look at __blk_mq_requeue_request(). In that function
the following code occurs: if (test_and_clear_bit(REQ_ATOM_STARTED,
&rq->atomic_flags)) { ... }

I think the?REQ_ATOM_STARTED check in blk_mq_check_expired() races with the
test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags) call in
__blk_mq_requeue_request().

Bart.

2017-03-15 16:23:45

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH 1/2] blk-mq: don't complete un-started request in timeout handler

On Wed, Mar 15, 2017 at 03:36:31PM +0000, Bart Van Assche wrote:
> On Wed, 2017-03-15 at 20:40 +0800, Ming Lei wrote:
> > On Wed, Mar 15, 2017 at 08:18:53PM +0800, Ming Lei wrote:
> > > On Wed, Mar 15, 2017 at 12:07:37AM +0000, Bart Van Assche wrote:
> > >
> > > > or __blk_mq_requeue_request(). Another issue with this function is that the
> > >
> > > __blk_mq_requeue_request() can be run from two pathes:
> > >
> > > - dispatch failure, in which case the req/tag isn't released to tag set
> > >
> > > - IO completion path, in which COMPLETE flag is cleared before requeue.
> > >
> > > so I can't see races with timeout in case of start rq vs. requeue rq.
> >
> > Actually rq/tag won't be released to tag set if it will be requeued, so
> > the timeout race is nothing to do with requeue.
>
> Hello Ming,
>
> Please have another look at __blk_mq_requeue_request(). In that function
> the following code occurs: if (test_and_clear_bit(REQ_ATOM_STARTED,
> &rq->atomic_flags)) { ... }
>
> I think the?REQ_ATOM_STARTED check in blk_mq_check_expired() races with the
> test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags) call in
> __blk_mq_requeue_request().

OK, this race should only exist in case that the requeue happens after dispatch
busy, because COMPLETE flag isn't set. And if the requeue is from io completion,
no such race because COMPLETE flag is set.

One solution I thought of is to call blk_mark_rq_complete() before requeuing
when dispatch busy happened, but that looks a bit silly. Another way is to
set STARTED flag just after .queue_rq returns BLK_MQ_RQ_QUEUE_OK, which looks
reasonable too. Any comments on the 2nd solution?


Thanks,
Ming

2017-03-15 16:46:46

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH 1/2] blk-mq: don't complete un-started request in timeout handler

On Thu, Mar 16, 2017 at 12:22:01AM +0800, Ming Lei wrote:
> On Wed, Mar 15, 2017 at 03:36:31PM +0000, Bart Van Assche wrote:
>
> OK, this race should only exist in case that the requeue happens after dispatch
> busy, because COMPLETE flag isn't set. And if the requeue is from io completion,
> no such race because COMPLETE flag is set.
>
> One solution I thought of is to call blk_mark_rq_complete() before requeuing
> when dispatch busy happened, but that looks a bit silly. Another way is to
> set STARTED flag just after .queue_rq returns BLK_MQ_RQ_QUEUE_OK, which looks
> reasonable too. Any comments on the 2nd solution?

Actually it isn't possible to happen because rq->deadline is just set
in blk_mq_start_request() called from .queue_rq, and it won't trigger
timeout handling even STARTED is observed as true in blk_mq_check_expired()
because timeout period is often set as big enough. So it is still safe, isn't it?

But this situation should have been commented.

Thanks,
Ming

2017-03-15 21:35:04

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH 1/2] blk-mq: don't complete un-started request in timeout handler

On Wed, 2017-03-15 at 20:18 +0800, Ming Lei wrote:
> On Wed, Mar 15, 2017 at 12:07:37AM +0000, Bart Van Assche wrote:
> > Both the old and the new check look racy to me. The REQ_ATOM_STARTED bit can
> > be changed concurrently by blk_mq_start_request(), __blk_mq_finish_request()
>
> blk_mq_start_request() and __blk_mq_finish_request() won't be run concurrently.
>
> From view of __blk_mq_finish_request():
>
> - if it is run from merge queue io path(blk_mq_merge_queue_io()),
> blk_mq_start_request() can't be run at all, and the COMPLETE flag
> is kept as previous value(zero)
>
> - if it is run from normal complete path, COMPLETE flag is cleared
> before the req/tag is released to tag set.
>
> so there isn't race in blk_mq_start_request() vs. __blk_mq_finish_request()
> wrt. timeout.
>
> > or __blk_mq_requeue_request(). Another issue with this function is that the
>
> __blk_mq_requeue_request() can be run from two pathes:
>
> - dispatch failure, in which case the req/tag isn't released to tag set
>
> - IO completion path, in which COMPLETE flag is cleared before requeue.
>
> so I can't see races with timeout in case of start rq vs. requeue rq.
>
> > request passed to this function can be reinitialized concurrently.

Hello Ming,

You misinterpret what I wrote. I was referring to manipulation of
REQ_ATOM_STARTED from different contexts and not to what you explained.

Bart.

2017-03-15 21:35:31

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH 1/2] blk-mq: don't complete un-started request in timeout handler

On Thu, 2017-03-16 at 00:22 +0800, Ming Lei wrote:
> On Wed, Mar 15, 2017 at 03:36:31PM +0000, Bart Van Assche wrote:
> > Please have another look at __blk_mq_requeue_request(). In that function
> > the following code occurs: if (test_and_clear_bit(REQ_ATOM_STARTED,
> > &rq->atomic_flags)) { ... }
> >
> > I think the?REQ_ATOM_STARTED check in blk_mq_check_expired() races with the
> > test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags) call in
> > __blk_mq_requeue_request().
>
> OK, this race should only exist in case that the requeue happens after dispatch
> busy, because COMPLETE flag isn't set. And if the requeue is from io completion,
> no such race because COMPLETE flag is set.
>
> One solution I thought of is to call blk_mark_rq_complete() before requeuing
> when dispatch busy happened, but that looks a bit silly. Another way is to
> set STARTED flag just after .queue_rq returns BLK_MQ_RQ_QUEUE_OK, which looks
> reasonable too. Any comments on the 2nd solution?

Sorry but I don't think that would be sufficient. There are several other
scenarios that have not been mentioned above, e.g. that a tag gets freed and
reused after the blk_mq_check_expired() call started and before that function
has had the chance to examine any members of struct request. What is needed in
blk_mq_check_expired() is the following as a single atomic operation:
* Check whether REQ_ATOM_STARTED has been set.
* Check whether REQ_ATOM_COMPLETE has not yet been set.
* If both conditions have been met, set REQ_ATOM_COMPLETE.

I don't think there is another solution than using a single state variable to
represent the REQ_ATOM_STARTED and REQ_ATOM_COMPLETE states instead of two
independent bits. How about the patch below?

Thanks,

Bart.


[PATCH] blk-mq: Fix request state manipulation races

Use a single state variable instead of two separate bits
REQ_ATOM_STARTED and REQ_ATOM_COMPLETE. For blk-mq, make
__blk_mq_finish_request() perform the transition from "complete" to
"not complete" instead of blk_mq_start_request(). Make sure that
blk_mq_check_expired() uses a single atomic operation to test the
"started" and "complete" states and to set the "complete" state.
---
?block/blk-core.c???????????|??6 +++--
?block/blk-mq.c?????????????| 67 +++++++++++++++++-----------------------------
?block/blk-timeout.c????????|??2 +-
?block/blk.h????????????????| 21 ++++++++++-----
?drivers/scsi/virtio_scsi.c |??2 +-
?include/linux/blkdev.h?????|??1 +
?6 files changed, 47 insertions(+), 52 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 0eeb99ef654f..dff857d2b86b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1305,7 +1305,7 @@ EXPORT_SYMBOL(blk_get_request);
?void blk_requeue_request(struct request_queue *q, struct request *rq)
?{
? blk_delete_timer(rq);
- blk_clear_rq_complete(rq);
+ atomic_set(&rq->state, REQ_NOT_STARTED);
? trace_block_rq_requeue(q, rq);
? wbt_requeue(q->rq_wb, &rq->issue_stat);
?
@@ -2477,7 +2477,9 @@ void blk_start_request(struct request *req)
? wbt_issue(req->q->rq_wb, &req->issue_stat);
? }
?
- BUG_ON(test_bit(REQ_ATOM_COMPLETE, &req->atomic_flags));
+ WARN_ONCE(atomic_read(&req->state) != REQ_NOT_STARTED,
+ ??"unexpected request state %d != %d\n",
+ ??atomic_read(&req->state), REQ_NOT_STARTED);
? blk_add_timer(req);
?}
?EXPORT_SYMBOL(blk_start_request);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 159187a28d66..fe73d5a1ffc3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -343,7 +343,7 @@ void __blk_mq_finish_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
? wbt_done(q->rq_wb, &rq->issue_stat);
? rq->rq_flags = 0;
?
- clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
+ atomic_set(&rq->state, REQ_NOT_STARTED);
? clear_bit(REQ_ATOM_POLL_SLEPT, &rq->atomic_flags);
? if (rq->tag != -1)
? blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
@@ -479,7 +479,7 @@ EXPORT_SYMBOL(blk_mq_complete_request);
?
?int blk_mq_request_started(struct request *rq)
?{
- return test_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
+ return atomic_read(&rq->state) == REQ_STARTED;
?}
?EXPORT_SYMBOL_GPL(blk_mq_request_started);
?
@@ -505,16 +505,10 @@ void blk_mq_start_request(struct request *rq)
? ?*/
? smp_mb__before_atomic();
?
- /*
- ?* Mark us as started and clear complete. Complete might have been
- ?* set if requeue raced with timeout, which then marked it as
- ?* complete. So be sure to clear complete again when we start
- ?* the request, otherwise we'll ignore the completion event.
- ?*/
- if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
- set_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
- if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags))
- clear_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
+ WARN_ONCE(atomic_read(&rq->state) != REQ_NOT_STARTED,
+ ??"unexpected request state %d != %d\n",
+ ??atomic_read(&rq->state), REQ_NOT_STARTED);
+ atomic_set(&rq->state, REQ_STARTED);
?
? if (q->dma_drain_size && blk_rq_bytes(rq)) {
? /*
@@ -530,12 +524,14 @@ EXPORT_SYMBOL(blk_mq_start_request);
?static void __blk_mq_requeue_request(struct request *rq)
?{
? struct request_queue *q = rq->q;
+ enum rq_state prev;
?
? trace_block_rq_requeue(q, rq);
? wbt_requeue(q->rq_wb, &rq->issue_stat);
? blk_mq_sched_requeue_request(rq);
?
- if (test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
+ prev = atomic_xchg(&rq->state, REQ_NOT_STARTED);
+ if (prev != REQ_NOT_STARTED) {
? if (q->dma_drain_size && blk_rq_bytes(rq))
? rq->nr_phys_segments--;
? }
@@ -661,17 +657,7 @@ void blk_mq_rq_timed_out(struct request *req, bool reserved)
? const struct blk_mq_ops *ops = req->q->mq_ops;
? enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
?
- /*
- ?* We know that complete is set at this point. If STARTED isn't set
- ?* anymore, then the request isn't active and the "timeout" should
- ?* just be ignored. This can happen due to the bitflag ordering.
- ?* Timeout first checks if STARTED is set, and if it is, assumes
- ?* the request is active. But if we race with completion, then
- ?* we both flags will get cleared. So check here again, and ignore
- ?* a timeout event with a request that isn't active.
- ?*/
- if (!test_bit(REQ_ATOM_STARTED, &req->atomic_flags))
- return;
+ WARN_ON_ONCE(atomic_read(&req->state) != REQ_COMPLETE);
?
? if (ops->timeout)
? ret = ops->timeout(req, reserved);
@@ -682,7 +668,7 @@ void blk_mq_rq_timed_out(struct request *req, bool reserved)
? break;
? case BLK_EH_RESET_TIMER:
? blk_add_timer(req);
- blk_clear_rq_complete(req);
+ atomic_set(&req->state, REQ_STARTED);
? break;
? case BLK_EH_NOT_HANDLED:
? break;
@@ -692,27 +678,24 @@ void blk_mq_rq_timed_out(struct request *req, bool reserved)
? }
?}
?
+/*
+ * Check whether or not a request has expired. This function can execute
+ * concurrently with other functions that change the request state. This can
+ * result in returning a deadline (blk_mq_timeout_data.next) that occurs
+ * before a request times out. However, this is harmless because the next
+ * call of blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &data) will
+ * yield the correct timeout, unless the same race occurs again.
+ */
?static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
? struct request *rq, void *priv, bool reserved)
?{
? struct blk_mq_timeout_data *data = priv;
?
- if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
- /*
- ?* If a request wasn't started before the queue was
- ?* marked dying, kill it here or it'll go unnoticed.
- ?*/
- if (unlikely(blk_queue_dying(rq->q))) {
- rq->errors = -EIO;
- blk_mq_end_request(rq, rq->errors);
- }
- return;
- }
-
- if (time_after_eq(jiffies, rq->deadline)) {
- if (!blk_mark_rq_complete(rq))
- blk_mq_rq_timed_out(rq, reserved);
- } else if (!data->next_set || time_after(data->next, rq->deadline)) {
+ if (time_after_eq(jiffies, rq->deadline) &&
+ ????!blk_mark_rq_complete_if_started(rq)) {
+ blk_mq_rq_timed_out(rq, reserved);
+ } else if ((!data->next_set || time_after(data->next, rq->deadline)) &&
+ ???blk_mq_request_started(rq)) {
? data->next = rq->deadline;
? data->next_set = 1;
? }
@@ -2821,7 +2804,7 @@ static bool blk_mq_poll_hybrid_sleep(struct request_queue *q,
?
? hrtimer_init_sleeper(&hs, current);
? do {
- if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags))
+ if (atomic_read(&rq->state) == REQ_COMPLETE)
? break;
? set_current_state(TASK_UNINTERRUPTIBLE);
? hrtimer_start_expires(&hs.timer, mode);
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index a30441a200c0..9a8b44ebfb99 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -94,7 +94,7 @@ static void blk_rq_timed_out(struct request *req)
? break;
? case BLK_EH_RESET_TIMER:
? blk_add_timer(req);
- blk_clear_rq_complete(req);
+ atomic_set(&req->state, REQ_NOT_STARTED);
? break;
? case BLK_EH_NOT_HANDLED:
? /*
diff --git a/block/blk.h b/block/blk.h
index d1ea4bd9b9a3..8af5fe21e85f 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -115,23 +115,32 @@ void blk_account_io_done(struct request *req);
? * Internal atomic flags for request handling
? */
?enum rq_atomic_flags {
- REQ_ATOM_COMPLETE = 0,
- REQ_ATOM_STARTED,
? REQ_ATOM_POLL_SLEPT,
?};
?
?/*
+ * Request states. Note: REQ_STARTED is only used by the blk-mq code.
+ */
+enum rq_state {
+ REQ_NOT_STARTED,
+ REQ_STARTED,
+ REQ_COMPLETE,
+};
+
+/*
? * EH timer and IO completion will both attempt to 'grab' the request, make
- * sure that only one of them succeeds
+ * sure that only one of them succeeds. The return value 0 means that this
+ * function changed the request state from "not complete" into "complete".
? */
?static inline int blk_mark_rq_complete(struct request *rq)
?{
- return test_and_set_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
+ return atomic_xchg(&rq->state, REQ_COMPLETE) == REQ_COMPLETE;
?}
?
-static inline void blk_clear_rq_complete(struct request *rq)
+static inline int blk_mark_rq_complete_if_started(struct request *rq)
?{
- clear_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
+ return atomic_cmpxchg(&rq->state, REQ_STARTED, REQ_COMPLETE) !=
+ REQ_STARTED;
?}
?
?/*
diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 939c47df73fa..136379097131 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -672,7 +672,7 @@ static int virtscsi_tmf(struct virtio_scsi *vscsi, struct virtio_scsi_cmd *cmd)
? ?*
? ?* In the abort case, sc->scsi_done will do nothing, because
? ?* the block layer must have detected a timeout and as a result
- ?* REQ_ATOM_COMPLETE has been set.
+ ?* rq->state == REQ_COMPLETED.
? ?*/
? virtscsi_poll_requests(vscsi);
?
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 5a7da607ca04..b286887b095e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -142,6 +142,7 @@ struct request {
?
? int internal_tag;
?
+ atomic_t state;
? unsigned long atomic_flags;
?
? /* the following two fields are internal, NEVER access directly */
--?
2.12.0

2017-03-15 23:41:56

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH 1/2] blk-mq: don't complete un-started request in timeout handler

On Wed, Mar 15, 2017 at 09:34:50PM +0000, Bart Van Assche wrote:
> On Wed, 2017-03-15 at 20:18 +0800, Ming Lei wrote:
> > On Wed, Mar 15, 2017 at 12:07:37AM +0000, Bart Van Assche wrote:
> > > Both the old and the new check look racy to me. The REQ_ATOM_STARTED bit can
> > > be changed concurrently by blk_mq_start_request(), __blk_mq_finish_request()
> >
> > blk_mq_start_request() and __blk_mq_finish_request() won't be run concurrently.
> >
> > From view of __blk_mq_finish_request():
> >
> > - if it is run from merge queue io path(blk_mq_merge_queue_io()),
> > blk_mq_start_request() can't be run at all, and the COMPLETE flag
> > is kept as previous value(zero)
> >
> > - if it is run from normal complete path, COMPLETE flag is cleared
> > before the req/tag is released to tag set.
> >
> > so there isn't race in blk_mq_start_request() vs. __blk_mq_finish_request()
> > wrt. timeout.
> >
> > > or __blk_mq_requeue_request(). Another issue with this function is that the
> >
> > __blk_mq_requeue_request() can be run from two pathes:
> >
> > - dispatch failure, in which case the req/tag isn't released to tag set
> >
> > - IO completion path, in which COMPLETE flag is cleared before requeue.
> >
> > so I can't see races with timeout in case of start rq vs. requeue rq.
> >
> > > request passed to this function can be reinitialized concurrently.
>
> Hello Ming,
>
> You misinterpret what I wrote. I was referring to manipulation of
> REQ_ATOM_STARTED from different contexts and not to what you explained.

This patch addresses one race between timeout and pre-queuing I/O in block layer
(before .queue_rq), please focus on this patch. And that is definitely correct.

Also I am happy to discuss this kind of races, but maybe we can do that in
another thread if you can describe the issue clearly.


--
Ming

2017-03-16 00:08:00

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH 1/2] blk-mq: don't complete un-started request in timeout handler

On Wed, Mar 15, 2017 at 09:35:03PM +0000, Bart Van Assche wrote:
> On Thu, 2017-03-16 at 00:22 +0800, Ming Lei wrote:
> > On Wed, Mar 15, 2017 at 03:36:31PM +0000, Bart Van Assche wrote:
> > > Please have another look at __blk_mq_requeue_request(). In that function
> > > the following code occurs: if (test_and_clear_bit(REQ_ATOM_STARTED,
> > > &rq->atomic_flags)) { ... }
> > >
> > > I think the?REQ_ATOM_STARTED check in blk_mq_check_expired() races with the
> > > test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags) call in
> > > __blk_mq_requeue_request().
> >
> > OK, this race should only exist in case that the requeue happens after dispatch
> > busy, because COMPLETE flag isn't set. And if the requeue is from io completion,
> > no such race because COMPLETE flag is set.
> >
> > One solution I thought of is to call blk_mark_rq_complete() before requeuing
> > when dispatch busy happened, but that looks a bit silly. Another way is to
> > set STARTED flag just after .queue_rq returns BLK_MQ_RQ_QUEUE_OK, which looks
> > reasonable too. Any comments on the 2nd solution?
>
> Sorry but I don't think that would be sufficient. There are several other
> scenarios that have not been mentioned above, e.g. that a tag gets freed and
> reused after the blk_mq_check_expired() call started and before that function
> has had the chance to examine any members of struct request. What is needed in
> blk_mq_check_expired() is the following as a single atomic operation:

We have dealt with this by checking COMPLETE & rq->deadline together in
blk_mq_check_expired() already:

- if new rq->deadline(set in reuse path) has been observed in the later
checking rq of blk_mq_check_expired(), it won't be timeouted because of the timing.

- if new rq->deadline(set in reuse path) hasn't been observed in the
later checking rq of blk_mq_check_expired(), that means COMPLETE flag isn't set
yet in reuse path because we have a barrier to enhance the order in
blk_mq_start_request(), so it won't be timeouted too.

So let me know what is the real race between free/reusing vs. timeout.

> * Check whether REQ_ATOM_STARTED has been set.
> * Check whether REQ_ATOM_COMPLETE has not yet been set.
> * If both conditions have been met, set REQ_ATOM_COMPLETE.
>
> I don't think there is another solution than using a single state variable to
> represent the REQ_ATOM_STARTED and REQ_ATOM_COMPLETE states instead of two
> independent bits. How about the patch below?

I would review it if you can confirm me that it is a real issue, :-)

Thanks,
Ming

2017-03-16 21:36:19

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH 1/2] blk-mq: don't complete un-started request in timeout handler

On Thu, 2017-03-16 at 08:07 +0800, Ming Lei wrote:
> > * Check whether REQ_ATOM_STARTED has been set.
> > * Check whether REQ_ATOM_COMPLETE has not yet been set.
> > * If both conditions have been met, set REQ_ATOM_COMPLETE.
> >
> > I don't think there is another solution than using a single state variable to
> > represent the REQ_ATOM_STARTED and REQ_ATOM_COMPLETE states instead of two
> > independent bits. How about the patch below?
>
> I would review it if you can confirm me that it is a real issue, :-)

Hello Ming,

I was chasing a v4.11 regression in the SCSI stack. Since my tests of today
revealed that it's probably not a block layer issue, let's proceed with your
patch.

Bart.

2017-03-16 21:38:30

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH 1/2] blk-mq: don't complete un-started request in timeout handler

On Thu, 2017-03-09 at 21:02 +0800, Ming Lei wrote:
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 159187a28d66..0aff380099d5 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -697,17 +697,8 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
> {
> struct blk_mq_timeout_data *data = priv;
>
> - if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
> - /*
> - * If a request wasn't started before the queue was
> - * marked dying, kill it here or it'll go unnoticed.
> - */
> - if (unlikely(blk_queue_dying(rq->q))) {
> - rq->errors = -EIO;
> - blk_mq_end_request(rq, rq->errors);
> - }
> + if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
> return;
> - }
>
> if (time_after_eq(jiffies, rq->deadline)) {
> if (!blk_mark_rq_complete(rq))

Reviewed-by: Bart Van Assche <[email protected]>

2017-03-17 00:07:46

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH 1/2] blk-mq: don't complete un-started request in timeout handler

On Fri, Mar 17, 2017 at 5:35 AM, Bart Van Assche
<[email protected]> wrote:
> On Thu, 2017-03-16 at 08:07 +0800, Ming Lei wrote:
>> > * Check whether REQ_ATOM_STARTED has been set.
>> > * Check whether REQ_ATOM_COMPLETE has not yet been set.
>> > * If both conditions have been met, set REQ_ATOM_COMPLETE.
>> >
>> > I don't think there is another solution than using a single state variable to
>> > represent the REQ_ATOM_STARTED and REQ_ATOM_COMPLETE states instead of two
>> > independent bits. How about the patch below?
>>
>> I would review it if you can confirm me that it is a real issue, :-)
>
> Hello Ming,
>
> I was chasing a v4.11 regression in the SCSI stack. Since my tests of today
> revealed that it's probably not a block layer issue, let's proceed with your
> patch.

Yeah, it shouldn't have been related with blk-mq timeout handler which
isn't changed much
recently, and thanks for your review!



Thanks,
Ming Lei