2018-01-18 02:42:29

by Ming Lei

[permalink] [raw]
Subject: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

BLK_STS_RESOURCE can be returned from driver when any resource
is running out of. And the resource may not be related with tags,
such as kmalloc(GFP_ATOMIC), when queue is idle under this kind of
BLK_STS_RESOURCE, restart can't work any more, then IO hang may
be caused.

Most of drivers may call kmalloc(GFP_ATOMIC) in IO path, and almost
all returns BLK_STS_RESOURCE under this situation. But for dm-mpath,
it may be triggered a bit easier since the request pool of underlying
queue may be consumed up much easier. But in reality, it is still not
easy to trigger it. I run all kinds of test on dm-mpath/scsi-debug
with all kinds of scsi_debug parameters, can't trigger this issue
at all. But finally it is triggered in Bart's SRP test, which seems
made by genius, :-)

This patch deals with this situation by running the queue again when
queue is found idle in timeout handler.

Signed-off-by: Ming Lei <[email protected]>
---

Another approach is to do the check after BLK_STS_RESOURCE is returned
from .queue_rq() and BLK_MQ_S_SCHED_RESTART is set, that way may introduce
a bit cost in hot path, and it was V1 of this patch actually, please see
that in the following link:

https://github.com/ming1/linux/commit/68a66900f3647ea6751aab2848b1e5eef508feaa

Or other better ways?

block/blk-mq.c | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 82 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6e3f77829dcc..4d4af8d712da 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -896,6 +896,85 @@ static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
blk_mq_rq_timed_out(rq, reserved);
}

+struct hctx_busy_data {
+ struct blk_mq_hw_ctx *hctx;
+ bool reserved;
+ bool busy;
+};
+
+static bool check_busy_hctx(struct sbitmap *sb, unsigned int bitnr, void *data)
+{
+ struct hctx_busy_data *busy_data = data;
+ struct blk_mq_hw_ctx *hctx = busy_data->hctx;
+ struct request *rq;
+
+ if (busy_data->reserved)
+ bitnr += hctx->tags->nr_reserved_tags;
+
+ rq = hctx->tags->static_rqs[bitnr];
+ if (blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT) {
+ busy_data->busy = true;
+ return false;
+ }
+ return true;
+}
+
+/* Check if there is any in-flight request */
+static bool blk_mq_hctx_is_busy(struct blk_mq_hw_ctx *hctx)
+{
+ struct hctx_busy_data data = {
+ .hctx = hctx,
+ .busy = false,
+ .reserved = true,
+ };
+
+ sbitmap_for_each_set(&hctx->tags->breserved_tags.sb,
+ check_busy_hctx, &data);
+ if (data.busy)
+ return true;
+
+ data.reserved = false;
+ sbitmap_for_each_set(&hctx->tags->bitmap_tags.sb,
+ check_busy_hctx, &data);
+ if (data.busy)
+ return true;
+
+ return false;
+}
+
+static void blk_mq_fixup_restart(struct blk_mq_hw_ctx *hctx)
+{
+ if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
+ bool busy;
+
+ /*
+ * If this hctx is still marked as RESTART, and there
+ * isn't any in-flight requests, we have to run queue
+ * here to prevent IO from hanging.
+ *
+ * BLK_STS_RESOURCE can be returned from driver when any
+ * resource is running out of. And the resource may not
+ * be related with tags, such as kmalloc(GFP_ATOMIC), when
+ * queue is idle under this kind of BLK_STS_RESOURCE, restart
+ * can't work any more, then IO hang may be caused.
+ *
+ * The counter-pair of the following barrier is the one
+ * in blk_mq_put_driver_tag() after returning BLK_STS_RESOURCE
+ * from ->queue_rq().
+ */
+ smp_mb();
+
+ busy = blk_mq_hctx_is_busy(hctx);
+ if (!busy) {
+ printk(KERN_WARNING "blk-mq: fixup RESTART\n");
+ printk(KERN_WARNING "\t If this message is shown"
+ " a bit often, please report the issue to"
+ " [email protected]\n");
+ blk_mq_run_hw_queue(hctx, true);
+ }
+ }
+}
+
static void blk_mq_timeout_work(struct work_struct *work)
{
struct request_queue *q =
@@ -966,8 +1045,10 @@ static void blk_mq_timeout_work(struct work_struct *work)
*/
queue_for_each_hw_ctx(q, hctx, i) {
/* the hctx may be unmapped, so check it here */
- if (blk_mq_hw_queue_mapped(hctx))
+ if (blk_mq_hw_queue_mapped(hctx)) {
blk_mq_tag_idle(hctx);
+ blk_mq_fixup_restart(hctx);
+ }
}
}
blk_queue_exit(q);
--
2.9.5



2018-01-18 16:52:05

by Bart Van Assche

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On 01/17/18 18:41, Ming Lei wrote:
> BLK_STS_RESOURCE can be returned from driver when any resource
> is running out of. And the resource may not be related with tags,
> such as kmalloc(GFP_ATOMIC), when queue is idle under this kind of
> BLK_STS_RESOURCE, restart can't work any more, then IO hang may
> be caused.
>
> Most of drivers may call kmalloc(GFP_ATOMIC) in IO path, and almost
> all returns BLK_STS_RESOURCE under this situation. But for dm-mpath,
> it may be triggered a bit easier since the request pool of underlying
> queue may be consumed up much easier. But in reality, it is still not
> easy to trigger it. I run all kinds of test on dm-mpath/scsi-debug
> with all kinds of scsi_debug parameters, can't trigger this issue
> at all. But finally it is triggered in Bart's SRP test, which seems
> made by genius, :-)
>
> [ ... ]
>
> static void blk_mq_timeout_work(struct work_struct *work)
> {
> struct request_queue *q =
> @@ -966,8 +1045,10 @@ static void blk_mq_timeout_work(struct work_struct *work)
> */
> queue_for_each_hw_ctx(q, hctx, i) {
> /* the hctx may be unmapped, so check it here */
> - if (blk_mq_hw_queue_mapped(hctx))
> + if (blk_mq_hw_queue_mapped(hctx)) {
> blk_mq_tag_idle(hctx);
> + blk_mq_fixup_restart(hctx);
> + }
> }
> }
> blk_queue_exit(q);

Hello Ming,

My comments about the above are as follows:
- It can take up to q->rq_timeout jiffies after a .queue_rq()
implementation returned BLK_STS_RESOURCE before blk_mq_timeout_work()
gets called. However, it can happen that only a few milliseconds after
.queue_rq() returned BLK_STS_RESOURCE that the condition that caused
it to return BLK_STS_RESOURCE gets cleared. So the above approach can
result in long delays during which it will seem like the queue got
stuck. Additionally, I think that the block driver should decide how
long it takes before a queue is rerun and not the block layer core.
- The lockup that I reported only occurs with the dm driver but not any
other block driver. So why to modify the block layer core since this
can be fixed by modifying the dm driver?
- A much simpler fix and a fix that is known to work exists, namely
inserting a blk_mq_delay_run_hw_queue() call in the dm driver.

Bart.

2018-01-18 17:05:23

by Mike Snitzer

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On Thu, Jan 18 2018 at 11:50am -0500,
Bart Van Assche <[email protected]> wrote:

> On 01/17/18 18:41, Ming Lei wrote:
> >BLK_STS_RESOURCE can be returned from driver when any resource
> >is running out of. And the resource may not be related with tags,
> >such as kmalloc(GFP_ATOMIC), when queue is idle under this kind of
> >BLK_STS_RESOURCE, restart can't work any more, then IO hang may
> >be caused.
> >
> >Most of drivers may call kmalloc(GFP_ATOMIC) in IO path, and almost
> >all returns BLK_STS_RESOURCE under this situation. But for dm-mpath,
> >it may be triggered a bit easier since the request pool of underlying
> >queue may be consumed up much easier. But in reality, it is still not
> >easy to trigger it. I run all kinds of test on dm-mpath/scsi-debug
> >with all kinds of scsi_debug parameters, can't trigger this issue
> >at all. But finally it is triggered in Bart's SRP test, which seems
> >made by genius, :-)
> >
> >[ ... ]
> >
> > static void blk_mq_timeout_work(struct work_struct *work)
> > {
> > struct request_queue *q =
> >@@ -966,8 +1045,10 @@ static void blk_mq_timeout_work(struct work_struct *work)
> > */
> > queue_for_each_hw_ctx(q, hctx, i) {
> > /* the hctx may be unmapped, so check it here */
> >- if (blk_mq_hw_queue_mapped(hctx))
> >+ if (blk_mq_hw_queue_mapped(hctx)) {
> > blk_mq_tag_idle(hctx);
> >+ blk_mq_fixup_restart(hctx);
> >+ }
> > }
> > }
> > blk_queue_exit(q);
>
> Hello Ming,
>
> My comments about the above are as follows:
> - It can take up to q->rq_timeout jiffies after a .queue_rq()
> implementation returned BLK_STS_RESOURCE before blk_mq_timeout_work()
> gets called. However, it can happen that only a few milliseconds after
> .queue_rq() returned BLK_STS_RESOURCE that the condition that caused
> it to return BLK_STS_RESOURCE gets cleared. So the above approach can
> result in long delays during which it will seem like the queue got
> stuck. Additionally, I think that the block driver should decide how
> long it takes before a queue is rerun and not the block layer core.

So configure q->rq_timeout to be shorter? Which is configurable though
blk_mq_tag_set's 'timeout' member. It apparently defaults to 30 * HZ.

That is the problem with timeouts, there is generally no one size fits
all.

> - The lockup that I reported only occurs with the dm driver but not any
> other block driver. So why to modify the block layer core since this
> can be fixed by modifying the dm driver?

Hard to know it is only DM's blk-mq that is impacted. That is the only
blk-mq driver that you're testing like this (that is also able to handle
faults, etc).

> - A much simpler fix and a fix that is known to work exists, namely
> inserting a blk_mq_delay_run_hw_queue() call in the dm driver.

Because your "much simpler" fix actively hurts performance, as is
detailed in this header:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16&id=ec3eaf9a673106f66606896aed6ddd20180b02ec

I'm not going to take your bandaid fix given it very much seems to be
papering over a real blk-mq issue.

Mike

2018-01-18 17:23:06

by Bart Van Assche

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On Thu, 2018-01-18 at 12:03 -0500, Mike Snitzer wrote:
> On Thu, Jan 18 2018 at 11:50am -0500,
> Bart Van Assche <[email protected]> wrote:
> > My comments about the above are as follows:
> > - It can take up to q->rq_timeout jiffies after a .queue_rq()
> > implementation returned BLK_STS_RESOURCE before blk_mq_timeout_work()
> > gets called. However, it can happen that only a few milliseconds after
> > .queue_rq() returned BLK_STS_RESOURCE that the condition that caused
> > it to return BLK_STS_RESOURCE gets cleared. So the above approach can
> > result in long delays during which it will seem like the queue got
> > stuck. Additionally, I think that the block driver should decide how
> > long it takes before a queue is rerun and not the block layer core.
>
> So configure q->rq_timeout to be shorter? Which is configurable though
> blk_mq_tag_set's 'timeout' member. It apparently defaults to 30 * HZ.
>
> That is the problem with timeouts, there is generally no one size fits
> all.

Sorry but I think that would be wrong. The delay after which a queue is rerun
should not be coupled to the request timeout. These two should be independent.

> > - The lockup that I reported only occurs with the dm driver but not any
> > other block driver. So why to modify the block layer core since this
> > can be fixed by modifying the dm driver?
>
> Hard to know it is only DM's blk-mq that is impacted. That is the only
> blk-mq driver that you're testing like this (that is also able to handle
> faults, etc).

That's not correct. I'm also testing the SCSI core, which is one of the most
complicated block drivers.

> > - A much simpler fix and a fix that is known to work exists, namely
> > inserting a blk_mq_delay_run_hw_queue() call in the dm driver.
>
> Because your "much simpler" fix actively hurts performance, as is
> detailed in this header:
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16&id=ec3eaf9a673106f66606896aed6ddd20180b02ec

We are close to the start of the merge window so I think it's better to fall
back to an old approach that is known to work than to keep a new approach
that is known not to work. Additionally, the performance issue you referred
to only affects IOPS and bandwidth more than 1% with the lpfc driver and that
is because the queue depth it supports is much lower than for other SCSI HBAs,
namely 3 instead of 64.

Thanks,

Bart.

2018-01-18 18:31:37

by Mike Snitzer

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On Thu, Jan 18 2018 at 12:20pm -0500,
Bart Van Assche <[email protected]> wrote:

> On Thu, 2018-01-18 at 12:03 -0500, Mike Snitzer wrote:
> > On Thu, Jan 18 2018 at 11:50am -0500,
> > Bart Van Assche <[email protected]> wrote:
> > > My comments about the above are as follows:
> > > - It can take up to q->rq_timeout jiffies after a .queue_rq()
> > > implementation returned BLK_STS_RESOURCE before blk_mq_timeout_work()
> > > gets called. However, it can happen that only a few milliseconds after
> > > .queue_rq() returned BLK_STS_RESOURCE that the condition that caused
> > > it to return BLK_STS_RESOURCE gets cleared. So the above approach can
> > > result in long delays during which it will seem like the queue got
> > > stuck. Additionally, I think that the block driver should decide how
> > > long it takes before a queue is rerun and not the block layer core.
> >
> > So configure q->rq_timeout to be shorter? Which is configurable though
> > blk_mq_tag_set's 'timeout' member. It apparently defaults to 30 * HZ.
> >
> > That is the problem with timeouts, there is generally no one size fits
> > all.
>
> Sorry but I think that would be wrong. The delay after which a queue is rerun
> should not be coupled to the request timeout. These two should be independent.

That's fair. Not saying I think that is a fix anyway.

> > > - The lockup that I reported only occurs with the dm driver but not any
> > > other block driver. So why to modify the block layer core since this
> > > can be fixed by modifying the dm driver?
> >
> > Hard to know it is only DM's blk-mq that is impacted. That is the only
> > blk-mq driver that you're testing like this (that is also able to handle
> > faults, etc).
>
> That's not correct. I'm also testing the SCSI core, which is one of the most
> complicated block drivers.

OK, but SCSI mq is part of the problem here. It is a snowflake that
has more exotic reasons for returning BLK_STS_RESOURCE.

> > > - A much simpler fix and a fix that is known to work exists, namely
> > > inserting a blk_mq_delay_run_hw_queue() call in the dm driver.
> >
> > Because your "much simpler" fix actively hurts performance, as is
> > detailed in this header:
> > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16&id=ec3eaf9a673106f66606896aed6ddd20180b02ec
>
> We are close to the start of the merge window so I think it's better to fall
> back to an old approach that is known to work than to keep a new approach
> that is known not to work. Additionally, the performance issue you referred
> to only affects IOPS and bandwidth more than 1% with the lpfc driver and that
> is because the queue depth it supports is much lower than for other SCSI HBAs,
> namely 3 instead of 64.

1%!? Where are you getting that number? Ming has detailed more
significant performance gains than 1%.. and not just on lpfc (though you
keep seizing on lpfc because of the low queue_depth of 3).

This is all very tiresome. I'm _really_ not interested in this debate
any more. The specific case that causes the stall need to be identified
and a real fix needs to be developed. Ming is doing a lot of that hard
work. Please contribute or at least stop pleading for your hack to be
reintroduced.

If at the end of the 4.16 release we still don't have a handle on the
stall you're seeing I'll revisit this and likely revert to blindly
kicking the queue after an arbitrary delay. But I'm willing to let this
issue get more time without papering over it.

Mike

2018-01-18 18:50:00

by Bart Van Assche

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On Thu, 2018-01-18 at 13:30 -0500, Mike Snitzer wrote:
> 1%!? Where are you getting that number? Ming has detailed more
> significant performance gains than 1%.. and not just on lpfc (though you
> keep seizing on lpfc because of the low queue_depth of 3).

That's what I derived from the numbers you posted for null_blk. If Ming has
posted performance results for other drivers than lpfc, please let me know
where I can find these. I have not yet seen these numbers.

> This is all very tiresome.

Yes, this is tiresome. It is very annoying to me that others keep introducing
so many regressions in such important parts of the kernel. It is also annoying
to me that I get blamed if I report a regression instead of seeing that the
regression gets fixed.

Bart.

2018-01-18 20:11:53

by Jens Axboe

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On 1/18/18 11:47 AM, Bart Van Assche wrote:
>> This is all very tiresome.
>
> Yes, this is tiresome. It is very annoying to me that others keep
> introducing so many regressions in such important parts of the kernel.
> It is also annoying to me that I get blamed if I report a regression
> instead of seeing that the regression gets fixed.

I agree, it sucks that any change there introduces the regression. I'm
fine with doing the delay insert again until a new patch is proven to be
better.

From the original topic of this email, we have conditions that can cause
the driver to not be able to submit an IO. A set of those conditions can
only happen if IO is in flight, and those cases we have covered just
fine. Another set can potentially trigger without IO being in flight.
These are cases where a non-device resource is unavailable at the time
of submission. This might be iommu running out of space, for instance,
or it might be a memory allocation of some sort. For these cases, we
don't get any notification when the shortage clears. All we can do is
ensure that we restart operations at some point in the future. We're SOL
at that point, but we have to ensure that we make forward progress.

That last set of conditions better not be a a common occurence, since
performance is down the toilet at that point. I don't want to introduce
hot path code to rectify it. Have the driver return if that happens in a
way that is DIFFERENT from needing a normal restart. The driver knows if
this is a resource that will become available when IO completes on this
device or not. If we get that return, we have a generic run-again delay.

This basically becomes the same as doing the delay queue thing from DM,
but just in a generic fashion.

--
Jens Axboe


2018-01-18 20:50:05

by Mike Snitzer

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On Thu, Jan 18 2018 at 3:11pm -0500,
Jens Axboe <[email protected]> wrote:

> On 1/18/18 11:47 AM, Bart Van Assche wrote:
> >> This is all very tiresome.
> >
> > Yes, this is tiresome. It is very annoying to me that others keep
> > introducing so many regressions in such important parts of the kernel.
> > It is also annoying to me that I get blamed if I report a regression
> > instead of seeing that the regression gets fixed.
>
> I agree, it sucks that any change there introduces the regression. I'm
> fine with doing the delay insert again until a new patch is proven to be
> better.
>
> From the original topic of this email, we have conditions that can cause
> the driver to not be able to submit an IO. A set of those conditions can
> only happen if IO is in flight, and those cases we have covered just
> fine. Another set can potentially trigger without IO being in flight.
> These are cases where a non-device resource is unavailable at the time
> of submission. This might be iommu running out of space, for instance,
> or it might be a memory allocation of some sort. For these cases, we
> don't get any notification when the shortage clears. All we can do is
> ensure that we restart operations at some point in the future. We're SOL
> at that point, but we have to ensure that we make forward progress.
>
> That last set of conditions better not be a a common occurence, since
> performance is down the toilet at that point. I don't want to introduce
> hot path code to rectify it. Have the driver return if that happens in a
> way that is DIFFERENT from needing a normal restart. The driver knows if
> this is a resource that will become available when IO completes on this
> device or not. If we get that return, we have a generic run-again delay.
>
> This basically becomes the same as doing the delay queue thing from DM,
> but just in a generic fashion.

This is a bit confusing for me (as I see it we have 2 blk-mq drivers
trying to collaborate, so your refering to "driver" lacks precision; but
I could just be missing something)...

For Bart's test the underlying scsi-mq driver is what is regularly
hitting this case in __blk_mq_try_issue_directly():

if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q))

It certainly better not be the norm (Bart's test hammering on this aside).

For starters, it'd be very useful to know if Bart is hitting the
blk_mq_hctx_stopped() or blk_queue_quiesced() for this case that is
triggering the use of blk_mq_sched_insert_request() -- I'd wager it is
due to blk_queue_quiesced() but Bart _please_ try to figure it out.

Anyway, in response to this case Bart would like the upper layer dm-mq
driver to blk_mq_delay_run_hw_queue(). Certainly is quite the hammer.

But that hammer aside, in general for this case, I'm concerned about: is
it really correct to allow an already stopped/quiesced underlying queue
to retain responsibility for processing the request? Or would the
upper-layer dm-mq benefit from being able to retry the request on its
terms (via a "DIFFERENT" return from blk-mq core)?

Like this? The (ab)use of BLK_STS_DM_REQUEUE certainly seems fitting in
this case but...

(Bart please note that this patch applies on linux-dm.git's 'for-next';
which is just a merge of Jens' 4.16 tree and dm-4.16)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 74a4f237ba91..371a1b97bf56 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1781,16 +1781,11 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
struct request_queue *q = rq->q;
bool run_queue = true;

- /*
- * RCU or SRCU read lock is needed before checking quiesced flag.
- *
- * When queue is stopped or quiesced, ignore 'bypass_insert' from
- * blk_mq_request_direct_issue(), and return BLK_STS_OK to caller,
- * and avoid driver to try to dispatch again.
- */
+ /* RCU or SRCU read lock is needed before checking quiesced flag */
if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) {
run_queue = false;
- bypass_insert = false;
+ if (bypass_insert)
+ return BLK_STS_DM_REQUEUE;
goto insert;
}

diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index d8519ddd7e1a..2f554ea485c3 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -408,7 +408,7 @@ static blk_status_t dm_dispatch_clone_request(struct request *clone, struct requ

clone->start_time = jiffies;
r = blk_insert_cloned_request(clone->q, clone);
- if (r != BLK_STS_OK && r != BLK_STS_RESOURCE)
+ if (r != BLK_STS_OK && r != BLK_STS_RESOURCE && r != BLK_STS_DM_REQUEUE)
/* must complete clone in terms of original request */
dm_complete_request(rq, r);
return r;
@@ -472,6 +472,7 @@ static void init_tio(struct dm_rq_target_io *tio, struct request *rq,
* Returns:
* DM_MAPIO_* : the request has been processed as indicated
* DM_MAPIO_REQUEUE : the original request needs to be immediately requeued
+ * DM_MAPIO_DELAY_REQUEUE : the original request needs to be requeued after delay
* < 0 : the request was completed due to failure
*/
static int map_request(struct dm_rq_target_io *tio)
@@ -500,11 +501,11 @@ static int map_request(struct dm_rq_target_io *tio)
trace_block_rq_remap(clone->q, clone, disk_devt(dm_disk(md)),
blk_rq_pos(rq));
ret = dm_dispatch_clone_request(clone, rq);
- if (ret == BLK_STS_RESOURCE) {
+ if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DM_REQUEUE) {
blk_rq_unprep_clone(clone);
tio->ti->type->release_clone_rq(clone);
tio->clone = NULL;
- if (!rq->q->mq_ops)
+ if (ret == BLK_STS_DM_REQUEUE || !rq->q->mq_ops)
r = DM_MAPIO_DELAY_REQUEUE;
else
r = DM_MAPIO_REQUEUE;
@@ -741,6 +742,7 @@ static int dm_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
const struct blk_mq_queue_data *bd)
{
+ int r;
struct request *rq = bd->rq;
struct dm_rq_target_io *tio = blk_mq_rq_to_pdu(rq);
struct mapped_device *md = tio->md;
@@ -768,10 +770,13 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
tio->ti = ti;

/* Direct call is fine since .queue_rq allows allocations */
- if (map_request(tio) == DM_MAPIO_REQUEUE) {
+ r = map_request(tio);
+ if (r == DM_MAPIO_REQUEUE || r == DM_MAPIO_DELAY_REQUEUE) {
/* Undo dm_start_request() before requeuing */
rq_end_stats(md, rq);
rq_completed(md, rq_data_dir(rq), false);
+ if (r == DM_MAPIO_DELAY_REQUEUE)
+ blk_mq_delay_run_hw_queue(hctx, 100/*ms*/);
return BLK_STS_RESOURCE;
}


2018-01-18 20:59:59

by Bart Van Assche

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote:
> For Bart's test the underlying scsi-mq driver is what is regularly
> hitting this case in __blk_mq_try_issue_directly():
>
> if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q))

Hello Mike,

That code path is not the code path that triggered the lockups that I reported
during the past days. These lockups were all triggered by incorrect handling of
.queue_rq() returning BLK_STS_RESOURCE.

Bart.

2018-01-18 21:26:48

by Mike Snitzer

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On Thu, Jan 18 2018 at 3:58P -0500,
Bart Van Assche <[email protected]> wrote:

> On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote:
> > For Bart's test the underlying scsi-mq driver is what is regularly
> > hitting this case in __blk_mq_try_issue_directly():
> >
> > if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q))
>
> Hello Mike,
>
> That code path is not the code path that triggered the lockups that I reported
> during the past days.

If you're hitting blk_mq_sched_insert_request() then you most certainly
are hitting that code path.

If you aren't then what was your earlier email going on about?
https://www.redhat.com/archives/dm-devel/2018-January/msg00372.html

If you were just focusing on that as one possible reason, that isn't
very helpful. By this point you really should _know_ what is triggering
the stall based on the code paths taken. Please use ftrace's
function_graph tracer if need be.

> These lockups were all triggered by incorrect handling of
> .queue_rq() returning BLK_STS_RESOURCE.

Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE?
"Incorrect" because it no longer runs blk_mq_delay_run_hw_queue()?

Please try to do more work analyzing the test case that only you can
easily run (due to srp_test being a PITA). And less time lobbying for
a change that you don't understand to _really_ be correct.

We have time to get this right, please stop hyperventilating about
"regressions".

Thanks,
Mike

2018-01-18 21:39:52

by Laurence Oberman

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On Thu, 2018-01-18 at 16:23 -0500, Mike Snitzer wrote:
> On Thu, Jan 18 2018 at  3:58P -0500,
> Bart Van Assche <[email protected]> wrote:
>
> > On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote:
> > > For Bart's test the underlying scsi-mq driver is what is
> > > regularly
> > > hitting this case in __blk_mq_try_issue_directly():
> > >
> > >         if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q))
> >
> > Hello Mike,
> >
> > That code path is not the code path that triggered the lockups that
> > I reported
> > during the past days.
>
> If you're hitting blk_mq_sched_insert_request() then you most
> certainly
> are hitting that code path.
>
> If you aren't then what was your earlier email going on about?
> https://www.redhat.com/archives/dm-devel/2018-January/msg00372.html
>
> If you were just focusing on that as one possible reason, that isn't
> very helpful.  By this point you really should _know_ what is
> triggering
> the stall based on the code paths taken.  Please use ftrace's
> function_graph tracer if need be.
>
> > These lockups were all triggered by incorrect handling of
> > .queue_rq() returning BLK_STS_RESOURCE.
>
> Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE?
> "Incorrect" because it no longer runs blk_mq_delay_run_hw_queue()?
>
> Please try to do more work analyzing the test case that only you can
> easily run (due to srp_test being a PITA).  And less time lobbying
> for
> a change that you don't understand to _really_ be correct.
>
> We have time to get this right, please stop hyperventilating about
> "regressions".
>
> Thanks,
> Mike

Hello Bart
I have run a good few loops of 02-mq and its stable for me on your
tree.
I am not running the entire disconnect re-connect loops and un-mounts
etc. for good reason.
I have 35 LUNS so its very impact-full to lose them and have them come
back all the time.

Anyway
I am very happy to try reproduce this in-house so Mike and Ming can
focus on it but I need to know if all I need to do is loop over 02-mq
over and over.

Also please let me know whats debugfs and sysfs to capture and I am
happy to try help move this along.

Regards
Laurence



2018-01-18 21:40:54

by Bart Van Assche

[permalink] [raw]
Subject: Re: [dm-devel] [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On Thu, 2018-01-18 at 16:23 -0500, Mike Snitzer wrote:
> On Thu, Jan 18 2018 at 3:58P -0500,
> Bart Van Assche <[email protected]> wrote:
>
> > On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote:
> > > For Bart's test the underlying scsi-mq driver is what is regularly
> > > hitting this case in __blk_mq_try_issue_directly():
> > >
> > > if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q))
> >
> > These lockups were all triggered by incorrect handling of
> > .queue_rq() returning BLK_STS_RESOURCE.
>
> Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE?
> "Incorrect" because it no longer runs blk_mq_delay_run_hw_queue()?

In what I wrote I was referring to both dm_mq_queue_rq() and scsi_queue_rq().
With "incorrect" I meant that queue lockups are introduced that make user
space processes unkillable. That's a severe bug.

> Please try to do more work analyzing the test case that only you can
> easily run (due to srp_test being a PITA).

It is not correct that I'm the only one who is able to run that software.
Anyone who is willing to merge the latest SRP initiator and target driver
patches in his or her tree can run that software in
any VM. I'm working hard
on getting the patches upstream that make it possible to run the srp-test
software on a setup that is not equipped with InfiniBand hardware.

> We have time to get this right, please stop hyperventilating about
> "regressions".

Sorry Mike but that's something I consider as an unfair comment. If Ming and
you work on patches together, it's your job to make sure that no regressions
are introduced. Instead of blaming me because I report these regressions you
should be grateful that I take the time and effort to report these regressions
early. And since you are employed by a large organization that sells Linux
support services, your employer should invest in developing test cases that
reach a higher coverage of the dm, SCSI and block layer code. I don't think
that it's normal that my tests discovered several issues that were not
discovered by Red Hat's internal test suite. That's something Red Hat has to
address.

Bart.

2018-01-18 22:04:34

by Mike Snitzer

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On Thu, Jan 18 2018 at 4:39pm -0500,
Bart Van Assche <[email protected]> wrote:

> On Thu, 2018-01-18 at 16:23 -0500, Mike Snitzer wrote:
> > On Thu, Jan 18 2018 at 3:58P -0500,
> > Bart Van Assche <[email protected]> wrote:
> >
> > > On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote:
> > > > For Bart's test the underlying scsi-mq driver is what is regularly
> > > > hitting this case in __blk_mq_try_issue_directly():
> > > >
> > > > if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q))
> > >
> > > These lockups were all triggered by incorrect handling of
> > > .queue_rq() returning BLK_STS_RESOURCE.
> >
> > Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE?
> > "Incorrect" because it no longer runs blk_mq_delay_run_hw_queue()?
>
> In what I wrote I was referring to both dm_mq_queue_rq() and scsi_queue_rq().
> With "incorrect" I meant that queue lockups are introduced that make user
> space processes unkillable. That's a severe bug.

And yet Laurence cannot reproduce any such lockups with your test...

Are you absolutely certain this patch doesn't help you?
https://patchwork.kernel.org/patch/10174037/

If it doesn't then that is actually very useful to know.

> > We have time to get this right, please stop hyperventilating about
> > "regressions".
>
> Sorry Mike but that's something I consider as an unfair comment. If Ming and
> you work on patches together, it's your job to make sure that no regressions
> are introduced. Instead of blaming me because I report these regressions you
> should be grateful that I take the time and effort to report these regressions
> early. And since you are employed by a large organization that sells Linux
> support services, your employer should invest in developing test cases that
> reach a higher coverage of the dm, SCSI and block layer code. I don't think
> that it's normal that my tests discovered several issues that were not
> discovered by Red Hat's internal test suite. That's something Red Hat has to
> address.

You have no self-awareness of just how mypoic you're being about this.

I'm not ignoring or blaming you for your test no longer passing. Far
from it. I very much want to fix this. But I want it fixed in a way
that doesn't paper over the real bug(s) while also introducing blind
queue runs that compromise the blk-mq RESTART code's ability to work as
intended.

I'd have thought you could appreciate this. We need a root cause on
this, not hand-waving justifications on why problematic delayed queue
runs are correct.

Please just focus on helping Laurence get his very capable testbed to
reproduce this issue. Once we can reproduce these "unkillable" "stalls"
in-house it'll be _much_ easier to analyze and fix.

Thanks,
Mike

2018-01-18 22:20:16

by Laurence Oberman

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On Thu, 2018-01-18 at 17:01 -0500, Mike Snitzer wrote:
> On Thu, Jan 18 2018 at  4:39pm -0500,
> Bart Van Assche <[email protected]> wrote:
>
> > On Thu, 2018-01-18 at 16:23 -0500, Mike Snitzer wrote:
> > > On Thu, Jan 18 2018 at  3:58P -0500,
> > > Bart Van Assche <[email protected]> wrote:
> > >
> > > > On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote:
> > > > > For Bart's test the underlying scsi-mq driver is what is
> > > > > regularly
> > > > > hitting this case in __blk_mq_try_issue_directly():
> > > > >
> > > > >         if (blk_mq_hctx_stopped(hctx) ||
> > > > > blk_queue_quiesced(q))
> > > >
> > > > These lockups were all triggered by incorrect handling of
> > > > .queue_rq() returning BLK_STS_RESOURCE.
> > >
> > > Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE?
> > > "Incorrect" because it no longer runs
> > > blk_mq_delay_run_hw_queue()?
> >
> > In what I wrote I was referring to both dm_mq_queue_rq() and
> > scsi_queue_rq().
> > With "incorrect" I meant that queue lockups are introduced that
> > make user
> > space processes unkillable. That's a severe bug.
>
> And yet Laurence cannot reproduce any such lockups with your test...
>
> Are you absolutely certain this patch doesn't help you?
> https://patchwork.kernel.org/patch/10174037/
>
> If it doesn't then that is actually very useful to know.
>
> > > We have time to get this right, please stop hyperventilating
> > > about
> > > "regressions".
> >
> > Sorry Mike but that's something I consider as an unfair comment. If
> > Ming and
> > you work on patches together, it's your job to make sure that no
> > regressions
> > are introduced. Instead of blaming me because I report these
> > regressions you
> > should be grateful that I take the time and effort to report these
> > regressions
> > early. And since you are employed by a large organization that
> > sells Linux
> > support services, your employer should invest in developing test
> > cases that
> > reach a higher coverage of the dm, SCSI and block layer code. I
> > don't think
> > that it's normal that my tests discovered several issues that were
> > not
> > discovered by Red Hat's internal test suite. That's something Red
> > Hat has to
> > address.
>
> You have no self-awareness of just how mypoic you're being about
> this.
>
> I'm not ignoring or blaming you for your test no longer passing.  Far
> from it.  I very much want to fix this.  But I want it fixed in a way
> that doesn't paper over the real bug(s) while also introducing blind
> queue runs that compromise the blk-mq RESTART code's ability to work
> as
> intended.
>
> I'd have thought you could appreciate this.  We need a root cause on
> this, not hand-waving justifications on why problematic delayed queue
> runs are correct.
>
> Please just focus on helping Laurence get his very capable testbed to
> reproduce this issue.  Once we can reproduce these "unkillable"
> "stalls"
> in-house it'll be _much_ easier to analyze and fix.
>
> Thanks,
> Mike

Hello Bart

OK, I ran 5 at once of 5 separate mount points.
I am using 4k block sizes
Its solid consistent for me. No stalls no gaps.

[global]
name=02-mq
filename=fio-output-02-mq.txt
rw=randwrite
verify=md5
;rwmixread=60
;rwmixwrite=40
bs=4K
;direct=1
;numjobs=4
;time_based=1
runtime=120

[file1]
size=3G
ioengine=libaio
iodepth=16

I watch I/O and I see it ramp up but fio still runs and it kind of
shuts down.

#!/bin/bash
for i in 1 2 3 4 5
do
cd /data-$i 
fio /root/srp-test.lobe/fio_tests/02-mq 1>/data-$i.out 2>&1 &
done


#Time     cpu sys inter  ctxsw KBRead  Reads KBWrit Writes 

17:16:09   34  12 34431  10393      0      0 249872  36094 
17:16:10   31  10 13001   2679      0      0  57652   7980 
17:16:11   32  11 19473   4416      0      0 143248  17362 
17:16:12   31   9  8068   1606      0      0  20088   2026 
17:16:13   31   9  7718   1518      0      0  15908   1354 
17:16:14   36  14 41132   9710      0      0 686216  46661 
17:16:15   39  18 63622  21033      0      0  1246K  75108 
17:16:16   38  16 76114   9013      0      0  1146K  82338 
17:16:17   33  11 31941   2735      0      0 237340  25034 
17:16:18   36  14 23900   4982      0      1  1567K  43244 
17:16:19   37  15 55884   4974      0      4  1003K  67253 
17:16:20   28  12  7542   4975      0      0  1630K   3266 
17:16:21    8   6 14650  34721      0      0  2535K  21206 
17:16:22    2   2  6655  15839      0      0  2319K   9897 
17:16:23   11  11 37305 134030      0      0  1275K  87010 
17:16:24   23  22 59266 119350   6560   1640  1102K  78034 
17:16:25   21  17 65699 144462 148052  37013 159900  22120 
17:16:26   14   9 80700 164034 216588  54147      4      1 
#         <----CPU[HYPER]-----><----------Disks----------->
#Time     cpu sys inter  ctxsw KBRead  Reads KBWrit Writes 
17:16:27   14   9 78699 162095 214412  53603      0      0 
17:16:28   14   9 75895 155352 204932  51233      0      0 
17:16:29   14   9 75377 161871 214136  53534      0      0 
17:16:30   16  11 79298 155967 206164  51541      0      0 
17:16:31   14   8 82281 165266 218652  54662      0      0 
17:16:32   14   9 88635 163302 215876  53970      0      0 
17:16:33   14   8 88755 168958 223568  55892      0      0 
17:16:34   15   9 82073 154321 203936  50984      0      0 
17:16:35   17  10 87004 161467 213308  53327      0      0 
17:16:36   15   9 86198 161247 212940  53235      4      1 
17:16:37   15   9 83429 162890 215276  53819      0      0 
17:16:38   16  10 83272 160611 212264  53066     20      2 
17:16:39   16  10 93850 168179 222376  55594      0      0 
17:16:40   15   9 86801 167507 221372  55343      0      0 
17:16:41   14   8 82688 171249 226560  56640      0      0 
17:16:42   14   8 86202 159171 210344  52586      0      0 
17:16:43   16  10 86063 156009 206328  51582      0      0 
17:16:44   16  10 90927 153961 203584  50896      0      0 
17:16:45   15   9 95084 156370 206844  51710    132     13 
17:16:46   14   8 85572 158469 209300  52326      4      1 
17:16:47   14   8 92075 158469 209600  52400      0      0 
17:16:48   16  10 95469 154772 204176  51044      0      0 
#         <----CPU[HYPER]-----><----------Disks----------->
#Time     cpu sys inter  ctxsw KBRead  Reads KBWrit Writes 
17:16:49   15  10 93856 154182 203568  50897      5      5 
17:16:50   15  10 92308 151964 200680  50170      0      0 
17:16:51   16  10 94035 155076 204868  51217      0      0 
17:16:52   15   9 93949 164969 217980  54495      0      0 
17:16:53   14   9 85744 149553 197428  49357      0      0 
17:16:54   14   9 83064 155788 205684  51421      0      0 
17:16:55   16  10 83579 166084 218480  54620      0      0 
17:16:56   15   9 74766 158728 208556  52139      4      1 
17:16:57   14   9 82944 162720 215116  53779      0      0 
17:16:58   12   7 88529 163814 216440  54110      0      0 
17:16:59   14   7 85936 166723 220372  55093      4      1 
17:17:00   13   8 86701 153728 203052  50763      0      0 
17:17:01   14   9 92859 155963 205972  51493      0      0 
17:17:02   13   8 85624 143746 189756  47439      0      0 
17:17:03   13   8 85029 142540 188044  47010      0      0 
17:17:04   14   8 87291 144790 191148  47788      0      0 
17:17:05   14   8 81922 144342 190544  47636      0      0 
17:17:06   14   8 81587 161043 212576  53144      8      2 
17:17:07   15   8 85080 142285 187688  46922      0      0 
17:17:08   13   8 86980 150246 197852  49463      0      0 
17:17:09   12   8 81388 144335 190312  47578      4      1 
17:17:10   13   8 88436 146957 194040  48510      0      0 
#         <----CPU[HYPER]-----><----------Disks----------->
#Time     cpu sys inter  ctxsw KBRead  Reads KBWrit Writes 
17:17:11   11   7 78997 137847 181828  45457      0      0 
17:17:12   12   8 85709 148847 196708  49177      0      0 
17:17:13   17  10 97228 170646 225528  56382      0      0 
17:17:14   14   9 96032 173474 229400  57350      0      0 
17:17:15   15   9 95340 167951 221848  55462      0      0 
17:17:16   16  10 87392 157489 207748  51937     38      5 
17:17:17   16  10 84756 152753 201716  50429      0      0 
17:17:18   17  10 85109 153000 202080  50520      0      0 
17:17:19   16  10 92109 164514 217396  54349     80      5 
17:17:20   15   9 91342 155674 205804  51451      0      0 
17:17:21   12   7 73921 126577 167060  41765      0      0 
17:17:22   13   8 76916 126701 167176  41794      0      0 
17:17:23   13   8 85187 151074 199768  49942      0      0 
17:17:24   15   9 85737 153664 203168  50792      0      0 
17:17:25   15   9 87113 147636 195084  48771      0      0 
17:17:26   14   8 89440 157433 208116  52029      4      1 
17:17:27   14   9 95290 172224 227036  56759      0      0 
17:17:28   15  10 93918 167611 220164  55041      0      0 
17:17:29   13   8 82404 149274 197312  49328      0      0 
17:17:30   10   7 75961 138276 182668  45667      0      0 
17:17:31   12   8 84899 151829 200764  50191      0      0 
17:17:32   13   8 81386 141091 186444  46611      0      0 
17:17:33   15   9  100K 167489 221640  55410      0      0 
17:17:34   14   9 89601 158765 210032  52508      0      0 
17:17:35   14   9 92440 155287 205216  51304      0      0 
17:17:36   15  10 90042 155934 206168  51542      4      1 
17:17:37   14   9 91670 164456 217684  54421      0      0 
17:17:38   14   9 78612 167368 221348  55337      0      0 
17:17:39   15   9 91325 162599 214616  53654      0      0 
17:17:40   14   9 79112 151595 199984  50000     21      8 
17:17:41    9   3 32565  60936  79832  19964     32     12 



2018-01-18 22:22:01

by Bart Van Assche

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On Thu, 2018-01-18 at 17:01 -0500, Mike Snitzer wrote:
> And yet Laurence cannot reproduce any such lockups with your test...

Hmm ... maybe I misunderstood Laurence but I don't think that Laurence has
already succeeded at running an unmodified version of my tests. In one of the
e-mails Laurence sent me this morning I read that he modified these scripts
to get past a kernel module unload failure that was reported while starting
these tests. So the next step is to check which changes were made to the test
scripts and also whether the test results are still valid.

> Are you absolutely certain this patch doesn't help you?
> https://patchwork.kernel.org/patch/10174037/
>
> If it doesn't then that is actually very useful to know.

The first I tried this morning is to run the srp-test software against a merge
of Jens' for-next branch and your dm-4.16 branch. Since I noticed that the dm
queue locked up I reinserted a blk_mq_delay_run_hw_queue() call in the dm code.
Since even that was not sufficient I tried to kick the queues via debugfs (for
s in /sys/kernel/debug/block/*/state; do echo kick >$s; done). Since that was
not sufficient to resolve the queue stall I reverted the following tree patches
that are in Jens' tree:
* "blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback"
* "blk-mq-sched: remove unused 'can_block' arg from blk_mq_sched_insert_request"
* "blk-mq: don't dispatch request in blk_mq_request_direct_issue if queue is busy"

Only after I had done this the srp-test software ran again without triggering
dm queue lockups. Sorry but I have not yet had the time to test patch "[RFC]
blk-mq: fixup RESTART when queue becomes idle".

> Please just focus on helping Laurence get his very capable testbed to
> reproduce this issue. Once we can reproduce these "unkillable" "stalls"
> in-house it'll be _much_ easier to analyze and fix.

OK, I will work with Laurence on this. Maybe Laurence and I should work on this
before analyzing the lockup that was mentioned above further?

Bart.

2018-01-18 22:25:30

by Bart Van Assche

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On Thu, 2018-01-18 at 17:18 -0500, Laurence Oberman wrote:
> OK, I ran 5 at once of 5 separate mount points.
> I am using 4k block sizes
> Its solid consistent for me. No stalls no gaps.

Hi Laurence,

That's great news and thank you for having shared this information but I think
it should be mentioned that you have been running my tree in which some recent
block and dm patches were reverted
(https://github.com/bvanassche/linux/tree/block-scsi-for-next)

>
> [global]
> name=02-mq
> filename=fio-output-02-mq.txt
> rw=randwrite
> verify=md5
> ;rwmixread=60
> ;rwmixwrite=40
> bs=4K
> ;direct=1
> ;numjobs=4
> ;time_based=1
> runtime=120
>
> [file1]
> size=3G
> ioengine=libaio
> iodepth=16
>
> I watch I/O and I see it ramp up but fio still runs and it kind of
> shuts down.

In test "file1" I see an I/O size of 3G. Does that mean that the patch that
should fix the sgl_alloc() issue is working?

Thanks,

Bart.

2018-01-18 22:38:06

by Laurence Oberman

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On Thu, 2018-01-18 at 22:24 +0000, Bart Van Assche wrote:
> On Thu, 2018-01-18 at 17:18 -0500, Laurence Oberman wrote:
> > OK, I ran 5 at once of 5 separate mount points.
> > I am using 4k block sizes
> > Its solid consistent for me. No stalls no gaps.
>
> Hi Laurence,
>
> That's great news and thank you for having shared this information
> but I think
> it should be mentioned that you have been running my tree in which
> some recent
> block and dm patches were reverted
> (https://github.com/bvanassche/linux/tree/block-scsi-for-next)
>
> >
> > [global]
> > name=02-mq
> > filename=fio-output-02-mq.txt
> > rw=randwrite
> > verify=md5
> > ;rwmixread=60
> > ;rwmixwrite=40
> > bs=4K
> > ;direct=1
> > ;numjobs=4
> > ;time_based=1
> > runtime=120
> >
> > [file1]
> > size=3G
> > ioengine=libaio
> > iodepth=16
> >
> > I watch I/O and I see it ramp up but fio still runs and it kind of
> > shuts down.
>
> In test "file1" I see an I/O size of 3G. Does that mean that the
> patch that
> should fix the sgl_alloc() issue is working?
>
> Thanks,
>
> Bart.

Hello Bart Thank you.

OK, so booting into Mike tree now and then hopefully I get the lockups.
Can you give me some idea of what to look for.
I assume I/O just stops.

I want to get this happening in-house so we have an avenue to fix this.
Following getting the stall I will attend to the slg patch test for the
SRPT side.

Please note I am not running latest SRPT right now as you know.

Regards
Back later with results


2018-01-18 22:41:10

by Jens Axboe

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On 1/18/18 3:35 PM, Laurence Oberman wrote:
> On Thu, 2018-01-18 at 22:24 +0000, Bart Van Assche wrote:
>> On Thu, 2018-01-18 at 17:18 -0500, Laurence Oberman wrote:
>>> OK, I ran 5 at once of 5 separate mount points.
>>> I am using 4k block sizes
>>> Its solid consistent for me. No stalls no gaps.
>>
>> Hi Laurence,
>>
>> That's great news and thank you for having shared this information
>> but I think
>> it should be mentioned that you have been running my tree in which
>> some recent
>> block and dm patches were reverted
>> (https://github.com/bvanassche/linux/tree/block-scsi-for-next)
>>
>>>
>>> [global]
>>> name=02-mq
>>> filename=fio-output-02-mq.txt
>>> rw=randwrite
>>> verify=md5
>>> ;rwmixread=60
>>> ;rwmixwrite=40
>>> bs=4K
>>> ;direct=1
>>> ;numjobs=4
>>> ;time_based=1
>>> runtime=120
>>>
>>> [file1]
>>> size=3G
>>> ioengine=libaio
>>> iodepth=16
>>>
>>> I watch I/O and I see it ramp up but fio still runs and it kind of
>>> shuts down.
>>
>> In test "file1" I see an I/O size of 3G. Does that mean that the
>> patch that
>> should fix the sgl_alloc() issue is working?
>>
>> Thanks,
>>
>> Bart.
>
> Hello Bart Thank you.
>
> OK, so booting into Mike tree now and then hopefully I get the lockups.
> Can you give me some idea of what to look for.
> I assume I/O just stops.
>
> I want to get this happening in-house so we have an avenue to fix this.
> Following getting the stall I will attend to the slg patch test for the
> SRPT side.
>
> Please note I am not running latest SRPT right now as you know.

When you do have a solid test case, please please submit a blktests
test case for it! This needs to be something we can regularly in
testing.

--
Jens Axboe


2018-01-18 22:56:10

by Bart Van Assche

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On Thu, 2018-01-18 at 15:39 -0700, Jens Axboe wrote:
> When you do have a solid test case, please please submit a blktests
> test case for it! This needs to be something we can regularly in
> testing.

Hello Jens,

That sounds like a good idea to me. BTW, I think the reason why so far I
can reproduce these queue stalls easier than others is because I modified the
SRP initiator to make it easy to cause the .get_budget() call to succeed and
the scsi_queue_rq() to return BLK_STS_BUSY. A possible next step is to apply
a similar change to the scsi_debug driver. The change I made is as follows:

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 0c887ebfbc64..7f3c4a197425 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -3135,6 +3135,16 @@ static int srp_reset_host(struct scsi_cmnd *scmnd)
return srp_reconnect_rport(target->rport) == 0 ? SUCCESS : FAILED;
}

+static int srp_target_alloc(struct scsi_target *starget)
+{
+ struct Scsi_Host *shost = dev_to_shost(starget->dev.parent);
+ struct srp_target_port *target = host_to_target(shost);
+
+ if (target->target_can_queue)
+ starget->can_queue = target->target_can_queue;
+ return 0;
+}
+
static int srp_slave_alloc(struct scsi_device *sdev)
{
struct Scsi_Host *shost = sdev->host;
@@ -3348,6 +3358,7 @@ static struct scsi_host_template srp_template = {
.module = THIS_MODULE,
.name = "InfiniBand SRP initiator",
.proc_name = DRV_NAME,
+ .target_alloc = srp_target_alloc,
.slave_alloc = srp_slave_alloc,
.slave_configure = srp_slave_configure,
.info = srp_target_info,
@@ -3515,6 +3526,7 @@ enum {
SRP_OPT_QUEUE_SIZE = 1 << 14,
SRP_OPT_IP_SRC = 1 << 15,
SRP_OPT_IP_DEST = 1 << 16,
+ SRP_OPT_TARGET_CAN_QUEUE= 1 << 17,
};

static unsigned int srp_opt_mandatory[] = {
@@ -3536,6 +3548,7 @@ static const match_table_t srp_opt_tokens = {
{ SRP_OPT_SERVICE_ID, "service_id=%s" },
{ SRP_OPT_MAX_SECT, "max_sect=%d" },
{ SRP_OPT_MAX_CMD_PER_LUN, "max_cmd_per_lun=%d" },
+ { SRP_OPT_TARGET_CAN_QUEUE, "target_can_queue=%d" },
{ SRP_OPT_IO_CLASS, "io_class=%x" },
{ SRP_OPT_INITIATOR_EXT, "initiator_ext=%s" },
{ SRP_OPT_CMD_SG_ENTRIES, "cmd_sg_entries=%u" },
@@ -3724,6 +3737,15 @@ static int srp_parse_options(struct net *net, const char *buf,
target->scsi_host->cmd_per_lun = token;
break;

+ case SRP_OPT_TARGET_CAN_QUEUE:
+ if (match_int(args, &token) || token < 1) {
+ pr_warn("bad max target_can_queue parameter '%s'\n",
+ p);
+ goto out;
+ }
+ target->target_can_queue = token;
+ break;
+
case SRP_OPT_IO_CLASS:
if (match_hex(args, &token)) {
pr_warn("bad IO class parameter '%s'\n", p);
diff --git a/drivers/infiniband/ulp/srp/ib_srp.h b/drivers/infiniband/ulp/srp/ib_srp.h
index d66c9057d5ea..70334fa3de8e 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.h
+++ b/drivers/infiniband/ulp/srp/ib_srp.h
@@ -216,6 +216,7 @@ struct srp_target_port {
char target_name[32];
unsigned int scsi_id;
unsigned int sg_tablesize;
+ unsigned int target_can_queue;
int mr_pool_size;
int mr_per_cmd;
int queue_size;

Bart.

2018-01-19 02:34:45

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On Thu, Jan 18, 2018 at 01:11:01PM -0700, Jens Axboe wrote:
> On 1/18/18 11:47 AM, Bart Van Assche wrote:
> >> This is all very tiresome.
> >
> > Yes, this is tiresome. It is very annoying to me that others keep
> > introducing so many regressions in such important parts of the kernel.
> > It is also annoying to me that I get blamed if I report a regression
> > instead of seeing that the regression gets fixed.
>
> I agree, it sucks that any change there introduces the regression. I'm
> fine with doing the delay insert again until a new patch is proven to be
> better.

That way is still buggy as I explained, since rerun queue before adding
request to hctx->dispatch_list isn't correct. Who can make sure the request
is visible when __blk_mq_run_hw_queue() is called?

Not mention this way will cause performance regression again.

>
> From the original topic of this email, we have conditions that can cause
> the driver to not be able to submit an IO. A set of those conditions can
> only happen if IO is in flight, and those cases we have covered just
> fine. Another set can potentially trigger without IO being in flight.
> These are cases where a non-device resource is unavailable at the time
> of submission. This might be iommu running out of space, for instance,
> or it might be a memory allocation of some sort. For these cases, we
> don't get any notification when the shortage clears. All we can do is
> ensure that we restart operations at some point in the future. We're SOL
> at that point, but we have to ensure that we make forward progress.

Right, it is a generic issue, not DM-specific one, almost all drivers
call kmalloc(GFP_ATOMIC) in IO path.

IMO, there is enough time for figuring out a generic solution before
4.16 release.

>
> That last set of conditions better not be a a common occurence, since
> performance is down the toilet at that point. I don't want to introduce
> hot path code to rectify it. Have the driver return if that happens in a
> way that is DIFFERENT from needing a normal restart. The driver knows if
> this is a resource that will become available when IO completes on this
> device or not. If we get that return, we have a generic run-again delay.

Now most of times both NVMe and SCSI won't return BLK_STS_RESOURCE, and
it should be DM-only which returns STS_RESOURCE so often.

>
> This basically becomes the same as doing the delay queue thing from DM,
> but just in a generic fashion.

Yeah, it is right.

--
Ming

2018-01-19 04:04:56

by Jens Axboe

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On 1/18/18 7:32 PM, Ming Lei wrote:
> On Thu, Jan 18, 2018 at 01:11:01PM -0700, Jens Axboe wrote:
>> On 1/18/18 11:47 AM, Bart Van Assche wrote:
>>>> This is all very tiresome.
>>>
>>> Yes, this is tiresome. It is very annoying to me that others keep
>>> introducing so many regressions in such important parts of the kernel.
>>> It is also annoying to me that I get blamed if I report a regression
>>> instead of seeing that the regression gets fixed.
>>
>> I agree, it sucks that any change there introduces the regression. I'm
>> fine with doing the delay insert again until a new patch is proven to be
>> better.
>
> That way is still buggy as I explained, since rerun queue before adding
> request to hctx->dispatch_list isn't correct. Who can make sure the request
> is visible when __blk_mq_run_hw_queue() is called?

That race basically doesn't exist for a 10ms gap.

> Not mention this way will cause performance regression again.

How so? It's _exactly_ the same as what you are proposing, except mine
will potentially run the queue when it need not do so. But given that
these are random 10ms queue kicks because we are screwed, it should not
matter. The key point is that it only should be if we have NO better
options. If it's a frequently occurring event that we have to return
BLK_STS_RESOURCE, then we need to get a way to register an event for
when that condition clears. That event will then kick the necessary
queue(s).

>> From the original topic of this email, we have conditions that can cause
>> the driver to not be able to submit an IO. A set of those conditions can
>> only happen if IO is in flight, and those cases we have covered just
>> fine. Another set can potentially trigger without IO being in flight.
>> These are cases where a non-device resource is unavailable at the time
>> of submission. This might be iommu running out of space, for instance,
>> or it might be a memory allocation of some sort. For these cases, we
>> don't get any notification when the shortage clears. All we can do is
>> ensure that we restart operations at some point in the future. We're SOL
>> at that point, but we have to ensure that we make forward progress.
>
> Right, it is a generic issue, not DM-specific one, almost all drivers
> call kmalloc(GFP_ATOMIC) in IO path.

GFP_ATOMIC basically never fails, unless we are out of memory. The
exception is higher order allocations. If a driver has a higher order
atomic allocation in its IO path, the device driver writer needs to be
taken out behind the barn and shot. Simple as that. It will NEVER work
well in a production environment. Witness the disaster that so many NIC
driver writers have learned.

This is NOT the case we care about here. It's resources that are more
readily depleted because other devices are using them. If it's a high
frequency or generally occurring event, then we simply must have a
callback to restart the queue from that. The condition then becomes
identical to device private starvation, the only difference being from
where we restart the queue.

> IMO, there is enough time for figuring out a generic solution before
> 4.16 release.

I would hope so, but the proposed solutions have not filled me with
a lot of confidence in the end result so far.

>> That last set of conditions better not be a a common occurence, since
>> performance is down the toilet at that point. I don't want to introduce
>> hot path code to rectify it. Have the driver return if that happens in a
>> way that is DIFFERENT from needing a normal restart. The driver knows if
>> this is a resource that will become available when IO completes on this
>> device or not. If we get that return, we have a generic run-again delay.
>
> Now most of times both NVMe and SCSI won't return BLK_STS_RESOURCE, and
> it should be DM-only which returns STS_RESOURCE so often.

Where does the dm STS_RESOURCE error usually come from - what's exact
resource are we running out of?

--
Jens Axboe


2018-01-19 05:12:08

by Bart Van Assche

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On Fri, 2018-01-19 at 10:32 +0800, Ming Lei wrote:
> Now most of times both NVMe and SCSI won't return BLK_STS_RESOURCE, and
> it should be DM-only which returns STS_RESOURCE so often.

That's wrong at least for SCSI. See also https://marc.info/?l=linux-block&m=151578329417076.

Bart.

2018-01-19 07:29:10

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On Thu, Jan 18, 2018 at 09:02:45PM -0700, Jens Axboe wrote:
> On 1/18/18 7:32 PM, Ming Lei wrote:
> > On Thu, Jan 18, 2018 at 01:11:01PM -0700, Jens Axboe wrote:
> >> On 1/18/18 11:47 AM, Bart Van Assche wrote:
> >>>> This is all very tiresome.
> >>>
> >>> Yes, this is tiresome. It is very annoying to me that others keep
> >>> introducing so many regressions in such important parts of the kernel.
> >>> It is also annoying to me that I get blamed if I report a regression
> >>> instead of seeing that the regression gets fixed.
> >>
> >> I agree, it sucks that any change there introduces the regression. I'm
> >> fine with doing the delay insert again until a new patch is proven to be
> >> better.
> >
> > That way is still buggy as I explained, since rerun queue before adding
> > request to hctx->dispatch_list isn't correct. Who can make sure the request
> > is visible when __blk_mq_run_hw_queue() is called?
>
> That race basically doesn't exist for a 10ms gap.
>
> > Not mention this way will cause performance regression again.
>
> How so? It's _exactly_ the same as what you are proposing, except mine
> will potentially run the queue when it need not do so. But given that
> these are random 10ms queue kicks because we are screwed, it should not
> matter. The key point is that it only should be if we have NO better
> options. If it's a frequently occurring event that we have to return
> BLK_STS_RESOURCE, then we need to get a way to register an event for
> when that condition clears. That event will then kick the necessary
> queue(s).

Please see queue_delayed_work_on(), hctx->run_work is shared by all
scheduling, once blk_mq_delay_run_hw_queue(100ms) returns, no new
scheduling can make progress during the 100ms.

>
> >> From the original topic of this email, we have conditions that can cause
> >> the driver to not be able to submit an IO. A set of those conditions can
> >> only happen if IO is in flight, and those cases we have covered just
> >> fine. Another set can potentially trigger without IO being in flight.
> >> These are cases where a non-device resource is unavailable at the time
> >> of submission. This might be iommu running out of space, for instance,
> >> or it might be a memory allocation of some sort. For these cases, we
> >> don't get any notification when the shortage clears. All we can do is
> >> ensure that we restart operations at some point in the future. We're SOL
> >> at that point, but we have to ensure that we make forward progress.
> >
> > Right, it is a generic issue, not DM-specific one, almost all drivers
> > call kmalloc(GFP_ATOMIC) in IO path.
>
> GFP_ATOMIC basically never fails, unless we are out of memory. The

I guess GFP_KERNEL may never fail, but GFP_ATOMIC failure might be
possible, and it is mentioned[1] there is such code in mm allocation
path, also OOM can happen too.

if (some randomly generated condition) && (request is atomic)
return NULL;

[1] https://lwn.net/Articles/276731/

> exception is higher order allocations. If a driver has a higher order
> atomic allocation in its IO path, the device driver writer needs to be
> taken out behind the barn and shot. Simple as that. It will NEVER work
> well in a production environment. Witness the disaster that so many NIC
> driver writers have learned.
>
> This is NOT the case we care about here. It's resources that are more
> readily depleted because other devices are using them. If it's a high
> frequency or generally occurring event, then we simply must have a
> callback to restart the queue from that. The condition then becomes
> identical to device private starvation, the only difference being from
> where we restart the queue.
>
> > IMO, there is enough time for figuring out a generic solution before
> > 4.16 release.
>
> I would hope so, but the proposed solutions have not filled me with
> a lot of confidence in the end result so far.
>
> >> That last set of conditions better not be a a common occurence, since
> >> performance is down the toilet at that point. I don't want to introduce
> >> hot path code to rectify it. Have the driver return if that happens in a
> >> way that is DIFFERENT from needing a normal restart. The driver knows if
> >> this is a resource that will become available when IO completes on this
> >> device or not. If we get that return, we have a generic run-again delay.
> >
> > Now most of times both NVMe and SCSI won't return BLK_STS_RESOURCE, and
> > it should be DM-only which returns STS_RESOURCE so often.
>
> Where does the dm STS_RESOURCE error usually come from - what's exact
> resource are we running out of?

It is from blk_get_request(underlying queue), see multipath_clone_and_map().

Thanks,
Ming

2018-01-19 07:35:43

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On Fri, Jan 19, 2018 at 05:09:46AM +0000, Bart Van Assche wrote:
> On Fri, 2018-01-19 at 10:32 +0800, Ming Lei wrote:
> > Now most of times both NVMe and SCSI won't return BLK_STS_RESOURCE, and
> > it should be DM-only which returns STS_RESOURCE so often.
>
> That's wrong at least for SCSI. See also https://marc.info/?l=linux-block&m=151578329417076.
>

> For other scenario's, e.g. if a SCSI initiator submits a
> SCSI request over a fabric and the SCSI target replies with "BUSY" then the

Could you explain a bit when SCSI target replies with BUSY very often?

Inside initiator, we have limited the max per-LUN requests and per-host
requests already before calling .queue_rq().

> SCSI core will end the I/O request with status BLK_STS_RESOURCE after the
> maximum number of retries has been reached (see also scsi_io_completion()).
> In that last case, if a SCSI target sends a "BUSY" reply over the wire back
> to the initiator, there is no other approach for the SCSI initiator to
> figure out whether it can queue another request than to resubmit the
> request. The worst possible strategy is to resubmit a request immediately
> because that will cause a significant fraction of the fabric bandwidth to
> be used just for replying "BUSY" to requests that can't be processed
> immediately.


--
Ming

2018-01-19 15:21:49

by Bart Van Assche

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On Fri, 2018-01-19 at 15:26 +0800, Ming Lei wrote:
> Please see queue_delayed_work_on(), hctx->run_work is shared by all
> scheduling, once blk_mq_delay_run_hw_queue(100ms) returns, no new
> scheduling can make progress during the 100ms.

How about addressing that as follows:

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f7515dd95a36..57f8379a476d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1403,9 +1403,9 @@ static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async,
put_cpu();
}

- kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
- &hctx->run_work,
- msecs_to_jiffies(msecs));
+ kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
+ &hctx->run_work,
+ msecs_to_jiffies(msecs));
}

void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)

Bart.

2018-01-19 15:24:58

by Jens Axboe

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On 1/19/18 12:26 AM, Ming Lei wrote:
> On Thu, Jan 18, 2018 at 09:02:45PM -0700, Jens Axboe wrote:
>> On 1/18/18 7:32 PM, Ming Lei wrote:
>>> On Thu, Jan 18, 2018 at 01:11:01PM -0700, Jens Axboe wrote:
>>>> On 1/18/18 11:47 AM, Bart Van Assche wrote:
>>>>>> This is all very tiresome.
>>>>>
>>>>> Yes, this is tiresome. It is very annoying to me that others keep
>>>>> introducing so many regressions in such important parts of the kernel.
>>>>> It is also annoying to me that I get blamed if I report a regression
>>>>> instead of seeing that the regression gets fixed.
>>>>
>>>> I agree, it sucks that any change there introduces the regression. I'm
>>>> fine with doing the delay insert again until a new patch is proven to be
>>>> better.
>>>
>>> That way is still buggy as I explained, since rerun queue before adding
>>> request to hctx->dispatch_list isn't correct. Who can make sure the request
>>> is visible when __blk_mq_run_hw_queue() is called?
>>
>> That race basically doesn't exist for a 10ms gap.
>>
>>> Not mention this way will cause performance regression again.
>>
>> How so? It's _exactly_ the same as what you are proposing, except mine
>> will potentially run the queue when it need not do so. But given that
>> these are random 10ms queue kicks because we are screwed, it should not
>> matter. The key point is that it only should be if we have NO better
>> options. If it's a frequently occurring event that we have to return
>> BLK_STS_RESOURCE, then we need to get a way to register an event for
>> when that condition clears. That event will then kick the necessary
>> queue(s).
>
> Please see queue_delayed_work_on(), hctx->run_work is shared by all
> scheduling, once blk_mq_delay_run_hw_queue(100ms) returns, no new
> scheduling can make progress during the 100ms.

That's a bug, plain and simple. If someone does "run this queue in
100ms" and someone else comes in and says "run this queue now", the
correct outcome is running this queue now.

>>>> From the original topic of this email, we have conditions that can cause
>>>> the driver to not be able to submit an IO. A set of those conditions can
>>>> only happen if IO is in flight, and those cases we have covered just
>>>> fine. Another set can potentially trigger without IO being in flight.
>>>> These are cases where a non-device resource is unavailable at the time
>>>> of submission. This might be iommu running out of space, for instance,
>>>> or it might be a memory allocation of some sort. For these cases, we
>>>> don't get any notification when the shortage clears. All we can do is
>>>> ensure that we restart operations at some point in the future. We're SOL
>>>> at that point, but we have to ensure that we make forward progress.
>>>
>>> Right, it is a generic issue, not DM-specific one, almost all drivers
>>> call kmalloc(GFP_ATOMIC) in IO path.
>>
>> GFP_ATOMIC basically never fails, unless we are out of memory. The
>
> I guess GFP_KERNEL may never fail, but GFP_ATOMIC failure might be
> possible, and it is mentioned[1] there is such code in mm allocation
> path, also OOM can happen too.
>
> if (some randomly generated condition) && (request is atomic)
> return NULL;
>
> [1] https://lwn.net/Articles/276731/

That article is 10 years old. Once you run large scale production, you
see what the real failures are. Fact is, for zero order allocation, if
the atomic alloc fails the shit has really hit the fan. In that case, a
delay of 10ms is not your main issue. It's a total red herring when you
compare to the frequency of what Bart is seeing. It's noise, and
irrelevant here. For an atomic zero order allocation failure, doing a
short random sleep is perfectly fine.

>> exception is higher order allocations. If a driver has a higher order
>> atomic allocation in its IO path, the device driver writer needs to be
>> taken out behind the barn and shot. Simple as that. It will NEVER work
>> well in a production environment. Witness the disaster that so many NIC
>> driver writers have learned.
>>
>> This is NOT the case we care about here. It's resources that are more
>> readily depleted because other devices are using them. If it's a high
>> frequency or generally occurring event, then we simply must have a
>> callback to restart the queue from that. The condition then becomes
>> identical to device private starvation, the only difference being from
>> where we restart the queue.
>>
>>> IMO, there is enough time for figuring out a generic solution before
>>> 4.16 release.
>>
>> I would hope so, but the proposed solutions have not filled me with
>> a lot of confidence in the end result so far.
>>
>>>> That last set of conditions better not be a a common occurence, since
>>>> performance is down the toilet at that point. I don't want to introduce
>>>> hot path code to rectify it. Have the driver return if that happens in a
>>>> way that is DIFFERENT from needing a normal restart. The driver knows if
>>>> this is a resource that will become available when IO completes on this
>>>> device or not. If we get that return, we have a generic run-again delay.
>>>
>>> Now most of times both NVMe and SCSI won't return BLK_STS_RESOURCE, and
>>> it should be DM-only which returns STS_RESOURCE so often.
>>
>> Where does the dm STS_RESOURCE error usually come from - what's exact
>> resource are we running out of?
>
> It is from blk_get_request(underlying queue), see
> multipath_clone_and_map().

That's what I thought. So for a low queue depth underlying queue, it's
quite possible that this situation can happen. Two potential solutions
I see:

1) As described earlier in this thread, having a mechanism for being
notified when the scarce resource becomes available. It would not
be hard to tap into the existing sbitmap wait queue for that.

2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource
allocation. I haven't read the dm code to know if this is a
possibility or not.

I'd probably prefer #1. It's a classic case of trying to get the
request, and if it fails, add ourselves to the sbitmap tag wait
queue head, retry, and bail if that also fails. Connecting the
scarce resource and the consumer is the only way to really fix
this, without bogus arbitrary delays.

--
Jens Axboe


2018-01-19 15:27:24

by Jens Axboe

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On 1/19/18 8:20 AM, Bart Van Assche wrote:
> On Fri, 2018-01-19 at 15:26 +0800, Ming Lei wrote:
>> Please see queue_delayed_work_on(), hctx->run_work is shared by all
>> scheduling, once blk_mq_delay_run_hw_queue(100ms) returns, no new
>> scheduling can make progress during the 100ms.
>
> How about addressing that as follows:
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index f7515dd95a36..57f8379a476d 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1403,9 +1403,9 @@ static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async,
> put_cpu();
> }
>
> - kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
> - &hctx->run_work,
> - msecs_to_jiffies(msecs));
> + kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
> + &hctx->run_work,
> + msecs_to_jiffies(msecs));
> }

Exactly. That's why I said it was just a bug in my previous email, not
honoring a newer run is just stupid. Only other thing you have to be
careful with here is the STOPPED bit.

--
Jens Axboe


2018-01-19 15:35:48

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On Fri, Jan 19, 2018 at 03:20:13PM +0000, Bart Van Assche wrote:
> On Fri, 2018-01-19 at 15:26 +0800, Ming Lei wrote:
> > Please see queue_delayed_work_on(), hctx->run_work is shared by all
> > scheduling, once blk_mq_delay_run_hw_queue(100ms) returns, no new
> > scheduling can make progress during the 100ms.
>
> How about addressing that as follows:
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index f7515dd95a36..57f8379a476d 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1403,9 +1403,9 @@ static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async,
> put_cpu();
> }
>
> - kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
> - &hctx->run_work,
> - msecs_to_jiffies(msecs));
> + kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
> + &hctx->run_work,
> + msecs_to_jiffies(msecs));
> }
>
> void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)
>
> Bart.

Yes, this one together with Jen's suggestion with returning
BLK_STS_NO_DEV_RESOURCE should fix this issue.

Could you cook a fix for this issue? Otherwise I am happy to do
that.

--
Ming

2018-01-19 15:42:06

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On Fri, Jan 19, 2018 at 08:24:06AM -0700, Jens Axboe wrote:
> On 1/19/18 12:26 AM, Ming Lei wrote:
> > On Thu, Jan 18, 2018 at 09:02:45PM -0700, Jens Axboe wrote:
> >> On 1/18/18 7:32 PM, Ming Lei wrote:
> >>> On Thu, Jan 18, 2018 at 01:11:01PM -0700, Jens Axboe wrote:
> >>>> On 1/18/18 11:47 AM, Bart Van Assche wrote:
> >>>>>> This is all very tiresome.
> >>>>>
> >>>>> Yes, this is tiresome. It is very annoying to me that others keep
> >>>>> introducing so many regressions in such important parts of the kernel.
> >>>>> It is also annoying to me that I get blamed if I report a regression
> >>>>> instead of seeing that the regression gets fixed.
> >>>>
> >>>> I agree, it sucks that any change there introduces the regression. I'm
> >>>> fine with doing the delay insert again until a new patch is proven to be
> >>>> better.
> >>>
> >>> That way is still buggy as I explained, since rerun queue before adding
> >>> request to hctx->dispatch_list isn't correct. Who can make sure the request
> >>> is visible when __blk_mq_run_hw_queue() is called?
> >>
> >> That race basically doesn't exist for a 10ms gap.
> >>
> >>> Not mention this way will cause performance regression again.
> >>
> >> How so? It's _exactly_ the same as what you are proposing, except mine
> >> will potentially run the queue when it need not do so. But given that
> >> these are random 10ms queue kicks because we are screwed, it should not
> >> matter. The key point is that it only should be if we have NO better
> >> options. If it's a frequently occurring event that we have to return
> >> BLK_STS_RESOURCE, then we need to get a way to register an event for
> >> when that condition clears. That event will then kick the necessary
> >> queue(s).
> >
> > Please see queue_delayed_work_on(), hctx->run_work is shared by all
> > scheduling, once blk_mq_delay_run_hw_queue(100ms) returns, no new
> > scheduling can make progress during the 100ms.
>
> That's a bug, plain and simple. If someone does "run this queue in
> 100ms" and someone else comes in and says "run this queue now", the
> correct outcome is running this queue now.
>
> >>>> From the original topic of this email, we have conditions that can cause
> >>>> the driver to not be able to submit an IO. A set of those conditions can
> >>>> only happen if IO is in flight, and those cases we have covered just
> >>>> fine. Another set can potentially trigger without IO being in flight.
> >>>> These are cases where a non-device resource is unavailable at the time
> >>>> of submission. This might be iommu running out of space, for instance,
> >>>> or it might be a memory allocation of some sort. For these cases, we
> >>>> don't get any notification when the shortage clears. All we can do is
> >>>> ensure that we restart operations at some point in the future. We're SOL
> >>>> at that point, but we have to ensure that we make forward progress.
> >>>
> >>> Right, it is a generic issue, not DM-specific one, almost all drivers
> >>> call kmalloc(GFP_ATOMIC) in IO path.
> >>
> >> GFP_ATOMIC basically never fails, unless we are out of memory. The
> >
> > I guess GFP_KERNEL may never fail, but GFP_ATOMIC failure might be
> > possible, and it is mentioned[1] there is such code in mm allocation
> > path, also OOM can happen too.
> >
> > if (some randomly generated condition) && (request is atomic)
> > return NULL;
> >
> > [1] https://lwn.net/Articles/276731/
>
> That article is 10 years old. Once you run large scale production, you
> see what the real failures are. Fact is, for zero order allocation, if
> the atomic alloc fails the shit has really hit the fan. In that case, a
> delay of 10ms is not your main issue. It's a total red herring when you
> compare to the frequency of what Bart is seeing. It's noise, and
> irrelevant here. For an atomic zero order allocation failure, doing a
> short random sleep is perfectly fine.
>
> >> exception is higher order allocations. If a driver has a higher order
> >> atomic allocation in its IO path, the device driver writer needs to be
> >> taken out behind the barn and shot. Simple as that. It will NEVER work
> >> well in a production environment. Witness the disaster that so many NIC
> >> driver writers have learned.
> >>
> >> This is NOT the case we care about here. It's resources that are more
> >> readily depleted because other devices are using them. If it's a high
> >> frequency or generally occurring event, then we simply must have a
> >> callback to restart the queue from that. The condition then becomes
> >> identical to device private starvation, the only difference being from
> >> where we restart the queue.
> >>
> >>> IMO, there is enough time for figuring out a generic solution before
> >>> 4.16 release.
> >>
> >> I would hope so, but the proposed solutions have not filled me with
> >> a lot of confidence in the end result so far.
> >>
> >>>> That last set of conditions better not be a a common occurence, since
> >>>> performance is down the toilet at that point. I don't want to introduce
> >>>> hot path code to rectify it. Have the driver return if that happens in a
> >>>> way that is DIFFERENT from needing a normal restart. The driver knows if
> >>>> this is a resource that will become available when IO completes on this
> >>>> device or not. If we get that return, we have a generic run-again delay.
> >>>
> >>> Now most of times both NVMe and SCSI won't return BLK_STS_RESOURCE, and
> >>> it should be DM-only which returns STS_RESOURCE so often.
> >>
> >> Where does the dm STS_RESOURCE error usually come from - what's exact
> >> resource are we running out of?
> >
> > It is from blk_get_request(underlying queue), see
> > multipath_clone_and_map().
>
> That's what I thought. So for a low queue depth underlying queue, it's
> quite possible that this situation can happen. Two potential solutions
> I see:
>
> 1) As described earlier in this thread, having a mechanism for being
> notified when the scarce resource becomes available. It would not
> be hard to tap into the existing sbitmap wait queue for that.
>
> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource
> allocation. I haven't read the dm code to know if this is a
> possibility or not.
>
> I'd probably prefer #1. It's a classic case of trying to get the
> request, and if it fails, add ourselves to the sbitmap tag wait
> queue head, retry, and bail if that also fails. Connecting the
> scarce resource and the consumer is the only way to really fix
> this, without bogus arbitrary delays.

Right, as I have replied to Bart, using mod_delayed_work_on() with
returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce
resource should fix this issue.

--
Ming

2018-01-19 15:50:52

by Jens Axboe

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On 1/19/18 8:40 AM, Ming Lei wrote:
>>>> Where does the dm STS_RESOURCE error usually come from - what's exact
>>>> resource are we running out of?
>>>
>>> It is from blk_get_request(underlying queue), see
>>> multipath_clone_and_map().
>>
>> That's what I thought. So for a low queue depth underlying queue, it's
>> quite possible that this situation can happen. Two potential solutions
>> I see:
>>
>> 1) As described earlier in this thread, having a mechanism for being
>> notified when the scarce resource becomes available. It would not
>> be hard to tap into the existing sbitmap wait queue for that.
>>
>> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource
>> allocation. I haven't read the dm code to know if this is a
>> possibility or not.
>>
>> I'd probably prefer #1. It's a classic case of trying to get the
>> request, and if it fails, add ourselves to the sbitmap tag wait
>> queue head, retry, and bail if that also fails. Connecting the
>> scarce resource and the consumer is the only way to really fix
>> this, without bogus arbitrary delays.
>
> Right, as I have replied to Bart, using mod_delayed_work_on() with
> returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce
> resource should fix this issue.

It'll fix the forever stall, but it won't really fix it, as we'll slow
down the dm device by some random amount.

A simple test case would be to have a null_blk device with a queue depth
of one, and dm on top of that. Start a fio job that runs two jobs: one
that does IO to the underlying device, and one that does IO to the dm
device. If the job on the dm device runs substantially slower than the
one to the underlying device, then the problem isn't really fixed.

That said, I'm fine with ensuring that we make forward progress always
first, and then we can come up with a proper solution to the issue. The
forward progress guarantee will be needed for the more rare failure
cases, like allocation failures. nvme needs that too, for instance, for
the discard range struct allocation.

--
Jens Axboe


2018-01-19 16:07:06

by Bart Van Assche

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On Fri, 2018-01-19 at 23:33 +0800, Ming Lei wrote:
> On Fri, Jan 19, 2018 at 03:20:13PM +0000, Bart Van Assche wrote:
> > On Fri, 2018-01-19 at 15:26 +0800, Ming Lei wrote:
> > > Please see queue_delayed_work_on(), hctx->run_work is shared by all
> > > scheduling, once blk_mq_delay_run_hw_queue(100ms) returns, no new
> > > scheduling can make progress during the 100ms.
> >
> > How about addressing that as follows:
> >
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index f7515dd95a36..57f8379a476d 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1403,9 +1403,9 @@ static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async,
> > put_cpu();
> > }
> >
> > - kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
> > - &hctx->run_work,
> > - msecs_to_jiffies(msecs));
> > + kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
> > + &hctx->run_work,
> > + msecs_to_jiffies(msecs));
> > }
> >
> > void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)
> >
> > Bart.
>
> Yes, this one together with Jen's suggestion with returning
> BLK_STS_NO_DEV_RESOURCE should fix this issue.
>
> Could you cook a fix for this issue? Otherwise I am happy to do
> that.

Hello Ming,

I will look further into this.

Bart.

2018-01-19 16:07:21

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote:
> On 1/19/18 8:40 AM, Ming Lei wrote:
> >>>> Where does the dm STS_RESOURCE error usually come from - what's exact
> >>>> resource are we running out of?
> >>>
> >>> It is from blk_get_request(underlying queue), see
> >>> multipath_clone_and_map().
> >>
> >> That's what I thought. So for a low queue depth underlying queue, it's
> >> quite possible that this situation can happen. Two potential solutions
> >> I see:
> >>
> >> 1) As described earlier in this thread, having a mechanism for being
> >> notified when the scarce resource becomes available. It would not
> >> be hard to tap into the existing sbitmap wait queue for that.
> >>
> >> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource
> >> allocation. I haven't read the dm code to know if this is a
> >> possibility or not.
> >>
> >> I'd probably prefer #1. It's a classic case of trying to get the
> >> request, and if it fails, add ourselves to the sbitmap tag wait
> >> queue head, retry, and bail if that also fails. Connecting the
> >> scarce resource and the consumer is the only way to really fix
> >> this, without bogus arbitrary delays.
> >
> > Right, as I have replied to Bart, using mod_delayed_work_on() with
> > returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce
> > resource should fix this issue.
>
> It'll fix the forever stall, but it won't really fix it, as we'll slow
> down the dm device by some random amount.
>
> A simple test case would be to have a null_blk device with a queue depth
> of one, and dm on top of that. Start a fio job that runs two jobs: one
> that does IO to the underlying device, and one that does IO to the dm
> device. If the job on the dm device runs substantially slower than the
> one to the underlying device, then the problem isn't really fixed.

I remembered that I tried this test on scsi-debug & dm-mpath over scsi-debug,
seems not observed this issue, could you explain a bit why IO over dm-mpath
may be slower? Because both two IO contexts call same get_request(), and
in theory dm-mpath should be a bit quicker since it uses direct issue for
underlying queue, without io scheduler involved.

--
Ming

2018-01-19 16:15:49

by Mike Snitzer

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On Fri, Jan 19 2018 at 10:48am -0500,
Jens Axboe <[email protected]> wrote:

> On 1/19/18 8:40 AM, Ming Lei wrote:
> >>>> Where does the dm STS_RESOURCE error usually come from - what's exact
> >>>> resource are we running out of?
> >>>
> >>> It is from blk_get_request(underlying queue), see
> >>> multipath_clone_and_map().
> >>
> >> That's what I thought. So for a low queue depth underlying queue, it's
> >> quite possible that this situation can happen. Two potential solutions
> >> I see:
> >>
> >> 1) As described earlier in this thread, having a mechanism for being
> >> notified when the scarce resource becomes available. It would not
> >> be hard to tap into the existing sbitmap wait queue for that.
> >>
> >> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource
> >> allocation. I haven't read the dm code to know if this is a
> >> possibility or not.

Right, #2 is _not_ the way forward. Historically request-based DM used
its own mempool for requests, this was to be able to have some measure
of control and resiliency in the face of low memory conditions that
might be affecting the broader system.

Then Christoph switched over to adding per-request-data; which ushered
in the use of blk_get_request using ATOMIC allocations. I like the
result of that line of development. But taking the next step of setting
BLK_MQ_F_BLOCKING is highly unfortunate (especially in that this
dm-mpath.c code is common to old .request_fn and blk-mq, at least the
call to blk_get_request is). Ultimately dm-mpath like to avoid blocking
for a request because for this dm-mpath device we have multiple queues
to allocate from if need be (provided we have an active-active storage
network topology).

> >> I'd probably prefer #1. It's a classic case of trying to get the
> >> request, and if it fails, add ourselves to the sbitmap tag wait
> >> queue head, retry, and bail if that also fails. Connecting the
> >> scarce resource and the consumer is the only way to really fix
> >> this, without bogus arbitrary delays.
> >
> > Right, as I have replied to Bart, using mod_delayed_work_on() with
> > returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce
> > resource should fix this issue.
>
> It'll fix the forever stall, but it won't really fix it, as we'll slow
> down the dm device by some random amount.

Agreed.

> A simple test case would be to have a null_blk device with a queue depth
> of one, and dm on top of that. Start a fio job that runs two jobs: one
> that does IO to the underlying device, and one that does IO to the dm
> device. If the job on the dm device runs substantially slower than the
> one to the underlying device, then the problem isn't really fixed.

Not sure DM will allow the underlying device to be opened (due to
master/slave ownership that is part of loading a DM table)?

> That said, I'm fine with ensuring that we make forward progress always
> first, and then we can come up with a proper solution to the issue. The
> forward progress guarantee will be needed for the more rare failure
> cases, like allocation failures. nvme needs that too, for instance, for
> the discard range struct allocation.

Yeap, I'd be OK with that too. We'd be better for revisted this and
then have some time to develop the ultimate robust fix (#1, callback
from above).

Mike

2018-01-19 16:21:41

by Jens Axboe

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On 1/19/18 9:05 AM, Ming Lei wrote:
> On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote:
>> On 1/19/18 8:40 AM, Ming Lei wrote:
>>>>>> Where does the dm STS_RESOURCE error usually come from - what's exact
>>>>>> resource are we running out of?
>>>>>
>>>>> It is from blk_get_request(underlying queue), see
>>>>> multipath_clone_and_map().
>>>>
>>>> That's what I thought. So for a low queue depth underlying queue, it's
>>>> quite possible that this situation can happen. Two potential solutions
>>>> I see:
>>>>
>>>> 1) As described earlier in this thread, having a mechanism for being
>>>> notified when the scarce resource becomes available. It would not
>>>> be hard to tap into the existing sbitmap wait queue for that.
>>>>
>>>> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource
>>>> allocation. I haven't read the dm code to know if this is a
>>>> possibility or not.
>>>>
>>>> I'd probably prefer #1. It's a classic case of trying to get the
>>>> request, and if it fails, add ourselves to the sbitmap tag wait
>>>> queue head, retry, and bail if that also fails. Connecting the
>>>> scarce resource and the consumer is the only way to really fix
>>>> this, without bogus arbitrary delays.
>>>
>>> Right, as I have replied to Bart, using mod_delayed_work_on() with
>>> returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce
>>> resource should fix this issue.
>>
>> It'll fix the forever stall, but it won't really fix it, as we'll slow
>> down the dm device by some random amount.
>>
>> A simple test case would be to have a null_blk device with a queue depth
>> of one, and dm on top of that. Start a fio job that runs two jobs: one
>> that does IO to the underlying device, and one that does IO to the dm
>> device. If the job on the dm device runs substantially slower than the
>> one to the underlying device, then the problem isn't really fixed.
>
> I remembered that I tried this test on scsi-debug & dm-mpath over scsi-debug,
> seems not observed this issue, could you explain a bit why IO over dm-mpath
> may be slower? Because both two IO contexts call same get_request(), and
> in theory dm-mpath should be a bit quicker since it uses direct issue for
> underlying queue, without io scheduler involved.

Because if you lose the race for getting the request, you'll have some
arbitrary delay before trying again, potentially. Compared to the direct
user of the underlying device, who will simply sleep on the resource and
get woken the instant it's available.

--
Jens Axboe


2018-01-19 16:24:39

by Jens Axboe

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On 1/19/18 9:13 AM, Mike Snitzer wrote:
> On Fri, Jan 19 2018 at 10:48am -0500,
> Jens Axboe <[email protected]> wrote:
>
>> On 1/19/18 8:40 AM, Ming Lei wrote:
>>>>>> Where does the dm STS_RESOURCE error usually come from - what's exact
>>>>>> resource are we running out of?
>>>>>
>>>>> It is from blk_get_request(underlying queue), see
>>>>> multipath_clone_and_map().
>>>>
>>>> That's what I thought. So for a low queue depth underlying queue, it's
>>>> quite possible that this situation can happen. Two potential solutions
>>>> I see:
>>>>
>>>> 1) As described earlier in this thread, having a mechanism for being
>>>> notified when the scarce resource becomes available. It would not
>>>> be hard to tap into the existing sbitmap wait queue for that.
>>>>
>>>> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource
>>>> allocation. I haven't read the dm code to know if this is a
>>>> possibility or not.
>
> Right, #2 is _not_ the way forward. Historically request-based DM used
> its own mempool for requests, this was to be able to have some measure
> of control and resiliency in the face of low memory conditions that
> might be affecting the broader system.
>
> Then Christoph switched over to adding per-request-data; which ushered
> in the use of blk_get_request using ATOMIC allocations. I like the
> result of that line of development. But taking the next step of setting
> BLK_MQ_F_BLOCKING is highly unfortunate (especially in that this
> dm-mpath.c code is common to old .request_fn and blk-mq, at least the
> call to blk_get_request is). Ultimately dm-mpath like to avoid blocking
> for a request because for this dm-mpath device we have multiple queues
> to allocate from if need be (provided we have an active-active storage
> network topology).

If you can go to multiple devices, obviously it should not block on a
single device. That's only true for the case where you can only go to
one device, blocking at that point would probably be fine. Or if all
your paths are busy, then blocking would also be OK.

But it's a much larger change, and would entail changing more than just
the actual call to blk_get_request().

>> A simple test case would be to have a null_blk device with a queue depth
>> of one, and dm on top of that. Start a fio job that runs two jobs: one
>> that does IO to the underlying device, and one that does IO to the dm
>> device. If the job on the dm device runs substantially slower than the
>> one to the underlying device, then the problem isn't really fixed.
>
> Not sure DM will allow the underlying device to be opened (due to
> master/slave ownership that is part of loading a DM table)?

There are many ways it could be setup - just partition the underlying
device then, and have one partition be part of the dm setup and the
other used directly.

>> That said, I'm fine with ensuring that we make forward progress always
>> first, and then we can come up with a proper solution to the issue. The
>> forward progress guarantee will be needed for the more rare failure
>> cases, like allocation failures. nvme needs that too, for instance, for
>> the discard range struct allocation.
>
> Yeap, I'd be OK with that too. We'd be better for revisted this and
> then have some time to develop the ultimate robust fix (#1, callback
> from above).

Yeah, we need the quick and dirty sooner, which just brings us back to
what we had before, essentially.

--
Jens Axboe


2018-01-19 16:28:04

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote:
> On 1/19/18 9:05 AM, Ming Lei wrote:
> > On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote:
> >> On 1/19/18 8:40 AM, Ming Lei wrote:
> >>>>>> Where does the dm STS_RESOURCE error usually come from - what's exact
> >>>>>> resource are we running out of?
> >>>>>
> >>>>> It is from blk_get_request(underlying queue), see
> >>>>> multipath_clone_and_map().
> >>>>
> >>>> That's what I thought. So for a low queue depth underlying queue, it's
> >>>> quite possible that this situation can happen. Two potential solutions
> >>>> I see:
> >>>>
> >>>> 1) As described earlier in this thread, having a mechanism for being
> >>>> notified when the scarce resource becomes available. It would not
> >>>> be hard to tap into the existing sbitmap wait queue for that.
> >>>>
> >>>> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource
> >>>> allocation. I haven't read the dm code to know if this is a
> >>>> possibility or not.
> >>>>
> >>>> I'd probably prefer #1. It's a classic case of trying to get the
> >>>> request, and if it fails, add ourselves to the sbitmap tag wait
> >>>> queue head, retry, and bail if that also fails. Connecting the
> >>>> scarce resource and the consumer is the only way to really fix
> >>>> this, without bogus arbitrary delays.
> >>>
> >>> Right, as I have replied to Bart, using mod_delayed_work_on() with
> >>> returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce
> >>> resource should fix this issue.
> >>
> >> It'll fix the forever stall, but it won't really fix it, as we'll slow
> >> down the dm device by some random amount.
> >>
> >> A simple test case would be to have a null_blk device with a queue depth
> >> of one, and dm on top of that. Start a fio job that runs two jobs: one
> >> that does IO to the underlying device, and one that does IO to the dm
> >> device. If the job on the dm device runs substantially slower than the
> >> one to the underlying device, then the problem isn't really fixed.
> >
> > I remembered that I tried this test on scsi-debug & dm-mpath over scsi-debug,
> > seems not observed this issue, could you explain a bit why IO over dm-mpath
> > may be slower? Because both two IO contexts call same get_request(), and
> > in theory dm-mpath should be a bit quicker since it uses direct issue for
> > underlying queue, without io scheduler involved.
>
> Because if you lose the race for getting the request, you'll have some
> arbitrary delay before trying again, potentially. Compared to the direct

But the restart still works, one request is completed, then the queue
is return immediately because we use mod_delayed_work_on(0), so looks
no such issue.


--
Ming

2018-01-19 16:29:00

by Jens Axboe

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On 1/19/18 9:26 AM, Ming Lei wrote:
> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote:
>> On 1/19/18 9:05 AM, Ming Lei wrote:
>>> On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote:
>>>> On 1/19/18 8:40 AM, Ming Lei wrote:
>>>>>>>> Where does the dm STS_RESOURCE error usually come from - what's exact
>>>>>>>> resource are we running out of?
>>>>>>>
>>>>>>> It is from blk_get_request(underlying queue), see
>>>>>>> multipath_clone_and_map().
>>>>>>
>>>>>> That's what I thought. So for a low queue depth underlying queue, it's
>>>>>> quite possible that this situation can happen. Two potential solutions
>>>>>> I see:
>>>>>>
>>>>>> 1) As described earlier in this thread, having a mechanism for being
>>>>>> notified when the scarce resource becomes available. It would not
>>>>>> be hard to tap into the existing sbitmap wait queue for that.
>>>>>>
>>>>>> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource
>>>>>> allocation. I haven't read the dm code to know if this is a
>>>>>> possibility or not.
>>>>>>
>>>>>> I'd probably prefer #1. It's a classic case of trying to get the
>>>>>> request, and if it fails, add ourselves to the sbitmap tag wait
>>>>>> queue head, retry, and bail if that also fails. Connecting the
>>>>>> scarce resource and the consumer is the only way to really fix
>>>>>> this, without bogus arbitrary delays.
>>>>>
>>>>> Right, as I have replied to Bart, using mod_delayed_work_on() with
>>>>> returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce
>>>>> resource should fix this issue.
>>>>
>>>> It'll fix the forever stall, but it won't really fix it, as we'll slow
>>>> down the dm device by some random amount.
>>>>
>>>> A simple test case would be to have a null_blk device with a queue depth
>>>> of one, and dm on top of that. Start a fio job that runs two jobs: one
>>>> that does IO to the underlying device, and one that does IO to the dm
>>>> device. If the job on the dm device runs substantially slower than the
>>>> one to the underlying device, then the problem isn't really fixed.
>>>
>>> I remembered that I tried this test on scsi-debug & dm-mpath over scsi-debug,
>>> seems not observed this issue, could you explain a bit why IO over dm-mpath
>>> may be slower? Because both two IO contexts call same get_request(), and
>>> in theory dm-mpath should be a bit quicker since it uses direct issue for
>>> underlying queue, without io scheduler involved.
>>
>> Because if you lose the race for getting the request, you'll have some
>> arbitrary delay before trying again, potentially. Compared to the direct
>
> But the restart still works, one request is completed, then the queue
> is return immediately because we use mod_delayed_work_on(0), so looks
> no such issue.

There are no pending requests for this case, nothing to restart the
queue. When you fail that blk_get_request(), you are idle, nothing
is pending.

--
Jens Axboe


2018-01-19 16:38:41

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote:
> On 1/19/18 9:26 AM, Ming Lei wrote:
> > On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote:
> >> On 1/19/18 9:05 AM, Ming Lei wrote:
> >>> On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote:
> >>>> On 1/19/18 8:40 AM, Ming Lei wrote:
> >>>>>>>> Where does the dm STS_RESOURCE error usually come from - what's exact
> >>>>>>>> resource are we running out of?
> >>>>>>>
> >>>>>>> It is from blk_get_request(underlying queue), see
> >>>>>>> multipath_clone_and_map().
> >>>>>>
> >>>>>> That's what I thought. So for a low queue depth underlying queue, it's
> >>>>>> quite possible that this situation can happen. Two potential solutions
> >>>>>> I see:
> >>>>>>
> >>>>>> 1) As described earlier in this thread, having a mechanism for being
> >>>>>> notified when the scarce resource becomes available. It would not
> >>>>>> be hard to tap into the existing sbitmap wait queue for that.
> >>>>>>
> >>>>>> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource
> >>>>>> allocation. I haven't read the dm code to know if this is a
> >>>>>> possibility or not.
> >>>>>>
> >>>>>> I'd probably prefer #1. It's a classic case of trying to get the
> >>>>>> request, and if it fails, add ourselves to the sbitmap tag wait
> >>>>>> queue head, retry, and bail if that also fails. Connecting the
> >>>>>> scarce resource and the consumer is the only way to really fix
> >>>>>> this, without bogus arbitrary delays.
> >>>>>
> >>>>> Right, as I have replied to Bart, using mod_delayed_work_on() with
> >>>>> returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce
> >>>>> resource should fix this issue.
> >>>>
> >>>> It'll fix the forever stall, but it won't really fix it, as we'll slow
> >>>> down the dm device by some random amount.
> >>>>
> >>>> A simple test case would be to have a null_blk device with a queue depth
> >>>> of one, and dm on top of that. Start a fio job that runs two jobs: one
> >>>> that does IO to the underlying device, and one that does IO to the dm
> >>>> device. If the job on the dm device runs substantially slower than the
> >>>> one to the underlying device, then the problem isn't really fixed.
> >>>
> >>> I remembered that I tried this test on scsi-debug & dm-mpath over scsi-debug,
> >>> seems not observed this issue, could you explain a bit why IO over dm-mpath
> >>> may be slower? Because both two IO contexts call same get_request(), and
> >>> in theory dm-mpath should be a bit quicker since it uses direct issue for
> >>> underlying queue, without io scheduler involved.
> >>
> >> Because if you lose the race for getting the request, you'll have some
> >> arbitrary delay before trying again, potentially. Compared to the direct
> >
> > But the restart still works, one request is completed, then the queue
> > is return immediately because we use mod_delayed_work_on(0), so looks
> > no such issue.
>
> There are no pending requests for this case, nothing to restart the
> queue. When you fail that blk_get_request(), you are idle, nothing
> is pending.

I think we needn't worry about that, once a device is attached to
dm-rq, it can't be mounted any more, and usually user don't use the device
directly and by dm-mpath at the same time.

--
Ming

2018-01-19 16:43:53

by Jens Axboe

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On 1/19/18 9:37 AM, Ming Lei wrote:
> On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote:
>> On 1/19/18 9:26 AM, Ming Lei wrote:
>>> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote:
>>>> On 1/19/18 9:05 AM, Ming Lei wrote:
>>>>> On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote:
>>>>>> On 1/19/18 8:40 AM, Ming Lei wrote:
>>>>>>>>>> Where does the dm STS_RESOURCE error usually come from - what's exact
>>>>>>>>>> resource are we running out of?
>>>>>>>>>
>>>>>>>>> It is from blk_get_request(underlying queue), see
>>>>>>>>> multipath_clone_and_map().
>>>>>>>>
>>>>>>>> That's what I thought. So for a low queue depth underlying queue, it's
>>>>>>>> quite possible that this situation can happen. Two potential solutions
>>>>>>>> I see:
>>>>>>>>
>>>>>>>> 1) As described earlier in this thread, having a mechanism for being
>>>>>>>> notified when the scarce resource becomes available. It would not
>>>>>>>> be hard to tap into the existing sbitmap wait queue for that.
>>>>>>>>
>>>>>>>> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource
>>>>>>>> allocation. I haven't read the dm code to know if this is a
>>>>>>>> possibility or not.
>>>>>>>>
>>>>>>>> I'd probably prefer #1. It's a classic case of trying to get the
>>>>>>>> request, and if it fails, add ourselves to the sbitmap tag wait
>>>>>>>> queue head, retry, and bail if that also fails. Connecting the
>>>>>>>> scarce resource and the consumer is the only way to really fix
>>>>>>>> this, without bogus arbitrary delays.
>>>>>>>
>>>>>>> Right, as I have replied to Bart, using mod_delayed_work_on() with
>>>>>>> returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce
>>>>>>> resource should fix this issue.
>>>>>>
>>>>>> It'll fix the forever stall, but it won't really fix it, as we'll slow
>>>>>> down the dm device by some random amount.
>>>>>>
>>>>>> A simple test case would be to have a null_blk device with a queue depth
>>>>>> of one, and dm on top of that. Start a fio job that runs two jobs: one
>>>>>> that does IO to the underlying device, and one that does IO to the dm
>>>>>> device. If the job on the dm device runs substantially slower than the
>>>>>> one to the underlying device, then the problem isn't really fixed.
>>>>>
>>>>> I remembered that I tried this test on scsi-debug & dm-mpath over scsi-debug,
>>>>> seems not observed this issue, could you explain a bit why IO over dm-mpath
>>>>> may be slower? Because both two IO contexts call same get_request(), and
>>>>> in theory dm-mpath should be a bit quicker since it uses direct issue for
>>>>> underlying queue, without io scheduler involved.
>>>>
>>>> Because if you lose the race for getting the request, you'll have some
>>>> arbitrary delay before trying again, potentially. Compared to the direct
>>>
>>> But the restart still works, one request is completed, then the queue
>>> is return immediately because we use mod_delayed_work_on(0), so looks
>>> no such issue.
>>
>> There are no pending requests for this case, nothing to restart the
>> queue. When you fail that blk_get_request(), you are idle, nothing
>> is pending.
>
> I think we needn't worry about that, once a device is attached to
> dm-rq, it can't be mounted any more, and usually user don't use the device
> directly and by dm-mpath at the same time.

Even if it doesn't happen for a normal dm setup, it is a case that
needs to be handled. The request allocation is just one example of
a wider scope resource that can be unavailable. If the driver returns
NO_DEV_RESOURCE (or whatever name), it will be a possibility that
the device itself is currently idle.

--
Jens Axboe


2018-01-19 16:48:20

by Mike Snitzer

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On Fri, Jan 19 2018 at 11:41am -0500,
Jens Axboe <[email protected]> wrote:

> On 1/19/18 9:37 AM, Ming Lei wrote:
> > On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote:
> >> On 1/19/18 9:26 AM, Ming Lei wrote:
> >>> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote:
> >>
> >> There are no pending requests for this case, nothing to restart the
> >> queue. When you fail that blk_get_request(), you are idle, nothing
> >> is pending.
> >
> > I think we needn't worry about that, once a device is attached to
> > dm-rq, it can't be mounted any more, and usually user don't use the device
> > directly and by dm-mpath at the same time.
>
> Even if it doesn't happen for a normal dm setup, it is a case that
> needs to be handled. The request allocation is just one example of
> a wider scope resource that can be unavailable. If the driver returns
> NO_DEV_RESOURCE (or whatever name), it will be a possibility that
> the device itself is currently idle.

How would a driver's resources be exhausted yet the device is idle (so
as not to be able to benefit from RESTART)?

2018-01-19 16:53:50

by Jens Axboe

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On 1/19/18 9:47 AM, Mike Snitzer wrote:
> On Fri, Jan 19 2018 at 11:41am -0500,
> Jens Axboe <[email protected]> wrote:
>
>> On 1/19/18 9:37 AM, Ming Lei wrote:
>>> On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote:
>>>> On 1/19/18 9:26 AM, Ming Lei wrote:
>>>>> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote:
>>>>
>>>> There are no pending requests for this case, nothing to restart the
>>>> queue. When you fail that blk_get_request(), you are idle, nothing
>>>> is pending.
>>>
>>> I think we needn't worry about that, once a device is attached to
>>> dm-rq, it can't be mounted any more, and usually user don't use the device
>>> directly and by dm-mpath at the same time.
>>
>> Even if it doesn't happen for a normal dm setup, it is a case that
>> needs to be handled. The request allocation is just one example of
>> a wider scope resource that can be unavailable. If the driver returns
>> NO_DEV_RESOURCE (or whatever name), it will be a possibility that
>> the device itself is currently idle.
>
> How would a driver's resources be exhausted yet the device is idle (so
> as not to be able to benefit from RESTART)?

I've outlined a number of these examples already. Another case might be:

1) Device is idle
2) Device gets request
3) Device attempts to DMA map
4) DMA map fails because the IOMMU is out of space (nic is using it all)
5) Device returns STS_RESOURCE
6) Queue is marked as needing a restart

All's well, except there is no IO on this device that will notice the
restart bit and retry the operation.

Replace IOMMU failure with any other resource that the driver might need
for an IO, which isn't tied to a device specific resource.
blk_get_request() on dm is an example, as is any allocation failure
occurring in the queue IO path for the driver.

--
Jens Axboe


2018-01-19 17:07:29

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On Fri, Jan 19, 2018 at 09:52:32AM -0700, Jens Axboe wrote:
> On 1/19/18 9:47 AM, Mike Snitzer wrote:
> > On Fri, Jan 19 2018 at 11:41am -0500,
> > Jens Axboe <[email protected]> wrote:
> >
> >> On 1/19/18 9:37 AM, Ming Lei wrote:
> >>> On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote:
> >>>> On 1/19/18 9:26 AM, Ming Lei wrote:
> >>>>> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote:
> >>>>
> >>>> There are no pending requests for this case, nothing to restart the
> >>>> queue. When you fail that blk_get_request(), you are idle, nothing
> >>>> is pending.
> >>>
> >>> I think we needn't worry about that, once a device is attached to
> >>> dm-rq, it can't be mounted any more, and usually user don't use the device
> >>> directly and by dm-mpath at the same time.
> >>
> >> Even if it doesn't happen for a normal dm setup, it is a case that
> >> needs to be handled. The request allocation is just one example of
> >> a wider scope resource that can be unavailable. If the driver returns
> >> NO_DEV_RESOURCE (or whatever name), it will be a possibility that
> >> the device itself is currently idle.
> >
> > How would a driver's resources be exhausted yet the device is idle (so
> > as not to be able to benefit from RESTART)?
>
> I've outlined a number of these examples already. Another case might be:
>
> 1) Device is idle
> 2) Device gets request
> 3) Device attempts to DMA map
> 4) DMA map fails because the IOMMU is out of space (nic is using it all)
> 5) Device returns STS_RESOURCE
> 6) Queue is marked as needing a restart
>
> All's well, except there is no IO on this device that will notice the
> restart bit and retry the operation.
>
> Replace IOMMU failure with any other resource that the driver might need
> for an IO, which isn't tied to a device specific resource.
> blk_get_request() on dm is an example, as is any allocation failure
> occurring in the queue IO path for the driver.

Yeah, if we decide to take the approach of introducing NO_DEV_RESOURCE, all
the current STS_RESOURCE for non-device resource allocation(kmalloc, dma
map, get_request, ...) should be converted to NO_DEV_RESOURCE.

And it is a generic issue, which need generic solution.

Seems running queue after arbitrary in this case is the only way we
thought of, or other solutions?

If the decision is made, let's make/review the patch, :-)

--
Ming

2018-01-19 17:11:13

by Jens Axboe

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On 1/19/18 10:05 AM, Ming Lei wrote:
> On Fri, Jan 19, 2018 at 09:52:32AM -0700, Jens Axboe wrote:
>> On 1/19/18 9:47 AM, Mike Snitzer wrote:
>>> On Fri, Jan 19 2018 at 11:41am -0500,
>>> Jens Axboe <[email protected]> wrote:
>>>
>>>> On 1/19/18 9:37 AM, Ming Lei wrote:
>>>>> On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote:
>>>>>> On 1/19/18 9:26 AM, Ming Lei wrote:
>>>>>>> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote:
>>>>>>
>>>>>> There are no pending requests for this case, nothing to restart the
>>>>>> queue. When you fail that blk_get_request(), you are idle, nothing
>>>>>> is pending.
>>>>>
>>>>> I think we needn't worry about that, once a device is attached to
>>>>> dm-rq, it can't be mounted any more, and usually user don't use the device
>>>>> directly and by dm-mpath at the same time.
>>>>
>>>> Even if it doesn't happen for a normal dm setup, it is a case that
>>>> needs to be handled. The request allocation is just one example of
>>>> a wider scope resource that can be unavailable. If the driver returns
>>>> NO_DEV_RESOURCE (or whatever name), it will be a possibility that
>>>> the device itself is currently idle.
>>>
>>> How would a driver's resources be exhausted yet the device is idle (so
>>> as not to be able to benefit from RESTART)?
>>
>> I've outlined a number of these examples already. Another case might be:
>>
>> 1) Device is idle
>> 2) Device gets request
>> 3) Device attempts to DMA map
>> 4) DMA map fails because the IOMMU is out of space (nic is using it all)
>> 5) Device returns STS_RESOURCE
>> 6) Queue is marked as needing a restart
>>
>> All's well, except there is no IO on this device that will notice the
>> restart bit and retry the operation.
>>
>> Replace IOMMU failure with any other resource that the driver might need
>> for an IO, which isn't tied to a device specific resource.
>> blk_get_request() on dm is an example, as is any allocation failure
>> occurring in the queue IO path for the driver.
>
> Yeah, if we decide to take the approach of introducing NO_DEV_RESOURCE, all
> the current STS_RESOURCE for non-device resource allocation(kmalloc, dma
> map, get_request, ...) should be converted to NO_DEV_RESOURCE.
>
> And it is a generic issue, which need generic solution.

Precisely.

> Seems running queue after arbitrary in this case is the only way we
> thought of, or other solutions?

I think that is the only solution. If it's a frequent enough occurence
to cause performance issues, then it's likely down to a specific
resource shortage, and we can tackle that independently (we need to,
since each of those will need a specialized solution).

> If the decision is made, let's make/review the patch, :-)

Let 'er rip.

--
Jens Axboe


2018-01-19 17:21:41

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On Fri, Jan 19, 2018 at 10:09:11AM -0700, Jens Axboe wrote:
> On 1/19/18 10:05 AM, Ming Lei wrote:
> > On Fri, Jan 19, 2018 at 09:52:32AM -0700, Jens Axboe wrote:
> >> On 1/19/18 9:47 AM, Mike Snitzer wrote:
> >>> On Fri, Jan 19 2018 at 11:41am -0500,
> >>> Jens Axboe <[email protected]> wrote:
> >>>
> >>>> On 1/19/18 9:37 AM, Ming Lei wrote:
> >>>>> On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote:
> >>>>>> On 1/19/18 9:26 AM, Ming Lei wrote:
> >>>>>>> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote:
> >>>>>>
> >>>>>> There are no pending requests for this case, nothing to restart the
> >>>>>> queue. When you fail that blk_get_request(), you are idle, nothing
> >>>>>> is pending.
> >>>>>
> >>>>> I think we needn't worry about that, once a device is attached to
> >>>>> dm-rq, it can't be mounted any more, and usually user don't use the device
> >>>>> directly and by dm-mpath at the same time.
> >>>>
> >>>> Even if it doesn't happen for a normal dm setup, it is a case that
> >>>> needs to be handled. The request allocation is just one example of
> >>>> a wider scope resource that can be unavailable. If the driver returns
> >>>> NO_DEV_RESOURCE (or whatever name), it will be a possibility that
> >>>> the device itself is currently idle.
> >>>
> >>> How would a driver's resources be exhausted yet the device is idle (so
> >>> as not to be able to benefit from RESTART)?
> >>
> >> I've outlined a number of these examples already. Another case might be:
> >>
> >> 1) Device is idle
> >> 2) Device gets request
> >> 3) Device attempts to DMA map
> >> 4) DMA map fails because the IOMMU is out of space (nic is using it all)
> >> 5) Device returns STS_RESOURCE
> >> 6) Queue is marked as needing a restart
> >>
> >> All's well, except there is no IO on this device that will notice the
> >> restart bit and retry the operation.
> >>
> >> Replace IOMMU failure with any other resource that the driver might need
> >> for an IO, which isn't tied to a device specific resource.
> >> blk_get_request() on dm is an example, as is any allocation failure
> >> occurring in the queue IO path for the driver.
> >
> > Yeah, if we decide to take the approach of introducing NO_DEV_RESOURCE, all
> > the current STS_RESOURCE for non-device resource allocation(kmalloc, dma
> > map, get_request, ...) should be converted to NO_DEV_RESOURCE.

Or simply introduce BLK_STS_DEV_RESOURCE and convert out of real device
resource into it, this way may be much easy to do.

--
Ming

2018-01-19 17:40:53

by Jens Axboe

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On 1/19/18 9:37 AM, Ming Lei wrote:
> On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote:
>> On 1/19/18 9:26 AM, Ming Lei wrote:
>>> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote:
>>>> On 1/19/18 9:05 AM, Ming Lei wrote:
>>>>> On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote:
>>>>>> On 1/19/18 8:40 AM, Ming Lei wrote:
>>>>>>>>>> Where does the dm STS_RESOURCE error usually come from - what's exact
>>>>>>>>>> resource are we running out of?
>>>>>>>>>
>>>>>>>>> It is from blk_get_request(underlying queue), see
>>>>>>>>> multipath_clone_and_map().
>>>>>>>>
>>>>>>>> That's what I thought. So for a low queue depth underlying queue, it's
>>>>>>>> quite possible that this situation can happen. Two potential solutions
>>>>>>>> I see:
>>>>>>>>
>>>>>>>> 1) As described earlier in this thread, having a mechanism for being
>>>>>>>> notified when the scarce resource becomes available. It would not
>>>>>>>> be hard to tap into the existing sbitmap wait queue for that.
>>>>>>>>
>>>>>>>> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource
>>>>>>>> allocation. I haven't read the dm code to know if this is a
>>>>>>>> possibility or not.
>>>>>>>>
>>>>>>>> I'd probably prefer #1. It's a classic case of trying to get the
>>>>>>>> request, and if it fails, add ourselves to the sbitmap tag wait
>>>>>>>> queue head, retry, and bail if that also fails. Connecting the
>>>>>>>> scarce resource and the consumer is the only way to really fix
>>>>>>>> this, without bogus arbitrary delays.
>>>>>>>
>>>>>>> Right, as I have replied to Bart, using mod_delayed_work_on() with
>>>>>>> returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce
>>>>>>> resource should fix this issue.
>>>>>>
>>>>>> It'll fix the forever stall, but it won't really fix it, as we'll slow
>>>>>> down the dm device by some random amount.
>>>>>>
>>>>>> A simple test case would be to have a null_blk device with a queue depth
>>>>>> of one, and dm on top of that. Start a fio job that runs two jobs: one
>>>>>> that does IO to the underlying device, and one that does IO to the dm
>>>>>> device. If the job on the dm device runs substantially slower than the
>>>>>> one to the underlying device, then the problem isn't really fixed.
>>>>>
>>>>> I remembered that I tried this test on scsi-debug & dm-mpath over scsi-debug,
>>>>> seems not observed this issue, could you explain a bit why IO over dm-mpath
>>>>> may be slower? Because both two IO contexts call same get_request(), and
>>>>> in theory dm-mpath should be a bit quicker since it uses direct issue for
>>>>> underlying queue, without io scheduler involved.
>>>>
>>>> Because if you lose the race for getting the request, you'll have some
>>>> arbitrary delay before trying again, potentially. Compared to the direct
>>>
>>> But the restart still works, one request is completed, then the queue
>>> is return immediately because we use mod_delayed_work_on(0), so looks
>>> no such issue.
>>
>> There are no pending requests for this case, nothing to restart the
>> queue. When you fail that blk_get_request(), you are idle, nothing
>> is pending.
>
> I think we needn't worry about that, once a device is attached to
> dm-rq, it can't be mounted any more, and usually user don't use the device
> directly and by dm-mpath at the same time.

Here's an example of that, using my current block tree (merged into
master). The setup is dm-mpath on top of null_blk, the latter having
just a single request. Both are mq devices.

Fio direct 4k random reads on dm_mq: ~250K iops

Start dd on underlying device (or partition on same device), just doing
sequential reads.

Fio direct 4k random reads on dm_mq with dd running: 9 iops

No schedulers involved.

https://i.imgur.com/WTDnnwE.gif

--
Jens Axboe


2018-01-19 18:25:40

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On Fri, Jan 19, 2018 at 10:38:41AM -0700, Jens Axboe wrote:
> On 1/19/18 9:37 AM, Ming Lei wrote:
> > On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote:
> >> On 1/19/18 9:26 AM, Ming Lei wrote:
> >>> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote:
> >>>> On 1/19/18 9:05 AM, Ming Lei wrote:
> >>>>> On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote:
> >>>>>> On 1/19/18 8:40 AM, Ming Lei wrote:
> >>>>>>>>>> Where does the dm STS_RESOURCE error usually come from - what's exact
> >>>>>>>>>> resource are we running out of?
> >>>>>>>>>
> >>>>>>>>> It is from blk_get_request(underlying queue), see
> >>>>>>>>> multipath_clone_and_map().
> >>>>>>>>
> >>>>>>>> That's what I thought. So for a low queue depth underlying queue, it's
> >>>>>>>> quite possible that this situation can happen. Two potential solutions
> >>>>>>>> I see:
> >>>>>>>>
> >>>>>>>> 1) As described earlier in this thread, having a mechanism for being
> >>>>>>>> notified when the scarce resource becomes available. It would not
> >>>>>>>> be hard to tap into the existing sbitmap wait queue for that.
> >>>>>>>>
> >>>>>>>> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource
> >>>>>>>> allocation. I haven't read the dm code to know if this is a
> >>>>>>>> possibility or not.
> >>>>>>>>
> >>>>>>>> I'd probably prefer #1. It's a classic case of trying to get the
> >>>>>>>> request, and if it fails, add ourselves to the sbitmap tag wait
> >>>>>>>> queue head, retry, and bail if that also fails. Connecting the
> >>>>>>>> scarce resource and the consumer is the only way to really fix
> >>>>>>>> this, without bogus arbitrary delays.
> >>>>>>>
> >>>>>>> Right, as I have replied to Bart, using mod_delayed_work_on() with
> >>>>>>> returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce
> >>>>>>> resource should fix this issue.
> >>>>>>
> >>>>>> It'll fix the forever stall, but it won't really fix it, as we'll slow
> >>>>>> down the dm device by some random amount.
> >>>>>>
> >>>>>> A simple test case would be to have a null_blk device with a queue depth
> >>>>>> of one, and dm on top of that. Start a fio job that runs two jobs: one
> >>>>>> that does IO to the underlying device, and one that does IO to the dm
> >>>>>> device. If the job on the dm device runs substantially slower than the
> >>>>>> one to the underlying device, then the problem isn't really fixed.
> >>>>>
> >>>>> I remembered that I tried this test on scsi-debug & dm-mpath over scsi-debug,
> >>>>> seems not observed this issue, could you explain a bit why IO over dm-mpath
> >>>>> may be slower? Because both two IO contexts call same get_request(), and
> >>>>> in theory dm-mpath should be a bit quicker since it uses direct issue for
> >>>>> underlying queue, without io scheduler involved.
> >>>>
> >>>> Because if you lose the race for getting the request, you'll have some
> >>>> arbitrary delay before trying again, potentially. Compared to the direct
> >>>
> >>> But the restart still works, one request is completed, then the queue
> >>> is return immediately because we use mod_delayed_work_on(0), so looks
> >>> no such issue.
> >>
> >> There are no pending requests for this case, nothing to restart the
> >> queue. When you fail that blk_get_request(), you are idle, nothing
> >> is pending.
> >
> > I think we needn't worry about that, once a device is attached to
> > dm-rq, it can't be mounted any more, and usually user don't use the device
> > directly and by dm-mpath at the same time.
>
> Here's an example of that, using my current block tree (merged into
> master). The setup is dm-mpath on top of null_blk, the latter having
> just a single request. Both are mq devices.
>
> Fio direct 4k random reads on dm_mq: ~250K iops
>
> Start dd on underlying device (or partition on same device), just doing
> sequential reads.
>
> Fio direct 4k random reads on dm_mq with dd running: 9 iops
>
> No schedulers involved.
>
> https://i.imgur.com/WTDnnwE.gif

This DM specific issue might be addressed by applying notifier_chain
(or similar mechanism)between the two queues, will think about the
details tomorrow.


--
Ming

2018-01-19 18:34:43

by Mike Snitzer

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On Fri, Jan 19 2018 at 12:38pm -0500,
Jens Axboe <[email protected]> wrote:

> On 1/19/18 9:37 AM, Ming Lei wrote:
> > On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote:
> >>
> >> There are no pending requests for this case, nothing to restart the
> >> queue. When you fail that blk_get_request(), you are idle, nothing
> >> is pending.
> >
> > I think we needn't worry about that, once a device is attached to
> > dm-rq, it can't be mounted any more, and usually user don't use the device
> > directly and by dm-mpath at the same time.
>
> Here's an example of that, using my current block tree (merged into
> master). The setup is dm-mpath on top of null_blk, the latter having
> just a single request. Both are mq devices.
>
> Fio direct 4k random reads on dm_mq: ~250K iops
>
> Start dd on underlying device (or partition on same device), just doing
> sequential reads.
>
> Fio direct 4k random reads on dm_mq with dd running: 9 iops
>
> No schedulers involved.
>
> https://i.imgur.com/WTDnnwE.gif

FYI, your tree doesn't have these dm-4.16 changes (which are needed to
make Ming's blk-mq chnages that are in your tree, commit 396eaf21e et
al, have Ming's desired effect on DM):

https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16&id=050af08ffb1b62af69196d61c22a0755f9a3cdbd
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16&id=459b54019cfeb7330ed4863ad40f78489e0ff23d
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16&id=ec3eaf9a673106f66606896aed6ddd20180b02ec

Fact that you're seeing such shit results without dm-4.16 commit
ec3eaf9a67 (that reverts older commit 6077c2d706) means: yeap, this
is really awful, let's fix it! But it is a different flavor of awful
because the dm-rq.c:map_request() will handle the DM_MAPIO_DELAY_REQUEUE
result from the null_blk's blk_get_request() failure using
dm_mq_delay_requeue_request() against the dm-mq mpath device:

blk_mq_requeue_request(rq, false);
__dm_mq_kick_requeue_list(rq->q, msecs);

So begs the question: why are we stalling _exactly_? (you may have it
all figured out, as your gif implies.. but I'm not there yet).

Might be interesting to see how your same test behaves without all of
the churn we've staged for 4.16, e.g. against v4.15-rc8

Mike

2018-01-19 19:48:37

by Bart Van Assche

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On Fri, 2018-01-19 at 15:34 +0800, Ming Lei wrote:
> Could you explain a bit when SCSI target replies with BUSY very often?
>
> Inside initiator, we have limited the max per-LUN requests and per-host
> requests already before calling .queue_rq().

That's correct. However, when a SCSI initiator and target system are
communicating with each other there is no guarantee that initiator and target
queue depth have been tuned properly. If the initiator queue depth and the
number of requests that can be in flight according to the network protocol
are both larger than the target queue depth and if the target system uses
relatively slow storage (e.g. harddisks) then it can happen that the target
replies with BUSY often.

The Linux iSCSI initiator limits MaxOutstandingR2T (the number of requests
an initiator may sent without having received an answer from the target) to
one so I don't think this can happen when using iSCSI/TCP.

With the SRP initiator however the maximum requests that can be in flight
between initiator and target depends on the number of credits that were
negotiated during login between initiator and target. Some time ago I modified
the SRP initiator such that it limits the initiator queue depth to the number
of SRP credits minus one (for task management). That resulted in a performance
improvement due to fewer BUSY conditions at the initiator side (see also commit
7ade400aba9a ("IB/srp: Reduce number of BUSY conditions")). Another patch for
the SCST SRP target driver limited the number of SRP credits to the queue depth
of the block device at the target side. I'm referring to the following code:
ch->rq_size = min(MAX_SRPT_RQ_SIZE, scst_get_max_lun_commands(NULL, 0)) (I have
not yet had the time to port this change to LIO).

Without such tuning across initiator and target it can happen often that the
target system sends the reply "BUSY" back to the initiator. I think that's why
there is code in the SCSI core to automatically adjust the initiator queue
depth if the "BUSY" condition is encountered frequently. See also
scsi_track_queue_full().

Bart.

2018-01-19 23:54:16

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On Fri, Jan 19, 2018 at 10:38:41AM -0700, Jens Axboe wrote:
> On 1/19/18 9:37 AM, Ming Lei wrote:
> > On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote:
> >> On 1/19/18 9:26 AM, Ming Lei wrote:
> >>> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote:
> >>>> On 1/19/18 9:05 AM, Ming Lei wrote:
> >>>>> On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote:
> >>>>>> On 1/19/18 8:40 AM, Ming Lei wrote:
> >>>>>>>>>> Where does the dm STS_RESOURCE error usually come from - what's exact
> >>>>>>>>>> resource are we running out of?
> >>>>>>>>>
> >>>>>>>>> It is from blk_get_request(underlying queue), see
> >>>>>>>>> multipath_clone_and_map().
> >>>>>>>>
> >>>>>>>> That's what I thought. So for a low queue depth underlying queue, it's
> >>>>>>>> quite possible that this situation can happen. Two potential solutions
> >>>>>>>> I see:
> >>>>>>>>
> >>>>>>>> 1) As described earlier in this thread, having a mechanism for being
> >>>>>>>> notified when the scarce resource becomes available. It would not
> >>>>>>>> be hard to tap into the existing sbitmap wait queue for that.
> >>>>>>>>
> >>>>>>>> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource
> >>>>>>>> allocation. I haven't read the dm code to know if this is a
> >>>>>>>> possibility or not.
> >>>>>>>>
> >>>>>>>> I'd probably prefer #1. It's a classic case of trying to get the
> >>>>>>>> request, and if it fails, add ourselves to the sbitmap tag wait
> >>>>>>>> queue head, retry, and bail if that also fails. Connecting the
> >>>>>>>> scarce resource and the consumer is the only way to really fix
> >>>>>>>> this, without bogus arbitrary delays.
> >>>>>>>
> >>>>>>> Right, as I have replied to Bart, using mod_delayed_work_on() with
> >>>>>>> returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce
> >>>>>>> resource should fix this issue.
> >>>>>>
> >>>>>> It'll fix the forever stall, but it won't really fix it, as we'll slow
> >>>>>> down the dm device by some random amount.
> >>>>>>
> >>>>>> A simple test case would be to have a null_blk device with a queue depth
> >>>>>> of one, and dm on top of that. Start a fio job that runs two jobs: one
> >>>>>> that does IO to the underlying device, and one that does IO to the dm
> >>>>>> device. If the job on the dm device runs substantially slower than the
> >>>>>> one to the underlying device, then the problem isn't really fixed.
> >>>>>
> >>>>> I remembered that I tried this test on scsi-debug & dm-mpath over scsi-debug,
> >>>>> seems not observed this issue, could you explain a bit why IO over dm-mpath
> >>>>> may be slower? Because both two IO contexts call same get_request(), and
> >>>>> in theory dm-mpath should be a bit quicker since it uses direct issue for
> >>>>> underlying queue, without io scheduler involved.
> >>>>
> >>>> Because if you lose the race for getting the request, you'll have some
> >>>> arbitrary delay before trying again, potentially. Compared to the direct
> >>>
> >>> But the restart still works, one request is completed, then the queue
> >>> is return immediately because we use mod_delayed_work_on(0), so looks
> >>> no such issue.
> >>
> >> There are no pending requests for this case, nothing to restart the
> >> queue. When you fail that blk_get_request(), you are idle, nothing
> >> is pending.
> >
> > I think we needn't worry about that, once a device is attached to
> > dm-rq, it can't be mounted any more, and usually user don't use the device
> > directly and by dm-mpath at the same time.
>
> Here's an example of that, using my current block tree (merged into
> master). The setup is dm-mpath on top of null_blk, the latter having
> just a single request. Both are mq devices.
>
> Fio direct 4k random reads on dm_mq: ~250K iops
>
> Start dd on underlying device (or partition on same device), just doing
> sequential reads.
>
> Fio direct 4k random reads on dm_mq with dd running: 9 iops
>
> No schedulers involved.
>
> https://i.imgur.com/WTDnnwE.gif

If null_blk's timer mode is used with a bit delay introduced, I guess
the effect from direct access to underlying queue shouldn't be so
serious. But it still won't be good as direct access.

Another way may be to introduce a variants blk_get_request(), such as
blk_get_request_with_notify(), then pass the current dm-rq's hctx to
it, and use the tag's waitqueue to handle that. But the change can be
a bit big.

--
Ming

2018-01-19 23:59:07

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On Fri, Jan 19, 2018 at 09:23:35AM -0700, Jens Axboe wrote:
> On 1/19/18 9:13 AM, Mike Snitzer wrote:
> > On Fri, Jan 19 2018 at 10:48am -0500,
> > Jens Axboe <[email protected]> wrote:
> >
> >> On 1/19/18 8:40 AM, Ming Lei wrote:
> >>>>>> Where does the dm STS_RESOURCE error usually come from - what's exact
> >>>>>> resource are we running out of?
> >>>>>
> >>>>> It is from blk_get_request(underlying queue), see
> >>>>> multipath_clone_and_map().
> >>>>
> >>>> That's what I thought. So for a low queue depth underlying queue, it's
> >>>> quite possible that this situation can happen. Two potential solutions
> >>>> I see:
> >>>>
> >>>> 1) As described earlier in this thread, having a mechanism for being
> >>>> notified when the scarce resource becomes available. It would not
> >>>> be hard to tap into the existing sbitmap wait queue for that.
> >>>>
> >>>> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource
> >>>> allocation. I haven't read the dm code to know if this is a
> >>>> possibility or not.
> >
> > Right, #2 is _not_ the way forward. Historically request-based DM used
> > its own mempool for requests, this was to be able to have some measure
> > of control and resiliency in the face of low memory conditions that
> > might be affecting the broader system.
> >
> > Then Christoph switched over to adding per-request-data; which ushered
> > in the use of blk_get_request using ATOMIC allocations. I like the
> > result of that line of development. But taking the next step of setting
> > BLK_MQ_F_BLOCKING is highly unfortunate (especially in that this
> > dm-mpath.c code is common to old .request_fn and blk-mq, at least the
> > call to blk_get_request is). Ultimately dm-mpath like to avoid blocking
> > for a request because for this dm-mpath device we have multiple queues
> > to allocate from if need be (provided we have an active-active storage
> > network topology).
>
> If you can go to multiple devices, obviously it should not block on a
> single device. That's only true for the case where you can only go to
> one device, blocking at that point would probably be fine. Or if all
> your paths are busy, then blocking would also be OK.

Introducing one extra block point will hurt AIO performance, in which
there is usually much less jobs/processes to submit IO.

--
Ming

2018-01-20 04:29:36

by Jens Axboe

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On 1/19/18 4:52 PM, Ming Lei wrote:
> On Fri, Jan 19, 2018 at 10:38:41AM -0700, Jens Axboe wrote:
>> On 1/19/18 9:37 AM, Ming Lei wrote:
>>> On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote:
>>>> On 1/19/18 9:26 AM, Ming Lei wrote:
>>>>> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote:
>>>>>> On 1/19/18 9:05 AM, Ming Lei wrote:
>>>>>>> On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote:
>>>>>>>> On 1/19/18 8:40 AM, Ming Lei wrote:
>>>>>>>>>>>> Where does the dm STS_RESOURCE error usually come from - what's exact
>>>>>>>>>>>> resource are we running out of?
>>>>>>>>>>>
>>>>>>>>>>> It is from blk_get_request(underlying queue), see
>>>>>>>>>>> multipath_clone_and_map().
>>>>>>>>>>
>>>>>>>>>> That's what I thought. So for a low queue depth underlying queue, it's
>>>>>>>>>> quite possible that this situation can happen. Two potential solutions
>>>>>>>>>> I see:
>>>>>>>>>>
>>>>>>>>>> 1) As described earlier in this thread, having a mechanism for being
>>>>>>>>>> notified when the scarce resource becomes available. It would not
>>>>>>>>>> be hard to tap into the existing sbitmap wait queue for that.
>>>>>>>>>>
>>>>>>>>>> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource
>>>>>>>>>> allocation. I haven't read the dm code to know if this is a
>>>>>>>>>> possibility or not.
>>>>>>>>>>
>>>>>>>>>> I'd probably prefer #1. It's a classic case of trying to get the
>>>>>>>>>> request, and if it fails, add ourselves to the sbitmap tag wait
>>>>>>>>>> queue head, retry, and bail if that also fails. Connecting the
>>>>>>>>>> scarce resource and the consumer is the only way to really fix
>>>>>>>>>> this, without bogus arbitrary delays.
>>>>>>>>>
>>>>>>>>> Right, as I have replied to Bart, using mod_delayed_work_on() with
>>>>>>>>> returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce
>>>>>>>>> resource should fix this issue.
>>>>>>>>
>>>>>>>> It'll fix the forever stall, but it won't really fix it, as we'll slow
>>>>>>>> down the dm device by some random amount.
>>>>>>>>
>>>>>>>> A simple test case would be to have a null_blk device with a queue depth
>>>>>>>> of one, and dm on top of that. Start a fio job that runs two jobs: one
>>>>>>>> that does IO to the underlying device, and one that does IO to the dm
>>>>>>>> device. If the job on the dm device runs substantially slower than the
>>>>>>>> one to the underlying device, then the problem isn't really fixed.
>>>>>>>
>>>>>>> I remembered that I tried this test on scsi-debug & dm-mpath over scsi-debug,
>>>>>>> seems not observed this issue, could you explain a bit why IO over dm-mpath
>>>>>>> may be slower? Because both two IO contexts call same get_request(), and
>>>>>>> in theory dm-mpath should be a bit quicker since it uses direct issue for
>>>>>>> underlying queue, without io scheduler involved.
>>>>>>
>>>>>> Because if you lose the race for getting the request, you'll have some
>>>>>> arbitrary delay before trying again, potentially. Compared to the direct
>>>>>
>>>>> But the restart still works, one request is completed, then the queue
>>>>> is return immediately because we use mod_delayed_work_on(0), so looks
>>>>> no such issue.
>>>>
>>>> There are no pending requests for this case, nothing to restart the
>>>> queue. When you fail that blk_get_request(), you are idle, nothing
>>>> is pending.
>>>
>>> I think we needn't worry about that, once a device is attached to
>>> dm-rq, it can't be mounted any more, and usually user don't use the device
>>> directly and by dm-mpath at the same time.
>>
>> Here's an example of that, using my current block tree (merged into
>> master). The setup is dm-mpath on top of null_blk, the latter having
>> just a single request. Both are mq devices.
>>
>> Fio direct 4k random reads on dm_mq: ~250K iops
>>
>> Start dd on underlying device (or partition on same device), just doing
>> sequential reads.
>>
>> Fio direct 4k random reads on dm_mq with dd running: 9 iops
>>
>> No schedulers involved.
>>
>> https://i.imgur.com/WTDnnwE.gif
>
> If null_blk's timer mode is used with a bit delay introduced, I guess
> the effect from direct access to underlying queue shouldn't be so
> serious. But it still won't be good as direct access.

Doesn't matter if it's used at the default of 10usec completion
latency, or inline complete. Same result, I ran both.

> Another way may be to introduce a variants blk_get_request(), such as
> blk_get_request_with_notify(), then pass the current dm-rq's hctx to
> it, and use the tag's waitqueue to handle that. But the change can be
> a bit big.

Yes, that's exactly the solution I suggested both yesterday and today.

--
Jens Axboe


2018-01-23 09:22:58

by Mike Snitzer

[permalink] [raw]
Subject: [PATCH] block: neutralize blk_insert_cloned_request IO stall regression (was: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle)

On Thu, Jan 18 2018 at 5:20pm -0500,
Bart Van Assche <[email protected]> wrote:

> On Thu, 2018-01-18 at 17:01 -0500, Mike Snitzer wrote:
> > And yet Laurence cannot reproduce any such lockups with your test...
>
> Hmm ... maybe I misunderstood Laurence but I don't think that Laurence has
> already succeeded at running an unmodified version of my tests. In one of the
> e-mails Laurence sent me this morning I read that he modified these scripts
> to get past a kernel module unload failure that was reported while starting
> these tests. So the next step is to check which changes were made to the test
> scripts and also whether the test results are still valid.
>
> > Are you absolutely certain this patch doesn't help you?
> > https://patchwork.kernel.org/patch/10174037/
> >
> > If it doesn't then that is actually very useful to know.
>
> The first I tried this morning is to run the srp-test software against a merge
> of Jens' for-next branch and your dm-4.16 branch. Since I noticed that the dm
> queue locked up I reinserted a blk_mq_delay_run_hw_queue() call in the dm code.
> Since even that was not sufficient I tried to kick the queues via debugfs (for
> s in /sys/kernel/debug/block/*/state; do echo kick >$s; done). Since that was
> not sufficient to resolve the queue stall I reverted the following tree patches
> that are in Jens' tree:
> * "blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback"
> * "blk-mq-sched: remove unused 'can_block' arg from blk_mq_sched_insert_request"
> * "blk-mq: don't dispatch request in blk_mq_request_direct_issue if queue is busy"
>
> Only after I had done this the srp-test software ran again without triggering
> dm queue lockups.

Given that Ming's notifier-based patchset needs more development time I
think we're unfortunately past the point where we can comfortably wait
for that to be ready.

So we need to explore alternatives to fixing this IO stall regression.
Rather than attempt the above block reverts (which is an incomplete
listing given newer changes): might we develop a more targeted code
change to neutralize commit 396eaf21ee ("blk-mq: improve DM's blk-mq IO
merging via blk_insert_cloned_request feedback")? -- which, given Bart's
findings above, seems to be the most problematic block commit.

To that end, assuming I drop this commit from dm-4.16:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16&id=316a795ad388e0c3ca613454851a28079d917a92

Here is my proposal for putting this regression behind us for 4.16
(Ming's line of development would continue and hopefully be included in
4.17):

From: Mike Snitzer <[email protected]>
Date: Tue, 23 Jan 2018 09:40:22 +0100
Subject: [PATCH] block: neutralize blk_insert_cloned_request IO stall regression

The series of blk-mq changes intended to improve sequential IO
performace (through improved merging with dm-mapth blk-mq stacked on
underlying blk-mq device). Unfortunately these changes have caused
dm-mpath blk-mq IO stalls when blk_mq_request_issue_directly()'s call to
q->mq_ops->queue_rq() fails (due to device-specific resource
unavailability).

Fix this by reverting back to how blk_insert_cloned_request() functioned
prior to commit 396eaf21ee -- by using blk_mq_request_bypass_insert()
instead of blk_mq_request_issue_directly().

In the future, this commit should be reverted as the first change in a
followup series of changes that implements a comprehensive solution to
allowing an underlying blk-mq queue's resource limitation to trigger the
upper blk-mq queue to run once that underlying limited resource is
replenished.

Fixes: 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
Signed-off-by: Mike Snitzer <[email protected]>
---
block/blk-core.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index cdae69be68e9..a224f282b4a6 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2520,7 +2520,8 @@ blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request *
* bypass a potential scheduler on the bottom device for
* insert.
*/
- return blk_mq_request_issue_directly(rq);
+ blk_mq_request_bypass_insert(rq, true);
+ return BLK_STS_OK;
}

spin_lock_irqsave(q->queue_lock, flags);
--
2.15.0


2018-01-23 10:54:54

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH] block: neutralize blk_insert_cloned_request IO stall regression (was: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle)

Hi Mike,

On Tue, Jan 23, 2018 at 10:22:04AM +0100, Mike Snitzer wrote:
> On Thu, Jan 18 2018 at 5:20pm -0500,
> Bart Van Assche <[email protected]> wrote:
>
> > On Thu, 2018-01-18 at 17:01 -0500, Mike Snitzer wrote:
> > > And yet Laurence cannot reproduce any such lockups with your test...
> >
> > Hmm ... maybe I misunderstood Laurence but I don't think that Laurence has
> > already succeeded at running an unmodified version of my tests. In one of the
> > e-mails Laurence sent me this morning I read that he modified these scripts
> > to get past a kernel module unload failure that was reported while starting
> > these tests. So the next step is to check which changes were made to the test
> > scripts and also whether the test results are still valid.
> >
> > > Are you absolutely certain this patch doesn't help you?
> > > https://patchwork.kernel.org/patch/10174037/
> > >
> > > If it doesn't then that is actually very useful to know.
> >
> > The first I tried this morning is to run the srp-test software against a merge
> > of Jens' for-next branch and your dm-4.16 branch. Since I noticed that the dm
> > queue locked up I reinserted a blk_mq_delay_run_hw_queue() call in the dm code.
> > Since even that was not sufficient I tried to kick the queues via debugfs (for
> > s in /sys/kernel/debug/block/*/state; do echo kick >$s; done). Since that was
> > not sufficient to resolve the queue stall I reverted the following tree patches
> > that are in Jens' tree:
> > * "blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback"
> > * "blk-mq-sched: remove unused 'can_block' arg from blk_mq_sched_insert_request"
> > * "blk-mq: don't dispatch request in blk_mq_request_direct_issue if queue is busy"
> >
> > Only after I had done this the srp-test software ran again without triggering
> > dm queue lockups.
>
> Given that Ming's notifier-based patchset needs more development time I
> think we're unfortunately past the point where we can comfortably wait
> for that to be ready.
>
> So we need to explore alternatives to fixing this IO stall regression.

The fix for IO stall doesn't need the notifier-based patchset, and only
the 1st patch is enough for fixing the IO stall. And it is a generic
issue, which need generic solution, that is the conclusion made by
Jens and me.

https://marc.info/?l=linux-kernel&m=151638176727612&w=2

And the notifier-based patchset is for solving the performance issue
reported by Jens:

- run IO on dm-mpath
- run background IO on low depth underlying queue
- then IO performance on dm-mpath is extremely slow

I will send out the 1st patch of 'blk-mq: introduce BLK_STS_DEV_RESOURCE'
soon, but the notifier-based patchset shouldn't be very urgent, since
the above test case isn't usual in reality.

> Rather than attempt the above block reverts (which is an incomplete
> listing given newer changes): might we develop a more targeted code
> change to neutralize commit 396eaf21ee ("blk-mq: improve DM's blk-mq IO
> merging via blk_insert_cloned_request feedback")? -- which, given Bart's
> findings above, seems to be the most problematic block commit.

The stall isn't related with commit 396eaf21ee too.

>
> To that end, assuming I drop this commit from dm-4.16:
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16&id=316a795ad388e0c3ca613454851a28079d917a92
>
> Here is my proposal for putting this regression behind us for 4.16
> (Ming's line of development would continue and hopefully be included in
> 4.17):

Actually notifier based approach is ready, even cache for clone is ready
too, but the test result isn't good enough on random IO on Jens's above
case, and sequential IO is much better with both cache clone and
notifier based allocation(much better than non-mq). And follows the tree
if anyone is interested:

https://github.com/ming1/linux/commits/v4.15-rc-block-dm-mpath

Now looks there is still one issue: the notifier can come early, just
before the request is added to hctx->dispatch_list, and performance
still gets hurt, especially on random IO in Jens's case. But queue
won't stall, :-)

>
> From: Mike Snitzer <[email protected]>
> Date: Tue, 23 Jan 2018 09:40:22 +0100
> Subject: [PATCH] block: neutralize blk_insert_cloned_request IO stall regression
>
> The series of blk-mq changes intended to improve sequential IO
> performace (through improved merging with dm-mapth blk-mq stacked on
> underlying blk-mq device). Unfortunately these changes have caused
> dm-mpath blk-mq IO stalls when blk_mq_request_issue_directly()'s call to
> q->mq_ops->queue_rq() fails (due to device-specific resource
> unavailability).
>
> Fix this by reverting back to how blk_insert_cloned_request() functioned
> prior to commit 396eaf21ee -- by using blk_mq_request_bypass_insert()
> instead of blk_mq_request_issue_directly().
>
> In the future, this commit should be reverted as the first change in a
> followup series of changes that implements a comprehensive solution to
> allowing an underlying blk-mq queue's resource limitation to trigger the
> upper blk-mq queue to run once that underlying limited resource is
> replenished.
>
> Fixes: 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
> Signed-off-by: Mike Snitzer <[email protected]>
> ---
> block/blk-core.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index cdae69be68e9..a224f282b4a6 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2520,7 +2520,8 @@ blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request *
> * bypass a potential scheduler on the bottom device for
> * insert.
> */
> - return blk_mq_request_issue_directly(rq);
> + blk_mq_request_bypass_insert(rq, true);
> + return BLK_STS_OK;
> }

If this patch is for fixing IO stall on DM, it isn't needed, and actually
it can't fix the IO stall issue.

--
Ming

2018-01-23 12:16:07

by Mike Snitzer

[permalink] [raw]
Subject: Re: block: neutralize blk_insert_cloned_request IO stall regression (was: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle)

On Tue, Jan 23 2018 at 5:53am -0500,
Ming Lei <[email protected]> wrote:

> Hi Mike,
>
> On Tue, Jan 23, 2018 at 10:22:04AM +0100, Mike Snitzer wrote:
> > On Thu, Jan 18 2018 at 5:20pm -0500,
> > Bart Van Assche <[email protected]> wrote:
> >
> > > On Thu, 2018-01-18 at 17:01 -0500, Mike Snitzer wrote:
> > > > And yet Laurence cannot reproduce any such lockups with your test...
> > >
> > > Hmm ... maybe I misunderstood Laurence but I don't think that Laurence has
> > > already succeeded at running an unmodified version of my tests. In one of the
> > > e-mails Laurence sent me this morning I read that he modified these scripts
> > > to get past a kernel module unload failure that was reported while starting
> > > these tests. So the next step is to check which changes were made to the test
> > > scripts and also whether the test results are still valid.
> > >
> > > > Are you absolutely certain this patch doesn't help you?
> > > > https://patchwork.kernel.org/patch/10174037/
> > > >
> > > > If it doesn't then that is actually very useful to know.
> > >
> > > The first I tried this morning is to run the srp-test software against a merge
> > > of Jens' for-next branch and your dm-4.16 branch. Since I noticed that the dm
> > > queue locked up I reinserted a blk_mq_delay_run_hw_queue() call in the dm code.
> > > Since even that was not sufficient I tried to kick the queues via debugfs (for
> > > s in /sys/kernel/debug/block/*/state; do echo kick >$s; done). Since that was
> > > not sufficient to resolve the queue stall I reverted the following tree patches
> > > that are in Jens' tree:
> > > * "blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback"
> > > * "blk-mq-sched: remove unused 'can_block' arg from blk_mq_sched_insert_request"
> > > * "blk-mq: don't dispatch request in blk_mq_request_direct_issue if queue is busy"
> > >
> > > Only after I had done this the srp-test software ran again without triggering
> > > dm queue lockups.
> >
> > Given that Ming's notifier-based patchset needs more development time I
> > think we're unfortunately past the point where we can comfortably wait
> > for that to be ready.
> >
> > So we need to explore alternatives to fixing this IO stall regression.
>
> The fix for IO stall doesn't need the notifier-based patchset, and only
> the 1st patch is enough for fixing the IO stall. And it is a generic
> issue, which need generic solution, that is the conclusion made by
> Jens and me.
>
> https://marc.info/?l=linux-kernel&m=151638176727612&w=2

That's fine if Bart verifies it.

> And the notifier-based patchset is for solving the performance issue
> reported by Jens:
>
> - run IO on dm-mpath
> - run background IO on low depth underlying queue
> - then IO performance on dm-mpath is extremely slow
>
> I will send out the 1st patch of 'blk-mq: introduce BLK_STS_DEV_RESOURCE'
> soon, but the notifier-based patchset shouldn't be very urgent, since
> the above test case isn't usual in reality.
>
> > Rather than attempt the above block reverts (which is an incomplete
> > listing given newer changes): might we develop a more targeted code
> > change to neutralize commit 396eaf21ee ("blk-mq: improve DM's blk-mq IO
> > merging via blk_insert_cloned_request feedback")? -- which, given Bart's
> > findings above, seems to be the most problematic block commit.
>
> The stall isn't related with commit 396eaf21ee too.
>
> >
> > To that end, assuming I drop this commit from dm-4.16:
> > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16&id=316a795ad388e0c3ca613454851a28079d917a92
> >
> > Here is my proposal for putting this regression behind us for 4.16
> > (Ming's line of development would continue and hopefully be included in
> > 4.17):
>
> Actually notifier based approach is ready, even cache for clone is ready
> too, but the test result isn't good enough on random IO on Jens's above
> case, and sequential IO is much better with both cache clone and
> notifier based allocation(much better than non-mq). And follows the tree
> if anyone is interested:
>
> https://github.com/ming1/linux/commits/v4.15-rc-block-dm-mpath
>
> Now looks there is still one issue: the notifier can come early, just
> before the request is added to hctx->dispatch_list, and performance
> still gets hurt, especially on random IO in Jens's case. But queue
> won't stall, :-)
>
> >
> > From: Mike Snitzer <[email protected]>
> > Date: Tue, 23 Jan 2018 09:40:22 +0100
> > Subject: [PATCH] block: neutralize blk_insert_cloned_request IO stall regression
> >
> > The series of blk-mq changes intended to improve sequential IO
> > performace (through improved merging with dm-mapth blk-mq stacked on
> > underlying blk-mq device). Unfortunately these changes have caused
> > dm-mpath blk-mq IO stalls when blk_mq_request_issue_directly()'s call to
> > q->mq_ops->queue_rq() fails (due to device-specific resource
> > unavailability).
> >
> > Fix this by reverting back to how blk_insert_cloned_request() functioned
> > prior to commit 396eaf21ee -- by using blk_mq_request_bypass_insert()
> > instead of blk_mq_request_issue_directly().
> >
> > In the future, this commit should be reverted as the first change in a
> > followup series of changes that implements a comprehensive solution to
> > allowing an underlying blk-mq queue's resource limitation to trigger the
> > upper blk-mq queue to run once that underlying limited resource is
> > replenished.
> >
> > Fixes: 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
> > Signed-off-by: Mike Snitzer <[email protected]>
> > ---
> > block/blk-core.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index cdae69be68e9..a224f282b4a6 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -2520,7 +2520,8 @@ blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request *
> > * bypass a potential scheduler on the bottom device for
> > * insert.
> > */
> > - return blk_mq_request_issue_directly(rq);
> > + blk_mq_request_bypass_insert(rq, true);
> > + return BLK_STS_OK;
> > }
>
> If this patch is for fixing IO stall on DM, it isn't needed, and actually
> it can't fix the IO stall issue.

It should "fix it" in conjunction with this:

https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16&id=87b7e73546b55f4a3a37d4726daedd9a17a20509

(Bart already said as much, I just effectively enabled the equivalent of
his reverts)

2018-01-23 12:18:46

by Ming Lei

[permalink] [raw]
Subject: Re: block: neutralize blk_insert_cloned_request IO stall regression (was: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle)

On Tue, Jan 23, 2018 at 8:15 PM, Mike Snitzer <[email protected]> wrote:
> On Tue, Jan 23 2018 at 5:53am -0500,
> Ming Lei <[email protected]> wrote:
>
>> Hi Mike,
>>
>> On Tue, Jan 23, 2018 at 10:22:04AM +0100, Mike Snitzer wrote:
>> > On Thu, Jan 18 2018 at 5:20pm -0500,
>> > Bart Van Assche <[email protected]> wrote:
>> >
>> > > On Thu, 2018-01-18 at 17:01 -0500, Mike Snitzer wrote:
>> > > > And yet Laurence cannot reproduce any such lockups with your test...
>> > >
>> > > Hmm ... maybe I misunderstood Laurence but I don't think that Laurence has
>> > > already succeeded at running an unmodified version of my tests. In one of the
>> > > e-mails Laurence sent me this morning I read that he modified these scripts
>> > > to get past a kernel module unload failure that was reported while starting
>> > > these tests. So the next step is to check which changes were made to the test
>> > > scripts and also whether the test results are still valid.
>> > >
>> > > > Are you absolutely certain this patch doesn't help you?
>> > > > https://patchwork.kernel.org/patch/10174037/
>> > > >
>> > > > If it doesn't then that is actually very useful to know.
>> > >
>> > > The first I tried this morning is to run the srp-test software against a merge
>> > > of Jens' for-next branch and your dm-4.16 branch. Since I noticed that the dm
>> > > queue locked up I reinserted a blk_mq_delay_run_hw_queue() call in the dm code.
>> > > Since even that was not sufficient I tried to kick the queues via debugfs (for
>> > > s in /sys/kernel/debug/block/*/state; do echo kick >$s; done). Since that was
>> > > not sufficient to resolve the queue stall I reverted the following tree patches
>> > > that are in Jens' tree:
>> > > * "blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback"
>> > > * "blk-mq-sched: remove unused 'can_block' arg from blk_mq_sched_insert_request"
>> > > * "blk-mq: don't dispatch request in blk_mq_request_direct_issue if queue is busy"
>> > >
>> > > Only after I had done this the srp-test software ran again without triggering
>> > > dm queue lockups.
>> >
>> > Given that Ming's notifier-based patchset needs more development time I
>> > think we're unfortunately past the point where we can comfortably wait
>> > for that to be ready.
>> >
>> > So we need to explore alternatives to fixing this IO stall regression.
>>
>> The fix for IO stall doesn't need the notifier-based patchset, and only
>> the 1st patch is enough for fixing the IO stall. And it is a generic
>> issue, which need generic solution, that is the conclusion made by
>> Jens and me.
>>
>> https://marc.info/?l=linux-kernel&m=151638176727612&w=2
>
> That's fine if Bart verifies it.
>
>> And the notifier-based patchset is for solving the performance issue
>> reported by Jens:
>>
>> - run IO on dm-mpath
>> - run background IO on low depth underlying queue
>> - then IO performance on dm-mpath is extremely slow
>>
>> I will send out the 1st patch of 'blk-mq: introduce BLK_STS_DEV_RESOURCE'
>> soon, but the notifier-based patchset shouldn't be very urgent, since
>> the above test case isn't usual in reality.
>>
>> > Rather than attempt the above block reverts (which is an incomplete
>> > listing given newer changes): might we develop a more targeted code
>> > change to neutralize commit 396eaf21ee ("blk-mq: improve DM's blk-mq IO
>> > merging via blk_insert_cloned_request feedback")? -- which, given Bart's
>> > findings above, seems to be the most problematic block commit.
>>
>> The stall isn't related with commit 396eaf21ee too.
>>
>> >
>> > To that end, assuming I drop this commit from dm-4.16:
>> > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16&id=316a795ad388e0c3ca613454851a28079d917a92
>> >
>> > Here is my proposal for putting this regression behind us for 4.16
>> > (Ming's line of development would continue and hopefully be included in
>> > 4.17):
>>
>> Actually notifier based approach is ready, even cache for clone is ready
>> too, but the test result isn't good enough on random IO on Jens's above
>> case, and sequential IO is much better with both cache clone and
>> notifier based allocation(much better than non-mq). And follows the tree
>> if anyone is interested:
>>
>> https://github.com/ming1/linux/commits/v4.15-rc-block-dm-mpath
>>
>> Now looks there is still one issue: the notifier can come early, just
>> before the request is added to hctx->dispatch_list, and performance
>> still gets hurt, especially on random IO in Jens's case. But queue
>> won't stall, :-)
>>
>> >
>> > From: Mike Snitzer <[email protected]>
>> > Date: Tue, 23 Jan 2018 09:40:22 +0100
>> > Subject: [PATCH] block: neutralize blk_insert_cloned_request IO stall regression
>> >
>> > The series of blk-mq changes intended to improve sequential IO
>> > performace (through improved merging with dm-mapth blk-mq stacked on
>> > underlying blk-mq device). Unfortunately these changes have caused
>> > dm-mpath blk-mq IO stalls when blk_mq_request_issue_directly()'s call to
>> > q->mq_ops->queue_rq() fails (due to device-specific resource
>> > unavailability).
>> >
>> > Fix this by reverting back to how blk_insert_cloned_request() functioned
>> > prior to commit 396eaf21ee -- by using blk_mq_request_bypass_insert()
>> > instead of blk_mq_request_issue_directly().
>> >
>> > In the future, this commit should be reverted as the first change in a
>> > followup series of changes that implements a comprehensive solution to
>> > allowing an underlying blk-mq queue's resource limitation to trigger the
>> > upper blk-mq queue to run once that underlying limited resource is
>> > replenished.
>> >
>> > Fixes: 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
>> > Signed-off-by: Mike Snitzer <[email protected]>
>> > ---
>> > block/blk-core.c | 3 ++-
>> > 1 file changed, 2 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/block/blk-core.c b/block/blk-core.c
>> > index cdae69be68e9..a224f282b4a6 100644
>> > --- a/block/blk-core.c
>> > +++ b/block/blk-core.c
>> > @@ -2520,7 +2520,8 @@ blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request *
>> > * bypass a potential scheduler on the bottom device for
>> > * insert.
>> > */
>> > - return blk_mq_request_issue_directly(rq);
>> > + blk_mq_request_bypass_insert(rq, true);
>> > + return BLK_STS_OK;
>> > }
>>
>> If this patch is for fixing IO stall on DM, it isn't needed, and actually
>> it can't fix the IO stall issue.
>
> It should "fix it" in conjunction with this:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16&id=87b7e73546b55f4a3a37d4726daedd9a17a20509
>
> (Bart already said as much, I just effectively enabled the equivalent of
> his reverts)

If the generic solution is take, Bart's revert isn't needed.


--
Ming Lei

2018-01-23 12:45:37

by Mike Snitzer

[permalink] [raw]
Subject: Re: block: neutralize blk_insert_cloned_request IO stall regression (was: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle)

On Tue, Jan 23 2018 at 7:17am -0500,
Ming Lei <[email protected]> wrote:

> On Tue, Jan 23, 2018 at 8:15 PM, Mike Snitzer <[email protected]> wrote:
> > On Tue, Jan 23 2018 at 5:53am -0500,
> > Ming Lei <[email protected]> wrote:
> >
> >> Hi Mike,
> >>
> >> On Tue, Jan 23, 2018 at 10:22:04AM +0100, Mike Snitzer wrote:
> >> >
> >> > From: Mike Snitzer <[email protected]>
> >> > Date: Tue, 23 Jan 2018 09:40:22 +0100
> >> > Subject: [PATCH] block: neutralize blk_insert_cloned_request IO stall regression
> >> >
> >> > The series of blk-mq changes intended to improve sequential IO
> >> > performace (through improved merging with dm-mapth blk-mq stacked on
> >> > underlying blk-mq device). Unfortunately these changes have caused
> >> > dm-mpath blk-mq IO stalls when blk_mq_request_issue_directly()'s call to
> >> > q->mq_ops->queue_rq() fails (due to device-specific resource
> >> > unavailability).
> >> >
> >> > Fix this by reverting back to how blk_insert_cloned_request() functioned
> >> > prior to commit 396eaf21ee -- by using blk_mq_request_bypass_insert()
> >> > instead of blk_mq_request_issue_directly().
> >> >
> >> > In the future, this commit should be reverted as the first change in a
> >> > followup series of changes that implements a comprehensive solution to
> >> > allowing an underlying blk-mq queue's resource limitation to trigger the
> >> > upper blk-mq queue to run once that underlying limited resource is
> >> > replenished.
> >> >
> >> > Fixes: 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
> >> > Signed-off-by: Mike Snitzer <[email protected]>
> >> > ---
> >> > block/blk-core.c | 3 ++-
> >> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >> >
> >> > diff --git a/block/blk-core.c b/block/blk-core.c
> >> > index cdae69be68e9..a224f282b4a6 100644
> >> > --- a/block/blk-core.c
> >> > +++ b/block/blk-core.c
> >> > @@ -2520,7 +2520,8 @@ blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request *
> >> > * bypass a potential scheduler on the bottom device for
> >> > * insert.
> >> > */
> >> > - return blk_mq_request_issue_directly(rq);
> >> > + blk_mq_request_bypass_insert(rq, true);
> >> > + return BLK_STS_OK;
> >> > }
> >>
> >> If this patch is for fixing IO stall on DM, it isn't needed, and actually
> >> it can't fix the IO stall issue.
> >
> > It should "fix it" in conjunction with this:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16&id=87b7e73546b55f4a3a37d4726daedd9a17a20509
> >
> > (Bart already said as much, I just effectively enabled the equivalent of
> > his reverts)
>
> If the generic solution is take, Bart's revert isn't needed.

Yes, we need Bart to verify your v2 patch:
https://patchwork.kernel.org/patch/10179945/

Bart please test this tree:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=for-next

Please report back with your results (and include details on any changes
you make to the tree, hopefully no changes are needed).

Thanks,
Mike

2018-01-23 16:44:58

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH] block: neutralize blk_insert_cloned_request IO stall regression (was: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle)

On Tue, 2018-01-23 at 10:22 +0100, Mike Snitzer wrote:
> On Thu, Jan 18 2018 at 5:20pm -0500,
> Bart Van Assche <[email protected]> wrote:
>
> > On Thu, 2018-01-18 at 17:01 -0500, Mike Snitzer wrote:
> > > And yet Laurence cannot reproduce any such lockups with your test...
> >
> > Hmm ... maybe I misunderstood Laurence but I don't think that Laurence has
> > already succeeded at running an unmodified version of my tests. In one of the
> > e-mails Laurence sent me this morning I read that he modified these scripts
> > to get past a kernel module unload failure that was reported while starting
> > these tests. So the next step is to check which changes were made to the test
> > scripts and also whether the test results are still valid.
> >
> > > Are you absolutely certain this patch doesn't help you?
> > > https://patchwork.kernel.org/patch/10174037/
> > >
> > > If it doesn't then that is actually very useful to know.
> >
> > The first I tried this morning is to run the srp-test software against a merge
> > of Jens' for-next branch and your dm-4.16 branch. Since I noticed that the dm
> > queue locked up I reinserted a blk_mq_delay_run_hw_queue() call in the dm code.
> > Since even that was not sufficient I tried to kick the queues via debugfs (for
> > s in /sys/kernel/debug/block/*/state; do echo kick >$s; done). Since that was
> > not sufficient to resolve the queue stall I reverted the following tree patches
> > that are in Jens' tree:
> > * "blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback"
> > * "blk-mq-sched: remove unused 'can_block' arg from blk_mq_sched_insert_request"
> > * "blk-mq: don't dispatch request in blk_mq_request_direct_issue if queue is busy"
> >
> > Only after I had done this the srp-test software ran again without triggering
> > dm queue lockups.

Hello Mike,

I want to take back what I wrote about the reverts. I have not yet tried to
analyze exactly which change made blk_insert_cloned_request() work reliably
for me but with Jens' latest for-next branch and your for-4.16 branch merged
all I need to avoid queue stalls is the following:

diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index f16096af879a..c59c59cfd2a5 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -761,6 +761,7 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
/* Undo dm_start_request() before requeuing */
rq_end_stats(md, rq);
rq_completed(md, rq_data_dir(rq), false);
+ blk_mq_delay_run_hw_queue(hctx, 100/*ms*/);
return BLK_STS_RESOURCE;
}

Bart.

2018-01-29 22:37:44

by Bart Van Assche

[permalink] [raw]
Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

On 01/19/18 07:24, Jens Axboe wrote:
> That's what I thought. So for a low queue depth underlying queue, it's
> quite possible that this situation can happen. Two potential solutions
> I see:
>
> 1) As described earlier in this thread, having a mechanism for being
> notified when the scarce resource becomes available. It would not
> be hard to tap into the existing sbitmap wait queue for that.
>
> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource
> allocation. I haven't read the dm code to know if this is a
> possibility or not.
>
> I'd probably prefer #1. It's a classic case of trying to get the
> request, and if it fails, add ourselves to the sbitmap tag wait
> queue head, retry, and bail if that also fails. Connecting the
> scarce resource and the consumer is the only way to really fix
> this, without bogus arbitrary delays.

(replying to an e-mail from ten days ago)

Implementing a notification mechanism for all cases in which
blk_insert_cloned_request() returns BLK_STS_RESOURCE today would require
a lot of work. If e.g. a SCSI LLD returns one of the SCSI_MLQUEUE_*_BUSY
return codes from its .queuecommand() implementation then the SCSI core
will translate that return code into BLK_STS_RESOURCE. From scsi_queue_rq():

reason = scsi_dispatch_cmd(cmd);
if (reason) {
scsi_set_blocked(cmd, reason);
ret = BLK_STS_RESOURCE;
goto out_dec_host_busy;
}

In other words, implementing a notification mechanism for all cases in
which blk_insert_cloned_request() can return BLK_STS_RESOURCE would
require to modify all SCSI LLDs.

Bart.