2020-02-03 21:02:28

by Salman Qazi

[permalink] [raw]
Subject: [PATCH] block: Limit number of items taken from the I/O scheduler in one go

We observed that it is possible for a flush to bypass the I/O
scheduler and get added to hctx->dispatch in blk_mq_sched_bypass_insert.
This can happen while a kworker is running blk_mq_do_dispatch_sched call
in blk_mq_sched_dispatch_requests.

However, the blk_mq_do_dispatch_sched call doesn't end in bounded time.
As a result, the flush can sit there indefinitely, as the I/O scheduler
feeds an arbitrary number of requests to the hardware.

The solution is to periodically poll hctx->dispatch in
blk_mq_do_dispatch_sched, to put a bound on the latency of the commands
sitting there.

Signed-off-by: Salman Qazi <[email protected]>
---
block/blk-mq-sched.c | 6 ++++++
block/blk-mq.c | 4 ++++
block/blk-sysfs.c | 33 +++++++++++++++++++++++++++++++++
include/linux/blkdev.h | 2 ++
4 files changed, 45 insertions(+)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index ca22afd47b3d..75cdec64b9c7 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -90,6 +90,7 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
struct request_queue *q = hctx->queue;
struct elevator_queue *e = q->elevator;
LIST_HEAD(rq_list);
+ int count = 0;

do {
struct request *rq;
@@ -97,6 +98,10 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
if (e->type->ops.has_work && !e->type->ops.has_work(hctx))
break;

+ if (count > 0 && count % q->max_sched_batch == 0 &&
+ !list_empty_careful(&hctx->dispatch))
+ break;
+
if (!blk_mq_get_dispatch_budget(hctx))
break;

@@ -112,6 +117,7 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
* in blk_mq_dispatch_rq_list().
*/
list_add(&rq->queuelist, &rq_list);
+ count++;
} while (blk_mq_dispatch_rq_list(q, &rq_list, true));
}

diff --git a/block/blk-mq.c b/block/blk-mq.c
index a12b1763508d..7cb13aa72a94 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -40,6 +40,8 @@
#include "blk-mq-sched.h"
#include "blk-rq-qos.h"

+#define BLK_MQ_DEFAULT_MAX_SCHED_BATCH 100
+
static void blk_mq_poll_stats_start(struct request_queue *q);
static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);

@@ -2934,6 +2936,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
*/
q->poll_nsec = BLK_MQ_POLL_CLASSIC;

+ q->max_sched_batch = BLK_MQ_DEFAULT_MAX_SCHED_BATCH;
+
blk_mq_init_cpu_queues(q, set->nr_hw_queues);
blk_mq_add_queue_tag_set(set, q);
blk_mq_map_swqueue(q);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index fca9b158f4a0..dd7b58a1bd35 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -390,6 +390,32 @@ static ssize_t queue_poll_delay_store(struct request_queue *q, const char *page,
return count;
}

+static ssize_t queue_max_sched_batch_show(struct request_queue *q, char *page)
+{
+ return sprintf(page, "%d\n", q->max_sched_batch);
+}
+
+static ssize_t queue_max_sched_batch_store(struct request_queue *q,
+ const char *page,
+ size_t count)
+{
+ int err, val;
+
+ if (!q->mq_ops)
+ return -EINVAL;
+
+ err = kstrtoint(page, 10, &val);
+ if (err < 0)
+ return err;
+
+ if (val <= 0)
+ return -EINVAL;
+
+ q->max_sched_batch = val;
+
+ return count;
+}
+
static ssize_t queue_poll_show(struct request_queue *q, char *page)
{
return queue_var_show(test_bit(QUEUE_FLAG_POLL, &q->queue_flags), page);
@@ -691,6 +717,12 @@ static struct queue_sysfs_entry queue_poll_delay_entry = {
.store = queue_poll_delay_store,
};

+static struct queue_sysfs_entry queue_max_sched_batch_entry = {
+ .attr = {.name = "max_sched_batch", .mode = 0644 },
+ .show = queue_max_sched_batch_show,
+ .store = queue_max_sched_batch_store,
+};
+
static struct queue_sysfs_entry queue_wc_entry = {
.attr = {.name = "write_cache", .mode = 0644 },
.show = queue_wc_show,
@@ -763,6 +795,7 @@ static struct attribute *queue_attrs[] = {
&queue_wb_lat_entry.attr,
&queue_poll_delay_entry.attr,
&queue_io_timeout_entry.attr,
+ &queue_max_sched_batch_entry.attr,
#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
&throtl_sample_time_entry.attr,
#endif
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 053ea4b51988..68e7d29d4dd4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -477,6 +477,8 @@ struct request_queue {
unsigned int rq_timeout;
int poll_nsec;

+ int max_sched_batch;
+
struct blk_stat_callback *poll_cb;
struct blk_rq_stat poll_stat[BLK_MQ_POLL_STATS_BKTS];

--
2.25.0.341.g760bfbb309-goog


2020-02-04 03:48:08

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH] block: Limit number of items taken from the I/O scheduler in one go

On 2020-02-03 12:59, Salman Qazi wrote:
> We observed that it is possible for a flush to bypass the I/O
> scheduler and get added to hctx->dispatch in blk_mq_sched_bypass_insert.
> This can happen while a kworker is running blk_mq_do_dispatch_sched call
> in blk_mq_sched_dispatch_requests.
>
> However, the blk_mq_do_dispatch_sched call doesn't end in bounded time.
> As a result, the flush can sit there indefinitely, as the I/O scheduler
> feeds an arbitrary number of requests to the hardware.
>
> The solution is to periodically poll hctx->dispatch in
> blk_mq_do_dispatch_sched, to put a bound on the latency of the commands
> sitting there.

(added Christoph, Ming and Hannes to the Cc-list)

Thank you for having posted a patch; that really helps.

I see that my name occurs first in the "To:" list. Since Jens is the
block layer maintainer I think Jens should have been mentioned first.

In version v4.20 of the Linux kernel I found the following in the legacy
block layer code:
* From blk_insert_flush():
list_add_tail(&rq->queuelist, &q->queue_head);
* From elv_next_request():
list_for_each_entry(rq, &q->queue_head, queuelist)

I think this means that the legacy block layer sent flush requests to
the scheduler instead of directly to the block driver. How about
modifying the blk-mq code such that it mimics that approach? I'm asking
this because this patch, although the code looks clean, doesn't seem the
best solution to me.

> + if (count > 0 && count % q->max_sched_batch == 0 &&
> + !list_empty_careful(&hctx->dispatch))
> + break;

A modulo operation in the hot path? Please don't do that.

> +static ssize_t queue_max_sched_batch_store(struct request_queue *q,
> + const char *page,
> + size_t count)
> +{
> + int err, val;
> +
> + if (!q->mq_ops)
> + return -EINVAL;
> +
> + err = kstrtoint(page, 10, &val);
> + if (err < 0)
> + return err;
> +
> + if (val <= 0)
> + return -EINVAL;
> +
> + q->max_sched_batch = val;
> +
> + return count;
> +}

Has it been considered to use kstrtouint() instead of checking whether
the value returned by kstrtoint() is positive?

> + int max_sched_batch;

unsigned int?

Thanks,

Bart.

2020-02-04 09:22:28

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH] block: Limit number of items taken from the I/O scheduler in one go

On Mon, Feb 03, 2020 at 12:59:50PM -0800, Salman Qazi wrote:
> We observed that it is possible for a flush to bypass the I/O
> scheduler and get added to hctx->dispatch in blk_mq_sched_bypass_insert.

We always bypass io scheduler for flush request.

> This can happen while a kworker is running blk_mq_do_dispatch_sched call
> in blk_mq_sched_dispatch_requests.
>
> However, the blk_mq_do_dispatch_sched call doesn't end in bounded time.
> As a result, the flush can sit there indefinitely, as the I/O scheduler
> feeds an arbitrary number of requests to the hardware.

blk-mq supposes to handle requests in hctx->dispatch with higher
priority, see blk_mq_sched_dispatch_requests().

However, there is single hctx->run_work, so async run queue for dispatching
flush request can only be run until another async run queue from scheduler
is done.

>
> The solution is to periodically poll hctx->dispatch in
> blk_mq_do_dispatch_sched, to put a bound on the latency of the commands
> sitting there.
>
> Signed-off-by: Salman Qazi <[email protected]>
> ---
> block/blk-mq-sched.c | 6 ++++++
> block/blk-mq.c | 4 ++++
> block/blk-sysfs.c | 33 +++++++++++++++++++++++++++++++++
> include/linux/blkdev.h | 2 ++
> 4 files changed, 45 insertions(+)
>
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index ca22afd47b3d..75cdec64b9c7 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -90,6 +90,7 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> struct request_queue *q = hctx->queue;
> struct elevator_queue *e = q->elevator;
> LIST_HEAD(rq_list);
> + int count = 0;
>
> do {
> struct request *rq;
> @@ -97,6 +98,10 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> if (e->type->ops.has_work && !e->type->ops.has_work(hctx))
> break;
>
> + if (count > 0 && count % q->max_sched_batch == 0 &&
> + !list_empty_careful(&hctx->dispatch))
> + break;

q->max_sched_batch may not be needed, and it is reasonable to always
disaptch requests in hctx->dispatch first.

Also such trick is missed in dispatch from sw queue.


Thanks,
Ming

2020-02-04 18:27:49

by Salman Qazi

[permalink] [raw]
Subject: Re: [PATCH] block: Limit number of items taken from the I/O scheduler in one go

On Tue, Feb 4, 2020 at 1:20 AM Ming Lei <[email protected]> wrote:
>
> On Mon, Feb 03, 2020 at 12:59:50PM -0800, Salman Qazi wrote:
> > We observed that it is possible for a flush to bypass the I/O
> > scheduler and get added to hctx->dispatch in blk_mq_sched_bypass_insert.
>
> We always bypass io scheduler for flush request.
>
> > This can happen while a kworker is running blk_mq_do_dispatch_sched call
> > in blk_mq_sched_dispatch_requests.
> >
> > However, the blk_mq_do_dispatch_sched call doesn't end in bounded time.
> > As a result, the flush can sit there indefinitely, as the I/O scheduler
> > feeds an arbitrary number of requests to the hardware.
>
> blk-mq supposes to handle requests in hctx->dispatch with higher
> priority, see blk_mq_sched_dispatch_requests().
>
> However, there is single hctx->run_work, so async run queue for dispatching
> flush request can only be run until another async run queue from scheduler
> is done.
>
> >
> > The solution is to periodically poll hctx->dispatch in
> > blk_mq_do_dispatch_sched, to put a bound on the latency of the commands
> > sitting there.
> >
> > Signed-off-by: Salman Qazi <[email protected]>
> > ---
> > block/blk-mq-sched.c | 6 ++++++
> > block/blk-mq.c | 4 ++++
> > block/blk-sysfs.c | 33 +++++++++++++++++++++++++++++++++
> > include/linux/blkdev.h | 2 ++
> > 4 files changed, 45 insertions(+)
> >
> > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > index ca22afd47b3d..75cdec64b9c7 100644
> > --- a/block/blk-mq-sched.c
> > +++ b/block/blk-mq-sched.c
> > @@ -90,6 +90,7 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> > struct request_queue *q = hctx->queue;
> > struct elevator_queue *e = q->elevator;
> > LIST_HEAD(rq_list);
> > + int count = 0;
> >
> > do {
> > struct request *rq;
> > @@ -97,6 +98,10 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> > if (e->type->ops.has_work && !e->type->ops.has_work(hctx))
> > break;
> >
> > + if (count > 0 && count % q->max_sched_batch == 0 &&
> > + !list_empty_careful(&hctx->dispatch))
> > + break;
>
> q->max_sched_batch may not be needed, and it is reasonable to always
> disaptch requests in hctx->dispatch first.
>
> Also such trick is missed in dispatch from sw queue.

I will update the commit message, drop max_sched_batch and just turn
it into a simple check and add the same
thing to the dispatch from the sw queue.

>
>
> Thanks,
> Ming
>

2020-02-04 19:39:28

by Salman Qazi

[permalink] [raw]
Subject: [PATCH] block: Limit number of items taken from the I/O scheduler in one go

Flushes bypass the I/O scheduler and get added to hctx->dispatch
in blk_mq_sched_bypass_insert. This can happen while a kworker is running
hctx->run_work work item and is past the point in
blk_mq_sched_dispatch_requests where hctx->dispatch is checked.

The blk_mq_do_dispatch_sched call is not guaranteed to end in bounded time,
because the I/O scheduler can feed an arbitrary number of commands.

Since we have only one hctx->run_work, the commands waiting in
hctx->dispatch will wait an arbitrary length of time for run_work to be
rerun.

A similar phenomenon exists with dispatches from the software queue.

The solution is to poll hctx->dispatch in blk_mq_do_dispatch_sched and
blk_mq_do_dispatch_ctx and return from the run_work handler and let it
rerun.

Signed-off-by: Salman Qazi <[email protected]>
---
block/blk-mq-sched.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index ca22afd47b3d..d1b8b31bc3d4 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -97,6 +97,9 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
if (e->type->ops.has_work && !e->type->ops.has_work(hctx))
break;

+ if (!list_empty_careful(&hctx->dispatch))
+ break;
+
if (!blk_mq_get_dispatch_budget(hctx))
break;

@@ -140,6 +143,9 @@ static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
do {
struct request *rq;

+ if (!list_empty_careful(&hctx->dispatch))
+ break;
+
if (!sbitmap_any_bit_set(&hctx->ctx_map))
break;

--
2.25.0.341.g760bfbb309-goog

2020-02-05 04:57:11

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH] block: Limit number of items taken from the I/O scheduler in one go

On Tue, Feb 04, 2020 at 11:37:11AM -0800, Salman Qazi wrote:
> Flushes bypass the I/O scheduler and get added to hctx->dispatch
> in blk_mq_sched_bypass_insert. This can happen while a kworker is running
> hctx->run_work work item and is past the point in
> blk_mq_sched_dispatch_requests where hctx->dispatch is checked.
>
> The blk_mq_do_dispatch_sched call is not guaranteed to end in bounded time,
> because the I/O scheduler can feed an arbitrary number of commands.
>
> Since we have only one hctx->run_work, the commands waiting in
> hctx->dispatch will wait an arbitrary length of time for run_work to be
> rerun.
>
> A similar phenomenon exists with dispatches from the software queue.
>
> The solution is to poll hctx->dispatch in blk_mq_do_dispatch_sched and
> blk_mq_do_dispatch_ctx and return from the run_work handler and let it
> rerun.
>
> Signed-off-by: Salman Qazi <[email protected]>
> ---
> block/blk-mq-sched.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index ca22afd47b3d..d1b8b31bc3d4 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -97,6 +97,9 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> if (e->type->ops.has_work && !e->type->ops.has_work(hctx))
> break;
>
> + if (!list_empty_careful(&hctx->dispatch))
> + break;
> +
> if (!blk_mq_get_dispatch_budget(hctx))
> break;
>
> @@ -140,6 +143,9 @@ static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
> do {
> struct request *rq;
>
> + if (!list_empty_careful(&hctx->dispatch))
> + break;
> +
> if (!sbitmap_any_bit_set(&hctx->ctx_map))
> break;

This approach looks good, given actually we retrieved request this way in
legacy IO request path, see __elv_next_request().

However, blk_mq_request_bypass_insert() may be run at the same time, so
this patch may cause requests stalled in scheduler queue.

How about returning if there is request available in hctx->dispatch from
the two helpers, then re-dispatch requests in blk_mq_sched_dispatch_requests()
if yes.

Thanks,
Ming

2020-02-05 19:58:32

by Salman Qazi

[permalink] [raw]
Subject: [PATCH] block: Limit number of items taken from the I/O scheduler in one go

Flushes bypass the I/O scheduler and get added to hctx->dispatch
in blk_mq_sched_bypass_insert. This can happen while a kworker is running
hctx->run_work work item and is past the point in
blk_mq_sched_dispatch_requests where hctx->dispatch is checked.

The blk_mq_do_dispatch_sched call is not guaranteed to end in bounded time,
because the I/O scheduler can feed an arbitrary number of commands.

Since we have only one hctx->run_work, the commands waiting in
hctx->dispatch will wait an arbitrary length of time for run_work to be
rerun.

A similar phenomenon exists with dispatches from the software queue.

The solution is to poll hctx->dispatch in blk_mq_do_dispatch_sched and
blk_mq_do_dispatch_ctx and return from the run_work handler and let it
rerun.

Signed-off-by: Salman Qazi <[email protected]>
---
block/blk-mq-sched.c | 37 +++++++++++++++++++++++++++++++------
1 file changed, 31 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index ca22afd47b3d..52249fddeb66 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -84,12 +84,16 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
* Only SCSI implements .get_budget and .put_budget, and SCSI restarts
* its queue by itself in its completion handler, so we don't need to
* restart queue if .get_budget() returns BLK_STS_NO_RESOURCE.
+ *
+ * Returns true if hctx->dispatch was found non-empty and
+ * run_work has to be run again.
*/
-static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
+static bool blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
{
struct request_queue *q = hctx->queue;
struct elevator_queue *e = q->elevator;
LIST_HEAD(rq_list);
+ bool ret = false;

do {
struct request *rq;
@@ -97,6 +101,11 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
if (e->type->ops.has_work && !e->type->ops.has_work(hctx))
break;

+ if (!list_empty_careful(&hctx->dispatch)) {
+ ret = true;
+ break;
+ }
+
if (!blk_mq_get_dispatch_budget(hctx))
break;

@@ -113,6 +122,8 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
*/
list_add(&rq->queuelist, &rq_list);
} while (blk_mq_dispatch_rq_list(q, &rq_list, true));
+
+ return ret;
}

static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
@@ -130,16 +141,25 @@ static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
* Only SCSI implements .get_budget and .put_budget, and SCSI restarts
* its queue by itself in its completion handler, so we don't need to
* restart queue if .get_budget() returns BLK_STS_NO_RESOURCE.
+ *
+ * Returns true if hctx->dispatch was found non-empty and
+ * run_work has to be run again.
*/
-static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
+static bool blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
{
struct request_queue *q = hctx->queue;
LIST_HEAD(rq_list);
struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
+ bool ret = false;

do {
struct request *rq;

+ if (!list_empty_careful(&hctx->dispatch)) {
+ ret = true;
+ break;
+ }
+
if (!sbitmap_any_bit_set(&hctx->ctx_map))
break;

@@ -165,6 +185,7 @@ static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
} while (blk_mq_dispatch_rq_list(q, &rq_list, true));

WRITE_ONCE(hctx->dispatch_from, ctx);
+ return ret;
}

void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
@@ -172,6 +193,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
struct request_queue *q = hctx->queue;
struct elevator_queue *e = q->elevator;
const bool has_sched_dispatch = e && e->type->ops.dispatch_request;
+ bool run_again = false;
LIST_HEAD(rq_list);

/* RCU or SRCU read lock is needed before checking quiesced flag */
@@ -208,19 +230,22 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
blk_mq_sched_mark_restart_hctx(hctx);
if (blk_mq_dispatch_rq_list(q, &rq_list, false)) {
if (has_sched_dispatch)
- blk_mq_do_dispatch_sched(hctx);
+ run_again = blk_mq_do_dispatch_sched(hctx);
else
- blk_mq_do_dispatch_ctx(hctx);
+ run_again = blk_mq_do_dispatch_ctx(hctx);
}
} else if (has_sched_dispatch) {
- blk_mq_do_dispatch_sched(hctx);
+ run_again = blk_mq_do_dispatch_sched(hctx);
} else if (hctx->dispatch_busy) {
/* dequeue request one by one from sw queue if queue is busy */
- blk_mq_do_dispatch_ctx(hctx);
+ run_again = blk_mq_do_dispatch_ctx(hctx);
} else {
blk_mq_flush_busy_ctxs(hctx, &rq_list);
blk_mq_dispatch_rq_list(q, &rq_list, false);
}
+
+ if (run_again)
+ blk_mq_run_hw_queue(hctx, true);
}

bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
--
2.25.0.341.g760bfbb309-goog

2020-02-06 10:20:17

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH] block: Limit number of items taken from the I/O scheduler in one go

On Wed, Feb 05, 2020 at 11:57:06AM -0800, Salman Qazi wrote:
> Flushes bypass the I/O scheduler and get added to hctx->dispatch
> in blk_mq_sched_bypass_insert. This can happen while a kworker is running
> hctx->run_work work item and is past the point in
> blk_mq_sched_dispatch_requests where hctx->dispatch is checked.
>
> The blk_mq_do_dispatch_sched call is not guaranteed to end in bounded time,
> because the I/O scheduler can feed an arbitrary number of commands.
>
> Since we have only one hctx->run_work, the commands waiting in
> hctx->dispatch will wait an arbitrary length of time for run_work to be
> rerun.
>
> A similar phenomenon exists with dispatches from the software queue.
>
> The solution is to poll hctx->dispatch in blk_mq_do_dispatch_sched and
> blk_mq_do_dispatch_ctx and return from the run_work handler and let it
> rerun.
>
> Signed-off-by: Salman Qazi <[email protected]>
> ---
> block/blk-mq-sched.c | 37 +++++++++++++++++++++++++++++++------
> 1 file changed, 31 insertions(+), 6 deletions(-)
>
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index ca22afd47b3d..52249fddeb66 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -84,12 +84,16 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
> * Only SCSI implements .get_budget and .put_budget, and SCSI restarts
> * its queue by itself in its completion handler, so we don't need to
> * restart queue if .get_budget() returns BLK_STS_NO_RESOURCE.
> + *
> + * Returns true if hctx->dispatch was found non-empty and
> + * run_work has to be run again.
> */
> -static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> +static bool blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> {
> struct request_queue *q = hctx->queue;
> struct elevator_queue *e = q->elevator;
> LIST_HEAD(rq_list);
> + bool ret = false;
>
> do {
> struct request *rq;
> @@ -97,6 +101,11 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> if (e->type->ops.has_work && !e->type->ops.has_work(hctx))
> break;
>
> + if (!list_empty_careful(&hctx->dispatch)) {
> + ret = true;
> + break;
> + }
> +
> if (!blk_mq_get_dispatch_budget(hctx))
> break;
>
> @@ -113,6 +122,8 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> */
> list_add(&rq->queuelist, &rq_list);
> } while (blk_mq_dispatch_rq_list(q, &rq_list, true));
> +
> + return ret;
> }
>
> static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
> @@ -130,16 +141,25 @@ static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
> * Only SCSI implements .get_budget and .put_budget, and SCSI restarts
> * its queue by itself in its completion handler, so we don't need to
> * restart queue if .get_budget() returns BLK_STS_NO_RESOURCE.
> + *
> + * Returns true if hctx->dispatch was found non-empty and
> + * run_work has to be run again.
> */
> -static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
> +static bool blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
> {
> struct request_queue *q = hctx->queue;
> LIST_HEAD(rq_list);
> struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
> + bool ret = false;
>
> do {
> struct request *rq;
>
> + if (!list_empty_careful(&hctx->dispatch)) {
> + ret = true;
> + break;
> + }
> +
> if (!sbitmap_any_bit_set(&hctx->ctx_map))
> break;
>
> @@ -165,6 +185,7 @@ static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
> } while (blk_mq_dispatch_rq_list(q, &rq_list, true));
>
> WRITE_ONCE(hctx->dispatch_from, ctx);
> + return ret;
> }
>
> void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
> @@ -172,6 +193,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
> struct request_queue *q = hctx->queue;
> struct elevator_queue *e = q->elevator;
> const bool has_sched_dispatch = e && e->type->ops.dispatch_request;
> + bool run_again = false;
> LIST_HEAD(rq_list);
>
> /* RCU or SRCU read lock is needed before checking quiesced flag */
> @@ -208,19 +230,22 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
> blk_mq_sched_mark_restart_hctx(hctx);
> if (blk_mq_dispatch_rq_list(q, &rq_list, false)) {
> if (has_sched_dispatch)
> - blk_mq_do_dispatch_sched(hctx);
> + run_again = blk_mq_do_dispatch_sched(hctx);
> else
> - blk_mq_do_dispatch_ctx(hctx);
> + run_again = blk_mq_do_dispatch_ctx(hctx);
> }
> } else if (has_sched_dispatch) {
> - blk_mq_do_dispatch_sched(hctx);
> + run_again = blk_mq_do_dispatch_sched(hctx);
> } else if (hctx->dispatch_busy) {
> /* dequeue request one by one from sw queue if queue is busy */
> - blk_mq_do_dispatch_ctx(hctx);
> + run_again = blk_mq_do_dispatch_ctx(hctx);
> } else {
> blk_mq_flush_busy_ctxs(hctx, &rq_list);
> blk_mq_dispatch_rq_list(q, &rq_list, false);
> }
> +
> + if (run_again)
> + blk_mq_run_hw_queue(hctx, true);

One improvement may be to run again locally in this function by
limited times(such as 1) first, then switch to blk_mq_run_hw_queue()
if run again is still needed.

This way may save one async run hw queue.

Thanks,
Ming

2020-02-06 21:13:43

by Salman Qazi

[permalink] [raw]
Subject: [PATCH] block: Limit number of items taken from the I/O scheduler in one go

Flushes bypass the I/O scheduler and get added to hctx->dispatch
in blk_mq_sched_bypass_insert. This can happen while a kworker is running
hctx->run_work work item and is past the point in
blk_mq_sched_dispatch_requests where hctx->dispatch is checked.

The blk_mq_do_dispatch_sched call is not guaranteed to end in bounded time,
because the I/O scheduler can feed an arbitrary number of commands.

Since we have only one hctx->run_work, the commands waiting in
hctx->dispatch will wait an arbitrary length of time for run_work to be
rerun.

A similar phenomenon exists with dispatches from the software queue.

The solution is to poll hctx->dispatch in blk_mq_do_dispatch_sched and
blk_mq_do_dispatch_ctx and return from the run_work handler and let it
rerun.

Signed-off-by: Salman Qazi <[email protected]>
---
block/blk-mq-sched.c | 47 ++++++++++++++++++++++++++++++++++++++------
1 file changed, 41 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index ca22afd47b3d..84dde147f646 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -84,12 +84,16 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
* Only SCSI implements .get_budget and .put_budget, and SCSI restarts
* its queue by itself in its completion handler, so we don't need to
* restart queue if .get_budget() returns BLK_STS_NO_RESOURCE.
+ *
+ * Returns true if hctx->dispatch was found non-empty and
+ * run_work has to be run again.
*/
-static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
+static bool blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
{
struct request_queue *q = hctx->queue;
struct elevator_queue *e = q->elevator;
LIST_HEAD(rq_list);
+ bool ret = false;

do {
struct request *rq;
@@ -97,6 +101,11 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
if (e->type->ops.has_work && !e->type->ops.has_work(hctx))
break;

+ if (!list_empty_careful(&hctx->dispatch)) {
+ ret = true;
+ break;
+ }
+
if (!blk_mq_get_dispatch_budget(hctx))
break;

@@ -113,6 +122,8 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
*/
list_add(&rq->queuelist, &rq_list);
} while (blk_mq_dispatch_rq_list(q, &rq_list, true));
+
+ return ret;
}

static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
@@ -130,16 +141,25 @@ static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
* Only SCSI implements .get_budget and .put_budget, and SCSI restarts
* its queue by itself in its completion handler, so we don't need to
* restart queue if .get_budget() returns BLK_STS_NO_RESOURCE.
+ *
+ * Returns true if hctx->dispatch was found non-empty and
+ * run_work has to be run again.
*/
-static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
+static bool blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
{
struct request_queue *q = hctx->queue;
LIST_HEAD(rq_list);
struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
+ bool ret = false;

do {
struct request *rq;

+ if (!list_empty_careful(&hctx->dispatch)) {
+ ret = true;
+ break;
+ }
+
if (!sbitmap_any_bit_set(&hctx->ctx_map))
break;

@@ -165,6 +185,7 @@ static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
} while (blk_mq_dispatch_rq_list(q, &rq_list, true));

WRITE_ONCE(hctx->dispatch_from, ctx);
+ return ret;
}

void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
@@ -172,6 +193,8 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
struct request_queue *q = hctx->queue;
struct elevator_queue *e = q->elevator;
const bool has_sched_dispatch = e && e->type->ops.dispatch_request;
+ bool run_again;
+ bool restarted = false;
LIST_HEAD(rq_list);

/* RCU or SRCU read lock is needed before checking quiesced flag */
@@ -180,6 +203,9 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)

hctx->run++;

+again:
+ run_again = false;
+
/*
* If we have previous entries on our dispatch list, grab them first for
* more fair dispatch.
@@ -208,19 +234,28 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
blk_mq_sched_mark_restart_hctx(hctx);
if (blk_mq_dispatch_rq_list(q, &rq_list, false)) {
if (has_sched_dispatch)
- blk_mq_do_dispatch_sched(hctx);
+ run_again = blk_mq_do_dispatch_sched(hctx);
else
- blk_mq_do_dispatch_ctx(hctx);
+ run_again = blk_mq_do_dispatch_ctx(hctx);
}
} else if (has_sched_dispatch) {
- blk_mq_do_dispatch_sched(hctx);
+ run_again = blk_mq_do_dispatch_sched(hctx);
} else if (hctx->dispatch_busy) {
/* dequeue request one by one from sw queue if queue is busy */
- blk_mq_do_dispatch_ctx(hctx);
+ run_again = blk_mq_do_dispatch_ctx(hctx);
} else {
blk_mq_flush_busy_ctxs(hctx, &rq_list);
blk_mq_dispatch_rq_list(q, &rq_list, false);
}
+
+ if (run_again) {
+ if (!restarted) {
+ restarted = true;
+ goto again;
+ }
+
+ blk_mq_run_hw_queue(hctx, true);
+ }
}

bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
--
2.25.0.341.g760bfbb309-goog

2020-02-07 02:09:26

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH] block: Limit number of items taken from the I/O scheduler in one go

On Thu, Feb 06, 2020 at 01:12:22PM -0800, Salman Qazi wrote:
> Flushes bypass the I/O scheduler and get added to hctx->dispatch
> in blk_mq_sched_bypass_insert. This can happen while a kworker is running
> hctx->run_work work item and is past the point in
> blk_mq_sched_dispatch_requests where hctx->dispatch is checked.
>
> The blk_mq_do_dispatch_sched call is not guaranteed to end in bounded time,
> because the I/O scheduler can feed an arbitrary number of commands.
>
> Since we have only one hctx->run_work, the commands waiting in
> hctx->dispatch will wait an arbitrary length of time for run_work to be
> rerun.
>
> A similar phenomenon exists with dispatches from the software queue.
>
> The solution is to poll hctx->dispatch in blk_mq_do_dispatch_sched and
> blk_mq_do_dispatch_ctx and return from the run_work handler and let it
> rerun.
>
> Signed-off-by: Salman Qazi <[email protected]>
> ---
> block/blk-mq-sched.c | 47 ++++++++++++++++++++++++++++++++++++++------
> 1 file changed, 41 insertions(+), 6 deletions(-)
>
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index ca22afd47b3d..84dde147f646 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -84,12 +84,16 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
> * Only SCSI implements .get_budget and .put_budget, and SCSI restarts
> * its queue by itself in its completion handler, so we don't need to
> * restart queue if .get_budget() returns BLK_STS_NO_RESOURCE.
> + *
> + * Returns true if hctx->dispatch was found non-empty and
> + * run_work has to be run again.
> */
> -static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> +static bool blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> {
> struct request_queue *q = hctx->queue;
> struct elevator_queue *e = q->elevator;
> LIST_HEAD(rq_list);
> + bool ret = false;
>
> do {
> struct request *rq;
> @@ -97,6 +101,11 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> if (e->type->ops.has_work && !e->type->ops.has_work(hctx))
> break;
>
> + if (!list_empty_careful(&hctx->dispatch)) {
> + ret = true;
> + break;
> + }
> +
> if (!blk_mq_get_dispatch_budget(hctx))
> break;
>
> @@ -113,6 +122,8 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> */
> list_add(&rq->queuelist, &rq_list);
> } while (blk_mq_dispatch_rq_list(q, &rq_list, true));
> +
> + return ret;
> }
>
> static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
> @@ -130,16 +141,25 @@ static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
> * Only SCSI implements .get_budget and .put_budget, and SCSI restarts
> * its queue by itself in its completion handler, so we don't need to
> * restart queue if .get_budget() returns BLK_STS_NO_RESOURCE.
> + *
> + * Returns true if hctx->dispatch was found non-empty and
> + * run_work has to be run again.
> */
> -static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
> +static bool blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
> {
> struct request_queue *q = hctx->queue;
> LIST_HEAD(rq_list);
> struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
> + bool ret = false;
>
> do {
> struct request *rq;
>
> + if (!list_empty_careful(&hctx->dispatch)) {
> + ret = true;
> + break;
> + }
> +
> if (!sbitmap_any_bit_set(&hctx->ctx_map))
> break;
>
> @@ -165,6 +185,7 @@ static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
> } while (blk_mq_dispatch_rq_list(q, &rq_list, true));
>
> WRITE_ONCE(hctx->dispatch_from, ctx);
> + return ret;
> }
>
> void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
> @@ -172,6 +193,8 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
> struct request_queue *q = hctx->queue;
> struct elevator_queue *e = q->elevator;
> const bool has_sched_dispatch = e && e->type->ops.dispatch_request;
> + bool run_again;
> + bool restarted = false;
> LIST_HEAD(rq_list);
>
> /* RCU or SRCU read lock is needed before checking quiesced flag */
> @@ -180,6 +203,9 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
>
> hctx->run++;
>
> +again:
> + run_again = false;
> +
> /*
> * If we have previous entries on our dispatch list, grab them first for
> * more fair dispatch.
> @@ -208,19 +234,28 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
> blk_mq_sched_mark_restart_hctx(hctx);
> if (blk_mq_dispatch_rq_list(q, &rq_list, false)) {
> if (has_sched_dispatch)
> - blk_mq_do_dispatch_sched(hctx);
> + run_again = blk_mq_do_dispatch_sched(hctx);
> else
> - blk_mq_do_dispatch_ctx(hctx);
> + run_again = blk_mq_do_dispatch_ctx(hctx);
> }
> } else if (has_sched_dispatch) {
> - blk_mq_do_dispatch_sched(hctx);
> + run_again = blk_mq_do_dispatch_sched(hctx);
> } else if (hctx->dispatch_busy) {
> /* dequeue request one by one from sw queue if queue is busy */
> - blk_mq_do_dispatch_ctx(hctx);
> + run_again = blk_mq_do_dispatch_ctx(hctx);
> } else {
> blk_mq_flush_busy_ctxs(hctx, &rq_list);
> blk_mq_dispatch_rq_list(q, &rq_list, false);
> }
> +
> + if (run_again) {
> + if (!restarted) {
> + restarted = true;
> + goto again;
> + }
> +
> + blk_mq_run_hw_queue(hctx, true);
> + }
> }
>
> bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
> --
> 2.25.0.341.g760bfbb309-goog
>

Reviewed-by: Ming Lei <[email protected]>

--
Ming

2020-02-07 15:28:18

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH] block: Limit number of items taken from the I/O scheduler in one go

On 2020-02-06 13:12, Salman Qazi wrote:
> + *
> + * Returns true if hctx->dispatch was found non-empty and
> + * run_work has to be run again.

Please elaborate this comment and explain why this is necessary (to
avoid that flush processing is postponed forever).

> + * Returns true if hctx->dispatch was found non-empty and
> + * run_work has to be run again.

Same comment here.

> +again:
> + run_again = false;
> +
> /*
> * If we have previous entries on our dispatch list, grab them first for
> * more fair dispatch.
> @@ -208,19 +234,28 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
> blk_mq_sched_mark_restart_hctx(hctx);
> if (blk_mq_dispatch_rq_list(q, &rq_list, false)) {
> if (has_sched_dispatch)
> - blk_mq_do_dispatch_sched(hctx);
> + run_again = blk_mq_do_dispatch_sched(hctx);
> else
> - blk_mq_do_dispatch_ctx(hctx);
> + run_again = blk_mq_do_dispatch_ctx(hctx);
> }
> } else if (has_sched_dispatch) {
> - blk_mq_do_dispatch_sched(hctx);
> + run_again = blk_mq_do_dispatch_sched(hctx);
> } else if (hctx->dispatch_busy) {
> /* dequeue request one by one from sw queue if queue is busy */
> - blk_mq_do_dispatch_ctx(hctx);
> + run_again = blk_mq_do_dispatch_ctx(hctx);
> } else {
> blk_mq_flush_busy_ctxs(hctx, &rq_list);
> blk_mq_dispatch_rq_list(q, &rq_list, false);
> }
> +
> + if (run_again) {
> + if (!restarted) {
> + restarted = true;
> + goto again;
> + }
> +
> + blk_mq_run_hw_queue(hctx, true);
> + }

So this patch changes blk_mq_sched_dispatch_requests() such that it
iterates at most two times? How about implementing that loop with an
explicit for-loop? I think that will make
blk_mq_sched_dispatch_requests() easier to read. As you may know forward
goto's are accepted in kernel code but backward goto's are frowned upon.

Thanks,

Bart.

2020-02-07 18:47:09

by Salman Qazi

[permalink] [raw]
Subject: Re: [PATCH] block: Limit number of items taken from the I/O scheduler in one go

On Fri, Feb 7, 2020 at 7:26 AM Bart Van Assche <[email protected]> wrote:
>
> On 2020-02-06 13:12, Salman Qazi wrote:
> > + *
> > + * Returns true if hctx->dispatch was found non-empty and
> > + * run_work has to be run again.
>
> Please elaborate this comment and explain why this is necessary (to
> avoid that flush processing is postponed forever).
>
> > + * Returns true if hctx->dispatch was found non-empty and
> > + * run_work has to be run again.
>
> Same comment here.

Will do.

>
> > +again:
> > + run_again = false;
> > +
> > /*
> > * If we have previous entries on our dispatch list, grab them first for
> > * more fair dispatch.
> > @@ -208,19 +234,28 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
> > blk_mq_sched_mark_restart_hctx(hctx);
> > if (blk_mq_dispatch_rq_list(q, &rq_list, false)) {
> > if (has_sched_dispatch)
> > - blk_mq_do_dispatch_sched(hctx);
> > + run_again = blk_mq_do_dispatch_sched(hctx);
> > else
> > - blk_mq_do_dispatch_ctx(hctx);
> > + run_again = blk_mq_do_dispatch_ctx(hctx);
> > }
> > } else if (has_sched_dispatch) {
> > - blk_mq_do_dispatch_sched(hctx);
> > + run_again = blk_mq_do_dispatch_sched(hctx);
> > } else if (hctx->dispatch_busy) {
> > /* dequeue request one by one from sw queue if queue is busy */
> > - blk_mq_do_dispatch_ctx(hctx);
> > + run_again = blk_mq_do_dispatch_ctx(hctx);
> > } else {
> > blk_mq_flush_busy_ctxs(hctx, &rq_list);
> > blk_mq_dispatch_rq_list(q, &rq_list, false);
> > }
> > +
> > + if (run_again) {
> > + if (!restarted) {
> > + restarted = true;
> > + goto again;
> > + }
> > +
> > + blk_mq_run_hw_queue(hctx, true);
> > + }
>
> So this patch changes blk_mq_sched_dispatch_requests() such that it
> iterates at most two times? How about implementing that loop with an
> explicit for-loop? I think that will make
> blk_mq_sched_dispatch_requests() easier to read. As you may know forward
> goto's are accepted in kernel code but backward goto's are frowned upon.
>

About the goto, I don't know if backwards gotos in general are frowned
upon. There are plenty of examples
in the kernel source. This particular label, 'again' for instance:

$ grep -r again: mm/|wc -l
22
$ grep -r again: block/|wc -l
4

But, just because others have done it doesn't mean I should. So, I
will attempt to explain why I think this is a good idea.
If I were to write this as a for-loop, it will look like this:

for (i = 0; i == 0 || (run_again && i < 2); i++) {
/* another level of 8 character wide indentation */
run_again = false;
/* a bunch of code that possibly sets run_again to true
}

if (run_again)
blk_mq_run_hw_queue(hctx, true);

[Another alternative is to set run_again to true, and simplify the for-loop
condition to run_again && i < 2. But, again, lots of verbiage and a boolean
in the for-loop condition.]

The for-loop is far from idiomatic. It's not clear what it does when
you first look at it.
It distracts from the common path of the code, which is something that
almost always
runs exactly once. There is now an additional level of indentation.
The readers of the
code aren't any better off, because they still have to figure out what
run_again is and if
they care about it. And the only way to do that is to read the entire
body of the loop, and
comments at the top of the functions.

The goto in this case preserves the intent of the code better. It is
dealing with an exceptional
and unusual case. Indeed this kind of use is not unusual in the
kernel, for instance to deal
with possible but unlikely races.

Just my $0.02.

> Thanks,
>
> Bart.

2020-02-07 19:05:39

by Salman Qazi

[permalink] [raw]
Subject: [PATCH] block: Limit number of items taken from the I/O scheduler in one go

Flushes bypass the I/O scheduler and get added to hctx->dispatch
in blk_mq_sched_bypass_insert. This can happen while a kworker is running
hctx->run_work work item and is past the point in
blk_mq_sched_dispatch_requests where hctx->dispatch is checked.

The blk_mq_do_dispatch_sched call is not guaranteed to end in bounded time,
because the I/O scheduler can feed an arbitrary number of commands.

Since we have only one hctx->run_work, the commands waiting in
hctx->dispatch will wait an arbitrary length of time for run_work to be
rerun.

A similar phenomenon exists with dispatches from the software queue.

The solution is to poll hctx->dispatch in blk_mq_do_dispatch_sched and
blk_mq_do_dispatch_ctx and return from the run_work handler and let it
rerun.

Signed-off-by: Salman Qazi <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
---
block/blk-mq-sched.c | 49 ++++++++++++++++++++++++++++++++++++++------
1 file changed, 43 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index ca22afd47b3d..3e78c5bbb4d9 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -84,12 +84,17 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
* Only SCSI implements .get_budget and .put_budget, and SCSI restarts
* its queue by itself in its completion handler, so we don't need to
* restart queue if .get_budget() returns BLK_STS_NO_RESOURCE.
+ *
+ * Returns true if hctx->dispatch was found non-empty and
+ * run_work has to be run again. This is necessary to avoid
+ * starving flushes.
*/
-static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
+static bool blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
{
struct request_queue *q = hctx->queue;
struct elevator_queue *e = q->elevator;
LIST_HEAD(rq_list);
+ bool ret = false;

do {
struct request *rq;
@@ -97,6 +102,11 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
if (e->type->ops.has_work && !e->type->ops.has_work(hctx))
break;

+ if (!list_empty_careful(&hctx->dispatch)) {
+ ret = true;
+ break;
+ }
+
if (!blk_mq_get_dispatch_budget(hctx))
break;

@@ -113,6 +123,8 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
*/
list_add(&rq->queuelist, &rq_list);
} while (blk_mq_dispatch_rq_list(q, &rq_list, true));
+
+ return ret;
}

static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
@@ -130,16 +142,26 @@ static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
* Only SCSI implements .get_budget and .put_budget, and SCSI restarts
* its queue by itself in its completion handler, so we don't need to
* restart queue if .get_budget() returns BLK_STS_NO_RESOURCE.
+ *
+ * Returns true if hctx->dispatch was found non-empty and
+ * run_work has to be run again. This is necessary to avoid
+ * starving flushes.
*/
-static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
+static bool blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
{
struct request_queue *q = hctx->queue;
LIST_HEAD(rq_list);
struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
+ bool ret = false;

do {
struct request *rq;

+ if (!list_empty_careful(&hctx->dispatch)) {
+ ret = true;
+ break;
+ }
+
if (!sbitmap_any_bit_set(&hctx->ctx_map))
break;

@@ -165,6 +187,7 @@ static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
} while (blk_mq_dispatch_rq_list(q, &rq_list, true));

WRITE_ONCE(hctx->dispatch_from, ctx);
+ return ret;
}

void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
@@ -172,6 +195,8 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
struct request_queue *q = hctx->queue;
struct elevator_queue *e = q->elevator;
const bool has_sched_dispatch = e && e->type->ops.dispatch_request;
+ bool run_again;
+ bool restarted = false;
LIST_HEAD(rq_list);

/* RCU or SRCU read lock is needed before checking quiesced flag */
@@ -180,6 +205,9 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)

hctx->run++;

+again:
+ run_again = false;
+
/*
* If we have previous entries on our dispatch list, grab them first for
* more fair dispatch.
@@ -208,19 +236,28 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
blk_mq_sched_mark_restart_hctx(hctx);
if (blk_mq_dispatch_rq_list(q, &rq_list, false)) {
if (has_sched_dispatch)
- blk_mq_do_dispatch_sched(hctx);
+ run_again = blk_mq_do_dispatch_sched(hctx);
else
- blk_mq_do_dispatch_ctx(hctx);
+ run_again = blk_mq_do_dispatch_ctx(hctx);
}
} else if (has_sched_dispatch) {
- blk_mq_do_dispatch_sched(hctx);
+ run_again = blk_mq_do_dispatch_sched(hctx);
} else if (hctx->dispatch_busy) {
/* dequeue request one by one from sw queue if queue is busy */
- blk_mq_do_dispatch_ctx(hctx);
+ run_again = blk_mq_do_dispatch_ctx(hctx);
} else {
blk_mq_flush_busy_ctxs(hctx, &rq_list);
blk_mq_dispatch_rq_list(q, &rq_list, false);
}
+
+ if (run_again) {
+ if (!restarted) {
+ restarted = true;
+ goto again;
+ }
+
+ blk_mq_run_hw_queue(hctx, true);
+ }
}

bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
--
2.25.0.341.g760bfbb309-goog

2020-02-07 20:21:32

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH] block: Limit number of items taken from the I/O scheduler in one go

On 2/7/20 10:45 AM, Salman Qazi wrote:
> If I were to write this as a for-loop, it will look like this:
>
> for (i = 0; i == 0 || (run_again && i < 2); i++) {
> /* another level of 8 character wide indentation */
> run_again = false;
> /* a bunch of code that possibly sets run_again to true
> }
>
> if (run_again)
> blk_mq_run_hw_queue(hctx, true);

That's not what I meant. What I meant is a loop that iterates at most
two times and also to break out of the loop if run_again == false.

BTW, I share your concern about the additional indentation by eight
positions. How about avoiding deeper indentation by introducing a new
function?

Thanks,

Bart.

2020-02-07 20:39:35

by Salman Qazi

[permalink] [raw]
Subject: Re: [PATCH] block: Limit number of items taken from the I/O scheduler in one go

On Fri, Feb 7, 2020 at 12:19 PM Bart Van Assche <[email protected]> wrote:
>
> On 2/7/20 10:45 AM, Salman Qazi wrote:
> > If I were to write this as a for-loop, it will look like this:
> >
> > for (i = 0; i == 0 || (run_again && i < 2); i++) {
> > /* another level of 8 character wide indentation */
> > run_again = false;
> > /* a bunch of code that possibly sets run_again to true
> > }
> >
> > if (run_again)
> > blk_mq_run_hw_queue(hctx, true);
>
> That's not what I meant. What I meant is a loop that iterates at most
> two times and also to break out of the loop if run_again == false.
>

I picked the most compact variant to demonstrate the problem. Adding
breaks isn't
really helping the readability.

for (i = 0; i < 2; i++) {
run_again = false;
/* bunch of code that possibly sets it to true */
...
if (!run_again)
break;
}
if (run_again)
blk_mq_run_hw_queue(hctx, true);

When I read this, I initially assume that the loop in general runs
twice and that this is the common case. It has the
same problem with conveying intent. Perhaps, more importantly, the
point of using programming constructs is to shorten and simplify the
code.
There are still two if-statements in addition to the loop. We haven't
gained much by introducing the loop.

> BTW, I share your concern about the additional indentation by eight
> positions. How about avoiding deeper indentation by introducing a new
> function?

If there was a benefit to introducing the loop, this would be a good
call. But the way I see it, the introduction of another
function is yet another way in which the introduction of the loop
makes the code less readable.

This is not a hill I want to die on. If the maintainer agrees with
you on this point, I will use a loop.
>
> Thanks,
>
> Bart.

2020-04-20 16:44:32

by Doug Anderson

[permalink] [raw]
Subject: Re: [PATCH] block: Limit number of items taken from the I/O scheduler in one go

Hi,

On Fri, Feb 7, 2020 at 12:38 PM Salman Qazi <[email protected]> wrote:
>
> On Fri, Feb 7, 2020 at 12:19 PM Bart Van Assche <[email protected]> wrote:
> >
> > On 2/7/20 10:45 AM, Salman Qazi wrote:
> > > If I were to write this as a for-loop, it will look like this:
> > >
> > > for (i = 0; i == 0 || (run_again && i < 2); i++) {
> > > /* another level of 8 character wide indentation */
> > > run_again = false;
> > > /* a bunch of code that possibly sets run_again to true
> > > }
> > >
> > > if (run_again)
> > > blk_mq_run_hw_queue(hctx, true);
> >
> > That's not what I meant. What I meant is a loop that iterates at most
> > two times and also to break out of the loop if run_again == false.
> >
>
> I picked the most compact variant to demonstrate the problem. Adding
> breaks isn't
> really helping the readability.
>
> for (i = 0; i < 2; i++) {
> run_again = false;
> /* bunch of code that possibly sets it to true */
> ...
> if (!run_again)
> break;
> }
> if (run_again)
> blk_mq_run_hw_queue(hctx, true);
>
> When I read this, I initially assume that the loop in general runs
> twice and that this is the common case. It has the
> same problem with conveying intent. Perhaps, more importantly, the
> point of using programming constructs is to shorten and simplify the
> code.
> There are still two if-statements in addition to the loop. We haven't
> gained much by introducing the loop.
>
> > BTW, I share your concern about the additional indentation by eight
> > positions. How about avoiding deeper indentation by introducing a new
> > function?
>
> If there was a benefit to introducing the loop, this would be a good
> call. But the way I see it, the introduction of another
> function is yet another way in which the introduction of the loop
> makes the code less readable.
>
> This is not a hill I want to die on. If the maintainer agrees with
> you on this point, I will use a loop.

I haven't done a massive amount of analysis of this patch, but since I
noticed it while debugging my own block series I've been keeping track
of it. Is there any status update here? We've been carrying this
"FROMLIST" in our Chrome OS trees for a little while, but that's not a
state we want to be in long-term. If it needs to spin before landing
upstream we should get the spin out and land it. If it's OK as-is
then it'd be nice to see it in mainline.

From the above I guess Salman was waiting for Jens to weigh in with an
opinion on the prefered bike shed color?

-Doug

2020-04-23 20:17:39

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH] block: Limit number of items taken from the I/O scheduler in one go

Jens, Bart, Ming, any update here? Or is this already applied (I didn't check)?

Thanks,
Jesse


On Mon, Apr 20, 2020 at 9:43 AM Doug Anderson <[email protected]> wrote:
>
> Hi,
>
> On Fri, Feb 7, 2020 at 12:38 PM Salman Qazi <[email protected]> wrote:
> >
> > On Fri, Feb 7, 2020 at 12:19 PM Bart Van Assche <[email protected]> wrote:
> > >
> > > On 2/7/20 10:45 AM, Salman Qazi wrote:
> > > > If I were to write this as a for-loop, it will look like this:
> > > >
> > > > for (i = 0; i == 0 || (run_again && i < 2); i++) {
> > > > /* another level of 8 character wide indentation */
> > > > run_again = false;
> > > > /* a bunch of code that possibly sets run_again to true
> > > > }
> > > >
> > > > if (run_again)
> > > > blk_mq_run_hw_queue(hctx, true);
> > >
> > > That's not what I meant. What I meant is a loop that iterates at most
> > > two times and also to break out of the loop if run_again == false.
> > >
> >
> > I picked the most compact variant to demonstrate the problem. Adding
> > breaks isn't
> > really helping the readability.
> >
> > for (i = 0; i < 2; i++) {
> > run_again = false;
> > /* bunch of code that possibly sets it to true */
> > ...
> > if (!run_again)
> > break;
> > }
> > if (run_again)
> > blk_mq_run_hw_queue(hctx, true);
> >
> > When I read this, I initially assume that the loop in general runs
> > twice and that this is the common case. It has the
> > same problem with conveying intent. Perhaps, more importantly, the
> > point of using programming constructs is to shorten and simplify the
> > code.
> > There are still two if-statements in addition to the loop. We haven't
> > gained much by introducing the loop.
> >
> > > BTW, I share your concern about the additional indentation by eight
> > > positions. How about avoiding deeper indentation by introducing a new
> > > function?
> >
> > If there was a benefit to introducing the loop, this would be a good
> > call. But the way I see it, the introduction of another
> > function is yet another way in which the introduction of the loop
> > makes the code less readable.
> >
> > This is not a hill I want to die on. If the maintainer agrees with
> > you on this point, I will use a loop.
>
> I haven't done a massive amount of analysis of this patch, but since I
> noticed it while debugging my own block series I've been keeping track
> of it. Is there any status update here? We've been carrying this
> "FROMLIST" in our Chrome OS trees for a little while, but that's not a
> state we want to be in long-term. If it needs to spin before landing
> upstream we should get the spin out and land it. If it's OK as-is
> then it'd be nice to see it in mainline.
>
> From the above I guess Salman was waiting for Jens to weigh in with an
> opinion on the prefered bike shed color?
>
> -Doug

2020-04-23 20:37:42

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] block: Limit number of items taken from the I/O scheduler in one go

On 4/23/20 2:13 PM, Jesse Barnes wrote:
> Jens, Bart, Ming, any update here? Or is this already applied (I didn't check)?

No updates on my end. I was expecting that a v2 would be posted after
all the discussion on it, but that doesn't seem to be the case.

--
Jens Axboe

2020-04-23 20:43:27

by Salman Qazi

[permalink] [raw]
Subject: Re: [PATCH] block: Limit number of items taken from the I/O scheduler in one go

I had posted a version on Feb 7th, but I guess I neglected to
explicitly label it as v2. I am sorry I am not well-acquainted with
the processes here.

On Thu, Apr 23, 2020 at 1:34 PM Jens Axboe <[email protected]> wrote:
>
> On 4/23/20 2:13 PM, Jesse Barnes wrote:
> > Jens, Bart, Ming, any update here? Or is this already applied (I didn't check)?
>
> No updates on my end. I was expecting that a v2 would be posted after
> all the discussion on it, but that doesn't seem to be the case.
>
> --
> Jens Axboe
>