LinuxLists.cc - [PATCHSET v4] blk-mq-scheduling framework

2016-12-17 00:13:04

[permalink] [raw]

Subject: [PATCHSET v4] blk-mq-scheduling framework

This is version 4 of this patchset, version 3 was posted here:

https://marc.info/?l=linux-block&m=148178513407631&w=2

>From the discussion last time, I looked into the feasibility of having
two sets of tags for the same request pool, to avoid having to copy
some of the request fields at dispatch and completion time. To do that,
we'd have to replace the driver tag map(s) with our own, and augment
that with tag map(s) on the side representing the device queue depth.
Queuing IO with the scheduler would allocate from the new map, and
dispatching would acquire the "real" tag. We would need to change
drivers to do this, or add an extra indirection table to map a real
tag to the scheduler tag. We would also need a 1:1 mapping between
scheduler and hardware tag pools, or additional info to track it.
Unless someone can convince me otherwise, I think the current approach
is cleaner.

I wasn't going to post v4 so soon, but I discovered a bug that led
to drastically decreased merging. Especially on rotating storage,
this release should be fast, and on par with the merging that we
get through the legacy schedulers.

Changes since v3:

- Keep the blk_mq_free_request/__blk_mq_free_request() as the
interface, and have those functions call the scheduler API
instead.

- Add insertion merging from unplugging.

- Ensure that RQF_STARTED is cleared when we get a new shadow
request, or merging will fail if it is already set.

- Improve the blk_mq_sched_init_hctx_data() implementation. From Omar.

- Make the shadow alloc/free interface more usable by schedulers
that use the software queues. From Omar.

- Fix a bug in the io context code.

- Put the is_shadow() helper in generic code, instead of in mq-deadline.

- Add prep patch that unexports blk_mq_free_hctx_request(), it's not
used by anyone.

- Remove the magic '256' queue depth from mq-deadline, replace with a
module parameter, 'queue_depth', that defaults to 256.

- Various cleanups.

2016-12-17 00:12:30

[permalink] [raw]

Subject: [PATCH 1/8] block: move existing elevator ops to union

Prep patch for adding MQ ops as well, since doing anon unions with
named initializers doesn't work on older compilers.

Signed-off-by: Jens Axboe <[email protected]>
---
block/blk-ioc.c | 8 +++----
block/blk-merge.c | 4 ++--
block/blk.h | 10 ++++----
block/cfq-iosched.c | 2 +-
block/deadline-iosched.c | 2 +-
block/elevator.c | 60 ++++++++++++++++++++++++------------------------
block/noop-iosched.c | 2 +-
include/linux/elevator.h | 4 +++-
8 files changed, 47 insertions(+), 45 deletions(-)

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 381cb50a673c..ab372092a57d 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -43,8 +43,8 @@ static void ioc_exit_icq(struct io_cq *icq)
if (icq->flags & ICQ_EXITED)
return;

- if (et->ops.elevator_exit_icq_fn)
- et->ops.elevator_exit_icq_fn(icq);
+ if (et->ops.sq.elevator_exit_icq_fn)
+ et->ops.sq.elevator_exit_icq_fn(icq);

icq->flags |= ICQ_EXITED;
}
@@ -383,8 +383,8 @@ struct io_cq *ioc_create_icq(struct io_context *ioc, struct request_queue *q,
if (likely(!radix_tree_insert(&ioc->icq_tree, q->id, icq))) {
hlist_add_head(&icq->ioc_node, &ioc->icq_list);
list_add(&icq->q_node, &q->icq_list);
- if (et->ops.elevator_init_icq_fn)
- et->ops.elevator_init_icq_fn(icq);
+ if (et->ops.sq.elevator_init_icq_fn)
+ et->ops.sq.elevator_init_icq_fn(icq);
} else {
kmem_cache_free(et->icq_cache, icq);
icq = ioc_lookup_icq(ioc, q);
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 182398cb1524..480570b691dc 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -763,8 +763,8 @@ int blk_attempt_req_merge(struct request_queue *q, struct request *rq,
{
struct elevator_queue *e = q->elevator;

- if (e->type->ops.elevator_allow_rq_merge_fn)
- if (!e->type->ops.elevator_allow_rq_merge_fn(q, rq, next))
+ if (e->type->ops.sq.elevator_allow_rq_merge_fn)
+ if (!e->type->ops.sq.elevator_allow_rq_merge_fn(q, rq, next))
return 0;

return attempt_merge(q, rq, next);
diff --git a/block/blk.h b/block/blk.h
index 041185e5f129..f46c0ac8ae3d 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -167,7 +167,7 @@ static inline struct request *__elv_next_request(struct request_queue *q)
return NULL;
}
if (unlikely(blk_queue_bypass(q)) ||
- !q->elevator->type->ops.elevator_dispatch_fn(q, 0))
+ !q->elevator->type->ops.sq.elevator_dispatch_fn(q, 0))
return NULL;
}
}
@@ -176,16 +176,16 @@ static inline void elv_activate_rq(struct request_queue *q, struct request *rq)
{
struct elevator_queue *e = q->elevator;

- if (e->type->ops.elevator_activate_req_fn)
- e->type->ops.elevator_activate_req_fn(q, rq);
+ if (e->type->ops.sq.elevator_activate_req_fn)
+ e->type->ops.sq.elevator_activate_req_fn(q, rq);
}

static inline void elv_deactivate_rq(struct request_queue *q, struct request *rq)
{
struct elevator_queue *e = q->elevator;

- if (e->type->ops.elevator_deactivate_req_fn)
- e->type->ops.elevator_deactivate_req_fn(q, rq);
+ if (e->type->ops.sq.elevator_deactivate_req_fn)
+ e->type->ops.sq.elevator_deactivate_req_fn(q, rq);
}

#ifdef CONFIG_FAIL_IO_TIMEOUT
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index c73a6fcaeb9d..37aeb20fa454 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -4837,7 +4837,7 @@ static struct elv_fs_entry cfq_attrs[] = {
};

static struct elevator_type iosched_cfq = {
- .ops = {
+ .ops.sq = {
.elevator_merge_fn = cfq_merge,
.elevator_merged_fn = cfq_merged_request,
.elevator_merge_req_fn = cfq_merged_requests,
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 55e0bb6d7da7..05fc0ea25a98 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -439,7 +439,7 @@ static struct elv_fs_entry deadline_attrs[] = {
};

static struct elevator_type iosched_deadline = {
- .ops = {
+ .ops.sq = {
.elevator_merge_fn = deadline_merge,
.elevator_merged_fn = deadline_merged_request,
.elevator_merge_req_fn = deadline_merged_requests,
diff --git a/block/elevator.c b/block/elevator.c
index 40f0c04e5ad3..022a26830297 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -58,8 +58,8 @@ static int elv_iosched_allow_bio_merge(struct request *rq, struct bio *bio)
struct request_queue *q = rq->q;
struct elevator_queue *e = q->elevator;

- if (e->type->ops.elevator_allow_bio_merge_fn)
- return e->type->ops.elevator_allow_bio_merge_fn(q, rq, bio);
+ if (e->type->ops.sq.elevator_allow_bio_merge_fn)
+ return e->type->ops.sq.elevator_allow_bio_merge_fn(q, rq, bio);

return 1;
}
@@ -224,7 +224,7 @@ int elevator_init(struct request_queue *q, char *name)
}
}

- err = e->ops.elevator_init_fn(q, e);
+ err = e->ops.sq.elevator_init_fn(q, e);
if (err)
elevator_put(e);
return err;
@@ -234,8 +234,8 @@ EXPORT_SYMBOL(elevator_init);
void elevator_exit(struct elevator_queue *e)
{
mutex_lock(&e->sysfs_lock);
- if (e->type->ops.elevator_exit_fn)
- e->type->ops.elevator_exit_fn(e);
+ if (e->type->ops.sq.elevator_exit_fn)
+ e->type->ops.sq.elevator_exit_fn(e);
mutex_unlock(&e->sysfs_lock);

kobject_put(&e->kobj);
@@ -443,8 +443,8 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio)
return ELEVATOR_BACK_MERGE;
}

- if (e->type->ops.elevator_merge_fn)
- return e->type->ops.elevator_merge_fn(q, req, bio);
+ if (e->type->ops.sq.elevator_merge_fn)
+ return e->type->ops.sq.elevator_merge_fn(q, req, bio);

return ELEVATOR_NO_MERGE;
}
@@ -495,8 +495,8 @@ void elv_merged_request(struct request_queue *q, struct request *rq, int type)
{
struct elevator_queue *e = q->elevator;

- if (e->type->ops.elevator_merged_fn)
- e->type->ops.elevator_merged_fn(q, rq, type);
+ if (e->type->ops.sq.elevator_merged_fn)
+ e->type->ops.sq.elevator_merged_fn(q, rq, type);

if (type == ELEVATOR_BACK_MERGE)
elv_rqhash_reposition(q, rq);
@@ -510,8 +510,8 @@ void elv_merge_requests(struct request_queue *q, struct request *rq,
struct elevator_queue *e = q->elevator;
const int next_sorted = next->rq_flags & RQF_SORTED;

- if (next_sorted && e->type->ops.elevator_merge_req_fn)
- e->type->ops.elevator_merge_req_fn(q, rq, next);
+ if (next_sorted && e->type->ops.sq.elevator_merge_req_fn)
+ e->type->ops.sq.elevator_merge_req_fn(q, rq, next);

elv_rqhash_reposition(q, rq);

@@ -528,8 +528,8 @@ void elv_bio_merged(struct request_queue *q, struct request *rq,
{
struct elevator_queue *e = q->elevator;

- if (e->type->ops.elevator_bio_merged_fn)
- e->type->ops.elevator_bio_merged_fn(q, rq, bio);
+ if (e->type->ops.sq.elevator_bio_merged_fn)
+ e->type->ops.sq.elevator_bio_merged_fn(q, rq, bio);
}

#ifdef CONFIG_PM
@@ -578,7 +578,7 @@ void elv_drain_elevator(struct request_queue *q)

lockdep_assert_held(q->queue_lock);

- while (q->elevator->type->ops.elevator_dispatch_fn(q, 1))
+ while (q->elevator->type->ops.sq.elevator_dispatch_fn(q, 1))
;
if (q->nr_sorted && printed++ < 10) {
printk(KERN_ERR "%s: forced dispatching is broken "
@@ -653,7 +653,7 @@ void __elv_add_request(struct request_queue *q, struct request *rq, int where)
* rq cannot be accessed after calling
* elevator_add_req_fn.
*/
- q->elevator->type->ops.elevator_add_req_fn(q, rq);
+ q->elevator->type->ops.sq.elevator_add_req_fn(q, rq);
break;

case ELEVATOR_INSERT_FLUSH:
@@ -682,8 +682,8 @@ struct request *elv_latter_request(struct request_queue *q, struct request *rq)
{
struct elevator_queue *e = q->elevator;

- if (e->type->ops.elevator_latter_req_fn)
- return e->type->ops.elevator_latter_req_fn(q, rq);
+ if (e->type->ops.sq.elevator_latter_req_fn)
+ return e->type->ops.sq.elevator_latter_req_fn(q, rq);
return NULL;
}

@@ -691,8 +691,8 @@ struct request *elv_former_request(struct request_queue *q, struct request *rq)
{
struct elevator_queue *e = q->elevator;

- if (e->type->ops.elevator_former_req_fn)
- return e->type->ops.elevator_former_req_fn(q, rq);
+ if (e->type->ops.sq.elevator_former_req_fn)
+ return e->type->ops.sq.elevator_former_req_fn(q, rq);
return NULL;
}

@@ -701,8 +701,8 @@ int elv_set_request(struct request_queue *q, struct request *rq,
{
struct elevator_queue *e = q->elevator;

- if (e->type->ops.elevator_set_req_fn)
- return e->type->ops.elevator_set_req_fn(q, rq, bio, gfp_mask);
+ if (e->type->ops.sq.elevator_set_req_fn)
+ return e->type->ops.sq.elevator_set_req_fn(q, rq, bio, gfp_mask);
return 0;
}

@@ -710,16 +710,16 @@ void elv_put_request(struct request_queue *q, struct request *rq)
{
struct elevator_queue *e = q->elevator;

- if (e->type->ops.elevator_put_req_fn)
- e->type->ops.elevator_put_req_fn(rq);
+ if (e->type->ops.sq.elevator_put_req_fn)
+ e->type->ops.sq.elevator_put_req_fn(rq);
}

int elv_may_queue(struct request_queue *q, unsigned int op)
{
struct elevator_queue *e = q->elevator;

- if (e->type->ops.elevator_may_queue_fn)
- return e->type->ops.elevator_may_queue_fn(q, op);
+ if (e->type->ops.sq.elevator_may_queue_fn)
+ return e->type->ops.sq.elevator_may_queue_fn(q, op);

return ELV_MQUEUE_MAY;
}
@@ -734,8 +734,8 @@ void elv_completed_request(struct request_queue *q, struct request *rq)
if (blk_account_rq(rq)) {
q->in_flight[rq_is_sync(rq)]--;
if ((rq->rq_flags & RQF_SORTED) &&
- e->type->ops.elevator_completed_req_fn)
- e->type->ops.elevator_completed_req_fn(q, rq);
+ e->type->ops.sq.elevator_completed_req_fn)
+ e->type->ops.sq.elevator_completed_req_fn(q, rq);
}
}

@@ -803,8 +803,8 @@ int elv_register_queue(struct request_queue *q)
}
kobject_uevent(&e->kobj, KOBJ_ADD);
e->registered = 1;
- if (e->type->ops.elevator_registered_fn)
- e->type->ops.elevator_registered_fn(q);
+ if (e->type->ops.sq.elevator_registered_fn)
+ e->type->ops.sq.elevator_registered_fn(q);
}
return error;
}
@@ -912,7 +912,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
spin_unlock_irq(q->queue_lock);

/* allocate, init and register new elevator */
- err = new_e->ops.elevator_init_fn(q, new_e);
+ err = new_e->ops.sq.elevator_init_fn(q, new_e);
if (err)
goto fail_init;

diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index a163c487cf38..2d1b15d89b45 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -92,7 +92,7 @@ static void noop_exit_queue(struct elevator_queue *e)
}

static struct elevator_type elevator_noop = {
- .ops = {
+ .ops.sq = {
.elevator_merge_req_fn = noop_merged_requests,
.elevator_dispatch_fn = noop_dispatch,
.elevator_add_req_fn = noop_add_request,
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index b276e9ef0e0b..2a9e966eed03 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -94,7 +94,9 @@ struct elevator_type
struct kmem_cache *icq_cache;

/* fields provided by elevator implementation */
- struct elevator_ops ops;
+ union {
+ struct elevator_ops sq;
+ } ops;
size_t icq_size; /* see iocontext.h */
size_t icq_align; /* ditto */
struct elv_fs_entry *elevator_attrs;
--
2.7.4

2016-12-17 00:12:40

[permalink] [raw]

Subject: [PATCH 2/8] blk-mq: make mq_ops a const pointer

We never change it, make that clear.

Signed-off-by: Jens Axboe <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
---
block/blk-mq.c | 2 +-
include/linux/blk-mq.h | 2 +-
include/linux/blkdev.h | 2 +-
3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index d79fdc11b1ee..87b7eaa1cb74 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -639,7 +639,7 @@ struct blk_mq_timeout_data {

void blk_mq_rq_timed_out(struct request *req, bool reserved)
{
- struct blk_mq_ops *ops = req->q->mq_ops;
+ const struct blk_mq_ops *ops = req->q->mq_ops;
enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;

/*
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 4a2ab5d99ff7..afc81d77e471 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -60,7 +60,7 @@ struct blk_mq_hw_ctx {

struct blk_mq_tag_set {
unsigned int *mq_map;
- struct blk_mq_ops *ops;
+ const struct blk_mq_ops *ops;
unsigned int nr_hw_queues;
unsigned int queue_depth; /* max hw supported */
unsigned int reserved_tags;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 286b2a264383..7c40fb838b44 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -408,7 +408,7 @@ struct request_queue {
dma_drain_needed_fn *dma_drain_needed;
lld_busy_fn *lld_busy_fn;

- struct blk_mq_ops *mq_ops;
+ const struct blk_mq_ops *mq_ops;

unsigned int *mq_map;

--
2.7.4

2016-12-17 00:12:51

[permalink] [raw]

Subject: [PATCH 3/8] block: move rq_ioc() to blk.h

We want to use it outside of blk-core.c.

Signed-off-by: Jens Axboe <[email protected]>
---
block/blk-core.c | 16 ----------------
block/blk.h | 16 ++++++++++++++++
2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 61ba08c58b64..92baea07acbc 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1040,22 +1040,6 @@ static bool blk_rq_should_init_elevator(struct bio *bio)
}

/**
- * rq_ioc - determine io_context for request allocation
- * @bio: request being allocated is for this bio (can be %NULL)
- *
- * Determine io_context to use for request allocation for @bio. May return
- * %NULL if %current->io_context doesn't exist.
- */
-static struct io_context *rq_ioc(struct bio *bio)
-{
-#ifdef CONFIG_BLK_CGROUP
- if (bio && bio->bi_ioc)
- return bio->bi_ioc;
-#endif
- return current->io_context;
-}
-
-/**
* __get_request - get a free request
* @rl: request list to allocate from
* @op: operation and flags
diff --git a/block/blk.h b/block/blk.h
index f46c0ac8ae3d..9a716b5925a4 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -264,6 +264,22 @@ void ioc_clear_queue(struct request_queue *q);
int create_task_io_context(struct task_struct *task, gfp_t gfp_mask, int node);

/**
+ * rq_ioc - determine io_context for request allocation
+ * @bio: request being allocated is for this bio (can be %NULL)
+ *
+ * Determine io_context to use for request allocation for @bio. May return
+ * %NULL if %current->io_context doesn't exist.
+ */
+static inline struct io_context *rq_ioc(struct bio *bio)
+{
+#ifdef CONFIG_BLK_CGROUP
+ if (bio && bio->bi_ioc)
+ return bio->bi_ioc;
+#endif
+ return current->io_context;
+}
+
+/**
* create_io_context - try to create task->io_context
* @gfp_mask: allocation mask
* @node: allocation node
--
2.7.4

2016-12-17 00:13:24

[permalink] [raw]

Subject: [PATCH 5/8] blk-mq: export some helpers we need to the scheduling framework

Signed-off-by: Jens Axboe <[email protected]>
---
block/blk-mq.c | 39 +++++++++++++++++++++------------------
block/blk-mq.h | 25 +++++++++++++++++++++++++
2 files changed, 46 insertions(+), 18 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6eeae30cc027..c3119f527bc1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -167,8 +167,8 @@ bool blk_mq_can_queue(struct blk_mq_hw_ctx *hctx)
}
EXPORT_SYMBOL(blk_mq_can_queue);

-static void blk_mq_rq_ctx_init(struct request_queue *q, struct blk_mq_ctx *ctx,
- struct request *rq, unsigned int op)
+void blk_mq_rq_ctx_init(struct request_queue *q, struct blk_mq_ctx *ctx,
+ struct request *rq, unsigned int op)
{
INIT_LIST_HEAD(&rq->queuelist);
/* csd/requeue_work/fifo_time is initialized before use */
@@ -213,9 +213,10 @@ static void blk_mq_rq_ctx_init(struct request_queue *q, struct blk_mq_ctx *ctx,

ctx->rq_dispatched[op_is_sync(op)]++;
}
+EXPORT_SYMBOL_GPL(blk_mq_rq_ctx_init);

-static struct request *
-__blk_mq_alloc_request(struct blk_mq_alloc_data *data, unsigned int op)
+struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data,
+ unsigned int op)
{
struct request *rq;
unsigned int tag;
@@ -236,6 +237,7 @@ __blk_mq_alloc_request(struct blk_mq_alloc_data *data, unsigned int op)

return NULL;
}
+EXPORT_SYMBOL_GPL(__blk_mq_alloc_request);

struct request *blk_mq_alloc_request(struct request_queue *q, int rw,
unsigned int flags)
@@ -319,8 +321,8 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, int rw,
}
EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx);

-static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx,
- struct blk_mq_ctx *ctx, struct request *rq)
+void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
+ struct request *rq)
{
const int tag = rq->tag;
struct request_queue *q = rq->q;
@@ -802,7 +804,7 @@ static bool flush_busy_ctx(struct sbitmap *sb, unsigned int bitnr, void *data)
* Process software queues that have been marked busy, splicing them
* to the for-dispatch
*/
-static void flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
+void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
{
struct flush_busy_ctx_data data = {
.hctx = hctx,
@@ -811,6 +813,7 @@ static void flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)

sbitmap_for_each_set(&hctx->ctx_map, flush_busy_ctx, &data);
}
+EXPORT_SYMBOL_GPL(blk_mq_flush_busy_ctxs);

static inline unsigned int queued_to_index(unsigned int queued)
{
@@ -921,7 +924,7 @@ static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
/*
* Touch any software queue that has pending entries.
*/
- flush_busy_ctxs(hctx, &rq_list);
+ blk_mq_flush_busy_ctxs(hctx, &rq_list);

/*
* If we have previous entries on our dispatch list, grab them
@@ -1135,8 +1138,8 @@ static inline void __blk_mq_insert_req_list(struct blk_mq_hw_ctx *hctx,
list_add_tail(&rq->queuelist, &ctx->rq_list);
}

-static void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx,
- struct request *rq, bool at_head)
+void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
+ bool at_head)
{
struct blk_mq_ctx *ctx = rq->mq_ctx;

@@ -1550,8 +1553,8 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
return cookie;
}

-static void blk_mq_free_rq_map(struct blk_mq_tag_set *set,
- struct blk_mq_tags *tags, unsigned int hctx_idx)
+void blk_mq_free_rq_map(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
+ unsigned int hctx_idx)
{
struct page *page;

@@ -1588,8 +1591,8 @@ static size_t order_to_size(unsigned int order)
return (size_t)PAGE_SIZE << order;
}

-static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
- unsigned int hctx_idx)
+struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
+ unsigned int hctx_idx)
{
struct blk_mq_tags *tags;
unsigned int i, j, entries_per_page, max_order = 4;
@@ -2263,10 +2266,10 @@ static int blk_mq_queue_reinit_dead(unsigned int cpu)
* Now CPU1 is just onlined and a request is inserted into ctx1->rq_list
* and set bit0 in pending bitmap as ctx1->index_hw is still zero.
*
- * And then while running hw queue, flush_busy_ctxs() finds bit0 is set in
- * pending bitmap and tries to retrieve requests in hctx->ctxs[0]->rq_list.
- * But htx->ctxs[0] is a pointer to ctx0, so the request in ctx1->rq_list
- * is ignored.
+ * And then while running hw queue, blk_mq_flush_busy_ctxs() finds bit0 is set
+ * in pending bitmap and tries to retrieve requests in hctx->ctxs[0]->rq_list.
+ * But htx->ctxs[0] is a pointer to ctx0, so the request in ctx1->rq_list is
+ * ignored.
*/
static int blk_mq_queue_reinit_prepare(unsigned int cpu)
{
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 63e9116cddbd..e59f5ca520a2 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -32,6 +32,21 @@ void blk_mq_free_queue(struct request_queue *q);
int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
void blk_mq_wake_waiters(struct request_queue *q);
bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *, struct list_head *);
+void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list);
+
+/*
+ * Internal helpers for allocating/freeing the request map
+ */
+void blk_mq_free_rq_map(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
+ unsigned int hctx_idx);
+struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
+ unsigned int hctx_idx);
+
+/*
+ * Internal helpers for request insertion into sw queues
+ */
+void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
+ bool at_head);

/*
* CPU hotplug helpers
@@ -103,6 +118,16 @@ static inline void blk_mq_set_alloc_data(struct blk_mq_alloc_data *data,
data->hctx = hctx;
}

+/*
+ * Internal helpers for request allocation/init/free
+ */
+void blk_mq_rq_ctx_init(struct request_queue *q, struct blk_mq_ctx *ctx,
+ struct request *rq, unsigned int op);
+void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
+ struct request *rq);
+struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data,
+ unsigned int op);
+
static inline bool blk_mq_hctx_stopped(struct blk_mq_hw_ctx *hctx)
{
return test_bit(BLK_MQ_S_STOPPED, &hctx->state);
--
2.7.4

2016-12-17 00:13:15

[permalink] [raw]

Subject: [PATCH 4/8] blk-mq: un-export blk_mq_free_hctx_request()

It's only used in blk-mq, kill it from the main exported header
and kill the symbol export as well.

Signed-off-by: Jens Axboe <[email protected]>
---
block/blk-mq.c | 5 ++---
include/linux/blk-mq.h | 1 -
2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 87b7eaa1cb74..6eeae30cc027 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -337,15 +337,14 @@ static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx,
blk_queue_exit(q);
}

-void blk_mq_free_hctx_request(struct blk_mq_hw_ctx *hctx, struct request *rq)
+static void blk_mq_free_hctx_request(struct blk_mq_hw_ctx *hctx,
+ struct request *rq)
{
struct blk_mq_ctx *ctx = rq->mq_ctx;

ctx->rq_completed[rq_is_sync(rq)]++;
__blk_mq_free_request(hctx, ctx, rq);
-
}
-EXPORT_SYMBOL_GPL(blk_mq_free_hctx_request);

void blk_mq_free_request(struct request *rq)
{
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index afc81d77e471..2686f9e7302a 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -181,7 +181,6 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule);

void blk_mq_insert_request(struct request *, bool, bool, bool);
void blk_mq_free_request(struct request *rq);
-void blk_mq_free_hctx_request(struct blk_mq_hw_ctx *, struct request *rq);
bool blk_mq_can_queue(struct blk_mq_hw_ctx *);

enum {
--
2.7.4

2016-12-17 00:14:02

[permalink] [raw]

Subject: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler

This is basically identical to deadline-iosched, except it registers
as a MQ capable scheduler. This is still a single queue design.

Signed-off-by: Jens Axboe <[email protected]>
---
block/Kconfig.iosched | 6 +
block/Makefile | 1 +
block/mq-deadline.c | 649 ++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 656 insertions(+)
create mode 100644 block/mq-deadline.c

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 421bef9c4c48..490ef2850fae 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -32,6 +32,12 @@ config IOSCHED_CFQ

This is the default I/O scheduler.

+config MQ_IOSCHED_DEADLINE
+ tristate "MQ deadline I/O scheduler"
+ default y
+ ---help---
+ MQ version of the deadline IO scheduler.
+
config CFQ_GROUP_IOSCHED
bool "CFQ Group Scheduling support"
depends on IOSCHED_CFQ && BLK_CGROUP
diff --git a/block/Makefile b/block/Makefile
index 2eee9e1bb6db..3ee0abd7205a 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -18,6 +18,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING) += blk-throttle.o
obj-$(CONFIG_IOSCHED_NOOP) += noop-iosched.o
obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o
obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o
+obj-$(CONFIG_MQ_IOSCHED_DEADLINE) += mq-deadline.o

obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
obj-$(CONFIG_BLK_CMDLINE_PARSER) += cmdline-parser.o
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
new file mode 100644
index 000000000000..3cb9de21ab21
--- /dev/null
+++ b/block/mq-deadline.c
@@ -0,0 +1,649 @@
+/*
+ * MQ Deadline i/o scheduler - adaptation of the legacy deadline scheduler,
+ * for the blk-mq scheduling framework
+ *
+ * Copyright (C) 2016 Jens Axboe <[email protected]>
+ */
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/blkdev.h>
+#include <linux/blk-mq.h>
+#include <linux/elevator.h>
+#include <linux/bio.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/init.h>
+#include <linux/compiler.h>
+#include <linux/rbtree.h>
+#include <linux/sbitmap.h>
+
+#include "blk.h"
+#include "blk-mq.h"
+#include "blk-mq-tag.h"
+#include "blk-mq-sched.h"
+
+static unsigned int queue_depth = 256;
+module_param(queue_depth, uint, 0644);
+MODULE_PARM_DESC(queue_depth, "Use this value as the scheduler queue depth");
+
+/*
+ * See Documentation/block/deadline-iosched.txt
+ */
+static const int read_expire = HZ / 2; /* max time before a read is submitted. */
+static const int write_expire = 5 * HZ; /* ditto for writes, these limits are SOFT! */
+static const int writes_starved = 2; /* max times reads can starve a write */
+static const int fifo_batch = 16; /* # of sequential requests treated as one
+ by the above parameters. For throughput. */
+
+struct deadline_data {
+ /*
+ * run time data
+ */
+
+ /*
+ * requests (deadline_rq s) are present on both sort_list and fifo_list
+ */
+ struct rb_root sort_list[2];
+ struct list_head fifo_list[2];
+
+ /*
+ * next in sort order. read, write or both are NULL
+ */
+ struct request *next_rq[2];
+ unsigned int batching; /* number of sequential requests made */
+ unsigned int starved; /* times reads have starved writes */
+
+ /*
+ * settings that change how the i/o scheduler behaves
+ */
+ int fifo_expire[2];
+ int fifo_batch;
+ int writes_starved;
+ int front_merges;
+
+ spinlock_t lock;
+ struct list_head dispatch;
+ struct blk_mq_tags *tags;
+ atomic_t wait_index;
+};
+
+static inline struct rb_root *
+deadline_rb_root(struct deadline_data *dd, struct request *rq)
+{
+ return &dd->sort_list[rq_data_dir(rq)];
+}
+
+/*
+ * get the request after `rq' in sector-sorted order
+ */
+static inline struct request *
+deadline_latter_request(struct request *rq)
+{
+ struct rb_node *node = rb_next(&rq->rb_node);
+
+ if (node)
+ return rb_entry_rq(node);
+
+ return NULL;
+}
+
+static void
+deadline_add_rq_rb(struct deadline_data *dd, struct request *rq)
+{
+ struct rb_root *root = deadline_rb_root(dd, rq);
+
+ elv_rb_add(root, rq);
+}
+
+static inline void
+deadline_del_rq_rb(struct deadline_data *dd, struct request *rq)
+{
+ const int data_dir = rq_data_dir(rq);
+
+ if (dd->next_rq[data_dir] == rq)
+ dd->next_rq[data_dir] = deadline_latter_request(rq);
+
+ elv_rb_del(deadline_rb_root(dd, rq), rq);
+}
+
+/*
+ * remove rq from rbtree and fifo.
+ */
+static void deadline_remove_request(struct request_queue *q, struct request *rq)
+{
+ struct deadline_data *dd = q->elevator->elevator_data;
+
+ list_del_init(&rq->queuelist);
+
+ /*
+ * We might not be on the rbtree, if we are doing an insert merge
+ */
+ if (!RB_EMPTY_NODE(&rq->rb_node))
+ deadline_del_rq_rb(dd, rq);
+
+ elv_rqhash_del(q, rq);
+ if (q->last_merge == rq)
+ q->last_merge = NULL;
+}
+
+static void dd_request_merged(struct request_queue *q, struct request *req,
+ int type)
+{
+ struct deadline_data *dd = q->elevator->elevator_data;
+
+ /*
+ * if the merge was a front merge, we need to reposition request
+ */
+ if (type == ELEVATOR_FRONT_MERGE) {
+ elv_rb_del(deadline_rb_root(dd, req), req);
+ deadline_add_rq_rb(dd, req);
+ }
+}
+
+static void dd_merged_requests(struct request_queue *q, struct request *req,
+ struct request *next)
+{
+ /*
+ * if next expires before rq, assign its expire time to rq
+ * and move into next position (next will be deleted) in fifo
+ */
+ if (!list_empty(&req->queuelist) && !list_empty(&next->queuelist)) {
+ if (time_before((unsigned long)next->fifo_time,
+ (unsigned long)req->fifo_time)) {
+ list_move(&req->queuelist, &next->queuelist);
+ req->fifo_time = next->fifo_time;
+ }
+ }
+
+ /*
+ * kill knowledge of next, this one is a goner
+ */
+ deadline_remove_request(q, next);
+}
+
+/*
+ * move an entry to dispatch queue
+ */
+static void
+deadline_move_request(struct deadline_data *dd, struct request *rq)
+{
+ const int data_dir = rq_data_dir(rq);
+
+ dd->next_rq[READ] = NULL;
+ dd->next_rq[WRITE] = NULL;
+ dd->next_rq[data_dir] = deadline_latter_request(rq);
+
+ /*
+ * take it off the sort and fifo list
+ */
+ deadline_remove_request(rq->q, rq);
+}
+
+/*
+ * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
+ * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
+ */
+static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
+{
+ struct request *rq = rq_entry_fifo(dd->fifo_list[ddir].next);
+
+ /*
+ * rq is expired!
+ */
+ if (time_after_eq(jiffies, (unsigned long)rq->fifo_time))
+ return 1;
+
+ return 0;
+}
+
+/*
+ * deadline_dispatch_requests selects the best request according to
+ * read/write expire, fifo_batch, etc
+ */
+static struct request *__dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
+{
+ struct deadline_data *dd = hctx->queue->elevator->elevator_data;
+ struct request *rq;
+ bool reads, writes;
+ int data_dir;
+
+ spin_lock(&dd->lock);
+
+ if (!list_empty(&dd->dispatch)) {
+ rq = list_first_entry(&dd->dispatch, struct request, queuelist);
+ list_del_init(&rq->queuelist);
+ goto done;
+ }
+
+ reads = !list_empty(&dd->fifo_list[READ]);
+ writes = !list_empty(&dd->fifo_list[WRITE]);
+
+ /*
+ * batches are currently reads XOR writes
+ */
+ if (dd->next_rq[WRITE])
+ rq = dd->next_rq[WRITE];
+ else
+ rq = dd->next_rq[READ];
+
+ if (rq && dd->batching < dd->fifo_batch)
+ /* we have a next request are still entitled to batch */
+ goto dispatch_request;
+
+ /*
+ * at this point we are not running a batch. select the appropriate
+ * data direction (read / write)
+ */
+
+ if (reads) {
+ BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
+
+ if (writes && (dd->starved++ >= dd->writes_starved))
+ goto dispatch_writes;
+
+ data_dir = READ;
+
+ goto dispatch_find_request;
+ }
+
+ /*
+ * there are either no reads or writes have been starved
+ */
+
+ if (writes) {
+dispatch_writes:
+ BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
+
+ dd->starved = 0;
+
+ data_dir = WRITE;
+
+ goto dispatch_find_request;
+ }
+
+ spin_unlock(&dd->lock);
+ return NULL;
+
+dispatch_find_request:
+ /*
+ * we are not running a batch, find best request for selected data_dir
+ */
+ if (deadline_check_fifo(dd, data_dir) || !dd->next_rq[data_dir]) {
+ /*
+ * A deadline has expired, the last request was in the other
+ * direction, or we have run out of higher-sectored requests.
+ * Start again from the request with the earliest expiry time.
+ */
+ rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
+ } else {
+ /*
+ * The last req was the same dir and we have a next request in
+ * sort order. No expired requests so continue on from here.
+ */
+ rq = dd->next_rq[data_dir];
+ }
+
+ dd->batching = 0;
+
+dispatch_request:
+ /*
+ * rq is the selected appropriate request.
+ */
+ dd->batching++;
+ deadline_move_request(dd, rq);
+done:
+ rq->rq_flags |= RQF_STARTED;
+ spin_unlock(&dd->lock);
+ return rq;
+}
+
+static void dd_dispatch_requests(struct blk_mq_hw_ctx *hctx,
+ struct list_head *rq_list)
+{
+ blk_mq_sched_dispatch_shadow_requests(hctx, rq_list, __dd_dispatch_request);
+}
+
+static void dd_exit_queue(struct elevator_queue *e)
+{
+ struct deadline_data *dd = e->elevator_data;
+
+ BUG_ON(!list_empty(&dd->fifo_list[READ]));
+ BUG_ON(!list_empty(&dd->fifo_list[WRITE]));
+
+ blk_mq_sched_free_requests(dd->tags);
+ kfree(dd);
+}
+
+/*
+ * initialize elevator private data (deadline_data).
+ */
+static int dd_init_queue(struct request_queue *q, struct elevator_type *e)
+{
+ struct deadline_data *dd;
+ struct elevator_queue *eq;
+
+ eq = elevator_alloc(q, e);
+ if (!eq)
+ return -ENOMEM;
+
+ dd = kzalloc_node(sizeof(*dd), GFP_KERNEL, q->node);
+ if (!dd) {
+ kobject_put(&eq->kobj);
+ return -ENOMEM;
+ }
+ eq->elevator_data = dd;
+
+ dd->tags = blk_mq_sched_alloc_requests(queue_depth, q->node);
+ if (!dd->tags) {
+ kfree(dd);
+ kobject_put(&eq->kobj);
+ return -ENOMEM;
+ }
+
+ INIT_LIST_HEAD(&dd->fifo_list[READ]);
+ INIT_LIST_HEAD(&dd->fifo_list[WRITE]);
+ dd->sort_list[READ] = RB_ROOT;
+ dd->sort_list[WRITE] = RB_ROOT;
+ dd->fifo_expire[READ] = read_expire;
+ dd->fifo_expire[WRITE] = write_expire;
+ dd->writes_starved = writes_starved;
+ dd->front_merges = 1;
+ dd->fifo_batch = fifo_batch;
+ spin_lock_init(&dd->lock);
+ INIT_LIST_HEAD(&dd->dispatch);
+ atomic_set(&dd->wait_index, 0);
+
+ q->elevator = eq;
+ return 0;
+}
+
+static int dd_request_merge(struct request_queue *q, struct request **rq,
+ struct bio *bio)
+{
+ struct deadline_data *dd = q->elevator->elevator_data;
+ sector_t sector = bio_end_sector(bio);
+ struct request *__rq;
+
+ if (!dd->front_merges)
+ return ELEVATOR_NO_MERGE;
+
+ __rq = elv_rb_find(&dd->sort_list[bio_data_dir(bio)], sector);
+ if (__rq) {
+ BUG_ON(sector != blk_rq_pos(__rq));
+
+ if (elv_bio_merge_ok(__rq, bio)) {
+ *rq = __rq;
+ return ELEVATOR_FRONT_MERGE;
+ }
+ }
+
+ return ELEVATOR_NO_MERGE;
+}
+
+static bool dd_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio)
+{
+ struct request_queue *q = hctx->queue;
+ struct deadline_data *dd = q->elevator->elevator_data;
+ int ret;
+
+ spin_lock(&dd->lock);
+ ret = blk_mq_sched_try_merge(q, bio);
+ spin_unlock(&dd->lock);
+
+ return ret;
+}
+
+/*
+ * add rq to rbtree and fifo
+ */
+static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
+ bool at_head)
+{
+ struct request_queue *q = hctx->queue;
+ struct deadline_data *dd = q->elevator->elevator_data;
+ const int data_dir = rq_data_dir(rq);
+
+ if (blk_mq_sched_try_insert_merge(q, rq))
+ return;
+
+ blk_mq_sched_request_inserted(rq);
+
+ /*
+ * If we're trying to insert a real request, just send it directly
+ * to the hardware dispatch list. This only happens for a requeue,
+ * or FUA/FLUSH requests.
+ */
+ if (!blk_mq_sched_rq_is_shadow(rq)) {
+ spin_lock(&hctx->lock);
+ list_add_tail(&rq->queuelist, &hctx->dispatch);
+ spin_unlock(&hctx->lock);
+ return;
+ }
+
+ if (at_head || rq->cmd_type != REQ_TYPE_FS) {
+ if (at_head)
+ list_add(&rq->queuelist, &dd->dispatch);
+ else
+ list_add_tail(&rq->queuelist, &dd->dispatch);
+ } else {
+ deadline_add_rq_rb(dd, rq);
+
+ if (rq_mergeable(rq)) {
+ elv_rqhash_add(q, rq);
+ if (!q->last_merge)
+ q->last_merge = rq;
+ }
+
+ /*
+ * set expire time and add to fifo list
+ */
+ rq->fifo_time = jiffies + dd->fifo_expire[data_dir];
+ list_add_tail(&rq->queuelist, &dd->fifo_list[data_dir]);
+ }
+}
+
+static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
+ struct list_head *list, bool at_head)
+{
+ struct request_queue *q = hctx->queue;
+ struct deadline_data *dd = q->elevator->elevator_data;
+
+ spin_lock(&dd->lock);
+ while (!list_empty(list)) {
+ struct request *rq;
+
+ rq = list_first_entry(list, struct request, queuelist);
+ list_del_init(&rq->queuelist);
+ dd_insert_request(hctx, rq, at_head);
+ }
+ spin_unlock(&dd->lock);
+}
+
+static struct request *dd_get_request(struct request_queue *q, unsigned int op,
+ struct blk_mq_alloc_data *data)
+{
+ struct deadline_data *dd = q->elevator->elevator_data;
+ struct request *rq;
+
+ /*
+ * The flush machinery intercepts before we insert the request. As
+ * a work-around, just hand it back a real request.
+ */
+ if (unlikely(op & (REQ_PREFLUSH | REQ_FUA)))
+ rq = __blk_mq_alloc_request(data, op);
+ else {
+ rq = blk_mq_sched_alloc_shadow_request(q, data, dd->tags, &dd->wait_index);
+ if (rq)
+ blk_mq_rq_ctx_init(q, data->ctx, rq, op);
+ }
+
+ return rq;
+}
+
+static bool dd_put_request(struct request *rq)
+{
+ /*
+ * If it's a real request, we just have to free it. For a shadow
+ * request, we should only free it if we haven't started it. A
+ * started request is mapped to a real one, and the real one will
+ * free it. We can get here with request merges, since we then
+ * free the request before we start/issue it.
+ */
+ if (!blk_mq_sched_rq_is_shadow(rq))
+ return false;
+
+ if (!(rq->rq_flags & RQF_STARTED)) {
+ struct request_queue *q = rq->q;
+ struct deadline_data *dd = q->elevator->elevator_data;
+
+ /*
+ * IO completion would normally do this, but if we merge
+ * and free before we issue the request, drop both the
+ * tag and queue ref
+ */
+ blk_mq_sched_free_shadow_request(dd->tags, rq);
+ blk_queue_exit(q);
+ }
+
+ return true;
+}
+
+static void dd_completed_request(struct blk_mq_hw_ctx *hctx, struct request *rq)
+{
+ struct request *sched_rq = rq->end_io_data;
+
+ /*
+ * sched_rq can be NULL, if we haven't setup the shadow yet
+ * because we failed getting one.
+ */
+ if (sched_rq) {
+ struct deadline_data *dd = hctx->queue->elevator->elevator_data;
+
+ blk_mq_sched_free_shadow_request(dd->tags, sched_rq);
+ blk_mq_start_stopped_hw_queue(hctx, true);
+ }
+}
+
+static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
+{
+ struct deadline_data *dd = hctx->queue->elevator->elevator_data;
+
+ return !list_empty_careful(&dd->dispatch) ||
+ !list_empty_careful(&dd->fifo_list[0]) ||
+ !list_empty_careful(&dd->fifo_list[1]);
+}
+
+/*
+ * sysfs parts below
+ */
+static ssize_t
+deadline_var_show(int var, char *page)
+{
+ return sprintf(page, "%d\n", var);
+}
+
+static ssize_t
+deadline_var_store(int *var, const char *page, size_t count)
+{
+ char *p = (char *) page;
+
+ *var = simple_strtol(p, &p, 10);
+ return count;
+}
+
+#define SHOW_FUNCTION(__FUNC, __VAR, __CONV) \
+static ssize_t __FUNC(struct elevator_queue *e, char *page) \
+{ \
+ struct deadline_data *dd = e->elevator_data; \
+ int __data = __VAR; \
+ if (__CONV) \
+ __data = jiffies_to_msecs(__data); \
+ return deadline_var_show(__data, (page)); \
+}
+SHOW_FUNCTION(deadline_read_expire_show, dd->fifo_expire[READ], 1);
+SHOW_FUNCTION(deadline_write_expire_show, dd->fifo_expire[WRITE], 1);
+SHOW_FUNCTION(deadline_writes_starved_show, dd->writes_starved, 0);
+SHOW_FUNCTION(deadline_front_merges_show, dd->front_merges, 0);
+SHOW_FUNCTION(deadline_fifo_batch_show, dd->fifo_batch, 0);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \
+static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count) \
+{ \
+ struct deadline_data *dd = e->elevator_data; \
+ int __data; \
+ int ret = deadline_var_store(&__data, (page), count); \
+ if (__data < (MIN)) \
+ __data = (MIN); \
+ else if (__data > (MAX)) \
+ __data = (MAX); \
+ if (__CONV) \
+ *(__PTR) = msecs_to_jiffies(__data); \
+ else \
+ *(__PTR) = __data; \
+ return ret; \
+}
+STORE_FUNCTION(deadline_read_expire_store, &dd->fifo_expire[READ], 0, INT_MAX, 1);
+STORE_FUNCTION(deadline_write_expire_store, &dd->fifo_expire[WRITE], 0, INT_MAX, 1);
+STORE_FUNCTION(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX, 0);
+STORE_FUNCTION(deadline_front_merges_store, &dd->front_merges, 0, 1, 0);
+STORE_FUNCTION(deadline_fifo_batch_store, &dd->fifo_batch, 0, INT_MAX, 0);
+#undef STORE_FUNCTION
+
+#define DD_ATTR(name) \
+ __ATTR(name, S_IRUGO|S_IWUSR, deadline_##name##_show, \
+ deadline_##name##_store)
+
+static struct elv_fs_entry deadline_attrs[] = {
+ DD_ATTR(read_expire),
+ DD_ATTR(write_expire),
+ DD_ATTR(writes_starved),
+ DD_ATTR(front_merges),
+ DD_ATTR(fifo_batch),
+ __ATTR_NULL
+};
+
+static struct elevator_type mq_deadline = {
+ .ops.mq = {
+ .get_request = dd_get_request,
+ .put_request = dd_put_request,
+ .insert_requests = dd_insert_requests,
+ .dispatch_requests = dd_dispatch_requests,
+ .completed_request = dd_completed_request,
+ .next_request = elv_rb_latter_request,
+ .former_request = elv_rb_former_request,
+ .bio_merge = dd_bio_merge,
+ .request_merge = dd_request_merge,
+ .requests_merged = dd_merged_requests,
+ .request_merged = dd_request_merged,
+ .has_work = dd_has_work,
+ .init_sched = dd_init_queue,
+ .exit_sched = dd_exit_queue,
+ },
+
+ .uses_mq = true,
+ .elevator_attrs = deadline_attrs,
+ .elevator_name = "mq-deadline",
+ .elevator_owner = THIS_MODULE,
+};
+
+static int __init deadline_init(void)
+{
+ if (!queue_depth) {
+ pr_err("mq-deadline: queue depth must be > 0\n");
+ return -EINVAL;
+ }
+ return elv_register(&mq_deadline);
+}
+
+static void __exit deadline_exit(void)
+{
+ elv_unregister(&mq_deadline);
+}
+
+module_init(deadline_init);
+module_exit(deadline_exit);
+
+MODULE_AUTHOR("Jens Axboe");
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("MQ deadline IO scheduler");
--
2.7.4

2016-12-17 00:14:14

[permalink] [raw]

Subject: [PATCH 8/8] blk-mq-sched: allow setting of default IO scheduler

Add Kconfig entries to manage what devices get assigned an MQ
scheduler, and add a blk-mq flag for drivers to opt out of scheduling.
The latter is useful for admin type queues that still allocate a blk-mq
queue and tag set, but aren't use for normal IO.

Signed-off-by: Jens Axboe <[email protected]>
---
block/Kconfig.iosched | 43 +++++++++++++++++++++++++++++++++++++------
block/blk-mq-sched.c | 19 +++++++++++++++++++
block/blk-mq-sched.h | 2 ++
block/blk-mq.c | 3 +++
block/elevator.c | 5 ++++-
drivers/nvme/host/pci.c | 1 +
include/linux/blk-mq.h | 1 +
7 files changed, 67 insertions(+), 7 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 490ef2850fae..96216cf18560 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -32,12 +32,6 @@ config IOSCHED_CFQ

This is the default I/O scheduler.

-config MQ_IOSCHED_DEADLINE
- tristate "MQ deadline I/O scheduler"
- default y
- ---help---
- MQ version of the deadline IO scheduler.
-
config CFQ_GROUP_IOSCHED
bool "CFQ Group Scheduling support"
depends on IOSCHED_CFQ && BLK_CGROUP
@@ -69,6 +63,43 @@ config DEFAULT_IOSCHED
default "cfq" if DEFAULT_CFQ
default "noop" if DEFAULT_NOOP

+config MQ_IOSCHED_DEADLINE
+ tristate "MQ deadline I/O scheduler"
+ default y
+ ---help---
+ MQ version of the deadline IO scheduler.
+
+config MQ_IOSCHED_NONE
+ bool
+ default y
+
+choice
+ prompt "Default MQ I/O scheduler"
+ default MQ_IOSCHED_NONE
+ help
+ Select the I/O scheduler which will be used by default for all
+ blk-mq managed block devices.
+
+ config DEFAULT_MQ_DEADLINE
+ bool "MQ Deadline" if MQ_IOSCHED_DEADLINE=y
+
+ config DEFAULT_MQ_NONE
+ bool "None"
+
+endchoice
+
+config DEFAULT_MQ_IOSCHED
+ string
+ default "mq-deadline" if DEFAULT_MQ_DEADLINE
+ default "none" if DEFAULT_MQ_NONE
+
+config MQ_IOSCHED_ONLY_SQ
+ bool "Enable blk-mq IO scheduler only for single queue devices"
+ default y
+ help
+ Say Y here, if you only want to enable IO scheduling on block
+ devices that have a single queue registered.
+
endmenu

endif
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index b7e1839d4785..1f06efcdaa2d 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -432,3 +432,22 @@ void blk_mq_sched_request_inserted(struct request *rq)
trace_block_rq_insert(rq->q, rq);
}
EXPORT_SYMBOL_GPL(blk_mq_sched_request_inserted);
+
+int blk_mq_sched_init(struct request_queue *q)
+{
+ int ret;
+
+#if defined(CONFIG_DEFAULT_MQ_NONE)
+ return 0;
+#endif
+#if defined(CONFIG_MQ_IOSCHED_ONLY_SQ)
+ if (q->nr_hw_queues > 1)
+ return 0;
+#endif
+
+ mutex_lock(&q->sysfs_lock);
+ ret = elevator_init(q, NULL);
+ mutex_unlock(&q->sysfs_lock);
+
+ return ret;
+}
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 1d1a4e9ce6ca..826f3e6991e3 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -37,6 +37,8 @@ bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq);

void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);

+int blk_mq_sched_init(struct request_queue *q);
+
static inline bool
blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio)
{
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 032dca4a27bf..0d8ea45b8562 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2105,6 +2105,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
INIT_LIST_HEAD(&q->requeue_list);
spin_lock_init(&q->requeue_lock);

+ if (!(set->flags & BLK_MQ_F_NO_SCHED))
+ blk_mq_sched_init(q);
+
if (q->nr_hw_queues > 1)
blk_queue_make_request(q, blk_mq_make_request);
else
diff --git a/block/elevator.c b/block/elevator.c
index e6b523360231..eb34c26f675f 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -219,7 +219,10 @@ int elevator_init(struct request_queue *q, char *name)
}

if (!e) {
- e = elevator_get(CONFIG_DEFAULT_IOSCHED, false);
+ if (q->mq_ops)
+ e = elevator_get(CONFIG_DEFAULT_MQ_IOSCHED, false);
+ else
+ e = elevator_get(CONFIG_DEFAULT_IOSCHED, false);
if (!e) {
printk(KERN_ERR
"Default I/O scheduler not found. " \
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index d6e6bce93d0c..063410d9b3cc 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1188,6 +1188,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
dev->admin_tagset.timeout = ADMIN_TIMEOUT;
dev->admin_tagset.numa_node = dev_to_node(dev->dev);
dev->admin_tagset.cmd_size = nvme_cmd_size(dev);
+ dev->admin_tagset.flags = BLK_MQ_F_NO_SCHED;
dev->admin_tagset.driver_data = dev;

if (blk_mq_alloc_tag_set(&dev->admin_tagset))
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index e3159be841ff..9255ccb043f2 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -152,6 +152,7 @@ enum {
BLK_MQ_F_SG_MERGE = 1 << 2,
BLK_MQ_F_DEFER_ISSUE = 1 << 4,
BLK_MQ_F_BLOCKING = 1 << 5,
+ BLK_MQ_F_NO_SCHED = 1 << 6,
BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
BLK_MQ_F_ALLOC_POLICY_BITS = 1,

--
2.7.4

2016-12-17 00:14:26

[permalink] [raw]

Subject: [PATCH 6/8] blk-mq-sched: add framework for MQ capable IO schedulers

This adds a set of hooks that intercepts the blk-mq path of
allocating/inserting/issuing/completing requests, allowing
us to develop a scheduler within that framework.

We reuse the existing elevator scheduler API on the registration
side, but augment that with the scheduler flagging support for
the blk-mq interfce, and with a separate set of ops hooks for MQ
devices.

Schedulers can opt in to using shadow requests. Shadow requests
are internal requests that the scheduler uses for for the allocate
and insert part, which are then mapped to a real driver request
at dispatch time. This is needed to separate the device queue depth
from the pool of requests that the scheduler has to work with.

Signed-off-by: Jens Axboe <[email protected]>
---
block/Makefile | 2 +-
block/blk-core.c | 3 +-
block/blk-exec.c | 3 +-
block/blk-flush.c | 7 +-
block/blk-merge.c | 2 +-
block/blk-mq-sched.c | 434 +++++++++++++++++++++++++++++++++++++++++++++++
block/blk-mq-sched.h | 209 +++++++++++++++++++++++
block/blk-mq.c | 197 +++++++++------------
block/blk-mq.h | 6 +-
block/elevator.c | 186 +++++++++++++++-----
include/linux/blk-mq.h | 3 +-
include/linux/elevator.h | 30 ++++
12 files changed, 914 insertions(+), 168 deletions(-)
create mode 100644 block/blk-mq-sched.c
create mode 100644 block/blk-mq-sched.h

diff --git a/block/Makefile b/block/Makefile
index a827f988c4e6..2eee9e1bb6db 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -6,7 +6,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
- blk-mq-sysfs.o blk-mq-cpumap.o ioctl.o \
+ blk-mq-sysfs.o blk-mq-cpumap.o blk-mq-sched.o ioctl.o \
genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
badblocks.o partitions/

diff --git a/block/blk-core.c b/block/blk-core.c
index 92baea07acbc..ee3a6f340cb8 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -39,6 +39,7 @@

#include "blk.h"
#include "blk-mq.h"
+#include "blk-mq-sched.h"
#include "blk-wbt.h"

EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap);
@@ -2127,7 +2128,7 @@ int blk_insert_cloned_request(struct request_queue *q, struct request *rq)
if (q->mq_ops) {
if (blk_queue_io_stat(q))
blk_account_io_start(rq, true);
- blk_mq_insert_request(rq, false, true, false);
+ blk_mq_sched_insert_request(rq, false, true, false);
return 0;
}

diff --git a/block/blk-exec.c b/block/blk-exec.c
index 3ecb00a6cf45..86656fdfa637 100644
--- a/block/blk-exec.c
+++ b/block/blk-exec.c
@@ -9,6 +9,7 @@
#include <linux/sched/sysctl.h>

#include "blk.h"
+#include "blk-mq-sched.h"

/*
* for max sense size
@@ -65,7 +66,7 @@ void blk_execute_rq_nowait(struct request_queue *q, struct gendisk *bd_disk,
* be reused after dying flag is set
*/
if (q->mq_ops) {
- blk_mq_insert_request(rq, at_head, true, false);
+ blk_mq_sched_insert_request(rq, at_head, true, false);
return;
}

diff --git a/block/blk-flush.c b/block/blk-flush.c
index 20b7c7a02f1c..6a7c29d2eb3c 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -74,6 +74,7 @@
#include "blk.h"
#include "blk-mq.h"
#include "blk-mq-tag.h"
+#include "blk-mq-sched.h"

/* FLUSH/FUA sequences */
enum {
@@ -453,9 +454,9 @@ void blk_insert_flush(struct request *rq)
*/
if ((policy & REQ_FSEQ_DATA) &&
!(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
- if (q->mq_ops) {
- blk_mq_insert_request(rq, false, true, false);
- } else
+ if (q->mq_ops)
+ blk_mq_sched_insert_request(rq, false, true, false);
+ else
list_add_tail(&rq->queuelist, &q->queue_head);
return;
}
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 480570b691dc..6aa43dec5af4 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -763,7 +763,7 @@ int blk_attempt_req_merge(struct request_queue *q, struct request *rq,
{
struct elevator_queue *e = q->elevator;

- if (e->type->ops.sq.elevator_allow_rq_merge_fn)
+ if (!e->uses_mq && e->type->ops.sq.elevator_allow_rq_merge_fn)
if (!e->type->ops.sq.elevator_allow_rq_merge_fn(q, rq, next))
return 0;

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
new file mode 100644
index 000000000000..b7e1839d4785
--- /dev/null
+++ b/block/blk-mq-sched.c
@@ -0,0 +1,434 @@
+/*
+ * blk-mq scheduling framework
+ *
+ * Copyright (C) 2016 Jens Axboe
+ */
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/blk-mq.h>
+
+#include <trace/events/block.h>
+
+#include "blk.h"
+#include "blk-mq.h"
+#include "blk-mq-sched.h"
+#include "blk-mq-tag.h"
+#include "blk-wbt.h"
+
+/*
+ * Empty set
+ */
+static const struct blk_mq_ops mq_sched_tag_ops = {
+};
+
+void blk_mq_sched_free_requests(struct blk_mq_tags *tags)
+{
+ blk_mq_free_rq_map(NULL, tags, 0);
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_free_requests);
+
+struct blk_mq_tags *blk_mq_sched_alloc_requests(unsigned int depth,
+ unsigned int numa_node)
+{
+ struct blk_mq_tag_set set = {
+ .ops = &mq_sched_tag_ops,
+ .nr_hw_queues = 1,
+ .queue_depth = depth,
+ .numa_node = numa_node,
+ };
+
+ return blk_mq_init_rq_map(&set, 0);
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_alloc_requests);
+
+void blk_mq_sched_free_hctx_data(struct request_queue *q,
+ void (*exit)(struct blk_mq_hw_ctx *))
+{
+ struct blk_mq_hw_ctx *hctx;
+ int i;
+
+ queue_for_each_hw_ctx(q, hctx, i) {
+ if (exit)
+ exit(hctx);
+ kfree(hctx->sched_data);
+ hctx->sched_data = NULL;
+ }
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_free_hctx_data);
+
+int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size,
+ int (*init)(struct blk_mq_hw_ctx *),
+ void (*exit)(struct blk_mq_hw_ctx *))
+{
+ struct blk_mq_hw_ctx *hctx;
+ int ret;
+ int i;
+
+ queue_for_each_hw_ctx(q, hctx, i) {
+ hctx->sched_data = kmalloc_node(size, GFP_KERNEL, hctx->numa_node);
+ if (!hctx->sched_data) {
+ ret = -ENOMEM;
+ goto error;
+ }
+
+ if (init) {
+ ret = init(hctx);
+ if (ret) {
+ /*
+ * We don't want to give exit() a partially
+ * initialized sched_data. init() must clean up
+ * if it fails.
+ */
+ kfree(hctx->sched_data);
+ hctx->sched_data = NULL;
+ goto error;
+ }
+ }
+ }
+
+ return 0;
+error:
+ blk_mq_sched_free_hctx_data(q, exit);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_init_hctx_data);
+
+struct request *blk_mq_sched_alloc_shadow_request(struct request_queue *q,
+ struct blk_mq_alloc_data *data,
+ struct blk_mq_tags *tags,
+ atomic_t *wait_index)
+{
+ struct sbq_wait_state *ws;
+ DEFINE_WAIT(wait);
+ struct request *rq;
+ int tag;
+
+ tag = __sbitmap_queue_get(&tags->bitmap_tags);
+ if (tag != -1)
+ goto done;
+
+ if (data->flags & BLK_MQ_REQ_NOWAIT)
+ return NULL;
+
+ ws = sbq_wait_ptr(&tags->bitmap_tags, wait_index);
+ do {
+ prepare_to_wait(&ws->wait, &wait, TASK_UNINTERRUPTIBLE);
+
+ tag = __sbitmap_queue_get(&tags->bitmap_tags);
+ if (tag != -1)
+ break;
+
+ blk_mq_run_hw_queue(data->hctx, false);
+
+ tag = __sbitmap_queue_get(&tags->bitmap_tags);
+ if (tag != -1)
+ break;
+
+ blk_mq_put_ctx(data->ctx);
+ io_schedule();
+
+ data->ctx = blk_mq_get_ctx(data->q);
+ data->hctx = blk_mq_map_queue(data->q, data->ctx->cpu);
+ finish_wait(&ws->wait, &wait);
+ ws = sbq_wait_ptr(&tags->bitmap_tags, wait_index);
+ } while (1);
+
+ finish_wait(&ws->wait, &wait);
+done:
+ rq = tags->rqs[tag];
+ rq->tag = tag;
+ rq->rq_flags = RQF_ALLOCED;
+ return rq;
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_alloc_shadow_request);
+
+void blk_mq_sched_free_shadow_request(struct blk_mq_tags *tags,
+ struct request *rq)
+{
+ WARN_ON_ONCE(!(rq->rq_flags & RQF_ALLOCED));
+ sbitmap_queue_clear(&tags->bitmap_tags, rq->tag, rq->mq_ctx->cpu);
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_free_shadow_request);
+
+static void rq_copy(struct request *rq, struct request *src)
+{
+#define FIELD_COPY(dst, src, name) ((dst)->name = (src)->name)
+ FIELD_COPY(rq, src, cpu);
+ FIELD_COPY(rq, src, cmd_type);
+ FIELD_COPY(rq, src, cmd_flags);
+ rq->rq_flags |= (src->rq_flags & (RQF_PREEMPT | RQF_QUIET | RQF_PM | RQF_DONTPREP));
+ rq->rq_flags &= ~RQF_IO_STAT;
+ FIELD_COPY(rq, src, __data_len);
+ FIELD_COPY(rq, src, __sector);
+ FIELD_COPY(rq, src, bio);
+ FIELD_COPY(rq, src, biotail);
+ FIELD_COPY(rq, src, rq_disk);
+ FIELD_COPY(rq, src, part);
+ FIELD_COPY(rq, src, issue_stat);
+ src->issue_stat.time = 0;
+ FIELD_COPY(rq, src, nr_phys_segments);
+#if defined(CONFIG_BLK_DEV_INTEGRITY)
+ FIELD_COPY(rq, src, nr_integrity_segments);
+#endif
+ FIELD_COPY(rq, src, ioprio);
+ FIELD_COPY(rq, src, timeout);
+
+ if (src->cmd_type == REQ_TYPE_BLOCK_PC) {
+ FIELD_COPY(rq, src, cmd);
+ FIELD_COPY(rq, src, cmd_len);
+ FIELD_COPY(rq, src, extra_len);
+ FIELD_COPY(rq, src, sense_len);
+ FIELD_COPY(rq, src, resid_len);
+ FIELD_COPY(rq, src, sense);
+ FIELD_COPY(rq, src, retries);
+ }
+
+ src->bio = src->biotail = NULL;
+}
+
+static void sched_rq_end_io(struct request *rq, int error)
+{
+ struct request *sched_rq = rq->end_io_data;
+
+ FIELD_COPY(sched_rq, rq, resid_len);
+ FIELD_COPY(sched_rq, rq, extra_len);
+ FIELD_COPY(sched_rq, rq, sense_len);
+ FIELD_COPY(sched_rq, rq, errors);
+ FIELD_COPY(sched_rq, rq, retries);
+
+ blk_account_io_completion(sched_rq, blk_rq_bytes(sched_rq));
+ blk_account_io_done(sched_rq);
+
+ if (sched_rq->end_io)
+ sched_rq->end_io(sched_rq, error);
+
+ blk_mq_finish_request(rq);
+}
+
+static inline struct request *
+__blk_mq_sched_alloc_request(struct blk_mq_hw_ctx *hctx)
+{
+ struct blk_mq_alloc_data data;
+ struct request *rq;
+
+ data.q = hctx->queue;
+ data.flags = BLK_MQ_REQ_NOWAIT;
+ data.ctx = blk_mq_get_ctx(hctx->queue);
+ data.hctx = hctx;
+
+ rq = __blk_mq_alloc_request(&data, 0);
+ blk_mq_put_ctx(data.ctx);
+
+ if (!rq)
+ blk_mq_stop_hw_queue(hctx);
+
+ return rq;
+}
+
+static inline void
+__blk_mq_sched_init_request_from_shadow(struct request *rq,
+ struct request *sched_rq)
+{
+ WARN_ON_ONCE(!(sched_rq->rq_flags & RQF_ALLOCED));
+ rq_copy(rq, sched_rq);
+ rq->end_io = sched_rq_end_io;
+ rq->end_io_data = sched_rq;
+}
+
+struct request *
+blk_mq_sched_request_from_shadow(struct blk_mq_hw_ctx *hctx,
+ struct request *(*get_sched_rq)(struct blk_mq_hw_ctx *))
+{
+ struct request *rq, *sched_rq;
+
+ rq = __blk_mq_sched_alloc_request(hctx);
+ if (!rq)
+ return NULL;
+
+ sched_rq = get_sched_rq(hctx);
+ if (sched_rq) {
+ __blk_mq_sched_init_request_from_shadow(rq, sched_rq);
+ return rq;
+ }
+
+ /*
+ * __blk_mq_finish_request() drops a queue ref we already hold,
+ * so grab an extra one.
+ */
+ blk_queue_enter_live(hctx->queue);
+ __blk_mq_finish_request(hctx, rq->mq_ctx, rq);
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_request_from_shadow);
+
+struct request *__blk_mq_sched_request_from_shadow(struct blk_mq_hw_ctx *hctx,
+ struct request *sched_rq)
+{
+ struct request *rq;
+
+ rq = __blk_mq_sched_alloc_request(hctx);
+ if (rq)
+ __blk_mq_sched_init_request_from_shadow(rq, sched_rq);
+
+ return rq;
+}
+EXPORT_SYMBOL_GPL(__blk_mq_sched_request_from_shadow);
+
+static void __blk_mq_sched_assign_ioc(struct request_queue *q,
+ struct request *rq, struct io_context *ioc)
+{
+ struct io_cq *icq;
+
+ spin_lock_irq(q->queue_lock);
+ icq = ioc_lookup_icq(ioc, q);
+ spin_unlock_irq(q->queue_lock);
+
+ if (!icq) {
+ icq = ioc_create_icq(ioc, q, GFP_ATOMIC);
+ if (!icq)
+ return;
+ }
+
+ rq->elv.icq = icq;
+ if (!blk_mq_sched_get_rq_priv(q, rq)) {
+ get_io_context(icq->ioc);
+ return;
+ }
+
+ rq->elv.icq = NULL;
+}
+
+static void blk_mq_sched_assign_ioc(struct request_queue *q,
+ struct request *rq, struct bio *bio)
+{
+ struct io_context *ioc;
+
+ ioc = rq_ioc(bio);
+ if (ioc)
+ __blk_mq_sched_assign_ioc(q, rq, ioc);
+}
+
+struct request *blk_mq_sched_get_request(struct request_queue *q,
+ struct bio *bio,
+ unsigned int op,
+ struct blk_mq_alloc_data *data)
+{
+ struct elevator_queue *e = q->elevator;
+ struct blk_mq_hw_ctx *hctx;
+ struct blk_mq_ctx *ctx;
+ struct request *rq;
+
+ blk_queue_enter_live(q);
+ ctx = blk_mq_get_ctx(q);
+ hctx = blk_mq_map_queue(q, ctx->cpu);
+
+ blk_mq_set_alloc_data(data, q, 0, ctx, hctx);
+
+ if (e && e->type->ops.mq.get_request)
+ rq = e->type->ops.mq.get_request(q, op, data);
+ else
+ rq = __blk_mq_alloc_request(data, op);
+
+ if (rq) {
+ rq->elv.icq = NULL;
+ if (e && e->type->icq_cache)
+ blk_mq_sched_assign_ioc(q, rq, bio);
+ data->hctx->queued++;
+ return rq;
+ }
+
+ blk_queue_exit(q);
+ return NULL;
+}
+
+void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
+{
+ struct elevator_queue *e = hctx->queue->elevator;
+ LIST_HEAD(rq_list);
+
+ if (unlikely(blk_mq_hctx_stopped(hctx)))
+ return;
+
+ hctx->run++;
+
+ /*
+ * If we have previous entries on our dispatch list, grab them first for
+ * more fair dispatch.
+ */
+ if (!list_empty_careful(&hctx->dispatch)) {
+ spin_lock(&hctx->lock);
+ if (!list_empty(&hctx->dispatch))
+ list_splice_init(&hctx->dispatch, &rq_list);
+ spin_unlock(&hctx->lock);
+ }
+
+ /*
+ * Only ask the scheduler for requests, if we didn't have residual
+ * requests from the dispatch list. This is to avoid the case where
+ * we only ever dispatch a fraction of the requests available because
+ * of low device queue depth. Once we pull requests out of the IO
+ * scheduler, we can no longer merge or sort them. So it's best to
+ * leave them there for as long as we can. Mark the hw queue as
+ * needing a restart in that case.
+ */
+ if (list_empty(&rq_list)) {
+ if (e && e->type->ops.mq.dispatch_requests)
+ e->type->ops.mq.dispatch_requests(hctx, &rq_list);
+ else
+ blk_mq_flush_busy_ctxs(hctx, &rq_list);
+ } else if (!test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
+ set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
+
+ blk_mq_dispatch_rq_list(hctx, &rq_list);
+}
+
+bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio)
+{
+ struct request *rq;
+ int ret;
+
+ ret = elv_merge(q, &rq, bio);
+ if (ret == ELEVATOR_BACK_MERGE) {
+ if (bio_attempt_back_merge(q, rq, bio)) {
+ if (!attempt_back_merge(q, rq))
+ elv_merged_request(q, rq, ret);
+ return true;
+ }
+ } else if (ret == ELEVATOR_FRONT_MERGE) {
+ if (bio_attempt_front_merge(q, rq, bio)) {
+ if (!attempt_front_merge(q, rq))
+ elv_merged_request(q, rq, ret);
+ return true;
+ }
+ }
+
+ return false;
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_try_merge);
+
+bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio)
+{
+ struct elevator_queue *e = q->elevator;
+
+ if (e->type->ops.mq.bio_merge) {
+ struct blk_mq_ctx *ctx = blk_mq_get_ctx(q);
+ struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
+
+ blk_mq_put_ctx(ctx);
+ return e->type->ops.mq.bio_merge(hctx, bio);
+ }
+
+ return false;
+}
+
+bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq)
+{
+ return rq_mergeable(rq) && elv_attempt_insert_merge(q, rq);
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_try_insert_merge);
+
+void blk_mq_sched_request_inserted(struct request *rq)
+{
+ trace_block_rq_insert(rq->q, rq);
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_request_inserted);
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
new file mode 100644
index 000000000000..1d1a4e9ce6ca
--- /dev/null
+++ b/block/blk-mq-sched.h
@@ -0,0 +1,209 @@
+#ifndef BLK_MQ_SCHED_H
+#define BLK_MQ_SCHED_H
+
+#include "blk-mq.h"
+#include "blk-wbt.h"
+
+struct blk_mq_tags *blk_mq_sched_alloc_requests(unsigned int depth, unsigned int numa_node);
+void blk_mq_sched_free_requests(struct blk_mq_tags *tags);
+
+int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size,
+ int (*init)(struct blk_mq_hw_ctx *),
+ void (*exit)(struct blk_mq_hw_ctx *));
+
+void blk_mq_sched_free_hctx_data(struct request_queue *q,
+ void (*exit)(struct blk_mq_hw_ctx *));
+
+void blk_mq_sched_free_shadow_request(struct blk_mq_tags *tags,
+ struct request *rq);
+struct request *blk_mq_sched_alloc_shadow_request(struct request_queue *q,
+ struct blk_mq_alloc_data *data,
+ struct blk_mq_tags *tags,
+ atomic_t *wait_index);
+struct request *
+blk_mq_sched_request_from_shadow(struct blk_mq_hw_ctx *hctx,
+ struct request *(*get_sched_rq)(struct blk_mq_hw_ctx *));
+struct request *
+__blk_mq_sched_request_from_shadow(struct blk_mq_hw_ctx *hctx,
+ struct request *sched_rq);
+
+struct request *blk_mq_sched_get_request(struct request_queue *q, struct bio *bio, unsigned int op, struct blk_mq_alloc_data *data);
+
+void __blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
+void blk_mq_sched_request_inserted(struct request *rq);
+bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio);
+bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio);
+bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq);
+
+void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
+
+static inline bool
+blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio)
+{
+ struct elevator_queue *e = q->elevator;
+
+ if (!e || blk_queue_nomerges(q) || !bio_mergeable(bio))
+ return false;
+
+ return __blk_mq_sched_bio_merge(q, bio);
+}
+
+static inline int blk_mq_sched_get_rq_priv(struct request_queue *q,
+ struct request *rq)
+{
+ struct elevator_queue *e = q->elevator;
+
+ if (e && e->type->ops.mq.get_rq_priv)
+ return e->type->ops.mq.get_rq_priv(q, rq);
+
+ return 0;
+}
+
+static inline void blk_mq_sched_put_rq_priv(struct request_queue *q,
+ struct request *rq)
+{
+ struct elevator_queue *e = q->elevator;
+
+ if (e && e->type->ops.mq.put_rq_priv)
+ e->type->ops.mq.put_rq_priv(q, rq);
+}
+
+static inline void blk_mq_sched_put_request(struct request *rq)
+{
+ struct request_queue *q = rq->q;
+ struct elevator_queue *e = q->elevator;
+ bool do_free = true;
+
+ wbt_done(q->rq_wb, &rq->issue_stat);
+
+ if (rq->rq_flags & RQF_ELVPRIV) {
+ blk_mq_sched_put_rq_priv(rq->q, rq);
+ if (rq->elv.icq) {
+ put_io_context(rq->elv.icq->ioc);
+ rq->elv.icq = NULL;
+ }
+ }
+
+ if (e && e->type->ops.mq.put_request)
+ do_free = !e->type->ops.mq.put_request(rq);
+ if (do_free)
+ blk_mq_finish_request(rq);
+}
+
+static inline void
+blk_mq_sched_insert_request(struct request *rq, bool at_head, bool run_queue,
+ bool async)
+{
+ struct request_queue *q = rq->q;
+ struct elevator_queue *e = q->elevator;
+ struct blk_mq_ctx *ctx = rq->mq_ctx;
+ struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
+
+ if (e && e->type->ops.mq.insert_requests) {
+ LIST_HEAD(list);
+
+ list_add(&rq->queuelist, &list);
+ e->type->ops.mq.insert_requests(hctx, &list, at_head);
+ } else {
+ spin_lock(&ctx->lock);
+ __blk_mq_insert_request(hctx, rq, at_head);
+ spin_unlock(&ctx->lock);
+ }
+
+ if (run_queue)
+ blk_mq_run_hw_queue(hctx, async);
+}
+
+static inline void
+blk_mq_sched_insert_requests(struct request_queue *q, struct blk_mq_ctx *ctx,
+ struct list_head *list, bool run_queue_async)
+{
+ struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
+ struct elevator_queue *e = hctx->queue->elevator;
+
+ if (e && e->type->ops.mq.insert_requests)
+ e->type->ops.mq.insert_requests(hctx, list, false);
+ else
+ blk_mq_insert_requests(hctx, ctx, list);
+
+ blk_mq_run_hw_queue(hctx, run_queue_async);
+}
+
+static inline void
+blk_mq_sched_dispatch_shadow_requests(struct blk_mq_hw_ctx *hctx,
+ struct list_head *rq_list,
+ struct request *(*get_sched_rq)(struct blk_mq_hw_ctx *))
+{
+ do {
+ struct request *rq;
+
+ rq = blk_mq_sched_request_from_shadow(hctx, get_sched_rq);
+ if (!rq)
+ break;
+
+ list_add_tail(&rq->queuelist, rq_list);
+ } while (1);
+}
+
+static inline bool
+blk_mq_sched_allow_merge(struct request_queue *q, struct request *rq,
+ struct bio *bio)
+{
+ struct elevator_queue *e = q->elevator;
+
+ if (e && e->type->ops.mq.allow_merge)
+ return e->type->ops.mq.allow_merge(q, rq, bio);
+
+ return true;
+}
+
+static inline void
+blk_mq_sched_completed_request(struct blk_mq_hw_ctx *hctx, struct request *rq)
+{
+ struct elevator_queue *e = hctx->queue->elevator;
+
+ if (e && e->type->ops.mq.completed_request)
+ e->type->ops.mq.completed_request(hctx, rq);
+
+ if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
+ clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
+ blk_mq_run_hw_queue(hctx, true);
+ }
+}
+
+static inline void blk_mq_sched_started_request(struct request *rq)
+{
+ struct request_queue *q = rq->q;
+ struct elevator_queue *e = q->elevator;
+
+ if (e && e->type->ops.mq.started_request)
+ e->type->ops.mq.started_request(rq);
+}
+
+static inline void blk_mq_sched_requeue_request(struct request *rq)
+{
+ struct request_queue *q = rq->q;
+ struct elevator_queue *e = q->elevator;
+
+ if (e && e->type->ops.mq.requeue_request)
+ e->type->ops.mq.requeue_request(rq);
+}
+
+static inline bool blk_mq_sched_has_work(struct blk_mq_hw_ctx *hctx)
+{
+ struct elevator_queue *e = hctx->queue->elevator;
+
+ if (e && e->type->ops.mq.has_work)
+ return e->type->ops.mq.has_work(hctx);
+
+ return false;
+}
+
+/*
+ * Returns true if this is an internal shadow request
+ */
+static inline bool blk_mq_sched_rq_is_shadow(struct request *rq)
+{
+ return (rq->rq_flags & RQF_ALLOCED) != 0;
+}
+#endif
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c3119f527bc1..032dca4a27bf 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -32,6 +32,7 @@
#include "blk-mq-tag.h"
#include "blk-stat.h"
#include "blk-wbt.h"
+#include "blk-mq-sched.h"

static DEFINE_MUTEX(all_q_mutex);
static LIST_HEAD(all_q_list);
@@ -41,7 +42,8 @@ static LIST_HEAD(all_q_list);
*/
static bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx)
{
- return sbitmap_any_bit_set(&hctx->ctx_map);
+ return sbitmap_any_bit_set(&hctx->ctx_map) ||
+ blk_mq_sched_has_work(hctx);
}

/*
@@ -242,26 +244,21 @@ EXPORT_SYMBOL_GPL(__blk_mq_alloc_request);
struct request *blk_mq_alloc_request(struct request_queue *q, int rw,
unsigned int flags)
{
- struct blk_mq_ctx *ctx;
- struct blk_mq_hw_ctx *hctx;
- struct request *rq;
struct blk_mq_alloc_data alloc_data;
+ struct request *rq;
int ret;

ret = blk_queue_enter(q, flags & BLK_MQ_REQ_NOWAIT);
if (ret)
return ERR_PTR(ret);

- ctx = blk_mq_get_ctx(q);
- hctx = blk_mq_map_queue(q, ctx->cpu);
- blk_mq_set_alloc_data(&alloc_data, q, flags, ctx, hctx);
- rq = __blk_mq_alloc_request(&alloc_data, rw);
- blk_mq_put_ctx(ctx);
+ rq = blk_mq_sched_get_request(q, NULL, rw, &alloc_data);

- if (!rq) {
- blk_queue_exit(q);
+ blk_mq_put_ctx(alloc_data.ctx);
+ blk_queue_exit(q);
+
+ if (!rq)
return ERR_PTR(-EWOULDBLOCK);
- }

rq->__data_len = 0;
rq->__sector = (sector_t) -1;
@@ -321,12 +318,14 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, int rw,
}
EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx);

-void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
- struct request *rq)
+void __blk_mq_finish_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
+ struct request *rq)
{
const int tag = rq->tag;
struct request_queue *q = rq->q;

+ blk_mq_sched_completed_request(hctx, rq);
+
if (rq->rq_flags & RQF_MQ_INFLIGHT)
atomic_dec(&hctx->nr_active);

@@ -339,18 +338,23 @@ void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
blk_queue_exit(q);
}

-static void blk_mq_free_hctx_request(struct blk_mq_hw_ctx *hctx,
+static void blk_mq_finish_hctx_request(struct blk_mq_hw_ctx *hctx,
struct request *rq)
{
struct blk_mq_ctx *ctx = rq->mq_ctx;

ctx->rq_completed[rq_is_sync(rq)]++;
- __blk_mq_free_request(hctx, ctx, rq);
+ __blk_mq_finish_request(hctx, ctx, rq);
+}
+
+void blk_mq_finish_request(struct request *rq)
+{
+ blk_mq_finish_hctx_request(blk_mq_map_queue(rq->q, rq->mq_ctx->cpu), rq);
}

void blk_mq_free_request(struct request *rq)
{
- blk_mq_free_hctx_request(blk_mq_map_queue(rq->q, rq->mq_ctx->cpu), rq);
+ blk_mq_sched_put_request(rq);
}
EXPORT_SYMBOL_GPL(blk_mq_free_request);

@@ -468,6 +472,8 @@ void blk_mq_start_request(struct request *rq)
{
struct request_queue *q = rq->q;

+ blk_mq_sched_started_request(rq);
+
trace_block_rq_issue(q, rq);

rq->resid_len = blk_rq_bytes(rq);
@@ -516,6 +522,7 @@ static void __blk_mq_requeue_request(struct request *rq)

trace_block_rq_requeue(q, rq);
wbt_requeue(q->rq_wb, &rq->issue_stat);
+ blk_mq_sched_requeue_request(rq);

if (test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
if (q->dma_drain_size && blk_rq_bytes(rq))
@@ -550,13 +557,13 @@ static void blk_mq_requeue_work(struct work_struct *work)

rq->rq_flags &= ~RQF_SOFTBARRIER;
list_del_init(&rq->queuelist);
- blk_mq_insert_request(rq, true, false, false);
+ blk_mq_sched_insert_request(rq, true, false, false);
}

while (!list_empty(&rq_list)) {
rq = list_entry(rq_list.next, struct request, queuelist);
list_del_init(&rq->queuelist);
- blk_mq_insert_request(rq, false, false, false);
+ blk_mq_sched_insert_request(rq, false, false, false);
}

blk_mq_run_hw_queues(q, false);
@@ -762,8 +769,16 @@ static bool blk_mq_attempt_merge(struct request_queue *q,

if (!blk_rq_merge_ok(rq, bio))
continue;
+ if (!blk_mq_sched_allow_merge(q, rq, bio))
+ break;

el_ret = blk_try_merge(rq, bio);
+ if (el_ret == ELEVATOR_NO_MERGE)
+ continue;
+
+ if (!blk_mq_sched_allow_merge(q, rq, bio))
+ break;
+
if (el_ret == ELEVATOR_BACK_MERGE) {
if (bio_attempt_back_merge(q, rq, bio)) {
ctx->rq_merged++;
@@ -905,41 +920,6 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
return ret != BLK_MQ_RQ_QUEUE_BUSY;
}

-/*
- * Run this hardware queue, pulling any software queues mapped to it in.
- * Note that this function currently has various problems around ordering
- * of IO. In particular, we'd like FIFO behaviour on handling existing
- * items on the hctx->dispatch list. Ignore that for now.
- */
-static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
-{
- LIST_HEAD(rq_list);
- LIST_HEAD(driver_list);
-
- if (unlikely(blk_mq_hctx_stopped(hctx)))
- return;
-
- hctx->run++;
-
- /*
- * Touch any software queue that has pending entries.
- */
- blk_mq_flush_busy_ctxs(hctx, &rq_list);
-
- /*
- * If we have previous entries on our dispatch list, grab them
- * and stuff them at the front for more fair dispatch.
- */
- if (!list_empty_careful(&hctx->dispatch)) {
- spin_lock(&hctx->lock);
- if (!list_empty(&hctx->dispatch))
- list_splice_init(&hctx->dispatch, &rq_list);
- spin_unlock(&hctx->lock);
- }
-
- blk_mq_dispatch_rq_list(hctx, &rq_list);
-}
-
static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
{
int srcu_idx;
@@ -949,11 +929,11 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)

if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
rcu_read_lock();
- blk_mq_process_rq_list(hctx);
+ blk_mq_sched_dispatch_requests(hctx);
rcu_read_unlock();
} else {
srcu_idx = srcu_read_lock(&hctx->queue_rq_srcu);
- blk_mq_process_rq_list(hctx);
+ blk_mq_sched_dispatch_requests(hctx);
srcu_read_unlock(&hctx->queue_rq_srcu, srcu_idx);
}
}
@@ -1147,32 +1127,10 @@ void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
blk_mq_hctx_mark_pending(hctx, ctx);
}

-void blk_mq_insert_request(struct request *rq, bool at_head, bool run_queue,
- bool async)
-{
- struct blk_mq_ctx *ctx = rq->mq_ctx;
- struct request_queue *q = rq->q;
- struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
-
- spin_lock(&ctx->lock);
- __blk_mq_insert_request(hctx, rq, at_head);
- spin_unlock(&ctx->lock);
-
- if (run_queue)
- blk_mq_run_hw_queue(hctx, async);
-}
-
-static void blk_mq_insert_requests(struct request_queue *q,
- struct blk_mq_ctx *ctx,
- struct list_head *list,
- int depth,
- bool from_schedule)
+void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
+ struct list_head *list)

{
- struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
-
- trace_block_unplug(q, depth, !from_schedule);
-
/*
* preemption doesn't flush plug list, so it's possible ctx->cpu is
* offline now
@@ -1188,8 +1146,6 @@ static void blk_mq_insert_requests(struct request_queue *q,
}
blk_mq_hctx_mark_pending(hctx, ctx);
spin_unlock(&ctx->lock);
-
- blk_mq_run_hw_queue(hctx, from_schedule);
}

static int plug_ctx_cmp(void *priv, struct list_head *a, struct list_head *b)
@@ -1225,9 +1181,10 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
BUG_ON(!rq->q);
if (rq->mq_ctx != this_ctx) {
if (this_ctx) {
- blk_mq_insert_requests(this_q, this_ctx,
- &ctx_list, depth,
- from_schedule);
+ trace_block_unplug(this_q, depth, from_schedule);
+ blk_mq_sched_insert_requests(this_q, this_ctx,
+ &ctx_list,
+ from_schedule);
}

this_ctx = rq->mq_ctx;
@@ -1244,8 +1201,9 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
* on 'ctx_list'. Do those.
*/
if (this_ctx) {
- blk_mq_insert_requests(this_q, this_ctx, &ctx_list, depth,
- from_schedule);
+ trace_block_unplug(this_q, depth, from_schedule);
+ blk_mq_sched_insert_requests(this_q, this_ctx, &ctx_list,
+ from_schedule);
}
}

@@ -1283,46 +1241,32 @@ static inline bool blk_mq_merge_queue_io(struct blk_mq_hw_ctx *hctx,
}

spin_unlock(&ctx->lock);
- __blk_mq_free_request(hctx, ctx, rq);
+ __blk_mq_finish_request(hctx, ctx, rq);
return true;
}
}

-static struct request *blk_mq_map_request(struct request_queue *q,
- struct bio *bio,
- struct blk_mq_alloc_data *data)
-{
- struct blk_mq_hw_ctx *hctx;
- struct blk_mq_ctx *ctx;
- struct request *rq;
-
- blk_queue_enter_live(q);
- ctx = blk_mq_get_ctx(q);
- hctx = blk_mq_map_queue(q, ctx->cpu);
-
- trace_block_getrq(q, bio, bio->bi_opf);
- blk_mq_set_alloc_data(data, q, 0, ctx, hctx);
- rq = __blk_mq_alloc_request(data, bio->bi_opf);
-
- data->hctx->queued++;
- return rq;
-}
-
static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
{
- int ret;
struct request_queue *q = rq->q;
- struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
struct blk_mq_queue_data bd = {
.rq = rq,
.list = NULL,
.last = 1
};
- blk_qc_t new_cookie = blk_tag_to_qc_t(rq->tag, hctx->queue_num);
+ struct blk_mq_hw_ctx *hctx;
+ blk_qc_t new_cookie;
+ int ret;
+
+ if (q->elevator)
+ goto insert;

+ hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
if (blk_mq_hctx_stopped(hctx))
goto insert;

+ new_cookie = blk_tag_to_qc_t(rq->tag, hctx->queue_num);
+
/*
* For OK queue, we are done. For error, kill it. Any other
* error (busy), just add it to our list as we previously
@@ -1344,7 +1288,7 @@ static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
}

insert:
- blk_mq_insert_request(rq, false, true, true);
+ blk_mq_sched_insert_request(rq, false, true, true);
}

/*
@@ -1377,9 +1321,14 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
blk_attempt_plug_merge(q, bio, &request_count, &same_queue_rq))
return BLK_QC_T_NONE;

+ if (blk_mq_sched_bio_merge(q, bio))
+ return BLK_QC_T_NONE;
+
wb_acct = wbt_wait(q->rq_wb, bio, NULL);

- rq = blk_mq_map_request(q, bio, &data);
+ trace_block_getrq(q, bio, bio->bi_opf);
+
+ rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data);
if (unlikely(!rq)) {
__wbt_done(q->rq_wb, wb_acct);
return BLK_QC_T_NONE;
@@ -1441,6 +1390,12 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
goto done;
}

+ if (q->elevator) {
+ blk_mq_put_ctx(data.ctx);
+ blk_mq_bio_to_request(rq, bio);
+ blk_mq_sched_insert_request(rq, false, true, true);
+ goto done;
+ }
if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
/*
* For a SYNC request, send it to the hardware immediately. For
@@ -1486,9 +1441,14 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
} else
request_count = blk_plug_queued_count(q);

+ if (blk_mq_sched_bio_merge(q, bio))
+ return BLK_QC_T_NONE;
+
wb_acct = wbt_wait(q->rq_wb, bio, NULL);

- rq = blk_mq_map_request(q, bio, &data);
+ trace_block_getrq(q, bio, bio->bi_opf);
+
+ rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data);
if (unlikely(!rq)) {
__wbt_done(q->rq_wb, wb_acct);
return BLK_QC_T_NONE;
@@ -1538,6 +1498,12 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
return cookie;
}

+ if (q->elevator) {
+ blk_mq_put_ctx(data.ctx);
+ blk_mq_bio_to_request(rq, bio);
+ blk_mq_sched_insert_request(rq, false, true, true);
+ goto done;
+ }
if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
/*
* For a SYNC request, send it to the hardware immediately. For
@@ -1550,6 +1516,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
}

blk_mq_put_ctx(data.ctx);
+done:
return cookie;
}

@@ -1558,7 +1525,7 @@ void blk_mq_free_rq_map(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
{
struct page *page;

- if (tags->rqs && set->ops->exit_request) {
+ if (tags->rqs && set && set->ops->exit_request) {
int i;

for (i = 0; i < tags->nr_tags; i++) {
diff --git a/block/blk-mq.h b/block/blk-mq.h
index e59f5ca520a2..898c3c9a60ec 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -47,7 +47,8 @@ struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
*/
void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
bool at_head);
-
+void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
+ struct list_head *list);
/*
* CPU hotplug helpers
*/
@@ -123,8 +124,9 @@ static inline void blk_mq_set_alloc_data(struct blk_mq_alloc_data *data,
*/
void blk_mq_rq_ctx_init(struct request_queue *q, struct blk_mq_ctx *ctx,
struct request *rq, unsigned int op);
-void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
+void __blk_mq_finish_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
struct request *rq);
+void blk_mq_finish_request(struct request *rq);
struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data,
unsigned int op);

diff --git a/block/elevator.c b/block/elevator.c
index 022a26830297..e6b523360231 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -40,6 +40,7 @@
#include <trace/events/block.h>

#include "blk.h"
+#include "blk-mq-sched.h"

static DEFINE_SPINLOCK(elv_list_lock);
static LIST_HEAD(elv_list);
@@ -58,7 +59,9 @@ static int elv_iosched_allow_bio_merge(struct request *rq, struct bio *bio)
struct request_queue *q = rq->q;
struct elevator_queue *e = q->elevator;

- if (e->type->ops.sq.elevator_allow_bio_merge_fn)
+ if (e->uses_mq && e->type->ops.mq.allow_merge)
+ return e->type->ops.mq.allow_merge(q, rq, bio);
+ else if (!e->uses_mq && e->type->ops.sq.elevator_allow_bio_merge_fn)
return e->type->ops.sq.elevator_allow_bio_merge_fn(q, rq, bio);

return 1;
@@ -163,6 +166,7 @@ struct elevator_queue *elevator_alloc(struct request_queue *q,
kobject_init(&eq->kobj, &elv_ktype);
mutex_init(&eq->sysfs_lock);
hash_init(eq->hash);
+ eq->uses_mq = e->uses_mq;

return eq;
}
@@ -219,12 +223,19 @@ int elevator_init(struct request_queue *q, char *name)
if (!e) {
printk(KERN_ERR
"Default I/O scheduler not found. " \
- "Using noop.\n");
+ "Using noop/none.\n");
+ if (q->mq_ops) {
+ elevator_put(e);
+ return 0;
+ }
e = elevator_get("noop", false);
}
}

- err = e->ops.sq.elevator_init_fn(q, e);
+ if (e->uses_mq)
+ err = e->ops.mq.init_sched(q, e);
+ else
+ err = e->ops.sq.elevator_init_fn(q, e);
if (err)
elevator_put(e);
return err;
@@ -234,7 +245,9 @@ EXPORT_SYMBOL(elevator_init);
void elevator_exit(struct elevator_queue *e)
{
mutex_lock(&e->sysfs_lock);
- if (e->type->ops.sq.elevator_exit_fn)
+ if (e->uses_mq && e->type->ops.mq.exit_sched)
+ e->type->ops.mq.exit_sched(e);
+ else if (!e->uses_mq && e->type->ops.sq.elevator_exit_fn)
e->type->ops.sq.elevator_exit_fn(e);
mutex_unlock(&e->sysfs_lock);

@@ -253,6 +266,7 @@ void elv_rqhash_del(struct request_queue *q, struct request *rq)
if (ELV_ON_HASH(rq))
__elv_rqhash_del(rq);
}
+EXPORT_SYMBOL_GPL(elv_rqhash_del);

void elv_rqhash_add(struct request_queue *q, struct request *rq)
{
@@ -262,6 +276,7 @@ void elv_rqhash_add(struct request_queue *q, struct request *rq)
hash_add(e->hash, &rq->hash, rq_hash_key(rq));
rq->rq_flags |= RQF_HASHED;
}
+EXPORT_SYMBOL_GPL(elv_rqhash_add);

void elv_rqhash_reposition(struct request_queue *q, struct request *rq)
{
@@ -443,7 +458,9 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio)
return ELEVATOR_BACK_MERGE;
}

- if (e->type->ops.sq.elevator_merge_fn)
+ if (e->uses_mq && e->type->ops.mq.request_merge)
+ return e->type->ops.mq.request_merge(q, req, bio);
+ else if (!e->uses_mq && e->type->ops.sq.elevator_merge_fn)
return e->type->ops.sq.elevator_merge_fn(q, req, bio);

return ELEVATOR_NO_MERGE;
@@ -456,8 +473,7 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio)
*
* Returns true if we merged, false otherwise
*/
-static bool elv_attempt_insert_merge(struct request_queue *q,
- struct request *rq)
+bool elv_attempt_insert_merge(struct request_queue *q, struct request *rq)
{
struct request *__rq;
bool ret;
@@ -495,7 +511,9 @@ void elv_merged_request(struct request_queue *q, struct request *rq, int type)
{
struct elevator_queue *e = q->elevator;

- if (e->type->ops.sq.elevator_merged_fn)
+ if (e->uses_mq && e->type->ops.mq.request_merged)
+ e->type->ops.mq.request_merged(q, rq, type);
+ else if (!e->uses_mq && e->type->ops.sq.elevator_merged_fn)
e->type->ops.sq.elevator_merged_fn(q, rq, type);

if (type == ELEVATOR_BACK_MERGE)
@@ -508,10 +526,15 @@ void elv_merge_requests(struct request_queue *q, struct request *rq,
struct request *next)
{
struct elevator_queue *e = q->elevator;
- const int next_sorted = next->rq_flags & RQF_SORTED;
-
- if (next_sorted && e->type->ops.sq.elevator_merge_req_fn)
- e->type->ops.sq.elevator_merge_req_fn(q, rq, next);
+ bool next_sorted = false;
+
+ if (e->uses_mq && e->type->ops.mq.requests_merged)
+ e->type->ops.mq.requests_merged(q, rq, next);
+ else if (e->type->ops.sq.elevator_merge_req_fn) {
+ next_sorted = next->rq_flags & RQF_SORTED;
+ if (next_sorted)
+ e->type->ops.sq.elevator_merge_req_fn(q, rq, next);
+ }

elv_rqhash_reposition(q, rq);

@@ -528,6 +551,9 @@ void elv_bio_merged(struct request_queue *q, struct request *rq,
{
struct elevator_queue *e = q->elevator;

+ if (WARN_ON_ONCE(e->uses_mq))
+ return;
+
if (e->type->ops.sq.elevator_bio_merged_fn)
e->type->ops.sq.elevator_bio_merged_fn(q, rq, bio);
}
@@ -682,8 +708,11 @@ struct request *elv_latter_request(struct request_queue *q, struct request *rq)
{
struct elevator_queue *e = q->elevator;

- if (e->type->ops.sq.elevator_latter_req_fn)
+ if (e->uses_mq && e->type->ops.mq.next_request)
+ return e->type->ops.mq.next_request(q, rq);
+ else if (!e->uses_mq && e->type->ops.sq.elevator_latter_req_fn)
return e->type->ops.sq.elevator_latter_req_fn(q, rq);
+
return NULL;
}

@@ -691,7 +720,9 @@ struct request *elv_former_request(struct request_queue *q, struct request *rq)
{
struct elevator_queue *e = q->elevator;

- if (e->type->ops.sq.elevator_former_req_fn)
+ if (e->uses_mq && e->type->ops.mq.former_request)
+ return e->type->ops.mq.former_request(q, rq);
+ if (!e->uses_mq && e->type->ops.sq.elevator_former_req_fn)
return e->type->ops.sq.elevator_former_req_fn(q, rq);
return NULL;
}
@@ -701,6 +732,9 @@ int elv_set_request(struct request_queue *q, struct request *rq,
{
struct elevator_queue *e = q->elevator;

+ if (WARN_ON_ONCE(e->uses_mq))
+ return 0;
+
if (e->type->ops.sq.elevator_set_req_fn)
return e->type->ops.sq.elevator_set_req_fn(q, rq, bio, gfp_mask);
return 0;
@@ -710,6 +744,9 @@ void elv_put_request(struct request_queue *q, struct request *rq)
{
struct elevator_queue *e = q->elevator;

+ if (WARN_ON_ONCE(e->uses_mq))
+ return;
+
if (e->type->ops.sq.elevator_put_req_fn)
e->type->ops.sq.elevator_put_req_fn(rq);
}
@@ -718,6 +755,9 @@ int elv_may_queue(struct request_queue *q, unsigned int op)
{
struct elevator_queue *e = q->elevator;

+ if (WARN_ON_ONCE(e->uses_mq))
+ return 0;
+
if (e->type->ops.sq.elevator_may_queue_fn)
return e->type->ops.sq.elevator_may_queue_fn(q, op);

@@ -728,6 +768,9 @@ void elv_completed_request(struct request_queue *q, struct request *rq)
{
struct elevator_queue *e = q->elevator;

+ if (WARN_ON_ONCE(e->uses_mq))
+ return;
+
/*
* request is released from the driver, io must be done
*/
@@ -803,7 +846,7 @@ int elv_register_queue(struct request_queue *q)
}
kobject_uevent(&e->kobj, KOBJ_ADD);
e->registered = 1;
- if (e->type->ops.sq.elevator_registered_fn)
+ if (!e->uses_mq && e->type->ops.sq.elevator_registered_fn)
e->type->ops.sq.elevator_registered_fn(q);
}
return error;
@@ -891,9 +934,14 @@ EXPORT_SYMBOL_GPL(elv_unregister);
static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
{
struct elevator_queue *old = q->elevator;
- bool registered = old->registered;
+ bool old_registered = false;
int err;

+ if (q->mq_ops) {
+ blk_mq_freeze_queue(q);
+ blk_mq_quiesce_queue(q);
+ }
+
/*
* Turn on BYPASS and drain all requests w/ elevator private data.
* Block layer doesn't call into a quiesced elevator - all requests
@@ -901,32 +949,52 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
* using INSERT_BACK. All requests have SOFTBARRIER set and no
* merge happens either.
*/
- blk_queue_bypass_start(q);
+ if (old) {
+ old_registered = old->registered;

- /* unregister and clear all auxiliary data of the old elevator */
- if (registered)
- elv_unregister_queue(q);
+ if (!q->mq_ops)
+ blk_queue_bypass_start(q);

- spin_lock_irq(q->queue_lock);
- ioc_clear_queue(q);
- spin_unlock_irq(q->queue_lock);
+ /* unregister and clear all auxiliary data of the old elevator */
+ if (old_registered)
+ elv_unregister_queue(q);
+
+ spin_lock_irq(q->queue_lock);
+ ioc_clear_queue(q);
+ spin_unlock_irq(q->queue_lock);
+ }

/* allocate, init and register new elevator */
- err = new_e->ops.sq.elevator_init_fn(q, new_e);
- if (err)
- goto fail_init;
+ if (new_e) {
+ if (new_e->uses_mq)
+ err = new_e->ops.mq.init_sched(q, new_e);
+ else
+ err = new_e->ops.sq.elevator_init_fn(q, new_e);
+ if (err)
+ goto fail_init;

- if (registered) {
err = elv_register_queue(q);
if (err)
goto fail_register;
- }
+ } else
+ q->elevator = NULL;

/* done, kill the old one and finish */
- elevator_exit(old);
- blk_queue_bypass_end(q);
+ if (old) {
+ elevator_exit(old);
+ if (!q->mq_ops)
+ blk_queue_bypass_end(q);
+ }
+
+ if (q->mq_ops) {
+ blk_mq_unfreeze_queue(q);
+ blk_mq_start_stopped_hw_queues(q, true);
+ }

- blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
+ if (new_e)
+ blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
+ else
+ blk_add_trace_msg(q, "elv switch: none");

return 0;

@@ -934,9 +1002,16 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
elevator_exit(q->elevator);
fail_init:
/* switch failed, restore and re-register old elevator */
- q->elevator = old;
- elv_register_queue(q);
- blk_queue_bypass_end(q);
+ if (old) {
+ q->elevator = old;
+ elv_register_queue(q);
+ if (!q->mq_ops)
+ blk_queue_bypass_end(q);
+ }
+ if (q->mq_ops) {
+ blk_mq_unfreeze_queue(q);
+ blk_mq_start_stopped_hw_queues(q, true);
+ }

return err;
}
@@ -949,8 +1024,11 @@ static int __elevator_change(struct request_queue *q, const char *name)
char elevator_name[ELV_NAME_MAX];
struct elevator_type *e;

- if (!q->elevator)
- return -ENXIO;
+ /*
+ * Special case for mq, turn off scheduling
+ */
+ if (q->mq_ops && !strncmp(name, "none", 4))
+ return elevator_switch(q, NULL);

strlcpy(elevator_name, name, sizeof(elevator_name));
e = elevator_get(strstrip(elevator_name), true);
@@ -959,11 +1037,23 @@ static int __elevator_change(struct request_queue *q, const char *name)
return -EINVAL;
}

- if (!strcmp(elevator_name, q->elevator->type->elevator_name)) {
+ if (q->elevator &&
+ !strcmp(elevator_name, q->elevator->type->elevator_name)) {
elevator_put(e);
return 0;
}

+ if (!e->uses_mq && q->mq_ops) {
+ printk(KERN_ERR "blk-mq-sched: elv %s does not support mq\n", elevator_name);
+ elevator_put(e);
+ return -EINVAL;
+ }
+ if (e->uses_mq && !q->mq_ops) {
+ printk(KERN_ERR "blk-mq-sched: elv %s is for mq\n", elevator_name);
+ elevator_put(e);
+ return -EINVAL;
+ }
+
return elevator_switch(q, e);
}

@@ -985,7 +1075,7 @@ ssize_t elv_iosched_store(struct request_queue *q, const char *name,
{
int ret;

- if (!q->elevator)
+ if (!q->mq_ops || q->request_fn)
return count;

ret = __elevator_change(q, name);
@@ -999,24 +1089,34 @@ ssize_t elv_iosched_store(struct request_queue *q, const char *name,
ssize_t elv_iosched_show(struct request_queue *q, char *name)
{
struct elevator_queue *e = q->elevator;
- struct elevator_type *elv;
+ struct elevator_type *elv = NULL;
struct elevator_type *__e;
int len = 0;

- if (!q->elevator || !blk_queue_stackable(q))
+ if (!blk_queue_stackable(q))
return sprintf(name, "none\n");

- elv = e->type;
+ if (!q->elevator)
+ len += sprintf(name+len, "[none] ");
+ else
+ elv = e->type;

spin_lock(&elv_list_lock);
list_for_each_entry(__e, &elv_list, list) {
- if (!strcmp(elv->elevator_name, __e->elevator_name))
+ if (elv && !strcmp(elv->elevator_name, __e->elevator_name)) {
len += sprintf(name+len, "[%s] ", elv->elevator_name);
- else
+ continue;
+ }
+ if (__e->uses_mq && q->mq_ops)
+ len += sprintf(name+len, "%s ", __e->elevator_name);
+ else if (!__e->uses_mq && !q->mq_ops)
len += sprintf(name+len, "%s ", __e->elevator_name);
}
spin_unlock(&elv_list_lock);

+ if (q->mq_ops && q->elevator)
+ len += sprintf(name+len, "none");
+
len += sprintf(len+name, "\n");
return len;
}
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2686f9e7302a..e3159be841ff 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -22,6 +22,7 @@ struct blk_mq_hw_ctx {

unsigned long flags; /* BLK_MQ_F_* flags */

+ void *sched_data;
struct request_queue *queue;
struct blk_flush_queue *fq;

@@ -156,6 +157,7 @@ enum {

BLK_MQ_S_STOPPED = 0,
BLK_MQ_S_TAG_ACTIVE = 1,
+ BLK_MQ_S_SCHED_RESTART = 2,

BLK_MQ_MAX_DEPTH = 10240,

@@ -179,7 +181,6 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set);

void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule);

-void blk_mq_insert_request(struct request *, bool, bool, bool);
void blk_mq_free_request(struct request *rq);
bool blk_mq_can_queue(struct blk_mq_hw_ctx *);

diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 2a9e966eed03..417810b2d2f5 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -77,6 +77,32 @@ struct elevator_ops
elevator_registered_fn *elevator_registered_fn;
};

+struct blk_mq_alloc_data;
+struct blk_mq_hw_ctx;
+
+struct elevator_mq_ops {
+ int (*init_sched)(struct request_queue *, struct elevator_type *);
+ void (*exit_sched)(struct elevator_queue *);
+
+ bool (*allow_merge)(struct request_queue *, struct request *, struct bio *);
+ bool (*bio_merge)(struct blk_mq_hw_ctx *, struct bio *);
+ int (*request_merge)(struct request_queue *q, struct request **, struct bio *);
+ void (*request_merged)(struct request_queue *, struct request *, int);
+ void (*requests_merged)(struct request_queue *, struct request *, struct request *);
+ struct request *(*get_request)(struct request_queue *, unsigned int, struct blk_mq_alloc_data *);
+ bool (*put_request)(struct request *);
+ void (*insert_requests)(struct blk_mq_hw_ctx *, struct list_head *, bool);
+ void (*dispatch_requests)(struct blk_mq_hw_ctx *, struct list_head *);
+ bool (*has_work)(struct blk_mq_hw_ctx *);
+ void (*completed_request)(struct blk_mq_hw_ctx *, struct request *);
+ void (*started_request)(struct request *);
+ void (*requeue_request)(struct request *);
+ struct request *(*former_request)(struct request_queue *, struct request *);
+ struct request *(*next_request)(struct request_queue *, struct request *);
+ int (*get_rq_priv)(struct request_queue *, struct request *);
+ void (*put_rq_priv)(struct request_queue *, struct request *);
+};
+
#define ELV_NAME_MAX (16)

struct elv_fs_entry {
@@ -96,12 +122,14 @@ struct elevator_type
/* fields provided by elevator implementation */
union {
struct elevator_ops sq;
+ struct elevator_mq_ops mq;
} ops;
size_t icq_size; /* see iocontext.h */
size_t icq_align; /* ditto */
struct elv_fs_entry *elevator_attrs;
char elevator_name[ELV_NAME_MAX];
struct module *elevator_owner;
+ bool uses_mq;

/* managed by elevator core */
char icq_cache_name[ELV_NAME_MAX + 5]; /* elvname + "_io_cq" */
@@ -125,6 +153,7 @@ struct elevator_queue
struct kobject kobj;
struct mutex sysfs_lock;
unsigned int registered:1;
+ unsigned int uses_mq:1;
DECLARE_HASHTABLE(hash, ELV_HASH_BITS);
};

@@ -141,6 +170,7 @@ extern void elv_merge_requests(struct request_queue *, struct request *,
extern void elv_merged_request(struct request_queue *, struct request *, int);
extern void elv_bio_merged(struct request_queue *q, struct request *,
struct bio *);
+extern bool elv_attempt_insert_merge(struct request_queue *, struct request *);
extern void elv_requeue_request(struct request_queue *, struct request *);
extern struct request *elv_former_request(struct request_queue *, struct request *);
extern struct request *elv_latter_request(struct request_queue *, struct request *);
--
2.7.4

2016-12-19 11:32:46

by Paolo Valente

[permalink] [raw]

Subject: Re: [PATCHSET v4] blk-mq-scheduling framework

> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <[email protected]> ha scritto:
>
> This is version 4 of this patchset, version 3 was posted here:
>
> https://marc.info/?l=linux-block&m=148178513407631&w=2
>
> From the discussion last time, I looked into the feasibility of having
> two sets of tags for the same request pool, to avoid having to copy
> some of the request fields at dispatch and completion time. To do that,
> we'd have to replace the driver tag map(s) with our own, and augment
> that with tag map(s) on the side representing the device queue depth.
> Queuing IO with the scheduler would allocate from the new map, and
> dispatching would acquire the "real" tag. We would need to change
> drivers to do this, or add an extra indirection table to map a real
> tag to the scheduler tag. We would also need a 1:1 mapping between
> scheduler and hardware tag pools, or additional info to track it.
> Unless someone can convince me otherwise, I think the current approach
> is cleaner.
>
> I wasn't going to post v4 so soon, but I discovered a bug that led
> to drastically decreased merging. Especially on rotating storage,
> this release should be fast, and on par with the merging that we
> get through the legacy schedulers.
>

I'm to modifying bfq. You mentioned other missing pieces to come. Do
you already have an idea of what they are, so that I am somehow
prepared to what won't work even if my changes are right?

Thanks,
Paolo

> Changes since v3:
>
> - Keep the blk_mq_free_request/__blk_mq_free_request() as the
> interface, and have those functions call the scheduler API
> instead.
>
> - Add insertion merging from unplugging.
>
> - Ensure that RQF_STARTED is cleared when we get a new shadow
> request, or merging will fail if it is already set.
>
> - Improve the blk_mq_sched_init_hctx_data() implementation. From Omar.
>
> - Make the shadow alloc/free interface more usable by schedulers
> that use the software queues. From Omar.
>
> - Fix a bug in the io context code.
>
> - Put the is_shadow() helper in generic code, instead of in mq-deadline.
>
> - Add prep patch that unexports blk_mq_free_hctx_request(), it's not
> used by anyone.
>
> - Remove the magic '256' queue depth from mq-deadline, replace with a
> module parameter, 'queue_depth', that defaults to 256.
>
> - Various cleanups.
>

2016-12-19 15:21:54

[permalink] [raw]

Subject: Re: [PATCHSET v4] blk-mq-scheduling framework

On 12/19/2016 04:32 AM, Paolo Valente wrote:
>
>> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <[email protected]> ha scritto:
>>
>> This is version 4 of this patchset, version 3 was posted here:
>>
>> https://marc.info/?l=linux-block&m=148178513407631&w=2
>>
>> From the discussion last time, I looked into the feasibility of having
>> two sets of tags for the same request pool, to avoid having to copy
>> some of the request fields at dispatch and completion time. To do that,
>> we'd have to replace the driver tag map(s) with our own, and augment
>> that with tag map(s) on the side representing the device queue depth.
>> Queuing IO with the scheduler would allocate from the new map, and
>> dispatching would acquire the "real" tag. We would need to change
>> drivers to do this, or add an extra indirection table to map a real
>> tag to the scheduler tag. We would also need a 1:1 mapping between
>> scheduler and hardware tag pools, or additional info to track it.
>> Unless someone can convince me otherwise, I think the current approach
>> is cleaner.
>>
>> I wasn't going to post v4 so soon, but I discovered a bug that led
>> to drastically decreased merging. Especially on rotating storage,
>> this release should be fast, and on par with the merging that we
>> get through the legacy schedulers.
>>
>
> I'm to modifying bfq. You mentioned other missing pieces to come. Do
> you already have an idea of what they are, so that I am somehow
> prepared to what won't work even if my changes are right?

I'm mostly talking about elevator ops hooks that aren't there in the new
framework, but exist in the old one. There should be no hidden
surprises, if that's what you are worried about.

On the ops side, the only ones I can think of are the activate and
deactivate, and those can be done in the dispatch_request hook for
activate, and put/requeue for deactivate.

Outside of that, some of them have been renamed, and some have been
collapsed (like activate/deactivate), and yet others again work a little
differently (like merging). See the mq-deadline conversion, and just
work through them one at the time.

--
Jens Axboe

2016-12-19 15:33:46

[permalink] [raw]

Subject: Re: [PATCHSET v4] blk-mq-scheduling framework

On 12/19/2016 08:20 AM, Jens Axboe wrote:
> On 12/19/2016 04:32 AM, Paolo Valente wrote:
>>
>>> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <[email protected]> ha scritto:
>>>
>>> This is version 4 of this patchset, version 3 was posted here:
>>>
>>> https://marc.info/?l=linux-block&m=148178513407631&w=2
>>>
>>> From the discussion last time, I looked into the feasibility of having
>>> two sets of tags for the same request pool, to avoid having to copy
>>> some of the request fields at dispatch and completion time. To do that,
>>> we'd have to replace the driver tag map(s) with our own, and augment
>>> that with tag map(s) on the side representing the device queue depth.
>>> Queuing IO with the scheduler would allocate from the new map, and
>>> dispatching would acquire the "real" tag. We would need to change
>>> drivers to do this, or add an extra indirection table to map a real
>>> tag to the scheduler tag. We would also need a 1:1 mapping between
>>> scheduler and hardware tag pools, or additional info to track it.
>>> Unless someone can convince me otherwise, I think the current approach
>>> is cleaner.
>>>
>>> I wasn't going to post v4 so soon, but I discovered a bug that led
>>> to drastically decreased merging. Especially on rotating storage,
>>> this release should be fast, and on par with the merging that we
>>> get through the legacy schedulers.
>>>
>>
>> I'm to modifying bfq. You mentioned other missing pieces to come. Do
>> you already have an idea of what they are, so that I am somehow
>> prepared to what won't work even if my changes are right?
>
> I'm mostly talking about elevator ops hooks that aren't there in the new
> framework, but exist in the old one. There should be no hidden
> surprises, if that's what you are worried about.
>
> On the ops side, the only ones I can think of are the activate and
> deactivate, and those can be done in the dispatch_request hook for
> activate, and put/requeue for deactivate.
>
> Outside of that, some of them have been renamed, and some have been
> collapsed (like activate/deactivate), and yet others again work a little
> differently (like merging). See the mq-deadline conversion, and just
> work through them one at the time.

Some more details...

Outside of the differences outlined above, a major one is that the old
scheduler interfaces invoked almost all of the hooks with the device
queue lock held. That's no longer the case on the new framework, you
have to setup your own lock(s) for what you need. That's a lot saner.
One example is the attempt to merge a bio to an existing request, that
would be the ->bio_merge() hook. If you look at mq-deadline, the hook
merely grabs its per-queue lock (dd->lock) and calls a blk-mq-sched
helper to do the merging. That, in turn, will call ->request_merge(), so
that is called with the lock that ->bio_merge() grabs.

--
Jens Axboe

--
Jens Axboe

2016-12-19 18:21:17

by Paolo Valente

[permalink] [raw]

Subject: Re: [PATCHSET v4] blk-mq-scheduling framework

> Il giorno 19 dic 2016, alle ore 16:20, Jens Axboe <[email protected]> ha scritto:
>
> On 12/19/2016 04:32 AM, Paolo Valente wrote:
>>
>>> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <[email protected]> ha scritto:
>>>
>>> This is version 4 of this patchset, version 3 was posted here:
>>>
>>> https://marc.info/?l=linux-block&m=148178513407631&w=2
>>>
>>> From the discussion last time, I looked into the feasibility of having
>>> two sets of tags for the same request pool, to avoid having to copy
>>> some of the request fields at dispatch and completion time. To do that,
>>> we'd have to replace the driver tag map(s) with our own, and augment
>>> that with tag map(s) on the side representing the device queue depth.
>>> Queuing IO with the scheduler would allocate from the new map, and
>>> dispatching would acquire the "real" tag. We would need to change
>>> drivers to do this, or add an extra indirection table to map a real
>>> tag to the scheduler tag. We would also need a 1:1 mapping between
>>> scheduler and hardware tag pools, or additional info to track it.
>>> Unless someone can convince me otherwise, I think the current approach
>>> is cleaner.
>>>
>>> I wasn't going to post v4 so soon, but I discovered a bug that led
>>> to drastically decreased merging. Especially on rotating storage,
>>> this release should be fast, and on par with the merging that we
>>> get through the legacy schedulers.
>>>
>>
>> I'm to modifying bfq. You mentioned other missing pieces to come. Do
>> you already have an idea of what they are, so that I am somehow
>> prepared to what won't work even if my changes are right?
>
> I'm mostly talking about elevator ops hooks that aren't there in the new
> framework, but exist in the old one. There should be no hidden
> surprises, if that's what you are worried about.
>
> On the ops side, the only ones I can think of are the activate and
> deactivate, and those can be done in the dispatch_request hook for
> activate, and put/requeue for deactivate.
>

You mean that there is no conceptual problem in moving the code of the
activate interface function into the dispatch function, and the code
of the deactivate into the put_request? (for a requeue it is a little
less clear to me, so one step at a time) Or am I missing
something more complex?

> Outside of that, some of them have been renamed, and some have been
> collapsed (like activate/deactivate), and yet others again work a little
> differently (like merging). See the mq-deadline conversion, and just
> work through them one at the time.
>

That's how I'm proceeding, thanks.

Thank you,
Paolo

> --
> Jens Axboe
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2016-12-19 21:05:43

[permalink] [raw]

Subject: Re: [PATCHSET v4] blk-mq-scheduling framework

On 12/19/2016 11:21 AM, Paolo Valente wrote:
>
>> Il giorno 19 dic 2016, alle ore 16:20, Jens Axboe <[email protected]> ha scritto:
>>
>> On 12/19/2016 04:32 AM, Paolo Valente wrote:
>>>
>>>> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <[email protected]> ha scritto:
>>>>
>>>> This is version 4 of this patchset, version 3 was posted here:
>>>>
>>>> https://marc.info/?l=linux-block&m=148178513407631&w=2
>>>>
>>>> From the discussion last time, I looked into the feasibility of having
>>>> two sets of tags for the same request pool, to avoid having to copy
>>>> some of the request fields at dispatch and completion time. To do that,
>>>> we'd have to replace the driver tag map(s) with our own, and augment
>>>> that with tag map(s) on the side representing the device queue depth.
>>>> Queuing IO with the scheduler would allocate from the new map, and
>>>> dispatching would acquire the "real" tag. We would need to change
>>>> drivers to do this, or add an extra indirection table to map a real
>>>> tag to the scheduler tag. We would also need a 1:1 mapping between
>>>> scheduler and hardware tag pools, or additional info to track it.
>>>> Unless someone can convince me otherwise, I think the current approach
>>>> is cleaner.
>>>>
>>>> I wasn't going to post v4 so soon, but I discovered a bug that led
>>>> to drastically decreased merging. Especially on rotating storage,
>>>> this release should be fast, and on par with the merging that we
>>>> get through the legacy schedulers.
>>>>
>>>
>>> I'm to modifying bfq. You mentioned other missing pieces to come. Do
>>> you already have an idea of what they are, so that I am somehow
>>> prepared to what won't work even if my changes are right?
>>
>> I'm mostly talking about elevator ops hooks that aren't there in the new
>> framework, but exist in the old one. There should be no hidden
>> surprises, if that's what you are worried about.
>>
>> On the ops side, the only ones I can think of are the activate and
>> deactivate, and those can be done in the dispatch_request hook for
>> activate, and put/requeue for deactivate.
>>
>
> You mean that there is no conceptual problem in moving the code of the
> activate interface function into the dispatch function, and the code
> of the deactivate into the put_request? (for a requeue it is a little
> less clear to me, so one step at a time) Or am I missing
> something more complex?

Yes, what I mean is that there isn't a 1:1 mapping between the old ops
and the new ops. So you'll have to consider the cases.

--
Jens Axboe

2016-12-20 09:35:36

by Paolo Valente

[permalink] [raw]

Subject: Re: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler

> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <[email protected]> ha scritto:
>
> This is basically identical to deadline-iosched, except it registers
> as a MQ capable scheduler. This is still a single queue design.
>
> Signed-off-by: Jens Axboe <[email protected]>
> ...
> +
> +static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
> +{
> + struct deadline_data *dd = hctx->queue->elevator->elevator_data;
> +
> + return !list_empty_careful(&dd->dispatch) ||
> + !list_empty_careful(&dd->fifo_list[0]) ||
> + !list_empty_careful(&dd->fifo_list[1]);

Just a request for clarification: if I'm not mistaken,
list_empty_careful can be used safely only if the only possible other
concurrent access is a delete. Or am I missing something?

If the above constraint does hold, then how are we guaranteed that it
is met? My doubt arises from, e.g., the possible concurrent list_add
from dd_insert_request.

Thanks,
Paolo

> +}
> +
> +/*
> + * sysfs parts below
> + */
> +static ssize_t
> +deadline_var_show(int var, char *page)
> +{
> + return sprintf(page, "%d\n", var);
> +}
> +
> +static ssize_t
> +deadline_var_store(int *var, const char *page, size_t count)
> +{
> + char *p = (char *) page;
> +
> + *var = simple_strtol(p, &p, 10);
> + return count;
> +}
> +
> +#define SHOW_FUNCTION(__FUNC, __VAR, __CONV) \
> +static ssize_t __FUNC(struct elevator_queue *e, char *page) \
> +{ \
> + struct deadline_data *dd = e->elevator_data; \
> + int __data = __VAR; \
> + if (__CONV) \
> + __data = jiffies_to_msecs(__data); \
> + return deadline_var_show(__data, (page)); \
> +}
> +SHOW_FUNCTION(deadline_read_expire_show, dd->fifo_expire[READ], 1);
> +SHOW_FUNCTION(deadline_write_expire_show, dd->fifo_expire[WRITE], 1);
> +SHOW_FUNCTION(deadline_writes_starved_show, dd->writes_starved, 0);
> +SHOW_FUNCTION(deadline_front_merges_show, dd->front_merges, 0);
> +SHOW_FUNCTION(deadline_fifo_batch_show, dd->fifo_batch, 0);
> +#undef SHOW_FUNCTION
> +
> +#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \
> +static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count) \
> +{ \
> + struct deadline_data *dd = e->elevator_data; \
> + int __data; \
> + int ret = deadline_var_store(&__data, (page), count); \
> + if (__data < (MIN)) \
> + __data = (MIN); \
> + else if (__data > (MAX)) \
> + __data = (MAX); \
> + if (__CONV) \
> + *(__PTR) = msecs_to_jiffies(__data); \
> + else \
> + *(__PTR) = __data; \
> + return ret; \
> +}
> +STORE_FUNCTION(deadline_read_expire_store, &dd->fifo_expire[READ], 0, INT_MAX, 1);
> +STORE_FUNCTION(deadline_write_expire_store, &dd->fifo_expire[WRITE], 0, INT_MAX, 1);
> +STORE_FUNCTION(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX, 0);
> +STORE_FUNCTION(deadline_front_merges_store, &dd->front_merges, 0, 1, 0);
> +STORE_FUNCTION(deadline_fifo_batch_store, &dd->fifo_batch, 0, INT_MAX, 0);
> +#undef STORE_FUNCTION
> +
> +#define DD_ATTR(name) \
> + __ATTR(name, S_IRUGO|S_IWUSR, deadline_##name##_show, \
> + deadline_##name##_store)
> +
> +static struct elv_fs_entry deadline_attrs[] = {
> + DD_ATTR(read_expire),
> + DD_ATTR(write_expire),
> + DD_ATTR(writes_starved),
> + DD_ATTR(front_merges),
> + DD_ATTR(fifo_batch),
> + __ATTR_NULL
> +};
> +
> +static struct elevator_type mq_deadline = {
> + .ops.mq = {
> + .get_request = dd_get_request,
> + .put_request = dd_put_request,
> + .insert_requests = dd_insert_requests,
> + .dispatch_requests = dd_dispatch_requests,
> + .completed_request = dd_completed_request,
> + .next_request = elv_rb_latter_request,
> + .former_request = elv_rb_former_request,
> + .bio_merge = dd_bio_merge,
> + .request_merge = dd_request_merge,
> + .requests_merged = dd_merged_requests,
> + .request_merged = dd_request_merged,
> + .has_work = dd_has_work,
> + .init_sched = dd_init_queue,
> + .exit_sched = dd_exit_queue,
> + },
> +
> + .uses_mq = true,
> + .elevator_attrs = deadline_attrs,
> + .elevator_name = "mq-deadline",
> + .elevator_owner = THIS_MODULE,
> +};
> +
> +static int __init deadline_init(void)
> +{
> + if (!queue_depth) {
> + pr_err("mq-deadline: queue depth must be > 0\n");
> + return -EINVAL;
> + }
> + return elv_register(&mq_deadline);
> +}
> +
> +static void __exit deadline_exit(void)
> +{
> + elv_unregister(&mq_deadline);
> +}
> +
> +module_init(deadline_init);
> +module_exit(deadline_exit);
> +
> +MODULE_AUTHOR("Jens Axboe");
> +MODULE_LICENSE("GPL");
> +MODULE_DESCRIPTION("MQ deadline IO scheduler");
> --
> 2.7.4
>

2016-12-20 10:12:08

by Paolo Valente

[permalink] [raw]

Subject: Re: [PATCH 3/8] block: move rq_ioc() to blk.h

> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <[email protected]> ha scritto:
>
> We want to use it outside of blk-core.c.
>

Hi Jens,
no hooks equivalent to elevator_init_icq_fn and elevator_exit_icq_fn
are invoked. In particular, the second hook let bfq (as with cfq)
know when it could finally exit the queue associated with the icq.
I'm trying to figure out how to make it without these hooks/signals,
but at no avail so far ...

Thanks,
Paolo

> Signed-off-by: Jens Axboe <[email protected]>
> ---
> block/blk-core.c | 16 ----------------
> block/blk.h | 16 ++++++++++++++++
> 2 files changed, 16 insertions(+), 16 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 61ba08c58b64..92baea07acbc 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1040,22 +1040,6 @@ static bool blk_rq_should_init_elevator(struct bio *bio)
> }
>
> /**
> - * rq_ioc - determine io_context for request allocation
> - * @bio: request being allocated is for this bio (can be %NULL)
> - *
> - * Determine io_context to use for request allocation for @bio. May return
> - * %NULL if %current->io_context doesn't exist.
> - */
> -static struct io_context *rq_ioc(struct bio *bio)
> -{
> -#ifdef CONFIG_BLK_CGROUP
> - if (bio && bio->bi_ioc)
> - return bio->bi_ioc;
> -#endif
> - return current->io_context;
> -}
> -
> -/**
> * __get_request - get a free request
> * @rl: request list to allocate from
> * @op: operation and flags
> diff --git a/block/blk.h b/block/blk.h
> index f46c0ac8ae3d..9a716b5925a4 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -264,6 +264,22 @@ void ioc_clear_queue(struct request_queue *q);
> int create_task_io_context(struct task_struct *task, gfp_t gfp_mask, int node);
>
> /**
> + * rq_ioc - determine io_context for request allocation
> + * @bio: request being allocated is for this bio (can be %NULL)
> + *
> + * Determine io_context to use for request allocation for @bio. May return
> + * %NULL if %current->io_context doesn't exist.
> + */
> +static inline struct io_context *rq_ioc(struct bio *bio)
> +{
> +#ifdef CONFIG_BLK_CGROUP
> + if (bio && bio->bi_ioc)
> + return bio->bi_ioc;
> +#endif
> + return current->io_context;
> +}
> +
> +/**
> * create_io_context - try to create task->io_context
> * @gfp_mask: allocation mask
> * @node: allocation node
> --
> 2.7.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2016-12-20 11:55:16

by Paolo Valente

[permalink] [raw]

Subject: Re: [PATCH 6/8] blk-mq-sched: add framework for MQ capable IO schedulers

> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <[email protected]> ha scritto:
>
> This adds a set of hooks that intercepts the blk-mq path of
> allocating/inserting/issuing/completing requests, allowing
> us to develop a scheduler within that framework.
>
> We reuse the existing elevator scheduler API on the registration
> side, but augment that with the scheduler flagging support for
> the blk-mq interfce, and with a separate set of ops hooks for MQ
> devices.
>
> Schedulers can opt in to using shadow requests. Shadow requests
> are internal requests that the scheduler uses for for the allocate
> and insert part, which are then mapped to a real driver request
> at dispatch time. This is needed to separate the device queue depth
> from the pool of requests that the scheduler has to work with.
>
> Signed-off-by: Jens Axboe <[email protected]>

> ...
>
> +struct request *blk_mq_sched_get_request(struct request_queue *q,
> + struct bio *bio,
> + unsigned int op,
> + struct blk_mq_alloc_data *data)
> +{
> + struct elevator_queue *e = q->elevator;
> + struct blk_mq_hw_ctx *hctx;
> + struct blk_mq_ctx *ctx;
> + struct request *rq;
> +
> + blk_queue_enter_live(q);
> + ctx = blk_mq_get_ctx(q);
> + hctx = blk_mq_map_queue(q, ctx->cpu);
> +
> + blk_mq_set_alloc_data(data, q, 0, ctx, hctx);
> +
> + if (e && e->type->ops.mq.get_request)
> + rq = e->type->ops.mq.get_request(q, op, data);

bio is not passed to the scheduler here. Yet bfq uses bio to get the
blkcg (invoking bio_blkcg). I'm not finding any workaround.

> + else
> + rq = __blk_mq_alloc_request(data, op);
> +
> + if (rq) {
> + rq->elv.icq = NULL;
> + if (e && e->type->icq_cache)
> + blk_mq_sched_assign_ioc(q, rq, bio);

bfq needs rq->elv.icq to be consistent in bfq_get_request, but the
needed initialization seems to occur only after mq.get_request is
invoked.

Note: to minimize latency, I'm reporting immediately each problem that
apparently cannot be solved by just modifying bfq. But, if the
resulting higher number of micro-emails is annoying for you, I can
buffer my questions, and send you cumulative emails less frequently.

Thanks,
Paolo

> + data->hctx->queued++;
> + return rq;
> + }
> +
> + blk_queue_exit(q);
> + return NULL;
> +}
> +
> +void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
> +{
> + struct elevator_queue *e = hctx->queue->elevator;
> + LIST_HEAD(rq_list);
> +
> + if (unlikely(blk_mq_hctx_stopped(hctx)))
> + return;
> +
> + hctx->run++;
> +
> + /*
> + * If we have previous entries on our dispatch list, grab them first for
> + * more fair dispatch.
> + */
> + if (!list_empty_careful(&hctx->dispatch)) {
> + spin_lock(&hctx->lock);
> + if (!list_empty(&hctx->dispatch))
> + list_splice_init(&hctx->dispatch, &rq_list);
> + spin_unlock(&hctx->lock);
> + }
> +
> + /*
> + * Only ask the scheduler for requests, if we didn't have residual
> + * requests from the dispatch list. This is to avoid the case where
> + * we only ever dispatch a fraction of the requests available because
> + * of low device queue depth. Once we pull requests out of the IO
> + * scheduler, we can no longer merge or sort them. So it's best to
> + * leave them there for as long as we can. Mark the hw queue as
> + * needing a restart in that case.
> + */
> + if (list_empty(&rq_list)) {
> + if (e && e->type->ops.mq.dispatch_requests)
> + e->type->ops.mq.dispatch_requests(hctx, &rq_list);
> + else
> + blk_mq_flush_busy_ctxs(hctx, &rq_list);
> + } else if (!test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
> + set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
> +
> + blk_mq_dispatch_rq_list(hctx, &rq_list);
> +}
> +
> +bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio)
> +{
> + struct request *rq;
> + int ret;
> +
> + ret = elv_merge(q, &rq, bio);
> + if (ret == ELEVATOR_BACK_MERGE) {
> + if (bio_attempt_back_merge(q, rq, bio)) {
> + if (!attempt_back_merge(q, rq))
> + elv_merged_request(q, rq, ret);
> + return true;
> + }
> + } else if (ret == ELEVATOR_FRONT_MERGE) {
> + if (bio_attempt_front_merge(q, rq, bio)) {
> + if (!attempt_front_merge(q, rq))
> + elv_merged_request(q, rq, ret);
> + return true;
> + }
> + }
> +
> + return false;
> +}
> +EXPORT_SYMBOL_GPL(blk_mq_sched_try_merge);
> +
> +bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio)
> +{
> + struct elevator_queue *e = q->elevator;
> +
> + if (e->type->ops.mq.bio_merge) {
> + struct blk_mq_ctx *ctx = blk_mq_get_ctx(q);
> + struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
> +
> + blk_mq_put_ctx(ctx);
> + return e->type->ops.mq.bio_merge(hctx, bio);
> + }
> +
> + return false;
> +}
> +
> +bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq)
> +{
> + return rq_mergeable(rq) && elv_attempt_insert_merge(q, rq);
> +}
> +EXPORT_SYMBOL_GPL(blk_mq_sched_try_insert_merge);
> +
> +void blk_mq_sched_request_inserted(struct request *rq)
> +{
> + trace_block_rq_insert(rq->q, rq);
> +}
> +EXPORT_SYMBOL_GPL(blk_mq_sched_request_inserted);
> diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
> new file mode 100644
> index 000000000000..1d1a4e9ce6ca
> --- /dev/null
> +++ b/block/blk-mq-sched.h
> @@ -0,0 +1,209 @@
> +#ifndef BLK_MQ_SCHED_H
> +#define BLK_MQ_SCHED_H
> +
> +#include "blk-mq.h"
> +#include "blk-wbt.h"
> +
> +struct blk_mq_tags *blk_mq_sched_alloc_requests(unsigned int depth, unsigned int numa_node);
> +void blk_mq_sched_free_requests(struct blk_mq_tags *tags);
> +
> +int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size,
> + int (*init)(struct blk_mq_hw_ctx *),
> + void (*exit)(struct blk_mq_hw_ctx *));
> +
> +void blk_mq_sched_free_hctx_data(struct request_queue *q,
> + void (*exit)(struct blk_mq_hw_ctx *));
> +
> +void blk_mq_sched_free_shadow_request(struct blk_mq_tags *tags,
> + struct request *rq);
> +struct request *blk_mq_sched_alloc_shadow_request(struct request_queue *q,
> + struct blk_mq_alloc_data *data,
> + struct blk_mq_tags *tags,
> + atomic_t *wait_index);
> +struct request *
> +blk_mq_sched_request_from_shadow(struct blk_mq_hw_ctx *hctx,
> + struct request *(*get_sched_rq)(struct blk_mq_hw_ctx *));
> +struct request *
> +__blk_mq_sched_request_from_shadow(struct blk_mq_hw_ctx *hctx,
> + struct request *sched_rq);
> +
> +struct request *blk_mq_sched_get_request(struct request_queue *q, struct bio *bio, unsigned int op, struct blk_mq_alloc_data *data);
> +
> +void __blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
> +void blk_mq_sched_request_inserted(struct request *rq);
> +bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio);
> +bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio);
> +bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq);
> +
> +void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
> +
> +static inline bool
> +blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio)
> +{
> + struct elevator_queue *e = q->elevator;
> +
> + if (!e || blk_queue_nomerges(q) || !bio_mergeable(bio))
> + return false;
> +
> + return __blk_mq_sched_bio_merge(q, bio);
> +}
> +
> +static inline int blk_mq_sched_get_rq_priv(struct request_queue *q,
> + struct request *rq)
> +{
> + struct elevator_queue *e = q->elevator;
> +
> + if (e && e->type->ops.mq.get_rq_priv)
> + return e->type->ops.mq.get_rq_priv(q, rq);
> +
> + return 0;
> +}
> +
> +static inline void blk_mq_sched_put_rq_priv(struct request_queue *q,
> + struct request *rq)
> +{
> + struct elevator_queue *e = q->elevator;
> +
> + if (e && e->type->ops.mq.put_rq_priv)
> + e->type->ops.mq.put_rq_priv(q, rq);
> +}
> +
> +static inline void blk_mq_sched_put_request(struct request *rq)
> +{
> + struct request_queue *q = rq->q;
> + struct elevator_queue *e = q->elevator;
> + bool do_free = true;
> +
> + wbt_done(q->rq_wb, &rq->issue_stat);
> +
> + if (rq->rq_flags & RQF_ELVPRIV) {
> + blk_mq_sched_put_rq_priv(rq->q, rq);
> + if (rq->elv.icq) {
> + put_io_context(rq->elv.icq->ioc);
> + rq->elv.icq = NULL;
> + }
> + }
> +
> + if (e && e->type->ops.mq.put_request)
> + do_free = !e->type->ops.mq.put_request(rq);
> + if (do_free)
> + blk_mq_finish_request(rq);
> +}
> +
> +static inline void
> +blk_mq_sched_insert_request(struct request *rq, bool at_head, bool run_queue,
> + bool async)
> +{
> + struct request_queue *q = rq->q;
> + struct elevator_queue *e = q->elevator;
> + struct blk_mq_ctx *ctx = rq->mq_ctx;
> + struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
> +
> + if (e && e->type->ops.mq.insert_requests) {
> + LIST_HEAD(list);
> +
> + list_add(&rq->queuelist, &list);
> + e->type->ops.mq.insert_requests(hctx, &list, at_head);
> + } else {
> + spin_lock(&ctx->lock);
> + __blk_mq_insert_request(hctx, rq, at_head);
> + spin_unlock(&ctx->lock);
> + }
> +
> + if (run_queue)
> + blk_mq_run_hw_queue(hctx, async);
> +}
> +
> +static inline void
> +blk_mq_sched_insert_requests(struct request_queue *q, struct blk_mq_ctx *ctx,
> + struct list_head *list, bool run_queue_async)
> +{
> + struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
> + struct elevator_queue *e = hctx->queue->elevator;
> +
> + if (e && e->type->ops.mq.insert_requests)
> + e->type->ops.mq.insert_requests(hctx, list, false);
> + else
> + blk_mq_insert_requests(hctx, ctx, list);
> +
> + blk_mq_run_hw_queue(hctx, run_queue_async);
> +}
> +
> +static inline void
> +blk_mq_sched_dispatch_shadow_requests(struct blk_mq_hw_ctx *hctx,
> + struct list_head *rq_list,
> + struct request *(*get_sched_rq)(struct blk_mq_hw_ctx *))
> +{
> + do {
> + struct request *rq;
> +
> + rq = blk_mq_sched_request_from_shadow(hctx, get_sched_rq);
> + if (!rq)
> + break;
> +
> + list_add_tail(&rq->queuelist, rq_list);
> + } while (1);
> +}
> +
> +static inline bool
> +blk_mq_sched_allow_merge(struct request_queue *q, struct request *rq,
> + struct bio *bio)
> +{
> + struct elevator_queue *e = q->elevator;
> +
> + if (e && e->type->ops.mq.allow_merge)
> + return e->type->ops.mq.allow_merge(q, rq, bio);
> +
> + return true;
> +}
> +
> +static inline void
> +blk_mq_sched_completed_request(struct blk_mq_hw_ctx *hctx, struct request *rq)
> +{
> + struct elevator_queue *e = hctx->queue->elevator;
> +
> + if (e && e->type->ops.mq.completed_request)
> + e->type->ops.mq.completed_request(hctx, rq);
> +
> + if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
> + clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
> + blk_mq_run_hw_queue(hctx, true);
> + }
> +}
> +
> +static inline void blk_mq_sched_started_request(struct request *rq)
> +{
> + struct request_queue *q = rq->q;
> + struct elevator_queue *e = q->elevator;
> +
> + if (e && e->type->ops.mq.started_request)
> + e->type->ops.mq.started_request(rq);
> +}
> +
> +static inline void blk_mq_sched_requeue_request(struct request *rq)
> +{
> + struct request_queue *q = rq->q;
> + struct elevator_queue *e = q->elevator;
> +
> + if (e && e->type->ops.mq.requeue_request)
> + e->type->ops.mq.requeue_request(rq);
> +}
> +
> +static inline bool blk_mq_sched_has_work(struct blk_mq_hw_ctx *hctx)
> +{
> + struct elevator_queue *e = hctx->queue->elevator;
> +
> + if (e && e->type->ops.mq.has_work)
> + return e->type->ops.mq.has_work(hctx);
> +
> + return false;
> +}
> +
> +/*
> + * Returns true if this is an internal shadow request
> + */
> +static inline bool blk_mq_sched_rq_is_shadow(struct request *rq)
> +{
> + return (rq->rq_flags & RQF_ALLOCED) != 0;
> +}
> +#endif
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index c3119f527bc1..032dca4a27bf 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -32,6 +32,7 @@
> #include "blk-mq-tag.h"
> #include "blk-stat.h"
> #include "blk-wbt.h"
> +#include "blk-mq-sched.h"
>
> static DEFINE_MUTEX(all_q_mutex);
> static LIST_HEAD(all_q_list);
> @@ -41,7 +42,8 @@ static LIST_HEAD(all_q_list);
> */
> static bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx)
> {
> - return sbitmap_any_bit_set(&hctx->ctx_map);
> + return sbitmap_any_bit_set(&hctx->ctx_map) ||
> + blk_mq_sched_has_work(hctx);
> }
>
> /*
> @@ -242,26 +244,21 @@ EXPORT_SYMBOL_GPL(__blk_mq_alloc_request);
> struct request *blk_mq_alloc_request(struct request_queue *q, int rw,
> unsigned int flags)
> {
> - struct blk_mq_ctx *ctx;
> - struct blk_mq_hw_ctx *hctx;
> - struct request *rq;
> struct blk_mq_alloc_data alloc_data;
> + struct request *rq;
> int ret;
>
> ret = blk_queue_enter(q, flags & BLK_MQ_REQ_NOWAIT);
> if (ret)
> return ERR_PTR(ret);
>
> - ctx = blk_mq_get_ctx(q);
> - hctx = blk_mq_map_queue(q, ctx->cpu);
> - blk_mq_set_alloc_data(&alloc_data, q, flags, ctx, hctx);
> - rq = __blk_mq_alloc_request(&alloc_data, rw);
> - blk_mq_put_ctx(ctx);
> + rq = blk_mq_sched_get_request(q, NULL, rw, &alloc_data);
>
> - if (!rq) {
> - blk_queue_exit(q);
> + blk_mq_put_ctx(alloc_data.ctx);
> + blk_queue_exit(q);
> +
> + if (!rq)
> return ERR_PTR(-EWOULDBLOCK);
> - }
>
> rq->__data_len = 0;
> rq->__sector = (sector_t) -1;
> @@ -321,12 +318,14 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, int rw,
> }
> EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx);
>
> -void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
> - struct request *rq)
> +void __blk_mq_finish_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
> + struct request *rq)
> {
> const int tag = rq->tag;
> struct request_queue *q = rq->q;
>
> + blk_mq_sched_completed_request(hctx, rq);
> +
> if (rq->rq_flags & RQF_MQ_INFLIGHT)
> atomic_dec(&hctx->nr_active);
>
> @@ -339,18 +338,23 @@ void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
> blk_queue_exit(q);
> }
>
> -static void blk_mq_free_hctx_request(struct blk_mq_hw_ctx *hctx,
> +static void blk_mq_finish_hctx_request(struct blk_mq_hw_ctx *hctx,
> struct request *rq)
> {
> struct blk_mq_ctx *ctx = rq->mq_ctx;
>
> ctx->rq_completed[rq_is_sync(rq)]++;
> - __blk_mq_free_request(hctx, ctx, rq);
> + __blk_mq_finish_request(hctx, ctx, rq);
> +}
> +
> +void blk_mq_finish_request(struct request *rq)
> +{
> + blk_mq_finish_hctx_request(blk_mq_map_queue(rq->q, rq->mq_ctx->cpu), rq);
> }
>
> void blk_mq_free_request(struct request *rq)
> {
> - blk_mq_free_hctx_request(blk_mq_map_queue(rq->q, rq->mq_ctx->cpu), rq);
> + blk_mq_sched_put_request(rq);
> }
> EXPORT_SYMBOL_GPL(blk_mq_free_request);
>
> @@ -468,6 +472,8 @@ void blk_mq_start_request(struct request *rq)
> {
> struct request_queue *q = rq->q;
>
> + blk_mq_sched_started_request(rq);
> +
> trace_block_rq_issue(q, rq);
>
> rq->resid_len = blk_rq_bytes(rq);
> @@ -516,6 +522,7 @@ static void __blk_mq_requeue_request(struct request *rq)
>
> trace_block_rq_requeue(q, rq);
> wbt_requeue(q->rq_wb, &rq->issue_stat);
> + blk_mq_sched_requeue_request(rq);
>
> if (test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
> if (q->dma_drain_size && blk_rq_bytes(rq))
> @@ -550,13 +557,13 @@ static void blk_mq_requeue_work(struct work_struct *work)
>
> rq->rq_flags &= ~RQF_SOFTBARRIER;
> list_del_init(&rq->queuelist);
> - blk_mq_insert_request(rq, true, false, false);
> + blk_mq_sched_insert_request(rq, true, false, false);
> }
>
> while (!list_empty(&rq_list)) {
> rq = list_entry(rq_list.next, struct request, queuelist);
> list_del_init(&rq->queuelist);
> - blk_mq_insert_request(rq, false, false, false);
> + blk_mq_sched_insert_request(rq, false, false, false);
> }
>
> blk_mq_run_hw_queues(q, false);
> @@ -762,8 +769,16 @@ static bool blk_mq_attempt_merge(struct request_queue *q,
>
> if (!blk_rq_merge_ok(rq, bio))
> continue;
> + if (!blk_mq_sched_allow_merge(q, rq, bio))
> + break;
>
> el_ret = blk_try_merge(rq, bio);
> + if (el_ret == ELEVATOR_NO_MERGE)
> + continue;
> +
> + if (!blk_mq_sched_allow_merge(q, rq, bio))
> + break;
> +
> if (el_ret == ELEVATOR_BACK_MERGE) {
> if (bio_attempt_back_merge(q, rq, bio)) {
> ctx->rq_merged++;
> @@ -905,41 +920,6 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
> return ret != BLK_MQ_RQ_QUEUE_BUSY;
> }
>
> -/*
> - * Run this hardware queue, pulling any software queues mapped to it in.
> - * Note that this function currently has various problems around ordering
> - * of IO. In particular, we'd like FIFO behaviour on handling existing
> - * items on the hctx->dispatch list. Ignore that for now.
> - */
> -static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
> -{
> - LIST_HEAD(rq_list);
> - LIST_HEAD(driver_list);
> -
> - if (unlikely(blk_mq_hctx_stopped(hctx)))
> - return;
> -
> - hctx->run++;
> -
> - /*
> - * Touch any software queue that has pending entries.
> - */
> - blk_mq_flush_busy_ctxs(hctx, &rq_list);
> -
> - /*
> - * If we have previous entries on our dispatch list, grab them
> - * and stuff them at the front for more fair dispatch.
> - */
> - if (!list_empty_careful(&hctx->dispatch)) {
> - spin_lock(&hctx->lock);
> - if (!list_empty(&hctx->dispatch))
> - list_splice_init(&hctx->dispatch, &rq_list);
> - spin_unlock(&hctx->lock);
> - }
> -
> - blk_mq_dispatch_rq_list(hctx, &rq_list);
> -}
> -
> static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
> {
> int srcu_idx;
> @@ -949,11 +929,11 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
>
> if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
> rcu_read_lock();
> - blk_mq_process_rq_list(hctx);
> + blk_mq_sched_dispatch_requests(hctx);
> rcu_read_unlock();
> } else {
> srcu_idx = srcu_read_lock(&hctx->queue_rq_srcu);
> - blk_mq_process_rq_list(hctx);
> + blk_mq_sched_dispatch_requests(hctx);
> srcu_read_unlock(&hctx->queue_rq_srcu, srcu_idx);
> }
> }
> @@ -1147,32 +1127,10 @@ void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
> blk_mq_hctx_mark_pending(hctx, ctx);
> }
>
> -void blk_mq_insert_request(struct request *rq, bool at_head, bool run_queue,
> - bool async)
> -{
> - struct blk_mq_ctx *ctx = rq->mq_ctx;
> - struct request_queue *q = rq->q;
> - struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
> -
> - spin_lock(&ctx->lock);
> - __blk_mq_insert_request(hctx, rq, at_head);
> - spin_unlock(&ctx->lock);
> -
> - if (run_queue)
> - blk_mq_run_hw_queue(hctx, async);
> -}
> -
> -static void blk_mq_insert_requests(struct request_queue *q,
> - struct blk_mq_ctx *ctx,
> - struct list_head *list,
> - int depth,
> - bool from_schedule)
> +void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
> + struct list_head *list)
>
> {
> - struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
> -
> - trace_block_unplug(q, depth, !from_schedule);
> -
> /*
> * preemption doesn't flush plug list, so it's possible ctx->cpu is
> * offline now
> @@ -1188,8 +1146,6 @@ static void blk_mq_insert_requests(struct request_queue *q,
> }
> blk_mq_hctx_mark_pending(hctx, ctx);
> spin_unlock(&ctx->lock);
> -
> - blk_mq_run_hw_queue(hctx, from_schedule);
> }
>
> static int plug_ctx_cmp(void *priv, struct list_head *a, struct list_head *b)
> @@ -1225,9 +1181,10 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
> BUG_ON(!rq->q);
> if (rq->mq_ctx != this_ctx) {
> if (this_ctx) {
> - blk_mq_insert_requests(this_q, this_ctx,
> - &ctx_list, depth,
> - from_schedule);
> + trace_block_unplug(this_q, depth, from_schedule);
> + blk_mq_sched_insert_requests(this_q, this_ctx,
> + &ctx_list,
> + from_schedule);
> }
>
> this_ctx = rq->mq_ctx;
> @@ -1244,8 +1201,9 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
> * on 'ctx_list'. Do those.
> */
> if (this_ctx) {
> - blk_mq_insert_requests(this_q, this_ctx, &ctx_list, depth,
> - from_schedule);
> + trace_block_unplug(this_q, depth, from_schedule);
> + blk_mq_sched_insert_requests(this_q, this_ctx, &ctx_list,
> + from_schedule);
> }
> }
>
> @@ -1283,46 +1241,32 @@ static inline bool blk_mq_merge_queue_io(struct blk_mq_hw_ctx *hctx,
> }
>
> spin_unlock(&ctx->lock);
> - __blk_mq_free_request(hctx, ctx, rq);
> + __blk_mq_finish_request(hctx, ctx, rq);
> return true;
> }
> }
>
> -static struct request *blk_mq_map_request(struct request_queue *q,
> - struct bio *bio,
> - struct blk_mq_alloc_data *data)
> -{
> - struct blk_mq_hw_ctx *hctx;
> - struct blk_mq_ctx *ctx;
> - struct request *rq;
> -
> - blk_queue_enter_live(q);
> - ctx = blk_mq_get_ctx(q);
> - hctx = blk_mq_map_queue(q, ctx->cpu);
> -
> - trace_block_getrq(q, bio, bio->bi_opf);
> - blk_mq_set_alloc_data(data, q, 0, ctx, hctx);
> - rq = __blk_mq_alloc_request(data, bio->bi_opf);
> -
> - data->hctx->queued++;
> - return rq;
> -}
> -
> static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
> {
> - int ret;
> struct request_queue *q = rq->q;
> - struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
> struct blk_mq_queue_data bd = {
> .rq = rq,
> .list = NULL,
> .last = 1
> };
> - blk_qc_t new_cookie = blk_tag_to_qc_t(rq->tag, hctx->queue_num);
> + struct blk_mq_hw_ctx *hctx;
> + blk_qc_t new_cookie;
> + int ret;
> +
> + if (q->elevator)
> + goto insert;
>
> + hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
> if (blk_mq_hctx_stopped(hctx))
> goto insert;
>
> + new_cookie = blk_tag_to_qc_t(rq->tag, hctx->queue_num);
> +
> /*
> * For OK queue, we are done. For error, kill it. Any other
> * error (busy), just add it to our list as we previously
> @@ -1344,7 +1288,7 @@ static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
> }
>
> insert:
> - blk_mq_insert_request(rq, false, true, true);
> + blk_mq_sched_insert_request(rq, false, true, true);
> }
>
> /*
> @@ -1377,9 +1321,14 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
> blk_attempt_plug_merge(q, bio, &request_count, &same_queue_rq))
> return BLK_QC_T_NONE;
>
> + if (blk_mq_sched_bio_merge(q, bio))
> + return BLK_QC_T_NONE;
> +
> wb_acct = wbt_wait(q->rq_wb, bio, NULL);
>
> - rq = blk_mq_map_request(q, bio, &data);
> + trace_block_getrq(q, bio, bio->bi_opf);
> +
> + rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data);
> if (unlikely(!rq)) {
> __wbt_done(q->rq_wb, wb_acct);
> return BLK_QC_T_NONE;
> @@ -1441,6 +1390,12 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
> goto done;
> }
>
> + if (q->elevator) {
> + blk_mq_put_ctx(data.ctx);
> + blk_mq_bio_to_request(rq, bio);
> + blk_mq_sched_insert_request(rq, false, true, true);
> + goto done;
> + }
> if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
> /*
> * For a SYNC request, send it to the hardware immediately. For
> @@ -1486,9 +1441,14 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
> } else
> request_count = blk_plug_queued_count(q);
>
> + if (blk_mq_sched_bio_merge(q, bio))
> + return BLK_QC_T_NONE;
> +
> wb_acct = wbt_wait(q->rq_wb, bio, NULL);
>
> - rq = blk_mq_map_request(q, bio, &data);
> + trace_block_getrq(q, bio, bio->bi_opf);
> +
> + rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data);
> if (unlikely(!rq)) {
> __wbt_done(q->rq_wb, wb_acct);
> return BLK_QC_T_NONE;
> @@ -1538,6 +1498,12 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
> return cookie;
> }
>
> + if (q->elevator) {
> + blk_mq_put_ctx(data.ctx);
> + blk_mq_bio_to_request(rq, bio);
> + blk_mq_sched_insert_request(rq, false, true, true);
> + goto done;
> + }
> if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
> /*
> * For a SYNC request, send it to the hardware immediately. For
> @@ -1550,6 +1516,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
> }
>
> blk_mq_put_ctx(data.ctx);
> +done:
> return cookie;
> }
>
> @@ -1558,7 +1525,7 @@ void blk_mq_free_rq_map(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
> {
> struct page *page;
>
> - if (tags->rqs && set->ops->exit_request) {
> + if (tags->rqs && set && set->ops->exit_request) {
> int i;
>
> for (i = 0; i < tags->nr_tags; i++) {
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index e59f5ca520a2..898c3c9a60ec 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -47,7 +47,8 @@ struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
> */
> void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
> bool at_head);
> -
> +void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
> + struct list_head *list);
> /*
> * CPU hotplug helpers
> */
> @@ -123,8 +124,9 @@ static inline void blk_mq_set_alloc_data(struct blk_mq_alloc_data *data,
> */
> void blk_mq_rq_ctx_init(struct request_queue *q, struct blk_mq_ctx *ctx,
> struct request *rq, unsigned int op);
> -void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
> +void __blk_mq_finish_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
> struct request *rq);
> +void blk_mq_finish_request(struct request *rq);
> struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data,
> unsigned int op);
>
> diff --git a/block/elevator.c b/block/elevator.c
> index 022a26830297..e6b523360231 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -40,6 +40,7 @@
> #include <trace/events/block.h>
>
> #include "blk.h"
> +#include "blk-mq-sched.h"
>
> static DEFINE_SPINLOCK(elv_list_lock);
> static LIST_HEAD(elv_list);
> @@ -58,7 +59,9 @@ static int elv_iosched_allow_bio_merge(struct request *rq, struct bio *bio)
> struct request_queue *q = rq->q;
> struct elevator_queue *e = q->elevator;
>
> - if (e->type->ops.sq.elevator_allow_bio_merge_fn)
> + if (e->uses_mq && e->type->ops.mq.allow_merge)
> + return e->type->ops.mq.allow_merge(q, rq, bio);
> + else if (!e->uses_mq && e->type->ops.sq.elevator_allow_bio_merge_fn)
> return e->type->ops.sq.elevator_allow_bio_merge_fn(q, rq, bio);
>
> return 1;
> @@ -163,6 +166,7 @@ struct elevator_queue *elevator_alloc(struct request_queue *q,
> kobject_init(&eq->kobj, &elv_ktype);
> mutex_init(&eq->sysfs_lock);
> hash_init(eq->hash);
> + eq->uses_mq = e->uses_mq;
>
> return eq;
> }
> @@ -219,12 +223,19 @@ int elevator_init(struct request_queue *q, char *name)
> if (!e) {
> printk(KERN_ERR
> "Default I/O scheduler not found. " \
> - "Using noop.\n");
> + "Using noop/none.\n");
> + if (q->mq_ops) {
> + elevator_put(e);
> + return 0;
> + }
> e = elevator_get("noop", false);
> }
> }
>
> - err = e->ops.sq.elevator_init_fn(q, e);
> + if (e->uses_mq)
> + err = e->ops.mq.init_sched(q, e);
> + else
> + err = e->ops.sq.elevator_init_fn(q, e);
> if (err)
> elevator_put(e);
> return err;
> @@ -234,7 +245,9 @@ EXPORT_SYMBOL(elevator_init);
> void elevator_exit(struct elevator_queue *e)
> {
> mutex_lock(&e->sysfs_lock);
> - if (e->type->ops.sq.elevator_exit_fn)
> + if (e->uses_mq && e->type->ops.mq.exit_sched)
> + e->type->ops.mq.exit_sched(e);
> + else if (!e->uses_mq && e->type->ops.sq.elevator_exit_fn)
> e->type->ops.sq.elevator_exit_fn(e);
> mutex_unlock(&e->sysfs_lock);
>
> @@ -253,6 +266,7 @@ void elv_rqhash_del(struct request_queue *q, struct request *rq)
> if (ELV_ON_HASH(rq))
> __elv_rqhash_del(rq);
> }
> +EXPORT_SYMBOL_GPL(elv_rqhash_del);
>
> void elv_rqhash_add(struct request_queue *q, struct request *rq)
> {
> @@ -262,6 +276,7 @@ void elv_rqhash_add(struct request_queue *q, struct request *rq)
> hash_add(e->hash, &rq->hash, rq_hash_key(rq));
> rq->rq_flags |= RQF_HASHED;
> }
> +EXPORT_SYMBOL_GPL(elv_rqhash_add);
>
> void elv_rqhash_reposition(struct request_queue *q, struct request *rq)
> {
> @@ -443,7 +458,9 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio)
> return ELEVATOR_BACK_MERGE;
> }
>
> - if (e->type->ops.sq.elevator_merge_fn)
> + if (e->uses_mq && e->type->ops.mq.request_merge)
> + return e->type->ops.mq.request_merge(q, req, bio);
> + else if (!e->uses_mq && e->type->ops.sq.elevator_merge_fn)
> return e->type->ops.sq.elevator_merge_fn(q, req, bio);
>
> return ELEVATOR_NO_MERGE;
> @@ -456,8 +473,7 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio)
> *
> * Returns true if we merged, false otherwise
> */
> -static bool elv_attempt_insert_merge(struct request_queue *q,
> - struct request *rq)
> +bool elv_attempt_insert_merge(struct request_queue *q, struct request *rq)
> {
> struct request *__rq;
> bool ret;
> @@ -495,7 +511,9 @@ void elv_merged_request(struct request_queue *q, struct request *rq, int type)
> {
> struct elevator_queue *e = q->elevator;
>
> - if (e->type->ops.sq.elevator_merged_fn)
> + if (e->uses_mq && e->type->ops.mq.request_merged)
> + e->type->ops.mq.request_merged(q, rq, type);
> + else if (!e->uses_mq && e->type->ops.sq.elevator_merged_fn)
> e->type->ops.sq.elevator_merged_fn(q, rq, type);
>
> if (type == ELEVATOR_BACK_MERGE)
> @@ -508,10 +526,15 @@ void elv_merge_requests(struct request_queue *q, struct request *rq,
> struct request *next)
> {
> struct elevator_queue *e = q->elevator;
> - const int next_sorted = next->rq_flags & RQF_SORTED;
> -
> - if (next_sorted && e->type->ops.sq.elevator_merge_req_fn)
> - e->type->ops.sq.elevator_merge_req_fn(q, rq, next);
> + bool next_sorted = false;
> +
> + if (e->uses_mq && e->type->ops.mq.requests_merged)
> + e->type->ops.mq.requests_merged(q, rq, next);
> + else if (e->type->ops.sq.elevator_merge_req_fn) {
> + next_sorted = next->rq_flags & RQF_SORTED;
> + if (next_sorted)
> + e->type->ops.sq.elevator_merge_req_fn(q, rq, next);
> + }
>
> elv_rqhash_reposition(q, rq);
>
> @@ -528,6 +551,9 @@ void elv_bio_merged(struct request_queue *q, struct request *rq,
> {
> struct elevator_queue *e = q->elevator;
>
> + if (WARN_ON_ONCE(e->uses_mq))
> + return;
> +
> if (e->type->ops.sq.elevator_bio_merged_fn)
> e->type->ops.sq.elevator_bio_merged_fn(q, rq, bio);
> }
> @@ -682,8 +708,11 @@ struct request *elv_latter_request(struct request_queue *q, struct request *rq)
> {
> struct elevator_queue *e = q->elevator;
>
> - if (e->type->ops.sq.elevator_latter_req_fn)
> + if (e->uses_mq && e->type->ops.mq.next_request)
> + return e->type->ops.mq.next_request(q, rq);
> + else if (!e->uses_mq && e->type->ops.sq.elevator_latter_req_fn)
> return e->type->ops.sq.elevator_latter_req_fn(q, rq);
> +
> return NULL;
> }
>
> @@ -691,7 +720,9 @@ struct request *elv_former_request(struct request_queue *q, struct request *rq)
> {
> struct elevator_queue *e = q->elevator;
>
> - if (e->type->ops.sq.elevator_former_req_fn)
> + if (e->uses_mq && e->type->ops.mq.former_request)
> + return e->type->ops.mq.former_request(q, rq);
> + if (!e->uses_mq && e->type->ops.sq.elevator_former_req_fn)
> return e->type->ops.sq.elevator_former_req_fn(q, rq);
> return NULL;
> }
> @@ -701,6 +732,9 @@ int elv_set_request(struct request_queue *q, struct request *rq,
> {
> struct elevator_queue *e = q->elevator;
>
> + if (WARN_ON_ONCE(e->uses_mq))
> + return 0;
> +
> if (e->type->ops.sq.elevator_set_req_fn)
> return e->type->ops.sq.elevator_set_req_fn(q, rq, bio, gfp_mask);
> return 0;
> @@ -710,6 +744,9 @@ void elv_put_request(struct request_queue *q, struct request *rq)
> {
> struct elevator_queue *e = q->elevator;
>
> + if (WARN_ON_ONCE(e->uses_mq))
> + return;
> +
> if (e->type->ops.sq.elevator_put_req_fn)
> e->type->ops.sq.elevator_put_req_fn(rq);
> }
> @@ -718,6 +755,9 @@ int elv_may_queue(struct request_queue *q, unsigned int op)
> {
> struct elevator_queue *e = q->elevator;
>
> + if (WARN_ON_ONCE(e->uses_mq))
> + return 0;
> +
> if (e->type->ops.sq.elevator_may_queue_fn)
> return e->type->ops.sq.elevator_may_queue_fn(q, op);
>
> @@ -728,6 +768,9 @@ void elv_completed_request(struct request_queue *q, struct request *rq)
> {
> struct elevator_queue *e = q->elevator;
>
> + if (WARN_ON_ONCE(e->uses_mq))
> + return;
> +
> /*
> * request is released from the driver, io must be done
> */
> @@ -803,7 +846,7 @@ int elv_register_queue(struct request_queue *q)
> }
> kobject_uevent(&e->kobj, KOBJ_ADD);
> e->registered = 1;
> - if (e->type->ops.sq.elevator_registered_fn)
> + if (!e->uses_mq && e->type->ops.sq.elevator_registered_fn)
> e->type->ops.sq.elevator_registered_fn(q);
> }
> return error;
> @@ -891,9 +934,14 @@ EXPORT_SYMBOL_GPL(elv_unregister);
> static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
> {
> struct elevator_queue *old = q->elevator;
> - bool registered = old->registered;
> + bool old_registered = false;
> int err;
>
> + if (q->mq_ops) {
> + blk_mq_freeze_queue(q);
> + blk_mq_quiesce_queue(q);
> + }
> +
> /*
> * Turn on BYPASS and drain all requests w/ elevator private data.
> * Block layer doesn't call into a quiesced elevator - all requests
> @@ -901,32 +949,52 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
> * using INSERT_BACK. All requests have SOFTBARRIER set and no
> * merge happens either.
> */
> - blk_queue_bypass_start(q);
> + if (old) {
> + old_registered = old->registered;
>
> - /* unregister and clear all auxiliary data of the old elevator */
> - if (registered)
> - elv_unregister_queue(q);
> + if (!q->mq_ops)
> + blk_queue_bypass_start(q);
>
> - spin_lock_irq(q->queue_lock);
> - ioc_clear_queue(q);
> - spin_unlock_irq(q->queue_lock);
> + /* unregister and clear all auxiliary data of the old elevator */
> + if (old_registered)
> + elv_unregister_queue(q);
> +
> + spin_lock_irq(q->queue_lock);
> + ioc_clear_queue(q);
> + spin_unlock_irq(q->queue_lock);
> + }
>
> /* allocate, init and register new elevator */
> - err = new_e->ops.sq.elevator_init_fn(q, new_e);
> - if (err)
> - goto fail_init;
> + if (new_e) {
> + if (new_e->uses_mq)
> + err = new_e->ops.mq.init_sched(q, new_e);
> + else
> + err = new_e->ops.sq.elevator_init_fn(q, new_e);
> + if (err)
> + goto fail_init;
>
> - if (registered) {
> err = elv_register_queue(q);
> if (err)
> goto fail_register;
> - }
> + } else
> + q->elevator = NULL;
>
> /* done, kill the old one and finish */
> - elevator_exit(old);
> - blk_queue_bypass_end(q);
> + if (old) {
> + elevator_exit(old);
> + if (!q->mq_ops)
> + blk_queue_bypass_end(q);
> + }
> +
> + if (q->mq_ops) {
> + blk_mq_unfreeze_queue(q);
> + blk_mq_start_stopped_hw_queues(q, true);
> + }
>
> - blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
> + if (new_e)
> + blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
> + else
> + blk_add_trace_msg(q, "elv switch: none");
>
> return 0;
>
> @@ -934,9 +1002,16 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
> elevator_exit(q->elevator);
> fail_init:
> /* switch failed, restore and re-register old elevator */
> - q->elevator = old;
> - elv_register_queue(q);
> - blk_queue_bypass_end(q);
> + if (old) {
> + q->elevator = old;
> + elv_register_queue(q);
> + if (!q->mq_ops)
> + blk_queue_bypass_end(q);
> + }
> + if (q->mq_ops) {
> + blk_mq_unfreeze_queue(q);
> + blk_mq_start_stopped_hw_queues(q, true);
> + }
>
> return err;
> }
> @@ -949,8 +1024,11 @@ static int __elevator_change(struct request_queue *q, const char *name)
> char elevator_name[ELV_NAME_MAX];
> struct elevator_type *e;
>
> - if (!q->elevator)
> - return -ENXIO;
> + /*
> + * Special case for mq, turn off scheduling
> + */
> + if (q->mq_ops && !strncmp(name, "none", 4))
> + return elevator_switch(q, NULL);
>
> strlcpy(elevator_name, name, sizeof(elevator_name));
> e = elevator_get(strstrip(elevator_name), true);
> @@ -959,11 +1037,23 @@ static int __elevator_change(struct request_queue *q, const char *name)
> return -EINVAL;
> }
>
> - if (!strcmp(elevator_name, q->elevator->type->elevator_name)) {
> + if (q->elevator &&
> + !strcmp(elevator_name, q->elevator->type->elevator_name)) {
> elevator_put(e);
> return 0;
> }
>
> + if (!e->uses_mq && q->mq_ops) {
> + printk(KERN_ERR "blk-mq-sched: elv %s does not support mq\n", elevator_name);
> + elevator_put(e);
> + return -EINVAL;
> + }
> + if (e->uses_mq && !q->mq_ops) {
> + printk(KERN_ERR "blk-mq-sched: elv %s is for mq\n", elevator_name);
> + elevator_put(e);
> + return -EINVAL;
> + }
> +
> return elevator_switch(q, e);
> }
>
> @@ -985,7 +1075,7 @@ ssize_t elv_iosched_store(struct request_queue *q, const char *name,
> {
> int ret;
>
> - if (!q->elevator)
> + if (!q->mq_ops || q->request_fn)
> return count;
>
> ret = __elevator_change(q, name);
> @@ -999,24 +1089,34 @@ ssize_t elv_iosched_store(struct request_queue *q, const char *name,
> ssize_t elv_iosched_show(struct request_queue *q, char *name)
> {
> struct elevator_queue *e = q->elevator;
> - struct elevator_type *elv;
> + struct elevator_type *elv = NULL;
> struct elevator_type *__e;
> int len = 0;
>
> - if (!q->elevator || !blk_queue_stackable(q))
> + if (!blk_queue_stackable(q))
> return sprintf(name, "none\n");
>
> - elv = e->type;
> + if (!q->elevator)
> + len += sprintf(name+len, "[none] ");
> + else
> + elv = e->type;
>
> spin_lock(&elv_list_lock);
> list_for_each_entry(__e, &elv_list, list) {
> - if (!strcmp(elv->elevator_name, __e->elevator_name))
> + if (elv && !strcmp(elv->elevator_name, __e->elevator_name)) {
> len += sprintf(name+len, "[%s] ", elv->elevator_name);
> - else
> + continue;
> + }
> + if (__e->uses_mq && q->mq_ops)
> + len += sprintf(name+len, "%s ", __e->elevator_name);
> + else if (!__e->uses_mq && !q->mq_ops)
> len += sprintf(name+len, "%s ", __e->elevator_name);
> }
> spin_unlock(&elv_list_lock);
>
> + if (q->mq_ops && q->elevator)
> + len += sprintf(name+len, "none");
> +
> len += sprintf(len+name, "\n");
> return len;
> }
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 2686f9e7302a..e3159be841ff 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -22,6 +22,7 @@ struct blk_mq_hw_ctx {
>
> unsigned long flags; /* BLK_MQ_F_* flags */
>
> + void *sched_data;
> struct request_queue *queue;
> struct blk_flush_queue *fq;
>
> @@ -156,6 +157,7 @@ enum {
>
> BLK_MQ_S_STOPPED = 0,
> BLK_MQ_S_TAG_ACTIVE = 1,
> + BLK_MQ_S_SCHED_RESTART = 2,
>
> BLK_MQ_MAX_DEPTH = 10240,
>
> @@ -179,7 +181,6 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set);
>
> void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule);
>
> -void blk_mq_insert_request(struct request *, bool, bool, bool);
> void blk_mq_free_request(struct request *rq);
> bool blk_mq_can_queue(struct blk_mq_hw_ctx *);
>
> diff --git a/include/linux/elevator.h b/include/linux/elevator.h
> index 2a9e966eed03..417810b2d2f5 100644
> --- a/include/linux/elevator.h
> +++ b/include/linux/elevator.h
> @@ -77,6 +77,32 @@ struct elevator_ops
> elevator_registered_fn *elevator_registered_fn;
> };
>
> +struct blk_mq_alloc_data;
> +struct blk_mq_hw_ctx;
> +
> +struct elevator_mq_ops {
> + int (*init_sched)(struct request_queue *, struct elevator_type *);
> + void (*exit_sched)(struct elevator_queue *);
> +
> + bool (*allow_merge)(struct request_queue *, struct request *, struct bio *);
> + bool (*bio_merge)(struct blk_mq_hw_ctx *, struct bio *);
> + int (*request_merge)(struct request_queue *q, struct request **, struct bio *);
> + void (*request_merged)(struct request_queue *, struct request *, int);
> + void (*requests_merged)(struct request_queue *, struct request *, struct request *);
> + struct request *(*get_request)(struct request_queue *, unsigned int, struct blk_mq_alloc_data *);
> + bool (*put_request)(struct request *);
> + void (*insert_requests)(struct blk_mq_hw_ctx *, struct list_head *, bool);
> + void (*dispatch_requests)(struct blk_mq_hw_ctx *, struct list_head *);
> + bool (*has_work)(struct blk_mq_hw_ctx *);
> + void (*completed_request)(struct blk_mq_hw_ctx *, struct request *);
> + void (*started_request)(struct request *);
> + void (*requeue_request)(struct request *);
> + struct request *(*former_request)(struct request_queue *, struct request *);
> + struct request *(*next_request)(struct request_queue *, struct request *);
> + int (*get_rq_priv)(struct request_queue *, struct request *);
> + void (*put_rq_priv)(struct request_queue *, struct request *);
> +};
> +
> #define ELV_NAME_MAX (16)
>
> struct elv_fs_entry {
> @@ -96,12 +122,14 @@ struct elevator_type
> /* fields provided by elevator implementation */
> union {
> struct elevator_ops sq;
> + struct elevator_mq_ops mq;
> } ops;
> size_t icq_size; /* see iocontext.h */
> size_t icq_align; /* ditto */
> struct elv_fs_entry *elevator_attrs;
> char elevator_name[ELV_NAME_MAX];
> struct module *elevator_owner;
> + bool uses_mq;
>
> /* managed by elevator core */
> char icq_cache_name[ELV_NAME_MAX + 5]; /* elvname + "_io_cq" */
> @@ -125,6 +153,7 @@ struct elevator_queue
> struct kobject kobj;
> struct mutex sysfs_lock;
> unsigned int registered:1;
> + unsigned int uses_mq:1;
> DECLARE_HASHTABLE(hash, ELV_HASH_BITS);
> };
>
> @@ -141,6 +170,7 @@ extern void elv_merge_requests(struct request_queue *, struct request *,
> extern void elv_merged_request(struct request_queue *, struct request *, int);
> extern void elv_bio_merged(struct request_queue *q, struct request *,
> struct bio *);
> +extern bool elv_attempt_insert_merge(struct request_queue *, struct request *);
> extern void elv_requeue_request(struct request_queue *, struct request *);
> extern struct request *elv_former_request(struct request_queue *, struct request *);
> extern struct request *elv_latter_request(struct request_queue *, struct request *);
> --
> 2.7.4
>

2016-12-20 15:46:38

[permalink] [raw]

Subject: Re: [PATCH 6/8] blk-mq-sched: add framework for MQ capable IO schedulers

On 12/20/2016 04:55 AM, Paolo Valente wrote:
>> +struct request *blk_mq_sched_get_request(struct request_queue *q,
>> + struct bio *bio,
>> + unsigned int op,
>> + struct blk_mq_alloc_data *data)
>> +{
>> + struct elevator_queue *e = q->elevator;
>> + struct blk_mq_hw_ctx *hctx;
>> + struct blk_mq_ctx *ctx;
>> + struct request *rq;
>> +
>> + blk_queue_enter_live(q);
>> + ctx = blk_mq_get_ctx(q);
>> + hctx = blk_mq_map_queue(q, ctx->cpu);
>> +
>> + blk_mq_set_alloc_data(data, q, 0, ctx, hctx);
>> +
>> + if (e && e->type->ops.mq.get_request)
>> + rq = e->type->ops.mq.get_request(q, op, data);
>
> bio is not passed to the scheduler here. Yet bfq uses bio to get the
> blkcg (invoking bio_blkcg). I'm not finding any workaround.

One important note here - what I'm posting is a work in progress, it's
by no means set in stone. So when you find missing items like this, feel
free to fix them up and send a patch. I will then fold in that patch. Or
if you don't feel comfortable fixing it up, let me know, and I'll fix it
up next time I touch it.

>> + else
>> + rq = __blk_mq_alloc_request(data, op);
>> +
>> + if (rq) {
>> + rq->elv.icq = NULL;
>> + if (e && e->type->icq_cache)
>> + blk_mq_sched_assign_ioc(q, rq, bio);
>
> bfq needs rq->elv.icq to be consistent in bfq_get_request, but the
> needed initialization seems to occur only after mq.get_request is
> invoked.
>
> Note: to minimize latency, I'm reporting immediately each problem that
> apparently cannot be solved by just modifying bfq. But, if the
> resulting higher number of micro-emails is annoying for you, I can
> buffer my questions, and send you cumulative emails less frequently.

That's perfectly fine, I prefer knowing earlier rather than later. But
do also remember that it's fine to send a patch to fix those things up,
you don't have to wait for me.

--
Jens Axboe

2016-12-20 15:46:51

[permalink] [raw]

Subject: Re: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler

On 12/20/2016 02:34 AM, Paolo Valente wrote:
>
>> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <[email protected]> ha scritto:
>>
>> This is basically identical to deadline-iosched, except it registers
>> as a MQ capable scheduler. This is still a single queue design.
>>
>> Signed-off-by: Jens Axboe <[email protected]>
>> ...
>> +
>> +static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
>> +{
>> + struct deadline_data *dd = hctx->queue->elevator->elevator_data;
>> +
>> + return !list_empty_careful(&dd->dispatch) ||
>> + !list_empty_careful(&dd->fifo_list[0]) ||
>> + !list_empty_careful(&dd->fifo_list[1]);
>
> Just a request for clarification: if I'm not mistaken,
> list_empty_careful can be used safely only if the only possible other
> concurrent access is a delete. Or am I missing something?

We can "solve" that with memory barriers. For now, it's safe to ignore
on your end.

--
Jens Axboe

2016-12-20 15:47:12

[permalink] [raw]

Subject: Re: [PATCH 3/8] block: move rq_ioc() to blk.h

On 12/20/2016 03:12 AM, Paolo Valente wrote:
>
>> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <[email protected]> ha scritto:
>>
>> We want to use it outside of blk-core.c.
>>
>
> Hi Jens,
> no hooks equivalent to elevator_init_icq_fn and elevator_exit_icq_fn
> are invoked. In particular, the second hook let bfq (as with cfq)
> know when it could finally exit the queue associated with the icq.
> I'm trying to figure out how to make it without these hooks/signals,
> but at no avail so far ...

Yep, those need to be added.

--
Jens Axboe

2016-12-20 22:14:27

[permalink] [raw]

Subject: Re: [PATCH 3/8] block: move rq_ioc() to blk.h

On 12/20/2016 08:46 AM, Jens Axboe wrote:
> On 12/20/2016 03:12 AM, Paolo Valente wrote:
>>
>>> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <[email protected]> ha scritto:
>>>
>>> We want to use it outside of blk-core.c.
>>>
>>
>> Hi Jens,
>> no hooks equivalent to elevator_init_icq_fn and elevator_exit_icq_fn
>> are invoked. In particular, the second hook let bfq (as with cfq)
>> know when it could finally exit the queue associated with the icq.
>> I'm trying to figure out how to make it without these hooks/signals,
>> but at no avail so far ...
>
> Yep, those need to be added.

Done, pushed out.

--
Jens Axboe

2016-12-21 02:22:35

[permalink] [raw]

Subject: Re: [PATCH 6/8] blk-mq-sched: add framework for MQ capable IO schedulers

On Tue, Dec 20 2016, Paolo Valente wrote:
> > + else
> > + rq = __blk_mq_alloc_request(data, op);
> > +
> > + if (rq) {
> > + rq->elv.icq = NULL;
> > + if (e && e->type->icq_cache)
> > + blk_mq_sched_assign_ioc(q, rq, bio);
>
> bfq needs rq->elv.icq to be consistent in bfq_get_request, but the
> needed initialization seems to occur only after mq.get_request is
> invoked.

Can you do it from get/put_rq_priv? The icq is assigned there. If not,
we can redo this part, not a big deal.

--
Jens Axboe

2016-12-21 11:59:42

by Bart Van Assche

[permalink] [raw]

Subject: Re: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler

On 12/17/2016 01:12 AM, Jens Axboe wrote:
> +static bool dd_put_request(struct request *rq)
> +{
> + /*
> + * If it's a real request, we just have to
free it. For a shadow
> + * request, we should only free it if we haven't started it. A
> + * started request is mapped to a real one, and the
real one will
> + * free it. We can get here with request merges, since we then
> + * free the request before we start/issue it.
> + */
> +
if (!blk_mq_sched_rq_is_shadow(rq))
> + return false;
> +
> + if (!(rq->rq_flags & RQF_STARTED)) {
> + struct request_queue *q
= rq->q;
> + struct deadline_data *dd = q->elevator->elevator_data;
> +
> + /*
> + * IO completion would normally do
this, but if we merge
> + * and free before we issue the request, drop both the
> + * tag and queue ref
> + */
> +
blk_mq_sched_free_shadow_request(dd->tags, rq);
> + blk_queue_exit(q);
> + }
> +
> + return true;
> +}

Hello Jens,

Since this patch is the first patch that introduces a call to blk_queue_exit()
from a module other than the block layer core, shouldn't this patch export the
blk_queue_exit() function? An attempt to build mq-deadline as a module resulted
in the following:

ERROR: "blk_queue_exit" [block/mq-deadline.ko] undefined!
make[1]: *** [scripts/Makefile.modpost:91: __modpost] Error 1
make: *** [Makefile:1198: modules] Error 2
Execution failed: make all

Bart.

2016-12-21 14:23:17

[permalink] [raw]

Subject: Re: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler

On 12/21/2016 04:59 AM, Bart Van Assche wrote:
> Since this patch is the first patch that introduces a call to
> blk_queue_exit() from a module other than the block layer core,
> shouldn't this patch export the blk_queue_exit() function? An attempt
> to build mq-deadline as a module resulted in the following:
>
> ERROR: "blk_queue_exit" [block/mq-deadline.ko] undefined!
> make[1]: *** [scripts/Makefile.modpost:91: __modpost] Error 1
> make: *** [Makefile:1198: modules] Error 2
> Execution failed: make all

Yes, it should. I'll make the export for now, I want to move that check
and free/drop into the generic code so that the schedulers don't have to
worry about it.

--
Jens Axboe

2016-12-22 09:59:34

by Paolo Valente

[permalink] [raw]

Subject: Re: [PATCH 6/8] blk-mq-sched: add framework for MQ capable IO schedulers

> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <[email protected]> ha scritto:
>
> This adds a set of hooks that intercepts the blk-mq path of
> allocating/inserting/issuing/completing requests, allowing
> us to develop a scheduler within that framework.
>
> We reuse the existing elevator scheduler API on the registration
> side, but augment that with the scheduler flagging support for
> the blk-mq interfce, and with a separate set of ops hooks for MQ
> devices.
>
> Schedulers can opt in to using shadow requests. Shadow requests
> are internal requests that the scheduler uses for for the allocate
> and insert part, which are then mapped to a real driver request
> at dispatch time. This is needed to separate the device queue depth
> from the pool of requests that the scheduler has to work with.
>
> Signed-off-by: Jens Axboe <[email protected]>
>
...

> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> new file mode 100644
> index 000000000000..b7e1839d4785
> --- /dev/null
> +++ b/block/blk-mq-sched.c

> ...
> +static inline bool
> +blk_mq_sched_allow_merge(struct request_queue *q, struct request *rq,
> + struct bio *bio)
> +{
> + struct elevator_queue *e = q->elevator;
> +
> + if (e && e->type->ops.mq.allow_merge)
> + return e->type->ops.mq.allow_merge(q, rq, bio);
> +
> + return true;
> +}
> +

Something does not seem to add up here:
e->type->ops.mq.allow_merge may be called only in
blk_mq_sched_allow_merge, which, in its turn, may be called only in
blk_mq_attempt_merge, which, finally, may be called only in
blk_mq_merge_queue_io. Yet the latter may be called only if there is
no elevator (line 1399 and 1507 in blk-mq.c).

Therefore, e->type->ops.mq.allow_merge can never be called, both if
there is and if there is not an elevator. Be patient if I'm missing
something huge, but I thought it was worth reporting this.

Paolo

2016-12-22 11:13:55

by Paolo Valente

[permalink] [raw]

Subject: Re: [PATCH 6/8] blk-mq-sched: add framework for MQ capable IO schedulers

> Il giorno 22 dic 2016, alle ore 10:59, Paolo Valente <[email protected]> ha scritto:
>
>>
>> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <[email protected]> ha scritto:
>>
>> This adds a set of hooks that intercepts the blk-mq path of
>> allocating/inserting/issuing/completing requests, allowing
>> us to develop a scheduler within that framework.
>>
>> We reuse the existing elevator scheduler API on the registration
>> side, but augment that with the scheduler flagging support for
>> the blk-mq interfce, and with a separate set of ops hooks for MQ
>> devices.
>>
>> Schedulers can opt in to using shadow requests. Shadow requests
>> are internal requests that the scheduler uses for for the allocate
>> and insert part, which are then mapped to a real driver request
>> at dispatch time. This is needed to separate the device queue depth
>> from the pool of requests that the scheduler has to work with.
>>
>> Signed-off-by: Jens Axboe <[email protected]>
>>
> ...
>
>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
>> new file mode 100644
>> index 000000000000..b7e1839d4785
>> --- /dev/null
>> +++ b/block/blk-mq-sched.c
>
>> ...
>> +static inline bool
>> +blk_mq_sched_allow_merge(struct request_queue *q, struct request *rq,
>> + struct bio *bio)
>> +{
>> + struct elevator_queue *e = q->elevator;
>> +
>> + if (e && e->type->ops.mq.allow_merge)
>> + return e->type->ops.mq.allow_merge(q, rq, bio);
>> +
>> + return true;
>> +}
>> +
>
> Something does not seem to add up here:
> e->type->ops.mq.allow_merge may be called only in
> blk_mq_sched_allow_merge, which, in its turn, may be called only in
> blk_mq_attempt_merge, which, finally, may be called only in
> blk_mq_merge_queue_io. Yet the latter may be called only if there is
> no elevator (line 1399 and 1507 in blk-mq.c).
>
> Therefore, e->type->ops.mq.allow_merge can never be called, both if
> there is and if there is not an elevator. Be patient if I'm missing
> something huge, but I thought it was worth reporting this.
>

Just another detail: if e->type->ops.mq.allow_merge does get invoked
from the above path, then it is invoked of course without the
scheduler lock held. In contrast, if this function gets invoked
from dd_bio_merge, then the scheduler lock is held.

To handle this opposite alternatives, I don't know whether checking if
the lock is held (and possibly taking it) from inside
e->type->ops.mq.allow_merge is a good solution. In any case, before
possibly trying it, I will wait for some feedback on the main problem,
i.e., on the fact that e->type->ops.mq.allow_merge
seems unreachable in the above path.

Thanks,
Paolo

> Paolo
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2016-12-22 15:20:12

by Paolo Valente

[permalink] [raw]

Subject: Re: [PATCH 6/8] blk-mq-sched: add framework for MQ capable IO schedulers

> Il giorno 21 dic 2016, alle ore 03:22, Jens Axboe <[email protected]> ha scritto:
>
> On Tue, Dec 20 2016, Paolo Valente wrote:
>>> + else
>>> + rq = __blk_mq_alloc_request(data, op);
>>> +
>>> + if (rq) {
>>> + rq->elv.icq = NULL;
>>> + if (e && e->type->icq_cache)
>>> + blk_mq_sched_assign_ioc(q, rq, bio);
>>
>> bfq needs rq->elv.icq to be consistent in bfq_get_request, but the
>> needed initialization seems to occur only after mq.get_request is
>> invoked.
>
> Can you do it from get/put_rq_priv?

Definitely, I just overlooked them, sorry :(

Thanks,
Paolo

> The icq is assigned there. If not,
> we can redo this part, not a big deal.
>
> --
> Jens Axboe
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2016-12-22 15:28:35

by Paolo Valente

[permalink] [raw]

Subject: Re: [PATCHSET v4] blk-mq-scheduling framework

> Il giorno 19 dic 2016, alle ore 22:05, Jens Axboe <[email protected]> ha scritto:
>
> On 12/19/2016 11:21 AM, Paolo Valente wrote:
>>
>>> Il giorno 19 dic 2016, alle ore 16:20, Jens Axboe <[email protected]> ha scritto:
>>>
>>> On 12/19/2016 04:32 AM, Paolo Valente wrote:
>>>>
>>>>> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <[email protected]> ha scritto:
>>>>>
>>>>> This is version 4 of this patchset, version 3 was posted here:
>>>>>
>>>>> https://marc.info/?l=linux-block&m=148178513407631&w=2
>>>>>
>>>>> From the discussion last time, I looked into the feasibility of having
>>>>> two sets of tags for the same request pool, to avoid having to copy
>>>>> some of the request fields at dispatch and completion time. To do that,
>>>>> we'd have to replace the driver tag map(s) with our own, and augment
>>>>> that with tag map(s) on the side representing the device queue depth.
>>>>> Queuing IO with the scheduler would allocate from the new map, and
>>>>> dispatching would acquire the "real" tag. We would need to change
>>>>> drivers to do this, or add an extra indirection table to map a real
>>>>> tag to the scheduler tag. We would also need a 1:1 mapping between
>>>>> scheduler and hardware tag pools, or additional info to track it.
>>>>> Unless someone can convince me otherwise, I think the current approach
>>>>> is cleaner.
>>>>>
>>>>> I wasn't going to post v4 so soon, but I discovered a bug that led
>>>>> to drastically decreased merging. Especially on rotating storage,
>>>>> this release should be fast, and on par with the merging that we
>>>>> get through the legacy schedulers.
>>>>>
>>>>
>>>> I'm to modifying bfq. You mentioned other missing pieces to come. Do
>>>> you already have an idea of what they are, so that I am somehow
>>>> prepared to what won't work even if my changes are right?
>>>
>>> I'm mostly talking about elevator ops hooks that aren't there in the new
>>> framework, but exist in the old one. There should be no hidden
>>> surprises, if that's what you are worried about.
>>>
>>> On the ops side, the only ones I can think of are the activate and
>>> deactivate, and those can be done in the dispatch_request hook for
>>> activate, and put/requeue for deactivate.
>>>
>>
>> You mean that there is no conceptual problem in moving the code of the
>> activate interface function into the dispatch function, and the code
>> of the deactivate into the put_request? (for a requeue it is a little
>> less clear to me, so one step at a time) Or am I missing
>> something more complex?
>
> Yes, what I mean is that there isn't a 1:1 mapping between the old ops
> and the new ops. So you'll have to consider the cases.
>
>

Problem: whereas it seems easy and safe to do somewhere else the
simple increment that was done in activate_request, I wonder if it may
happen that a request is deactivate before being completed. In it may
happen, then, without a deactivate_request hook, the increments would
remain unbalanced. Or are request completions always guaranteed till
no hw/sw components breaks?

Thanks,
Paolo

> --
> Jens Axboe
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2016-12-22 16:07:14

by Paolo Valente

[permalink] [raw]

Subject: Re: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler

> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <[email protected]> ha scritto:
>
> This is basically identical to deadline-iosched, except it registers
> as a MQ capable scheduler. This is still a single queue design.
>
> Signed-off-by: Jens Axboe <[email protected]>
> ---

...

> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> new file mode 100644
> index 000000000000..3cb9de21ab21
> --- /dev/null
> +++ b/block/mq-deadline.c
> ...
> +/*
> + * remove rq from rbtree and fifo.
> + */
> +static void deadline_remove_request(struct request_queue *q, struct request *rq)
> +{
> + struct deadline_data *dd = q->elevator->elevator_data;
> +
> + list_del_init(&rq->queuelist);
> +
> + /*
> + * We might not be on the rbtree, if we are doing an insert merge
> + */
> + if (!RB_EMPTY_NODE(&rq->rb_node))
> + deadline_del_rq_rb(dd, rq);
> +

I've been scratching my head on the last three instructions, but at no
avail. If I understand correctly, the
list_del_init(&rq->queue list);
removes rq from the fifo list. But, if so, I don't understand how it
could be possible that rq has not been added to the rb_tree too.

Another interpretation that I tried is that the above three lines
handle correctly the following case where rq has not been inserted at
all into deadline fifo queue and rb tree: when dd_insert_request was
executed for rq, blk_mq_sched_try_insert_merge succeeded. Yet, the
list_del_init(&rq->queue list);
does not seem to make sense.

Could you please shed some light on this for me?

Thanks,
Paolo

2016-12-22 16:37:44

by Bart Van Assche

[permalink] [raw]

Subject: Re: [PATCHSET v4] blk-mq-scheduling framework

On Fri, 2016-12-16 at 17:12 -0700, Jens Axboe wrote:
> From the discussion last time, I looked into the feasibility of having
> two sets of tags for the same request pool, to avoid having to copy
> some of the request fields at dispatch and completion time. To do that,
> we'd have to replace the driver tag map(s) with our own, and augment
> that with tag map(s) on the side representing the device queue depth.
> Queuing IO with the scheduler would allocate from the new map, and
> dispatching would acquire the "real" tag. We would need to change
> drivers to do this, or add an extra indirection table to map a real
> tag to the scheduler tag. We would also need a 1:1 mapping between
> scheduler and hardware tag pools, or additional info to track it.
> Unless someone can convince me otherwise, I think the current approach
> is cleaner.

Hello Jens,

Can you have a look at the attached patches? These implement the "two tags
per request" approach without a table that maps one tag type to the other
or any other ugly construct. __blk_mq_alloc_request() is modified such that
it assigns rq->sched_tag and sched_tags->rqs[] instead of rq->tag and
tags->rqs[]. rq->tag and tags->rqs[] are assigned just before dispatch by
blk_mq_assign_drv_tag(). This approach results in significantly less code
than the approach proposed in v4 of your blk-mq-sched patch series. Memory
usage is lower because only a single set of requests is allocated. The
runtime overhead is lower because request fields no longer have to be
copied between the requests owned by the block driver and the requests
owned by the I/O scheduler. I can boot a VM from the virtio-blk driver but
otherwise the attached patches have not yet been tested.

Thanks,

Bart.

Attachments:

0001-blk-mq-Revert-some-of-the-blk-mq-sched-framework-cha.patch (21.25 kB)
0001-blk-mq-Revert-some-of-the-blk-mq-sched-framework-cha.patch 0002-blk-mq-Make-the-blk_mq_-get-put-_tag-callers-specify.patch (9.59 kB)
0002-blk-mq-Make-the-blk_mq_-get-put-_tag-callers-specify.patch 0003-blk-mq-Split-driver-and-scheduler-tags.patch (14.03 kB)
0003-blk-mq-Split-driver-and-scheduler-tags.patch Download all attachments

2016-12-22 16:50:09

by Paolo Valente

[permalink] [raw]

Subject: Re: [PATCH 7/8] mq-deadline: add blk-mq adaptation of the deadline IO scheduler

> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <[email protected]> ha scritto:
>
> This is basically identical to deadline-iosched, except it registers
> as a MQ capable scheduler. This is still a single queue design.
>

One last question (for today ...):in mq-deadline there are no
"schedule dispatch" or "unplug work" functions. In blk, CFQ and BFQ
do these schedules/unplugs in a lot of cases. What's the right
replacement? Just doing nothing?

Thanks,
Paolo

> Signed-off-by: Jens Axboe <[email protected]>
> ---
> block/Kconfig.iosched | 6 +
> block/Makefile | 1 +
> block/mq-deadline.c | 649 ++++++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 656 insertions(+)
> create mode 100644 block/mq-deadline.c
>
> diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
> index 421bef9c4c48..490ef2850fae 100644
> --- a/block/Kconfig.iosched
> +++ b/block/Kconfig.iosched
> @@ -32,6 +32,12 @@ config IOSCHED_CFQ
>
> This is the default I/O scheduler.
>
> +config MQ_IOSCHED_DEADLINE
> + tristate "MQ deadline I/O scheduler"
> + default y
> + ---help---
> + MQ version of the deadline IO scheduler.
> +
> config CFQ_GROUP_IOSCHED
> bool "CFQ Group Scheduling support"
> depends on IOSCHED_CFQ && BLK_CGROUP
> diff --git a/block/Makefile b/block/Makefile
> index 2eee9e1bb6db..3ee0abd7205a 100644
> --- a/block/Makefile
> +++ b/block/Makefile
> @@ -18,6 +18,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING) += blk-throttle.o
> obj-$(CONFIG_IOSCHED_NOOP) += noop-iosched.o
> obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o
> obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o
> +obj-$(CONFIG_MQ_IOSCHED_DEADLINE) += mq-deadline.o
>
> obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
> obj-$(CONFIG_BLK_CMDLINE_PARSER) += cmdline-parser.o
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> new file mode 100644
> index 000000000000..3cb9de21ab21
> --- /dev/null
> +++ b/block/mq-deadline.c
> @@ -0,0 +1,649 @@
> +/*
> + * MQ Deadline i/o scheduler - adaptation of the legacy deadline scheduler,
> + * for the blk-mq scheduling framework
> + *
> + * Copyright (C) 2016 Jens Axboe <[email protected]>
> + */
> +#include <linux/kernel.h>
> +#include <linux/fs.h>
> +#include <linux/blkdev.h>
> +#include <linux/blk-mq.h>
> +#include <linux/elevator.h>
> +#include <linux/bio.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/init.h>
> +#include <linux/compiler.h>
> +#include <linux/rbtree.h>
> +#include <linux/sbitmap.h>
> +
> +#include "blk.h"
> +#include "blk-mq.h"
> +#include "blk-mq-tag.h"
> +#include "blk-mq-sched.h"
> +
> +static unsigned int queue_depth = 256;
> +module_param(queue_depth, uint, 0644);
> +MODULE_PARM_DESC(queue_depth, "Use this value as the scheduler queue depth");
> +
> +/*
> + * See Documentation/block/deadline-iosched.txt
> + */
> +static const int read_expire = HZ / 2; /* max time before a read is submitted. */
> +static const int write_expire = 5 * HZ; /* ditto for writes, these limits are SOFT! */
> +static const int writes_starved = 2; /* max times reads can starve a write */
> +static const int fifo_batch = 16; /* # of sequential requests treated as one
> + by the above parameters. For throughput. */
> +
> +struct deadline_data {
> + /*
> + * run time data
> + */
> +
> + /*
> + * requests (deadline_rq s) are present on both sort_list and fifo_list
> + */
> + struct rb_root sort_list[2];
> + struct list_head fifo_list[2];
> +
> + /*
> + * next in sort order. read, write or both are NULL
> + */
> + struct request *next_rq[2];
> + unsigned int batching; /* number of sequential requests made */
> + unsigned int starved; /* times reads have starved writes */
> +
> + /*
> + * settings that change how the i/o scheduler behaves
> + */
> + int fifo_expire[2];
> + int fifo_batch;
> + int writes_starved;
> + int front_merges;
> +
> + spinlock_t lock;
> + struct list_head dispatch;
> + struct blk_mq_tags *tags;
> + atomic_t wait_index;
> +};
> +
> +static inline struct rb_root *
> +deadline_rb_root(struct deadline_data *dd, struct request *rq)
> +{
> + return &dd->sort_list[rq_data_dir(rq)];
> +}
> +
> +/*
> + * get the request after `rq' in sector-sorted order
> + */
> +static inline struct request *
> +deadline_latter_request(struct request *rq)
> +{
> + struct rb_node *node = rb_next(&rq->rb_node);
> +
> + if (node)
> + return rb_entry_rq(node);
> +
> + return NULL;
> +}
> +
> +static void
> +deadline_add_rq_rb(struct deadline_data *dd, struct request *rq)
> +{
> + struct rb_root *root = deadline_rb_root(dd, rq);
> +
> + elv_rb_add(root, rq);
> +}
> +
> +static inline void
> +deadline_del_rq_rb(struct deadline_data *dd, struct request *rq)
> +{
> + const int data_dir = rq_data_dir(rq);
> +
> + if (dd->next_rq[data_dir] == rq)
> + dd->next_rq[data_dir] = deadline_latter_request(rq);
> +
> + elv_rb_del(deadline_rb_root(dd, rq), rq);
> +}
> +
> +/*
> + * remove rq from rbtree and fifo.
> + */
> +static void deadline_remove_request(struct request_queue *q, struct request *rq)
> +{
> + struct deadline_data *dd = q->elevator->elevator_data;
> +
> + list_del_init(&rq->queuelist);
> +
> + /*
> + * We might not be on the rbtree, if we are doing an insert merge
> + */
> + if (!RB_EMPTY_NODE(&rq->rb_node))
> + deadline_del_rq_rb(dd, rq);
> +
> + elv_rqhash_del(q, rq);
> + if (q->last_merge == rq)
> + q->last_merge = NULL;
> +}
> +
> +static void dd_request_merged(struct request_queue *q, struct request *req,
> + int type)
> +{
> + struct deadline_data *dd = q->elevator->elevator_data;
> +
> + /*
> + * if the merge was a front merge, we need to reposition request
> + */
> + if (type == ELEVATOR_FRONT_MERGE) {
> + elv_rb_del(deadline_rb_root(dd, req), req);
> + deadline_add_rq_rb(dd, req);
> + }
> +}
> +
> +static void dd_merged_requests(struct request_queue *q, struct request *req,
> + struct request *next)
> +{
> + /*
> + * if next expires before rq, assign its expire time to rq
> + * and move into next position (next will be deleted) in fifo
> + */
> + if (!list_empty(&req->queuelist) && !list_empty(&next->queuelist)) {
> + if (time_before((unsigned long)next->fifo_time,
> + (unsigned long)req->fifo_time)) {
> + list_move(&req->queuelist, &next->queuelist);
> + req->fifo_time = next->fifo_time;
> + }
> + }
> +
> + /*
> + * kill knowledge of next, this one is a goner
> + */
> + deadline_remove_request(q, next);
> +}
> +
> +/*
> + * move an entry to dispatch queue
> + */
> +static void
> +deadline_move_request(struct deadline_data *dd, struct request *rq)
> +{
> + const int data_dir = rq_data_dir(rq);
> +
> + dd->next_rq[READ] = NULL;
> + dd->next_rq[WRITE] = NULL;
> + dd->next_rq[data_dir] = deadline_latter_request(rq);
> +
> + /*
> + * take it off the sort and fifo list
> + */
> + deadline_remove_request(rq->q, rq);
> +}
> +
> +/*
> + * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
> + * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
> + */
> +static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
> +{
> + struct request *rq = rq_entry_fifo(dd->fifo_list[ddir].next);
> +
> + /*
> + * rq is expired!
> + */
> + if (time_after_eq(jiffies, (unsigned long)rq->fifo_time))
> + return 1;
> +
> + return 0;
> +}
> +
> +/*
> + * deadline_dispatch_requests selects the best request according to
> + * read/write expire, fifo_batch, etc
> + */
> +static struct request *__dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
> +{
> + struct deadline_data *dd = hctx->queue->elevator->elevator_data;
> + struct request *rq;
> + bool reads, writes;
> + int data_dir;
> +
> + spin_lock(&dd->lock);
> +
> + if (!list_empty(&dd->dispatch)) {
> + rq = list_first_entry(&dd->dispatch, struct request, queuelist);
> + list_del_init(&rq->queuelist);
> + goto done;
> + }
> +
> + reads = !list_empty(&dd->fifo_list[READ]);
> + writes = !list_empty(&dd->fifo_list[WRITE]);
> +
> + /*
> + * batches are currently reads XOR writes
> + */
> + if (dd->next_rq[WRITE])
> + rq = dd->next_rq[WRITE];
> + else
> + rq = dd->next_rq[READ];
> +
> + if (rq && dd->batching < dd->fifo_batch)
> + /* we have a next request are still entitled to batch */
> + goto dispatch_request;
> +
> + /*
> + * at this point we are not running a batch. select the appropriate
> + * data direction (read / write)
> + */
> +
> + if (reads) {
> + BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
> +
> + if (writes && (dd->starved++ >= dd->writes_starved))
> + goto dispatch_writes;
> +
> + data_dir = READ;
> +
> + goto dispatch_find_request;
> + }
> +
> + /*
> + * there are either no reads or writes have been starved
> + */
> +
> + if (writes) {
> +dispatch_writes:
> + BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
> +
> + dd->starved = 0;
> +
> + data_dir = WRITE;
> +
> + goto dispatch_find_request;
> + }
> +
> + spin_unlock(&dd->lock);
> + return NULL;
> +
> +dispatch_find_request:
> + /*
> + * we are not running a batch, find best request for selected data_dir
> + */
> + if (deadline_check_fifo(dd, data_dir) || !dd->next_rq[data_dir]) {
> + /*
> + * A deadline has expired, the last request was in the other
> + * direction, or we have run out of higher-sectored requests.
> + * Start again from the request with the earliest expiry time.
> + */
> + rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
> + } else {
> + /*
> + * The last req was the same dir and we have a next request in
> + * sort order. No expired requests so continue on from here.
> + */
> + rq = dd->next_rq[data_dir];
> + }
> +
> + dd->batching = 0;
> +
> +dispatch_request:
> + /*
> + * rq is the selected appropriate request.
> + */
> + dd->batching++;
> + deadline_move_request(dd, rq);
> +done:
> + rq->rq_flags |= RQF_STARTED;
> + spin_unlock(&dd->lock);
> + return rq;
> +}
> +
> +static void dd_dispatch_requests(struct blk_mq_hw_ctx *hctx,
> + struct list_head *rq_list)
> +{
> + blk_mq_sched_dispatch_shadow_requests(hctx, rq_list, __dd_dispatch_request);
> +}
> +
> +static void dd_exit_queue(struct elevator_queue *e)
> +{
> + struct deadline_data *dd = e->elevator_data;
> +
> + BUG_ON(!list_empty(&dd->fifo_list[READ]));
> + BUG_ON(!list_empty(&dd->fifo_list[WRITE]));
> +
> + blk_mq_sched_free_requests(dd->tags);
> + kfree(dd);
> +}
> +
> +/*
> + * initialize elevator private data (deadline_data).
> + */
> +static int dd_init_queue(struct request_queue *q, struct elevator_type *e)
> +{
> + struct deadline_data *dd;
> + struct elevator_queue *eq;
> +
> + eq = elevator_alloc(q, e);
> + if (!eq)
> + return -ENOMEM;
> +
> + dd = kzalloc_node(sizeof(*dd), GFP_KERNEL, q->node);
> + if (!dd) {
> + kobject_put(&eq->kobj);
> + return -ENOMEM;
> + }
> + eq->elevator_data = dd;
> +
> + dd->tags = blk_mq_sched_alloc_requests(queue_depth, q->node);
> + if (!dd->tags) {
> + kfree(dd);
> + kobject_put(&eq->kobj);
> + return -ENOMEM;
> + }
> +
> + INIT_LIST_HEAD(&dd->fifo_list[READ]);
> + INIT_LIST_HEAD(&dd->fifo_list[WRITE]);
> + dd->sort_list[READ] = RB_ROOT;
> + dd->sort_list[WRITE] = RB_ROOT;
> + dd->fifo_expire[READ] = read_expire;
> + dd->fifo_expire[WRITE] = write_expire;
> + dd->writes_starved = writes_starved;
> + dd->front_merges = 1;
> + dd->fifo_batch = fifo_batch;
> + spin_lock_init(&dd->lock);
> + INIT_LIST_HEAD(&dd->dispatch);
> + atomic_set(&dd->wait_index, 0);
> +
> + q->elevator = eq;
> + return 0;
> +}
> +
> +static int dd_request_merge(struct request_queue *q, struct request **rq,
> + struct bio *bio)
> +{
> + struct deadline_data *dd = q->elevator->elevator_data;
> + sector_t sector = bio_end_sector(bio);
> + struct request *__rq;
> +
> + if (!dd->front_merges)
> + return ELEVATOR_NO_MERGE;
> +
> + __rq = elv_rb_find(&dd->sort_list[bio_data_dir(bio)], sector);
> + if (__rq) {
> + BUG_ON(sector != blk_rq_pos(__rq));
> +
> + if (elv_bio_merge_ok(__rq, bio)) {
> + *rq = __rq;
> + return ELEVATOR_FRONT_MERGE;
> + }
> + }
> +
> + return ELEVATOR_NO_MERGE;
> +}
> +
> +static bool dd_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio)
> +{
> + struct request_queue *q = hctx->queue;
> + struct deadline_data *dd = q->elevator->elevator_data;
> + int ret;
> +
> + spin_lock(&dd->lock);
> + ret = blk_mq_sched_try_merge(q, bio);
> + spin_unlock(&dd->lock);
> +
> + return ret;
> +}
> +
> +/*
> + * add rq to rbtree and fifo
> + */
> +static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
> + bool at_head)
> +{
> + struct request_queue *q = hctx->queue;
> + struct deadline_data *dd = q->elevator->elevator_data;
> + const int data_dir = rq_data_dir(rq);
> +
> + if (blk_mq_sched_try_insert_merge(q, rq))
> + return;
> +
> + blk_mq_sched_request_inserted(rq);
> +
> + /*
> + * If we're trying to insert a real request, just send it directly
> + * to the hardware dispatch list. This only happens for a requeue,
> + * or FUA/FLUSH requests.
> + */
> + if (!blk_mq_sched_rq_is_shadow(rq)) {
> + spin_lock(&hctx->lock);
> + list_add_tail(&rq->queuelist, &hctx->dispatch);
> + spin_unlock(&hctx->lock);
> + return;
> + }
> +
> + if (at_head || rq->cmd_type != REQ_TYPE_FS) {
> + if (at_head)
> + list_add(&rq->queuelist, &dd->dispatch);
> + else
> + list_add_tail(&rq->queuelist, &dd->dispatch);
> + } else {
> + deadline_add_rq_rb(dd, rq);
> +
> + if (rq_mergeable(rq)) {
> + elv_rqhash_add(q, rq);
> + if (!q->last_merge)
> + q->last_merge = rq;
> + }
> +
> + /*
> + * set expire time and add to fifo list
> + */
> + rq->fifo_time = jiffies + dd->fifo_expire[data_dir];
> + list_add_tail(&rq->queuelist, &dd->fifo_list[data_dir]);
> + }
> +}
> +
> +static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
> + struct list_head *list, bool at_head)
> +{
> + struct request_queue *q = hctx->queue;
> + struct deadline_data *dd = q->elevator->elevator_data;
> +
> + spin_lock(&dd->lock);
> + while (!list_empty(list)) {
> + struct request *rq;
> +
> + rq = list_first_entry(list, struct request, queuelist);
> + list_del_init(&rq->queuelist);
> + dd_insert_request(hctx, rq, at_head);
> + }
> + spin_unlock(&dd->lock);
> +}
> +
> +static struct request *dd_get_request(struct request_queue *q, unsigned int op,
> + struct blk_mq_alloc_data *data)
> +{
> + struct deadline_data *dd = q->elevator->elevator_data;
> + struct request *rq;
> +
> + /*
> + * The flush machinery intercepts before we insert the request. As
> + * a work-around, just hand it back a real request.
> + */
> + if (unlikely(op & (REQ_PREFLUSH | REQ_FUA)))
> + rq = __blk_mq_alloc_request(data, op);
> + else {
> + rq = blk_mq_sched_alloc_shadow_request(q, data, dd->tags, &dd->wait_index);
> + if (rq)
> + blk_mq_rq_ctx_init(q, data->ctx, rq, op);
> + }
> +
> + return rq;
> +}
> +
> +static bool dd_put_request(struct request *rq)
> +{
> + /*
> + * If it's a real request, we just have to free it. For a shadow
> + * request, we should only free it if we haven't started it. A
> + * started request is mapped to a real one, and the real one will
> + * free it. We can get here with request merges, since we then
> + * free the request before we start/issue it.
> + */
> + if (!blk_mq_sched_rq_is_shadow(rq))
> + return false;
> +
> + if (!(rq->rq_flags & RQF_STARTED)) {
> + struct request_queue *q = rq->q;
> + struct deadline_data *dd = q->elevator->elevator_data;
> +
> + /*
> + * IO completion would normally do this, but if we merge
> + * and free before we issue the request, drop both the
> + * tag and queue ref
> + */
> + blk_mq_sched_free_shadow_request(dd->tags, rq);
> + blk_queue_exit(q);
> + }
> +
> + return true;
> +}
> +
> +static void dd_completed_request(struct blk_mq_hw_ctx *hctx, struct request *rq)
> +{
> + struct request *sched_rq = rq->end_io_data;
> +
> + /*
> + * sched_rq can be NULL, if we haven't setup the shadow yet
> + * because we failed getting one.
> + */
> + if (sched_rq) {
> + struct deadline_data *dd = hctx->queue->elevator->elevator_data;
> +
> + blk_mq_sched_free_shadow_request(dd->tags, sched_rq);
> + blk_mq_start_stopped_hw_queue(hctx, true);
> + }
> +}
> +
> +static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
> +{
> + struct deadline_data *dd = hctx->queue->elevator->elevator_data;
> +
> + return !list_empty_careful(&dd->dispatch) ||
> + !list_empty_careful(&dd->fifo_list[0]) ||
> + !list_empty_careful(&dd->fifo_list[1]);
> +}
> +
> +/*
> + * sysfs parts below
> + */
> +static ssize_t
> +deadline_var_show(int var, char *page)
> +{
> + return sprintf(page, "%d\n", var);
> +}
> +
> +static ssize_t
> +deadline_var_store(int *var, const char *page, size_t count)
> +{
> + char *p = (char *) page;
> +
> + *var = simple_strtol(p, &p, 10);
> + return count;
> +}
> +
> +#define SHOW_FUNCTION(__FUNC, __VAR, __CONV) \
> +static ssize_t __FUNC(struct elevator_queue *e, char *page) \
> +{ \
> + struct deadline_data *dd = e->elevator_data; \
> + int __data = __VAR; \
> + if (__CONV) \
> + __data = jiffies_to_msecs(__data); \
> + return deadline_var_show(__data, (page)); \
> +}
> +SHOW_FUNCTION(deadline_read_expire_show, dd->fifo_expire[READ], 1);
> +SHOW_FUNCTION(deadline_write_expire_show, dd->fifo_expire[WRITE], 1);
> +SHOW_FUNCTION(deadline_writes_starved_show, dd->writes_starved, 0);
> +SHOW_FUNCTION(deadline_front_merges_show, dd->front_merges, 0);
> +SHOW_FUNCTION(deadline_fifo_batch_show, dd->fifo_batch, 0);
> +#undef SHOW_FUNCTION
> +
> +#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \
> +static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count) \
> +{ \
> + struct deadline_data *dd = e->elevator_data; \
> + int __data; \
> + int ret = deadline_var_store(&__data, (page), count); \
> + if (__data < (MIN)) \
> + __data = (MIN); \
> + else if (__data > (MAX)) \
> + __data = (MAX); \
> + if (__CONV) \
> + *(__PTR) = msecs_to_jiffies(__data); \
> + else \
> + *(__PTR) = __data; \
> + return ret; \
> +}
> +STORE_FUNCTION(deadline_read_expire_store, &dd->fifo_expire[READ], 0, INT_MAX, 1);
> +STORE_FUNCTION(deadline_write_expire_store, &dd->fifo_expire[WRITE], 0, INT_MAX, 1);
> +STORE_FUNCTION(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX, 0);
> +STORE_FUNCTION(deadline_front_merges_store, &dd->front_merges, 0, 1, 0);
> +STORE_FUNCTION(deadline_fifo_batch_store, &dd->fifo_batch, 0, INT_MAX, 0);
> +#undef STORE_FUNCTION
> +
> +#define DD_ATTR(name) \
> + __ATTR(name, S_IRUGO|S_IWUSR, deadline_##name##_show, \
> + deadline_##name##_store)
> +
> +static struct elv_fs_entry deadline_attrs[] = {
> + DD_ATTR(read_expire),
> + DD_ATTR(write_expire),
> + DD_ATTR(writes_starved),
> + DD_ATTR(front_merges),
> + DD_ATTR(fifo_batch),
> + __ATTR_NULL
> +};
> +
> +static struct elevator_type mq_deadline = {
> + .ops.mq = {
> + .get_request = dd_get_request,
> + .put_request = dd_put_request,
> + .insert_requests = dd_insert_requests,
> + .dispatch_requests = dd_dispatch_requests,
> + .completed_request = dd_completed_request,
> + .next_request = elv_rb_latter_request,
> + .former_request = elv_rb_former_request,
> + .bio_merge = dd_bio_merge,
> + .request_merge = dd_request_merge,
> + .requests_merged = dd_merged_requests,
> + .request_merged = dd_request_merged,
> + .has_work = dd_has_work,
> + .init_sched = dd_init_queue,
> + .exit_sched = dd_exit_queue,
> + },
> +
> + .uses_mq = true,
> + .elevator_attrs = deadline_attrs,
> + .elevator_name = "mq-deadline",
> + .elevator_owner = THIS_MODULE,
> +};
> +
> +static int __init deadline_init(void)
> +{
> + if (!queue_depth) {
> + pr_err("mq-deadline: queue depth must be > 0\n");
> + return -EINVAL;
> + }
> + return elv_register(&mq_deadline);
> +}
> +
> +static void __exit deadline_exit(void)
> +{
> + elv_unregister(&mq_deadline);
> +}
> +
> +module_init(deadline_init);
> +module_exit(deadline_exit);
> +
> +MODULE_AUTHOR("Jens Axboe");
> +MODULE_LICENSE("GPL");
> +MODULE_DESCRIPTION("MQ deadline IO scheduler");
> --
> 2.7.4
>

2016-12-22 16:52:24

by Omar Sandoval

[permalink] [raw]

Subject: Re: [PATCHSET v4] blk-mq-scheduling framework

On Thu, Dec 22, 2016 at 04:23:24PM +0000, Bart Van Assche wrote:
> On Fri, 2016-12-16 at 17:12 -0700, Jens Axboe wrote:
> > From the discussion last time, I looked into the feasibility of having
> > two sets of tags for the same request pool, to avoid having to copy
> > some of the request fields at dispatch and completion time. To do that,
> > we'd have to replace the driver tag map(s) with our own, and augment
> > that with tag map(s) on the side representing the device queue depth.
> > Queuing IO with the scheduler would allocate from the new map, and
> > dispatching would acquire the "real" tag. We would need to change
> > drivers to do this, or add an extra indirection table to map a real
> > tag to the scheduler tag. We would also need a 1:1 mapping between
> > scheduler and hardware tag pools, or additional info to track it.
> > Unless someone can convince me otherwise, I think the current approach
> > is cleaner.
>
> Hello Jens,
>
> Can you have a look at the attached patches? These implement the "two tags
> per request" approach without a table that maps one tag type to the other
> or any other ugly construct.?__blk_mq_alloc_request() is modified such that
> it assigns rq->sched_tag and sched_tags->rqs[] instead of rq->tag and
> tags->rqs[]. rq->tag and tags->rqs[] are assigned just before dispatch by
> blk_mq_assign_drv_tag(). This approach results in significantly less code
> than the approach proposed in v4 of your blk-mq-sched patch series. Memory
> usage is lower because only a single set of requests is allocated. The
> runtime overhead is lower because request fields no longer have to be
> copied between the requests owned by the block driver and the requests
> owned by the I/O scheduler. I can boot a VM from the virtio-blk driver but
> otherwise the attached patches have not yet been tested.
>
> Thanks,
>
> Bart.

Hey, Bart,

This approach occurred to us, but we couldn't figure out a way to make
blk_mq_tag_to_rq() work with it. From skimming over the patches, I
didn't see a solution to that problem.

2016-12-22 16:57:50

by Bart Van Assche

[permalink] [raw]

Subject: Re: [PATCHSET v4] blk-mq-scheduling framework

On Thu, 2016-12-22 at 08:52 -0800, Omar Sandoval wrote:
> This approach occurred to us, but we couldn't figure out a way to make
> blk_mq_tag_to_rq() work with it. From skimming over the patches, I
> didn't see a solution to that problem.

Hello Omar,

Can you clarify your comment? Since my patches initialize both tags->rqs[]
and sched_tags->rqs[] the function blk_mq_tag_to_rq() should still work.

Bart.

2016-12-22 17:12:42

by Omar Sandoval

[permalink] [raw]

Subject: Re: [PATCHSET v4] blk-mq-scheduling framework

On Thu, Dec 22, 2016 at 04:57:36PM +0000, Bart Van Assche wrote:
> On Thu, 2016-12-22 at 08:52 -0800, Omar Sandoval wrote:
> > This approach occurred to us, but we couldn't figure out a way to make
> > blk_mq_tag_to_rq() work with it. From skimming over the patches, I
> > didn't see a solution to that problem.
>
> Hello Omar,
>
> Can you clarify your comment? Since my patches initialize both tags->rqs[]
> and sched_tags->rqs[] the function blk_mq_tag_to_rq() should still work.
>
> Bart.

Sorry, you're right, it does work, but tags->rqs[] ends up being the
extra lookup table. I suspect that the runtime overhead of keeping that
up to date could be worse than copying the rq fields if you have lots of
CPUs but only one hardware queue.

2016-12-22 17:40:09

by Bart Van Assche

[permalink] [raw]

Subject: Re: [PATCHSET v4] blk-mq-scheduling framework

On Thu, 2016-12-22 at 09:12 -0800, Omar Sandoval wrote:
> On Thu, Dec 22, 2016 at 04:57:36PM +0000, Bart Van Assche wrote:
> > On Thu, 2016-12-22 at 08:52 -0800, Omar Sandoval wrote:
> > > This approach occurred to us, but we couldn't figure out a way to make
> > > blk_mq_tag_to_rq() work with it. From skimming over the patches, I
> > > didn't see a solution to that problem.
> >
> > Can you clarify your comment? Since my patches initialize both tags->rqs[]
> > and sched_tags->rqs[] the function blk_mq_tag_to_rq() should still work.
>
> Sorry, you're right, it does work, but tags->rqs[] ends up being the
> extra lookup table. I suspect that the runtime overhead of keeping that
> up to date could be worse than copying the rq fields if you have lots of
> CPUs but only one hardware queue.

Hello Omar,

I'm not sure that anything can be done if the number of CPUs that is submitting
I/O is large compared to the queue depth so I don't think we should spend our
time on that case. If the queue depth is large enough then the sbitmap code will
allocate tags such that different CPUs use different rqs[] elements.

The advantages of the approach I proposed are such that I am convinced that is
what we should start from and address contention on the tags->rqs[] array if it
measurements show that it is necessary to address it.

Bart.

2016-12-23 10:12:49

by Paolo Valente

[permalink] [raw]

Subject: Re: [PATCH 6/8] blk-mq-sched: add framework for MQ capable IO schedulers

> Il giorno 22 dic 2016, alle ore 10:59, Paolo Valente <[email protected]> ha scritto:
>
>>
>> Il giorno 17 dic 2016, alle ore 01:12, Jens Axboe <[email protected]> ha scritto:
>>
>> This adds a set of hooks that intercepts the blk-mq path of
>> allocating/inserting/issuing/completing requests, allowing
>> us to develop a scheduler within that framework.
>>
>> We reuse the existing elevator scheduler API on the registration
>> side, but augment that with the scheduler flagging support for
>> the blk-mq interfce, and with a separate set of ops hooks for MQ
>> devices.
>>
>> Schedulers can opt in to using shadow requests. Shadow requests
>> are internal requests that the scheduler uses for for the allocate
>> and insert part, which are then mapped to a real driver request
>> at dispatch time. This is needed to separate the device queue depth
>> from the pool of requests that the scheduler has to work with.
>>
>> Signed-off-by: Jens Axboe <[email protected]>
>>
> ...
>
>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
>> new file mode 100644
>> index 000000000000..b7e1839d4785
>> --- /dev/null
>> +++ b/block/blk-mq-sched.c
>
>> ...
>> +static inline bool
>> +blk_mq_sched_allow_merge(struct request_queue *q, struct request *rq,
>> + struct bio *bio)
>> +{
>> + struct elevator_queue *e = q->elevator;
>> +
>> + if (e && e->type->ops.mq.allow_merge)
>> + return e->type->ops.mq.allow_merge(q, rq, bio);
>> +
>> + return true;
>> +}
>> +
>
> Something does not seem to add up here:
> e->type->ops.mq.allow_merge may be called only in
> blk_mq_sched_allow_merge, which, in its turn, may be called only in
> blk_mq_attempt_merge, which, finally, may be called only in
> blk_mq_merge_queue_io. Yet the latter may be called only if there is
> no elevator (line 1399 and 1507 in blk-mq.c).
>
> Therefore, e->type->ops.mq.allow_merge can never be called, both if
> there is and if there is not an elevator. Be patient if I'm missing
> something huge, but I thought it was worth reporting this.
>

Jens,
I forgot to add that I'm willing (and would be happy) to propose a fix
to this, and possibly the other problems too, on my own. Just, I'm
not yet expert enough to do it with having first received some
feedback or instructions from you. In this specific case, I don't
even know yet whether this is really a bug.

Thanks, and merry Christmas if we don't get in touch before,
Paolo

> Paolo
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html