As a followup to this posting from yesterday:
https://marc.info/?l=linux-block&m=148115232806065&w=2
this is version 2. I wanted to post a new one fairly quickly, as there
ended up being a number of potential crashes in v1. This one should be
solid, I've run mq-deadline on both NVMe and regular rotating storage,
and we handle the various merging cases correctly.
You can download it from git as well:
git://git.kernel.dk/linux-block blk-mq-sched.2
Note that this is based on for-4.10/block, which is in turn based on
v4.9-rc1. I suggest pulling it into my for-next branch, which would
then merge nicely with 'master' as well.
Changes since v1:
- Add Kconfig entries to allow the user to choose what the default
scheduler should be for blk-mq, and whether that depends on the
number of hardware queues.
- Properly abstract the whole get/put of a request, so we can manage
the life time properly.
- Enable full merging on mq-deadline (front/back, bio-to-rq, rq-to-rq).
Has full feature parity with deadline now.
- Export necessary symbols for compiling mq-deadline as a module.
- Various API adjustments for the mq schedulers.
- Various cleanups and improvements.
- Fix a lot of bugs. A lot. Upgrade!
block/Kconfig.iosched | 37 ++
block/Makefile | 3
block/blk-core.c | 9
block/blk-exec.c | 3
block/blk-flush.c | 7
block/blk-merge.c | 3
block/blk-mq-sched.c | 265 +++++++++++++++++++
block/blk-mq-sched.h | 188 +++++++++++++
block/blk-mq-tag.c | 1
block/blk-mq.c | 254 ++++++++++--------
block/blk-mq.h | 35 +-
block/elevator.c | 194 ++++++++++----
block/mq-deadline.c | 647 +++++++++++++++++++++++++++++++++++++++++++++++
drivers/nvme/host/pci.c | 1
include/linux/blk-mq.h | 4
include/linux/elevator.h | 34 ++
16 files changed, 1495 insertions(+), 190 deletions(-)
Signed-off-by: Jens Axboe <[email protected]>
---
block/elevator.c | 8 ++++----
include/linux/elevator.h | 5 +++++
2 files changed, 9 insertions(+), 4 deletions(-)
diff --git a/block/elevator.c b/block/elevator.c
index a18a5db274e4..40f0c04e5ad3 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -248,13 +248,13 @@ static inline void __elv_rqhash_del(struct request *rq)
rq->rq_flags &= ~RQF_HASHED;
}
-static void elv_rqhash_del(struct request_queue *q, struct request *rq)
+void elv_rqhash_del(struct request_queue *q, struct request *rq)
{
if (ELV_ON_HASH(rq))
__elv_rqhash_del(rq);
}
-static void elv_rqhash_add(struct request_queue *q, struct request *rq)
+void elv_rqhash_add(struct request_queue *q, struct request *rq)
{
struct elevator_queue *e = q->elevator;
@@ -263,13 +263,13 @@ static void elv_rqhash_add(struct request_queue *q, struct request *rq)
rq->rq_flags |= RQF_HASHED;
}
-static void elv_rqhash_reposition(struct request_queue *q, struct request *rq)
+void elv_rqhash_reposition(struct request_queue *q, struct request *rq)
{
__elv_rqhash_del(rq);
elv_rqhash_add(q, rq);
}
-static struct request *elv_rqhash_find(struct request_queue *q, sector_t offset)
+struct request *elv_rqhash_find(struct request_queue *q, sector_t offset)
{
struct elevator_queue *e = q->elevator;
struct hlist_node *next;
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index f219c9aed360..b276e9ef0e0b 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -108,6 +108,11 @@ struct elevator_type
#define ELV_HASH_BITS 6
+void elv_rqhash_del(struct request_queue *q, struct request *rq);
+void elv_rqhash_add(struct request_queue *q, struct request *rq);
+void elv_rqhash_reposition(struct request_queue *q, struct request *rq);
+struct request *elv_rqhash_find(struct request_queue *q, sector_t offset);
+
/*
* each queue has an elevator_queue associated with it
*/
--
2.7.4
Currently we pass in to run the queue async, but don't flag the
queue to be run. We don't need to run it async here, but we should
run it. So fixup the parameters.
Signed-off-by: Jens Axboe <[email protected]>
---
block/blk-flush.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/block/blk-flush.c b/block/blk-flush.c
index 1bdbb3d3e5f5..27a42dab5a36 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -426,7 +426,7 @@ void blk_insert_flush(struct request *rq)
if ((policy & REQ_FSEQ_DATA) &&
!(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
if (q->mq_ops) {
- blk_mq_insert_request(rq, false, false, true);
+ blk_mq_insert_request(rq, false, true, false);
} else
list_add_tail(&rq->queuelist, &q->queue_head);
return;
--
2.7.4
Takes a list of requests, and dispatches it. Moves any residual
requests to the dispatch list.
Signed-off-by: Jens Axboe <[email protected]>
---
block/blk-mq.c | 85 ++++++++++++++++++++++++++++++++--------------------------
block/blk-mq.h | 1 +
2 files changed, 48 insertions(+), 38 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index b216746be9d3..abbf7cca4d0d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -821,41 +821,13 @@ static inline unsigned int queued_to_index(unsigned int queued)
return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
}
-/*
- * Run this hardware queue, pulling any software queues mapped to it in.
- * Note that this function currently has various problems around ordering
- * of IO. In particular, we'd like FIFO behaviour on handling existing
- * items on the hctx->dispatch list. Ignore that for now.
- */
-static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
+bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
{
struct request_queue *q = hctx->queue;
struct request *rq;
- LIST_HEAD(rq_list);
LIST_HEAD(driver_list);
struct list_head *dptr;
- int queued;
-
- if (unlikely(blk_mq_hctx_stopped(hctx)))
- return;
-
- hctx->run++;
-
- /*
- * Touch any software queue that has pending entries.
- */
- flush_busy_ctxs(hctx, &rq_list);
-
- /*
- * If we have previous entries on our dispatch list, grab them
- * and stuff them at the front for more fair dispatch.
- */
- if (!list_empty_careful(&hctx->dispatch)) {
- spin_lock(&hctx->lock);
- if (!list_empty(&hctx->dispatch))
- list_splice_init(&hctx->dispatch, &rq_list);
- spin_unlock(&hctx->lock);
- }
+ int queued, ret = BLK_MQ_RQ_QUEUE_OK;
/*
* Start off with dptr being NULL, so we start the first request
@@ -867,16 +839,15 @@ static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
* Now process all the entries, sending them to the driver.
*/
queued = 0;
- while (!list_empty(&rq_list)) {
+ while (!list_empty(list)) {
struct blk_mq_queue_data bd;
- int ret;
- rq = list_first_entry(&rq_list, struct request, queuelist);
+ rq = list_first_entry(list, struct request, queuelist);
list_del_init(&rq->queuelist);
bd.rq = rq;
bd.list = dptr;
- bd.last = list_empty(&rq_list);
+ bd.last = list_empty(list);
ret = q->mq_ops->queue_rq(hctx, &bd);
switch (ret) {
@@ -884,7 +855,7 @@ static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
queued++;
break;
case BLK_MQ_RQ_QUEUE_BUSY:
- list_add(&rq->queuelist, &rq_list);
+ list_add(&rq->queuelist, list);
__blk_mq_requeue_request(rq);
break;
default:
@@ -902,7 +873,7 @@ static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
* We've done the first request. If we have more than 1
* left in the list, set dptr to defer issue.
*/
- if (!dptr && rq_list.next != rq_list.prev)
+ if (!dptr && list->next != list->prev)
dptr = &driver_list;
}
@@ -912,10 +883,11 @@ static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
* Any items that need requeuing? Stuff them into hctx->dispatch,
* that is where we will continue on next queue run.
*/
- if (!list_empty(&rq_list)) {
+ if (!list_empty(list)) {
spin_lock(&hctx->lock);
- list_splice(&rq_list, &hctx->dispatch);
+ list_splice(list, &hctx->dispatch);
spin_unlock(&hctx->lock);
+
/*
* the queue is expected stopped with BLK_MQ_RQ_QUEUE_BUSY, but
* it's possible the queue is stopped and restarted again
@@ -927,6 +899,43 @@ static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
**/
blk_mq_run_hw_queue(hctx, true);
}
+
+ return ret != BLK_MQ_RQ_QUEUE_BUSY;
+}
+
+/*
+ * Run this hardware queue, pulling any software queues mapped to it in.
+ * Note that this function currently has various problems around ordering
+ * of IO. In particular, we'd like FIFO behaviour on handling existing
+ * items on the hctx->dispatch list. Ignore that for now.
+ */
+static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
+{
+ LIST_HEAD(rq_list);
+ LIST_HEAD(driver_list);
+
+ if (unlikely(blk_mq_hctx_stopped(hctx)))
+ return;
+
+ hctx->run++;
+
+ /*
+ * Touch any software queue that has pending entries.
+ */
+ flush_busy_ctxs(hctx, &rq_list);
+
+ /*
+ * If we have previous entries on our dispatch list, grab them
+ * and stuff them at the front for more fair dispatch.
+ */
+ if (!list_empty_careful(&hctx->dispatch)) {
+ spin_lock(&hctx->lock);
+ if (!list_empty(&hctx->dispatch))
+ list_splice_init(&hctx->dispatch, &rq_list);
+ spin_unlock(&hctx->lock);
+ }
+
+ blk_mq_dispatch_rq_list(hctx, &rq_list);
}
static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
diff --git a/block/blk-mq.h b/block/blk-mq.h
index b444370ae05b..3a54dd32a6fc 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -31,6 +31,7 @@ void blk_mq_freeze_queue(struct request_queue *q);
void blk_mq_free_queue(struct request_queue *q);
int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
void blk_mq_wake_waiters(struct request_queue *q);
+bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *, struct list_head *);
/*
* CPU hotplug helpers
--
2.7.4
Signed-off-by: Jens Axboe <[email protected]>
---
block/Kconfig.iosched | 6 +
block/Makefile | 1 +
block/mq-deadline.c | 647 ++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 654 insertions(+)
create mode 100644 block/mq-deadline.c
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 421bef9c4c48..490ef2850fae 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -32,6 +32,12 @@ config IOSCHED_CFQ
This is the default I/O scheduler.
+config MQ_IOSCHED_DEADLINE
+ tristate "MQ deadline I/O scheduler"
+ default y
+ ---help---
+ MQ version of the deadline IO scheduler.
+
config CFQ_GROUP_IOSCHED
bool "CFQ Group Scheduling support"
depends on IOSCHED_CFQ && BLK_CGROUP
diff --git a/block/Makefile b/block/Makefile
index 2eee9e1bb6db..3ee0abd7205a 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -18,6 +18,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING) += blk-throttle.o
obj-$(CONFIG_IOSCHED_NOOP) += noop-iosched.o
obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o
obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o
+obj-$(CONFIG_MQ_IOSCHED_DEADLINE) += mq-deadline.o
obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
obj-$(CONFIG_BLK_CMDLINE_PARSER) += cmdline-parser.o
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
new file mode 100644
index 000000000000..dfd30b68bfc4
--- /dev/null
+++ b/block/mq-deadline.c
@@ -0,0 +1,647 @@
+/*
+ * MQ Deadline i/o scheduler - adaptation of the legacy deadline scheduler,
+ * for the blk-mq scheduling framework
+ *
+ * Copyright (C) 2016 Jens Axboe <[email protected]>
+ */
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/blkdev.h>
+#include <linux/blk-mq.h>
+#include <linux/elevator.h>
+#include <linux/bio.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/init.h>
+#include <linux/compiler.h>
+#include <linux/rbtree.h>
+#include <linux/sbitmap.h>
+
+#include "blk.h"
+#include "blk-mq.h"
+#include "blk-mq-tag.h"
+#include "blk-mq-sched.h"
+
+static inline bool dd_rq_is_shadow(struct request *rq)
+{
+ return rq->rq_flags & RQF_ALLOCED;
+}
+
+/*
+ * See Documentation/block/deadline-iosched.txt
+ */
+static const int read_expire = HZ / 2; /* max time before a read is submitted. */
+static const int write_expire = 5 * HZ; /* ditto for writes, these limits are SOFT! */
+static const int writes_starved = 2; /* max times reads can starve a write */
+static const int fifo_batch = 16; /* # of sequential requests treated as one
+ by the above parameters. For throughput. */
+
+struct deadline_data {
+ /*
+ * run time data
+ */
+
+ /*
+ * requests (deadline_rq s) are present on both sort_list and fifo_list
+ */
+ struct rb_root sort_list[2];
+ struct list_head fifo_list[2];
+
+ /*
+ * next in sort order. read, write or both are NULL
+ */
+ struct request *next_rq[2];
+ unsigned int batching; /* number of sequential requests made */
+ unsigned int starved; /* times reads have starved writes */
+
+ /*
+ * settings that change how the i/o scheduler behaves
+ */
+ int fifo_expire[2];
+ int fifo_batch;
+ int writes_starved;
+ int front_merges;
+
+ spinlock_t lock;
+ struct list_head dispatch;
+ struct blk_mq_tags *tags;
+ atomic_t wait_index;
+};
+
+static inline struct rb_root *
+deadline_rb_root(struct deadline_data *dd, struct request *rq)
+{
+ return &dd->sort_list[rq_data_dir(rq)];
+}
+
+/*
+ * get the request after `rq' in sector-sorted order
+ */
+static inline struct request *
+deadline_latter_request(struct request *rq)
+{
+ struct rb_node *node = rb_next(&rq->rb_node);
+
+ if (node)
+ return rb_entry_rq(node);
+
+ return NULL;
+}
+
+static void
+deadline_add_rq_rb(struct deadline_data *dd, struct request *rq)
+{
+ struct rb_root *root = deadline_rb_root(dd, rq);
+
+ elv_rb_add(root, rq);
+}
+
+static inline void
+deadline_del_rq_rb(struct deadline_data *dd, struct request *rq)
+{
+ const int data_dir = rq_data_dir(rq);
+
+ if (dd->next_rq[data_dir] == rq)
+ dd->next_rq[data_dir] = deadline_latter_request(rq);
+
+ elv_rb_del(deadline_rb_root(dd, rq), rq);
+}
+
+/*
+ * remove rq from rbtree and fifo.
+ */
+static void deadline_remove_request(struct request_queue *q, struct request *rq)
+{
+ struct deadline_data *dd = q->elevator->elevator_data;
+
+ list_del_init(&rq->queuelist);
+ deadline_del_rq_rb(dd, rq);
+
+ elv_rqhash_del(q, rq);
+ if (q->last_merge == rq)
+ q->last_merge = NULL;
+}
+
+static void dd_merged_requests(struct request_queue *q, struct request *req,
+ struct request *next)
+{
+ /*
+ * if next expires before rq, assign its expire time to rq
+ * and move into next position (next will be deleted) in fifo
+ */
+ if (!list_empty(&req->queuelist) && !list_empty(&next->queuelist)) {
+ if (time_before((unsigned long)next->fifo_time,
+ (unsigned long)req->fifo_time)) {
+ list_move(&req->queuelist, &next->queuelist);
+ req->fifo_time = next->fifo_time;
+ }
+ }
+
+ /*
+ * kill knowledge of next, this one is a goner
+ */
+ deadline_remove_request(q, next);
+}
+
+/*
+ * move an entry to dispatch queue
+ */
+static void
+deadline_move_request(struct deadline_data *dd, struct request *rq)
+{
+ const int data_dir = rq_data_dir(rq);
+
+ dd->next_rq[READ] = NULL;
+ dd->next_rq[WRITE] = NULL;
+ dd->next_rq[data_dir] = deadline_latter_request(rq);
+
+ /*
+ * take it off the sort and fifo list
+ */
+ deadline_remove_request(rq->q, rq);
+}
+
+/*
+ * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
+ * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
+ */
+static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
+{
+ struct request *rq = rq_entry_fifo(dd->fifo_list[ddir].next);
+
+ /*
+ * rq is expired!
+ */
+ if (time_after_eq(jiffies, (unsigned long)rq->fifo_time))
+ return 1;
+
+ return 0;
+}
+
+/*
+ * deadline_dispatch_requests selects the best request according to
+ * read/write expire, fifo_batch, etc
+ */
+static struct request *__dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
+{
+ struct deadline_data *dd = hctx->queue->elevator->elevator_data;
+ struct request *rq;
+ bool reads, writes;
+ int data_dir;
+
+ spin_lock(&dd->lock);
+
+ if (!list_empty(&dd->dispatch)) {
+ rq = list_first_entry(&dd->dispatch, struct request, queuelist);
+ list_del_init(&rq->queuelist);
+ goto done;
+ }
+
+ reads = !list_empty(&dd->fifo_list[READ]);
+ writes = !list_empty(&dd->fifo_list[WRITE]);
+
+ /*
+ * batches are currently reads XOR writes
+ */
+ if (dd->next_rq[WRITE])
+ rq = dd->next_rq[WRITE];
+ else
+ rq = dd->next_rq[READ];
+
+ if (rq && dd->batching < dd->fifo_batch)
+ /* we have a next request are still entitled to batch */
+ goto dispatch_request;
+
+ /*
+ * at this point we are not running a batch. select the appropriate
+ * data direction (read / write)
+ */
+
+ if (reads) {
+ BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
+
+ if (writes && (dd->starved++ >= dd->writes_starved))
+ goto dispatch_writes;
+
+ data_dir = READ;
+
+ goto dispatch_find_request;
+ }
+
+ /*
+ * there are either no reads or writes have been starved
+ */
+
+ if (writes) {
+dispatch_writes:
+ BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
+
+ dd->starved = 0;
+
+ data_dir = WRITE;
+
+ goto dispatch_find_request;
+ }
+
+ spin_unlock(&dd->lock);
+ return NULL;
+
+dispatch_find_request:
+ /*
+ * we are not running a batch, find best request for selected data_dir
+ */
+ if (deadline_check_fifo(dd, data_dir) || !dd->next_rq[data_dir]) {
+ /*
+ * A deadline has expired, the last request was in the other
+ * direction, or we have run out of higher-sectored requests.
+ * Start again from the request with the earliest expiry time.
+ */
+ rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
+ } else {
+ /*
+ * The last req was the same dir and we have a next request in
+ * sort order. No expired requests so continue on from here.
+ */
+ rq = dd->next_rq[data_dir];
+ }
+
+ dd->batching = 0;
+
+dispatch_request:
+ /*
+ * rq is the selected appropriate request.
+ */
+ dd->batching++;
+ deadline_move_request(dd, rq);
+done:
+ rq->rq_flags |= RQF_STARTED;
+ spin_unlock(&dd->lock);
+ return rq;
+}
+
+static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
+{
+ return blk_mq_sched_request_from_shadow(hctx, __dd_dispatch_request);
+}
+
+static void dd_exit_queue(struct elevator_queue *e)
+{
+ struct deadline_data *dd = e->elevator_data;
+
+ BUG_ON(!list_empty(&dd->fifo_list[READ]));
+ BUG_ON(!list_empty(&dd->fifo_list[WRITE]));
+
+ blk_mq_sched_free_requests(dd->tags);
+ kfree(dd);
+}
+
+/*
+ * initialize elevator private data (deadline_data).
+ */
+static int dd_init_queue(struct request_queue *q, struct elevator_type *e)
+{
+ struct deadline_data *dd;
+ struct elevator_queue *eq;
+
+ eq = elevator_alloc(q, e);
+ if (!eq)
+ return -ENOMEM;
+
+ dd = kzalloc_node(sizeof(*dd), GFP_KERNEL, q->node);
+ if (!dd) {
+ kobject_put(&eq->kobj);
+ return -ENOMEM;
+ }
+ eq->elevator_data = dd;
+
+ dd->tags = blk_mq_sched_alloc_requests(256, q->node);
+ if (!dd->tags) {
+ kfree(dd);
+ kobject_put(&eq->kobj);
+ return -ENOMEM;
+ }
+
+ INIT_LIST_HEAD(&dd->fifo_list[READ]);
+ INIT_LIST_HEAD(&dd->fifo_list[WRITE]);
+ dd->sort_list[READ] = RB_ROOT;
+ dd->sort_list[WRITE] = RB_ROOT;
+ dd->fifo_expire[READ] = read_expire;
+ dd->fifo_expire[WRITE] = write_expire;
+ dd->writes_starved = writes_starved;
+ dd->front_merges = 1;
+ dd->fifo_batch = fifo_batch;
+ spin_lock_init(&dd->lock);
+ INIT_LIST_HEAD(&dd->dispatch);
+ atomic_set(&dd->wait_index, 0);
+
+ q->elevator = eq;
+ return 0;
+}
+
+static int __dd_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio,
+ struct request **req)
+{
+ struct request_queue *q = hctx->queue;
+ struct deadline_data *dd = q->elevator->elevator_data;
+ struct request *__rq;
+ sector_t sector;
+ int ret;
+
+ /*
+ * First try one-hit cache.
+ */
+ if (q->last_merge && elv_bio_merge_ok(q->last_merge, bio)) {
+ ret = blk_try_merge(q->last_merge, bio);
+ if (ret != ELEVATOR_NO_MERGE) {
+ *req = q->last_merge;
+ return ret;
+ }
+ }
+
+ if (blk_queue_noxmerges(q))
+ return ELEVATOR_NO_MERGE;
+
+ /*
+ * See if our hash lookup can find a potential backmerge.
+ */
+ __rq = elv_rqhash_find(q, bio->bi_iter.bi_sector);
+ if (__rq && elv_bio_merge_ok(__rq, bio)) {
+ *req = __rq;
+ return ELEVATOR_BACK_MERGE;
+ }
+
+ if (!dd->front_merges)
+ return ELEVATOR_NO_MERGE;
+
+ sector = bio_end_sector(bio);
+
+ __rq = elv_rb_find(&dd->sort_list[bio_data_dir(bio)], sector);
+ if (__rq) {
+ BUG_ON(sector != blk_rq_pos(__rq));
+
+ if (elv_bio_merge_ok(__rq, bio)) {
+ *req = __rq;
+ return ELEVATOR_FRONT_MERGE;
+ }
+ }
+
+ return ELEVATOR_NO_MERGE;
+}
+
+static bool dd_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio)
+{
+ struct request_queue *q = hctx->queue;
+ struct deadline_data *dd = q->elevator->elevator_data;
+ struct request *rq;
+ int ret;
+
+ spin_lock(&dd->lock);
+
+ ret = __dd_bio_merge(hctx, bio, &rq);
+
+ if (ret == ELEVATOR_BACK_MERGE) {
+ if (bio_attempt_back_merge(q, rq, bio)) {
+ q->last_merge = rq;
+ elv_rqhash_reposition(q, rq);
+ if (!attempt_back_merge(q, rq))
+ elv_merged_request(q, rq, ret);
+ goto done;
+ }
+ ret = ELEVATOR_NO_MERGE;
+ } else if (ret == ELEVATOR_FRONT_MERGE) {
+ if (bio_attempt_front_merge(q, rq, bio)) {
+ q->last_merge = rq;
+ elv_rb_del(deadline_rb_root(dd, rq), rq);
+ deadline_add_rq_rb(dd, rq);
+ if (!attempt_front_merge(q, rq))
+ elv_merged_request(q, rq, ret);
+ goto done;
+ }
+ ret = ELEVATOR_NO_MERGE;
+ }
+
+done:
+ spin_unlock(&dd->lock);
+ return ret != ELEVATOR_NO_MERGE;
+}
+
+/*
+ * add rq to rbtree and fifo
+ */
+static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
+ bool at_head)
+{
+ struct request_queue *q = hctx->queue;
+ struct deadline_data *dd = q->elevator->elevator_data;
+ const int data_dir = rq_data_dir(rq);
+
+ /*
+ * If we're trying to insert a real request, just send it directly
+ * to the hardware dispatch list. This only happens for a requeue,
+ * or FUA/FLUSH requests.
+ */
+ if (!dd_rq_is_shadow(rq)) {
+ spin_lock(&hctx->lock);
+ list_add_tail(&rq->queuelist, &hctx->dispatch);
+ spin_unlock(&hctx->lock);
+ return;
+ }
+
+ spin_lock(&dd->lock);
+
+ if (at_head || rq->cmd_type != REQ_TYPE_FS) {
+ if (at_head)
+ list_add(&rq->queuelist, &dd->dispatch);
+ else
+ list_add_tail(&rq->queuelist, &dd->dispatch);
+ } else {
+ deadline_add_rq_rb(dd, rq);
+
+ if (rq_mergeable(rq)) {
+ elv_rqhash_add(q, rq);
+ if (!q->last_merge)
+ q->last_merge = rq;
+ }
+
+ /*
+ * set expire time and add to fifo list
+ */
+ rq->fifo_time = jiffies + dd->fifo_expire[data_dir];
+ list_add_tail(&rq->queuelist, &dd->fifo_list[data_dir]);
+ }
+
+ spin_unlock(&dd->lock);
+}
+
+static struct request *dd_get_request(struct request_queue *q, unsigned int op,
+ struct blk_mq_alloc_data *data)
+{
+ struct deadline_data *dd = q->elevator->elevator_data;
+ struct request *rq;
+
+ /*
+ * The flush machinery intercepts before we insert the request. As
+ * a work-around, just hand it back a real request.
+ */
+ if (unlikely(op & (REQ_PREFLUSH | REQ_FUA)))
+ rq = __blk_mq_alloc_request(data, op);
+ else {
+ rq = blk_mq_sched_alloc_shadow_request(q, data, dd->tags, &dd->wait_index);
+ if (rq)
+ blk_mq_rq_ctx_init(q, data->ctx, rq, op);
+ }
+
+ return rq;
+}
+
+static void dd_put_request(struct request *rq)
+{
+ /*
+ * If it's a real request, we just have to free it. For a shadow
+ * request, we should only free it if we haven't started it. A
+ * started request is mapped to a real one, and the real one will
+ * free it. We can get here with request merges, since we then
+ * free the request before we start/issue it.
+ */
+ if (!dd_rq_is_shadow(rq))
+ blk_mq_free_request(rq);
+ else if (!(rq->rq_flags & RQF_STARTED)) {
+ struct deadline_data *dd = rq->q->elevator->elevator_data;
+
+ blk_mq_sched_free_shadow_request(dd->tags, rq);
+ }
+}
+
+static void dd_completed_request(struct blk_mq_hw_ctx *hctx, struct request *rq)
+{
+ struct request *sched_rq = rq->end_io_data;
+
+ /*
+ * sched_rq can be NULL, if we haven't setup the shadow yet
+ * because we failed getting one.
+ */
+ if (sched_rq) {
+ struct deadline_data *dd = hctx->queue->elevator->elevator_data;
+
+ blk_mq_sched_free_shadow_request(dd->tags, sched_rq);
+ blk_mq_start_stopped_hw_queue(hctx, true);
+ }
+}
+
+static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
+{
+ struct deadline_data *dd = hctx->queue->elevator->elevator_data;
+
+ return !list_empty_careful(&dd->dispatch) ||
+ !list_empty_careful(&dd->fifo_list[0]) ||
+ !list_empty_careful(&dd->fifo_list[1]);
+}
+
+/*
+ * sysfs parts below
+ */
+static ssize_t
+deadline_var_show(int var, char *page)
+{
+ return sprintf(page, "%d\n", var);
+}
+
+static ssize_t
+deadline_var_store(int *var, const char *page, size_t count)
+{
+ char *p = (char *) page;
+
+ *var = simple_strtol(p, &p, 10);
+ return count;
+}
+
+#define SHOW_FUNCTION(__FUNC, __VAR, __CONV) \
+static ssize_t __FUNC(struct elevator_queue *e, char *page) \
+{ \
+ struct deadline_data *dd = e->elevator_data; \
+ int __data = __VAR; \
+ if (__CONV) \
+ __data = jiffies_to_msecs(__data); \
+ return deadline_var_show(__data, (page)); \
+}
+SHOW_FUNCTION(deadline_read_expire_show, dd->fifo_expire[READ], 1);
+SHOW_FUNCTION(deadline_write_expire_show, dd->fifo_expire[WRITE], 1);
+SHOW_FUNCTION(deadline_writes_starved_show, dd->writes_starved, 0);
+SHOW_FUNCTION(deadline_front_merges_show, dd->front_merges, 0);
+SHOW_FUNCTION(deadline_fifo_batch_show, dd->fifo_batch, 0);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \
+static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count) \
+{ \
+ struct deadline_data *dd = e->elevator_data; \
+ int __data; \
+ int ret = deadline_var_store(&__data, (page), count); \
+ if (__data < (MIN)) \
+ __data = (MIN); \
+ else if (__data > (MAX)) \
+ __data = (MAX); \
+ if (__CONV) \
+ *(__PTR) = msecs_to_jiffies(__data); \
+ else \
+ *(__PTR) = __data; \
+ return ret; \
+}
+STORE_FUNCTION(deadline_read_expire_store, &dd->fifo_expire[READ], 0, INT_MAX, 1);
+STORE_FUNCTION(deadline_write_expire_store, &dd->fifo_expire[WRITE], 0, INT_MAX, 1);
+STORE_FUNCTION(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX, 0);
+STORE_FUNCTION(deadline_front_merges_store, &dd->front_merges, 0, 1, 0);
+STORE_FUNCTION(deadline_fifo_batch_store, &dd->fifo_batch, 0, INT_MAX, 0);
+#undef STORE_FUNCTION
+
+#define DD_ATTR(name) \
+ __ATTR(name, S_IRUGO|S_IWUSR, deadline_##name##_show, \
+ deadline_##name##_store)
+
+static struct elv_fs_entry deadline_attrs[] = {
+ DD_ATTR(read_expire),
+ DD_ATTR(write_expire),
+ DD_ATTR(writes_starved),
+ DD_ATTR(front_merges),
+ DD_ATTR(fifo_batch),
+ __ATTR_NULL
+};
+
+static struct elevator_type mq_deadline = {
+ .mq_ops = {
+ .get_request = dd_get_request,
+ .put_request = dd_put_request,
+ .insert_request = dd_insert_request,
+ .dispatch_request = dd_dispatch_request,
+ .completed_request = dd_completed_request,
+ .next_request = elv_rb_latter_request,
+ .former_request = elv_rb_former_request,
+ .bio_merge = dd_bio_merge,
+ .requests_merged = dd_merged_requests,
+ .has_work = dd_has_work,
+ .init_sched = dd_init_queue,
+ .exit_sched = dd_exit_queue,
+ },
+
+ .uses_mq = true,
+ .elevator_attrs = deadline_attrs,
+ .elevator_name = "mq-deadline",
+ .elevator_owner = THIS_MODULE,
+};
+
+static int __init deadline_init(void)
+{
+ return elv_register(&mq_deadline);
+}
+
+static void __exit deadline_exit(void)
+{
+ elv_unregister(&mq_deadline);
+}
+
+module_init(deadline_init);
+module_exit(deadline_exit);
+
+MODULE_AUTHOR("Jens Axboe");
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("MQ deadline IO scheduler");
--
2.7.4
We have a variant for all hardware queues, but not one for a single
hardware queue.
Signed-off-by: Jens Axboe <[email protected]>
---
block/blk-mq.c | 18 +++++++++++-------
include/linux/blk-mq.h | 1 +
2 files changed, 12 insertions(+), 7 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 90db5b490df9..b216746be9d3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1064,18 +1064,22 @@ void blk_mq_start_hw_queues(struct request_queue *q)
}
EXPORT_SYMBOL(blk_mq_start_hw_queues);
+void blk_mq_start_stopped_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
+{
+ if (!blk_mq_hctx_stopped(hctx))
+ return;
+
+ clear_bit(BLK_MQ_S_STOPPED, &hctx->state);
+ blk_mq_run_hw_queue(hctx, async);
+}
+
void blk_mq_start_stopped_hw_queues(struct request_queue *q, bool async)
{
struct blk_mq_hw_ctx *hctx;
int i;
- queue_for_each_hw_ctx(q, hctx, i) {
- if (!blk_mq_hctx_stopped(hctx))
- continue;
-
- clear_bit(BLK_MQ_S_STOPPED, &hctx->state);
- blk_mq_run_hw_queue(hctx, async);
- }
+ queue_for_each_hw_ctx(q, hctx, i)
+ blk_mq_start_stopped_hw_queue(hctx, async);
}
EXPORT_SYMBOL(blk_mq_start_stopped_hw_queues);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 35a0af5ede6d..87e404aae267 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -231,6 +231,7 @@ void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx);
void blk_mq_start_hw_queue(struct blk_mq_hw_ctx *hctx);
void blk_mq_stop_hw_queues(struct request_queue *q);
void blk_mq_start_hw_queues(struct request_queue *q);
+void blk_mq_start_stopped_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
void blk_mq_start_stopped_hw_queues(struct request_queue *q, bool async);
void blk_mq_run_hw_queues(struct request_queue *q, bool async);
void blk_mq_delay_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs);
--
2.7.4
Signed-off-by: Jens Axboe <[email protected]>
---
block/Kconfig.iosched | 43 +++++++++++++++++++++++++++++++++++++------
block/blk-mq-sched.c | 19 +++++++++++++++++++
block/blk-mq-sched.h | 1 +
block/blk-mq.c | 3 +++
block/elevator.c | 5 ++++-
drivers/nvme/host/pci.c | 1 +
include/linux/blk-mq.h | 1 +
7 files changed, 66 insertions(+), 7 deletions(-)
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 490ef2850fae..00502a3d76b7 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -32,12 +32,6 @@ config IOSCHED_CFQ
This is the default I/O scheduler.
-config MQ_IOSCHED_DEADLINE
- tristate "MQ deadline I/O scheduler"
- default y
- ---help---
- MQ version of the deadline IO scheduler.
-
config CFQ_GROUP_IOSCHED
bool "CFQ Group Scheduling support"
depends on IOSCHED_CFQ && BLK_CGROUP
@@ -69,6 +63,43 @@ config DEFAULT_IOSCHED
default "cfq" if DEFAULT_CFQ
default "noop" if DEFAULT_NOOP
+config MQ_IOSCHED_DEADLINE
+ tristate "MQ deadline I/O scheduler"
+ default y
+ ---help---
+ MQ version of the deadline IO scheduler.
+
+config MQ_IOSCHED_NONE
+ bool
+ default y
+
+choice
+ prompt "Default MQ I/O scheduler"
+ default MQ_IOSCHED_NONE
+ help
+ Select the I/O scheduler which will be used by default for all
+ blk-mq managed block devices.
+
+ config DEFAULT_MQ_DEADLINE
+ bool "MQ Deadline" if MQ_IOSCHED_DEADLINE=y
+
+ config DEFAULT_MQ_NONE
+ bool "None"
+
+endchoice
+
+config DEFAULT_MQ_IOSCHED
+ string
+ default "mq-deadline" if DEFAULT_MQ_DEADLINE
+ default "none" if DEFAULT_MQ_NONE
+
endmenu
+config MQ_IOSCHED_ONLY_SQ
+ bool "Enable blk-mq IO scheduler only for single queue devices"
+ default y
+ help
+ Say Y here, if you only want to enable IO scheduling on block
+ devices that have a single queue registered.
+
endif
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 9213366e67d1..bcab84d325c2 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -244,3 +244,22 @@ void __blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
blk_mq_dispatch_rq_list(hctx, &rq_list);
}
+
+int blk_mq_sched_init(struct request_queue *q)
+{
+ int ret;
+
+#if defined(CONFIG_DEFAULT_MQ_NONE)
+ return 0;
+#endif
+#if defined(CONFIG_MQ_IOSCHED_ONLY_SQ)
+ if (q->nr_hw_queues > 1)
+ return 0;
+#endif
+
+ mutex_lock(&q->sysfs_lock);
+ ret = elevator_init(q, NULL);
+ mutex_unlock(&q->sysfs_lock);
+
+ return ret;
+}
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 609c80506cfc..391ecc00f520 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -25,6 +25,7 @@ struct request *
blk_mq_sched_request_from_shadow(struct blk_mq_hw_ctx *hctx,
struct request *(*get_sched_rq)(struct blk_mq_hw_ctx *));
+int blk_mq_sched_init(struct request_queue *q);
struct blk_mq_alloc_data {
/* input parameter */
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 019de6f0fd06..9eeffd76f729 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2141,6 +2141,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
INIT_LIST_HEAD(&q->requeue_list);
spin_lock_init(&q->requeue_lock);
+ if (!(set->flags & BLK_MQ_F_NO_SCHED))
+ blk_mq_sched_init(q);
+
if (q->nr_hw_queues > 1)
blk_queue_make_request(q, blk_mq_make_request);
else
diff --git a/block/elevator.c b/block/elevator.c
index f1191b3b0ff3..368976d05f0a 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -219,7 +219,10 @@ int elevator_init(struct request_queue *q, char *name)
}
if (!e) {
- e = elevator_get(CONFIG_DEFAULT_IOSCHED, false);
+ if (q->mq_ops)
+ e = elevator_get(CONFIG_DEFAULT_MQ_IOSCHED, false);
+ else
+ e = elevator_get(CONFIG_DEFAULT_IOSCHED, false);
if (!e) {
printk(KERN_ERR
"Default I/O scheduler not found. " \
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 82b9b3f1f21d..7777ec58252f 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1186,6 +1186,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
dev->admin_tagset.timeout = ADMIN_TIMEOUT;
dev->admin_tagset.numa_node = dev_to_node(dev->dev);
dev->admin_tagset.cmd_size = nvme_cmd_size(dev);
+ dev->admin_tagset.flags = BLK_MQ_F_NO_SCHED;
dev->admin_tagset.driver_data = dev;
if (blk_mq_alloc_tag_set(&dev->admin_tagset))
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index c86b314dde97..7c470bf4d7bf 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -152,6 +152,7 @@ enum {
BLK_MQ_F_SG_MERGE = 1 << 2,
BLK_MQ_F_DEFER_ISSUE = 1 << 4,
BLK_MQ_F_BLOCKING = 1 << 5,
+ BLK_MQ_F_NO_SCHED = 1 << 6,
BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
BLK_MQ_F_ALLOC_POLICY_BITS = 1,
--
2.7.4
Signed-off-by: Jens Axboe <[email protected]>
---
block/Makefile | 2 +-
block/blk-core.c | 9 +-
block/blk-exec.c | 3 +-
block/blk-flush.c | 7 +-
block/blk-merge.c | 3 +
block/blk-mq-sched.c | 246 +++++++++++++++++++++++++++++++++++++++++++++++
block/blk-mq-sched.h | 187 +++++++++++++++++++++++++++++++++++
block/blk-mq-tag.c | 1 +
block/blk-mq.c | 150 +++++++++++++++--------------
block/blk-mq.h | 34 +++----
block/elevator.c | 181 ++++++++++++++++++++++++++--------
include/linux/blk-mq.h | 2 +-
include/linux/elevator.h | 29 +++++-
13 files changed, 713 insertions(+), 141 deletions(-)
create mode 100644 block/blk-mq-sched.c
create mode 100644 block/blk-mq-sched.h
diff --git a/block/Makefile b/block/Makefile
index a827f988c4e6..2eee9e1bb6db 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -6,7 +6,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
- blk-mq-sysfs.o blk-mq-cpumap.o ioctl.o \
+ blk-mq-sysfs.o blk-mq-cpumap.o blk-mq-sched.o ioctl.o \
genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
badblocks.o partitions/
diff --git a/block/blk-core.c b/block/blk-core.c
index 4b7ec5958055..3f83414d6986 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -39,6 +39,7 @@
#include "blk.h"
#include "blk-mq.h"
+#include "blk-mq-sched.h"
#include "blk-wbt.h"
EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap);
@@ -1428,7 +1429,7 @@ void __blk_put_request(struct request_queue *q, struct request *req)
return;
if (q->mq_ops) {
- blk_mq_free_request(req);
+ blk_mq_sched_put_request(req);
return;
}
@@ -1464,7 +1465,7 @@ void blk_put_request(struct request *req)
struct request_queue *q = req->q;
if (q->mq_ops)
- blk_mq_free_request(req);
+ blk_mq_sched_put_request(req);
else {
unsigned long flags;
@@ -1528,6 +1529,7 @@ bool bio_attempt_back_merge(struct request_queue *q, struct request *req,
blk_account_io_start(req, false);
return true;
}
+EXPORT_SYMBOL_GPL(bio_attempt_back_merge);
bool bio_attempt_front_merge(struct request_queue *q, struct request *req,
struct bio *bio)
@@ -1552,6 +1554,7 @@ bool bio_attempt_front_merge(struct request_queue *q, struct request *req,
blk_account_io_start(req, false);
return true;
}
+EXPORT_SYMBOL_GPL(bio_attempt_front_merge);
/**
* blk_attempt_plug_merge - try to merge with %current's plugged list
@@ -2173,7 +2176,7 @@ int blk_insert_cloned_request(struct request_queue *q, struct request *rq)
if (q->mq_ops) {
if (blk_queue_io_stat(q))
blk_account_io_start(rq, true);
- blk_mq_insert_request(rq, false, true, false);
+ blk_mq_sched_insert_request(rq, false, true, false);
return 0;
}
diff --git a/block/blk-exec.c b/block/blk-exec.c
index 3ecb00a6cf45..86656fdfa637 100644
--- a/block/blk-exec.c
+++ b/block/blk-exec.c
@@ -9,6 +9,7 @@
#include <linux/sched/sysctl.h>
#include "blk.h"
+#include "blk-mq-sched.h"
/*
* for max sense size
@@ -65,7 +66,7 @@ void blk_execute_rq_nowait(struct request_queue *q, struct gendisk *bd_disk,
* be reused after dying flag is set
*/
if (q->mq_ops) {
- blk_mq_insert_request(rq, at_head, true, false);
+ blk_mq_sched_insert_request(rq, at_head, true, false);
return;
}
diff --git a/block/blk-flush.c b/block/blk-flush.c
index 27a42dab5a36..63b91697d167 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -74,6 +74,7 @@
#include "blk.h"
#include "blk-mq.h"
#include "blk-mq-tag.h"
+#include "blk-mq-sched.h"
/* FLUSH/FUA sequences */
enum {
@@ -425,9 +426,9 @@ void blk_insert_flush(struct request *rq)
*/
if ((policy & REQ_FSEQ_DATA) &&
!(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
- if (q->mq_ops) {
- blk_mq_insert_request(rq, false, true, false);
- } else
+ if (q->mq_ops)
+ blk_mq_sched_insert_request(rq, false, true, false);
+ else
list_add_tail(&rq->queuelist, &q->queue_head);
return;
}
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 1002afdfee99..01247812e13f 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -766,6 +766,7 @@ int attempt_back_merge(struct request_queue *q, struct request *rq)
return 0;
}
+EXPORT_SYMBOL_GPL(attempt_back_merge);
int attempt_front_merge(struct request_queue *q, struct request *rq)
{
@@ -776,6 +777,7 @@ int attempt_front_merge(struct request_queue *q, struct request *rq)
return 0;
}
+EXPORT_SYMBOL_GPL(attempt_front_merge);
int blk_attempt_req_merge(struct request_queue *q, struct request *rq,
struct request *next)
@@ -825,3 +827,4 @@ int blk_try_merge(struct request *rq, struct bio *bio)
return ELEVATOR_FRONT_MERGE;
return ELEVATOR_NO_MERGE;
}
+EXPORT_SYMBOL_GPL(blk_try_merge);
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
new file mode 100644
index 000000000000..9213366e67d1
--- /dev/null
+++ b/block/blk-mq-sched.c
@@ -0,0 +1,246 @@
+#include <linux/kernel.h>
+#include <linux/module.h>
+
+#include <linux/blk-mq.h>
+#include "blk.h"
+#include "blk-mq.h"
+#include "blk-mq-sched.h"
+#include "blk-mq-tag.h"
+#include "blk-wbt.h"
+
+/*
+ * Empty set
+ */
+static struct blk_mq_ops mq_sched_tag_ops = {
+ .queue_rq = NULL,
+};
+
+void blk_mq_sched_free_requests(struct blk_mq_tags *tags)
+{
+ blk_mq_free_rq_map(NULL, tags, 0);
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_free_requests);
+
+struct blk_mq_tags *blk_mq_sched_alloc_requests(unsigned int depth,
+ unsigned int numa_node)
+{
+ struct blk_mq_tag_set set = {
+ .ops = &mq_sched_tag_ops,
+ .nr_hw_queues = 1,
+ .queue_depth = depth,
+ .numa_node = numa_node,
+ };
+
+ return blk_mq_init_rq_map(&set, 0);
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_alloc_requests);
+
+void blk_mq_sched_free_hctx_data(struct request_queue *q,
+ void (*exit)(struct blk_mq_hw_ctx *))
+{
+ struct blk_mq_hw_ctx *hctx;
+ int i;
+
+ queue_for_each_hw_ctx(q, hctx, i) {
+ if (exit)
+ exit(hctx);
+ kfree(hctx->sched_data);
+ hctx->sched_data = NULL;
+ }
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_free_hctx_data);
+
+int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size,
+ void (*init)(struct blk_mq_hw_ctx *))
+{
+ struct blk_mq_hw_ctx *hctx;
+ int i;
+
+ queue_for_each_hw_ctx(q, hctx, i) {
+ hctx->sched_data = kmalloc_node(size, GFP_KERNEL, hctx->numa_node);
+ if (!hctx->sched_data)
+ goto error;
+
+ if (init)
+ init(hctx);
+ }
+
+ return 0;
+error:
+ blk_mq_sched_free_hctx_data(q, NULL);
+ return -ENOMEM;
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_init_hctx_data);
+
+struct request *blk_mq_sched_alloc_shadow_request(struct request_queue *q,
+ struct blk_mq_alloc_data *data,
+ struct blk_mq_tags *tags,
+ atomic_t *wait_index)
+{
+ struct sbq_wait_state *ws;
+ DEFINE_WAIT(wait);
+ struct request *rq;
+ int tag;
+
+ tag = __sbitmap_queue_get(&tags->bitmap_tags);
+ if (tag != -1)
+ goto done;
+
+ if (data->flags & BLK_MQ_REQ_NOWAIT)
+ return NULL;
+
+ ws = sbq_wait_ptr(&tags->bitmap_tags, wait_index);
+ do {
+ prepare_to_wait(&ws->wait, &wait, TASK_UNINTERRUPTIBLE);
+
+ tag = __sbitmap_queue_get(&tags->bitmap_tags);
+ if (tag != -1)
+ break;
+
+ blk_mq_run_hw_queue(data->hctx, false);
+
+ tag = __sbitmap_queue_get(&tags->bitmap_tags);
+ if (tag != -1)
+ break;
+
+ blk_mq_put_ctx(data->ctx);
+ io_schedule();
+
+ data->ctx = blk_mq_get_ctx(data->q);
+ data->hctx = blk_mq_map_queue(data->q, data->ctx->cpu);
+ finish_wait(&ws->wait, &wait);
+ ws = sbq_wait_ptr(&tags->bitmap_tags, wait_index);
+ } while (1);
+
+ finish_wait(&ws->wait, &wait);
+done:
+ rq = tags->rqs[tag];
+ rq->tag = tag;
+ rq->rq_flags |= RQF_ALLOCED;
+ return rq;
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_alloc_shadow_request);
+
+void blk_mq_sched_free_shadow_request(struct blk_mq_tags *tags,
+ struct request *rq)
+{
+ WARN_ON_ONCE(!(rq->rq_flags & RQF_ALLOCED));
+ sbitmap_queue_clear(&tags->bitmap_tags, rq->tag, rq->mq_ctx->cpu);
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_free_shadow_request);
+
+static void rq_copy(struct request *rq, struct request *src)
+{
+#define FIELD_COPY(dst, src, name) ((dst)->name = (src)->name)
+ FIELD_COPY(rq, src, cpu);
+ FIELD_COPY(rq, src, cmd_type);
+ FIELD_COPY(rq, src, cmd_flags);
+ rq->rq_flags |= (src->rq_flags & (RQF_PREEMPT | RQF_QUIET | RQF_PM | RQF_DONTPREP));
+ rq->rq_flags &= ~RQF_IO_STAT;
+ FIELD_COPY(rq, src, __data_len);
+ FIELD_COPY(rq, src, __sector);
+ FIELD_COPY(rq, src, bio);
+ FIELD_COPY(rq, src, biotail);
+ FIELD_COPY(rq, src, rq_disk);
+ FIELD_COPY(rq, src, part);
+ FIELD_COPY(rq, src, nr_phys_segments);
+#if defined(CONFIG_BLK_DEV_INTEGRITY)
+ FIELD_COPY(rq, src, nr_integrity_segments);
+#endif
+ FIELD_COPY(rq, src, ioprio);
+ FIELD_COPY(rq, src, timeout);
+
+ if (src->cmd_type == REQ_TYPE_BLOCK_PC) {
+ FIELD_COPY(rq, src, cmd);
+ FIELD_COPY(rq, src, cmd_len);
+ FIELD_COPY(rq, src, extra_len);
+ FIELD_COPY(rq, src, sense_len);
+ FIELD_COPY(rq, src, resid_len);
+ FIELD_COPY(rq, src, sense);
+ FIELD_COPY(rq, src, retries);
+ }
+
+ src->bio = src->biotail = NULL;
+}
+
+static void sched_rq_end_io(struct request *rq, int error)
+{
+ struct request *sched_rq = rq->end_io_data;
+
+ FIELD_COPY(sched_rq, rq, resid_len);
+ FIELD_COPY(sched_rq, rq, extra_len);
+ FIELD_COPY(sched_rq, rq, sense_len);
+ FIELD_COPY(sched_rq, rq, errors);
+ FIELD_COPY(sched_rq, rq, retries);
+
+ blk_account_io_completion(sched_rq, blk_rq_bytes(sched_rq));
+ blk_account_io_done(sched_rq);
+
+ wbt_done(sched_rq->q->rq_wb, &sched_rq->issue_stat);
+
+ if (sched_rq->end_io)
+ sched_rq->end_io(sched_rq, error);
+
+ blk_mq_free_request(rq);
+}
+
+struct request *
+blk_mq_sched_request_from_shadow(struct blk_mq_hw_ctx *hctx,
+ struct request *(*get_sched_rq)(struct blk_mq_hw_ctx *))
+{
+ struct blk_mq_alloc_data data;
+ struct request *sched_rq, *rq;
+
+ data.q = hctx->queue;
+ data.flags = BLK_MQ_REQ_NOWAIT;
+ data.ctx = blk_mq_get_ctx(hctx->queue);
+ data.hctx = hctx;
+
+ rq = __blk_mq_alloc_request(&data, 0);
+ blk_mq_put_ctx(data.ctx);
+
+ if (!rq) {
+ blk_mq_stop_hw_queue(hctx);
+ return NULL;
+ }
+
+ sched_rq = get_sched_rq(hctx);
+
+ if (!sched_rq) {
+ blk_queue_enter_live(hctx->queue);
+ __blk_mq_free_request(hctx, data.ctx, rq);
+ return NULL;
+ }
+
+ WARN_ON_ONCE(!(sched_rq->rq_flags & RQF_ALLOCED));
+ rq_copy(rq, sched_rq);
+ rq->end_io = sched_rq_end_io;
+ rq->end_io_data = sched_rq;
+
+ return rq;
+}
+EXPORT_SYMBOL_GPL(blk_mq_sched_request_from_shadow);
+
+void __blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
+{
+ struct elevator_queue *e = hctx->queue->elevator;
+ struct request *rq;
+ LIST_HEAD(rq_list);
+
+ if (unlikely(blk_mq_hctx_stopped(hctx)))
+ return;
+
+ hctx->run++;
+
+ if (!list_empty(&hctx->dispatch)) {
+ spin_lock(&hctx->lock);
+ if (!list_empty(&hctx->dispatch))
+ list_splice_init(&hctx->dispatch, &rq_list);
+ spin_unlock(&hctx->lock);
+ }
+
+ while ((rq = e->type->mq_ops.dispatch_request(hctx)) != NULL)
+ list_add_tail(&rq->queuelist, &rq_list);
+
+ blk_mq_dispatch_rq_list(hctx, &rq_list);
+}
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
new file mode 100644
index 000000000000..609c80506cfc
--- /dev/null
+++ b/block/blk-mq-sched.h
@@ -0,0 +1,187 @@
+#ifndef BLK_MQ_SCHED_H
+#define BLK_MQ_SCHED_H
+
+#include "blk-mq.h"
+
+struct blk_mq_hw_ctx;
+struct blk_mq_ctx;
+struct request_queue;
+
+struct blk_mq_tags *blk_mq_sched_alloc_requests(unsigned int depth, unsigned int numa_node);
+void blk_mq_sched_free_requests(struct blk_mq_tags *tags);
+
+int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size,
+ void (*init)(struct blk_mq_hw_ctx *));
+void blk_mq_sched_free_hctx_data(struct request_queue *q,
+ void (*exit)(struct blk_mq_hw_ctx *));
+
+void blk_mq_sched_free_shadow_request(struct blk_mq_tags *tags,
+ struct request *rq);
+struct request *blk_mq_sched_alloc_shadow_request(struct request_queue *q,
+ struct blk_mq_alloc_data *data,
+ struct blk_mq_tags *tags,
+ atomic_t *wait_index);
+struct request *
+blk_mq_sched_request_from_shadow(struct blk_mq_hw_ctx *hctx,
+ struct request *(*get_sched_rq)(struct blk_mq_hw_ctx *));
+
+
+struct blk_mq_alloc_data {
+ /* input parameter */
+ struct request_queue *q;
+ unsigned int flags;
+
+ /* input & output parameter */
+ struct blk_mq_ctx *ctx;
+ struct blk_mq_hw_ctx *hctx;
+};
+
+static inline void blk_mq_set_alloc_data(struct blk_mq_alloc_data *data,
+ struct request_queue *q, unsigned int flags,
+ struct blk_mq_ctx *ctx, struct blk_mq_hw_ctx *hctx)
+{
+ data->q = q;
+ data->flags = flags;
+ data->ctx = ctx;
+ data->hctx = hctx;
+}
+
+static inline bool
+blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio)
+{
+ struct elevator_queue *e = q->elevator;
+
+ if (blk_queue_nomerges(q) || !bio_mergeable(bio))
+ return false;
+
+ if (e) {
+ struct blk_mq_ctx *ctx = blk_mq_get_ctx(q);
+ struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
+
+ blk_mq_put_ctx(ctx);
+ return e->type->mq_ops.bio_merge(hctx, bio);
+ }
+
+ return false;
+}
+
+static inline void blk_mq_sched_put_request(struct request *rq)
+{
+ struct request_queue *q = rq->q;
+ struct elevator_queue *e = q->elevator;
+
+ if (e && e->type->mq_ops.put_request)
+ e->type->mq_ops.put_request(rq);
+ else
+ blk_mq_free_request(rq);
+}
+
+static inline struct request *
+blk_mq_sched_get_request(struct request_queue *q, unsigned int op,
+ struct blk_mq_alloc_data *data)
+{
+ struct elevator_queue *e = q->elevator;
+ struct blk_mq_hw_ctx *hctx;
+ struct blk_mq_ctx *ctx;
+ struct request *rq;
+
+ blk_queue_enter_live(q);
+ ctx = blk_mq_get_ctx(q);
+ hctx = blk_mq_map_queue(q, ctx->cpu);
+
+ blk_mq_set_alloc_data(data, q, 0, ctx, hctx);
+
+ if (e && e->type->mq_ops.get_request)
+ rq = e->type->mq_ops.get_request(q, op, data);
+ else
+ rq = __blk_mq_alloc_request(data, op);
+
+ if (rq)
+ data->hctx->queued++;
+
+ return rq;
+
+}
+
+static inline void
+blk_mq_sched_insert_request(struct request *rq, bool at_head, bool run_queue,
+ bool async)
+{
+ struct request_queue *q = rq->q;
+ struct elevator_queue *e = q->elevator;
+ struct blk_mq_ctx *ctx = rq->mq_ctx;
+ struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
+
+ if (e)
+ e->type->mq_ops.insert_request(hctx, rq, at_head);
+ else {
+ spin_lock(&ctx->lock);
+ __blk_mq_insert_request(hctx, rq, at_head);
+ spin_unlock(&ctx->lock);
+ }
+
+ if (run_queue)
+ blk_mq_run_hw_queue(hctx, async);
+}
+
+static inline bool
+blk_mq_sched_allow_merge(struct request_queue *q, struct request *rq,
+ struct bio *bio)
+{
+ struct elevator_queue *e = q->elevator;
+
+ if (e && e->type->mq_ops.allow_merge)
+ return e->type->mq_ops.allow_merge(q, rq, bio);
+
+ return true;
+}
+
+static inline void
+blk_mq_sched_completed_request(struct blk_mq_hw_ctx *hctx, struct request *rq)
+{
+ struct elevator_queue *e = hctx->queue->elevator;
+
+ if (e && e->type->mq_ops.completed_request)
+ e->type->mq_ops.completed_request(hctx, rq);
+}
+
+static inline void blk_mq_sched_started_request(struct request *rq)
+{
+ struct request_queue *q = rq->q;
+ struct elevator_queue *e = q->elevator;
+
+ if (e && e->type->mq_ops.started_request)
+ e->type->mq_ops.started_request(rq);
+}
+
+static inline void blk_mq_sched_requeue_request(struct request *rq)
+{
+ struct request_queue *q = rq->q;
+ struct elevator_queue *e = q->elevator;
+
+ if (e && e->type->mq_ops.requeue_request)
+ e->type->mq_ops.requeue_request(rq);
+}
+
+static inline bool blk_mq_sched_has_work(struct blk_mq_hw_ctx *hctx)
+{
+ struct elevator_queue *e = hctx->queue->elevator;
+
+ if (e && e->type->mq_ops.has_work)
+ return e->type->mq_ops.has_work(hctx);
+
+ return false;
+}
+
+void __blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
+
+static inline void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
+{
+ if (hctx->queue->elevator)
+ __blk_mq_sched_dispatch_requests(hctx);
+ else
+ blk_mq_process_sw_list(hctx);
+}
+
+
+#endif
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index dcf5ce3ba4bf..bbd494e23d57 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -12,6 +12,7 @@
#include "blk.h"
#include "blk-mq.h"
#include "blk-mq-tag.h"
+#include "blk-mq-sched.h"
bool blk_mq_has_free_tags(struct blk_mq_tags *tags)
{
diff --git a/block/blk-mq.c b/block/blk-mq.c
index abbf7cca4d0d..019de6f0fd06 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -32,6 +32,7 @@
#include "blk-mq-tag.h"
#include "blk-stat.h"
#include "blk-wbt.h"
+#include "blk-mq-sched.h"
static DEFINE_MUTEX(all_q_mutex);
static LIST_HEAD(all_q_list);
@@ -41,7 +42,8 @@ static LIST_HEAD(all_q_list);
*/
static bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx)
{
- return sbitmap_any_bit_set(&hctx->ctx_map);
+ return sbitmap_any_bit_set(&hctx->ctx_map) ||
+ blk_mq_sched_has_work(hctx);
}
/*
@@ -167,8 +169,8 @@ bool blk_mq_can_queue(struct blk_mq_hw_ctx *hctx)
}
EXPORT_SYMBOL(blk_mq_can_queue);
-static void blk_mq_rq_ctx_init(struct request_queue *q, struct blk_mq_ctx *ctx,
- struct request *rq, unsigned int op)
+void blk_mq_rq_ctx_init(struct request_queue *q, struct blk_mq_ctx *ctx,
+ struct request *rq, unsigned int op)
{
INIT_LIST_HEAD(&rq->queuelist);
/* csd/requeue_work/fifo_time is initialized before use */
@@ -213,9 +215,10 @@ static void blk_mq_rq_ctx_init(struct request_queue *q, struct blk_mq_ctx *ctx,
ctx->rq_dispatched[op_is_sync(op)]++;
}
+EXPORT_SYMBOL_GPL(blk_mq_rq_ctx_init);
-static struct request *
-__blk_mq_alloc_request(struct blk_mq_alloc_data *data, unsigned int op)
+struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data,
+ unsigned int op)
{
struct request *rq;
unsigned int tag;
@@ -236,25 +239,23 @@ __blk_mq_alloc_request(struct blk_mq_alloc_data *data, unsigned int op)
return NULL;
}
+EXPORT_SYMBOL_GPL(__blk_mq_alloc_request);
struct request *blk_mq_alloc_request(struct request_queue *q, int rw,
unsigned int flags)
{
- struct blk_mq_ctx *ctx;
- struct blk_mq_hw_ctx *hctx;
- struct request *rq;
struct blk_mq_alloc_data alloc_data;
+ struct request *rq;
int ret;
ret = blk_queue_enter(q, flags & BLK_MQ_REQ_NOWAIT);
if (ret)
return ERR_PTR(ret);
- ctx = blk_mq_get_ctx(q);
- hctx = blk_mq_map_queue(q, ctx->cpu);
- blk_mq_set_alloc_data(&alloc_data, q, flags, ctx, hctx);
- rq = __blk_mq_alloc_request(&alloc_data, rw);
- blk_mq_put_ctx(ctx);
+ rq = blk_mq_sched_get_request(q, rw, &alloc_data);
+
+ blk_mq_put_ctx(alloc_data.ctx);
+ blk_queue_exit(q);
if (!rq) {
blk_queue_exit(q);
@@ -319,12 +320,14 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, int rw,
}
EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx);
-static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx,
- struct blk_mq_ctx *ctx, struct request *rq)
+void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
+ struct request *rq)
{
const int tag = rq->tag;
struct request_queue *q = rq->q;
+ blk_mq_sched_completed_request(hctx, rq);
+
if (rq->rq_flags & RQF_MQ_INFLIGHT)
atomic_dec(&hctx->nr_active);
@@ -467,6 +470,8 @@ void blk_mq_start_request(struct request *rq)
{
struct request_queue *q = rq->q;
+ blk_mq_sched_started_request(rq);
+
trace_block_rq_issue(q, rq);
rq->resid_len = blk_rq_bytes(rq);
@@ -515,6 +520,7 @@ static void __blk_mq_requeue_request(struct request *rq)
trace_block_rq_requeue(q, rq);
wbt_requeue(q->rq_wb, &rq->issue_stat);
+ blk_mq_sched_requeue_request(rq);
if (test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
if (q->dma_drain_size && blk_rq_bytes(rq))
@@ -549,13 +555,13 @@ static void blk_mq_requeue_work(struct work_struct *work)
rq->rq_flags &= ~RQF_SOFTBARRIER;
list_del_init(&rq->queuelist);
- blk_mq_insert_request(rq, true, false, false);
+ blk_mq_sched_insert_request(rq, true, false, false);
}
while (!list_empty(&rq_list)) {
rq = list_entry(rq_list.next, struct request, queuelist);
list_del_init(&rq->queuelist);
- blk_mq_insert_request(rq, false, false, false);
+ blk_mq_sched_insert_request(rq, false, false, false);
}
blk_mq_run_hw_queues(q, false);
@@ -761,8 +767,16 @@ static bool blk_mq_attempt_merge(struct request_queue *q,
if (!blk_rq_merge_ok(rq, bio))
continue;
+ if (!blk_mq_sched_allow_merge(q, rq, bio))
+ break;
el_ret = blk_try_merge(rq, bio);
+ if (el_ret == ELEVATOR_NO_MERGE)
+ continue;
+
+ if (!blk_mq_sched_allow_merge(q, rq, bio))
+ break;
+
if (el_ret == ELEVATOR_BACK_MERGE) {
if (bio_attempt_back_merge(q, rq, bio)) {
ctx->rq_merged++;
@@ -909,7 +923,7 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
* of IO. In particular, we'd like FIFO behaviour on handling existing
* items on the hctx->dispatch list. Ignore that for now.
*/
-static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
+void blk_mq_process_sw_list(struct blk_mq_hw_ctx *hctx)
{
LIST_HEAD(rq_list);
LIST_HEAD(driver_list);
@@ -947,11 +961,11 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
rcu_read_lock();
- blk_mq_process_rq_list(hctx);
+ blk_mq_sched_dispatch_requests(hctx);
rcu_read_unlock();
} else {
srcu_idx = srcu_read_lock(&hctx->queue_rq_srcu);
- blk_mq_process_rq_list(hctx);
+ blk_mq_sched_dispatch_requests(hctx);
srcu_read_unlock(&hctx->queue_rq_srcu, srcu_idx);
}
}
@@ -1081,6 +1095,7 @@ void blk_mq_start_stopped_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
clear_bit(BLK_MQ_S_STOPPED, &hctx->state);
blk_mq_run_hw_queue(hctx, async);
}
+EXPORT_SYMBOL_GPL(blk_mq_start_stopped_hw_queue);
void blk_mq_start_stopped_hw_queues(struct request_queue *q, bool async)
{
@@ -1135,8 +1150,8 @@ static inline void __blk_mq_insert_req_list(struct blk_mq_hw_ctx *hctx,
list_add_tail(&rq->queuelist, &ctx->rq_list);
}
-static void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx,
- struct request *rq, bool at_head)
+void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
+ bool at_head)
{
struct blk_mq_ctx *ctx = rq->mq_ctx;
@@ -1144,21 +1159,6 @@ static void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx,
blk_mq_hctx_mark_pending(hctx, ctx);
}
-void blk_mq_insert_request(struct request *rq, bool at_head, bool run_queue,
- bool async)
-{
- struct blk_mq_ctx *ctx = rq->mq_ctx;
- struct request_queue *q = rq->q;
- struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
-
- spin_lock(&ctx->lock);
- __blk_mq_insert_request(hctx, rq, at_head);
- spin_unlock(&ctx->lock);
-
- if (run_queue)
- blk_mq_run_hw_queue(hctx, async);
-}
-
static void blk_mq_insert_requests(struct request_queue *q,
struct blk_mq_ctx *ctx,
struct list_head *list,
@@ -1174,17 +1174,14 @@ static void blk_mq_insert_requests(struct request_queue *q,
* preemption doesn't flush plug list, so it's possible ctx->cpu is
* offline now
*/
- spin_lock(&ctx->lock);
while (!list_empty(list)) {
struct request *rq;
rq = list_first_entry(list, struct request, queuelist);
BUG_ON(rq->mq_ctx != ctx);
list_del_init(&rq->queuelist);
- __blk_mq_insert_req_list(hctx, rq, false);
+ blk_mq_sched_insert_request(rq, false, false, false);
}
- blk_mq_hctx_mark_pending(hctx, ctx);
- spin_unlock(&ctx->lock);
blk_mq_run_hw_queue(hctx, from_schedule);
}
@@ -1285,41 +1282,27 @@ static inline bool blk_mq_merge_queue_io(struct blk_mq_hw_ctx *hctx,
}
}
-static struct request *blk_mq_map_request(struct request_queue *q,
- struct bio *bio,
- struct blk_mq_alloc_data *data)
-{
- struct blk_mq_hw_ctx *hctx;
- struct blk_mq_ctx *ctx;
- struct request *rq;
-
- blk_queue_enter_live(q);
- ctx = blk_mq_get_ctx(q);
- hctx = blk_mq_map_queue(q, ctx->cpu);
-
- trace_block_getrq(q, bio, bio->bi_opf);
- blk_mq_set_alloc_data(data, q, 0, ctx, hctx);
- rq = __blk_mq_alloc_request(data, bio->bi_opf);
-
- data->hctx->queued++;
- return rq;
-}
-
static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
{
- int ret;
struct request_queue *q = rq->q;
- struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
struct blk_mq_queue_data bd = {
.rq = rq,
.list = NULL,
.last = 1
};
- blk_qc_t new_cookie = blk_tag_to_qc_t(rq->tag, hctx->queue_num);
+ struct blk_mq_hw_ctx *hctx;
+ blk_qc_t new_cookie;
+ int ret;
+
+ if (q->elevator)
+ goto insert;
+ hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
if (blk_mq_hctx_stopped(hctx))
goto insert;
+ new_cookie = blk_tag_to_qc_t(rq->tag, hctx->queue_num);
+
/*
* For OK queue, we are done. For error, kill it. Any other
* error (busy), just add it to our list as we previously
@@ -1341,7 +1324,7 @@ static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
}
insert:
- blk_mq_insert_request(rq, false, true, true);
+ blk_mq_sched_insert_request(rq, false, true, true);
}
/*
@@ -1374,9 +1357,14 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
blk_attempt_plug_merge(q, bio, &request_count, &same_queue_rq))
return BLK_QC_T_NONE;
+ if (blk_mq_sched_bio_merge(q, bio))
+ return BLK_QC_T_NONE;
+
wb_acct = wbt_wait(q->rq_wb, bio, NULL);
- rq = blk_mq_map_request(q, bio, &data);
+ trace_block_getrq(q, bio, bio->bi_opf);
+
+ rq = blk_mq_sched_get_request(q, bio->bi_opf, &data);
if (unlikely(!rq)) {
__wbt_done(q->rq_wb, wb_acct);
return BLK_QC_T_NONE;
@@ -1438,6 +1426,12 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
goto done;
}
+ if (q->elevator) {
+ blk_mq_put_ctx(data.ctx);
+ blk_mq_bio_to_request(rq, bio);
+ blk_mq_sched_insert_request(rq, false, true, true);
+ goto done;
+ }
if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
/*
* For a SYNC request, send it to the hardware immediately. For
@@ -1483,9 +1477,14 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
} else
request_count = blk_plug_queued_count(q);
+ if (blk_mq_sched_bio_merge(q, bio))
+ return BLK_QC_T_NONE;
+
wb_acct = wbt_wait(q->rq_wb, bio, NULL);
- rq = blk_mq_map_request(q, bio, &data);
+ trace_block_getrq(q, bio, bio->bi_opf);
+
+ rq = blk_mq_sched_get_request(q, bio->bi_opf, &data);
if (unlikely(!rq)) {
__wbt_done(q->rq_wb, wb_acct);
return BLK_QC_T_NONE;
@@ -1535,6 +1534,12 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
return cookie;
}
+ if (q->elevator) {
+ blk_mq_put_ctx(data.ctx);
+ blk_mq_bio_to_request(rq, bio);
+ blk_mq_sched_insert_request(rq, false, true, true);
+ goto done;
+ }
if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
/*
* For a SYNC request, send it to the hardware immediately. For
@@ -1547,15 +1552,16 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
}
blk_mq_put_ctx(data.ctx);
+done:
return cookie;
}
-static void blk_mq_free_rq_map(struct blk_mq_tag_set *set,
- struct blk_mq_tags *tags, unsigned int hctx_idx)
+void blk_mq_free_rq_map(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
+ unsigned int hctx_idx)
{
struct page *page;
- if (tags->rqs && set->ops->exit_request) {
+ if (tags->rqs && set && set->ops->exit_request) {
int i;
for (i = 0; i < tags->nr_tags; i++) {
@@ -1588,8 +1594,8 @@ static size_t order_to_size(unsigned int order)
return (size_t)PAGE_SIZE << order;
}
-static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
- unsigned int hctx_idx)
+struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
+ unsigned int hctx_idx)
{
struct blk_mq_tags *tags;
unsigned int i, j, entries_per_page, max_order = 4;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 3a54dd32a6fc..ddce89bb0461 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -84,26 +84,6 @@ static inline void blk_mq_put_ctx(struct blk_mq_ctx *ctx)
put_cpu();
}
-struct blk_mq_alloc_data {
- /* input parameter */
- struct request_queue *q;
- unsigned int flags;
-
- /* input & output parameter */
- struct blk_mq_ctx *ctx;
- struct blk_mq_hw_ctx *hctx;
-};
-
-static inline void blk_mq_set_alloc_data(struct blk_mq_alloc_data *data,
- struct request_queue *q, unsigned int flags,
- struct blk_mq_ctx *ctx, struct blk_mq_hw_ctx *hctx)
-{
- data->q = q;
- data->flags = flags;
- data->ctx = ctx;
- data->hctx = hctx;
-}
-
static inline bool blk_mq_hctx_stopped(struct blk_mq_hw_ctx *hctx)
{
return test_bit(BLK_MQ_S_STOPPED, &hctx->state);
@@ -114,4 +94,18 @@ static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx)
return hctx->nr_ctx && hctx->tags;
}
+void blk_mq_free_rq_map(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
+ unsigned int hctx_idx);
+struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
+ unsigned int hctx_idx);
+void blk_mq_rq_ctx_init(struct request_queue *q, struct blk_mq_ctx *ctx,
+ struct request *rq, unsigned int op);
+void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
+ struct request *rq);
+struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data,
+ unsigned int op);
+void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
+ bool at_head);
+void blk_mq_process_sw_list(struct blk_mq_hw_ctx *hctx);
+
#endif
diff --git a/block/elevator.c b/block/elevator.c
index 40f0c04e5ad3..f1191b3b0ff3 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -40,6 +40,7 @@
#include <trace/events/block.h>
#include "blk.h"
+#include "blk-mq-sched.h"
static DEFINE_SPINLOCK(elv_list_lock);
static LIST_HEAD(elv_list);
@@ -58,7 +59,9 @@ static int elv_iosched_allow_bio_merge(struct request *rq, struct bio *bio)
struct request_queue *q = rq->q;
struct elevator_queue *e = q->elevator;
- if (e->type->ops.elevator_allow_bio_merge_fn)
+ if (e->uses_mq && e->type->mq_ops.allow_merge)
+ return e->type->mq_ops.allow_merge(q, rq, bio);
+ else if (!e->uses_mq && e->type->ops.elevator_allow_bio_merge_fn)
return e->type->ops.elevator_allow_bio_merge_fn(q, rq, bio);
return 1;
@@ -163,6 +166,7 @@ struct elevator_queue *elevator_alloc(struct request_queue *q,
kobject_init(&eq->kobj, &elv_ktype);
mutex_init(&eq->sysfs_lock);
hash_init(eq->hash);
+ eq->uses_mq = e->uses_mq;
return eq;
}
@@ -224,7 +228,10 @@ int elevator_init(struct request_queue *q, char *name)
}
}
- err = e->ops.elevator_init_fn(q, e);
+ if (e->uses_mq)
+ err = e->mq_ops.init_sched(q, e);
+ else
+ err = e->ops.elevator_init_fn(q, e);
if (err)
elevator_put(e);
return err;
@@ -234,7 +241,9 @@ EXPORT_SYMBOL(elevator_init);
void elevator_exit(struct elevator_queue *e)
{
mutex_lock(&e->sysfs_lock);
- if (e->type->ops.elevator_exit_fn)
+ if (e->uses_mq && e->type->mq_ops.exit_sched)
+ e->type->mq_ops.exit_sched(e);
+ else if (!e->uses_mq && e->type->ops.elevator_exit_fn)
e->type->ops.elevator_exit_fn(e);
mutex_unlock(&e->sysfs_lock);
@@ -253,6 +262,7 @@ void elv_rqhash_del(struct request_queue *q, struct request *rq)
if (ELV_ON_HASH(rq))
__elv_rqhash_del(rq);
}
+EXPORT_SYMBOL_GPL(elv_rqhash_del);
void elv_rqhash_add(struct request_queue *q, struct request *rq)
{
@@ -262,12 +272,14 @@ void elv_rqhash_add(struct request_queue *q, struct request *rq)
hash_add(e->hash, &rq->hash, rq_hash_key(rq));
rq->rq_flags |= RQF_HASHED;
}
+EXPORT_SYMBOL_GPL(elv_rqhash_add);
void elv_rqhash_reposition(struct request_queue *q, struct request *rq)
{
__elv_rqhash_del(rq);
elv_rqhash_add(q, rq);
}
+EXPORT_SYMBOL_GPL(elv_rqhash_reposition);
struct request *elv_rqhash_find(struct request_queue *q, sector_t offset)
{
@@ -289,6 +301,7 @@ struct request *elv_rqhash_find(struct request_queue *q, sector_t offset)
return NULL;
}
+EXPORT_SYMBOL_GPL(elv_rqhash_find);
/*
* RB-tree support functions for inserting/lookup/removal of requests
@@ -411,6 +424,9 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio)
struct request *__rq;
int ret;
+ if (WARN_ON_ONCE(e->uses_mq))
+ return ELEVATOR_NO_MERGE;
+
/*
* Levels of merges:
* nomerges: No merges at all attempted
@@ -462,6 +478,9 @@ static bool elv_attempt_insert_merge(struct request_queue *q,
struct request *__rq;
bool ret;
+ if (WARN_ON_ONCE(q->elevator && q->elevator->uses_mq))
+ return false;
+
if (blk_queue_nomerges(q))
return false;
@@ -495,7 +514,7 @@ void elv_merged_request(struct request_queue *q, struct request *rq, int type)
{
struct elevator_queue *e = q->elevator;
- if (e->type->ops.elevator_merged_fn)
+ if (!e->uses_mq && e->type->ops.elevator_merged_fn)
e->type->ops.elevator_merged_fn(q, rq, type);
if (type == ELEVATOR_BACK_MERGE)
@@ -508,10 +527,15 @@ void elv_merge_requests(struct request_queue *q, struct request *rq,
struct request *next)
{
struct elevator_queue *e = q->elevator;
- const int next_sorted = next->rq_flags & RQF_SORTED;
-
- if (next_sorted && e->type->ops.elevator_merge_req_fn)
- e->type->ops.elevator_merge_req_fn(q, rq, next);
+ bool next_sorted = false;
+
+ if (e->uses_mq && e->type->mq_ops.requests_merged)
+ e->type->mq_ops.requests_merged(q, rq, next);
+ else if (e->type->ops.elevator_merge_req_fn) {
+ next_sorted = next->rq_flags & RQF_SORTED;
+ if (next_sorted)
+ e->type->ops.elevator_merge_req_fn(q, rq, next);
+ }
elv_rqhash_reposition(q, rq);
@@ -528,6 +552,9 @@ void elv_bio_merged(struct request_queue *q, struct request *rq,
{
struct elevator_queue *e = q->elevator;
+ if (WARN_ON_ONCE(e->uses_mq))
+ return;
+
if (e->type->ops.elevator_bio_merged_fn)
e->type->ops.elevator_bio_merged_fn(q, rq, bio);
}
@@ -682,8 +709,11 @@ struct request *elv_latter_request(struct request_queue *q, struct request *rq)
{
struct elevator_queue *e = q->elevator;
- if (e->type->ops.elevator_latter_req_fn)
+ if (e->uses_mq && e->type->mq_ops.next_request)
+ return e->type->mq_ops.next_request(q, rq);
+ else if (!e->uses_mq && e->type->ops.elevator_latter_req_fn)
return e->type->ops.elevator_latter_req_fn(q, rq);
+
return NULL;
}
@@ -691,7 +721,9 @@ struct request *elv_former_request(struct request_queue *q, struct request *rq)
{
struct elevator_queue *e = q->elevator;
- if (e->type->ops.elevator_former_req_fn)
+ if (e->uses_mq && e->type->mq_ops.former_request)
+ return e->type->mq_ops.former_request(q, rq);
+ if (!e->uses_mq && e->type->ops.elevator_former_req_fn)
return e->type->ops.elevator_former_req_fn(q, rq);
return NULL;
}
@@ -701,6 +733,9 @@ int elv_set_request(struct request_queue *q, struct request *rq,
{
struct elevator_queue *e = q->elevator;
+ if (WARN_ON_ONCE(e->uses_mq))
+ return 0;
+
if (e->type->ops.elevator_set_req_fn)
return e->type->ops.elevator_set_req_fn(q, rq, bio, gfp_mask);
return 0;
@@ -710,6 +745,9 @@ void elv_put_request(struct request_queue *q, struct request *rq)
{
struct elevator_queue *e = q->elevator;
+ if (WARN_ON_ONCE(e->uses_mq))
+ return;
+
if (e->type->ops.elevator_put_req_fn)
e->type->ops.elevator_put_req_fn(rq);
}
@@ -718,6 +756,9 @@ int elv_may_queue(struct request_queue *q, unsigned int op)
{
struct elevator_queue *e = q->elevator;
+ if (WARN_ON_ONCE(e->uses_mq))
+ return 0;
+
if (e->type->ops.elevator_may_queue_fn)
return e->type->ops.elevator_may_queue_fn(q, op);
@@ -728,6 +769,9 @@ void elv_completed_request(struct request_queue *q, struct request *rq)
{
struct elevator_queue *e = q->elevator;
+ if (WARN_ON_ONCE(e->uses_mq))
+ return;
+
/*
* request is released from the driver, io must be done
*/
@@ -803,7 +847,7 @@ int elv_register_queue(struct request_queue *q)
}
kobject_uevent(&e->kobj, KOBJ_ADD);
e->registered = 1;
- if (e->type->ops.elevator_registered_fn)
+ if (!e->uses_mq && e->type->ops.elevator_registered_fn)
e->type->ops.elevator_registered_fn(q);
}
return error;
@@ -891,9 +935,14 @@ EXPORT_SYMBOL_GPL(elv_unregister);
static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
{
struct elevator_queue *old = q->elevator;
- bool registered = old->registered;
+ bool old_registered = false;
int err;
+ if (q->mq_ops) {
+ blk_mq_freeze_queue(q);
+ blk_mq_quiesce_queue(q);
+ }
+
/*
* Turn on BYPASS and drain all requests w/ elevator private data.
* Block layer doesn't call into a quiesced elevator - all requests
@@ -901,32 +950,54 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
* using INSERT_BACK. All requests have SOFTBARRIER set and no
* merge happens either.
*/
- blk_queue_bypass_start(q);
+ if (old) {
+ old_registered = old->registered;
+
+ if (!q->mq_ops)
+ blk_queue_bypass_start(q);
- /* unregister and clear all auxiliary data of the old elevator */
- if (registered)
- elv_unregister_queue(q);
+ /* unregister and clear all auxiliary data of the old elevator */
+ if (old_registered)
+ elv_unregister_queue(q);
- spin_lock_irq(q->queue_lock);
- ioc_clear_queue(q);
- spin_unlock_irq(q->queue_lock);
+ if (q->queue_lock) {
+ spin_lock_irq(q->queue_lock);
+ ioc_clear_queue(q);
+ spin_unlock_irq(q->queue_lock);
+ }
+ }
/* allocate, init and register new elevator */
- err = new_e->ops.elevator_init_fn(q, new_e);
- if (err)
- goto fail_init;
+ if (new_e) {
+ if (new_e->uses_mq)
+ err = new_e->mq_ops.init_sched(q, new_e);
+ else
+ err = new_e->ops.elevator_init_fn(q, new_e);
+ if (err)
+ goto fail_init;
- if (registered) {
err = elv_register_queue(q);
if (err)
goto fail_register;
- }
+ } else
+ q->elevator = NULL;
/* done, kill the old one and finish */
- elevator_exit(old);
- blk_queue_bypass_end(q);
+ if (old) {
+ elevator_exit(old);
+ if (!q->mq_ops)
+ blk_queue_bypass_end(q);
+ }
- blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
+ if (q->mq_ops) {
+ blk_mq_unfreeze_queue(q);
+ blk_mq_start_stopped_hw_queues(q, true);
+ }
+
+ if (new_e)
+ blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
+ else
+ blk_add_trace_msg(q, "elv switch: none");
return 0;
@@ -934,9 +1005,16 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
elevator_exit(q->elevator);
fail_init:
/* switch failed, restore and re-register old elevator */
- q->elevator = old;
- elv_register_queue(q);
- blk_queue_bypass_end(q);
+ if (old) {
+ q->elevator = old;
+ elv_register_queue(q);
+ if (!q->mq_ops)
+ blk_queue_bypass_end(q);
+ }
+ if (q->mq_ops) {
+ blk_mq_unfreeze_queue(q);
+ blk_mq_start_stopped_hw_queues(q, true);
+ }
return err;
}
@@ -949,8 +1027,11 @@ static int __elevator_change(struct request_queue *q, const char *name)
char elevator_name[ELV_NAME_MAX];
struct elevator_type *e;
- if (!q->elevator)
- return -ENXIO;
+ /*
+ * Special case for mq, turn off scheduling
+ */
+ if (q->mq_ops && !strncmp(name, "none", 4))
+ return elevator_switch(q, NULL);
strlcpy(elevator_name, name, sizeof(elevator_name));
e = elevator_get(strstrip(elevator_name), true);
@@ -959,11 +1040,23 @@ static int __elevator_change(struct request_queue *q, const char *name)
return -EINVAL;
}
- if (!strcmp(elevator_name, q->elevator->type->elevator_name)) {
+ if (q->elevator &&
+ !strcmp(elevator_name, q->elevator->type->elevator_name)) {
elevator_put(e);
return 0;
}
+ if (!e->uses_mq && q->mq_ops) {
+ printk(KERN_ERR "blk-mq-sched: elv %s does not support mq\n", elevator_name);
+ elevator_put(e);
+ return -EINVAL;
+ }
+ if (e->uses_mq && !q->mq_ops) {
+ printk(KERN_ERR "blk-mq-sched: elv %s is for mq\n", elevator_name);
+ elevator_put(e);
+ return -EINVAL;
+ }
+
return elevator_switch(q, e);
}
@@ -985,7 +1078,7 @@ ssize_t elv_iosched_store(struct request_queue *q, const char *name,
{
int ret;
- if (!q->elevator)
+ if (!q->mq_ops || q->request_fn)
return count;
ret = __elevator_change(q, name);
@@ -999,24 +1092,34 @@ ssize_t elv_iosched_store(struct request_queue *q, const char *name,
ssize_t elv_iosched_show(struct request_queue *q, char *name)
{
struct elevator_queue *e = q->elevator;
- struct elevator_type *elv;
+ struct elevator_type *elv = NULL;
struct elevator_type *__e;
int len = 0;
- if (!q->elevator || !blk_queue_stackable(q))
+ if (!blk_queue_stackable(q))
return sprintf(name, "none\n");
- elv = e->type;
+ if (!q->elevator)
+ len += sprintf(name+len, "[none] ");
+ else
+ elv = e->type;
spin_lock(&elv_list_lock);
list_for_each_entry(__e, &elv_list, list) {
- if (!strcmp(elv->elevator_name, __e->elevator_name))
+ if (elv && !strcmp(elv->elevator_name, __e->elevator_name)) {
len += sprintf(name+len, "[%s] ", elv->elevator_name);
- else
+ continue;
+ }
+ if (__e->uses_mq && q->mq_ops)
+ len += sprintf(name+len, "%s ", __e->elevator_name);
+ else if (!__e->uses_mq && !q->mq_ops)
len += sprintf(name+len, "%s ", __e->elevator_name);
}
spin_unlock(&elv_list_lock);
+ if (q->mq_ops && q->elevator)
+ len += sprintf(name+len, "none");
+
len += sprintf(len+name, "\n");
return len;
}
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 87e404aae267..c86b314dde97 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -22,6 +22,7 @@ struct blk_mq_hw_ctx {
unsigned long flags; /* BLK_MQ_F_* flags */
+ void *sched_data;
struct request_queue *queue;
struct blk_flush_queue *fq;
@@ -179,7 +180,6 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set);
void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule);
-void blk_mq_insert_request(struct request *, bool, bool, bool);
void blk_mq_free_request(struct request *rq);
void blk_mq_free_hctx_request(struct blk_mq_hw_ctx *, struct request *rq);
bool blk_mq_can_queue(struct blk_mq_hw_ctx *);
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index b276e9ef0e0b..5d013f2b9071 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -77,6 +77,28 @@ struct elevator_ops
elevator_registered_fn *elevator_registered_fn;
};
+struct blk_mq_alloc_data;
+struct blk_mq_hw_ctx;
+
+struct elevator_mq_ops {
+ int (*init_sched)(struct request_queue *, struct elevator_type *);
+ void (*exit_sched)(struct elevator_queue *);
+
+ bool (*allow_merge)(struct request_queue *, struct request *, struct bio *);
+ bool (*bio_merge)(struct blk_mq_hw_ctx *, struct bio *);
+ void (*requests_merged)(struct request_queue *, struct request *, struct request *);
+ struct request *(*get_request)(struct request_queue *, unsigned int, struct blk_mq_alloc_data *);
+ void (*put_request)(struct request *);
+ void (*insert_request)(struct blk_mq_hw_ctx *, struct request *, bool);
+ struct request *(*dispatch_request)(struct blk_mq_hw_ctx *);
+ bool (*has_work)(struct blk_mq_hw_ctx *);
+ void (*completed_request)(struct blk_mq_hw_ctx *, struct request *);
+ void (*started_request)(struct request *);
+ void (*requeue_request)(struct request *);
+ struct request *(*former_request)(struct request_queue *, struct request *);
+ struct request *(*next_request)(struct request_queue *, struct request *);
+};
+
#define ELV_NAME_MAX (16)
struct elv_fs_entry {
@@ -94,12 +116,16 @@ struct elevator_type
struct kmem_cache *icq_cache;
/* fields provided by elevator implementation */
- struct elevator_ops ops;
+ union {
+ struct elevator_ops ops;
+ struct elevator_mq_ops mq_ops;
+ };
size_t icq_size; /* see iocontext.h */
size_t icq_align; /* ditto */
struct elv_fs_entry *elevator_attrs;
char elevator_name[ELV_NAME_MAX];
struct module *elevator_owner;
+ bool uses_mq;
/* managed by elevator core */
char icq_cache_name[ELV_NAME_MAX + 5]; /* elvname + "_io_cq" */
@@ -123,6 +149,7 @@ struct elevator_queue
struct kobject kobj;
struct mutex sysfs_lock;
unsigned int registered:1;
+ unsigned int uses_mq:1;
DECLARE_HASHTABLE(hash, ELV_HASH_BITS);
};
--
2.7.4
On 12/08/2016 09:13 PM, Jens Axboe wrote:
> Takes a list of requests, and dispatches it. Moves any residual
> requests to the dispatch list.
>
> Signed-off-by: Jens Axboe <[email protected]>
> ---
> block/blk-mq.c | 85 ++++++++++++++++++++++++++++++++--------------------------
> block/blk-mq.h | 1 +
> 2 files changed, 48 insertions(+), 38 deletions(-)
>
Reviewed-by: Hannes Reinecke <[email protected]>
Cheers,
Hannes
--
Dr. Hannes Reinecke Teamlead Storage & Networking
[email protected] +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)
On 12/08/2016 09:13 PM, Jens Axboe wrote:
> Signed-off-by: Jens Axboe <[email protected]>
> ---
> block/elevator.c | 8 ++++----
> include/linux/elevator.h | 5 +++++
> 2 files changed, 9 insertions(+), 4 deletions(-)
>
Reviewed-by: Hannes Reinecke <[email protected]>
Cheers,
Hannes
--
Dr. Hannes Reinecke Teamlead Storage & Networking
[email protected] +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)
On 12/08/2016 09:13 PM, Jens Axboe wrote:
> Currently we pass in to run the queue async, but don't flag the
> queue to be run. We don't need to run it async here, but we should
> run it. So fixup the parameters.
>
> Signed-off-by: Jens Axboe <[email protected]>
> ---
> block/blk-flush.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/block/blk-flush.c b/block/blk-flush.c
> index 1bdbb3d3e5f5..27a42dab5a36 100644
> --- a/block/blk-flush.c
> +++ b/block/blk-flush.c
> @@ -426,7 +426,7 @@ void blk_insert_flush(struct request *rq)
> if ((policy & REQ_FSEQ_DATA) &&
> !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
> if (q->mq_ops) {
> - blk_mq_insert_request(rq, false, false, true);
> + blk_mq_insert_request(rq, false, true, false);
> } else
> list_add_tail(&rq->queuelist, &q->queue_head);
> return;
>
Reviewed-by: Hannes Reinecke <[email protected]>
Cheers,
Hannes
--
Dr. Hannes Reinecke Teamlead Storage & Networking
[email protected] +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)
On 12/08/2016 09:13 PM, Jens Axboe wrote:
> We have a variant for all hardware queues, but not one for a single
> hardware queue.
Reviewed-by: Bart Van Assche <[email protected]>
On 12/08/2016 09:13 PM, Jens Axboe wrote:
> +static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
> +{
> + LIST_HEAD(rq_list);
> + LIST_HEAD(driver_list);
Hello Jens,
driver_list is not used in this function so please consider removing
that variable from blk_mq_process_rq_list(). Otherwise this patch looks
fine to me.
Bart.
Minor comments.
On 12/9/2016 1:43 AM, Jens Axboe wrote:
> Takes a list of requests, and dispatches it. Moves any residual
> requests to the dispatch list.
>
> Signed-off-by: Jens Axboe <[email protected]>
> ---
> block/blk-mq.c | 85 ++++++++++++++++++++++++++++++++--------------------------
> block/blk-mq.h | 1 +
> 2 files changed, 48 insertions(+), 38 deletions(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index b216746be9d3..abbf7cca4d0d 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -821,41 +821,13 @@ static inline unsigned int queued_to_index(unsigned int queued)
> return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
> }
>
> -/*
> - * Run this hardware queue, pulling any software queues mapped to it in.
> - * Note that this function currently has various problems around ordering
> - * of IO. In particular, we'd like FIFO behaviour on handling existing
> - * items on the hctx->dispatch list. Ignore that for now.
> - */
> -static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
> +bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
> {
> struct request_queue *q = hctx->queue;
> struct request *rq;
> - LIST_HEAD(rq_list);
> LIST_HEAD(driver_list);
> struct list_head *dptr;
> - int queued;
> -
> - if (unlikely(blk_mq_hctx_stopped(hctx)))
> - return;
> -
> - hctx->run++;
> -
> - /*
> - * Touch any software queue that has pending entries.
> - */
> - flush_busy_ctxs(hctx, &rq_list);
> -
> - /*
> - * If we have previous entries on our dispatch list, grab them
> - * and stuff them at the front for more fair dispatch.
> - */
> - if (!list_empty_careful(&hctx->dispatch)) {
> - spin_lock(&hctx->lock);
> - if (!list_empty(&hctx->dispatch))
> - list_splice_init(&hctx->dispatch, &rq_list);
> - spin_unlock(&hctx->lock);
> - }
> + int queued, ret = BLK_MQ_RQ_QUEUE_OK;
>
> /*
> * Start off with dptr being NULL, so we start the first request
> @@ -867,16 +839,15 @@ static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
> * Now process all the entries, sending them to the driver.
> */
> queued = 0;
> - while (!list_empty(&rq_list)) {
> + while (!list_empty(list)) {
> struct blk_mq_queue_data bd;
> - int ret;
>
> - rq = list_first_entry(&rq_list, struct request, queuelist);
> + rq = list_first_entry(list, struct request, queuelist);
> list_del_init(&rq->queuelist);
>
> bd.rq = rq;
> bd.list = dptr;
> - bd.last = list_empty(&rq_list);
> + bd.last = list_empty(list);
>
> ret = q->mq_ops->queue_rq(hctx, &bd);
> switch (ret) {
> @@ -884,7 +855,7 @@ static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
> queued++;
> break;
> case BLK_MQ_RQ_QUEUE_BUSY:
> - list_add(&rq->queuelist, &rq_list);
> + list_add(&rq->queuelist, list);
> __blk_mq_requeue_request(rq);
> break;
> default:
> @@ -902,7 +873,7 @@ static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
> * We've done the first request. If we have more than 1
> * left in the list, set dptr to defer issue.
> */
> - if (!dptr && rq_list.next != rq_list.prev)
> + if (!dptr && list->next != list->prev)
> dptr = &driver_list;
> }
>
> @@ -912,10 +883,11 @@ static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
> * Any items that need requeuing? Stuff them into hctx->dispatch,
> * that is where we will continue on next queue run.
> */
> - if (!list_empty(&rq_list)) {
> + if (!list_empty(list)) {
> spin_lock(&hctx->lock);
> - list_splice(&rq_list, &hctx->dispatch);
> + list_splice(list, &hctx->dispatch);
> spin_unlock(&hctx->lock);
> +
> /*
> * the queue is expected stopped with BLK_MQ_RQ_QUEUE_BUSY, but
> * it's possible the queue is stopped and restarted again
> @@ -927,6 +899,43 @@ static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
> **/
> blk_mq_run_hw_queue(hctx, true);
> }
> +
> + return ret != BLK_MQ_RQ_QUEUE_BUSY;
> +}
> +
> +/*
> + * Run this hardware queue, pulling any software queues mapped to it in.
> + * Note that this function currently has various problems around ordering
> + * of IO. In particular, we'd like FIFO behaviour on handling existing
> + * items on the hctx->dispatch list. Ignore that for now.
> + */
> +static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
> +{
> + LIST_HEAD(rq_list);
> + LIST_HEAD(driver_list);
driver_list is not required. since not used in this function anymore.
> +
> + if (unlikely(blk_mq_hctx_stopped(hctx)))
> + return;
> +
> + hctx->run++;
> +
> + /*
> + * Touch any software queue that has pending entries.
> + */
> + flush_busy_ctxs(hctx, &rq_list);
> +
> + /*
> + * If we have previous entries on our dispatch list, grab them
> + * and stuff them at the front for more fair dispatch.
> + */
> + if (!list_empty_careful(&hctx->dispatch)) {
> + spin_lock(&hctx->lock);
> + if (!list_empty(&hctx->dispatch))
list_splice_init already checks for list_empty. So this may be
redundant. Please check.
> + list_splice_init(&hctx->dispatch, &rq_list);
> + spin_unlock(&hctx->lock);
> + }
> +
> + blk_mq_dispatch_rq_list(hctx, &rq_list);
> }
>
> static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index b444370ae05b..3a54dd32a6fc 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -31,6 +31,7 @@ void blk_mq_freeze_queue(struct request_queue *q);
> void blk_mq_free_queue(struct request_queue *q);
> int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
> void blk_mq_wake_waiters(struct request_queue *q);
> +bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *, struct list_head *);
>
> /*
> * CPU hotplug helpers
>
> Il giorno 08 dic 2016, alle ore 21:13, Jens Axboe <[email protected]> ha scritto:
>
> As a followup to this posting from yesterday:
>
> https://marc.info/?l=linux-block&m=148115232806065&w=2
>
> this is version 2. I wanted to post a new one fairly quickly, as there
> ended up being a number of potential crashes in v1. This one should be
> solid, I've run mq-deadline on both NVMe and regular rotating storage,
> and we handle the various merging cases correctly.
>
> You can download it from git as well:
>
> git://git.kernel.dk/linux-block blk-mq-sched.2
>
> Note that this is based on for-4.10/block, which is in turn based on
> v4.9-rc1. I suggest pulling it into my for-next branch, which would
> then merge nicely with 'master' as well.
>
Hi Jens,
this is just to tell you that I have finished running some extensive
tests on this patch series (throughput, responsiveness, low latency
for soft real time). No regression w.r.t. blk detected, and no
crashes or other anomalies.
Starting to work on BFQ port. Please be patient with my little
expertise on mq environment, and with my next silly questions!
Thanks,
Paolo
> Changes since v1:
>
> - Add Kconfig entries to allow the user to choose what the default
> scheduler should be for blk-mq, and whether that depends on the
> number of hardware queues.
>
> - Properly abstract the whole get/put of a request, so we can manage
> the life time properly.
>
> - Enable full merging on mq-deadline (front/back, bio-to-rq, rq-to-rq).
> Has full feature parity with deadline now.
>
> - Export necessary symbols for compiling mq-deadline as a module.
>
> - Various API adjustments for the mq schedulers.
>
> - Various cleanups and improvements.
>
> - Fix a lot of bugs. A lot. Upgrade!
>
> block/Kconfig.iosched | 37 ++
> block/Makefile | 3
> block/blk-core.c | 9
> block/blk-exec.c | 3
> block/blk-flush.c | 7
> block/blk-merge.c | 3
> block/blk-mq-sched.c | 265 +++++++++++++++++++
> block/blk-mq-sched.h | 188 +++++++++++++
> block/blk-mq-tag.c | 1
> block/blk-mq.c | 254 ++++++++++--------
> block/blk-mq.h | 35 +-
> block/elevator.c | 194 ++++++++++----
> block/mq-deadline.c | 647 +++++++++++++++++++++++++++++++++++++++++++++++
> drivers/nvme/host/pci.c | 1
> include/linux/blk-mq.h | 4
> include/linux/elevator.h | 34 ++
> 16 files changed, 1495 insertions(+), 190 deletions(-)
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/13/2016 10:18 AM, Ritesh Harjani wrote:
> On 12/9/2016 1:43 AM, Jens Axboe wrote:
>> + if (!list_empty_careful(&hctx->dispatch)) {
>> + spin_lock(&hctx->lock);
>> + if (!list_empty(&hctx->dispatch))
> list_splice_init already checks for list_empty. So this may be
> redundant. Please check.
Hello Ritesh,
I think the list_empty() check is on purpose and is intended as a
performance optimization.
Bart.
On 12/08/2016 09:13 PM, Jens Axboe wrote:
> +config DEFAULT_MQ_IOSCHED
> + string
> + default "mq-deadline" if DEFAULT_MQ_DEADLINE
> + default "none" if DEFAULT_MQ_NONE
> +
> endmenu
>
> +config MQ_IOSCHED_ONLY_SQ
> + bool "Enable blk-mq IO scheduler only for single queue devices"
> + default y
> + help
> + Say Y here, if you only want to enable IO scheduling on block
> + devices that have a single queue registered.
> +
> endif
Hello Jens,
Shouln't the MQ_IOSCHED_ONLY_SQ entry be placed before "endmenu" such
that it is displayed in the I/O scheduler menu instead of the block menu?
Bart.
On 12/08/2016 09:13 PM, Jens Axboe wrote:
> +static int dd_init_queue(struct request_queue *q, struct elevator_type *e)
> +{
> + struct deadline_data *dd;
> + struct elevator_queue *eq;
> +
> + eq = elevator_alloc(q, e);
> + if (!eq)
> + return -ENOMEM;
> +
> + dd = kzalloc_node(sizeof(*dd), GFP_KERNEL, q->node);
> + if (!dd) {
> + kobject_put(&eq->kobj);
> + return -ENOMEM;
> + }
> + eq->elevator_data = dd;
> +
> + dd->tags = blk_mq_sched_alloc_requests(256, q->node);
> + if (!dd->tags) {
> + kfree(dd);
> + kobject_put(&eq->kobj);
> + return -ENOMEM;
> + }
Hello Jens,
Please add a comment that explains where the number 256 comes from.
Thanks,
Bart.
On 12/08/2016 09:13 PM, Jens Axboe wrote:
> +/*
> + * Empty set
> + */
> +static struct blk_mq_ops mq_sched_tag_ops = {
> + .queue_rq = NULL,
> +};
Hello Jens,
Would "static struct blk_mq_ops mq_sched_tag_ops;" have been sufficient?
Can this data structure be declared 'const' if the blk_mq_ops pointers
in struct blk_mq_tag_set and struct request_queue are also declared const?
> +struct request *blk_mq_sched_alloc_shadow_request(struct request_queue *q,
> + struct blk_mq_alloc_data *data,
> + struct blk_mq_tags *tags,
> + atomic_t *wait_index)
> +{
Using the word "shadow" in the function name suggests to me that there
is a shadow request for every request and a request for every shadow
request. However, my understanding from the code is that there can be
requests without shadow requests (for e.g. a flush) and shadow requests
without requests. Shouldn't the name of this function reflect that, e.g.
by using "sched" or "elv" in the function name instead of "shadow"?
> +struct request *
> +blk_mq_sched_request_from_shadow(struct blk_mq_hw_ctx *hctx,
> + struct request *(*get_sched_rq)(struct blk_mq_hw_ctx *))
This function dequeues a request from the I/O scheduler queue, allocates
a request, copies the relevant request structure members into that
request and makes the request refer to the shadow request. Isn't the
request dispatching more important than associating the request with the
shadow request? If so, how about making the function name reflect that?
> +{
> + struct blk_mq_alloc_data data;
> + struct request *sched_rq, *rq;
> +
> + data.q = hctx->queue;
> + data.flags = BLK_MQ_REQ_NOWAIT;
> + data.ctx = blk_mq_get_ctx(hctx->queue);
> + data.hctx = hctx;
> +
> + rq = __blk_mq_alloc_request(&data, 0);
> + blk_mq_put_ctx(data.ctx);
> +
> + if (!rq) {
> + blk_mq_stop_hw_queue(hctx);
> + return NULL;
> + }
> +
> + sched_rq = get_sched_rq(hctx);
> +
> + if (!sched_rq) {
> + blk_queue_enter_live(hctx->queue);
> + __blk_mq_free_request(hctx, data.ctx, rq);
> + return NULL;
> + }
The mq deadline scheduler calls this function with get_sched_rq ==
__dd_dispatch_request. If __blk_mq_alloc_request() fails, shouldn't the
request that was removed from the scheduler queue be pushed back onto
that queue? Additionally, are you sure it's necessary to call
blk_queue_enter_live() from the error path?
Bart.
On 12/08/2016 09:13 PM, Jens Axboe wrote:
> +static inline void blk_mq_sched_put_request(struct request *rq)
> +{
> + struct request_queue *q = rq->q;
> + struct elevator_queue *e = q->elevator;
> +
> + if (e && e->type->mq_ops.put_request)
> + e->type->mq_ops.put_request(rq);
> + else
> + blk_mq_free_request(rq);
> +}
blk_mq_free_request() always triggers a call of blk_queue_exit().
dd_put_request() only triggers a call of blk_queue_exit() if it is not a
shadow request. Is that on purpose?
> +static inline struct request *
> +blk_mq_sched_get_request(struct request_queue *q, unsigned int op,
> + struct blk_mq_alloc_data *data)
> +{
> + struct elevator_queue *e = q->elevator;
> + struct blk_mq_hw_ctx *hctx;
> + struct blk_mq_ctx *ctx;
> + struct request *rq;
> +
> + blk_queue_enter_live(q);
> + ctx = blk_mq_get_ctx(q);
> + hctx = blk_mq_map_queue(q, ctx->cpu);
> +
> + blk_mq_set_alloc_data(data, q, 0, ctx, hctx);
> +
> + if (e && e->type->mq_ops.get_request)
> + rq = e->type->mq_ops.get_request(q, op, data);
> + else
> + rq = __blk_mq_alloc_request(data, op);
> +
> + if (rq)
> + data->hctx->queued++;
> +
> + return rq;
> +
> +}
Some but not all callers of blk_mq_sched_get_request() call
blk_queue_exit() if this function returns NULL. Please consider to move
the blk_queue_exit() call from the blk_mq_alloc_request() error path
into this function. I think that will make it a lot easier to verify
whether or not the blk_queue_enter() / blk_queue_exit() calls are
balanced properly.
Additionally, since blk_queue_enter() / blk_queue_exit() calls by
blk_mq_sched_get_request() and blk_mq_sched_put_request() must be
balanced and since the latter function only calls blk_queue_exit() for
non-shadow requests, shouldn't blk_mq_sched_get_request() call
blk_queue_enter_live() only if __blk_mq_alloc_request() is called?
Thanks,
Bart.
On 12/13/2016 01:51 AM, Bart Van Assche wrote:
> On 12/08/2016 09:13 PM, Jens Axboe wrote:
>> +static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
>> +{
>> + LIST_HEAD(rq_list);
>> + LIST_HEAD(driver_list);
>
> Hello Jens,
>
> driver_list is not used in this function so please consider removing
> that variable from blk_mq_process_rq_list(). Otherwise this patch looks
> fine to me.
Thanks Bart, this already got fixed up in the current branch.
--
Jens Axboe
On 12/13/2016 03:13 AM, Bart Van Assche wrote:
> On 12/08/2016 09:13 PM, Jens Axboe wrote:
>> +config DEFAULT_MQ_IOSCHED
>> + string
>> + default "mq-deadline" if DEFAULT_MQ_DEADLINE
>> + default "none" if DEFAULT_MQ_NONE
>> +
>> endmenu
>>
>> +config MQ_IOSCHED_ONLY_SQ
>> + bool "Enable blk-mq IO scheduler only for single queue devices"
>> + default y
>> + help
>> + Say Y here, if you only want to enable IO scheduling on block
>> + devices that have a single queue registered.
>> +
>> endif
>
> Hello Jens,
>
> Shouln't the MQ_IOSCHED_ONLY_SQ entry be placed before "endmenu" such
> that it is displayed in the I/O scheduler menu instead of the block menu?
Good catch, yes it should. I'll move it.
--
Jens Axboe
On 12/13/2016 04:04 AM, Bart Van Assche wrote:
> On 12/08/2016 09:13 PM, Jens Axboe wrote:
>> +static int dd_init_queue(struct request_queue *q, struct elevator_type *e)
>> +{
>> + struct deadline_data *dd;
>> + struct elevator_queue *eq;
>> +
>> + eq = elevator_alloc(q, e);
>> + if (!eq)
>> + return -ENOMEM;
>> +
>> + dd = kzalloc_node(sizeof(*dd), GFP_KERNEL, q->node);
>> + if (!dd) {
>> + kobject_put(&eq->kobj);
>> + return -ENOMEM;
>> + }
>> + eq->elevator_data = dd;
>> +
>> + dd->tags = blk_mq_sched_alloc_requests(256, q->node);
>> + if (!dd->tags) {
>> + kfree(dd);
>> + kobject_put(&eq->kobj);
>> + return -ENOMEM;
>> + }
>
> Hello Jens,
>
> Please add a comment that explains where the number 256 comes from.
Pulled out of my... I'll add a comment! Really this should just be
->nr_requests soft setting, the 256 is just a random sane default that I
chose for now. I had forgotten about that, thanks.
--
Jens Axboe
On 12/13/2016 06:56 AM, Bart Van Assche wrote:
> On 12/08/2016 09:13 PM, Jens Axboe wrote:
>> +/*
>> + * Empty set
>> + */
>> +static struct blk_mq_ops mq_sched_tag_ops = {
>> + .queue_rq = NULL,
>> +};
>
> Hello Jens,
>
> Would "static struct blk_mq_ops mq_sched_tag_ops;" have been sufficient?
> Can this data structure be declared 'const' if the blk_mq_ops pointers
> in struct blk_mq_tag_set and struct request_queue are also declared const?
Yes, the static should be enough to ensure that it's all zeroes. I did
have this as const, but then realized I'd have to change a few other
places too. I'll make that change, hopefully it'll just work out.
>> +struct request *blk_mq_sched_alloc_shadow_request(struct request_queue *q,
>> + struct blk_mq_alloc_data *data,
>> + struct blk_mq_tags *tags,
>> + atomic_t *wait_index)
>> +{
>
> Using the word "shadow" in the function name suggests to me that there
> is a shadow request for every request and a request for every shadow
> request. However, my understanding from the code is that there can be
> requests without shadow requests (for e.g. a flush) and shadow requests
> without requests. Shouldn't the name of this function reflect that, e.g.
> by using "sched" or "elv" in the function name instead of "shadow"?
Shadow might not be the best name. Most do have shadows though, it's
only the rare exception like the flush, that you mention. I'll see if I
can come up with a better name.
>> +struct request *
>> +blk_mq_sched_request_from_shadow(struct blk_mq_hw_ctx *hctx,
>> + struct request *(*get_sched_rq)(struct blk_mq_hw_ctx *))
>
> This function dequeues a request from the I/O scheduler queue, allocates
> a request, copies the relevant request structure members into that
> request and makes the request refer to the shadow request. Isn't the
> request dispatching more important than associating the request with the
> shadow request? If so, how about making the function name reflect that?
Sure, I can update the naming. Will need to anyway, if we get rid of the
shadow naming.
>> +{
>> + struct blk_mq_alloc_data data;
>> + struct request *sched_rq, *rq;
>> +
>> + data.q = hctx->queue;
>> + data.flags = BLK_MQ_REQ_NOWAIT;
>> + data.ctx = blk_mq_get_ctx(hctx->queue);
>> + data.hctx = hctx;
>> +
>> + rq = __blk_mq_alloc_request(&data, 0);
>> + blk_mq_put_ctx(data.ctx);
>> +
>> + if (!rq) {
>> + blk_mq_stop_hw_queue(hctx);
>> + return NULL;
>> + }
>> +
>> + sched_rq = get_sched_rq(hctx);
>> +
>> + if (!sched_rq) {
>> + blk_queue_enter_live(hctx->queue);
>> + __blk_mq_free_request(hctx, data.ctx, rq);
>> + return NULL;
>> + }
>
> The mq deadline scheduler calls this function with get_sched_rq ==
> __dd_dispatch_request. If __blk_mq_alloc_request() fails, shouldn't the
> request that was removed from the scheduler queue be pushed back onto
> that queue? Additionally, are you sure it's necessary to call
> blk_queue_enter_live() from the error path?
If __blk_mq_alloc_request() fails, we haven't pulled a request from the
scheduler yet. The extra ref is needed because __blk_mq_alloc_request()
doesn't get a ref on the request, however __blk_mq_free_request() does
put one.
--
Jens Axboe
On Tue, Dec 13 2016, Paolo Valente wrote:
>
> > Il giorno 08 dic 2016, alle ore 21:13, Jens Axboe <[email protected]> ha scritto:
> >
> > As a followup to this posting from yesterday:
> >
> > https://marc.info/?l=linux-block&m=148115232806065&w=2
> >
> > this is version 2. I wanted to post a new one fairly quickly, as there
> > ended up being a number of potential crashes in v1. This one should be
> > solid, I've run mq-deadline on both NVMe and regular rotating storage,
> > and we handle the various merging cases correctly.
> >
> > You can download it from git as well:
> >
> > git://git.kernel.dk/linux-block blk-mq-sched.2
> >
> > Note that this is based on for-4.10/block, which is in turn based on
> > v4.9-rc1. I suggest pulling it into my for-next branch, which would
> > then merge nicely with 'master' as well.
> >
>
> Hi Jens,
> this is just to tell you that I have finished running some extensive
> tests on this patch series (throughput, responsiveness, low latency
> for soft real time). No regression w.r.t. blk detected, and no
> crashes or other anomalies.
>
> Starting to work on BFQ port. Please be patient with my little
> expertise on mq environment, and with my next silly questions!
No worries, ask away if you have questions. As you might have seen, it's
still a little bit of a moving target, but it's getting closer every
day. I'll post a v3 later today hopefully that will be a good fix point
for you. I'll need to add the io context setup etc, that's not there
yet, as only cfq/bfq uses that.
--
Jens Axboe
On Tue, Dec 13 2016, Bart Van Assche wrote:
> On 12/08/2016 09:13 PM, Jens Axboe wrote:
> >+static inline void blk_mq_sched_put_request(struct request *rq)
> >+{
> >+ struct request_queue *q = rq->q;
> >+ struct elevator_queue *e = q->elevator;
> >+
> >+ if (e && e->type->mq_ops.put_request)
> >+ e->type->mq_ops.put_request(rq);
> >+ else
> >+ blk_mq_free_request(rq);
> >+}
>
> blk_mq_free_request() always triggers a call of blk_queue_exit().
> dd_put_request() only triggers a call of blk_queue_exit() if it is not a
> shadow request. Is that on purpose?
If the scheduler doesn't define get/put requests, then the lifetime
follows the normal setup. If we do define them, then dd_put_request()
only wants to put the request if it's one where we did setup a shadow.
> >+static inline struct request *
> >+blk_mq_sched_get_request(struct request_queue *q, unsigned int op,
> >+ struct blk_mq_alloc_data *data)
> >+{
> >+ struct elevator_queue *e = q->elevator;
> >+ struct blk_mq_hw_ctx *hctx;
> >+ struct blk_mq_ctx *ctx;
> >+ struct request *rq;
> >+
> >+ blk_queue_enter_live(q);
> >+ ctx = blk_mq_get_ctx(q);
> >+ hctx = blk_mq_map_queue(q, ctx->cpu);
> >+
> >+ blk_mq_set_alloc_data(data, q, 0, ctx, hctx);
> >+
> >+ if (e && e->type->mq_ops.get_request)
> >+ rq = e->type->mq_ops.get_request(q, op, data);
> >+ else
> >+ rq = __blk_mq_alloc_request(data, op);
> >+
> >+ if (rq)
> >+ data->hctx->queued++;
> >+
> >+ return rq;
> >+
> >+}
>
> Some but not all callers of blk_mq_sched_get_request() call blk_queue_exit()
> if this function returns NULL. Please consider to move the blk_queue_exit()
> call from the blk_mq_alloc_request() error path into this function. I think
> that will make it a lot easier to verify whether or not the
> blk_queue_enter() / blk_queue_exit() calls are balanced properly.
Agree, I'll make the change, it'll be easier to read then.
> Additionally, since blk_queue_enter() / blk_queue_exit() calls by
> blk_mq_sched_get_request() and blk_mq_sched_put_request() must be balanced
> and since the latter function only calls blk_queue_exit() for non-shadow
> requests, shouldn't blk_mq_sched_get_request() call blk_queue_enter_live()
> only if __blk_mq_alloc_request() is called?
I'll double check that part, there might be a bug or at least a chance
to clean this up a bit. I did verify most of this at some point, and
tested it with the scheduler switching. That part falls apart pretty
quickly, if the references aren't matched exactly.
--
Jens Axboe
> Il giorno 13 dic 2016, alle ore 16:17, Jens Axboe <[email protected]> ha scritto:
>
> On Tue, Dec 13 2016, Paolo Valente wrote:
>>
>>> Il giorno 08 dic 2016, alle ore 21:13, Jens Axboe <[email protected]> ha scritto:
>>>
>>> As a followup to this posting from yesterday:
>>>
>>> https://marc.info/?l=linux-block&m=148115232806065&w=2
>>>
>>> this is version 2. I wanted to post a new one fairly quickly, as there
>>> ended up being a number of potential crashes in v1. This one should be
>>> solid, I've run mq-deadline on both NVMe and regular rotating storage,
>>> and we handle the various merging cases correctly.
>>>
>>> You can download it from git as well:
>>>
>>> git://git.kernel.dk/linux-block blk-mq-sched.2
>>>
>>> Note that this is based on for-4.10/block, which is in turn based on
>>> v4.9-rc1. I suggest pulling it into my for-next branch, which would
>>> then merge nicely with 'master' as well.
>>>
>>
>> Hi Jens,
>> this is just to tell you that I have finished running some extensive
>> tests on this patch series (throughput, responsiveness, low latency
>> for soft real time). No regression w.r.t. blk detected, and no
>> crashes or other anomalies.
>>
>> Starting to work on BFQ port. Please be patient with my little
>> expertise on mq environment, and with my next silly questions!
>
> No worries, ask away if you have questions. As you might have seen, it's
> still a little bit of a moving target, but it's getting closer every
> day. I'll post a v3 later today hopefully that will be a good fix point
> for you. I'll need to add the io context setup etc, that's not there
> yet, as only cfq/bfq uses that.
>
You anticipated the question that was worrying me more, how to handle
iocontexts :) I'll go on studying your patches while waiting for this
(last, right?) missing piece for bfq.
Should you implement a modified version of cfq, to test your last
extensions, I would of course appreciate very much to have a look at
it (if you are willing to share it, of course).
Thanks,
Paolo
> --
> Jens Axboe
On 12/13/2016 09:15 AM, Paolo Valente wrote:
>
>> Il giorno 13 dic 2016, alle ore 16:17, Jens Axboe <[email protected]> ha scritto:
>>
>> On Tue, Dec 13 2016, Paolo Valente wrote:
>>>
>>>> Il giorno 08 dic 2016, alle ore 21:13, Jens Axboe <[email protected]> ha scritto:
>>>>
>>>> As a followup to this posting from yesterday:
>>>>
>>>> https://marc.info/?l=linux-block&m=148115232806065&w=2
>>>>
>>>> this is version 2. I wanted to post a new one fairly quickly, as there
>>>> ended up being a number of potential crashes in v1. This one should be
>>>> solid, I've run mq-deadline on both NVMe and regular rotating storage,
>>>> and we handle the various merging cases correctly.
>>>>
>>>> You can download it from git as well:
>>>>
>>>> git://git.kernel.dk/linux-block blk-mq-sched.2
>>>>
>>>> Note that this is based on for-4.10/block, which is in turn based on
>>>> v4.9-rc1. I suggest pulling it into my for-next branch, which would
>>>> then merge nicely with 'master' as well.
>>>>
>>>
>>> Hi Jens,
>>> this is just to tell you that I have finished running some extensive
>>> tests on this patch series (throughput, responsiveness, low latency
>>> for soft real time). No regression w.r.t. blk detected, and no
>>> crashes or other anomalies.
>>>
>>> Starting to work on BFQ port. Please be patient with my little
>>> expertise on mq environment, and with my next silly questions!
>>
>> No worries, ask away if you have questions. As you might have seen, it's
>> still a little bit of a moving target, but it's getting closer every
>> day. I'll post a v3 later today hopefully that will be a good fix point
>> for you. I'll need to add the io context setup etc, that's not there
>> yet, as only cfq/bfq uses that.
>>
>
> You anticipated the question that was worrying me more, how to handle
> iocontexts :) I'll go on studying your patches while waiting for this
> (last, right?) missing piece for bfq.
It's the last missing larger piece. We probably have a few hooks that
BFQ/CFQ currently uses that aren't wired up yet in the elevator_ops for
mq, so you'll probably have to do those as you go. I can take a look,
but I would prefer if they be done one a as-needed basis. Perhaps we can
get rid of some of them.
> Should you implement a modified version of cfq, to test your last
> extensions, I would of course appreciate very much to have a look at
> it (if you are willing to share it, of course).
I most likely won't do that, as it would be a waste of time on my end.
If you need help with the BFQ parts, I'll help you out.
--
Jens Axboe
On 12/13/2016 09:28 AM, Jens Axboe wrote:
>>> No worries, ask away if you have questions. As you might have seen, it's
>>> still a little bit of a moving target, but it's getting closer every
>>> day. I'll post a v3 later today hopefully that will be a good fix point
>>> for you. I'll need to add the io context setup etc, that's not there
>>> yet, as only cfq/bfq uses that.
>>>
>>
>> You anticipated the question that was worrying me more, how to handle
>> iocontexts :) I'll go on studying your patches while waiting for this
>> (last, right?) missing piece for bfq.
>
> It's the last missing larger piece. We probably have a few hooks that
> BFQ/CFQ currently uses that aren't wired up yet in the elevator_ops for
> mq, so you'll probably have to do those as you go. I can take a look,
> but I would prefer if they be done one a as-needed basis. Perhaps we can
> get rid of some of them.
The current 'blk-mq-sched' branch has support for getting the IO
contexts setup, and assigned to requests. Only works off the task
io_context for now, we ignore anything set in the bio. But that's a
minor thing, generally it should work for you.
Note that the mq ops have different naming than the classic elevator
ops. For instance, the set_request/put_request are
get_rq_priv/put_rq_priv instead. Others are different as well. In
general, refer to mq-deadline.c for how the hooks work and you can
compare with deadline-iosched.c, since they are still very close.
Note that the io context linking uses the embedded queue lock,
q->queue_lock, whereas for other things you are free to use a lock
embedded in the your elevator data. Again, refer to mq-deadline, it uses
dd->lock to protect the hash/rbtree. If mq-deadline used io contexts, it
would manage those behind q->queue_lock.
--
Jens Axboe
On 12/08/2016 09:13 PM, Jens Axboe wrote:
> +static inline bool dd_rq_is_shadow(struct request *rq)
> +{
> + return rq->rq_flags & RQF_ALLOCED;
> +}
Hello Jens,
Something minor: because req_flags_t has been defined using __bitwise
(typedef __u32 __bitwise req_flags_t) sparse complains for the above
function about converting req_flags_t into bool. How about changing the
body of that function into "return (rq->rq_flags & RQF_ALLOCED) != 0" to
keep sparse happy?
Bart.
On 12/13/2016 04:14 PM, Jens Axboe wrote:
> On 12/13/2016 06:56 AM, Bart Van Assche wrote:
>> On 12/08/2016 09:13 PM, Jens Axboe wrote:
>>> +struct request *blk_mq_sched_alloc_shadow_request(struct request_queue *q,
>>> + struct blk_mq_alloc_data *data,
>>> + struct blk_mq_tags *tags,
>>> + atomic_t *wait_index)
>>> +{
>>
>> Using the word "shadow" in the function name suggests to me that there
>> is a shadow request for every request and a request for every shadow
>> request. However, my understanding from the code is that there can be
>> requests without shadow requests (for e.g. a flush) and shadow requests
>> without requests. Shouldn't the name of this function reflect that, e.g.
>> by using "sched" or "elv" in the function name instead of "shadow"?
>
> Shadow might not be the best name. Most do have shadows though, it's
> only the rare exception like the flush, that you mention. I'll see if I
> can come up with a better name.
Hello Jens,
One aspect of this patch series that might turn out to be a maintenance
burden is the copying between original and shadow requests. It is easy
to overlook that rq_copy() has to be updated if a field would ever be
added to struct request. Additionally, having to allocate two requests
structures per I/O instead of one will have a runtime overhead. Do you
think the following approach would work?
- Instead of using two request structures per I/O, only use a single
request structure.
- Instead of storing one tag in the request structure, store two tags
in that structure. One tag comes from the I/O scheduler tag set
(size: nr_requests) and the other from the tag set associated with
the block driver (size: HBA queue depth).
- Only add a request to the hctx dispatch list after a block driver tag
has been assigned. This means that an I/O scheduler must keep a
request structure on a list it manages itself as long as no block
driver tag has been assigned.
- sysfs_list_show() is modified such that it shows both tags.
Thanks,
Bart.
On 12/14/2016 01:09 AM, Bart Van Assche wrote:
> On 12/08/2016 09:13 PM, Jens Axboe wrote:
>> +static inline bool dd_rq_is_shadow(struct request *rq)
>> +{
>> + return rq->rq_flags & RQF_ALLOCED;
>> +}
>
> Hello Jens,
>
> Something minor: because req_flags_t has been defined using __bitwise
> (typedef __u32 __bitwise req_flags_t) sparse complains for the above
> function about converting req_flags_t into bool. How about changing the
> body of that function into "return (rq->rq_flags & RQF_ALLOCED) != 0" to
> keep sparse happy?
Sure, I can fold in that change.
--
Jens Axboe
On 12/14/2016 03:31 AM, Bart Van Assche wrote:
> On 12/13/2016 04:14 PM, Jens Axboe wrote:
>> On 12/13/2016 06:56 AM, Bart Van Assche wrote:
>>> On 12/08/2016 09:13 PM, Jens Axboe wrote:
>>>> +struct request *blk_mq_sched_alloc_shadow_request(struct request_queue *q,
>>>> + struct blk_mq_alloc_data *data,
>>>> + struct blk_mq_tags *tags,
>>>> + atomic_t *wait_index)
>>>> +{
>>>
>>> Using the word "shadow" in the function name suggests to me that there
>>> is a shadow request for every request and a request for every shadow
>>> request. However, my understanding from the code is that there can be
>>> requests without shadow requests (for e.g. a flush) and shadow requests
>>> without requests. Shouldn't the name of this function reflect that, e.g.
>>> by using "sched" or "elv" in the function name instead of "shadow"?
>>
>> Shadow might not be the best name. Most do have shadows though, it's
>> only the rare exception like the flush, that you mention. I'll see if I
>> can come up with a better name.
>
> Hello Jens,
>
> One aspect of this patch series that might turn out to be a maintenance
> burden is the copying between original and shadow requests. It is easy
> to overlook that rq_copy() has to be updated if a field would ever be
> added to struct request. Additionally, having to allocate two requests
> structures per I/O instead of one will have a runtime overhead. Do you
> think the following approach would work?
> - Instead of using two request structures per I/O, only use a single
> request structure.
> - Instead of storing one tag in the request structure, store two tags
> in that structure. One tag comes from the I/O scheduler tag set
> (size: nr_requests) and the other from the tag set associated with
> the block driver (size: HBA queue depth).
> - Only add a request to the hctx dispatch list after a block driver tag
> has been assigned. This means that an I/O scheduler must keep a
> request structure on a list it manages itself as long as no block
> driver tag has been assigned.
> - sysfs_list_show() is modified such that it shows both tags.
I have considered doing exactly that, decided to go down the other path.
I may still revisit, it's not that I'm a huge fan of the shadow requests
and the necessary copying. We don't update the request that often, so I
don't think it's going to be a big maintenance burden. But it'd be hard
to claim that it's super pretty...
I'll play with the idea.
--
Jens Axboe