2014-06-06 12:21:16

by Matias Bjørling

[permalink] [raw]
Subject: [PATCH v6] conversion to blk-mq

Hi all,

Here is the updated patchset, on top of Jens' latest for-linus branch. You may
use the nvmemq_review branch at:

https://github.com/MatiasBjorling/linux-collab nvmemq_review

and see the individual contributions in the nvmemq_v6 branch.

I've removed the WIP flag, let me know if there is any outstanding work.
If not, I'll create final version on top of latest linus master and Matthew
master branch patches for final patch.

Changes since v5:
* Splits are now supported directly within blk-mq
* Remove nvme_queue->cpu_mask variable
* Remove unnecessary null check
* Style fixups

Changes since v4:
* Fix timeout retries
* Fix naming in nvme_init_hctx
* Fix racy behavior of admin queue in nvme_dev_remove
* Fix wrong return values in nvme_queue_request
* Put cqe_seen back
* Introduce abort_completion for killing timed out I/Os
* Move locks outside of nvme_submit_iod
* Various renaming and style fixes

Changes since v3:
* Added abortion logic
* Fixed possible race on abortion
* Removed req data with flush. Handled by by blk-mq
* Added safety check for submitting user rq to admin queue.
* Use dev->online_queues for nr_hw_queues
* Fix loop with initialization in nvme_create_io_queues
* Style fixups

Changes since v2:
* rebased on top of current 3.16/core.
* use blk-mq queue management for spreading io queues
* removed rcu handling and allocated all io queues up front for mgmt by blk-mq
* removed the need for hotplugging notification
* fixed flush data handling
* fixed double free of spinlock
* various cleanup

Matias Bjørling (1):
NVMe: conversion to blk-mq

drivers/block/nvme-core.c | 1178 +++++++++++++++++++--------------------------
drivers/block/nvme-scsi.c | 8 +-
include/linux/nvme.h | 14 +-
3 files changed, 496 insertions(+), 704 deletions(-)

--
1.9.1


2014-06-06 12:21:27

by Matias Bjørling

[permalink] [raw]
Subject: [PATCH v6] NVMe: conversion to blk-mq

This converts the current NVMe driver to utilize the blk-mq layer.

Contributions in this patch from:

Sam Bradshaw <[email protected]>
Jens Axboe <[email protected]>
Keith Busch <[email protected]>
Christoph Hellwig <[email protected]>
Robert Nelson <[email protected]>

Signed-off-by: Matias Bjørling <[email protected]>
---
drivers/block/nvme-core.c | 1178 +++++++++++++++++++--------------------------
drivers/block/nvme-scsi.c | 8 +-
include/linux/nvme.h | 14 +-
3 files changed, 496 insertions(+), 704 deletions(-)

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index f8c032b..6255ee7 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -13,9 +13,9 @@
*/

#include <linux/nvme.h>
-#include <linux/bio.h>
#include <linux/bitops.h>
#include <linux/blkdev.h>
+#include <linux/blk-mq.h>
#include <linux/cpu.h>
#include <linux/delay.h>
#include <linux/errno.h>
@@ -33,7 +33,6 @@
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/pci.h>
-#include <linux/percpu.h>
#include <linux/poison.h>
#include <linux/ptrace.h>
#include <linux/sched.h>
@@ -42,11 +41,11 @@
#include <scsi/sg.h>
#include <asm-generic/io-64-nonatomic-lo-hi.h>

-#define NVME_Q_DEPTH 1024
+#define NVME_Q_DEPTH 1024
+#define NVME_AQ_DEPTH 64
#define SQ_SIZE(depth) (depth * sizeof(struct nvme_command))
#define CQ_SIZE(depth) (depth * sizeof(struct nvme_completion))
-#define ADMIN_TIMEOUT (60 * HZ)
-#define IOD_TIMEOUT (4 * NVME_IO_TIMEOUT)
+#define ADMIN_TIMEOUT (60 * HZ)

unsigned char io_timeout = 30;
module_param(io_timeout, byte, 0644);
@@ -65,10 +64,12 @@ static struct workqueue_struct *nvme_workq;
static wait_queue_head_t nvme_kthread_wait;

static void nvme_reset_failed_dev(struct work_struct *ws);
+static int nvme_process_cq(struct nvme_queue *nvmeq);

struct async_cmd_info {
struct kthread_work work;
struct kthread_worker *worker;
+ struct request *req;
u32 result;
int status;
void *ctx;
@@ -79,7 +80,6 @@ struct async_cmd_info {
* commands and one for I/O commands).
*/
struct nvme_queue {
- struct rcu_head r_head;
struct device *q_dmadev;
struct nvme_dev *dev;
char irqname[24]; /* nvme4294967295-65535\0 */
@@ -88,10 +88,6 @@ struct nvme_queue {
volatile struct nvme_completion *cqes;
dma_addr_t sq_dma_addr;
dma_addr_t cq_dma_addr;
- wait_queue_head_t sq_full;
- wait_queue_t sq_cong_wait;
- struct bio_list sq_cong;
- struct list_head iod_bio;
u32 __iomem *q_db;
u16 q_depth;
u16 cq_vector;
@@ -102,9 +98,8 @@ struct nvme_queue {
u8 cq_phase;
u8 cqe_seen;
u8 q_suspended;
- cpumask_var_t cpu_mask;
struct async_cmd_info cmdinfo;
- unsigned long cmdid_data[];
+ struct blk_mq_hw_ctx *hctx;
};

/*
@@ -132,62 +127,72 @@ typedef void (*nvme_completion_fn)(struct nvme_queue *, void *,
struct nvme_cmd_info {
nvme_completion_fn fn;
void *ctx;
- unsigned long timeout;
int aborted;
+ struct nvme_queue *nvmeq;
};

-static struct nvme_cmd_info *nvme_cmd_info(struct nvme_queue *nvmeq)
+static int nvme_admin_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
+ unsigned int hctx_idx)
{
- return (void *)&nvmeq->cmdid_data[BITS_TO_LONGS(nvmeq->q_depth)];
+ struct nvme_dev *dev = data;
+ struct nvme_queue *nvmeq = dev->queues[0];
+
+ WARN_ON(nvmeq->hctx);
+ nvmeq->hctx = hctx;
+ hctx->driver_data = nvmeq;
+ return 0;
}

-static unsigned nvme_queue_extra(int depth)
+static int nvme_admin_init_request(void *data, struct request *req,
+ unsigned int hctx_idx, unsigned int rq_idx,
+ unsigned int numa_node)
{
- return DIV_ROUND_UP(depth, 8) + (depth * sizeof(struct nvme_cmd_info));
+ struct nvme_dev *dev = data;
+ struct nvme_cmd_info *cmd = blk_mq_rq_to_pdu(req);
+ struct nvme_queue *nvmeq = dev->queues[0];
+
+ WARN_ON(!nvmeq);
+ WARN_ON(!cmd);
+ cmd->nvmeq = nvmeq;
+ return 0;
}

-/**
- * alloc_cmdid() - Allocate a Command ID
- * @nvmeq: The queue that will be used for this command
- * @ctx: A pointer that will be passed to the handler
- * @handler: The function to call on completion
- *
- * Allocate a Command ID for a queue. The data passed in will
- * be passed to the completion handler. This is implemented by using
- * the bottom two bits of the ctx pointer to store the handler ID.
- * Passing in a pointer that's not 4-byte aligned will cause a BUG.
- * We can change this if it becomes a problem.
- *
- * May be called with local interrupts disabled and the q_lock held,
- * or with interrupts enabled and no locks held.
- */
-static int alloc_cmdid(struct nvme_queue *nvmeq, void *ctx,
- nvme_completion_fn handler, unsigned timeout)
+static int nvme_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
+ unsigned int hctx_idx)
{
- int depth = nvmeq->q_depth - 1;
- struct nvme_cmd_info *info = nvme_cmd_info(nvmeq);
- int cmdid;
+ struct nvme_dev *dev = data;
+ struct nvme_queue *nvmeq = dev->queues[(hctx_idx % dev->queue_count)
+ + 1];
+ /* nvmeq queues are shared between namespaces. We assume here that
+ * blk-mq map the tags so they match up with the nvme queue tags */
+ if (!nvmeq->hctx)
+ nvmeq->hctx = hctx;
+ else
+ WARN_ON(nvmeq->hctx->tags != hctx->tags);
+ hctx->driver_data = nvmeq;
+ return 0;
+}

- do {
- cmdid = find_first_zero_bit(nvmeq->cmdid_data, depth);
- if (cmdid >= depth)
- return -EBUSY;
- } while (test_and_set_bit(cmdid, nvmeq->cmdid_data));
+static int nvme_init_request(void *data, struct request *req,
+ unsigned int hctx_idx, unsigned int rq_idx,
+ unsigned int numa_node)
+{
+ struct nvme_dev *dev = data;
+ struct nvme_cmd_info *cmd = blk_mq_rq_to_pdu(req);
+ struct nvme_queue *nvmeq = dev->queues[hctx_idx + 1];

- info[cmdid].fn = handler;
- info[cmdid].ctx = ctx;
- info[cmdid].timeout = jiffies + timeout;
- info[cmdid].aborted = 0;
- return cmdid;
+ WARN_ON(!nvmeq);
+ WARN_ON(!cmd);
+ cmd->nvmeq = nvmeq;
+ return 0;
}

-static int alloc_cmdid_killable(struct nvme_queue *nvmeq, void *ctx,
- nvme_completion_fn handler, unsigned timeout)
+static void nvme_set_info(struct nvme_cmd_info *cmd, void *ctx,
+ nvme_completion_fn handler)
{
- int cmdid;
- wait_event_killable(nvmeq->sq_full,
- (cmdid = alloc_cmdid(nvmeq, ctx, handler, timeout)) >= 0);
- return (cmdid < 0) ? -EINTR : cmdid;
+ cmd->fn = handler;
+ cmd->ctx = ctx;
+ cmd->aborted = 0;
}

/* Special values must be less than 0x1000 */
@@ -195,20 +200,12 @@ static int alloc_cmdid_killable(struct nvme_queue *nvmeq, void *ctx,
#define CMD_CTX_CANCELLED (0x30C + CMD_CTX_BASE)
#define CMD_CTX_COMPLETED (0x310 + CMD_CTX_BASE)
#define CMD_CTX_INVALID (0x314 + CMD_CTX_BASE)
-#define CMD_CTX_FLUSH (0x318 + CMD_CTX_BASE)
-#define CMD_CTX_ABORT (0x31C + CMD_CTX_BASE)

static void special_completion(struct nvme_queue *nvmeq, void *ctx,
struct nvme_completion *cqe)
{
if (ctx == CMD_CTX_CANCELLED)
return;
- if (ctx == CMD_CTX_FLUSH)
- return;
- if (ctx == CMD_CTX_ABORT) {
- ++nvmeq->dev->abort_limit;
- return;
- }
if (ctx == CMD_CTX_COMPLETED) {
dev_warn(nvmeq->q_dmadev,
"completed id %d twice on queue %d\n",
@@ -225,6 +222,44 @@ static void special_completion(struct nvme_queue *nvmeq, void *ctx,
dev_warn(nvmeq->q_dmadev, "Unknown special completion %p\n", ctx);
}

+static void *cancel_cmd_info(struct nvme_cmd_info *cmd, nvme_completion_fn *fn)
+{
+ void *ctx;
+
+ if (fn)
+ *fn = cmd->fn;
+ ctx = cmd->ctx;
+ cmd->fn = special_completion;
+ cmd->ctx = CMD_CTX_CANCELLED;
+ return ctx;
+}
+
+static void abort_completion(struct nvme_queue *nvmeq, void *ctx,
+ struct nvme_completion *cqe)
+{
+ struct request *req;
+ struct nvme_cmd_info *aborted = ctx;
+ struct nvme_queue *a_nvmeq = aborted->nvmeq;
+ struct blk_mq_hw_ctx *hctx = nvmeq->hctx;
+ void *a_ctx;
+ nvme_completion_fn a_fn;
+ static struct nvme_completion a_cqe = {
+ .status = cpu_to_le16(NVME_SC_ABORT_REQ << 1),
+ };
+
+ req = blk_mq_tag_to_rq(hctx->tags, cqe->command_id);
+ blk_put_request(req);
+
+ if (!cqe->status)
+ dev_warn(nvmeq->q_dmadev, "Could not abort I/O %d QID %d",
+ req->tag, nvmeq->qid);
+
+ a_ctx = cancel_cmd_info(aborted, &a_fn);
+ a_fn(a_nvmeq, a_ctx, &a_cqe);
+
+ ++nvmeq->dev->abort_limit;
+}
+
static void async_completion(struct nvme_queue *nvmeq, void *ctx,
struct nvme_completion *cqe)
{
@@ -232,75 +267,38 @@ static void async_completion(struct nvme_queue *nvmeq, void *ctx,
cmdinfo->result = le32_to_cpup(&cqe->result);
cmdinfo->status = le16_to_cpup(&cqe->status) >> 1;
queue_kthread_work(cmdinfo->worker, &cmdinfo->work);
+ blk_put_request(cmdinfo->req);
+}
+
+static inline struct nvme_cmd_info *get_cmd_from_tag(struct nvme_queue *nvmeq,
+ unsigned int tag)
+{
+ struct blk_mq_hw_ctx *hctx = nvmeq->hctx;
+ struct request *req = blk_mq_tag_to_rq(hctx->tags, tag);
+
+ return blk_mq_rq_to_pdu(req);
}

/*
* Called with local interrupts disabled and the q_lock held. May not sleep.
*/
-static void *free_cmdid(struct nvme_queue *nvmeq, int cmdid,
+static void *nvme_finish_cmd(struct nvme_queue *nvmeq, int tag,
nvme_completion_fn *fn)
{
+ struct nvme_cmd_info *cmd = get_cmd_from_tag(nvmeq, tag);
void *ctx;
- struct nvme_cmd_info *info = nvme_cmd_info(nvmeq);
-
- if (cmdid >= nvmeq->q_depth || !info[cmdid].fn) {
- if (fn)
- *fn = special_completion;
+ if (tag >= nvmeq->q_depth) {
+ *fn = special_completion;
return CMD_CTX_INVALID;
}
if (fn)
- *fn = info[cmdid].fn;
- ctx = info[cmdid].ctx;
- info[cmdid].fn = special_completion;
- info[cmdid].ctx = CMD_CTX_COMPLETED;
- clear_bit(cmdid, nvmeq->cmdid_data);
- wake_up(&nvmeq->sq_full);
+ *fn = cmd->fn;
+ ctx = cmd->ctx;
+ cmd->fn = special_completion;
+ cmd->ctx = CMD_CTX_COMPLETED;
return ctx;
}

-static void *cancel_cmdid(struct nvme_queue *nvmeq, int cmdid,
- nvme_completion_fn *fn)
-{
- void *ctx;
- struct nvme_cmd_info *info = nvme_cmd_info(nvmeq);
- if (fn)
- *fn = info[cmdid].fn;
- ctx = info[cmdid].ctx;
- info[cmdid].fn = special_completion;
- info[cmdid].ctx = CMD_CTX_CANCELLED;
- return ctx;
-}
-
-static struct nvme_queue *raw_nvmeq(struct nvme_dev *dev, int qid)
-{
- return rcu_dereference_raw(dev->queues[qid]);
-}
-
-static struct nvme_queue *get_nvmeq(struct nvme_dev *dev) __acquires(RCU)
-{
- unsigned queue_id = get_cpu_var(*dev->io_queue);
- rcu_read_lock();
- return rcu_dereference(dev->queues[queue_id]);
-}
-
-static void put_nvmeq(struct nvme_queue *nvmeq) __releases(RCU)
-{
- rcu_read_unlock();
- put_cpu_var(nvmeq->dev->io_queue);
-}
-
-static struct nvme_queue *lock_nvmeq(struct nvme_dev *dev, int q_idx)
- __acquires(RCU)
-{
- rcu_read_lock();
- return rcu_dereference(dev->queues[q_idx]);
-}
-
-static void unlock_nvmeq(struct nvme_queue *nvmeq) __releases(RCU)
-{
- rcu_read_unlock();
-}
-
/**
* nvme_submit_cmd() - Copy a command into a queue and ring the doorbell
* @nvmeq: The queue to use
@@ -357,7 +355,6 @@ nvme_alloc_iod(unsigned nseg, unsigned nbytes, gfp_t gfp)
iod->length = nbytes;
iod->nents = 0;
iod->first_dma = 0ULL;
- iod->start_time = jiffies;
}

return iod;
@@ -381,59 +378,31 @@ void nvme_free_iod(struct nvme_dev *dev, struct nvme_iod *iod)
kfree(iod);
}

-static void nvme_start_io_acct(struct bio *bio)
-{
- struct gendisk *disk = bio->bi_bdev->bd_disk;
- const int rw = bio_data_dir(bio);
- int cpu = part_stat_lock();
- part_round_stats(cpu, &disk->part0);
- part_stat_inc(cpu, &disk->part0, ios[rw]);
- part_stat_add(cpu, &disk->part0, sectors[rw], bio_sectors(bio));
- part_inc_in_flight(&disk->part0, rw);
- part_stat_unlock();
-}
-
-static void nvme_end_io_acct(struct bio *bio, unsigned long start_time)
-{
- struct gendisk *disk = bio->bi_bdev->bd_disk;
- const int rw = bio_data_dir(bio);
- unsigned long duration = jiffies - start_time;
- int cpu = part_stat_lock();
- part_stat_add(cpu, &disk->part0, ticks[rw], duration);
- part_round_stats(cpu, &disk->part0);
- part_dec_in_flight(&disk->part0, rw);
- part_stat_unlock();
-}
-
-static void bio_completion(struct nvme_queue *nvmeq, void *ctx,
+static void req_completion(struct nvme_queue *nvmeq, void *ctx,
struct nvme_completion *cqe)
{
struct nvme_iod *iod = ctx;
- struct bio *bio = iod->private;
+ struct request *req = iod->private;
+
u16 status = le16_to_cpup(&cqe->status) >> 1;

if (unlikely(status)) {
- if (!(status & NVME_SC_DNR ||
- bio->bi_rw & REQ_FAILFAST_MASK) &&
- (jiffies - iod->start_time) < IOD_TIMEOUT) {
- if (!waitqueue_active(&nvmeq->sq_full))
- add_wait_queue(&nvmeq->sq_full,
- &nvmeq->sq_cong_wait);
- list_add_tail(&iod->node, &nvmeq->iod_bio);
- wake_up(&nvmeq->sq_full);
+ if (!(status & NVME_SC_DNR || blk_noretry_request(req))
+ && (jiffies - req->start_time) < req->timeout) {
+ blk_mq_requeue_request(req);
return;
}
- }
+ req->errors = -EIO;
+ } else
+ req->errors = 0;
+
if (iod->nents) {
- dma_unmap_sg(nvmeq->q_dmadev, iod->sg, iod->nents,
- bio_data_dir(bio) ? DMA_TO_DEVICE : DMA_FROM_DEVICE);
- nvme_end_io_acct(bio, iod->start_time);
+ dma_unmap_sg(&nvmeq->dev->pci_dev->dev, iod->sg, iod->nents,
+ rq_data_dir(req) ? DMA_TO_DEVICE : DMA_FROM_DEVICE);
}
nvme_free_iod(nvmeq->dev, iod);
- if (status)
- bio_endio(bio, -EIO);
- else
- bio_endio(bio, 0);
+
+ blk_mq_complete_request(req);
}

/* length is in bytes. gfp flags indicates whether we may sleep. */
@@ -515,86 +484,25 @@ int nvme_setup_prps(struct nvme_dev *dev, struct nvme_iod *iod, int total_len,
return total_len;
}

-static int nvme_split_and_submit(struct bio *bio, struct nvme_queue *nvmeq,
- int len)
-{
- struct bio *split = bio_split(bio, len >> 9, GFP_ATOMIC, NULL);
- if (!split)
- return -ENOMEM;
-
- bio_chain(split, bio);
-
- if (!waitqueue_active(&nvmeq->sq_full))
- add_wait_queue(&nvmeq->sq_full, &nvmeq->sq_cong_wait);
- bio_list_add(&nvmeq->sq_cong, split);
- bio_list_add(&nvmeq->sq_cong, bio);
- wake_up(&nvmeq->sq_full);
-
- return 0;
-}
-
-/* NVMe scatterlists require no holes in the virtual address */
-#define BIOVEC_NOT_VIRT_MERGEABLE(vec1, vec2) ((vec2)->bv_offset || \
- (((vec1)->bv_offset + (vec1)->bv_len) % PAGE_SIZE))
-
-static int nvme_map_bio(struct nvme_queue *nvmeq, struct nvme_iod *iod,
- struct bio *bio, enum dma_data_direction dma_dir, int psegs)
-{
- struct bio_vec bvec, bvprv;
- struct bvec_iter iter;
- struct scatterlist *sg = NULL;
- int length = 0, nsegs = 0, split_len = bio->bi_iter.bi_size;
- int first = 1;
-
- if (nvmeq->dev->stripe_size)
- split_len = nvmeq->dev->stripe_size -
- ((bio->bi_iter.bi_sector << 9) &
- (nvmeq->dev->stripe_size - 1));
-
- sg_init_table(iod->sg, psegs);
- bio_for_each_segment(bvec, bio, iter) {
- if (!first && BIOVEC_PHYS_MERGEABLE(&bvprv, &bvec)) {
- sg->length += bvec.bv_len;
- } else {
- if (!first && BIOVEC_NOT_VIRT_MERGEABLE(&bvprv, &bvec))
- return nvme_split_and_submit(bio, nvmeq,
- length);
-
- sg = sg ? sg + 1 : iod->sg;
- sg_set_page(sg, bvec.bv_page,
- bvec.bv_len, bvec.bv_offset);
- nsegs++;
- }
-
- if (split_len - length < bvec.bv_len)
- return nvme_split_and_submit(bio, nvmeq, split_len);
- length += bvec.bv_len;
- bvprv = bvec;
- first = 0;
- }
- iod->nents = nsegs;
- sg_mark_end(sg);
- if (dma_map_sg(nvmeq->q_dmadev, iod->sg, iod->nents, dma_dir) == 0)
- return -ENOMEM;
-
- BUG_ON(length != bio->bi_iter.bi_size);
- return length;
-}
-
-static int nvme_submit_discard(struct nvme_queue *nvmeq, struct nvme_ns *ns,
- struct bio *bio, struct nvme_iod *iod, int cmdid)
+/*
+ * We reuse the small pool to allocate the 16-byte range here as it is not
+ * worth having a special pool for these or additional cases to handle freeing
+ * the iod.
+ */
+static void nvme_submit_discard(struct nvme_queue *nvmeq, struct nvme_ns *ns,
+ struct request *req, struct nvme_iod *iod)
{
struct nvme_dsm_range *range =
(struct nvme_dsm_range *)iod_list(iod)[0];
struct nvme_command *cmnd = &nvmeq->sq_cmds[nvmeq->sq_tail];

range->cattr = cpu_to_le32(0);
- range->nlb = cpu_to_le32(bio->bi_iter.bi_size >> ns->lba_shift);
- range->slba = cpu_to_le64(nvme_block_nr(ns, bio->bi_iter.bi_sector));
+ range->nlb = cpu_to_le32(blk_rq_bytes(req) >> ns->lba_shift);
+ range->slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));

memset(cmnd, 0, sizeof(*cmnd));
cmnd->dsm.opcode = nvme_cmd_dsm;
- cmnd->dsm.command_id = cmdid;
+ cmnd->dsm.command_id = req->tag;
cmnd->dsm.nsid = cpu_to_le32(ns->ns_id);
cmnd->dsm.prp1 = cpu_to_le64(iod->first_dma);
cmnd->dsm.nr = 0;
@@ -603,11 +511,9 @@ static int nvme_submit_discard(struct nvme_queue *nvmeq, struct nvme_ns *ns,
if (++nvmeq->sq_tail == nvmeq->q_depth)
nvmeq->sq_tail = 0;
writel(nvmeq->sq_tail, nvmeq->q_db);
-
- return 0;
}

-static int nvme_submit_flush(struct nvme_queue *nvmeq, struct nvme_ns *ns,
+static void nvme_submit_flush(struct nvme_queue *nvmeq, struct nvme_ns *ns,
int cmdid)
{
struct nvme_command *cmnd = &nvmeq->sq_cmds[nvmeq->sq_tail];
@@ -620,59 +526,34 @@ static int nvme_submit_flush(struct nvme_queue *nvmeq, struct nvme_ns *ns,
if (++nvmeq->sq_tail == nvmeq->q_depth)
nvmeq->sq_tail = 0;
writel(nvmeq->sq_tail, nvmeq->q_db);
-
- return 0;
-}
-
-int nvme_submit_flush_data(struct nvme_queue *nvmeq, struct nvme_ns *ns)
-{
- int cmdid = alloc_cmdid(nvmeq, (void *)CMD_CTX_FLUSH,
- special_completion, NVME_IO_TIMEOUT);
- if (unlikely(cmdid < 0))
- return cmdid;
-
- return nvme_submit_flush(nvmeq, ns, cmdid);
}

-static int nvme_submit_iod(struct nvme_queue *nvmeq, struct nvme_iod *iod)
+static int nvme_submit_iod(struct nvme_queue *nvmeq, struct nvme_iod *iod,
+ struct nvme_ns *ns)
{
- struct bio *bio = iod->private;
- struct nvme_ns *ns = bio->bi_bdev->bd_disk->private_data;
+ struct request *req = iod->private;
struct nvme_command *cmnd;
- int cmdid;
- u16 control;
- u32 dsmgmt;
+ u16 control = 0;
+ u32 dsmgmt = 0;

- cmdid = alloc_cmdid(nvmeq, iod, bio_completion, NVME_IO_TIMEOUT);
- if (unlikely(cmdid < 0))
- return cmdid;
-
- if (bio->bi_rw & REQ_DISCARD)
- return nvme_submit_discard(nvmeq, ns, bio, iod, cmdid);
- if ((bio->bi_rw & REQ_FLUSH) && !iod->nents)
- return nvme_submit_flush(nvmeq, ns, cmdid);
-
- control = 0;
- if (bio->bi_rw & REQ_FUA)
+ if (req->cmd_flags & REQ_FUA)
control |= NVME_RW_FUA;
- if (bio->bi_rw & (REQ_FAILFAST_DEV | REQ_RAHEAD))
+ if (req->cmd_flags & (REQ_FAILFAST_DEV | REQ_RAHEAD))
control |= NVME_RW_LR;

- dsmgmt = 0;
- if (bio->bi_rw & REQ_RAHEAD)
+ if (req->cmd_flags & REQ_RAHEAD)
dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;

cmnd = &nvmeq->sq_cmds[nvmeq->sq_tail];
memset(cmnd, 0, sizeof(*cmnd));

- cmnd->rw.opcode = bio_data_dir(bio) ? nvme_cmd_write : nvme_cmd_read;
- cmnd->rw.command_id = cmdid;
+ cmnd->rw.opcode = (rq_data_dir(req) ? nvme_cmd_write : nvme_cmd_read);
+ cmnd->rw.command_id = req->tag;
cmnd->rw.nsid = cpu_to_le32(ns->ns_id);
cmnd->rw.prp1 = cpu_to_le64(sg_dma_address(iod->sg));
cmnd->rw.prp2 = cpu_to_le64(iod->first_dma);
- cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, bio->bi_iter.bi_sector));
- cmnd->rw.length =
- cpu_to_le16((bio->bi_iter.bi_size >> ns->lba_shift) - 1);
+ cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
+ cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
cmnd->rw.control = cpu_to_le16(control);
cmnd->rw.dsmgmt = cpu_to_le32(dsmgmt);

@@ -683,28 +564,32 @@ static int nvme_submit_iod(struct nvme_queue *nvmeq, struct nvme_iod *iod)
return 0;
}

-/*
- * Called with local interrupts disabled and the q_lock held. May not sleep.
- */
-static int nvme_submit_bio_queue(struct nvme_queue *nvmeq, struct nvme_ns *ns,
- struct bio *bio)
+static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
{
+ struct nvme_ns *ns = hctx->queue->queuedata;
+ struct nvme_queue *nvmeq = hctx->driver_data;
+ struct nvme_cmd_info *cmd = blk_mq_rq_to_pdu(req);
struct nvme_iod *iod;
- int psegs = bio_phys_segments(ns->queue, bio);
- int result;
+ enum dma_data_direction dma_dir;
+ int psegs = req->nr_phys_segments;
+ int result = BLK_MQ_RQ_QUEUE_BUSY;
+ /*
+ * Requeued IO has already been prepped
+ */
+ iod = req->special;
+ if (iod)
+ goto submit_iod;

- if ((bio->bi_rw & REQ_FLUSH) && psegs) {
- result = nvme_submit_flush_data(nvmeq, ns);
- if (result)
- return result;
- }
-
- iod = nvme_alloc_iod(psegs, bio->bi_iter.bi_size, GFP_ATOMIC);
+ iod = nvme_alloc_iod(psegs, blk_rq_bytes(req), GFP_ATOMIC);
if (!iod)
- return -ENOMEM;
+ return result;

- iod->private = bio;
- if (bio->bi_rw & REQ_DISCARD) {
+ iod->private = req;
+ req->special = iod;
+
+ nvme_set_info(cmd, iod, req_completion);
+
+ if (req->cmd_flags & REQ_DISCARD) {
void *range;
/*
* We reuse the small pool to allocate the 16-byte range here
@@ -714,33 +599,53 @@ static int nvme_submit_bio_queue(struct nvme_queue *nvmeq, struct nvme_ns *ns,
range = dma_pool_alloc(nvmeq->dev->prp_small_pool,
GFP_ATOMIC,
&iod->first_dma);
- if (!range) {
- result = -ENOMEM;
- goto free_iod;
- }
+ if (!range)
+ goto finish_cmd;
iod_list(iod)[0] = (__le64 *)range;
iod->npages = 0;
} else if (psegs) {
- result = nvme_map_bio(nvmeq, iod, bio,
- bio_data_dir(bio) ? DMA_TO_DEVICE : DMA_FROM_DEVICE,
- psegs);
- if (result <= 0)
- goto free_iod;
- if (nvme_setup_prps(nvmeq->dev, iod, result, GFP_ATOMIC) !=
- result) {
- result = -ENOMEM;
- goto free_iod;
+ dma_dir = rq_data_dir(req) ? DMA_TO_DEVICE : DMA_FROM_DEVICE;
+
+ sg_init_table(iod->sg, psegs);
+ iod->nents = blk_rq_map_sg(req->q, req, iod->sg);
+ if (!iod->nents) {
+ result = BLK_MQ_RQ_QUEUE_ERROR;
+ goto finish_cmd;
}
- nvme_start_io_acct(bio);
+
+ if (!dma_map_sg(nvmeq->q_dmadev, iod->sg, iod->nents, dma_dir))
+ goto finish_cmd;
+
+ if (blk_rq_bytes(req) != nvme_setup_prps(nvmeq->dev, iod,
+ blk_rq_bytes(req), GFP_ATOMIC))
+ goto finish_cmd;
+ }
+
+ submit_iod:
+ spin_lock_irq(&nvmeq->q_lock);
+ if (nvmeq->q_suspended) {
+ spin_unlock_irq(&nvmeq->q_lock);
+ goto finish_cmd;
}
- if (unlikely(nvme_submit_iod(nvmeq, iod))) {
- if (!waitqueue_active(&nvmeq->sq_full))
- add_wait_queue(&nvmeq->sq_full, &nvmeq->sq_cong_wait);
- list_add_tail(&iod->node, &nvmeq->iod_bio);
+
+ if (req->cmd_flags & REQ_DISCARD) {
+ nvme_submit_discard(nvmeq, ns, req, iod);
+ goto queued;
+ }
+
+ if (req->cmd_flags & REQ_FLUSH) {
+ nvme_submit_flush(nvmeq, ns, req->tag);
+ goto queued;
}
- return 0;

- free_iod:
+ nvme_submit_iod(nvmeq, iod, ns);
+ queued:
+ nvme_process_cq(nvmeq);
+ spin_unlock_irq(&nvmeq->q_lock);
+ return BLK_MQ_RQ_QUEUE_OK;
+
+ finish_cmd:
+ nvme_finish_cmd(nvmeq, req->tag, NULL);
nvme_free_iod(nvmeq->dev, iod);
return result;
}
@@ -763,8 +668,7 @@ static int nvme_process_cq(struct nvme_queue *nvmeq)
head = 0;
phase = !phase;
}
-
- ctx = free_cmdid(nvmeq, cqe.command_id, &fn);
+ ctx = nvme_finish_cmd(nvmeq, cqe.command_id, &fn);
fn(nvmeq, ctx, &cqe);
}

@@ -785,30 +689,12 @@ static int nvme_process_cq(struct nvme_queue *nvmeq)
return 1;
}

-static void nvme_make_request(struct request_queue *q, struct bio *bio)
+/* Admin queue isn't initialized as a request queue. If at some point this
+ * happens anyway, make sure to notify the user */
+static int nvme_admin_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
{
- struct nvme_ns *ns = q->queuedata;
- struct nvme_queue *nvmeq = get_nvmeq(ns->dev);
- int result = -EBUSY;
-
- if (!nvmeq) {
- put_nvmeq(NULL);
- bio_endio(bio, -EIO);
- return;
- }
-
- spin_lock_irq(&nvmeq->q_lock);
- if (!nvmeq->q_suspended && bio_list_empty(&nvmeq->sq_cong))
- result = nvme_submit_bio_queue(nvmeq, ns, bio);
- if (unlikely(result)) {
- if (!waitqueue_active(&nvmeq->sq_full))
- add_wait_queue(&nvmeq->sq_full, &nvmeq->sq_cong_wait);
- bio_list_add(&nvmeq->sq_cong, bio);
- }
-
- nvme_process_cq(nvmeq);
- spin_unlock_irq(&nvmeq->q_lock);
- put_nvmeq(nvmeq);
+ WARN_ON(1);
+ return BLK_MQ_RQ_QUEUE_ERROR;
}

static irqreturn_t nvme_irq(int irq, void *data)
@@ -832,10 +718,11 @@ static irqreturn_t nvme_irq_check(int irq, void *data)
return IRQ_WAKE_THREAD;
}

-static void nvme_abort_command(struct nvme_queue *nvmeq, int cmdid)
+static void nvme_abort_cmd_info(struct nvme_queue *nvmeq, struct nvme_cmd_info *
+ cmd_info)
{
spin_lock_irq(&nvmeq->q_lock);
- cancel_cmdid(nvmeq, cmdid, NULL);
+ cancel_cmd_info(cmd_info, NULL);
spin_unlock_irq(&nvmeq->q_lock);
}

@@ -858,46 +745,31 @@ static void sync_completion(struct nvme_queue *nvmeq, void *ctx,
* Returns 0 on success. If the result is negative, it's a Linux error code;
* if the result is positive, it's an NVM Express status code
*/
-static int nvme_submit_sync_cmd(struct nvme_dev *dev, int q_idx,
- struct nvme_command *cmd,
+static int nvme_submit_sync_cmd(struct request *req, struct nvme_command *cmd,
u32 *result, unsigned timeout)
{
- int cmdid, ret;
+ int ret;
struct sync_cmd_info cmdinfo;
- struct nvme_queue *nvmeq;
-
- nvmeq = lock_nvmeq(dev, q_idx);
- if (!nvmeq) {
- unlock_nvmeq(nvmeq);
- return -ENODEV;
- }
+ struct nvme_cmd_info *cmd_rq = blk_mq_rq_to_pdu(req);
+ struct nvme_queue *nvmeq = cmd_rq->nvmeq;

cmdinfo.task = current;
cmdinfo.status = -EINTR;

- cmdid = alloc_cmdid(nvmeq, &cmdinfo, sync_completion, timeout);
- if (cmdid < 0) {
- unlock_nvmeq(nvmeq);
- return cmdid;
- }
- cmd->common.command_id = cmdid;
+ cmd->common.command_id = req->tag;
+
+ nvme_set_info(cmd_rq, &cmdinfo, sync_completion);

set_current_state(TASK_KILLABLE);
ret = nvme_submit_cmd(nvmeq, cmd);
if (ret) {
- free_cmdid(nvmeq, cmdid, NULL);
- unlock_nvmeq(nvmeq);
+ nvme_finish_cmd(nvmeq, req->tag, NULL);
set_current_state(TASK_RUNNING);
- return ret;
}
- unlock_nvmeq(nvmeq);
schedule_timeout(timeout);

if (cmdinfo.status == -EINTR) {
- nvmeq = lock_nvmeq(dev, q_idx);
- if (nvmeq)
- nvme_abort_command(nvmeq, cmdid);
- unlock_nvmeq(nvmeq);
+ nvme_abort_cmd_info(nvmeq, blk_mq_rq_to_pdu(req));
return -EINTR;
}

@@ -907,59 +779,78 @@ static int nvme_submit_sync_cmd(struct nvme_dev *dev, int q_idx,
return cmdinfo.status;
}

-static int nvme_submit_async_cmd(struct nvme_queue *nvmeq,
+static int nvme_submit_admin_async_cmd(struct nvme_dev *dev,
struct nvme_command *cmd,
struct async_cmd_info *cmdinfo, unsigned timeout)
{
- int cmdid;
+ struct nvme_queue *nvmeq = dev->queues[0];
+ struct request *req;
+ struct nvme_cmd_info *cmd_rq;

- cmdid = alloc_cmdid_killable(nvmeq, cmdinfo, async_completion, timeout);
- if (cmdid < 0)
- return cmdid;
+ req = blk_mq_alloc_request(dev->admin_q, WRITE, GFP_KERNEL, false);
+ if (!req)
+ return -ENOMEM;
+
+ req->timeout = timeout;
+ cmd_rq = blk_mq_rq_to_pdu(req);
+ cmdinfo->req = req;
+ nvme_set_info(cmd_rq, cmdinfo, async_completion);
cmdinfo->status = -EINTR;
- cmd->common.command_id = cmdid;
+
+ cmd->common.command_id = req->tag;
+
return nvme_submit_cmd(nvmeq, cmd);
}

+int __nvme_submit_admin_cmd(struct nvme_dev *dev, struct nvme_command *cmd,
+ u32 *result, unsigned timeout)
+{
+ int res;
+ struct request *req;
+
+ req = blk_mq_alloc_request(dev->admin_q, WRITE, GFP_KERNEL, false);
+ if (!req)
+ return -ENOMEM;
+ res = nvme_submit_sync_cmd(req, cmd, result, timeout);
+ blk_put_request(req);
+ return res;
+}
+
int nvme_submit_admin_cmd(struct nvme_dev *dev, struct nvme_command *cmd,
u32 *result)
{
- return nvme_submit_sync_cmd(dev, 0, cmd, result, ADMIN_TIMEOUT);
+ return __nvme_submit_admin_cmd(dev, cmd, result, ADMIN_TIMEOUT);
}

-int nvme_submit_io_cmd(struct nvme_dev *dev, struct nvme_command *cmd,
- u32 *result)
+int nvme_submit_io_cmd(struct nvme_dev *dev, struct nvme_ns *ns,
+ struct nvme_command *cmd, u32 *result)
{
- return nvme_submit_sync_cmd(dev, smp_processor_id() + 1, cmd, result,
- NVME_IO_TIMEOUT);
-}
+ int res;
+ struct request *req;

-static int nvme_submit_admin_cmd_async(struct nvme_dev *dev,
- struct nvme_command *cmd, struct async_cmd_info *cmdinfo)
-{
- return nvme_submit_async_cmd(raw_nvmeq(dev, 0), cmd, cmdinfo,
- ADMIN_TIMEOUT);
+ req = blk_mq_alloc_request(ns->queue, WRITE, (GFP_KERNEL|__GFP_WAIT),
+ false);
+ if (!req)
+ return -ENOMEM;
+ res = nvme_submit_sync_cmd(req, cmd, result, NVME_IO_TIMEOUT);
+ blk_put_request(req);
+ return res;
}

static int adapter_delete_queue(struct nvme_dev *dev, u8 opcode, u16 id)
{
- int status;
struct nvme_command c;

memset(&c, 0, sizeof(c));
c.delete_queue.opcode = opcode;
c.delete_queue.qid = cpu_to_le16(id);

- status = nvme_submit_admin_cmd(dev, &c, NULL);
- if (status)
- return -EIO;
- return 0;
+ return nvme_submit_admin_cmd(dev, &c, NULL);
}

static int adapter_alloc_cq(struct nvme_dev *dev, u16 qid,
struct nvme_queue *nvmeq)
{
- int status;
struct nvme_command c;
int flags = NVME_QUEUE_PHYS_CONTIG | NVME_CQ_IRQ_ENABLED;

@@ -971,16 +862,12 @@ static int adapter_alloc_cq(struct nvme_dev *dev, u16 qid,
c.create_cq.cq_flags = cpu_to_le16(flags);
c.create_cq.irq_vector = cpu_to_le16(nvmeq->cq_vector);

- status = nvme_submit_admin_cmd(dev, &c, NULL);
- if (status)
- return -EIO;
- return 0;
+ return nvme_submit_admin_cmd(dev, &c, NULL);
}

static int adapter_alloc_sq(struct nvme_dev *dev, u16 qid,
struct nvme_queue *nvmeq)
{
- int status;
struct nvme_command c;
int flags = NVME_QUEUE_PHYS_CONTIG | NVME_SQ_PRIO_MEDIUM;

@@ -992,10 +879,7 @@ static int adapter_alloc_sq(struct nvme_dev *dev, u16 qid,
c.create_sq.sq_flags = cpu_to_le16(flags);
c.create_sq.cqid = cpu_to_le16(qid);

- status = nvme_submit_admin_cmd(dev, &c, NULL);
- if (status)
- return -EIO;
- return 0;
+ return nvme_submit_admin_cmd(dev, &c, NULL);
}

static int adapter_delete_cq(struct nvme_dev *dev, u16 cqid)
@@ -1051,28 +935,27 @@ int nvme_set_features(struct nvme_dev *dev, unsigned fid, unsigned dword11,
}

/**
- * nvme_abort_cmd - Attempt aborting a command
- * @cmdid: Command id of a timed out IO
- * @queue: The queue with timed out IO
+ * nvme_abort_req - Attempt aborting a request
*
* Schedule controller reset if the command was already aborted once before and
* still hasn't been returned to the driver, or if this is the admin queue.
*/
-static void nvme_abort_cmd(int cmdid, struct nvme_queue *nvmeq)
+static void nvme_abort_req(struct request *req)
{
- int a_cmdid;
+ struct nvme_cmd_info *cmd_rq = blk_mq_rq_to_pdu(req);
+ struct nvme_queue *nvmeq = cmd_rq->nvmeq;
+ struct nvme_dev *dev = nvmeq->dev;
+ struct request *abort_req;
+ struct nvme_cmd_info *abort_cmd;
struct nvme_command cmd;
- struct nvme_dev *dev = nvmeq->dev;
- struct nvme_cmd_info *info = nvme_cmd_info(nvmeq);
- struct nvme_queue *adminq;

- if (!nvmeq->qid || info[cmdid].aborted) {
+ if (!nvmeq->qid || cmd_rq->aborted) {
if (work_busy(&dev->reset_work))
return;
list_del_init(&dev->node);
dev_warn(&dev->pci_dev->dev,
- "I/O %d QID %d timeout, reset controller\n", cmdid,
- nvmeq->qid);
+ "I/O %d QID %d timeout, reset controller\n",
+ req->tag, nvmeq->qid);
dev->reset_workfn = nvme_reset_failed_dev;
queue_work(nvme_workq, &dev->reset_work);
return;
@@ -1081,89 +964,102 @@ static void nvme_abort_cmd(int cmdid, struct nvme_queue *nvmeq)
if (!dev->abort_limit)
return;

- adminq = rcu_dereference(dev->queues[0]);
- a_cmdid = alloc_cmdid(adminq, CMD_CTX_ABORT, special_completion,
- ADMIN_TIMEOUT);
- if (a_cmdid < 0)
+ abort_req = blk_mq_alloc_request(dev->admin_q, WRITE, GFP_ATOMIC,
+ true);
+ if (!abort_req)
return;

+ abort_cmd = blk_mq_rq_to_pdu(abort_req);
+ nvme_set_info(abort_cmd, cmd_rq, abort_completion);
+
memset(&cmd, 0, sizeof(cmd));
cmd.abort.opcode = nvme_admin_abort_cmd;
- cmd.abort.cid = cmdid;
+ cmd.abort.cid = req->tag;
cmd.abort.sqid = cpu_to_le16(nvmeq->qid);
- cmd.abort.command_id = a_cmdid;
+ cmd.abort.command_id = abort_req->tag;

--dev->abort_limit;
- info[cmdid].aborted = 1;
- info[cmdid].timeout = jiffies + ADMIN_TIMEOUT;
+ cmd_rq->aborted = 1;

- dev_warn(nvmeq->q_dmadev, "Aborting I/O %d QID %d\n", cmdid,
+ dev_warn(nvmeq->q_dmadev, "Aborting I/O %d QID %d\n", req->tag,
nvmeq->qid);
- nvme_submit_cmd(adminq, &cmd);
+ if (nvme_submit_cmd(dev->queues[0], &cmd) < 0) {
+ dev_warn(nvmeq->q_dmadev, "Could not abort I/O %d QID %d",
+ req->tag, nvmeq->qid);
+ }
}

-/**
- * nvme_cancel_ios - Cancel outstanding I/Os
- * @queue: The queue to cancel I/Os on
- * @timeout: True to only cancel I/Os which have timed out
- */
-static void nvme_cancel_ios(struct nvme_queue *nvmeq, bool timeout)
+static void nvme_cancel_queue_ios(void *data, unsigned long *tag_map)
{
- int depth = nvmeq->q_depth - 1;
- struct nvme_cmd_info *info = nvme_cmd_info(nvmeq);
- unsigned long now = jiffies;
- int cmdid;
+ struct nvme_queue *nvmeq = data;
+ struct blk_mq_hw_ctx *hctx = nvmeq->hctx;
+ unsigned int tag = 0;

- for_each_set_bit(cmdid, nvmeq->cmdid_data, depth) {
+ tag = 0;
+ do {
+ struct request *req;
void *ctx;
nvme_completion_fn fn;
+ struct nvme_cmd_info *cmd;
static struct nvme_completion cqe = {
.status = cpu_to_le16(NVME_SC_ABORT_REQ << 1),
};
+ int qdepth = nvmeq == nvmeq->dev->queues[0] ?
+ nvmeq->dev->admin_tagset.queue_depth :
+ nvmeq->dev->tagset.queue_depth;

- if (timeout && !time_after(now, info[cmdid].timeout))
- continue;
- if (info[cmdid].ctx == CMD_CTX_CANCELLED)
- continue;
- if (timeout && nvmeq->dev->initialized) {
- nvme_abort_cmd(cmdid, nvmeq);
+ /* zero'd bits are free tags */
+ tag = find_next_zero_bit(tag_map, qdepth, tag);
+ if (tag >= qdepth)
+ break;
+
+ req = blk_mq_tag_to_rq(hctx->tags, tag++);
+ cmd = blk_mq_rq_to_pdu(req);
+
+ if (cmd->ctx == CMD_CTX_CANCELLED)
continue;
- }
- dev_warn(nvmeq->q_dmadev, "Cancelling I/O %d QID %d\n", cmdid,
- nvmeq->qid);
- ctx = cancel_cmdid(nvmeq, cmdid, &fn);
+
+ dev_warn(nvmeq->q_dmadev, "Cancelling I/O %d QID %d\n",
+ req->tag, nvmeq->qid);
+ ctx = cancel_cmd_info(cmd, &fn);
fn(nvmeq, ctx, &cqe);
- }
+
+ } while (1);
+}
+
+/**
+ * nvme_cancel_ios - Cancel outstanding I/Os
+ * @nvmeq: The queue to cancel I/Os on
+ * @tagset: The tag set associated with the queue
+ */
+static void nvme_cancel_ios(struct nvme_queue *nvmeq)
+{
+ struct blk_mq_hw_ctx *hctx = nvmeq->hctx;
+
+ if (nvmeq->dev->initialized)
+ blk_mq_tag_busy_iter(hctx->tags, nvme_cancel_queue_ios, nvmeq);
}

-static void nvme_free_queue(struct rcu_head *r)
+static enum blk_eh_timer_return nvme_timeout(struct request *req)
{
- struct nvme_queue *nvmeq = container_of(r, struct nvme_queue, r_head);
-
- spin_lock_irq(&nvmeq->q_lock);
- while (bio_list_peek(&nvmeq->sq_cong)) {
- struct bio *bio = bio_list_pop(&nvmeq->sq_cong);
- bio_endio(bio, -EIO);
- }
- while (!list_empty(&nvmeq->iod_bio)) {
- static struct nvme_completion cqe = {
- .status = cpu_to_le16(
- (NVME_SC_ABORT_REQ | NVME_SC_DNR) << 1),
- };
- struct nvme_iod *iod = list_first_entry(&nvmeq->iod_bio,
- struct nvme_iod,
- node);
- list_del(&iod->node);
- bio_completion(nvmeq, iod, &cqe);
- }
- spin_unlock_irq(&nvmeq->q_lock);
+ struct nvme_cmd_info *cmd = blk_mq_rq_to_pdu(req);
+ struct nvme_queue *nvmeq = cmd->nvmeq;

+ dev_warn(nvmeq->q_dmadev, "Cancelling I/O %d QID %d\n", req->tag,
+ nvmeq->qid);
+
+ if (nvmeq->dev->initialized)
+ nvme_abort_req(req);
+
+ return BLK_EH_HANDLED;
+}
+
+static void nvme_free_queue(struct nvme_queue *nvmeq)
+{
dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
(void *)nvmeq->cqes, nvmeq->cq_dma_addr);
dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth),
nvmeq->sq_cmds, nvmeq->sq_dma_addr);
- if (nvmeq->qid)
- free_cpumask_var(nvmeq->cpu_mask);
kfree(nvmeq);
}

@@ -1172,10 +1068,10 @@ static void nvme_free_queues(struct nvme_dev *dev, int lowest)
int i;

for (i = dev->queue_count - 1; i >= lowest; i--) {
- struct nvme_queue *nvmeq = raw_nvmeq(dev, i);
- rcu_assign_pointer(dev->queues[i], NULL);
- call_rcu(&nvmeq->r_head, nvme_free_queue);
+ struct nvme_queue *nvmeq = dev->queues[i];
dev->queue_count--;
+ dev->queues[i] = NULL;
+ nvme_free_queue(nvmeq);
}
}

@@ -1208,13 +1104,13 @@ static void nvme_clear_queue(struct nvme_queue *nvmeq)
{
spin_lock_irq(&nvmeq->q_lock);
nvme_process_cq(nvmeq);
- nvme_cancel_ios(nvmeq, false);
+ nvme_cancel_ios(nvmeq);
spin_unlock_irq(&nvmeq->q_lock);
}

static void nvme_disable_queue(struct nvme_dev *dev, int qid)
{
- struct nvme_queue *nvmeq = raw_nvmeq(dev, qid);
+ struct nvme_queue *nvmeq = dev->queues[qid];

if (!nvmeq)
return;
@@ -1234,8 +1130,7 @@ static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
int depth, int vector)
{
struct device *dmadev = &dev->pci_dev->dev;
- unsigned extra = nvme_queue_extra(depth);
- struct nvme_queue *nvmeq = kzalloc(sizeof(*nvmeq) + extra, GFP_KERNEL);
+ struct nvme_queue *nvmeq = kzalloc(sizeof(*nvmeq), GFP_KERNEL);
if (!nvmeq)
return NULL;

@@ -1250,9 +1145,6 @@ static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
if (!nvmeq->sq_cmds)
goto free_cqdma;

- if (qid && !zalloc_cpumask_var(&nvmeq->cpu_mask, GFP_KERNEL))
- goto free_sqdma;
-
nvmeq->q_dmadev = dmadev;
nvmeq->dev = dev;
snprintf(nvmeq->irqname, sizeof(nvmeq->irqname), "nvme%dq%d",
@@ -1260,23 +1152,16 @@ static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
spin_lock_init(&nvmeq->q_lock);
nvmeq->cq_head = 0;
nvmeq->cq_phase = 1;
- init_waitqueue_head(&nvmeq->sq_full);
- init_waitqueue_entry(&nvmeq->sq_cong_wait, nvme_thread);
- bio_list_init(&nvmeq->sq_cong);
- INIT_LIST_HEAD(&nvmeq->iod_bio);
nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
nvmeq->q_depth = depth;
nvmeq->cq_vector = vector;
nvmeq->qid = qid;
nvmeq->q_suspended = 1;
dev->queue_count++;
- rcu_assign_pointer(dev->queues[qid], nvmeq);
+ dev->queues[qid] = nvmeq;

return nvmeq;

- free_sqdma:
- dma_free_coherent(dmadev, SQ_SIZE(depth), (void *)nvmeq->sq_cmds,
- nvmeq->sq_dma_addr);
free_cqdma:
dma_free_coherent(dmadev, CQ_SIZE(depth), (void *)nvmeq->cqes,
nvmeq->cq_dma_addr);
@@ -1299,15 +1184,12 @@ static int queue_request_irq(struct nvme_dev *dev, struct nvme_queue *nvmeq,
static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
{
struct nvme_dev *dev = nvmeq->dev;
- unsigned extra = nvme_queue_extra(nvmeq->q_depth);

nvmeq->sq_tail = 0;
nvmeq->cq_head = 0;
nvmeq->cq_phase = 1;
nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
- memset(nvmeq->cmdid_data, 0, extra);
memset((void *)nvmeq->cqes, 0, CQ_SIZE(nvmeq->q_depth));
- nvme_cancel_ios(nvmeq, false);
nvmeq->q_suspended = 0;
dev->online_queues++;
}
@@ -1408,6 +1290,53 @@ static int nvme_shutdown_ctrl(struct nvme_dev *dev)
return 0;
}

+static struct blk_mq_ops nvme_mq_admin_ops = {
+ .queue_rq = nvme_admin_queue_rq,
+ .map_queue = blk_mq_map_queue,
+ .init_hctx = nvme_admin_init_hctx,
+ .init_request = nvme_admin_init_request,
+ .timeout = nvme_timeout,
+};
+
+static struct blk_mq_ops nvme_mq_ops = {
+ .queue_rq = nvme_queue_rq,
+ .map_queue = blk_mq_map_queue,
+ .init_hctx = nvme_init_hctx,
+ .init_request = nvme_init_request,
+ .timeout = nvme_timeout,
+};
+
+static int nvme_alloc_admin_tags(struct nvme_dev *dev)
+{
+ if (!dev->admin_q) {
+ dev->admin_tagset.ops = &nvme_mq_admin_ops;
+ dev->admin_tagset.nr_hw_queues = 1;
+ dev->admin_tagset.queue_depth = NVME_AQ_DEPTH;
+ dev->admin_tagset.timeout = ADMIN_TIMEOUT;
+ dev->admin_tagset.reserved_tags = 1,
+ dev->admin_tagset.numa_node = dev_to_node(&dev->pci_dev->dev);
+ dev->admin_tagset.cmd_size = sizeof(struct nvme_cmd_info);
+ dev->admin_tagset.driver_data = dev;
+
+ if (blk_mq_alloc_tag_set(&dev->admin_tagset))
+ return -ENOMEM;
+
+ dev->admin_q = blk_mq_init_queue(&dev->admin_tagset);
+ if (!dev->admin_q) {
+ blk_mq_free_tag_set(&dev->admin_tagset);
+ return -ENOMEM;
+ }
+ }
+
+ return 0;
+}
+
+static void nvme_free_admin_tags(struct nvme_dev *dev)
+{
+ if (dev->admin_q)
+ blk_mq_free_tag_set(&dev->admin_tagset);
+}
+
static int nvme_configure_admin_queue(struct nvme_dev *dev)
{
int result;
@@ -1419,9 +1348,9 @@ static int nvme_configure_admin_queue(struct nvme_dev *dev)
if (result < 0)
return result;

- nvmeq = raw_nvmeq(dev, 0);
+ nvmeq = dev->queues[0];
if (!nvmeq) {
- nvmeq = nvme_alloc_queue(dev, 0, 64, 0);
+ nvmeq = nvme_alloc_queue(dev, 0, NVME_AQ_DEPTH, 0);
if (!nvmeq)
return -ENOMEM;
}
@@ -1441,16 +1370,26 @@ static int nvme_configure_admin_queue(struct nvme_dev *dev)

result = nvme_enable_ctrl(dev, cap);
if (result)
- return result;
+ goto free_nvmeq;
+
+ result = nvme_alloc_admin_tags(dev);
+ if (result)
+ goto free_nvmeq;

result = queue_request_irq(dev, nvmeq, nvmeq->irqname);
if (result)
- return result;
+ goto free_tags;

spin_lock_irq(&nvmeq->q_lock);
nvme_init_queue(nvmeq, 0);
spin_unlock_irq(&nvmeq->q_lock);
return result;
+
+ free_tags:
+ nvme_free_admin_tags(dev);
+ free_nvmeq:
+ nvme_free_queues(dev, 0);
+ return result;
}

struct nvme_iod *nvme_map_user_pages(struct nvme_dev *dev, int write,
@@ -1605,7 +1544,7 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
if (length != (io.nblocks + 1) << ns->lba_shift)
status = -ENOMEM;
else
- status = nvme_submit_io_cmd(dev, &c, NULL);
+ status = nvme_submit_io_cmd(dev, ns, &c, NULL);

if (meta_len) {
if (status == NVME_SC_SUCCESS && !(io.opcode & 1)) {
@@ -1677,10 +1616,11 @@ static int nvme_user_admin_cmd(struct nvme_dev *dev,

timeout = cmd.timeout_ms ? msecs_to_jiffies(cmd.timeout_ms) :
ADMIN_TIMEOUT;
+
if (length != cmd.data_len)
status = -ENOMEM;
else
- status = nvme_submit_sync_cmd(dev, 0, &c, &cmd.result, timeout);
+ status = __nvme_submit_admin_cmd(dev, &c, &cmd.result, timeout);

if (cmd.data_len) {
nvme_unmap_user_pages(dev, cmd.opcode & 1, iod);
@@ -1769,41 +1709,6 @@ static const struct block_device_operations nvme_fops = {
.getgeo = nvme_getgeo,
};

-static void nvme_resubmit_iods(struct nvme_queue *nvmeq)
-{
- struct nvme_iod *iod, *next;
-
- list_for_each_entry_safe(iod, next, &nvmeq->iod_bio, node) {
- if (unlikely(nvme_submit_iod(nvmeq, iod)))
- break;
- list_del(&iod->node);
- if (bio_list_empty(&nvmeq->sq_cong) &&
- list_empty(&nvmeq->iod_bio))
- remove_wait_queue(&nvmeq->sq_full,
- &nvmeq->sq_cong_wait);
- }
-}
-
-static void nvme_resubmit_bios(struct nvme_queue *nvmeq)
-{
- while (bio_list_peek(&nvmeq->sq_cong)) {
- struct bio *bio = bio_list_pop(&nvmeq->sq_cong);
- struct nvme_ns *ns = bio->bi_bdev->bd_disk->private_data;
-
- if (bio_list_empty(&nvmeq->sq_cong) &&
- list_empty(&nvmeq->iod_bio))
- remove_wait_queue(&nvmeq->sq_full,
- &nvmeq->sq_cong_wait);
- if (nvme_submit_bio_queue(nvmeq, ns, bio)) {
- if (!waitqueue_active(&nvmeq->sq_full))
- add_wait_queue(&nvmeq->sq_full,
- &nvmeq->sq_cong_wait);
- bio_list_add_head(&nvmeq->sq_cong, bio);
- break;
- }
- }
-}
-
static int nvme_kthread(void *data)
{
struct nvme_dev *dev, *next;
@@ -1824,23 +1729,17 @@ static int nvme_kthread(void *data)
queue_work(nvme_workq, &dev->reset_work);
continue;
}
- rcu_read_lock();
for (i = 0; i < dev->queue_count; i++) {
- struct nvme_queue *nvmeq =
- rcu_dereference(dev->queues[i]);
+ struct nvme_queue *nvmeq = dev->queues[i];
if (!nvmeq)
continue;
spin_lock_irq(&nvmeq->q_lock);
if (nvmeq->q_suspended)
goto unlock;
nvme_process_cq(nvmeq);
- nvme_cancel_ios(nvmeq, true);
- nvme_resubmit_bios(nvmeq);
- nvme_resubmit_iods(nvmeq);
unlock:
spin_unlock_irq(&nvmeq->q_lock);
}
- rcu_read_unlock();
}
spin_unlock(&dev_list_lock);
schedule_timeout(round_jiffies_relative(HZ));
@@ -1863,27 +1762,29 @@ static struct nvme_ns *nvme_alloc_ns(struct nvme_dev *dev, unsigned nsid,
{
struct nvme_ns *ns;
struct gendisk *disk;
+ int node = dev_to_node(&dev->pci_dev->dev);
int lbaf;

if (rt->attributes & NVME_LBART_ATTRIB_HIDE)
return NULL;

- ns = kzalloc(sizeof(*ns), GFP_KERNEL);
+ ns = kzalloc_node(sizeof(*ns), GFP_KERNEL, node);
if (!ns)
return NULL;
- ns->queue = blk_alloc_queue(GFP_KERNEL);
+ ns->queue = blk_mq_init_queue(&dev->tagset);
if (!ns->queue)
goto out_free_ns;
- ns->queue->queue_flags = QUEUE_FLAG_DEFAULT;
+ queue_flag_set_unlocked(QUEUE_FLAG_DEFAULT, ns->queue);
queue_flag_set_unlocked(QUEUE_FLAG_NOMERGES, ns->queue);
queue_flag_set_unlocked(QUEUE_FLAG_NONROT, ns->queue);
- blk_queue_make_request(ns->queue, nvme_make_request);
+ queue_flag_clear_unlocked(QUEUE_FLAG_IO_STAT, ns->queue);
ns->dev = dev;
ns->queue->queuedata = ns;

- disk = alloc_disk(0);
+ disk = alloc_disk_node(0, node);
if (!disk)
goto out_free_queue;
+
ns->ns_id = nsid;
ns->disk = disk;
lbaf = id->flbas & 0xf;
@@ -1892,6 +1793,8 @@ static struct nvme_ns *nvme_alloc_ns(struct nvme_dev *dev, unsigned nsid,
blk_queue_logical_block_size(ns->queue, 1 << ns->lba_shift);
if (dev->max_hw_sectors)
blk_queue_max_hw_sectors(ns->queue, dev->max_hw_sectors);
+ if (dev->stripe_size)
+ blk_queue_chunk_sectors(ns->queue, dev->stripe_size >> 9);
if (dev->vwc & NVME_CTRL_VWC_PRESENT)
blk_queue_flush(ns->queue, REQ_FLUSH | REQ_FUA);

@@ -1917,143 +1820,19 @@ static struct nvme_ns *nvme_alloc_ns(struct nvme_dev *dev, unsigned nsid,
return NULL;
}

-static int nvme_find_closest_node(int node)
-{
- int n, val, min_val = INT_MAX, best_node = node;
-
- for_each_online_node(n) {
- if (n == node)
- continue;
- val = node_distance(node, n);
- if (val < min_val) {
- min_val = val;
- best_node = n;
- }
- }
- return best_node;
-}
-
-static void nvme_set_queue_cpus(cpumask_t *qmask, struct nvme_queue *nvmeq,
- int count)
-{
- int cpu;
- for_each_cpu(cpu, qmask) {
- if (cpumask_weight(nvmeq->cpu_mask) >= count)
- break;
- if (!cpumask_test_and_set_cpu(cpu, nvmeq->cpu_mask))
- *per_cpu_ptr(nvmeq->dev->io_queue, cpu) = nvmeq->qid;
- }
-}
-
-static void nvme_add_cpus(cpumask_t *mask, const cpumask_t *unassigned_cpus,
- const cpumask_t *new_mask, struct nvme_queue *nvmeq, int cpus_per_queue)
-{
- int next_cpu;
- for_each_cpu(next_cpu, new_mask) {
- cpumask_or(mask, mask, get_cpu_mask(next_cpu));
- cpumask_or(mask, mask, topology_thread_cpumask(next_cpu));
- cpumask_and(mask, mask, unassigned_cpus);
- nvme_set_queue_cpus(mask, nvmeq, cpus_per_queue);
- }
-}
-
static void nvme_create_io_queues(struct nvme_dev *dev)
{
- unsigned i, max;
+ unsigned i;

- max = min(dev->max_qid, num_online_cpus());
- for (i = dev->queue_count; i <= max; i++)
+ for (i = dev->queue_count; i <= dev->max_qid; i++)
if (!nvme_alloc_queue(dev, i, dev->q_depth, i - 1))
break;

- max = min(dev->queue_count - 1, num_online_cpus());
- for (i = dev->online_queues; i <= max; i++)
- if (nvme_create_queue(raw_nvmeq(dev, i), i))
+ for (i = dev->online_queues; i <= dev->queue_count - 1; i++)
+ if (nvme_create_queue(dev->queues[i], i))
break;
}

-/*
- * If there are fewer queues than online cpus, this will try to optimally
- * assign a queue to multiple cpus by grouping cpus that are "close" together:
- * thread siblings, core, socket, closest node, then whatever else is
- * available.
- */
-static void nvme_assign_io_queues(struct nvme_dev *dev)
-{
- unsigned cpu, cpus_per_queue, queues, remainder, i;
- cpumask_var_t unassigned_cpus;
-
- nvme_create_io_queues(dev);
-
- queues = min(dev->online_queues - 1, num_online_cpus());
- if (!queues)
- return;
-
- cpus_per_queue = num_online_cpus() / queues;
- remainder = queues - (num_online_cpus() - queues * cpus_per_queue);
-
- if (!alloc_cpumask_var(&unassigned_cpus, GFP_KERNEL))
- return;
-
- cpumask_copy(unassigned_cpus, cpu_online_mask);
- cpu = cpumask_first(unassigned_cpus);
- for (i = 1; i <= queues; i++) {
- struct nvme_queue *nvmeq = lock_nvmeq(dev, i);
- cpumask_t mask;
-
- cpumask_clear(nvmeq->cpu_mask);
- if (!cpumask_weight(unassigned_cpus)) {
- unlock_nvmeq(nvmeq);
- break;
- }
-
- mask = *get_cpu_mask(cpu);
- nvme_set_queue_cpus(&mask, nvmeq, cpus_per_queue);
- if (cpus_weight(mask) < cpus_per_queue)
- nvme_add_cpus(&mask, unassigned_cpus,
- topology_thread_cpumask(cpu),
- nvmeq, cpus_per_queue);
- if (cpus_weight(mask) < cpus_per_queue)
- nvme_add_cpus(&mask, unassigned_cpus,
- topology_core_cpumask(cpu),
- nvmeq, cpus_per_queue);
- if (cpus_weight(mask) < cpus_per_queue)
- nvme_add_cpus(&mask, unassigned_cpus,
- cpumask_of_node(cpu_to_node(cpu)),
- nvmeq, cpus_per_queue);
- if (cpus_weight(mask) < cpus_per_queue)
- nvme_add_cpus(&mask, unassigned_cpus,
- cpumask_of_node(
- nvme_find_closest_node(
- cpu_to_node(cpu))),
- nvmeq, cpus_per_queue);
- if (cpus_weight(mask) < cpus_per_queue)
- nvme_add_cpus(&mask, unassigned_cpus,
- unassigned_cpus,
- nvmeq, cpus_per_queue);
-
- WARN(cpumask_weight(nvmeq->cpu_mask) != cpus_per_queue,
- "nvme%d qid:%d mis-matched queue-to-cpu assignment\n",
- dev->instance, i);
-
- irq_set_affinity_hint(dev->entry[nvmeq->cq_vector].vector,
- nvmeq->cpu_mask);
- cpumask_andnot(unassigned_cpus, unassigned_cpus,
- nvmeq->cpu_mask);
- cpu = cpumask_next(cpu, unassigned_cpus);
- if (remainder && !--remainder)
- cpus_per_queue++;
- unlock_nvmeq(nvmeq);
- }
- WARN(cpumask_weight(unassigned_cpus), "nvme%d unassigned online cpus\n",
- dev->instance);
- i = 0;
- cpumask_andnot(unassigned_cpus, cpu_possible_mask, cpu_online_mask);
- for_each_cpu(cpu, unassigned_cpus)
- *per_cpu_ptr(dev->io_queue, cpu) = (i++ % queues) + 1;
- free_cpumask_var(unassigned_cpus);
-}
-
static int set_queue_count(struct nvme_dev *dev, int count)
{
int status;
@@ -2077,22 +1856,9 @@ static size_t db_bar_size(struct nvme_dev *dev, unsigned nr_io_queues)
return 4096 + ((nr_io_queues + 1) * 8 * dev->db_stride);
}

-static int nvme_cpu_notify(struct notifier_block *self,
- unsigned long action, void *hcpu)
-{
- struct nvme_dev *dev = container_of(self, struct nvme_dev, nb);
- switch (action) {
- case CPU_ONLINE:
- case CPU_DEAD:
- nvme_assign_io_queues(dev);
- break;
- }
- return NOTIFY_OK;
-}
-
static int nvme_setup_io_queues(struct nvme_dev *dev)
{
- struct nvme_queue *adminq = raw_nvmeq(dev, 0);
+ struct nvme_queue *adminq = dev->queues[0];
struct pci_dev *pdev = dev->pci_dev;
int result, i, vecs, nr_io_queues, size;

@@ -2151,12 +1917,7 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)

/* Free previously allocated queues that are no longer usable */
nvme_free_queues(dev, nr_io_queues + 1);
- nvme_assign_io_queues(dev);
-
- dev->nb.notifier_call = &nvme_cpu_notify;
- result = register_hotcpu_notifier(&dev->nb);
- if (result)
- goto free_queues;
+ nvme_create_io_queues(dev);

return 0;

@@ -2205,8 +1966,29 @@ static int nvme_dev_add(struct nvme_dev *dev)
if (ctrl->mdts)
dev->max_hw_sectors = 1 << (ctrl->mdts + shift - 9);
if ((pdev->vendor == PCI_VENDOR_ID_INTEL) &&
- (pdev->device == 0x0953) && ctrl->vs[3])
+ (pdev->device == 0x0953) && ctrl->vs[3]) {
+ unsigned int max_hw_sectors;
+
dev->stripe_size = 1 << (ctrl->vs[3] + shift);
+ max_hw_sectors = dev->stripe_size >> (shift - 9);
+ if (dev->max_hw_sectors) {
+ dev->max_hw_sectors = min(max_hw_sectors,
+ dev->max_hw_sectors);
+ } else
+ dev->max_hw_sectors = max_hw_sectors;
+ }
+
+ dev->tagset.ops = &nvme_mq_ops;
+ dev->tagset.nr_hw_queues = dev->online_queues - 1;
+ dev->tagset.timeout = NVME_IO_TIMEOUT;
+ dev->tagset.numa_node = dev_to_node(&dev->pci_dev->dev);
+ dev->tagset.queue_depth = min_t(int, dev->q_depth, BLK_MQ_MAX_DEPTH);
+ dev->tagset.cmd_size = sizeof(struct nvme_cmd_info);
+ dev->tagset.flags = BLK_MQ_F_SHOULD_MERGE;
+ dev->tagset.driver_data = dev;
+
+ if (blk_mq_alloc_tag_set(&dev->tagset))
+ goto out;

id_ns = mem;
for (i = 1; i <= nn; i++) {
@@ -2356,7 +2138,8 @@ static int adapter_async_del_queue(struct nvme_queue *nvmeq, u8 opcode,
c.delete_queue.qid = cpu_to_le16(nvmeq->qid);

init_kthread_work(&nvmeq->cmdinfo.work, fn);
- return nvme_submit_admin_cmd_async(nvmeq->dev, &c, &nvmeq->cmdinfo);
+ return nvme_submit_admin_async_cmd(nvmeq->dev, &c, &nvmeq->cmdinfo,
+ ADMIN_TIMEOUT);
}

static void nvme_del_cq_work_handler(struct kthread_work *work)
@@ -2419,7 +2202,7 @@ static void nvme_disable_io_queues(struct nvme_dev *dev)
atomic_set(&dq.refcount, 0);
dq.worker = &worker;
for (i = dev->queue_count - 1; i > 0; i--) {
- struct nvme_queue *nvmeq = raw_nvmeq(dev, i);
+ struct nvme_queue *nvmeq = dev->queues[i];

if (nvme_suspend_queue(nvmeq))
continue;
@@ -2457,13 +2240,12 @@ static void nvme_dev_shutdown(struct nvme_dev *dev)
int i;

dev->initialized = 0;
- unregister_hotcpu_notifier(&dev->nb);

nvme_dev_list_remove(dev);

if (!dev->bar || (dev->bar && readl(&dev->bar->csts) == -1)) {
for (i = dev->queue_count - 1; i >= 0; i--) {
- struct nvme_queue *nvmeq = raw_nvmeq(dev, i);
+ struct nvme_queue *nvmeq = dev->queues[i];
nvme_suspend_queue(nvmeq);
nvme_clear_queue(nvmeq);
}
@@ -2475,6 +2257,12 @@ static void nvme_dev_shutdown(struct nvme_dev *dev)
nvme_dev_unmap(dev);
}

+static void nvme_dev_remove_admin(struct nvme_dev *dev)
+{
+ if (dev->admin_q && !blk_queue_dying(dev->admin_q))
+ blk_cleanup_queue(dev->admin_q);
+}
+
static void nvme_dev_remove(struct nvme_dev *dev)
{
struct nvme_ns *ns;
@@ -2556,7 +2344,7 @@ static void nvme_free_dev(struct kref *kref)
struct nvme_dev *dev = container_of(kref, struct nvme_dev, kref);

nvme_free_namespaces(dev);
- free_percpu(dev->io_queue);
+ blk_mq_free_tag_set(&dev->tagset);
kfree(dev->queues);
kfree(dev->entry);
kfree(dev);
@@ -2708,23 +2496,24 @@ static void nvme_reset_workfn(struct work_struct *work)

static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
{
- int result = -ENOMEM;
+ int node, result = -ENOMEM;
struct nvme_dev *dev;

- dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+ node = dev_to_node(&pdev->dev);
+ if (node == NUMA_NO_NODE)
+ set_dev_node(&pdev->dev, 0);
+
+ dev = kzalloc_node(sizeof(*dev), GFP_KERNEL, node);
if (!dev)
return -ENOMEM;
- dev->entry = kcalloc(num_possible_cpus(), sizeof(*dev->entry),
- GFP_KERNEL);
+ dev->entry = kzalloc_node(num_possible_cpus() * sizeof(*dev->entry),
+ GFP_KERNEL, node);
if (!dev->entry)
goto free;
- dev->queues = kcalloc(num_possible_cpus() + 1, sizeof(void *),
- GFP_KERNEL);
+ dev->queues = kzalloc_node((num_possible_cpus() + 1) * sizeof(void *),
+ GFP_KERNEL, node);
if (!dev->queues)
goto free;
- dev->io_queue = alloc_percpu(unsigned short);
- if (!dev->io_queue)
- goto free;

INIT_LIST_HEAD(&dev->namespaces);
dev->reset_workfn = nvme_reset_failed_dev;
@@ -2766,6 +2555,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)

remove:
nvme_dev_remove(dev);
+ nvme_dev_remove_admin(dev);
nvme_free_namespaces(dev);
shutdown:
nvme_dev_shutdown(dev);
@@ -2775,7 +2565,6 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
release:
nvme_release_instance(dev);
free:
- free_percpu(dev->io_queue);
kfree(dev->queues);
kfree(dev->entry);
kfree(dev);
@@ -2811,8 +2600,9 @@ static void nvme_remove(struct pci_dev *pdev)
misc_deregister(&dev->miscdev);
nvme_dev_remove(dev);
nvme_dev_shutdown(dev);
+ nvme_dev_remove_admin(dev);
nvme_free_queues(dev, 0);
- rcu_barrier();
+ nvme_free_admin_tags(dev);
nvme_release_instance(dev);
nvme_release_prp_pools(dev);
kref_put(&dev->kref, nvme_free_dev);
diff --git a/drivers/block/nvme-scsi.c b/drivers/block/nvme-scsi.c
index da3b252..d386cae 100644
--- a/drivers/block/nvme-scsi.c
+++ b/drivers/block/nvme-scsi.c
@@ -2103,7 +2103,7 @@ static int nvme_trans_do_nvme_io(struct nvme_ns *ns, struct sg_io_hdr *hdr,

nvme_offset += unit_num_blocks;

- nvme_sc = nvme_submit_io_cmd(dev, &c, NULL);
+ nvme_sc = nvme_submit_io_cmd(dev, ns, &c, NULL);
if (nvme_sc != NVME_SC_SUCCESS) {
nvme_unmap_user_pages(dev,
(is_write) ? DMA_TO_DEVICE : DMA_FROM_DEVICE,
@@ -2656,7 +2656,7 @@ static int nvme_trans_start_stop(struct nvme_ns *ns, struct sg_io_hdr *hdr,
c.common.opcode = nvme_cmd_flush;
c.common.nsid = cpu_to_le32(ns->ns_id);

- nvme_sc = nvme_submit_io_cmd(ns->dev, &c, NULL);
+ nvme_sc = nvme_submit_io_cmd(ns->dev, ns, &c, NULL);
res = nvme_trans_status_code(hdr, nvme_sc);
if (res)
goto out;
@@ -2684,7 +2684,7 @@ static int nvme_trans_synchronize_cache(struct nvme_ns *ns,
c.common.opcode = nvme_cmd_flush;
c.common.nsid = cpu_to_le32(ns->ns_id);

- nvme_sc = nvme_submit_io_cmd(ns->dev, &c, NULL);
+ nvme_sc = nvme_submit_io_cmd(ns->dev, ns, &c, NULL);

res = nvme_trans_status_code(hdr, nvme_sc);
if (res)
@@ -2892,7 +2892,7 @@ static int nvme_trans_unmap(struct nvme_ns *ns, struct sg_io_hdr *hdr,
c.dsm.nr = cpu_to_le32(ndesc - 1);
c.dsm.attributes = cpu_to_le32(NVME_DSMGMT_AD);

- nvme_sc = nvme_submit_io_cmd(dev, &c, NULL);
+ nvme_sc = nvme_submit_io_cmd(dev, ns, &c, NULL);
res = nvme_trans_status_code(hdr, nvme_sc);

dma_free_coherent(&dev->pci_dev->dev, ndesc * sizeof(*range),
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 6266373..38aade9 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -19,6 +19,7 @@
#include <linux/pci.h>
#include <linux/miscdevice.h>
#include <linux/kref.h>
+#include <linux/blk-mq.h>

struct nvme_bar {
__u64 cap; /* Controller Capabilities */
@@ -70,8 +71,10 @@ extern unsigned char io_timeout;
*/
struct nvme_dev {
struct list_head node;
- struct nvme_queue __rcu **queues;
- unsigned short __percpu *io_queue;
+ struct nvme_queue **queues;
+ struct request_queue *admin_q;
+ struct blk_mq_tag_set tagset;
+ struct blk_mq_tag_set admin_tagset;
u32 __iomem *dbs;
struct pci_dev *pci_dev;
struct dma_pool *prp_page_pool;
@@ -90,7 +93,6 @@ struct nvme_dev {
struct miscdevice miscdev;
work_func_t reset_workfn;
struct work_struct reset_work;
- struct notifier_block nb;
char name[12];
char serial[20];
char model[40];
@@ -132,7 +134,6 @@ struct nvme_iod {
int offset; /* Of PRP list */
int nents; /* Used in scatterlist */
int length; /* Of data, in bytes */
- unsigned long start_time;
dma_addr_t first_dma;
struct list_head node;
struct scatterlist sg[0];
@@ -150,12 +151,13 @@ static inline u64 nvme_block_nr(struct nvme_ns *ns, sector_t sector)
*/
void nvme_free_iod(struct nvme_dev *dev, struct nvme_iod *iod);

-int nvme_setup_prps(struct nvme_dev *, struct nvme_iod *, int , gfp_t);
+int nvme_setup_prps(struct nvme_dev *, struct nvme_iod *, int, gfp_t);
struct nvme_iod *nvme_map_user_pages(struct nvme_dev *dev, int write,
unsigned long addr, unsigned length);
void nvme_unmap_user_pages(struct nvme_dev *dev, int write,
struct nvme_iod *iod);
-int nvme_submit_io_cmd(struct nvme_dev *, struct nvme_command *, u32 *);
+int nvme_submit_io_cmd(struct nvme_dev *, struct nvme_ns *,
+ struct nvme_command *, u32 *);
int nvme_submit_flush_data(struct nvme_queue *nvmeq, struct nvme_ns *ns);
int nvme_submit_admin_cmd(struct nvme_dev *, struct nvme_command *,
u32 *result);
--
1.9.1

2014-06-09 04:35:53

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH v6] NVMe: conversion to blk-mq

On Fri, Jun 6, 2014 at 8:20 PM, Matias Bjørling <[email protected]> wrote:
> This converts the current NVMe driver to utilize the blk-mq layer.

Looks it can't be applied cleanly against 3.15-rc8 + Jens's for-linux
branch, when I fix the conflict manually, below failure is triggered:

[ 487.696057] nvme 0000:00:07.0: Cancelling I/O 202 QID 1
[ 487.699005] nvme 0000:00:07.0: Aborting I/O 202 QID 1
[ 487.704074] nvme 0000:00:07.0: Cancelling I/O 202 QID 1
[ 487.717881] nvme 0000:00:07.0: Aborting I/O 202 QID 1
[ 487.736093] end_request: I/O error, dev nvme0n1, sector 91532352
[ 487.747378] nvme 0000:00:07.0: completed id 0 twice on queue 0


when running fio randread(libaio, iodepth:64) with more than 3 jobs.

And looks no such failure when jobs is 1 or 2.

Thanks,
--
Ming Lei

2014-06-09 06:00:48

by Matias Bjørling

[permalink] [raw]
Subject: Re: [PATCH v6] NVMe: conversion to blk-mq

On Mon, Jun 9, 2014 at 6:35 AM, Ming Lei <[email protected]> wrote:
> On Fri, Jun 6, 2014 at 8:20 PM, Matias Bjørling <[email protected]> wrote:
>> This converts the current NVMe driver to utilize the blk-mq layer.
>
> Looks it can't be applied cleanly against 3.15-rc8 + Jens's for-linux
> branch, when I fix the conflict manually, below failure is triggered:
>
> [ 487.696057] nvme 0000:00:07.0: Cancelling I/O 202 QID 1
> [ 487.699005] nvme 0000:00:07.0: Aborting I/O 202 QID 1
> [ 487.704074] nvme 0000:00:07.0: Cancelling I/O 202 QID 1
> [ 487.717881] nvme 0000:00:07.0: Aborting I/O 202 QID 1
> [ 487.736093] end_request: I/O error, dev nvme0n1, sector 91532352
> [ 487.747378] nvme 0000:00:07.0: completed id 0 twice on queue 0
>
>
> when running fio randread(libaio, iodepth:64) with more than 3 jobs.
>
> And looks no such failure when jobs is 1 or 2.

Can you try with the nvmemq_review branch at

https://github.com/MatiasBjorling/linux-collab.git

I'll take a look tomorrow and see if I can reproduce it. Thanks!

>
> Thanks,
> --
> Ming Lei

2014-06-09 07:50:29

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH v6] NVMe: conversion to blk-mq

On Mon, Jun 9, 2014 at 2:00 PM, Matias Bjørling <[email protected]> wrote:
> On Mon, Jun 9, 2014 at 6:35 AM, Ming Lei <[email protected]> wrote:
>> On Fri, Jun 6, 2014 at 8:20 PM, Matias Bjørling <[email protected]> wrote:
>>> This converts the current NVMe driver to utilize the blk-mq layer.
>>
>> Looks it can't be applied cleanly against 3.15-rc8 + Jens's for-linux
>> branch, when I fix the conflict manually, below failure is triggered:
>>
>> [ 487.696057] nvme 0000:00:07.0: Cancelling I/O 202 QID 1
>> [ 487.699005] nvme 0000:00:07.0: Aborting I/O 202 QID 1
>> [ 487.704074] nvme 0000:00:07.0: Cancelling I/O 202 QID 1
>> [ 487.717881] nvme 0000:00:07.0: Aborting I/O 202 QID 1
>> [ 487.736093] end_request: I/O error, dev nvme0n1, sector 91532352
>> [ 487.747378] nvme 0000:00:07.0: completed id 0 twice on queue 0
>>
>>
>> when running fio randread(libaio, iodepth:64) with more than 3 jobs.
>>
>> And looks no such failure when jobs is 1 or 2.
>
> Can you try with the nvmemq_review branch at
>
> https://github.com/MatiasBjorling/linux-collab.git

Looks git-pull from the branch does work, so you
might have out-of-tree patches.


Thanks,
--
Ming Lei

2014-06-09 07:53:33

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH v6] NVMe: conversion to blk-mq

On Mon, Jun 9, 2014 at 3:50 PM, Ming Lei <[email protected]> wrote:
> On Mon, Jun 9, 2014 at 2:00 PM, Matias Bjørling <[email protected]> wrote:
>> On Mon, Jun 9, 2014 at 6:35 AM, Ming Lei <[email protected]> wrote:
>>> On Fri, Jun 6, 2014 at 8:20 PM, Matias Bjørling <[email protected]> wrote:
>>>> This converts the current NVMe driver to utilize the blk-mq layer.
>>>
>>> Looks it can't be applied cleanly against 3.15-rc8 + Jens's for-linux
>>> branch, when I fix the conflict manually, below failure is triggered:
>>>
>>> [ 487.696057] nvme 0000:00:07.0: Cancelling I/O 202 QID 1
>>> [ 487.699005] nvme 0000:00:07.0: Aborting I/O 202 QID 1
>>> [ 487.704074] nvme 0000:00:07.0: Cancelling I/O 202 QID 1
>>> [ 487.717881] nvme 0000:00:07.0: Aborting I/O 202 QID 1
>>> [ 487.736093] end_request: I/O error, dev nvme0n1, sector 91532352
>>> [ 487.747378] nvme 0000:00:07.0: completed id 0 twice on queue 0
>>>
>>>
>>> when running fio randread(libaio, iodepth:64) with more than 3 jobs.
>>>
>>> And looks no such failure when jobs is 1 or 2.
>>
>> Can you try with the nvmemq_review branch at
>>
>> https://github.com/MatiasBjorling/linux-collab.git
>
> Looks git-pull from the branch does work, so you
> might have out-of-tree patches.

After pulling from your tree, the problem still persists.

I test nvme over qemu, and both linus/next tree can
work well with qemu nvme.


Thanks,
--
Ming Lei

2014-06-09 10:40:56

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH v6] NVMe: conversion to blk-mq

On Mon, Jun 9, 2014 at 3:53 PM, Ming Lei <[email protected]>
>
> After pulling from your tree, the problem still persists.
>
> I test nvme over qemu, and both linus/next tree can
> work well with qemu nvme.

One problem I found is that rq->start_time isn't set
after you cleared IO_STAT intentionally, then the
request can't be requeued any more. Once the request
can be requeued, the problem becomes req hanging.

The root cause is that device returns
NVME_INTERNAL_DEV_ERROR(0x6) with your conversion
patch.

Thanks,
--
Ming Lei

2014-06-09 14:09:33

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v6] NVMe: conversion to blk-mq

On 2014-06-08 22:35, Ming Lei wrote:
> On Fri, Jun 6, 2014 at 8:20 PM, Matias Bjørling <[email protected]> wrote:
>> This converts the current NVMe driver to utilize the blk-mq layer.
>
> Looks it can't be applied cleanly against 3.15-rc8 + Jens's for-linux
> branch, when I fix the conflict manually, below failure is triggered:
>
> [ 487.696057] nvme 0000:00:07.0: Cancelling I/O 202 QID 1
> [ 487.699005] nvme 0000:00:07.0: Aborting I/O 202 QID 1
> [ 487.704074] nvme 0000:00:07.0: Cancelling I/O 202 QID 1
> [ 487.717881] nvme 0000:00:07.0: Aborting I/O 202 QID 1
> [ 487.736093] end_request: I/O error, dev nvme0n1, sector 91532352
> [ 487.747378] nvme 0000:00:07.0: completed id 0 twice on queue 0
>
>
> when running fio randread(libaio, iodepth:64) with more than 3 jobs.
>
> And looks no such failure when jobs is 1 or 2.

It's a known issue, make sure you have this patch:

http://git.kernel.dk/?p=linux-block.git;a=commit;h=f6be4fb4bcb396fc3b1c134b7863351972de081f


--
Jens Axboe

2014-06-09 14:59:54

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v6] NVMe: conversion to blk-mq

On Mon, Jun 09, 2014 at 08:08:50AM -0600, Jens Axboe wrote:
> >[ 487.704074] nvme 0000:00:07.0: Cancelling I/O 202 QID 1
> >[ 487.717881] nvme 0000:00:07.0: Aborting I/O 202 QID 1
> >[ 487.736093] end_request: I/O error, dev nvme0n1, sector 91532352
> >[ 487.747378] nvme 0000:00:07.0: completed id 0 twice on queue 0
> >
> >
> >when running fio randread(libaio, iodepth:64) with more than 3 jobs.
> >
> >And looks no such failure when jobs is 1 or 2.
>
> It's a known issue, make sure you have this patch:
>
> http://git.kernel.dk/?p=linux-block.git;a=commit;h=f6be4fb4bcb396fc3b1c134b7863351972de081f

So without this patch we can get duplicate completions? That could
explain the issues that Robert sees with scsi-mq and hpsa..

2014-06-09 15:07:55

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH v6] NVMe: conversion to blk-mq

On Mon, Jun 9, 2014 at 10:08 PM, Jens Axboe <[email protected]> wrote:
> On 2014-06-08 22:35, Ming Lei wrote:
>>
>> On Fri, Jun 6, 2014 at 8:20 PM, Matias Bjørling <[email protected]> wrote:
>>>
>>> This converts the current NVMe driver to utilize the blk-mq layer.
>>
>>
>> Looks it can't be applied cleanly against 3.15-rc8 + Jens's for-linux
>> branch, when I fix the conflict manually, below failure is triggered:
>>
>> [ 487.696057] nvme 0000:00:07.0: Cancelling I/O 202 QID 1
>> [ 487.699005] nvme 0000:00:07.0: Aborting I/O 202 QID 1
>> [ 487.704074] nvme 0000:00:07.0: Cancelling I/O 202 QID 1
>> [ 487.717881] nvme 0000:00:07.0: Aborting I/O 202 QID 1
>> [ 487.736093] end_request: I/O error, dev nvme0n1, sector 91532352
>> [ 487.747378] nvme 0000:00:07.0: completed id 0 twice on queue 0
>>
>>
>> when running fio randread(libaio, iodepth:64) with more than 3 jobs.
>>
>> And looks no such failure when jobs is 1 or 2.
>
>
> It's a known issue, make sure you have this patch:
>
> http://git.kernel.dk/?p=linux-block.git;a=commit;h=f6be4fb4bcb396fc3b1c134b7863351972de081f

Looks assignment to 'rq->deadline' in blk_mq_start_request()
may be removed, since blk_add_timer() already did that.

But that isn't the only issue.

Another one is that req->start_time isn't set without IO_STAT in
nvme, which may cause similar issue too, so req->start_time
can be set in nvme's queue_rq callback.

My real issue should be nvme specific: NVME_INTERNAL_DEV_ERROR(0x6) is
returned from device
with the conversion patch, but no such issue in current bio mode.


Thanks,
--
Ming Lei

2014-06-09 15:11:13

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v6] NVMe: conversion to blk-mq

On 2014-06-09 09:07, Ming Lei wrote:
> On Mon, Jun 9, 2014 at 10:08 PM, Jens Axboe <[email protected]> wrote:
>> On 2014-06-08 22:35, Ming Lei wrote:
>>>
>>> On Fri, Jun 6, 2014 at 8:20 PM, Matias Bjørling <[email protected]> wrote:
>>>>
>>>> This converts the current NVMe driver to utilize the blk-mq layer.
>>>
>>>
>>> Looks it can't be applied cleanly against 3.15-rc8 + Jens's for-linux
>>> branch, when I fix the conflict manually, below failure is triggered:
>>>
>>> [ 487.696057] nvme 0000:00:07.0: Cancelling I/O 202 QID 1
>>> [ 487.699005] nvme 0000:00:07.0: Aborting I/O 202 QID 1
>>> [ 487.704074] nvme 0000:00:07.0: Cancelling I/O 202 QID 1
>>> [ 487.717881] nvme 0000:00:07.0: Aborting I/O 202 QID 1
>>> [ 487.736093] end_request: I/O error, dev nvme0n1, sector 91532352
>>> [ 487.747378] nvme 0000:00:07.0: completed id 0 twice on queue 0
>>>
>>>
>>> when running fio randread(libaio, iodepth:64) with more than 3 jobs.
>>>
>>> And looks no such failure when jobs is 1 or 2.
>>
>>
>> It's a known issue, make sure you have this patch:
>>
>> https://urldefense.proofpoint.com/v1/url?u=http://git.kernel.dk/?p%3Dlinux-block.git%3Ba%3Dcommit%3Bh%3Df6be4fb4bcb396fc3b1c134b7863351972de081f&k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0A&r=3JMVyziIyZtZ5cv9eWNLwQ%3D%3D%0A&m=04gqJgKKgVgGynQnyGFy4OmxWZ5%2FJ%2FwBiXQH3K%2BEZ4U%3D%0A&s=b137111b72079e93492763bfa23fd3458cd9b287a5eb3b68f61b69c9cda6b7a5
>
> Looks assignment to 'rq->deadline' in blk_mq_start_request()
> may be removed, since blk_add_timer() already did that.
>
> But that isn't the only issue.
>
> Another one is that req->start_time isn't set without IO_STAT in
> nvme, which may cause similar issue too, so req->start_time
> can be set in nvme's queue_rq callback.
>
> My real issue should be nvme specific: NVME_INTERNAL_DEV_ERROR(0x6) is
> returned from device
> with the conversion patch, but no such issue in current bio mode.

You are right, I didn't get to the bottom of the thread first.
Might be better to just have ->start_time set always, regardless of io
stats being enabled or not.

--
Jens Axboe

2014-06-09 15:16:01

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH v6] NVMe: conversion to blk-mq

On Mon, Jun 9, 2014 at 11:10 PM, Jens Axboe <[email protected]> wrote:
> On 2014-06-09 09:07, Ming Lei wrote:
>
> You are right, I didn't get to the bottom of the thread first.
> Might be better to just have ->start_time set always, regardless of io stats
> being enabled or not.

Yes, just like what non-mq code did.


Thanks,
--
Ming Lei

2014-06-09 15:28:09

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v6] NVMe: conversion to blk-mq

On 2014-06-09 08:59, Christoph Hellwig wrote:
> On Mon, Jun 09, 2014 at 08:08:50AM -0600, Jens Axboe wrote:
>>> [ 487.704074] nvme 0000:00:07.0: Cancelling I/O 202 QID 1
>>> [ 487.717881] nvme 0000:00:07.0: Aborting I/O 202 QID 1
>>> [ 487.736093] end_request: I/O error, dev nvme0n1, sector 91532352
>>> [ 487.747378] nvme 0000:00:07.0: completed id 0 twice on queue 0
>>>
>>>
>>> when running fio randread(libaio, iodepth:64) with more than 3 jobs.
>>>
>>> And looks no such failure when jobs is 1 or 2.
>>
>> It's a known issue, make sure you have this patch:
>>
>> https://urldefense.proofpoint.com/v1/url?u=http://git.kernel.dk/?p%3Dlinux-block.git%3Ba%3Dcommit%3Bh%3Df6be4fb4bcb396fc3b1c134b7863351972de081f&k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0A&r=3JMVyziIyZtZ5cv9eWNLwQ%3D%3D%0A&m=MKK7A%2B9AnXXwtMV9YPSFEprXId0z0WLtd2JQPxnaNVc%3D%0A&s=45425e5d17390ae912457035ff081b95a7084725706ffe91166a1bdb25d8bf3d
>
> So without this patch we can get duplicate completions? That could
> explain the issues that Robert sees with scsi-mq and hpsa..

I don't see how that would happen in general, except if the timeout
part was buggy.

--
Jens Axboe

2014-06-09 15:35:04

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH v6] NVMe: conversion to blk-mq

On Mon, Jun 9, 2014 at 11:27 PM, Jens Axboe <[email protected]> wrote:
> On 2014-06-09 08:59, Christoph Hellwig wrote:
>>
>> On Mon, Jun 09, 2014 at 08:08:50AM -0600, Jens Axboe wrote:
>>>>
>>>> [ 487.704074] nvme 0000:00:07.0: Cancelling I/O 202 QID 1
>>>> [ 487.717881] nvme 0000:00:07.0: Aborting I/O 202 QID 1
>>>> [ 487.736093] end_request: I/O error, dev nvme0n1, sector 91532352
>>>> [ 487.747378] nvme 0000:00:07.0: completed id 0 twice on queue 0
>>>>
>>>>
>>>> when running fio randread(libaio, iodepth:64) with more than 3 jobs.
>>>>
>>>> And looks no such failure when jobs is 1 or 2.
>>>
>>>
>>> It's a known issue, make sure you have this patch:
>>>
>>>
>>> https://urldefense.proofpoint.com/v1/url?u=http://git.kernel.dk/?p%3Dlinux-block.git%3Ba%3Dcommit%3Bh%3Df6be4fb4bcb396fc3b1c134b7863351972de081f&k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0A&r=3JMVyziIyZtZ5cv9eWNLwQ%3D%3D%0A&m=MKK7A%2B9AnXXwtMV9YPSFEprXId0z0WLtd2JQPxnaNVc%3D%0A&s=45425e5d17390ae912457035ff081b95a7084725706ffe91166a1bdb25d8bf3d
>>
>>
>> So without this patch we can get duplicate completions? That could
>> explain the issues that Robert sees with scsi-mq and hpsa..

Use-after-free may cause duplicate completions.

BTW, is the scsi-mq patchset public now?

Thanks,
--
Ming Lei

2014-06-09 15:41:05

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v6] NVMe: conversion to blk-mq

On 2014-06-09 09:15, Ming Lei wrote:
> On Mon, Jun 9, 2014 at 11:10 PM, Jens Axboe <[email protected]> wrote:
>> On 2014-06-09 09:07, Ming Lei wrote:
>>
>> You are right, I didn't get to the bottom of the thread first.
>> Might be better to just have ->start_time set always, regardless of io stats
>> being enabled or not.
>
> Yes, just like what non-mq code did.

I queued up a fix for it.

--
Jens Axboe

2014-06-10 11:36:04

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v6] NVMe: conversion to blk-mq

On Mon, Jun 09, 2014 at 11:35:00PM +0800, Ming Lei wrote:
>
> Use-after-free may cause duplicate completions.
>
> BTW, is the scsi-mq patchset public now?

It always has been. I've posted it fairly frequently to linux-scsi,
and git://git.infradead.org/users/hch/scsi.git always has the latest
code, with scsi-mq-wip.11 as the current iteration.

2014-06-10 11:55:42

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH v6] NVMe: conversion to blk-mq

On Tue, Jun 10, 2014 at 7:35 PM, Christoph Hellwig <[email protected]> wrote:
> On Mon, Jun 09, 2014 at 11:35:00PM +0800, Ming Lei wrote:
>>
>> Use-after-free may cause duplicate completions.
>>
>> BTW, is the scsi-mq patchset public now?
>
> It always has been. I've posted it fairly frequently to linux-scsi,
> and git://git.infradead.org/users/hch/scsi.git always has the latest
> code, with scsi-mq-wip.11 as the current iteration.

That is great, thanks a lot for sharing it.

Even google can't tell me where the tree is, :-(

Thanks,
--
Ming Lei

2014-06-10 12:40:44

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v6] NVMe: conversion to blk-mq

On Tue, Jun 10, 2014 at 07:55:40PM +0800, Ming Lei wrote:
> That is great, thanks a lot for sharing it.
>
> Even google can't tell me where the tree is, :-(

The git tree was mentioned in every posting of the series to linux-scsi
(and I think lkml as well).

2014-06-10 19:42:18

by Matias Bjørling

[permalink] [raw]
Subject: Re: [PATCH v6] NVMe: conversion to blk-mq

>
> I queued up a fix for it.
>

Rebased nvmemq_review on top of it. Thanks!

2014-06-12 00:58:43

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH v6] NVMe: conversion to blk-mq

On Mon, Jun 9, 2014 at 6:40 PM, Ming Lei <[email protected]> wrote:
> The root cause is that device returns
> NVME_INTERNAL_DEV_ERROR(0x6) with your conversion
> patch.

The above problem is caused by qemu not handling -EAGAIN from
io_submit(), so please ignore the report.


Thanks,
--
Ming Lei