LinuxLists.cc - [PATCH v5] conversion to blk-mq

2014-06-02 20:55:32

Subject: [PATCH v5] conversion to blk-mq

Hi Matthew and Keith,

Here is an updated patch with the feedback from the previous days. It's against
Jens' for-3.16/core tree. You may use the nvmemq_wip_review branch at:

https://github.com/MatiasBjorling/linux-collab nvmemq_wip_review

and see the changes from v4 to v5 here:

https://github.com/MatiasBjorling/linux-collab/commit/e650bc5

Changes since v4:
* Fix timeout retries
* Fix naming in nvme_init_hctx
* Fix racy behavior of admin queue in nvme_dev_remove
* Fix wrong return values in nvme_queue_request
* Put cqe_seen back
* Introduce abort_completion for killing timed out I/Os
* Move locks outside of nvme_submit_iod
* Various renaming and style fixes

Changes since v3:
* Added abortion logic
* Fixed possible race on abortion
* Removed req data with flush. Handled by by blk-mq
* Added safety check for submitting user rq to admin queue.
* Use dev->online_queues for nr_hw_queues
* Fix loop with initialization in nvme_create_io_queues
* Style fixups

Changes since v2:
* rebased on top of current 3.16/core.
* use blk-mq queue management for spreading io queues
* removed rcu handling and allocated all io queues up front for mgmt by blk-mq
* removed the need for hotplugging notification
* fixed flush data handling
* fixed double free of spinlock
* various cleanup

Matias Bjørling (1):
NVMe: conversion to blk-mq

drivers/block/nvme-core.c | 1149 ++++++++++++++++++---------------------------
drivers/block/nvme-scsi.c | 8 +-
include/linux/nvme.h | 14 +-
3 files changed, 478 insertions(+), 693 deletions(-)

--
1.9.1

2014-06-02 20:55:46

by Matias Bjørling

[permalink] [raw]

Subject: [PATCH v5] NVMe: conversion to blk-mq

This converts the current NVMe driver to utilize the blk-mq layer.

Contributions in this patch from:

Sam Bradshaw <[email protected]>
Jens Axboe <[email protected]>
Christoph Hellwig <[email protected]>
Robert Nelson <[email protected]>

Signed-off-by: Matias Bjørling <[email protected]>
---
drivers/block/nvme-core.c | 1149 ++++++++++++++++++---------------------------
drivers/block/nvme-scsi.c | 8 +-
include/linux/nvme.h | 14 +-
3 files changed, 478 insertions(+), 693 deletions(-)

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index 9212dcc..dd556d6 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -13,9 +13,9 @@
*/

#include <linux/nvme.h>
-#include <linux/bio.h>
#include <linux/bitops.h>
#include <linux/blkdev.h>
+#include <linux/blk-mq.h>
#include <linux/cpu.h>
#include <linux/delay.h>
#include <linux/errno.h>
@@ -33,7 +33,6 @@
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/pci.h>
-#include <linux/percpu.h>
#include <linux/poison.h>
#include <linux/ptrace.h>
#include <linux/sched.h>
@@ -42,11 +41,11 @@
#include <scsi/sg.h>
#include <asm-generic/io-64-nonatomic-lo-hi.h>

-#define NVME_Q_DEPTH 1024
+#define NVME_Q_DEPTH 1024
+#define NVME_AQ_DEPTH 64
#define SQ_SIZE(depth) (depth * sizeof(struct nvme_command))
#define CQ_SIZE(depth) (depth * sizeof(struct nvme_completion))
-#define ADMIN_TIMEOUT (60 * HZ)
-#define IOD_TIMEOUT (4 * NVME_IO_TIMEOUT)
+#define ADMIN_TIMEOUT (60 * HZ)

unsigned char io_timeout = 30;
module_param(io_timeout, byte, 0644);
@@ -65,10 +64,12 @@ static struct workqueue_struct *nvme_workq;
static wait_queue_head_t nvme_kthread_wait;

static void nvme_reset_failed_dev(struct work_struct *ws);
+static int nvme_process_cq(struct nvme_queue *nvmeq);

struct async_cmd_info {
struct kthread_work work;
struct kthread_worker *worker;
+ struct request *req;
u32 result;
int status;
void *ctx;
@@ -79,7 +80,6 @@ struct async_cmd_info {
* commands and one for I/O commands).
*/
struct nvme_queue {
- struct rcu_head r_head;
struct device *q_dmadev;
struct nvme_dev *dev;
char irqname[24]; /* nvme4294967295-65535\0 */
@@ -88,10 +88,6 @@ struct nvme_queue {
volatile struct nvme_completion *cqes;
dma_addr_t sq_dma_addr;
dma_addr_t cq_dma_addr;
- wait_queue_head_t sq_full;
- wait_queue_t sq_cong_wait;
- struct bio_list sq_cong;
- struct list_head iod_bio;
u32 __iomem *q_db;
u16 q_depth;
u16 cq_vector;
@@ -104,7 +100,7 @@ struct nvme_queue {
u8 q_suspended;
cpumask_var_t cpu_mask;
struct async_cmd_info cmdinfo;
- unsigned long cmdid_data[];
+ struct blk_mq_hw_ctx *hctx;
};

/*
@@ -132,62 +128,70 @@ typedef void (*nvme_completion_fn)(struct nvme_queue *, void *,
struct nvme_cmd_info {
nvme_completion_fn fn;
void *ctx;
- unsigned long timeout;
int aborted;
+ struct nvme_queue *nvmeq;
};

-static struct nvme_cmd_info *nvme_cmd_info(struct nvme_queue *nvmeq)
+static int nvme_admin_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
+ unsigned int hctx_idx)
{
- return (void *)&nvmeq->cmdid_data[BITS_TO_LONGS(nvmeq->q_depth)];
+ struct nvme_dev *dev = data;
+ struct nvme_queue *nvmeq = dev->queues[0];
+ BUG_ON(!nvmeq);
+ WARN_ON(nvmeq->hctx);
+ nvmeq->hctx = hctx;
+ hctx->driver_data = nvmeq;
+ return 0;
}

-static unsigned nvme_queue_extra(int depth)
+static int nvme_admin_init_request(void *data, struct request *req,
+ unsigned int hctx_idx, unsigned int rq_idx,
+ unsigned int numa_node)
{
- return DIV_ROUND_UP(depth, 8) + (depth * sizeof(struct nvme_cmd_info));
+ struct nvme_dev *dev = data;
+ struct nvme_cmd_info *cmd = blk_mq_rq_to_pdu(req);
+ struct nvme_queue *nvmeq = dev->queues[0];
+ WARN_ON(!nvmeq);
+ WARN_ON(!cmd);
+ cmd->nvmeq = nvmeq;
+ return 0;
}

-/**
- * alloc_cmdid() - Allocate a Command ID
- * @nvmeq: The queue that will be used for this command
- * @ctx: A pointer that will be passed to the handler
- * @handler: The function to call on completion
- *
- * Allocate a Command ID for a queue. The data passed in will
- * be passed to the completion handler. This is implemented by using
- * the bottom two bits of the ctx pointer to store the handler ID.
- * Passing in a pointer that's not 4-byte aligned will cause a BUG.
- * We can change this if it becomes a problem.
- *
- * May be called with local interrupts disabled and the q_lock held,
- * or with interrupts enabled and no locks held.
- */
-static int alloc_cmdid(struct nvme_queue *nvmeq, void *ctx,
- nvme_completion_fn handler, unsigned timeout)
+static int nvme_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
+ unsigned int hctx_idx)
{
- int depth = nvmeq->q_depth - 1;
- struct nvme_cmd_info *info = nvme_cmd_info(nvmeq);
- int cmdid;
-
- do {
- cmdid = find_first_zero_bit(nvmeq->cmdid_data, depth);
- if (cmdid >= depth)
- return -EBUSY;
- } while (test_and_set_bit(cmdid, nvmeq->cmdid_data));
+ struct nvme_dev *dev = data;
+ struct nvme_queue *nvmeq = dev->queues[(hctx_idx % dev->queue_count)
+ + 1];
+ /* nvmeq queues are shared between namespaces. We assume here that
+ * blk-mq map the tags so they match up with the nvme queue tags */
+ if (!nvmeq->hctx)
+ nvmeq->hctx = hctx;
+ else
+ WARN_ON(nvmeq->hctx->tags != hctx->tags);
+ hctx->driver_data = nvmeq;
+ return 0;
+}

- info[cmdid].fn = handler;
- info[cmdid].ctx = ctx;
- info[cmdid].timeout = jiffies + timeout;
- info[cmdid].aborted = 0;
- return cmdid;
+static int nvme_init_request(void *data, struct request *req,
+ unsigned int hctx_idx, unsigned int rq_idx,
+ unsigned int numa_node)
+{
+ struct nvme_dev *dev = data;
+ struct nvme_cmd_info *cmd = blk_mq_rq_to_pdu(req);
+ struct nvme_queue *nvmeq = dev->queues[hctx_idx + 1];
+ WARN_ON(!nvmeq);
+ WARN_ON(!cmd);
+ cmd->nvmeq = nvmeq;
+ return 0;
}

-static int alloc_cmdid_killable(struct nvme_queue *nvmeq, void *ctx,
- nvme_completion_fn handler, unsigned timeout)
+static void nvme_set_info(struct nvme_cmd_info *cmd, void *ctx,
+ nvme_completion_fn handler)
{
- int cmdid;
- wait_event_killable(nvmeq->sq_full,
- (cmdid = alloc_cmdid(nvmeq, ctx, handler, timeout)) >= 0);
- return (cmdid < 0) ? -EINTR : cmdid;
+ cmd->fn = handler;
+ cmd->ctx = ctx;
+ cmd->aborted = 0;
}

/* Special values must be less than 0x1000 */
@@ -195,20 +199,12 @@ static int alloc_cmdid_killable(struct nvme_queue *nvmeq, void *ctx,
#define CMD_CTX_CANCELLED (0x30C + CMD_CTX_BASE)
#define CMD_CTX_COMPLETED (0x310 + CMD_CTX_BASE)
#define CMD_CTX_INVALID (0x314 + CMD_CTX_BASE)
-#define CMD_CTX_FLUSH (0x318 + CMD_CTX_BASE)
-#define CMD_CTX_ABORT (0x31C + CMD_CTX_BASE)

static void special_completion(struct nvme_queue *nvmeq, void *ctx,
struct nvme_completion *cqe)
{
if (ctx == CMD_CTX_CANCELLED)
return;
- if (ctx == CMD_CTX_FLUSH)
- return;
- if (ctx == CMD_CTX_ABORT) {
- ++nvmeq->dev->abort_limit;
- return;
- }
if (ctx == CMD_CTX_COMPLETED) {
dev_warn(nvmeq->q_dmadev,
"completed id %d twice on queue %d\n",
@@ -225,6 +221,42 @@ static void special_completion(struct nvme_queue *nvmeq, void *ctx,
dev_warn(nvmeq->q_dmadev, "Unknown special completion %p\n", ctx);
}

+static void *cancel_cmd_info(struct nvme_cmd_info *cmd, nvme_completion_fn *fn)
+{
+ void *ctx;
+ if (fn)
+ *fn = cmd->fn;
+ ctx = cmd->ctx;
+ cmd->fn = special_completion;
+ cmd->ctx = CMD_CTX_CANCELLED;
+ return ctx;
+}
+
+static void abort_completion(struct nvme_queue *nvmeq, void *ctx,
+ struct nvme_completion *cqe)
+{
+ struct request *req;
+ struct nvme_cmd_info *aborted = ctx;
+ struct nvme_queue *a_nvmeq = aborted->nvmeq;
+ void *a_ctx;
+ nvme_completion_fn a_fn;
+ static struct nvme_completion a_cqe = {
+ .status = cpu_to_le16(NVME_SC_ABORT_REQ << 1),
+ };
+
+ req = blk_mq_tag_to_rq(nvmeq->hctx, cqe->command_id);
+ blk_put_request(req);
+
+ if (!cqe->status)
+ dev_warn(nvmeq->q_dmadev, "Could not abort I/O %d QID %d",
+ req->tag, nvmeq->qid);
+
+ a_ctx = cancel_cmd_info(aborted, &a_fn);
+ a_fn(a_nvmeq, a_ctx, &a_cqe);
+
+ ++nvmeq->dev->abort_limit;
+}
+
static void async_completion(struct nvme_queue *nvmeq, void *ctx,
struct nvme_completion *cqe)
{
@@ -232,75 +264,37 @@ static void async_completion(struct nvme_queue *nvmeq, void *ctx,
cmdinfo->result = le32_to_cpup(&cqe->result);
cmdinfo->status = le16_to_cpup(&cqe->status) >> 1;
queue_kthread_work(cmdinfo->worker, &cmdinfo->work);
+ blk_put_request(cmdinfo->req);
+}
+
+static inline struct nvme_cmd_info *get_cmd_from_tag(struct nvme_queue *nvmeq,
+ unsigned int tag)
+{
+ struct blk_mq_hw_ctx *hctx = nvmeq->hctx;
+ struct request *req = blk_mq_tag_to_rq(hctx, tag);
+ return blk_mq_rq_to_pdu(req);
}

/*
* Called with local interrupts disabled and the q_lock held. May not sleep.
*/
-static void *free_cmdid(struct nvme_queue *nvmeq, int cmdid,
+static void *nvme_finish_cmd(struct nvme_queue *nvmeq, int tag,
nvme_completion_fn *fn)
{
+ struct nvme_cmd_info *cmd = get_cmd_from_tag(nvmeq, tag);
void *ctx;
- struct nvme_cmd_info *info = nvme_cmd_info(nvmeq);
-
- if (cmdid >= nvmeq->q_depth || !info[cmdid].fn) {
- if (fn)
- *fn = special_completion;
+ if (tag >= nvmeq->q_depth) {
+ *fn = special_completion;
return CMD_CTX_INVALID;
}
if (fn)
- *fn = info[cmdid].fn;
- ctx = info[cmdid].ctx;
- info[cmdid].fn = special_completion;
- info[cmdid].ctx = CMD_CTX_COMPLETED;
- clear_bit(cmdid, nvmeq->cmdid_data);
- wake_up(&nvmeq->sq_full);
+ *fn = cmd->fn;
+ ctx = cmd->ctx;
+ cmd->fn = special_completion;
+ cmd->ctx = CMD_CTX_COMPLETED;
return ctx;
}

-static void *cancel_cmdid(struct nvme_queue *nvmeq, int cmdid,
- nvme_completion_fn *fn)
-{
- void *ctx;
- struct nvme_cmd_info *info = nvme_cmd_info(nvmeq);
- if (fn)
- *fn = info[cmdid].fn;
- ctx = info[cmdid].ctx;
- info[cmdid].fn = special_completion;
- info[cmdid].ctx = CMD_CTX_CANCELLED;
- return ctx;
-}
-
-static struct nvme_queue *raw_nvmeq(struct nvme_dev *dev, int qid)
-{
- return rcu_dereference_raw(dev->queues[qid]);
-}
-
-static struct nvme_queue *get_nvmeq(struct nvme_dev *dev) __acquires(RCU)
-{
- unsigned queue_id = get_cpu_var(*dev->io_queue);
- rcu_read_lock();
- return rcu_dereference(dev->queues[queue_id]);
-}
-
-static void put_nvmeq(struct nvme_queue *nvmeq) __releases(RCU)
-{
- rcu_read_unlock();
- put_cpu_var(nvmeq->dev->io_queue);
-}
-
-static struct nvme_queue *lock_nvmeq(struct nvme_dev *dev, int q_idx)
- __acquires(RCU)
-{
- rcu_read_lock();
- return rcu_dereference(dev->queues[q_idx]);
-}
-
-static void unlock_nvmeq(struct nvme_queue *nvmeq) __releases(RCU)
-{
- rcu_read_unlock();
-}
-
/**
* nvme_submit_cmd() - Copy a command into a queue and ring the doorbell
* @nvmeq: The queue to use
@@ -357,7 +351,6 @@ nvme_alloc_iod(unsigned nseg, unsigned nbytes, gfp_t gfp)
iod->length = nbytes;
iod->nents = 0;
iod->first_dma = 0ULL;
- iod->start_time = jiffies;
}

return iod;
@@ -381,59 +374,31 @@ void nvme_free_iod(struct nvme_dev *dev, struct nvme_iod *iod)
kfree(iod);
}

-static void nvme_start_io_acct(struct bio *bio)
-{
- struct gendisk *disk = bio->bi_bdev->bd_disk;
- const int rw = bio_data_dir(bio);
- int cpu = part_stat_lock();
- part_round_stats(cpu, &disk->part0);
- part_stat_inc(cpu, &disk->part0, ios[rw]);
- part_stat_add(cpu, &disk->part0, sectors[rw], bio_sectors(bio));
- part_inc_in_flight(&disk->part0, rw);
- part_stat_unlock();
-}
-
-static void nvme_end_io_acct(struct bio *bio, unsigned long start_time)
-{
- struct gendisk *disk = bio->bi_bdev->bd_disk;
- const int rw = bio_data_dir(bio);
- unsigned long duration = jiffies - start_time;
- int cpu = part_stat_lock();
- part_stat_add(cpu, &disk->part0, ticks[rw], duration);
- part_round_stats(cpu, &disk->part0);
- part_dec_in_flight(&disk->part0, rw);
- part_stat_unlock();
-}
-
-static void bio_completion(struct nvme_queue *nvmeq, void *ctx,
+static void req_completion(struct nvme_queue *nvmeq, void *ctx,
struct nvme_completion *cqe)
{
struct nvme_iod *iod = ctx;
- struct bio *bio = iod->private;
+ struct request *req = iod->private;
+
u16 status = le16_to_cpup(&cqe->status) >> 1;

if (unlikely(status)) {
- if (!(status & NVME_SC_DNR ||
- bio->bi_rw & REQ_FAILFAST_MASK) &&
- (jiffies - iod->start_time) < IOD_TIMEOUT) {
- if (!waitqueue_active(&nvmeq->sq_full))
- add_wait_queue(&nvmeq->sq_full,
- &nvmeq->sq_cong_wait);
- list_add_tail(&iod->node, &nvmeq->iod_bio);
- wake_up(&nvmeq->sq_full);
+ if (!(status & NVME_SC_DNR || blk_noretry_request(req))
+ && (jiffies - req->start_time) < req->timeout) {
+ blk_mq_requeue_request(req);
return;
}
- }
+ req->errors = -EIO;
+ } else
+ req->errors = 0;
+
if (iod->nents) {
- dma_unmap_sg(nvmeq->q_dmadev, iod->sg, iod->nents,
- bio_data_dir(bio) ? DMA_TO_DEVICE : DMA_FROM_DEVICE);
- nvme_end_io_acct(bio, iod->start_time);
+ dma_unmap_sg(&nvmeq->dev->pci_dev->dev, iod->sg, iod->nents,
+ rq_data_dir(req) ? DMA_TO_DEVICE : DMA_FROM_DEVICE);
}
nvme_free_iod(nvmeq->dev, iod);
- if (status)
- bio_endio(bio, -EIO);
- else
- bio_endio(bio, 0);
+
+ blk_mq_complete_request(req);
}

/* length is in bytes. gfp flags indicates whether we may sleep. */
@@ -515,86 +480,25 @@ int nvme_setup_prps(struct nvme_dev *dev, struct nvme_iod *iod, int total_len,
return total_len;
}

-static int nvme_split_and_submit(struct bio *bio, struct nvme_queue *nvmeq,
- int len)
-{
- struct bio *split = bio_split(bio, len >> 9, GFP_ATOMIC, NULL);
- if (!split)
- return -ENOMEM;
-
- bio_chain(split, bio);
-
- if (!waitqueue_active(&nvmeq->sq_full))
- add_wait_queue(&nvmeq->sq_full, &nvmeq->sq_cong_wait);
- bio_list_add(&nvmeq->sq_cong, split);
- bio_list_add(&nvmeq->sq_cong, bio);
- wake_up(&nvmeq->sq_full);
-
- return 0;
-}
-
-/* NVMe scatterlists require no holes in the virtual address */
-#define BIOVEC_NOT_VIRT_MERGEABLE(vec1, vec2) ((vec2)->bv_offset || \
- (((vec1)->bv_offset + (vec1)->bv_len) % PAGE_SIZE))
-
-static int nvme_map_bio(struct nvme_queue *nvmeq, struct nvme_iod *iod,
- struct bio *bio, enum dma_data_direction dma_dir, int psegs)
-{
- struct bio_vec bvec, bvprv;
- struct bvec_iter iter;
- struct scatterlist *sg = NULL;
- int length = 0, nsegs = 0, split_len = bio->bi_iter.bi_size;
- int first = 1;
-
- if (nvmeq->dev->stripe_size)
- split_len = nvmeq->dev->stripe_size -
- ((bio->bi_iter.bi_sector << 9) &
- (nvmeq->dev->stripe_size - 1));
-
- sg_init_table(iod->sg, psegs);
- bio_for_each_segment(bvec, bio, iter) {
- if (!first && BIOVEC_PHYS_MERGEABLE(&bvprv, &bvec)) {
- sg->length += bvec.bv_len;
- } else {
- if (!first && BIOVEC_NOT_VIRT_MERGEABLE(&bvprv, &bvec))
- return nvme_split_and_submit(bio, nvmeq,
- length);
-
- sg = sg ? sg + 1 : iod->sg;
- sg_set_page(sg, bvec.bv_page,
- bvec.bv_len, bvec.bv_offset);
- nsegs++;
- }
-
- if (split_len - length < bvec.bv_len)
- return nvme_split_and_submit(bio, nvmeq, split_len);
- length += bvec.bv_len;
- bvprv = bvec;
- first = 0;
- }
- iod->nents = nsegs;
- sg_mark_end(sg);
- if (dma_map_sg(nvmeq->q_dmadev, iod->sg, iod->nents, dma_dir) == 0)
- return -ENOMEM;
-
- BUG_ON(length != bio->bi_iter.bi_size);
- return length;
-}
-
-static int nvme_submit_discard(struct nvme_queue *nvmeq, struct nvme_ns *ns,
- struct bio *bio, struct nvme_iod *iod, int cmdid)
+/*
+ * We reuse the small pool to allocate the 16-byte range here as it is not
+ * worth having a special pool for these or additional cases to handle freeing
+ * the iod.
+ */
+static void nvme_submit_discard(struct nvme_queue *nvmeq, struct nvme_ns *ns,
+ struct request *req, struct nvme_iod *iod)
{
struct nvme_dsm_range *range =
(struct nvme_dsm_range *)iod_list(iod)[0];
struct nvme_command *cmnd = &nvmeq->sq_cmds[nvmeq->sq_tail];

range->cattr = cpu_to_le32(0);
- range->nlb = cpu_to_le32(bio->bi_iter.bi_size >> ns->lba_shift);
- range->slba = cpu_to_le64(nvme_block_nr(ns, bio->bi_iter.bi_sector));
+ range->nlb = cpu_to_le32(blk_rq_bytes(req) >> ns->lba_shift);
+ range->slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));

memset(cmnd, 0, sizeof(*cmnd));
cmnd->dsm.opcode = nvme_cmd_dsm;
- cmnd->dsm.command_id = cmdid;
+ cmnd->dsm.command_id = req->tag;
cmnd->dsm.nsid = cpu_to_le32(ns->ns_id);
cmnd->dsm.prp1 = cpu_to_le64(iod->first_dma);
cmnd->dsm.nr = 0;
@@ -603,11 +507,9 @@ static int nvme_submit_discard(struct nvme_queue *nvmeq, struct nvme_ns *ns,
if (++nvmeq->sq_tail == nvmeq->q_depth)
nvmeq->sq_tail = 0;
writel(nvmeq->sq_tail, nvmeq->q_db);
-
- return 0;
}

-static int nvme_submit_flush(struct nvme_queue *nvmeq, struct nvme_ns *ns,
+static void nvme_submit_flush(struct nvme_queue *nvmeq, struct nvme_ns *ns,
int cmdid)
{
struct nvme_command *cmnd = &nvmeq->sq_cmds[nvmeq->sq_tail];
@@ -620,59 +522,34 @@ static int nvme_submit_flush(struct nvme_queue *nvmeq, struct nvme_ns *ns,
if (++nvmeq->sq_tail == nvmeq->q_depth)
nvmeq->sq_tail = 0;
writel(nvmeq->sq_tail, nvmeq->q_db);
-
- return 0;
}

-int nvme_submit_flush_data(struct nvme_queue *nvmeq, struct nvme_ns *ns)
+static int nvme_submit_iod(struct nvme_queue *nvmeq, struct nvme_iod *iod,
+ struct nvme_ns *ns)
{
- int cmdid = alloc_cmdid(nvmeq, (void *)CMD_CTX_FLUSH,
- special_completion, NVME_IO_TIMEOUT);
- if (unlikely(cmdid < 0))
- return cmdid;
-
- return nvme_submit_flush(nvmeq, ns, cmdid);
-}
-
-static int nvme_submit_iod(struct nvme_queue *nvmeq, struct nvme_iod *iod)
-{
- struct bio *bio = iod->private;
- struct nvme_ns *ns = bio->bi_bdev->bd_disk->private_data;
+ struct request *req = iod->private;
struct nvme_command *cmnd;
- int cmdid;
- u16 control;
- u32 dsmgmt;
-
- cmdid = alloc_cmdid(nvmeq, iod, bio_completion, NVME_IO_TIMEOUT);
- if (unlikely(cmdid < 0))
- return cmdid;
+ u16 control = 0;
+ u32 dsmgmt = 0;

- if (bio->bi_rw & REQ_DISCARD)
- return nvme_submit_discard(nvmeq, ns, bio, iod, cmdid);
- if ((bio->bi_rw & REQ_FLUSH) && !iod->nents)
- return nvme_submit_flush(nvmeq, ns, cmdid);
-
- control = 0;
- if (bio->bi_rw & REQ_FUA)
+ if (req->cmd_flags & REQ_FUA)
control |= NVME_RW_FUA;
- if (bio->bi_rw & (REQ_FAILFAST_DEV | REQ_RAHEAD))
+ if (req->cmd_flags & (REQ_FAILFAST_DEV | REQ_RAHEAD))
control |= NVME_RW_LR;

- dsmgmt = 0;
- if (bio->bi_rw & REQ_RAHEAD)
+ if (req->cmd_flags & REQ_RAHEAD)
dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;

cmnd = &nvmeq->sq_cmds[nvmeq->sq_tail];
memset(cmnd, 0, sizeof(*cmnd));

- cmnd->rw.opcode = bio_data_dir(bio) ? nvme_cmd_write : nvme_cmd_read;
- cmnd->rw.command_id = cmdid;
+ cmnd->rw.opcode = (rq_data_dir(req) ? nvme_cmd_write : nvme_cmd_read);
+ cmnd->rw.command_id = req->tag;
cmnd->rw.nsid = cpu_to_le32(ns->ns_id);
cmnd->rw.prp1 = cpu_to_le64(sg_dma_address(iod->sg));
cmnd->rw.prp2 = cpu_to_le64(iod->first_dma);
- cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, bio->bi_iter.bi_sector));
- cmnd->rw.length =
- cpu_to_le16((bio->bi_iter.bi_size >> ns->lba_shift) - 1);
+ cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
+ cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
cmnd->rw.control = cpu_to_le16(control);
cmnd->rw.dsmgmt = cpu_to_le32(dsmgmt);

@@ -683,28 +560,32 @@ static int nvme_submit_iod(struct nvme_queue *nvmeq, struct nvme_iod *iod)
return 0;
}

-/*
- * Called with local interrupts disabled and the q_lock held. May not sleep.
- */
-static int nvme_submit_bio_queue(struct nvme_queue *nvmeq, struct nvme_ns *ns,
- struct bio *bio)
+static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
{
+ struct nvme_ns *ns = hctx->queue->queuedata;
+ struct nvme_queue *nvmeq = hctx->driver_data;
+ struct nvme_cmd_info *cmd = blk_mq_rq_to_pdu(req);
struct nvme_iod *iod;
- int psegs = bio_phys_segments(ns->queue, bio);
- int result;
-
- if ((bio->bi_rw & REQ_FLUSH) && psegs) {
- result = nvme_submit_flush_data(nvmeq, ns);
- if (result)
- return result;
- }
+ enum dma_data_direction dma_dir;
+ int psegs = req->nr_phys_segments;
+ int result = BLK_MQ_RQ_QUEUE_BUSY;
+ /*
+ * Requeued IO has already been prepped
+ */
+ iod = req->special;
+ if (iod)
+ goto submit_iod;

- iod = nvme_alloc_iod(psegs, bio->bi_iter.bi_size, GFP_ATOMIC);
+ iod = nvme_alloc_iod(psegs, blk_rq_bytes(req), GFP_ATOMIC);
if (!iod)
- return -ENOMEM;
+ return result;
+
+ iod->private = req;
+ req->special = iod;

- iod->private = bio;
- if (bio->bi_rw & REQ_DISCARD) {
+ nvme_set_info(cmd, iod, req_completion);
+
+ if (req->cmd_flags & REQ_DISCARD) {
void *range;
/*
* We reuse the small pool to allocate the 16-byte range here
@@ -714,33 +595,53 @@ static int nvme_submit_bio_queue(struct nvme_queue *nvmeq, struct nvme_ns *ns,
range = dma_pool_alloc(nvmeq->dev->prp_small_pool,
GFP_ATOMIC,
&iod->first_dma);
- if (!range) {
- result = -ENOMEM;
- goto free_iod;
- }
+ if (!range)
+ goto finish_cmd;
iod_list(iod)[0] = (__le64 *)range;
iod->npages = 0;
} else if (psegs) {
- result = nvme_map_bio(nvmeq, iod, bio,
- bio_data_dir(bio) ? DMA_TO_DEVICE : DMA_FROM_DEVICE,
- psegs);
- if (result <= 0)
- goto free_iod;
- if (nvme_setup_prps(nvmeq->dev, iod, result, GFP_ATOMIC) !=
- result) {
- result = -ENOMEM;
- goto free_iod;
+ dma_dir = rq_data_dir(req) ? DMA_TO_DEVICE : DMA_FROM_DEVICE;
+
+ sg_init_table(iod->sg, psegs);
+ iod->nents = blk_rq_map_sg(req->q, req, iod->sg);
+ if (!iod->nents) {
+ result = BLK_MQ_RQ_QUEUE_ERROR;
+ goto finish_cmd;
}
- nvme_start_io_acct(bio);
+
+ if (!dma_map_sg(nvmeq->q_dmadev, iod->sg, iod->nents, dma_dir))
+ goto finish_cmd;
+
+ if (blk_rq_bytes(req) != nvme_setup_prps(nvmeq->dev, iod,
+ blk_rq_bytes(req), GFP_ATOMIC))
+ goto finish_cmd;
+ }
+
+ submit_iod:
+ spin_lock_irq(&nvmeq->q_lock);
+ if (nvmeq->q_suspended) {
+ spin_unlock_irq(&nvmeq->q_lock);
+ goto finish_cmd;
}
- if (unlikely(nvme_submit_iod(nvmeq, iod))) {
- if (!waitqueue_active(&nvmeq->sq_full))
- add_wait_queue(&nvmeq->sq_full, &nvmeq->sq_cong_wait);
- list_add_tail(&iod->node, &nvmeq->iod_bio);
+
+ if (req->cmd_flags & REQ_DISCARD) {
+ nvme_submit_discard(nvmeq, ns, req, iod);
+ goto queued;
}
- return 0;

- free_iod:
+ if (req->cmd_flags & REQ_FLUSH) {
+ nvme_submit_flush(nvmeq, ns, req->tag);
+ goto queued;
+ }
+
+ nvme_submit_iod(nvmeq, iod, ns);
+ queued:
+ nvme_process_cq(nvmeq);
+ spin_unlock_irq(&nvmeq->q_lock);
+ return BLK_MQ_RQ_QUEUE_OK;
+
+ finish_cmd:
+ nvme_finish_cmd(nvmeq, req->tag, NULL);
nvme_free_iod(nvmeq->dev, iod);
return result;
}
@@ -763,8 +664,7 @@ static int nvme_process_cq(struct nvme_queue *nvmeq)
head = 0;
phase = !phase;
}
-
- ctx = free_cmdid(nvmeq, cqe.command_id, &fn);
+ ctx = nvme_finish_cmd(nvmeq, cqe.command_id, &fn);
fn(nvmeq, ctx, &cqe);
}

@@ -785,30 +685,12 @@ static int nvme_process_cq(struct nvme_queue *nvmeq)
return 1;
}

-static void nvme_make_request(struct request_queue *q, struct bio *bio)
+/* Admin queue isn't initialized as a request queue. If at some point this
+ * happens anyway, make sure to notify the user */
+static int nvme_admin_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
{
- struct nvme_ns *ns = q->queuedata;
- struct nvme_queue *nvmeq = get_nvmeq(ns->dev);
- int result = -EBUSY;
-
- if (!nvmeq) {
- put_nvmeq(NULL);
- bio_endio(bio, -EIO);
- return;
- }
-
- spin_lock_irq(&nvmeq->q_lock);
- if (!nvmeq->q_suspended && bio_list_empty(&nvmeq->sq_cong))
- result = nvme_submit_bio_queue(nvmeq, ns, bio);
- if (unlikely(result)) {
- if (!waitqueue_active(&nvmeq->sq_full))
- add_wait_queue(&nvmeq->sq_full, &nvmeq->sq_cong_wait);
- bio_list_add(&nvmeq->sq_cong, bio);
- }
-
- nvme_process_cq(nvmeq);
- spin_unlock_irq(&nvmeq->q_lock);
- put_nvmeq(nvmeq);
+ WARN_ON(1);
+ return BLK_MQ_RQ_QUEUE_ERROR;
}

static irqreturn_t nvme_irq(int irq, void *data)
@@ -832,10 +714,11 @@ static irqreturn_t nvme_irq_check(int irq, void *data)
return IRQ_WAKE_THREAD;
}

-static void nvme_abort_command(struct nvme_queue *nvmeq, int cmdid)
+static void nvme_abort_cmd_info(struct nvme_queue *nvmeq, struct nvme_cmd_info *
+ cmd_info)
{
spin_lock_irq(&nvmeq->q_lock);
- cancel_cmdid(nvmeq, cmdid, NULL);
+ cancel_cmd_info(cmd_info, NULL);
spin_unlock_irq(&nvmeq->q_lock);
}

@@ -858,46 +741,31 @@ static void sync_completion(struct nvme_queue *nvmeq, void *ctx,
* Returns 0 on success. If the result is negative, it's a Linux error code;
* if the result is positive, it's an NVM Express status code
*/
-static int nvme_submit_sync_cmd(struct nvme_dev *dev, int q_idx,
- struct nvme_command *cmd,
+static int nvme_submit_sync_cmd(struct request *req, struct nvme_command *cmd,
u32 *result, unsigned timeout)
{
- int cmdid, ret;
+ int ret;
struct sync_cmd_info cmdinfo;
- struct nvme_queue *nvmeq;
-
- nvmeq = lock_nvmeq(dev, q_idx);
- if (!nvmeq) {
- unlock_nvmeq(nvmeq);
- return -ENODEV;
- }
+ struct nvme_cmd_info *cmd_rq = blk_mq_rq_to_pdu(req);
+ struct nvme_queue *nvmeq = cmd_rq->nvmeq;

cmdinfo.task = current;
cmdinfo.status = -EINTR;

- cmdid = alloc_cmdid(nvmeq, &cmdinfo, sync_completion, timeout);
- if (cmdid < 0) {
- unlock_nvmeq(nvmeq);
- return cmdid;
- }
- cmd->common.command_id = cmdid;
+ cmd->common.command_id = req->tag;
+
+ nvme_set_info(cmd_rq, &cmdinfo, sync_completion);

set_current_state(TASK_KILLABLE);
ret = nvme_submit_cmd(nvmeq, cmd);
if (ret) {
- free_cmdid(nvmeq, cmdid, NULL);
- unlock_nvmeq(nvmeq);
+ nvme_finish_cmd(nvmeq, req->tag, NULL);
set_current_state(TASK_RUNNING);
- return ret;
}
- unlock_nvmeq(nvmeq);
schedule_timeout(timeout);

if (cmdinfo.status == -EINTR) {
- nvmeq = lock_nvmeq(dev, q_idx);
- if (nvmeq)
- nvme_abort_command(nvmeq, cmdid);
- unlock_nvmeq(nvmeq);
+ nvme_abort_cmd_info(nvmeq, blk_mq_rq_to_pdu(req));
return -EINTR;
}

@@ -907,59 +775,78 @@ static int nvme_submit_sync_cmd(struct nvme_dev *dev, int q_idx,
return cmdinfo.status;
}

-static int nvme_submit_async_cmd(struct nvme_queue *nvmeq,
+static int nvme_submit_admin_async_cmd(struct nvme_dev *dev,
struct nvme_command *cmd,
struct async_cmd_info *cmdinfo, unsigned timeout)
{
- int cmdid;
+ struct nvme_queue *nvmeq = dev->queues[0];
+ struct request *req;
+ struct nvme_cmd_info *cmd_rq;

- cmdid = alloc_cmdid_killable(nvmeq, cmdinfo, async_completion, timeout);
- if (cmdid < 0)
- return cmdid;
+ req = blk_mq_alloc_request(dev->admin_q, WRITE, GFP_KERNEL, false);
+ if (!req)
+ return -ENOMEM;
+
+ req->timeout = timeout;
+ cmd_rq = blk_mq_rq_to_pdu(req);
+ cmdinfo->req = req;
+ nvme_set_info(cmd_rq, cmdinfo, async_completion);
cmdinfo->status = -EINTR;
- cmd->common.command_id = cmdid;
+
+ cmd->common.command_id = req->tag;
+
return nvme_submit_cmd(nvmeq, cmd);
}

-int nvme_submit_admin_cmd(struct nvme_dev *dev, struct nvme_command *cmd,
- u32 *result)
+int __nvme_submit_admin_cmd(struct nvme_dev *dev, struct nvme_command *cmd,
+ u32 *result, unsigned timeout)
{
- return nvme_submit_sync_cmd(dev, 0, cmd, result, ADMIN_TIMEOUT);
+ int res;
+ struct request *req;
+
+ req = blk_mq_alloc_request(dev->admin_q, WRITE, GFP_KERNEL, false);
+ if (!req)
+ return -ENOMEM;
+ res = nvme_submit_sync_cmd(req, cmd, result, timeout);
+ blk_put_request(req);
+ return res;
}

-int nvme_submit_io_cmd(struct nvme_dev *dev, struct nvme_command *cmd,
+int nvme_submit_admin_cmd(struct nvme_dev *dev, struct nvme_command *cmd,
u32 *result)
{
- return nvme_submit_sync_cmd(dev, smp_processor_id() + 1, cmd, result,
- NVME_IO_TIMEOUT);
+ return __nvme_submit_admin_cmd(dev, cmd, result, ADMIN_TIMEOUT);
}

-static int nvme_submit_admin_cmd_async(struct nvme_dev *dev,
- struct nvme_command *cmd, struct async_cmd_info *cmdinfo)
+int nvme_submit_io_cmd(struct nvme_dev *dev, struct nvme_ns *ns,
+ struct nvme_command *cmd, u32 *result)
{
- return nvme_submit_async_cmd(raw_nvmeq(dev, 0), cmd, cmdinfo,
- ADMIN_TIMEOUT);
+ int res;
+ struct request *req;
+
+ req = blk_mq_alloc_request(ns->queue, WRITE, (GFP_KERNEL|__GFP_WAIT),
+ false);
+ if (!req)
+ return -ENOMEM;
+ res = nvme_submit_sync_cmd(req, cmd, result, NVME_IO_TIMEOUT);
+ blk_put_request(req);
+ return res;
}

static int adapter_delete_queue(struct nvme_dev *dev, u8 opcode, u16 id)
{
- int status;
struct nvme_command c;

memset(&c, 0, sizeof(c));
c.delete_queue.opcode = opcode;
c.delete_queue.qid = cpu_to_le16(id);

- status = nvme_submit_admin_cmd(dev, &c, NULL);
- if (status)
- return -EIO;
- return 0;
+ return nvme_submit_admin_cmd(dev, &c, NULL);
}

static int adapter_alloc_cq(struct nvme_dev *dev, u16 qid,
struct nvme_queue *nvmeq)
{
- int status;
struct nvme_command c;
int flags = NVME_QUEUE_PHYS_CONTIG | NVME_CQ_IRQ_ENABLED;

@@ -971,16 +858,12 @@ static int adapter_alloc_cq(struct nvme_dev *dev, u16 qid,
c.create_cq.cq_flags = cpu_to_le16(flags);
c.create_cq.irq_vector = cpu_to_le16(nvmeq->cq_vector);

- status = nvme_submit_admin_cmd(dev, &c, NULL);
- if (status)
- return -EIO;
- return 0;
+ return nvme_submit_admin_cmd(dev, &c, NULL);
}

static int adapter_alloc_sq(struct nvme_dev *dev, u16 qid,
struct nvme_queue *nvmeq)
{
- int status;
struct nvme_command c;
int flags = NVME_QUEUE_PHYS_CONTIG | NVME_SQ_PRIO_MEDIUM;

@@ -992,10 +875,7 @@ static int adapter_alloc_sq(struct nvme_dev *dev, u16 qid,
c.create_sq.sq_flags = cpu_to_le16(flags);
c.create_sq.cqid = cpu_to_le16(qid);

- status = nvme_submit_admin_cmd(dev, &c, NULL);
- if (status)
- return -EIO;
- return 0;
+ return nvme_submit_admin_cmd(dev, &c, NULL);
}

static int adapter_delete_cq(struct nvme_dev *dev, u16 cqid)
@@ -1051,28 +931,27 @@ int nvme_set_features(struct nvme_dev *dev, unsigned fid, unsigned dword11,
}

/**
- * nvme_abort_cmd - Attempt aborting a command
- * @cmdid: Command id of a timed out IO
- * @queue: The queue with timed out IO
+ * nvme_abort_req - Attempt aborting a request
*
* Schedule controller reset if the command was already aborted once before and
* still hasn't been returned to the driver, or if this is the admin queue.
*/
-static void nvme_abort_cmd(int cmdid, struct nvme_queue *nvmeq)
+static void nvme_abort_req(struct request *req)
{
- int a_cmdid;
- struct nvme_command cmd;
+ struct nvme_cmd_info *cmd_rq = blk_mq_rq_to_pdu(req);
+ struct nvme_queue *nvmeq = cmd_rq->nvmeq;
struct nvme_dev *dev = nvmeq->dev;
- struct nvme_cmd_info *info = nvme_cmd_info(nvmeq);
- struct nvme_queue *adminq;
+ struct request *abort_req;
+ struct nvme_cmd_info *abort_cmd;
+ struct nvme_command cmd;

- if (!nvmeq->qid || info[cmdid].aborted) {
+ if (!nvmeq->qid || cmd_rq->aborted) {
if (work_busy(&dev->reset_work))
return;
list_del_init(&dev->node);
dev_warn(&dev->pci_dev->dev,
- "I/O %d QID %d timeout, reset controller\n", cmdid,
- nvmeq->qid);
+ "I/O %d QID %d timeout, reset controller\n",
+ req->tag, nvmeq->qid);
dev->reset_workfn = nvme_reset_failed_dev;
queue_work(nvme_workq, &dev->reset_work);
return;
@@ -1081,83 +960,98 @@ static void nvme_abort_cmd(int cmdid, struct nvme_queue *nvmeq)
if (!dev->abort_limit)
return;

- adminq = rcu_dereference(dev->queues[0]);
- a_cmdid = alloc_cmdid(adminq, CMD_CTX_ABORT, special_completion,
- ADMIN_TIMEOUT);
- if (a_cmdid < 0)
+ abort_req = blk_mq_alloc_request(dev->admin_q, WRITE, GFP_ATOMIC,
+ true);
+ if (!abort_req)
return;

+ abort_cmd = blk_mq_rq_to_pdu(abort_req);
+ nvme_set_info(abort_cmd, cmd_rq, abort_completion);
+
memset(&cmd, 0, sizeof(cmd));
cmd.abort.opcode = nvme_admin_abort_cmd;
- cmd.abort.cid = cmdid;
+ cmd.abort.cid = req->tag;
cmd.abort.sqid = cpu_to_le16(nvmeq->qid);
- cmd.abort.command_id = a_cmdid;
+ cmd.abort.command_id = abort_req->tag;

--dev->abort_limit;
- info[cmdid].aborted = 1;
- info[cmdid].timeout = jiffies + ADMIN_TIMEOUT;
+ cmd_rq->aborted = 1;

- dev_warn(nvmeq->q_dmadev, "Aborting I/O %d QID %d\n", cmdid,
+ dev_warn(nvmeq->q_dmadev, "Aborting I/O %d QID %d\n", req->tag,
nvmeq->qid);
- nvme_submit_cmd(adminq, &cmd);
+ if (nvme_submit_cmd(dev->queues[0], &cmd) < 0) {
+ dev_warn(nvmeq->q_dmadev, "Could not abort I/O %d QID %d",
+ req->tag, nvmeq->qid);
+ }
}

-/**
- * nvme_cancel_ios - Cancel outstanding I/Os
- * @queue: The queue to cancel I/Os on
- * @timeout: True to only cancel I/Os which have timed out
- */
-static void nvme_cancel_ios(struct nvme_queue *nvmeq, bool timeout)
+static void nvme_cancel_queue_ios(void *data, unsigned long *tag_map)
{
- int depth = nvmeq->q_depth - 1;
- struct nvme_cmd_info *info = nvme_cmd_info(nvmeq);
- unsigned long now = jiffies;
- int cmdid;
+ struct nvme_queue *nvmeq = data;
+ struct blk_mq_hw_ctx *hctx = nvmeq->hctx;
+ unsigned int tag = 0;

- for_each_set_bit(cmdid, nvmeq->cmdid_data, depth) {
+ tag = 0;
+ do {
+ struct request *req;
void *ctx;
nvme_completion_fn fn;
+ struct nvme_cmd_info *cmd;
static struct nvme_completion cqe = {
.status = cpu_to_le16(NVME_SC_ABORT_REQ << 1),
};
+ int qdepth = nvmeq == nvmeq->dev->queues[0] ?
+ nvmeq->dev->admin_tagset.queue_depth :
+ nvmeq->dev->tagset.queue_depth;

- if (timeout && !time_after(now, info[cmdid].timeout))
- continue;
- if (info[cmdid].ctx == CMD_CTX_CANCELLED)
- continue;
- if (timeout && nvmeq->dev->initialized) {
- nvme_abort_cmd(cmdid, nvmeq);
+ /* zero'd bits are free tags */
+ tag = find_next_zero_bit(tag_map, qdepth, tag);
+ if (tag >= qdepth)
+ break;
+
+ req = blk_mq_tag_to_rq(hctx, tag++);
+ cmd = blk_mq_rq_to_pdu(req);
+
+ if (cmd->ctx == CMD_CTX_CANCELLED)
continue;
- }
- dev_warn(nvmeq->q_dmadev, "Cancelling I/O %d QID %d\n", cmdid,
- nvmeq->qid);
- ctx = cancel_cmdid(nvmeq, cmdid, &fn);
+
+ dev_warn(nvmeq->q_dmadev, "Cancelling I/O %d QID %d\n",
+ req->tag, nvmeq->qid);
+ ctx = cancel_cmd_info(cmd, &fn);
fn(nvmeq, ctx, &cqe);
- }
+
+ } while (1);
}

-static void nvme_free_queue(struct rcu_head *r)
+/**
+ * nvme_cancel_ios - Cancel outstanding I/Os
+ * @nvmeq: The queue to cancel I/Os on
+ * @tagset: The tag set associated with the queue
+ */
+static void nvme_cancel_ios(struct nvme_queue *nvmeq)
{
- struct nvme_queue *nvmeq = container_of(r, struct nvme_queue, r_head);
+ struct blk_mq_hw_ctx *hctx = nvmeq->hctx;

- spin_lock_irq(&nvmeq->q_lock);
- while (bio_list_peek(&nvmeq->sq_cong)) {
- struct bio *bio = bio_list_pop(&nvmeq->sq_cong);
- bio_endio(bio, -EIO);
- }
- while (!list_empty(&nvmeq->iod_bio)) {
- static struct nvme_completion cqe = {
- .status = cpu_to_le16(
- (NVME_SC_ABORT_REQ | NVME_SC_DNR) << 1),
- };
- struct nvme_iod *iod = list_first_entry(&nvmeq->iod_bio,
- struct nvme_iod,
- node);
- list_del(&iod->node);
- bio_completion(nvmeq, iod, &cqe);
- }
- spin_unlock_irq(&nvmeq->q_lock);
+ if (nvmeq->dev->initialized)
+ blk_mq_tag_busy_iter(hctx->tags, nvme_cancel_queue_ios, nvmeq);
+}
+
+static enum blk_eh_timer_return nvme_timeout(struct request *req)
+{
+ struct nvme_cmd_info *cmd = blk_mq_rq_to_pdu(req);
+ struct nvme_queue *nvmeq = cmd->nvmeq;
+
+ dev_warn(nvmeq->q_dmadev, "Cancelling I/O %d QID %d\n", req->tag,
+ nvmeq->qid);
+
+ if (nvmeq->dev->initialized)
+ nvme_abort_req(req);
+
+ return BLK_EH_HANDLED;
+}

+static void nvme_free_queue(struct nvme_queue *nvmeq)
+{
dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
(void *)nvmeq->cqes, nvmeq->cq_dma_addr);
dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth),
@@ -1172,10 +1066,10 @@ static void nvme_free_queues(struct nvme_dev *dev, int lowest)
int i;

for (i = dev->queue_count - 1; i >= lowest; i--) {
- struct nvme_queue *nvmeq = raw_nvmeq(dev, i);
- rcu_assign_pointer(dev->queues[i], NULL);
- call_rcu(&nvmeq->r_head, nvme_free_queue);
+ struct nvme_queue *nvmeq = dev->queues[i];
+ nvme_free_queue(nvmeq);
dev->queue_count--;
+ dev->queues[i] = NULL;
}
}

@@ -1208,13 +1102,13 @@ static void nvme_clear_queue(struct nvme_queue *nvmeq)
{
spin_lock_irq(&nvmeq->q_lock);
nvme_process_cq(nvmeq);
- nvme_cancel_ios(nvmeq, false);
+ nvme_cancel_ios(nvmeq);
spin_unlock_irq(&nvmeq->q_lock);
}

static void nvme_disable_queue(struct nvme_dev *dev, int qid)
{
- struct nvme_queue *nvmeq = raw_nvmeq(dev, qid);
+ struct nvme_queue *nvmeq = dev->queues[qid];

if (!nvmeq)
return;
@@ -1234,8 +1128,7 @@ static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
int depth, int vector)
{
struct device *dmadev = &dev->pci_dev->dev;
- unsigned extra = nvme_queue_extra(depth);
- struct nvme_queue *nvmeq = kzalloc(sizeof(*nvmeq) + extra, GFP_KERNEL);
+ struct nvme_queue *nvmeq = kzalloc(sizeof(*nvmeq), GFP_KERNEL);
if (!nvmeq)
return NULL;

@@ -1260,17 +1153,13 @@ static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
spin_lock_init(&nvmeq->q_lock);
nvmeq->cq_head = 0;
nvmeq->cq_phase = 1;
- init_waitqueue_head(&nvmeq->sq_full);
- init_waitqueue_entry(&nvmeq->sq_cong_wait, nvme_thread);
- bio_list_init(&nvmeq->sq_cong);
- INIT_LIST_HEAD(&nvmeq->iod_bio);
nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
nvmeq->q_depth = depth;
nvmeq->cq_vector = vector;
nvmeq->qid = qid;
nvmeq->q_suspended = 1;
dev->queue_count++;
- rcu_assign_pointer(dev->queues[qid], nvmeq);
+ dev->queues[qid] = nvmeq;

return nvmeq;

@@ -1299,15 +1188,12 @@ static int queue_request_irq(struct nvme_dev *dev, struct nvme_queue *nvmeq,
static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
{
struct nvme_dev *dev = nvmeq->dev;
- unsigned extra = nvme_queue_extra(nvmeq->q_depth);

nvmeq->sq_tail = 0;
nvmeq->cq_head = 0;
nvmeq->cq_phase = 1;
nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
- memset(nvmeq->cmdid_data, 0, extra);
memset((void *)nvmeq->cqes, 0, CQ_SIZE(nvmeq->q_depth));
- nvme_cancel_ios(nvmeq, false);
nvmeq->q_suspended = 0;
dev->online_queues++;
}
@@ -1408,6 +1294,53 @@ static int nvme_shutdown_ctrl(struct nvme_dev *dev)
return 0;
}

+static struct blk_mq_ops nvme_mq_admin_ops = {
+ .queue_rq = nvme_admin_queue_rq,
+ .map_queue = blk_mq_map_queue,
+ .init_hctx = nvme_admin_init_hctx,
+ .init_request = nvme_admin_init_request,
+ .timeout = nvme_timeout,
+};
+
+static struct blk_mq_ops nvme_mq_ops = {
+ .queue_rq = nvme_queue_rq,
+ .map_queue = blk_mq_map_queue,
+ .init_hctx = nvme_init_hctx,
+ .init_request = nvme_init_request,
+ .timeout = nvme_timeout,
+};
+
+static int nvme_alloc_admin_tags(struct nvme_dev *dev)
+{
+ if (!dev->admin_q) {
+ dev->admin_tagset.ops = &nvme_mq_admin_ops;
+ dev->admin_tagset.nr_hw_queues = 1;
+ dev->admin_tagset.queue_depth = NVME_AQ_DEPTH;
+ dev->admin_tagset.timeout = ADMIN_TIMEOUT;
+ dev->admin_tagset.reserved_tags = 1,
+ dev->admin_tagset.numa_node = dev_to_node(&dev->pci_dev->dev);
+ dev->admin_tagset.cmd_size = sizeof(struct nvme_cmd_info);
+ dev->admin_tagset.driver_data = dev;
+
+ if (blk_mq_alloc_tag_set(&dev->admin_tagset))
+ return -ENOMEM;
+
+ dev->admin_q = blk_mq_init_queue(&dev->admin_tagset);
+ if (!dev->admin_q) {
+ blk_mq_free_tag_set(&dev->admin_tagset);
+ return -ENOMEM;
+ }
+ }
+
+ return 0;
+}
+
+static void nvme_free_admin_tags(struct nvme_dev *dev)
+{
+ if (dev->admin_q)
+ blk_mq_free_tag_set(&dev->admin_tagset);
+}
+
static int nvme_configure_admin_queue(struct nvme_dev *dev)
{
int result;
@@ -1419,9 +1352,9 @@ static int nvme_configure_admin_queue(struct nvme_dev *dev)
if (result < 0)
return result;

- nvmeq = raw_nvmeq(dev, 0);
+ nvmeq = dev->queues[0];
if (!nvmeq) {
- nvmeq = nvme_alloc_queue(dev, 0, 64, 0);
+ nvmeq = nvme_alloc_queue(dev, 0, NVME_AQ_DEPTH, 0);
if (!nvmeq)
return -ENOMEM;
}
@@ -1441,16 +1374,26 @@ static int nvme_configure_admin_queue(struct nvme_dev *dev)

result = nvme_enable_ctrl(dev, cap);
if (result)
- return result;
+ goto free_nvmeq;
+
+ result = nvme_alloc_admin_tags(dev);
+ if (result)
+ goto free_nvmeq;

result = queue_request_irq(dev, nvmeq, nvmeq->irqname);
if (result)
- return result;
+ goto free_tags;

spin_lock_irq(&nvmeq->q_lock);
nvme_init_queue(nvmeq, 0);
spin_unlock_irq(&nvmeq->q_lock);
return result;
+
+ free_tags:
+ nvme_free_admin_tags(dev);
+ free_nvmeq:
+ nvme_free_queues(dev, 0);
+ return result;
}

struct nvme_iod *nvme_map_user_pages(struct nvme_dev *dev, int write,
@@ -1605,7 +1548,7 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
if (length != (io.nblocks + 1) << ns->lba_shift)
status = -ENOMEM;
else
- status = nvme_submit_io_cmd(dev, &c, NULL);
+ status = nvme_submit_io_cmd(dev, ns, &c, NULL);

if (meta_len) {
if (status == NVME_SC_SUCCESS && !(io.opcode & 1)) {
@@ -1677,10 +1620,11 @@ static int nvme_user_admin_cmd(struct nvme_dev *dev,

timeout = cmd.timeout_ms ? msecs_to_jiffies(cmd.timeout_ms) :
ADMIN_TIMEOUT;
+
if (length != cmd.data_len)
status = -ENOMEM;
else
- status = nvme_submit_sync_cmd(dev, 0, &c, &cmd.result, timeout);
+ status = __nvme_submit_admin_cmd(dev, &c, &cmd.result, timeout);

if (cmd.data_len) {
nvme_unmap_user_pages(dev, cmd.opcode & 1, iod);
@@ -1769,41 +1713,6 @@ static const struct block_device_operations nvme_fops = {
.getgeo = nvme_getgeo,
};

-static void nvme_resubmit_iods(struct nvme_queue *nvmeq)
-{
- struct nvme_iod *iod, *next;
-
- list_for_each_entry_safe(iod, next, &nvmeq->iod_bio, node) {
- if (unlikely(nvme_submit_iod(nvmeq, iod)))
- break;
- list_del(&iod->node);
- if (bio_list_empty(&nvmeq->sq_cong) &&
- list_empty(&nvmeq->iod_bio))
- remove_wait_queue(&nvmeq->sq_full,
- &nvmeq->sq_cong_wait);
- }
-}
-
-static void nvme_resubmit_bios(struct nvme_queue *nvmeq)
-{
- while (bio_list_peek(&nvmeq->sq_cong)) {
- struct bio *bio = bio_list_pop(&nvmeq->sq_cong);
- struct nvme_ns *ns = bio->bi_bdev->bd_disk->private_data;
-
- if (bio_list_empty(&nvmeq->sq_cong) &&
- list_empty(&nvmeq->iod_bio))
- remove_wait_queue(&nvmeq->sq_full,
- &nvmeq->sq_cong_wait);
- if (nvme_submit_bio_queue(nvmeq, ns, bio)) {
- if (!waitqueue_active(&nvmeq->sq_full))
- add_wait_queue(&nvmeq->sq_full,
- &nvmeq->sq_cong_wait);
- bio_list_add_head(&nvmeq->sq_cong, bio);
- break;
- }
- }
-}
-
static int nvme_kthread(void *data)
{
struct nvme_dev *dev, *next;
@@ -1824,23 +1733,17 @@ static int nvme_kthread(void *data)
queue_work(nvme_workq, &dev->reset_work);
continue;
}
- rcu_read_lock();
for (i = 0; i < dev->queue_count; i++) {
- struct nvme_queue *nvmeq =
- rcu_dereference(dev->queues[i]);
+ struct nvme_queue *nvmeq = dev->queues[i];
if (!nvmeq)
continue;
spin_lock_irq(&nvmeq->q_lock);
if (nvmeq->q_suspended)
goto unlock;
nvme_process_cq(nvmeq);
- nvme_cancel_ios(nvmeq, true);
- nvme_resubmit_bios(nvmeq);
- nvme_resubmit_iods(nvmeq);
unlock:
spin_unlock_irq(&nvmeq->q_lock);
}
- rcu_read_unlock();
}
spin_unlock(&dev_list_lock);
schedule_timeout(round_jiffies_relative(HZ));
@@ -1863,27 +1766,29 @@ static struct nvme_ns *nvme_alloc_ns(struct nvme_dev *dev, unsigned nsid,
{
struct nvme_ns *ns;
struct gendisk *disk;
+ int node = dev_to_node(&dev->pci_dev->dev);
int lbaf;

if (rt->attributes & NVME_LBART_ATTRIB_HIDE)
return NULL;

- ns = kzalloc(sizeof(*ns), GFP_KERNEL);
+ ns = kzalloc_node(sizeof(*ns), GFP_KERNEL, node);
if (!ns)
return NULL;
- ns->queue = blk_alloc_queue(GFP_KERNEL);
+ ns->queue = blk_mq_init_queue(&dev->tagset);
if (!ns->queue)
goto out_free_ns;
- ns->queue->queue_flags = QUEUE_FLAG_DEFAULT;
+ queue_flag_set_unlocked(QUEUE_FLAG_DEFAULT, ns->queue);
queue_flag_set_unlocked(QUEUE_FLAG_NOMERGES, ns->queue);
queue_flag_set_unlocked(QUEUE_FLAG_NONROT, ns->queue);
- blk_queue_make_request(ns->queue, nvme_make_request);
+ queue_flag_clear_unlocked(QUEUE_FLAG_IO_STAT, ns->queue);
ns->dev = dev;
ns->queue->queuedata = ns;

- disk = alloc_disk(0);
+ disk = alloc_disk_node(0, node);
if (!disk)
goto out_free_queue;
+
ns->ns_id = nsid;
ns->disk = disk;
lbaf = id->flbas & 0xf;
@@ -1917,143 +1822,19 @@ static struct nvme_ns *nvme_alloc_ns(struct nvme_dev *dev, unsigned nsid,
return NULL;
}

-static int nvme_find_closest_node(int node)
-{
- int n, val, min_val = INT_MAX, best_node = node;
-
- for_each_online_node(n) {
- if (n == node)
- continue;
- val = node_distance(node, n);
- if (val < min_val) {
- min_val = val;
- best_node = n;
- }
- }
- return best_node;
-}
-
-static void nvme_set_queue_cpus(cpumask_t *qmask, struct nvme_queue *nvmeq,
- int count)
-{
- int cpu;
- for_each_cpu(cpu, qmask) {
- if (cpumask_weight(nvmeq->cpu_mask) >= count)
- break;
- if (!cpumask_test_and_set_cpu(cpu, nvmeq->cpu_mask))
- *per_cpu_ptr(nvmeq->dev->io_queue, cpu) = nvmeq->qid;
- }
-}
-
-static void nvme_add_cpus(cpumask_t *mask, const cpumask_t *unassigned_cpus,
- const cpumask_t *new_mask, struct nvme_queue *nvmeq, int cpus_per_queue)
-{
- int next_cpu;
- for_each_cpu(next_cpu, new_mask) {
- cpumask_or(mask, mask, get_cpu_mask(next_cpu));
- cpumask_or(mask, mask, topology_thread_cpumask(next_cpu));
- cpumask_and(mask, mask, unassigned_cpus);
- nvme_set_queue_cpus(mask, nvmeq, cpus_per_queue);
- }
-}
-
static void nvme_create_io_queues(struct nvme_dev *dev)
{
- unsigned i, max;
+ unsigned i;

- max = min(dev->max_qid, num_online_cpus());
- for (i = dev->queue_count; i <= max; i++)
+ for (i = dev->queue_count; i <= dev->max_qid; i++)
if (!nvme_alloc_queue(dev, i, dev->q_depth, i - 1))
break;

- max = min(dev->queue_count - 1, num_online_cpus());
- for (i = dev->online_queues; i <= max; i++)
- if (nvme_create_queue(raw_nvmeq(dev, i), i))
+ for (i = dev->online_queues; i <= dev->queue_count - 1; i++)
+ if (nvme_create_queue(dev->queues[i], i))
break;
}

-/*
- * If there are fewer queues than online cpus, this will try to optimally
- * assign a queue to multiple cpus by grouping cpus that are "close" together:
- * thread siblings, core, socket, closest node, then whatever else is
- * available.
- */
-static void nvme_assign_io_queues(struct nvme_dev *dev)
-{
- unsigned cpu, cpus_per_queue, queues, remainder, i;
- cpumask_var_t unassigned_cpus;
-
- nvme_create_io_queues(dev);
-
- queues = min(dev->online_queues - 1, num_online_cpus());
- if (!queues)
- return;
-
- cpus_per_queue = num_online_cpus() / queues;
- remainder = queues - (num_online_cpus() - queues * cpus_per_queue);
-
- if (!alloc_cpumask_var(&unassigned_cpus, GFP_KERNEL))
- return;
-
- cpumask_copy(unassigned_cpus, cpu_online_mask);
- cpu = cpumask_first(unassigned_cpus);
- for (i = 1; i <= queues; i++) {
- struct nvme_queue *nvmeq = lock_nvmeq(dev, i);
- cpumask_t mask;
-
- cpumask_clear(nvmeq->cpu_mask);
- if (!cpumask_weight(unassigned_cpus)) {
- unlock_nvmeq(nvmeq);
- break;
- }
-
- mask = *get_cpu_mask(cpu);
- nvme_set_queue_cpus(&mask, nvmeq, cpus_per_queue);
- if (cpus_weight(mask) < cpus_per_queue)
- nvme_add_cpus(&mask, unassigned_cpus,
- topology_thread_cpumask(cpu),
- nvmeq, cpus_per_queue);
- if (cpus_weight(mask) < cpus_per_queue)
- nvme_add_cpus(&mask, unassigned_cpus,
- topology_core_cpumask(cpu),
- nvmeq, cpus_per_queue);
- if (cpus_weight(mask) < cpus_per_queue)
- nvme_add_cpus(&mask, unassigned_cpus,
- cpumask_of_node(cpu_to_node(cpu)),
- nvmeq, cpus_per_queue);
- if (cpus_weight(mask) < cpus_per_queue)
- nvme_add_cpus(&mask, unassigned_cpus,
- cpumask_of_node(
- nvme_find_closest_node(
- cpu_to_node(cpu))),
- nvmeq, cpus_per_queue);
- if (cpus_weight(mask) < cpus_per_queue)
- nvme_add_cpus(&mask, unassigned_cpus,
- unassigned_cpus,
- nvmeq, cpus_per_queue);
-
- WARN(cpumask_weight(nvmeq->cpu_mask) != cpus_per_queue,
- "nvme%d qid:%d mis-matched queue-to-cpu assignment\n",
- dev->instance, i);
-
- irq_set_affinity_hint(dev->entry[nvmeq->cq_vector].vector,
- nvmeq->cpu_mask);
- cpumask_andnot(unassigned_cpus, unassigned_cpus,
- nvmeq->cpu_mask);
- cpu = cpumask_next(cpu, unassigned_cpus);
- if (remainder && !--remainder)
- cpus_per_queue++;
- unlock_nvmeq(nvmeq);
- }
- WARN(cpumask_weight(unassigned_cpus), "nvme%d unassigned online cpus\n",
- dev->instance);
- i = 0;
- cpumask_andnot(unassigned_cpus, cpu_possible_mask, cpu_online_mask);
- for_each_cpu(cpu, unassigned_cpus)
- *per_cpu_ptr(dev->io_queue, cpu) = (i++ % queues) + 1;
- free_cpumask_var(unassigned_cpus);
-}
-
static int set_queue_count(struct nvme_dev *dev, int count)
{
int status;
@@ -2077,22 +1858,9 @@ static size_t db_bar_size(struct nvme_dev *dev, unsigned nr_io_queues)
return 4096 + ((nr_io_queues + 1) * 8 * dev->db_stride);
}

-static int nvme_cpu_notify(struct notifier_block *self,
- unsigned long action, void *hcpu)
-{
- struct nvme_dev *dev = container_of(self, struct nvme_dev, nb);
- switch (action) {
- case CPU_ONLINE:
- case CPU_DEAD:
- nvme_assign_io_queues(dev);
- break;
- }
- return NOTIFY_OK;
-}
-
static int nvme_setup_io_queues(struct nvme_dev *dev)
{
- struct nvme_queue *adminq = raw_nvmeq(dev, 0);
+ struct nvme_queue *adminq = dev->queues[0];
struct pci_dev *pdev = dev->pci_dev;
int result, i, vecs, nr_io_queues, size;

@@ -2151,12 +1919,7 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)

/* Free previously allocated queues that are no longer usable */
nvme_free_queues(dev, nr_io_queues + 1);
- nvme_assign_io_queues(dev);
-
- dev->nb.notifier_call = &nvme_cpu_notify;
- result = register_hotcpu_notifier(&dev->nb);
- if (result)
- goto free_queues;
+ nvme_create_io_queues(dev);

return 0;

@@ -2208,6 +1971,18 @@ static int nvme_dev_add(struct nvme_dev *dev)
(pdev->device == 0x0953) && ctrl->vs[3])
dev->stripe_size = 1 << (ctrl->vs[3] + shift);

+ dev->tagset.ops = &nvme_mq_ops;
+ dev->tagset.nr_hw_queues = dev->online_queues - 1;
+ dev->tagset.timeout = NVME_IO_TIMEOUT;
+ dev->tagset.numa_node = dev_to_node(&dev->pci_dev->dev);
+ dev->tagset.queue_depth = min_t(int, dev->q_depth, BLK_MQ_MAX_DEPTH);
+ dev->tagset.cmd_size = sizeof(struct nvme_cmd_info);
+ dev->tagset.flags = BLK_MQ_F_SHOULD_MERGE;
+ dev->tagset.driver_data = dev;
+
+ if (blk_mq_alloc_tag_set(&dev->tagset))
+ goto out;
+
id_ns = mem;
for (i = 1; i <= nn; i++) {
res = nvme_identify(dev, i, 0, dma_addr);
@@ -2356,7 +2131,8 @@ static int adapter_async_del_queue(struct nvme_queue *nvmeq, u8 opcode,
c.delete_queue.qid = cpu_to_le16(nvmeq->qid);

init_kthread_work(&nvmeq->cmdinfo.work, fn);
- return nvme_submit_admin_cmd_async(nvmeq->dev, &c, &nvmeq->cmdinfo);
+ return nvme_submit_admin_async_cmd(nvmeq->dev, &c, &nvmeq->cmdinfo,
+ ADMIN_TIMEOUT);
}

static void nvme_del_cq_work_handler(struct kthread_work *work)
@@ -2419,7 +2195,7 @@ static void nvme_disable_io_queues(struct nvme_dev *dev)
atomic_set(&dq.refcount, 0);
dq.worker = &worker;
for (i = dev->queue_count - 1; i > 0; i--) {
- struct nvme_queue *nvmeq = raw_nvmeq(dev, i);
+ struct nvme_queue *nvmeq = dev->queues[i];

if (nvme_suspend_queue(nvmeq))
continue;
@@ -2457,13 +2233,12 @@ static void nvme_dev_shutdown(struct nvme_dev *dev)
int i;

dev->initialized = 0;
- unregister_hotcpu_notifier(&dev->nb);

nvme_dev_list_remove(dev);

if (!dev->bar || (dev->bar && readl(&dev->bar->csts) == -1)) {
for (i = dev->queue_count - 1; i >= 0; i--) {
- struct nvme_queue *nvmeq = raw_nvmeq(dev, i);
+ struct nvme_queue *nvmeq = dev->queues[i];
nvme_suspend_queue(nvmeq);
nvme_clear_queue(nvmeq);
}
@@ -2475,6 +2250,12 @@ static void nvme_dev_shutdown(struct nvme_dev *dev)
nvme_dev_unmap(dev);
}

+static void nvme_dev_remove_admin(struct nvme_dev *dev)
+{
+ if (dev->admin_q && !blk_queue_dying(dev->admin_q))
+ blk_cleanup_queue(dev->admin_q);
+}
+
static void nvme_dev_remove(struct nvme_dev *dev)
{
struct nvme_ns *ns;
@@ -2556,7 +2337,7 @@ static void nvme_free_dev(struct kref *kref)
struct nvme_dev *dev = container_of(kref, struct nvme_dev, kref);

nvme_free_namespaces(dev);
- free_percpu(dev->io_queue);
+ blk_mq_free_tag_set(&dev->tagset);
kfree(dev->queues);
kfree(dev->entry);
kfree(dev);
@@ -2708,23 +2489,24 @@ static void nvme_reset_workfn(struct work_struct *work)

static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
{
- int result = -ENOMEM;
+ int node, result = -ENOMEM;
struct nvme_dev *dev;

- dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+ node = dev_to_node(&pdev->dev);
+ if (node == NUMA_NO_NODE)
+ set_dev_node(&pdev->dev, 0);
+
+ dev = kzalloc_node(sizeof(*dev), GFP_KERNEL, node);
if (!dev)
return -ENOMEM;
- dev->entry = kcalloc(num_possible_cpus(), sizeof(*dev->entry),
- GFP_KERNEL);
+ dev->entry = kzalloc_node(num_possible_cpus() * sizeof(*dev->entry),
+ GFP_KERNEL, node);
if (!dev->entry)
goto free;
- dev->queues = kcalloc(num_possible_cpus() + 1, sizeof(void *),
- GFP_KERNEL);
+ dev->queues = kzalloc_node((num_possible_cpus() + 1) * sizeof(void *),
+ GFP_KERNEL, node);
if (!dev->queues)
goto free;
- dev->io_queue = alloc_percpu(unsigned short);
- if (!dev->io_queue)
- goto free;

INIT_LIST_HEAD(&dev->namespaces);
dev->reset_workfn = nvme_reset_failed_dev;
@@ -2766,6 +2548,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)

remove:
nvme_dev_remove(dev);
+ nvme_dev_remove_admin(dev);
nvme_free_namespaces(dev);
shutdown:
nvme_dev_shutdown(dev);
@@ -2775,7 +2558,6 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
release:
nvme_release_instance(dev);
free:
- free_percpu(dev->io_queue);
kfree(dev->queues);
kfree(dev->entry);
kfree(dev);
@@ -2801,8 +2583,9 @@ static void nvme_remove(struct pci_dev *pdev)
misc_deregister(&dev->miscdev);
nvme_dev_remove(dev);
nvme_dev_shutdown(dev);
+ nvme_dev_remove_admin(dev);
nvme_free_queues(dev, 0);
- rcu_barrier();
+ nvme_free_admin_tags(dev);
nvme_release_instance(dev);
nvme_release_prp_pools(dev);
kref_put(&dev->kref, nvme_free_dev);
diff --git a/drivers/block/nvme-scsi.c b/drivers/block/nvme-scsi.c
index da3b252..d386cae 100644
--- a/drivers/block/nvme-scsi.c
+++ b/drivers/block/nvme-scsi.c
@@ -2103,7 +2103,7 @@ static int nvme_trans_do_nvme_io(struct nvme_ns *ns, struct sg_io_hdr *hdr,

nvme_offset += unit_num_blocks;

- nvme_sc = nvme_submit_io_cmd(dev, &c, NULL);
+ nvme_sc = nvme_submit_io_cmd(dev, ns, &c, NULL);
if (nvme_sc != NVME_SC_SUCCESS) {
nvme_unmap_user_pages(dev,
(is_write) ? DMA_TO_DEVICE : DMA_FROM_DEVICE,
@@ -2656,7 +2656,7 @@ static int nvme_trans_start_stop(struct nvme_ns *ns, struct sg_io_hdr *hdr,
c.common.opcode = nvme_cmd_flush;
c.common.nsid = cpu_to_le32(ns->ns_id);

- nvme_sc = nvme_submit_io_cmd(ns->dev, &c, NULL);
+ nvme_sc = nvme_submit_io_cmd(ns->dev, ns, &c, NULL);
res = nvme_trans_status_code(hdr, nvme_sc);
if (res)
goto out;
@@ -2684,7 +2684,7 @@ static int nvme_trans_synchronize_cache(struct nvme_ns *ns,
c.common.opcode = nvme_cmd_flush;
c.common.nsid = cpu_to_le32(ns->ns_id);

- nvme_sc = nvme_submit_io_cmd(ns->dev, &c, NULL);
+ nvme_sc = nvme_submit_io_cmd(ns->dev, ns, &c, NULL);

res = nvme_trans_status_code(hdr, nvme_sc);
if (res)
@@ -2892,7 +2892,7 @@ static int nvme_trans_unmap(struct nvme_ns *ns, struct sg_io_hdr *hdr,
c.dsm.nr = cpu_to_le32(ndesc - 1);
c.dsm.attributes = cpu_to_le32(NVME_DSMGMT_AD);

- nvme_sc = nvme_submit_io_cmd(dev, &c, NULL);
+ nvme_sc = nvme_submit_io_cmd(dev, ns, &c, NULL);
res = nvme_trans_status_code(hdr, nvme_sc);

dma_free_coherent(&dev->pci_dev->dev, ndesc * sizeof(*range),
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 6266373..38aade9 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -19,6 +19,7 @@
#include <linux/pci.h>
#include <linux/miscdevice.h>
#include <linux/kref.h>
+#include <linux/blk-mq.h>

struct nvme_bar {
__u64 cap; /* Controller Capabilities */
@@ -70,8 +71,10 @@ extern unsigned char io_timeout;
*/
struct nvme_dev {
struct list_head node;
- struct nvme_queue __rcu **queues;
- unsigned short __percpu *io_queue;
+ struct nvme_queue **queues;
+ struct request_queue *admin_q;
+ struct blk_mq_tag_set tagset;
+ struct blk_mq_tag_set admin_tagset;
u32 __iomem *dbs;
struct pci_dev *pci_dev;
struct dma_pool *prp_page_pool;
@@ -90,7 +93,6 @@ struct nvme_dev {
struct miscdevice miscdev;
work_func_t reset_workfn;
struct work_struct reset_work;
- struct notifier_block nb;
char name[12];
char serial[20];
char model[40];
@@ -132,7 +134,6 @@ struct nvme_iod {
int offset; /* Of PRP list */
int nents; /* Used in scatterlist */
int length; /* Of data, in bytes */
- unsigned long start_time;
dma_addr_t first_dma;
struct list_head node;
struct scatterlist sg[0];
@@ -150,12 +151,13 @@ static inline u64 nvme_block_nr(struct nvme_ns *ns, sector_t sector)
*/
void nvme_free_iod(struct nvme_dev *dev, struct nvme_iod *iod);

-int nvme_setup_prps(struct nvme_dev *, struct nvme_iod *, int , gfp_t);
+int nvme_setup_prps(struct nvme_dev *, struct nvme_iod *, int, gfp_t);
struct nvme_iod *nvme_map_user_pages(struct nvme_dev *dev, int write,
unsigned long addr, unsigned length);
void nvme_unmap_user_pages(struct nvme_dev *dev, int write,
struct nvme_iod *iod);
-int nvme_submit_io_cmd(struct nvme_dev *, struct nvme_command *, u32 *);
+int nvme_submit_io_cmd(struct nvme_dev *, struct nvme_ns *,
+ struct nvme_command *, u32 *);
int nvme_submit_flush_data(struct nvme_queue *nvmeq, struct nvme_ns *ns);
int nvme_submit_admin_cmd(struct nvme_dev *, struct nvme_command *,
u32 *result);
--
1.9.1

2014-06-02 22:50:09

by Keith Busch

[permalink] [raw]

Subject: Re: [PATCH v5] conversion to blk-mq

On Mon, 2 Jun 2014, Matias Bjørling wrote:
> Hi Matthew and Keith,
>
> Here is an updated patch with the feedback from the previous days. It's against
> Jens' for-3.16/core tree. You may use the nvmemq_wip_review branch at:

I'm testing this on my normal hardware now. As I feared, hot removal
doesn't work while the device is actively being managed. Here's the oops:

[ 1267.018283] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
[ 1267.018288] IP: [<ffffffff811c51b8>] blk_mq_map_queue+0xf/0x1e
[ 1267.018292] PGD b5ed5067 PUD b57e2067 PMD 0
[ 1267.018294] Oops: 0000 [#1] SMP
[ 1267.018343] Modules linked in: nvme parport_pc ppdev lp parport dlm sctp libcrc32c configfs nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc md4 hmac cifs bridge stp llc joydev jfs hid_generic usbhid hid loop md_mod x86_pkg_temp_thermal coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd microcode ehci_pci ehci_hcd usbcore pcspkr lpc_ich acpi_cpufreq ioatdma mfd_core usb_common i2c_i801 evdev wmi ipmi_si tpm_tis ipmi_msghandler tpm processor thermal_sys button ext4 crc16 jbd2 mbcache dm_mod nbd sg sr_mod cdrom sd_mod crc_t10dif crct10dif_common isci libsas ahci scsi_transport_sas libahci igb libata ptp pps_core i2c_algo_bit scsi_mod i2c_core dca [last unloaded: nvme]
[ 1267.018346] CPU: 1 PID: 6618 Comm: nvme_id_ctrl Tainted: G W 3.15.0-rc1+ #2
[ 1267.018347] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
[ 1267.018349] task: ffff88042879eef0 ti: ffff8800b5f14000 task.ti: ffff8800b5f14000
[ 1267.018354] RIP: 0010:[<ffffffff811c51b8>] [<ffffffff811c51b8>] blk_mq_map_queue+0xf/0x1e
[ 1267.018356] RSP: 0018:ffff8800b5f15db0 EFLAGS: 00010206
[ 1267.018357] RAX: 0000000000000000 RBX: ffffe8fbffa21e80 RCX: ffff88042f3952e0
[ 1267.018359] RDX: 0000000000001c10 RSI: 0000000000000001 RDI: ffff8800b5e9f008
[ 1267.018360] RBP: ffff88042d993480 R08: ffff8800b5e18cc0 R09: 0000000000000000
[ 1267.018362] R10: 0000000000000ca0 R11: 0000000000000ca0 R12: ffff8800b5f15e94
[ 1267.018363] R13: 0000000000003a98 R14: 00007fffc7b44c40 R15: 0000000000000000
[ 1267.018365] FS: 00007feb4cb05700(0000) GS:ffff88043f620000(0000) knlGS:0000000000000000
[ 1267.018367] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1267.018369] CR2: 0000000000000004 CR3: 00000000b50f2000 CR4: 00000000000407e0
[ 1267.018369] Stack:
[ 1267.018372] ffffffff811c6334 00000000fffffffc ffff88042d993480 00000000fffffffc
[ 1267.018375] ffffffffa0532d73 0000000000000ca0 00000000fffffff4 ffff88042562b440
[ 1267.018378] ffff88042d853000 0000000000001000 ffffffffa05348f3 ffffc9001518b2a8
[ 1267.018378] Call Trace:
[ 1267.018383] [<ffffffff811c6334>] ? blk_mq_free_request+0x37/0x48
[ 1267.018388] [<ffffffffa0532d73>] ? __nvme_submit_admin_cmd+0x7f/0x8a [nvme]
[ 1267.018392] [<ffffffffa05348f3>] ? nvme_user_admin_cmd+0x144/0x1b1 [nvme]
[ 1267.018397] [<ffffffffa053497d>] ? nvme_dev_ioctl+0x1d/0x2b [nvme]
[ 1267.018399] [<ffffffff81125916>] ? do_vfs_ioctl+0x3f2/0x43b
[ 1267.018402] [<ffffffff8105ca6b>] ? finish_task_switch+0x84/0xc4
[ 1267.018406] [<ffffffff81394a8c>] ? __schedule+0x45c/0x4f0
[ 1267.018408] [<ffffffff811259ad>] ? SyS_ioctl+0x4e/0x7d
[ 1267.018411] [<ffffffff8139c6d2>] ? system_call_fastpath+0x16/0x1b
[ 1267.018435] Code: 8b 4a 38 48 39 4e 38 72 12 74 06 b8 01 00 00 00 c3 48 8b 4a 60 48 39 4e 60 73 f0 c3 66 66 66 66 90 48 8b 87 e0 00 00 00 48 63 f6 <8b> 14 b0 48 8b 87 f8 00 00 00 48 8b 04 d0 c3 89 ff f0 48 0f ab
[ 1267.018438] RIP [<ffffffff811c51b8>] blk_mq_map_queue+0xf/0x1e
[ 1267.018439] RSP <ffff8800b5f15db0>
[ 1267.018440] CR2: 0000000000000004
[ 1267.018443] ---[ end trace 1e244ae5f3ceb23b ]---

2014-06-02 23:06:27

by Keith Busch

[permalink] [raw]

Subject: Re: [PATCH v5] conversion to blk-mq

Depending on the timing, it might fail in alloc instead of free:

Jun 2 16:45:40 kbgrz1 kernel: [ 265.421243] NULL pointer dereference at (null)
Jun 2 16:45:40 kbgrz1 kernel: [ 265.434284] PGD 429acf067 PUD 42ce28067 PMD 0
Jun 2 16:45:40 kbgrz1 kernel: [ 265.439565] Oops: 0000 [#1] SMP
Jun 2 16:45:40 kbgrz1 kernel: [ 265.443413] Modules linked in: nvme parport_pc ppdev lp parport dlm sctp libcrc32c configfs nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc md4 hmac cifs bridge stp llc jfs joydev hid_generic usbhid hid loop md_mod x86_pkg_temp_thermal coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd ehci_pci ehci_hcd microcode usbcore pcspkr ioatdma lpc_ich usb_common mfd_core i2c_i801 evdev wmi acpi_cpufreq tpm_tis ipmi_si tpm ipmi_msghandler processor thermal_sys button ext4 crc16 jbd2 mbcache dm_mod nbd sg sd_mod sr_mod crc_t10dif cdrom crct10dif_common isci libsas igb ahci scsi_transport_sas libahci ptp libata pps_core i2c_algo_bit i2c_core scsi_mod dca
Jun 2 16:45:40 kbgrz1 kernel: [ 265.526398] CPU: 4 PID: 5454 Comm: nvme_id_ctrl Not tainted 3.15.0-rc1+ #2
Jun 2 16:45:40 kbgrz1 kernel: [ 265.534181] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
Jun 2 16:45:40 kbgrz1 kernel: [ 265.545789] task: ffff88042e418390 ti: ffff8804283e6000 task.ti: ffff8804283e6000
Jun 2 16:45:40 kbgrz1 kernel: [ 265.554270] RIP: 0010:[<ffffffff811c51b8>] [<ffffffff811c51b8>] blk_mq_map_queue+0xf/0x1e
Jun 2 16:45:40 kbgrz1 kernel: [ 265.563720] RSP: 0018:ffff8804283e7d80 EFLAGS: 00010286
Jun 2 16:45:40 kbgrz1 kernel: [ 265.569755] RAX: 0000000000000000 RBX: ffff880035a06008 RCX: 00000000000f4240
Jun 2 16:45:40 kbgrz1 kernel: [ 265.577830] RDX: 000000000028a360 RSI: 0000000000000000 RDI: ffff880035a06008
Jun 2 16:45:40 kbgrz1 kernel: [ 265.585904] RBP: ffff88043f680000 R08: 0000000000000000 R09: 0000000000001000
Jun 2 16:45:40 kbgrz1 kernel: [ 265.593979] R10: 0000000000001000 R11: 0000000000000410 R12: 00000000000000d0
Jun 2 16:45:40 kbgrz1 kernel: [ 265.602053] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
Jun 2 16:45:40 kbgrz1 kernel: [ 265.610134] FS: 00007f74b8bc9700(0000) GS:ffff88043f680000(0000) knlGS:0000000000000000
Jun 2 16:45:40 kbgrz1 kernel: [ 265.619303] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 2 16:45:40 kbgrz1 kernel: [ 265.625824] CR2: 0000000000000000 CR3: 00000004292c4000 CR4: 00000000000407e0
Jun 2 16:45:40 kbgrz1 kernel: [ 265.633889] Stack:
Jun 2 16:45:40 kbgrz1 kernel: [ 265.636236] ffffffff811c689f 0000000000000000 00000000fffffff4 ffff8804283e7e10
Jun 2 16:45:40 kbgrz1 kernel: [ 265.644949] ffff8804283e7e94 000000000000007d 00007fff4f8a73b0 0000000000000000
Jun 2 16:45:40 kbgrz1 kernel: [ 265.653653] ffffffffa055acc7 0000000000000246 000000002d0aaec0 ffff88042d0aaec0
Jun 2 16:45:40 kbgrz1 kernel: [ 265.662358] Call Trace:
Jun 2 16:45:40 kbgrz1 kernel: [ 265.665194] [<ffffffff811c689f>] ? blk_mq_alloc_request+0x54/0xd5
Jun 2 16:45:40 kbgrz1 kernel: [ 265.672217] [<ffffffffa055acc7>] ? __nvme_submit_admin_cmd+0x2d/0x68 [nvme]
Jun 2 16:45:40 kbgrz1 kernel: [ 265.680196] [<ffffffffa055c879>] ? nvme_user_admin_cmd+0x144/0x1b1 [nvme]
Jun 2 16:45:40 kbgrz1 kernel: [ 265.687987] [<ffffffffa055c903>] ? nvme_dev_ioctl+0x1d/0x2b [nvme]
Jun 2 16:45:40 kbgrz1 kernel: [ 265.695107] [<ffffffff81125916>] ? do_vfs_ioctl+0x3f2/0x43b
Jun 2 16:45:40 kbgrz1 kernel: [ 265.701547] [<ffffffff8105ca6b>] ? finish_task_switch+0x84/0xc4
Jun 2 16:45:40 kbgrz1 kernel: [ 265.708382] [<ffffffff81394a8c>] ? __schedule+0x45c/0x4f0
Jun 2 16:45:40 kbgrz1 kernel: [ 265.714603] [<ffffffff811259ad>] ? SyS_ioctl+0x4e/0x7d
Jun 2 16:45:40 kbgrz1 kernel: [ 265.720555] [<ffffffff8139c6d2>] ? system_call_fastpath+0x16/0x1b
Jun 2 16:45:40 kbgrz1 kernel: [ 265.727560] Code: 8b 4a 38 48 39 4e 38 72 12 74 06 b8 01 00 00 00 c3 48 8b 4a 60 48 39 4e 60 73 f0 c3 66 66 66 66 90 48 8b 87 e0 00 00 00 48 63 f6 <8b> 14 b0 48 8b 87 f8 00 00 00 48 8b 04 d0 c3 89 ff f0 48 0f ab
Jun 2 16:45:40 kbgrz1 kernel: [ 265.760706] RSP <ffff8804283e7d80>
Jun 2 16:45:40 kbgrz1 kernel: [ 265.764705] CR2: 0000000000000000
Jun 2 16:45:40 kbgrz1 kernel: [ 265.768531] ---[ end trace 785048a51785f51e ]---

On Mon, 2 Jun 2014, Keith Busch wrote:
> On Mon, 2 Jun 2014, Matias Bjørling wrote:
>> Hi Matthew and Keith,
>>
>> Here is an updated patch with the feedback from the previous days. It's
>> against
>> Jens' for-3.16/core tree. You may use the nvmemq_wip_review branch at:
>
> I'm testing this on my normal hardware now. As I feared, hot removal
> doesn't work while the device is actively being managed. Here's the oops:
>
> [ 1267.018283] BUG: unable to handle kernel NULL pointer dereference at
> 0000000000000004
> [ 1267.018288] IP: [<ffffffff811c51b8>] blk_mq_map_queue+0xf/0x1e
> [ 1267.018292] PGD b5ed5067 PUD b57e2067 PMD 0
> [ 1267.018294] Oops: 0000 [#1] SMP
> [ 1267.018343] Modules linked in: nvme parport_pc ppdev lp parport dlm sctp
> libcrc32c configfs nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache
> sunrpc md4 hmac cifs bridge stp llc joydev jfs hid_generic usbhid hid loop
> md_mod x86_pkg_temp_thermal coretemp kvm_intel kvm iTCO_wdt
> iTCO_vendor_support crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64
> glue_helper lrw gf128mul ablk_helper cryptd microcode ehci_pci ehci_hcd
> usbcore pcspkr lpc_ich acpi_cpufreq ioatdma mfd_core usb_common i2c_i801
> evdev wmi ipmi_si tpm_tis ipmi_msghandler tpm processor thermal_sys button
> ext4 crc16 jbd2 mbcache dm_mod nbd sg sr_mod cdrom sd_mod crc_t10dif
> crct10dif_common isci libsas ahci scsi_transport_sas libahci igb libata ptp
> pps_core i2c_algo_bit scsi_mod i2c_core dca [last unloaded: nvme]
> [ 1267.018346] CPU: 1 PID: 6618 Comm: nvme_id_ctrl Tainted: G W
> 3.15.0-rc1+ #2
> [ 1267.018347] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS
> SE5C600.86B.02.02.0002.122320131210 12/23/2013
> [ 1267.018349] task: ffff88042879eef0 ti: ffff8800b5f14000 task.ti:
> ffff8800b5f14000
> [ 1267.018354] RIP: 0010:[<ffffffff811c51b8>] [<ffffffff811c51b8>]
> blk_mq_map_queue+0xf/0x1e
> [ 1267.018356] RSP: 0018:ffff8800b5f15db0 EFLAGS: 00010206
> [ 1267.018357] RAX: 0000000000000000 RBX: ffffe8fbffa21e80 RCX:
> ffff88042f3952e0
> [ 1267.018359] RDX: 0000000000001c10 RSI: 0000000000000001 RDI:
> ffff8800b5e9f008
> [ 1267.018360] RBP: ffff88042d993480 R08: ffff8800b5e18cc0 R09:
> 0000000000000000
> [ 1267.018362] R10: 0000000000000ca0 R11: 0000000000000ca0 R12:
> ffff8800b5f15e94
> [ 1267.018363] R13: 0000000000003a98 R14: 00007fffc7b44c40 R15:
> 0000000000000000
> [ 1267.018365] FS: 00007feb4cb05700(0000) GS:ffff88043f620000(0000)
> knlGS:0000000000000000
> [ 1267.018367] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1267.018369] CR2: 0000000000000004 CR3: 00000000b50f2000 CR4:
> 00000000000407e0
> [ 1267.018369] Stack:
> [ 1267.018372] ffffffff811c6334 00000000fffffffc ffff88042d993480
> 00000000fffffffc
> [ 1267.018375] ffffffffa0532d73 0000000000000ca0 00000000fffffff4
> ffff88042562b440
> [ 1267.018378] ffff88042d853000 0000000000001000 ffffffffa05348f3
> ffffc9001518b2a8
> [ 1267.018378] Call Trace:
> [ 1267.018383] [<ffffffff811c6334>] ? blk_mq_free_request+0x37/0x48
> [ 1267.018388] [<ffffffffa0532d73>] ? __nvme_submit_admin_cmd+0x7f/0x8a
> [nvme]
> [ 1267.018392] [<ffffffffa05348f3>] ? nvme_user_admin_cmd+0x144/0x1b1 [nvme]
> [ 1267.018397] [<ffffffffa053497d>] ? nvme_dev_ioctl+0x1d/0x2b [nvme]
> [ 1267.018399] [<ffffffff81125916>] ? do_vfs_ioctl+0x3f2/0x43b
> [ 1267.018402] [<ffffffff8105ca6b>] ? finish_task_switch+0x84/0xc4
> [ 1267.018406] [<ffffffff81394a8c>] ? __schedule+0x45c/0x4f0
> [ 1267.018408] [<ffffffff811259ad>] ? SyS_ioctl+0x4e/0x7d
> [ 1267.018411] [<ffffffff8139c6d2>] ? system_call_fastpath+0x16/0x1b
> [ 1267.018435] Code: 8b 4a 38 48 39 4e 38 72 12 74 06 b8 01 00 00 00 c3 48 8b
> 4a 60 48 39 4e 60 73 f0 c3 66 66 66 66 90 48 8b 87 e0 00 00 00 48 63 f6 <8b>
> 14 b0 48 8b 87 f8 00 00 00 48 8b 04 d0 c3 89 ff f0 48 0f ab
> [ 1267.018438] RIP [<ffffffff811c51b8>] blk_mq_map_queue+0xf/0x1e
> [ 1267.018439] RSP <ffff8800b5f15db0>
> [ 1267.018440] CR2: 0000000000000004
> [ 1267.018443] ---[ end trace 1e244ae5f3ceb23b ]---
>

2014-06-03 13:56:50

by Matias Bjørling

[permalink] [raw]

Subject: Re: [PATCH v5] conversion to blk-mq

On 06/03/2014 01:06 AM, Keith Busch wrote:
> Depending on the timing, it might fail in alloc instead of free:
>
> Jun 2 16:45:40 kbgrz1 kernel: [ 265.421243] NULL pointer dereference
> at (null)
> Jun 2 16:45:40 kbgrz1 kernel: [ 265.434284] PGD 429acf067 PUD
> 42ce28067 PMD 0
> Jun 2 16:45:40 kbgrz1 kernel: [ 265.439565] Oops: 0000 [#1] SMP
> Jun 2 16:45:40 kbgrz1 kernel: [ 265.443413] Modules linked in: nvme
> parport_pc ppdev lp parport dlm sctp libcrc32c configfs nfsd auth_rpcgss
> oid_registry nfs_acl nfs lockd fscache sunrpc md4 hmac cifs bridge stp
> llc jfs joydev hid_generic usbhid hid loop md_mod x86_pkg_temp_thermal
> coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support crc32c_intel
> ghash_clmulni_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul
> ablk_helper cryptd ehci_pci ehci_hcd microcode usbcore pcspkr ioatdma
> lpc_ich usb_common mfd_core i2c_i801 evdev wmi acpi_cpufreq tpm_tis
> ipmi_si tpm ipmi_msghandler processor thermal_sys button ext4 crc16 jbd2
> mbcache dm_mod nbd sg sd_mod sr_mod crc_t10dif cdrom crct10dif_common
> isci libsas igb ahci scsi_transport_sas libahci ptp libata pps_core
> i2c_algo_bit i2c_core scsi_mod dca
> Jun 2 16:45:40 kbgrz1 kernel: [ 265.526398] CPU: 4 PID: 5454 Comm:
> nvme_id_ctrl Not tainted 3.15.0-rc1+ #2
> Jun 2 16:45:40 kbgrz1 kernel: [ 265.534181] Hardware name: Intel
> Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210
> 12/23/2013
> Jun 2 16:45:40 kbgrz1 kernel: [ 265.545789] task: ffff88042e418390 ti:
> ffff8804283e6000 task.ti: ffff8804283e6000
> Jun 2 16:45:40 kbgrz1 kernel: [ 265.554270] RIP:
> 0010:[<ffffffff811c51b8>] [<ffffffff811c51b8>] blk_mq_map_queue+0xf/0x1e
> Jun 2 16:45:40 kbgrz1 kernel: [ 265.563720] RSP:
> 0018:ffff8804283e7d80 EFLAGS: 00010286
> Jun 2 16:45:40 kbgrz1 kernel: [ 265.569755] RAX: 0000000000000000 RBX:
> ffff880035a06008 RCX: 00000000000f4240
> Jun 2 16:45:40 kbgrz1 kernel: [ 265.577830] RDX: 000000000028a360 RSI:
> 0000000000000000 RDI: ffff880035a06008
> Jun 2 16:45:40 kbgrz1 kernel: [ 265.585904] RBP: ffff88043f680000 R08:
> 0000000000000000 R09: 0000000000001000
> Jun 2 16:45:40 kbgrz1 kernel: [ 265.593979] R10: 0000000000001000 R11:
> 0000000000000410 R12: 00000000000000d0
> Jun 2 16:45:40 kbgrz1 kernel: [ 265.602053] R13: 0000000000000001 R14:
> 0000000000000000 R15: 0000000000000000
> Jun 2 16:45:40 kbgrz1 kernel: [ 265.610134] FS:
> 00007f74b8bc9700(0000) GS:ffff88043f680000(0000) knlGS:0000000000000000
> Jun 2 16:45:40 kbgrz1 kernel: [ 265.619303] CS: 0010 DS: 0000 ES:
> 0000 CR0: 0000000080050033
> Jun 2 16:45:40 kbgrz1 kernel: [ 265.625824] CR2: 0000000000000000 CR3:
> 00000004292c4000 CR4: 00000000000407e0
> Jun 2 16:45:40 kbgrz1 kernel: [ 265.633889] Stack:
> Jun 2 16:45:40 kbgrz1 kernel: [ 265.636236] ffffffff811c689f
> 0000000000000000 00000000fffffff4 ffff8804283e7e10
> Jun 2 16:45:40 kbgrz1 kernel: [ 265.644949] ffff8804283e7e94
> 000000000000007d 00007fff4f8a73b0 0000000000000000
> Jun 2 16:45:40 kbgrz1 kernel: [ 265.653653] ffffffffa055acc7
> 0000000000000246 000000002d0aaec0 ffff88042d0aaec0
> Jun 2 16:45:40 kbgrz1 kernel: [ 265.662358] Call Trace:
> Jun 2 16:45:40 kbgrz1 kernel: [ 265.665194] [<ffffffff811c689f>] ?
> blk_mq_alloc_request+0x54/0xd5
> Jun 2 16:45:40 kbgrz1 kernel: [ 265.672217] [<ffffffffa055acc7>] ?
> __nvme_submit_admin_cmd+0x2d/0x68 [nvme]
> Jun 2 16:45:40 kbgrz1 kernel: [ 265.680196] [<ffffffffa055c879>] ?
> nvme_user_admin_cmd+0x144/0x1b1 [nvme]
> Jun 2 16:45:40 kbgrz1 kernel: [ 265.687987] [<ffffffffa055c903>] ?
> nvme_dev_ioctl+0x1d/0x2b [nvme]
> Jun 2 16:45:40 kbgrz1 kernel: [ 265.695107] [<ffffffff81125916>] ?
> do_vfs_ioctl+0x3f2/0x43b
> Jun 2 16:45:40 kbgrz1 kernel: [ 265.701547] [<ffffffff8105ca6b>] ?
> finish_task_switch+0x84/0xc4
> Jun 2 16:45:40 kbgrz1 kernel: [ 265.708382] [<ffffffff81394a8c>] ?
> __schedule+0x45c/0x4f0
> Jun 2 16:45:40 kbgrz1 kernel: [ 265.714603] [<ffffffff811259ad>] ?
> SyS_ioctl+0x4e/0x7d
> Jun 2 16:45:40 kbgrz1 kernel: [ 265.720555] [<ffffffff8139c6d2>] ?
> system_call_fastpath+0x16/0x1b
> Jun 2 16:45:40 kbgrz1 kernel: [ 265.727560] Code: 8b 4a 38 48 39 4e 38
> 72 12 74 06 b8 01 00 00 00 c3 48 8b 4a 60 48 39 4e 60 73 f0 c3 66 66 66
> 66 90 48 8b 87 e0 00 00 00 48 63 f6 <8b> 14 b0 48 8b 87 f8 00 00 00 48
> 8b 04 d0 c3 89 ff f0 48 0f ab
> Jun 2 16:45:40 kbgrz1 kernel: [ 265.760706] RSP <ffff8804283e7d80>
> Jun 2 16:45:40 kbgrz1 kernel: [ 265.764705] CR2: 0000000000000000
> Jun 2 16:45:40 kbgrz1 kernel: [ 265.768531] ---[ end trace
> 785048a51785f51e ]---
>
> On Mon, 2 Jun 2014, Keith Busch wrote:
>> On Mon, 2 Jun 2014, Matias Bjørling wrote:
>>> Hi Matthew and Keith,
>>>
>>> Here is an updated patch with the feedback from the previous days.
>>> It's against
>>> Jens' for-3.16/core tree. You may use the nvmemq_wip_review branch at:
>>
>> I'm testing this on my normal hardware now. As I feared, hot removal
>> doesn't work while the device is actively being managed. Here's the oops:
>>
>> [ 1267.018283] BUG: unable to handle kernel NULL pointer dereference
>> at 0000000000000004
>> [ 1267.018288] IP: [<ffffffff811c51b8>] blk_mq_map_queue+0xf/0x1e
>> [ 1267.018292] PGD b5ed5067 PUD b57e2067 PMD 0
>> [ 1267.018294] Oops: 0000 [#1] SMP
>> [ 1267.018343] Modules linked in: nvme parport_pc ppdev lp parport dlm
>> sctp libcrc32c configfs nfsd auth_rpcgss oid_registry nfs_acl nfs
>> lockd fscache sunrpc md4 hmac cifs bridge stp llc joydev jfs
>> hid_generic usbhid hid loop md_mod x86_pkg_temp_thermal coretemp
>> kvm_intel kvm iTCO_wdt iTCO_vendor_support crc32c_intel
>> ghash_clmulni_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul
>> ablk_helper cryptd microcode ehci_pci ehci_hcd usbcore pcspkr lpc_ich
>> acpi_cpufreq ioatdma mfd_core usb_common i2c_i801 evdev wmi ipmi_si
>> tpm_tis ipmi_msghandler tpm processor thermal_sys button ext4 crc16
>> jbd2 mbcache dm_mod nbd sg sr_mod cdrom sd_mod crc_t10dif
>> crct10dif_common isci libsas ahci scsi_transport_sas libahci igb
>> libata ptp pps_core i2c_algo_bit scsi_mod i2c_core dca [last unloaded:
>> nvme]
>> [ 1267.018346] CPU: 1 PID: 6618 Comm: nvme_id_ctrl Tainted: G W
>> 3.15.0-rc1+ #2
>> [ 1267.018347] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS
>> SE5C600.86B.02.02.0002.122320131210 12/23/2013
>> [ 1267.018349] task: ffff88042879eef0 ti: ffff8800b5f14000 task.ti:
>> ffff8800b5f14000
>> [ 1267.018354] RIP: 0010:[<ffffffff811c51b8>] [<ffffffff811c51b8>]
>> blk_mq_map_queue+0xf/0x1e
>> [ 1267.018356] RSP: 0018:ffff8800b5f15db0 EFLAGS: 00010206
>> [ 1267.018357] RAX: 0000000000000000 RBX: ffffe8fbffa21e80 RCX:
>> ffff88042f3952e0
>> [ 1267.018359] RDX: 0000000000001c10 RSI: 0000000000000001 RDI:
>> ffff8800b5e9f008
>> [ 1267.018360] RBP: ffff88042d993480 R08: ffff8800b5e18cc0 R09:
>> 0000000000000000
>> [ 1267.018362] R10: 0000000000000ca0 R11: 0000000000000ca0 R12:
>> ffff8800b5f15e94
>> [ 1267.018363] R13: 0000000000003a98 R14: 00007fffc7b44c40 R15:
>> 0000000000000000
>> [ 1267.018365] FS: 00007feb4cb05700(0000) GS:ffff88043f620000(0000)
>> knlGS:0000000000000000
>> [ 1267.018367] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 1267.018369] CR2: 0000000000000004 CR3: 00000000b50f2000 CR4:
>> 00000000000407e0
>> [ 1267.018369] Stack:
>> [ 1267.018372] ffffffff811c6334 00000000fffffffc ffff88042d993480
>> 00000000fffffffc
>> [ 1267.018375] ffffffffa0532d73 0000000000000ca0 00000000fffffff4
>> ffff88042562b440
>> [ 1267.018378] ffff88042d853000 0000000000001000 ffffffffa05348f3
>> ffffc9001518b2a8
>> [ 1267.018378] Call Trace:
>> [ 1267.018383] [<ffffffff811c6334>] ? blk_mq_free_request+0x37/0x48
>> [ 1267.018388] [<ffffffffa0532d73>] ?
>> __nvme_submit_admin_cmd+0x7f/0x8a [nvme]
>> [ 1267.018392] [<ffffffffa05348f3>] ? nvme_user_admin_cmd+0x144/0x1b1
>> [nvme]
>> [ 1267.018397] [<ffffffffa053497d>] ? nvme_dev_ioctl+0x1d/0x2b [nvme]
>> [ 1267.018399] [<ffffffff81125916>] ? do_vfs_ioctl+0x3f2/0x43b
>> [ 1267.018402] [<ffffffff8105ca6b>] ? finish_task_switch+0x84/0xc4
>> [ 1267.018406] [<ffffffff81394a8c>] ? __schedule+0x45c/0x4f0
>> [ 1267.018408] [<ffffffff811259ad>] ? SyS_ioctl+0x4e/0x7d
>> [ 1267.018411] [<ffffffff8139c6d2>] ? system_call_fastpath+0x16/0x1b
>> [ 1267.018435] Code: 8b 4a 38 48 39 4e 38 72 12 74 06 b8 01 00 00 00
>> c3 48 8b 4a 60 48 39 4e 60 73 f0 c3 66 66 66 66 90 48 8b 87 e0 00 00
>> 00 48 63 f6 <8b> 14 b0 48 8b 87 f8 00 00 00 48 8b 04 d0 c3 89 ff f0 48
>> 0f ab
>> [ 1267.018438] RIP [<ffffffff811c51b8>] blk_mq_map_queue+0xf/0x1e
>> [ 1267.018439] RSP <ffff8800b5f15db0>
>> [ 1267.018440] CR2: 0000000000000004
>> [ 1267.018443] ---[ end trace 1e244ae5f3ceb23b ]---
>>

Thanks. I'll look into it.

2014-06-03 20:12:37

by Matias Bjørling

[permalink] [raw]

Subject: Re: [PATCH v5] conversion to blk-mq

2014-06-03 22:23:36

by Keith Busch

[permalink] [raw]

Subject: Re: [PATCH v5] conversion to blk-mq

On Tue, 3 Jun 2014, Matias Bjorling wrote:
>
> Keith, will you take the nvmemq_wip_v6 branch for a spin? Thanks!

Still fails as before:

[ 88.933881] BUG: unable to handle kernel NULL pointer dereference at 0000000000000014
[ 88.942900] IP: [<ffffffff811c51b8>] blk_mq_map_queue+0xf/0x1e
[ 88.949605] PGD 427be0067 PUD 425495067 PMD 0
[ 88.954915] Oops: 0000 [#1] SMP
[ 88.958787] Modules linked in: nvme parport_pc ppdev lp parport dlm sctp libcrc32c configfs nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc md4 hmac cifs bridge stp llc joydev jfs hid_generic usbhid hid loop md_mod x86_pkg_temp_thermal coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd microcode ehci_pci ehci_hcd pcspkr usbcore lpc_ich ioatdma usb_common mfd_core evdev i2c_i801 wmi acpi_cpufreq tpm_tis tpm ipmi_si ipmi_msghandler processor thermal_sys button ext4 crc16 jbd2 mbcache dm_mod nbd sg sd_mod sr_mod crc_t10dif cdrom crct10dif_common isci ahci libsas igb libahci scsi_transport_sas ptp pps_core i2c_algo_bit libata i2c_core scsi_mod dca
[ 89.042529] CPU: 5 PID: 4544 Comm: nvme_id_ctrl Not tainted 3.15.0-rc1+ #3
[ 89.050295] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
[ 89.061856] task: ffff88042bbdb0d0 ti: ffff88042c24c000 task.ti: ffff88042c24c000
[ 89.070305] RIP: 0010:[<ffffffff811c51b8>] [<ffffffff811c51b8>] blk_mq_map_queue+0xf/0x1e
[ 89.079747] RSP: 0018:ffff88042c24dda0 EFLAGS: 00010202
[ 89.085795] RAX: 0000000000000000 RBX: ffffe8fbffaa1b00 RCX: ffff88042e8ec4b0
[ 89.093868] RDX: 0000000000008be6 RSI: 0000000000000005 RDI: ffff88042abdf048
[ 89.101950] RBP: ffff88042c2b81c0 R08: ffff88042c24c000 R09: ffff880035c58410
[ 89.110033] R10: ffff88043f6b2dc0 R11: ffff88043f6b2dc0 R12: ffff88042c24de94
[ 89.118119] R13: 000000000000007d R14: 00007fff0cd892b0 R15: 0000000000000000
[ 89.126210] FS: 00007f39866c5700(0000) GS:ffff88043f6a0000(0000) knlGS:0000000000000000
[ 89.135387] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 89.141916] CR2: 0000000000000014 CR3: 000000042b387000 CR4: 00000000000407e0
[ 89.149997] Stack:
[ 89.152353] ffffffff811c6334 00000000fffffffc ffff88042c2b81c0 ffff88042c24de10
[ 89.161096] ffffffffa054dcbb 0000000000000246 00000000fffffffc ffff8800b5e05cc0
[ 89.169839] 00000000fffffff4 ffff8800b5e05cc0 ffff88042bbc3000 0000000000001000
[ 89.178583] Call Trace:
[ 89.181429] [<ffffffff811c6334>] ? blk_mq_free_request+0x37/0x48
[ 89.188360] [<ffffffffa054dcbb>] ? __nvme_submit_admin_cmd+0x52/0x68 [nvme]
[ 89.196349] [<ffffffffa054f761>] ? nvme_user_admin_cmd+0x144/0x1b1 [nvme]
[ 89.204150] [<ffffffffa054f7eb>] ? nvme_dev_ioctl+0x1d/0x2b [nvme]
[ 89.211278] [<ffffffff81125916>] ? do_vfs_ioctl+0x3f2/0x43b
[ 89.217710] [<ffffffff81117e35>] ? vfs_write+0xde/0xfc
[ 89.223657] [<ffffffff811259ad>] ? SyS_ioctl+0x4e/0x7d
[ 89.229622] [<ffffffff8139c6d2>] ? system_call_fastpath+0x16/0x1b
[ 89.236636] Code: 8b 4a 38 48 39 4e 38 72 12 74 06 b8 01 00 00 00 c3 48 8b 4a 60 48 39 4e 60 73 f0 c3 66 66 66 66 90 48 8b 87 e0 00 00 00 48 63 f6 <8b> 14 b0 48 8b 87 f8 00 00 00 48 8b 04 d0 c3 89 ff f0 48 0f ab
[ 89.263435] RIP [<ffffffff811c51b8>] blk_mq_map_queue+0xf/0x1e
[ 89.270237] RSP <ffff88042c24dda0>
[ 89.274237] CR2: 0000000000000014
[ 89.278095] ---[ end trace 54c0e8cbb1fe2ec3 ]---

2014-06-03 22:27:46

by Keith Busch

[permalink] [raw]

Subject: Re: [PATCH v5] conversion to blk-mq

> On Tue, 3 Jun 2014, Matias Bjorling wrote:
>>
>> Keith, will you take the nvmemq_wip_v6 branch for a spin? Thanks!

BTW, if you want to test this out yourself, it's pretty simple to
recreate. I just run a simple user admin program sending nvme passthrough
commands in a tight loop, then run:

# echo 1 > /sys/bus/pci/devices/<bdf>/remove

> Still fails as before:
>
> [ 88.933881] BUG: unable to handle kernel NULL pointer dereference at
> 0000000000000014
> [ 88.942900] IP: [<ffffffff811c51b8>] blk_mq_map_queue+0xf/0x1e
> [ 88.949605] PGD 427be0067 PUD 425495067 PMD 0
> [ 88.954915] Oops: 0000 [#1] SMP
> [ 88.958787] Modules linked in: nvme parport_pc ppdev lp parport dlm sctp
> libcrc32c configfs nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache
> sunrpc md4 hmac cifs bridge stp llc joydev jfs hid_generic usbhid hid loop
> md_mod x86_pkg_temp_thermal coretemp kvm_intel kvm iTCO_wdt
> iTCO_vendor_support crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64
> glue_helper lrw gf128mul ablk_helper cryptd microcode ehci_pci ehci_hcd
> pcspkr usbcore lpc_ich ioatdma usb_common mfd_core evdev i2c_i801 wmi
> acpi_cpufreq tpm_tis tpm ipmi_si ipmi_msghandler processor thermal_sys button
> ext4 crc16 jbd2 mbcache dm_mod nbd sg sd_mod sr_mod crc_t10dif cdrom
> crct10dif_common isci ahci libsas igb libahci scsi_transport_sas ptp pps_core
> i2c_algo_bit libata i2c_core scsi_mod dca
> [ 89.042529] CPU: 5 PID: 4544 Comm: nvme_id_ctrl Not tainted 3.15.0-rc1+ #3
> [ 89.050295] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS
> SE5C600.86B.02.02.0002.122320131210 12/23/2013
> [ 89.061856] task: ffff88042bbdb0d0 ti: ffff88042c24c000 task.ti:
> ffff88042c24c000
> [ 89.070305] RIP: 0010:[<ffffffff811c51b8>] [<ffffffff811c51b8>]
> blk_mq_map_queue+0xf/0x1e
> [ 89.079747] RSP: 0018:ffff88042c24dda0 EFLAGS: 00010202
> [ 89.085795] RAX: 0000000000000000 RBX: ffffe8fbffaa1b00 RCX:
> ffff88042e8ec4b0
> [ 89.093868] RDX: 0000000000008be6 RSI: 0000000000000005 RDI:
> ffff88042abdf048
> [ 89.101950] RBP: ffff88042c2b81c0 R08: ffff88042c24c000 R09:
> ffff880035c58410
> [ 89.110033] R10: ffff88043f6b2dc0 R11: ffff88043f6b2dc0 R12:
> ffff88042c24de94
> [ 89.118119] R13: 000000000000007d R14: 00007fff0cd892b0 R15:
> 0000000000000000
> [ 89.126210] FS: 00007f39866c5700(0000) GS:ffff88043f6a0000(0000)
> knlGS:0000000000000000
> [ 89.135387] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 89.141916] CR2: 0000000000000014 CR3: 000000042b387000 CR4:
> 00000000000407e0
> [ 89.149997] Stack:
> [ 89.152353] ffffffff811c6334 00000000fffffffc ffff88042c2b81c0
> ffff88042c24de10
> [ 89.161096] ffffffffa054dcbb 0000000000000246 00000000fffffffc
> ffff8800b5e05cc0
> [ 89.169839] 00000000fffffff4 ffff8800b5e05cc0 ffff88042bbc3000
> 0000000000001000
> [ 89.178583] Call Trace:
> [ 89.181429] [<ffffffff811c6334>] ? blk_mq_free_request+0x37/0x48
> [ 89.188360] [<ffffffffa054dcbb>] ? __nvme_submit_admin_cmd+0x52/0x68
> [nvme]
> [ 89.196349] [<ffffffffa054f761>] ? nvme_user_admin_cmd+0x144/0x1b1 [nvme]
> [ 89.204150] [<ffffffffa054f7eb>] ? nvme_dev_ioctl+0x1d/0x2b [nvme]
> [ 89.211278] [<ffffffff81125916>] ? do_vfs_ioctl+0x3f2/0x43b
> [ 89.217710] [<ffffffff81117e35>] ? vfs_write+0xde/0xfc
> [ 89.223657] [<ffffffff811259ad>] ? SyS_ioctl+0x4e/0x7d
> [ 89.229622] [<ffffffff8139c6d2>] ? system_call_fastpath+0x16/0x1b
> [ 89.236636] Code: 8b 4a 38 48 39 4e 38 72 12 74 06 b8 01 00 00 00 c3 48 8b
> 4a 60 48 39 4e 60 73 f0 c3 66 66 66 66 90 48 8b 87 e0 00 00 00 48 63 f6 <8b>
> 14 b0 48 8b 87 f8 00 00 00 48 8b 04 d0 c3 89 ff f0 48 0f ab
> [ 89.263435] RIP [<ffffffff811c51b8>] blk_mq_map_queue+0xf/0x1e
> [ 89.270237] RSP <ffff88042c24dda0>
> [ 89.274237] CR2: 0000000000000014
> [ 89.278095] ---[ end trace 54c0e8cbb1fe2ec3 ]---
>

2014-06-04 09:16:31

by Matias Bjørling

[permalink] [raw]

Subject: Re: [PATCH v5] conversion to blk-mq

On 06/04/2014 12:27 AM, Keith Busch wrote:
>> On Tue, 3 Jun 2014, Matias Bjorling wrote:
>>>
>>> Keith, will you take the nvmemq_wip_v6 branch for a spin? Thanks!
>
> BTW, if you want to test this out yourself, it's pretty simple to
> recreate. I just run a simple user admin program sending nvme passthrough
> commands in a tight loop, then run:
>
> # echo 1 > /sys/bus/pci/devices/<bdf>/remove

I can't recreate- I use the nvme_get_feature program to continuously hit
the ioctl path, testing using your nvme qemu branch.

>
>> Still fails as before:
>>
>> [ 88.933881] BUG: unable to handle kernel NULL pointer dereference
>> at 0000000000000014
>> [ 88.942900] IP: [<ffffffff811c51b8>] blk_mq_map_queue+0xf/0x1e
>> [ 88.949605] PGD 427be0067 PUD 425495067 PMD 0
>> [ 88.954915] Oops: 0000 [#1] SMP
>> [ 88.958787] Modules linked in: nvme parport_pc ppdev lp parport dlm
>> sctp libcrc32c configfs nfsd auth_rpcgss oid_registry nfs_acl nfs
>> lockd fscache sunrpc md4 hmac cifs bridge stp llc joydev jfs
>> hid_generic usbhid hid loop md_mod x86_pkg_temp_thermal coretemp
>> kvm_intel kvm iTCO_wdt iTCO_vendor_support crc32c_intel
>> ghash_clmulni_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul
>> ablk_helper cryptd microcode ehci_pci ehci_hcd pcspkr usbcore lpc_ich
>> ioatdma usb_common mfd_core evdev i2c_i801 wmi acpi_cpufreq tpm_tis
>> tpm ipmi_si ipmi_msghandler processor thermal_sys button ext4 crc16
>> jbd2 mbcache dm_mod nbd sg sd_mod sr_mod crc_t10dif cdrom
>> crct10dif_common isci ahci libsas igb libahci scsi_transport_sas ptp
>> pps_core i2c_algo_bit libata i2c_core scsi_mod dca
>> [ 89.042529] CPU: 5 PID: 4544 Comm: nvme_id_ctrl Not tainted
>> 3.15.0-rc1+ #3
>> [ 89.050295] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS
>> SE5C600.86B.02.02.0002.122320131210 12/23/2013
>> [ 89.061856] task: ffff88042bbdb0d0 ti: ffff88042c24c000 task.ti:
>> ffff88042c24c000
>> [ 89.070305] RIP: 0010:[<ffffffff811c51b8>] [<ffffffff811c51b8>]
>> blk_mq_map_queue+0xf/0x1e
>> [ 89.079747] RSP: 0018:ffff88042c24dda0 EFLAGS: 00010202
>> [ 89.085795] RAX: 0000000000000000 RBX: ffffe8fbffaa1b00 RCX:
>> ffff88042e8ec4b0
>> [ 89.093868] RDX: 0000000000008be6 RSI: 0000000000000005 RDI:
>> ffff88042abdf048
>> [ 89.101950] RBP: ffff88042c2b81c0 R08: ffff88042c24c000 R09:
>> ffff880035c58410
>> [ 89.110033] R10: ffff88043f6b2dc0 R11: ffff88043f6b2dc0 R12:
>> ffff88042c24de94
>> [ 89.118119] R13: 000000000000007d R14: 00007fff0cd892b0 R15:
>> 0000000000000000
>> [ 89.126210] FS: 00007f39866c5700(0000) GS:ffff88043f6a0000(0000)
>> knlGS:0000000000000000
>> [ 89.135387] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 89.141916] CR2: 0000000000000014 CR3: 000000042b387000 CR4:
>> 00000000000407e0
>> [ 89.149997] Stack:
>> [ 89.152353] ffffffff811c6334 00000000fffffffc ffff88042c2b81c0
>> ffff88042c24de10
>> [ 89.161096] ffffffffa054dcbb 0000000000000246 00000000fffffffc
>> ffff8800b5e05cc0
>> [ 89.169839] 00000000fffffff4 ffff8800b5e05cc0 ffff88042bbc3000
>> 0000000000001000
>> [ 89.178583] Call Trace:
>> [ 89.181429] [<ffffffff811c6334>] ? blk_mq_free_request+0x37/0x48
>> [ 89.188360] [<ffffffffa054dcbb>] ?
>> __nvme_submit_admin_cmd+0x52/0x68 [nvme]
>> [ 89.196349] [<ffffffffa054f761>] ? nvme_user_admin_cmd+0x144/0x1b1
>> [nvme]
>> [ 89.204150] [<ffffffffa054f7eb>] ? nvme_dev_ioctl+0x1d/0x2b [nvme]
>> [ 89.211278] [<ffffffff81125916>] ? do_vfs_ioctl+0x3f2/0x43b
>> [ 89.217710] [<ffffffff81117e35>] ? vfs_write+0xde/0xfc
>> [ 89.223657] [<ffffffff811259ad>] ? SyS_ioctl+0x4e/0x7d
>> [ 89.229622] [<ffffffff8139c6d2>] ? system_call_fastpath+0x16/0x1b
>> [ 89.236636] Code: 8b 4a 38 48 39 4e 38 72 12 74 06 b8 01 00 00 00
>> c3 48 8b 4a 60 48 39 4e 60 73 f0 c3 66 66 66 66 90 48 8b 87 e0 00 00
>> 00 48 63 f6 <8b> 14 b0 48 8b 87 f8 00 00 00 48 8b 04 d0 c3 89 ff f0 48
>> 0f ab
>> [ 89.263435] RIP [<ffffffff811c51b8>] blk_mq_map_queue+0xf/0x1e
>> [ 89.270237] RSP <ffff88042c24dda0>
>> [ 89.274237] CR2: 0000000000000014
>> [ 89.278095] ---[ end trace 54c0e8cbb1fe2ec3 ]---
>>

2014-06-04 18:28:25

by Keith Busch

[permalink] [raw]

Subject: Re: [PATCH v5] conversion to blk-mq

On Wed, 4 Jun 2014, Matias Bj?rling wrote:
> On 06/04/2014 12:27 AM, Keith Busch wrote:
>>> On Tue, 3 Jun 2014, Matias Bjorling wrote:
>>>>
>>>> Keith, will you take the nvmemq_wip_v6 branch for a spin? Thanks!
>>
>> BTW, if you want to test this out yourself, it's pretty simple to
>> recreate. I just run a simple user admin program sending nvme passthrough
>> commands in a tight loop, then run:
>>
>> # echo 1 > /sys/bus/pci/devices/<bdf>/remove
>
> I can't recreate- I use the nvme_get_feature program to continuously hit the
> ioctl path, testing using your nvme qemu branch.

Okay, I'll try to fix it.

I think there are multiple problems, but the first is that since there
is no gendisk associated with the admin_q, the QUEUE_FLAG_INIT_DONE flag
is never set, and blk_mq_queue_enter returns successful whenever this
flag is not set even though this queue is dying, so we enter with all
its invalid pointers.

Here's a couple diff's. The first fixes the kernel oops by not entering a
dying queue. The second is just a few unrelated clean-ups in nvme-core.c.

I still can't complete my current hot-removal test, though; something
appears hung, but haven't nailed that down yet.

Please let me know what you think! Thanks.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index d10013b..5a9ae8a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -105,6 +105,10 @@ static int blk_mq_queue_enter(struct request_queue *q)
__percpu_counter_add(&q->mq_usage_counter, 1, 1000000);
smp_wmb();
/* we have problems to freeze the queue if it's initializing */
+ if (blk_queue_dying(q)) {
+ __percpu_counter_add(&q->mq_usage_counter, -1, 1000000);
+ ret = -ENODEV;
+ }
if (!blk_queue_bypass(q) || !blk_queue_init_done(q))
return 0;

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index 243a5e6..22e9c82 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -98,7 +98,6 @@ struct nvme_queue {
u8 cq_phase;
u8 cqe_seen;
u8 q_suspended;
- cpumask_var_t cpu_mask;
struct async_cmd_info cmdinfo;
struct blk_mq_hw_ctx *hctx;
};
@@ -1055,8 +1054,6 @@ static void nvme_free_queue(struct nvme_queue *nvmeq)
(void *)nvmeq->cqes, nvmeq->cq_dma_addr);
dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth),
nvmeq->sq_cmds, nvmeq->sq_dma_addr);
- if (nvmeq->qid)
- free_cpumask_var(nvmeq->cpu_mask);
kfree(nvmeq);
}

@@ -1066,9 +1063,9 @@ static void nvme_free_queues(struct nvme_dev *dev, int lowest)

for (i = dev->queue_count - 1; i >= lowest; i--) {
struct nvme_queue *nvmeq = dev->queues[i];
- nvme_free_queue(nvmeq);
dev->queue_count--;
dev->queues[i] = NULL;
+ nvme_free_queue(nvmeq);
}
}

@@ -1142,9 +1139,6 @@ static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
if (!nvmeq->sq_cmds)
goto free_cqdma;

- if (qid && !zalloc_cpumask_var(&nvmeq->cpu_mask, GFP_KERNEL))
- goto free_sqdma;
-
nvmeq->q_dmadev = dmadev;
nvmeq->dev = dev;
snprintf(nvmeq->irqname, sizeof(nvmeq->irqname), "nvme%dq%d",
@@ -1162,9 +1156,6 @@ static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,

return nvmeq;

- free_sqdma:
- dma_free_coherent(dmadev, SQ_SIZE(depth), (void *)nvmeq->sq_cmds,
- nvmeq->sq_dma_addr);
free_cqdma:
dma_free_coherent(dmadev, CQ_SIZE(depth), (void *)nvmeq->cqes,
nvmeq->cq_dma_addr);

2014-06-04 18:42:53

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH v5] conversion to blk-mq

On 06/04/2014 12:28 PM, Keith Busch wrote:
> On Wed, 4 Jun 2014, Matias Bj?rling wrote:
>> On 06/04/2014 12:27 AM, Keith Busch wrote:
>>>> On Tue, 3 Jun 2014, Matias Bjorling wrote:
>>>>>
>>>>> Keith, will you take the nvmemq_wip_v6 branch for a spin? Thanks!
>>>
>>> BTW, if you want to test this out yourself, it's pretty simple to
>>> recreate. I just run a simple user admin program sending nvme
>>> passthrough
>>> commands in a tight loop, then run:
>>>
>>> # echo 1 > /sys/bus/pci/devices/<bdf>/remove
>>
>> I can't recreate- I use the nvme_get_feature program to continuously
>> hit the ioctl path, testing using your nvme qemu branch.
>
> Okay, I'll try to fix it.
>
> I think there are multiple problems, but the first is that since there
> is no gendisk associated with the admin_q, the QUEUE_FLAG_INIT_DONE flag
> is never set, and blk_mq_queue_enter returns successful whenever this
> flag is not set even though this queue is dying, so we enter with all
> its invalid pointers.
>
> Here's a couple diff's. The first fixes the kernel oops by not entering a
> dying queue. The second is just a few unrelated clean-ups in nvme-core.c.
>
> I still can't complete my current hot-removal test, though; something
> appears hung, but haven't nailed that down yet.
>
> Please let me know what you think! Thanks.
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index d10013b..5a9ae8a 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -105,6 +105,10 @@ static int blk_mq_queue_enter(struct request_queue *q)
> __percpu_counter_add(&q->mq_usage_counter, 1, 1000000);
> smp_wmb();
> /* we have problems to freeze the queue if it's initializing */
> + if (blk_queue_dying(q)) {
> + __percpu_counter_add(&q->mq_usage_counter, -1, 1000000);
> + ret = -ENODEV;
> + }
> if (!blk_queue_bypass(q) || !blk_queue_init_done(q))
> return 0;

Are you testing against 3.13? You really need the current tree for this,
otherwise I'm sure you'll run into issues (as you appear to be :-)

--
Jens Axboe

2014-06-04 18:52:23

by Keith Busch

[permalink] [raw]

Subject: Re: [PATCH v5] conversion to blk-mq

On Wed, 4 Jun 2014, Jens Axboe wrote:
> On 06/04/2014 12:28 PM, Keith Busch wrote:
> Are you testing against 3.13? You really need the current tree for this,
> otherwise I'm sure you'll run into issues (as you appear to be :-)

I'm using Matias' current tree:

git://github.com/MatiasBjorling/linux-collab nvmeq_wip_6

Is this the right one to use? Looks like a 3.15rc1+.

Also ... obviously that's the wrong diff from before. I didn't save my
file before creating the diff. Should have looked like this:

diff --git a/block/blk-mq.c b/block/blk-mq.c
index d10013b..d1b986c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -105,7 +105,8 @@ static int blk_mq_queue_enter(struct request_queue *q)
__percpu_counter_add(&q->mq_usage_counter, 1, 1000000);
smp_wmb();
/* we have problems to freeze the queue if it's initializing */
- if (!blk_queue_bypass(q) || !blk_queue_init_done(q))
+ if (!blk_queue_dying(q) && (!blk_queue_bypass(q) ||
+ !blk_queue_init_done(q)))
return 0;

__percpu_counter_add(&q->mq_usage_counter, -1, 1000000);

2014-06-04 18:55:15

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH v5] conversion to blk-mq

On 06/04/2014 12:52 PM, Keith Busch wrote:
> On Wed, 4 Jun 2014, Jens Axboe wrote:
>> On 06/04/2014 12:28 PM, Keith Busch wrote:
>> Are you testing against 3.13? You really need the current tree for this,
>> otherwise I'm sure you'll run into issues (as you appear to be :-)
>
> I'm using Matias' current tree:
>
> git://github.com/MatiasBjorling/linux-collab nvmeq_wip_6
>
> Is this the right one to use? Looks like a 3.15rc1+.

It is, I misread that diff wrong.

> Also ... obviously that's the wrong diff from before. I didn't save my
> file before creating the diff. Should have looked like this:
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index d10013b..d1b986c 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -105,7 +105,8 @@ static int blk_mq_queue_enter(struct request_queue *q)
> __percpu_counter_add(&q->mq_usage_counter, 1, 1000000);
> smp_wmb();
> /* we have problems to freeze the queue if it's initializing */
> - if (!blk_queue_bypass(q) || !blk_queue_init_done(q))
> + if (!blk_queue_dying(q) && (!blk_queue_bypass(q) ||
> + !blk_queue_init_done(q)))
> return 0;
>
> __percpu_counter_add(&q->mq_usage_counter, -1, 1000000);

This looks better.

--
Jens Axboe

2014-06-04 20:01:15

by Matias Bjørling

[permalink] [raw]

Subject: Re: [PATCH v5] conversion to blk-mq

On 06/04/2014 08:52 PM, Keith Busch wrote:
> On Wed, 4 Jun 2014, Jens Axboe wrote:
>> On 06/04/2014 12:28 PM, Keith Busch wrote:
>> Are you testing against 3.13? You really need the current tree for this,
>> otherwise I'm sure you'll run into issues (as you appear to be :-)
>
> I'm using Matias' current tree:
>
> git://github.com/MatiasBjorling/linux-collab nvmeq_wip_6
>
> Is this the right one to use? Looks like a 3.15rc1+.
>
> Also ... obviously that's the wrong diff from before. I didn't save my
> file before creating the diff. Should have looked like this:
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index d10013b..d1b986c 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -105,7 +105,8 @@ static int blk_mq_queue_enter(struct request_queue *q)
> __percpu_counter_add(&q->mq_usage_counter, 1, 1000000);
> smp_wmb();
> /* we have problems to freeze the queue if it's initializing */
> - if (!blk_queue_bypass(q) || !blk_queue_init_done(q))
> + if (!blk_queue_dying(q) && (!blk_queue_bypass(q) ||
> + !blk_queue_init_done(q)))
> return 0;
>
> __percpu_counter_add(&q->mq_usage_counter, -1, 1000000);

Thanks!

I've put your patch into the nvmemq_wip_v6 branch and rebased it on top
of Jens' for-linus.